Logging Best Practices

Logs are the written record of what your application does. When something goes wrong at 2 AM and the on-call engineer needs to figure out what happened, logs are the primary evidence they have to work with. Good logs tell a clear story: what happened, when it happened, and in what context. Bad logs are noise — walls of unstructured text that bury critical information under irrelevant details, or worse, logs that are missing entirely when you need them most.

Logging seems simple, but doing it well requires deliberate choices about format, content, levels, retention, and security. This lesson covers the practices that separate useful, actionable logs from the ones that sit unused on disk until storage runs out.

Structured Logging

Traditional logging writes human-readable strings to a file. A typical log line might look like this:

2024-03-15 14:23:07 ERROR Failed to process payment for user 12345, order #789, amount $49.99 - Connection timeout to payment gateway

This is readable by a human, but it is difficult for machines to parse. If you want to find all payment failures above $100, or all errors for a specific user, you need to write fragile regular expressions that break when the message format changes slightly.

Structured logging solves this problem by writing logs in a consistent, machine-parseable format — typically JSON:

{
  "timestamp": "2024-03-15T14:23:07.432Z",
  "level": "error",
  "message": "Failed to process payment",
  "service": "payment-api",
  "user_id": 12345,
  "order_id": 789,
  "amount": 49.99,
  "currency": "USD",
  "error": "Connection timeout to payment gateway",
  "gateway": "stripe",
  "duration_ms": 30000,
  "correlation_id": "abc-123-def-456"
}

With structured logs, every piece of information is in a named field. Log management tools can index these fields, letting you search, filter, aggregate, and visualize your logs with precision. "Show me all payment errors over $100 in the last 24 hours grouped by gateway" becomes a simple query instead of a text search nightmare.

Most modern logging libraries support structured logging natively. In Node.js, libraries like Winston and Pino output JSON by default. In Python, the structlog library adds structured logging on top of the standard logging module. In Java, SLF4J with Logback supports structured output through the Logstash encoder.

Log Levels

Log levels indicate the severity and importance of a log entry. Using them consistently allows you to control the verbosity of your logs per environment (verbose in development, concise in production) and to filter noise when investigating problems. The standard levels, from least to most severe:

DEBUG: Detailed diagnostic information useful during development and troubleshooting. Examples: variable values at key decision points, SQL queries being executed, cache hit/miss information. DEBUG logs should be disabled in production by default to avoid excessive volume and potential exposure of sensitive data.
INFO: General operational events that confirm things are working as expected. Examples: application startup ("Server listening on port 8080"), successful completion of significant operations ("Order #789 processed successfully"), scheduled job execution ("Daily report generation completed in 4.2 seconds"). INFO is typically the default level in production.
WARN: Something unexpected happened that is not an error but might indicate a problem. The application can continue operating, but someone should investigate. Examples: a deprecated API was called, a retry was needed for a database query, disk usage exceeded 80%, a configuration value fell back to a default because the expected value was missing.
ERROR: Something failed. The application could not complete a requested operation. Examples: a database query failed after all retries, a payment could not be processed, an external API returned an unexpected response. ERROR logs should include enough context to diagnose the problem — the input that caused the failure, the error message, and the stack trace.
FATAL (or CRITICAL): A catastrophic failure that prevents the application from continuing to operate. Examples: the database connection pool is exhausted, a required configuration file is missing at startup, an out-of-memory condition. FATAL events typically result in the process exiting. They should always trigger immediate alerts.

Tip: A common mistake is logging everything at the ERROR level. When everything is an "error," nothing stands out. Reserve ERROR for actual failures that require attention, and use WARN for things that are unusual but not broken. Your future self will thank you when the ERROR count in your dashboard actually means something.

What to Log

Good logging requires knowing what information will be valuable when you are diagnosing a problem. Log these categories of events:

Incoming requests: Log the method, path, status code, response time, and client identifier for every HTTP request. This is the foundation of operational visibility. Most web frameworks can do this automatically with middleware.
Errors and exceptions: Log the full error message, stack trace, and the input or context that triggered the error. Without context, a stack trace is only half the story.
State changes: Log when important data changes — user account created, order status changed, configuration updated, feature flag toggled. These logs create an audit trail that is invaluable for debugging and compliance.
Security events: Log authentication attempts (both successful and failed), authorization failures, password changes, and permission modifications. These logs are essential for detecting and investigating security incidents. Many compliance frameworks (SOC 2, HIPAA, PCI DSS) require security event logging.
External service interactions: Log requests to external APIs, databases, and third-party services, including the response time and status. When your application is slow or failing, these logs help you quickly determine whether the problem is in your code or in a dependency.
Application lifecycle events: Log when the application starts, shuts down, reloads configuration, or connects to resources. These logs help correlate problems with deployments and infrastructure changes.

What NOT to Log

Some data should never appear in your logs, regardless of how useful it might seem for debugging. Logging sensitive data creates security and compliance risks that far outweigh any debugging benefit:

Passwords: Never log passwords, whether in plaintext or hashed form. If a password is being sent to your application, log the fact that an authentication attempt occurred, but not the password itself.
Authentication tokens: Session tokens, JWTs, API keys, and OAuth tokens grant access to accounts. If these appear in logs and the logs are compromised, attackers can hijack user sessions. Log a truncated or hashed version if you need to correlate requests with tokens.
Personally Identifiable Information (PII): Full names, email addresses, phone numbers, physical addresses, Social Security numbers, and government IDs should not appear in logs. If you need to reference a user, log the user ID (an opaque identifier) rather than their personal details.
Financial data: Full credit card numbers, bank account numbers, and financial transaction details should not be logged. PCI DSS compliance requires that full card numbers are never stored in logs. If you need to log a card reference, log only the last four digits.
Health information: Medical records, diagnoses, and other health data are protected by regulations like HIPAA. Logging this data in an unsecured log file is a violation.
Encryption keys and secrets: Database passwords, API secrets, encryption keys, and private certificates should never appear in logs. Use a secrets manager and log that a connection was made, not the credentials used to make it.

As a general rule: if you would not be comfortable seeing a piece of data on the front page of a newspaper, do not log it. Review your logging statements as part of code review to catch accidental exposure of sensitive data.

Correlation IDs

In a distributed system where a single user request might pass through an API gateway, an authentication service, a business logic service, and a database, tracing that request through the logs of each service is extremely difficult without a common identifier.

A correlation ID (also called a trace ID or request ID) is a unique identifier generated at the entry point of a request and passed through every service that handles it. Each log entry includes the correlation ID, making it possible to search across all services and see every log entry related to a single request, in order.

Implementation is straightforward:

When a request arrives at your entry point (API gateway, load balancer, or first service), generate a unique ID (UUID v4 is common) or use an incoming X-Request-ID header if one exists.
Include this ID in every log entry for the duration of the request.
Pass the ID to downstream services via an HTTP header (commonly X-Request-ID or X-Correlation-ID).
Each downstream service reads the header and includes the ID in its own log entries.

With correlation IDs in place, debugging a failed request becomes a simple search: filter your logs by the correlation ID, and you see the complete story of that request across every service, in chronological order. This is foundational to observability in distributed systems.

Centralized Log Management

When your application runs on a single server, reading log files directly might be sufficient. But as soon as you have multiple servers, containers, or services, you need centralized log management — a system that collects logs from all sources, indexes them, and provides a unified interface for searching and analysis.

ELK Stack (Elasticsearch, Logstash, Kibana): The most widely used open-source log management stack. Elasticsearch stores and indexes the logs, Logstash (or the lighter Filebeat) collects and ships them, and Kibana provides the visualization and search interface. The ELK Stack is powerful and flexible but requires significant operational effort to run at scale. Managed versions are available from Elastic Cloud and AWS OpenSearch.
Datadog Log Management: A fully managed service that integrates log management with infrastructure monitoring, APM, and error tracking. Datadog provides powerful search, filtering, and alerting on logs, with automatic correlation to traces and metrics. The cost can add up at high volumes, but the operational simplicity and integrated experience are significant advantages.
AWS CloudWatch Logs: If your infrastructure runs on AWS, CloudWatch Logs provides native log collection and search. It integrates with other AWS services (Lambda, ECS, EC2) and can trigger alarms based on log patterns. It is less feature-rich than dedicated log management tools but requires no additional infrastructure to set up.
Grafana Loki: An open-source log aggregation system designed to be cost-effective by indexing only metadata (labels) rather than the full log content. It pairs with Grafana for visualization and is a good choice for teams that already use Grafana for metrics dashboards and want to keep logs in the same ecosystem.

Log Retention Policies

Logs consume storage, and storage costs money. Without a retention policy, log volume grows unbounded until someone notices the disk is full or the cloud bill spikes. A good retention policy balances debugging needs, compliance requirements, and cost:

Hot storage (0-30 days): Recent logs that are actively searchable and immediately accessible. This covers most debugging scenarios, since the vast majority of investigations involve events from the last few days.
Warm storage (30-90 days): Older logs that are still searchable but may have slower query times. Useful for trend analysis and investigating issues that were discovered after a delay.
Cold storage (90 days to years): Archived logs stored in cheap storage (like AWS S3 Glacier) for compliance purposes. Not immediately searchable, but retrievable when needed. Compliance frameworks like PCI DSS require retaining audit logs for at least one year, with three months immediately available.

Define retention policies based on log type and importance. Security audit logs might need 2-year retention for compliance, while DEBUG-level application logs might be deleted after 7 days. Automate retention enforcement so logs are deleted or archived on schedule without manual intervention.

Using Logs for Debugging and Auditing

Well-structured logs serve two distinct but equally important purposes:

Debugging is the immediate use case. When a user reports a problem or a monitoring alert fires, logs are the first place you look. Effective debugging with logs follows a pattern: start with the error log entry, use the correlation ID to find related entries across services, read the breadcrumb trail of INFO and DEBUG entries leading up to the error, and identify the root cause based on the full context.

Auditing is the long-term use case. Audit logs answer questions like "who accessed this record and when?" and "what changes were made to this configuration?" Audit logs should be immutable (append-only) and stored separately from application logs so they cannot be tampered with. Many compliance frameworks require audit logs as evidence that security controls are in place and functioning.

The best logging practice serves both purposes simultaneously: structured, contextual, properly leveled log entries that are immediately useful for debugging and that also create a reliable audit trail over time.

Resources

The Twelve-Factor App: Logs — Treat logs as event streams, the foundational philosophy behind modern logging
Elastic Stack Documentation — Official guides for Elasticsearch, Logstash, Kibana, and Beats