Logging Best Practices

A practical guide to structured logging for production services, with emphasis on observability platforms like Datadog, Splunk, and CloudWatch.

1. Log Levels - When to Use Each

CRITICAL  →  System is unusable, wake someone up at 3am
ERROR     →  Something failed that shouldn't have, needs investigation
WARNING   →  Something unexpected but handled, or approaching limits
INFO      →  High-level business events worth keeping in production
DEBUG     →  Detailed diagnostic info for troubleshooting

The "Would I Want to Be Paged?" Test

If the answer is...	Use this level
"Yes, wake me up"	ERROR or CRITICAL
"I should know about this tomorrow"	WARNING
"Useful for understanding what happened"	INFO
"Only useful when debugging a specific issue"	DEBUG

Practical Examples

# DEBUG - Verbose progress, only useful when actively debugging
logger.debug("Fetching user profile from cache", extra={"user_id": user_id})

# INFO - Business-relevant events worth keeping
logger.info("Order completed", extra={"order_id": order_id, "total": 99.99, "items": 3})

# WARNING - Recoverable issues, degraded functionality
logger.warning("Cache miss, falling back to database", extra={"user_id": user_id})

# ERROR - Something broke that shouldn't have
logger.error("Payment processing failed", extra={"order_id": order_id, "error": str(e)})

# CRITICAL - System is broken, immediate action needed
logger.critical("Database connection pool exhausted", extra={"active_connections": 100})

Common Mistakes

Wrong	Right
INFO for every function entry/exit	DEBUG for verbose progress
ERROR when a user provides bad input	WARNING (it's expected behavior)
DEBUG for "user logged in"	INFO (business event worth tracking)
Logging sensitive data at any level	Never log passwords, tokens, PII

2. Structured Logging vs String Formatting

The Wrong Way: String Embedding

logger.info(f"Order {order_id} completed for user {user_id} in {duration_ms}ms")

Problems:

order_id is buried in the message string
Can't search by @order_id:12345 in your logging platform
Can't aggregate/filter programmatically
Message varies, harder to group identical events

The Right Way: Structured Logging

logger.info(
    "Order completed",
    extra={
        "order_id": order_id,
        "user_id": user_id,
        "duration_ms": duration_ms,
        "status": "completed"
    }
)

Benefits:

Attributes are searchable: @order_id:12345
Consistent message text → easy to group/count
Can create dashboards filtering by any attribute
Machine-readable for automated analysis

The Key Insight

Think of your log message as a static identifier (like an event name), and the extra dict as the variable data. This makes logs both human-readable AND machine-queryable.

3. Tags vs Attributes (Datadog-Specific)

This distinction is crucial for cost control at scale:

Aspect	Tags	Attributes
Indexed?	Yes (fast queries)	No (searched inline)
Cost	Scales with cardinality	Flat cost per log
Use for	Low cardinality: `env`, `service`, `status`	High cardinality: `order_id`, `user_id`, `request_id`
Query syntax	`env:production`	`@order_id:12345`

Why This Matters

With millions of requests, making request_id a tag would explode your Datadog bill because:

Each unique tag value creates an index entry
High-cardinality tags = high indexing costs

As an attribute (via extra={}):

Stored inline with each log entry
Searchable with @request_id:xxx
No cardinality-based cost scaling

# This becomes a searchable ATTRIBUTE (cheap for high-cardinality data)
logger.info("Request completed", extra={"request_id": request_id})

# Tags are set at the service level, not per-log (low cardinality only)
# env:production, service:api-gateway, version:1.2.3

4. What to Log

DO Log

# Request boundaries (start/end of significant operations)
logger.info("Processing started", extra={"request_id": request_id})
logger.info("Processing completed", extra={"request_id": request_id, "duration_ms": elapsed})

# State changes
logger.info("Order status changed", extra={"order_id": order_id, "from": "pending", "to": "shipped"})

# Errors with context
logger.error("Failed to send email", extra={"user_id": user_id, "error": str(e)})

# Business metrics
logger.info("Payment processed", extra={
    "order_id": order_id,
    "amount": 149.99,
    "currency": "USD",
    "payment_method": "card"
})

DON'T Log

# Sensitive data - NEVER
logger.info(f"User authenticated with password {password}")
logger.info(f"API key used: {api_key}")
logger.info(f"Credit card: {card_number}")

# Excessive verbosity in production
logger.info(f"Entering function xyz")  # Use DEBUG
logger.info(f"Variable x = {x}")  # Use DEBUG

# Large payloads
logger.info(f"Full response: {json.dumps(huge_object)}")  # Truncate or omit

# Redundant information
logger.info(f"Starting to process")
logger.info(f"Still processing...")
logger.info(f"Almost done processing...")
logger.info(f"Done processing")  # Just log start and end

5. Log at the Right Layer

┌─────────────────────────────────────────────────────────────┐
│  API Layer (routes, controllers)                             │
│  → Log: Request received, request completed, HTTP errors     │
│  → Level: INFO for boundaries, WARNING/ERROR for failures    │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  Service Layer (business logic)                              │
│  → Log: Business operations, state transitions               │
│  → Level: INFO for business events, DEBUG for internal steps │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  Data Layer (repositories, DAOs)                             │
│  → Log: Query execution, cache hits/misses                   │
│  → Level: DEBUG for queries, WARNING for slow queries        │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  External Integrations (APIs, queues, storage)               │
│  → Log: External calls, retries, failures                    │
│  → Level: DEBUG for success, WARNING/ERROR for failures      │
└─────────────────────────────────────────────────────────────┘

The "One INFO Per Layer" Rule

For a single request flowing through your system, aim for roughly one INFO log per architectural layer. More than that creates noise; less than that makes debugging hard. Use DEBUG for the detailed breadcrumbs within each layer.

6. Correlation IDs

Always propagate identifiers through your system:

# In your API handler
request_id = str(uuid.uuid4())
logger.info("Request received", extra={"request_id": request_id, "endpoint": "/api/orders"})

# In your service layer
logger.info("Processing order", extra={"request_id": request_id, "order_id": order_id})

# In your data layer
logger.debug("Executing query", extra={"request_id": request_id, "table": "orders"})

# In your external integration
logger.info("Calling payment gateway", extra={"request_id": request_id, "provider": "stripe"})

Now in your logging platform: @request_id:abc-123 shows the entire journey of that request across all services and layers.

Middleware Pattern (FastAPI Example)

from contextvars import ContextVar
import uuid

request_id_var: ContextVar[str] = ContextVar("request_id", default="")

@app.middleware("http")
async def add_request_id(request: Request, call_next):
    request_id = request.headers.get("X-Request-ID", str(uuid.uuid4()))
    request_id_var.set(request_id)
    response = await call_next(request)
    response.headers["X-Request-ID"] = request_id
    return response

# Then in any function:
def get_logger_extra() -> dict:
    return {"request_id": request_id_var.get()}

logger.info("Something happened", extra=get_logger_extra())

7. Error Logging Patterns

# Include exception info for unexpected errors
try:
    result = await risky_operation()
except Exception as e:
    logger.exception(  # Automatically includes stack trace
        "Unexpected error in risky operation",
        extra={"request_id": request_id, "error": str(e)}
    )
    raise

# Use warning for expected/handled failures
try:
    cached = await cache.get(key)
except CacheConnectionError as e:
    logger.warning(
        "Cache unavailable, falling back to database",
        extra={"request_id": request_id, "error": str(e)}
    )
    cached = None  # Graceful degradation

# NEVER swallow errors silently
try:
    do_something()
except Exception:
    pass  # NEVER DO THIS - you'll regret it at 3am

The Exception Logging Decision Tree

Is this error expected in normal operation?
├── Yes (e.g., user not found, invalid input)
│   └── WARNING with context, handle gracefully
└── No (e.g., database down, null pointer)
    ├── Can you recover?
    │   ├── Yes → ERROR + logger.exception() + recover
    │   └── No  → ERROR + logger.exception() + re-raise
    └── Is the system unusable?
        └── Yes → CRITICAL + alert + potentially shutdown

8. Production Configuration

import os
from your_logging_module import setup_logging

# Determine environment
env = os.getenv("ENVIRONMENT", "development")
use_json = env not in ("local", "development", "dev")
log_level = os.getenv("LOG_LEVEL", "INFO" if env == "production" else "DEBUG")

# Initialize logging BEFORE other imports that might log
setup_logging(level=log_level, use_json=use_json)

Environment	Format	Level	Why
Local/Dev	Human-readable	DEBUG	Easy to read in terminal
Staging	JSON	DEBUG	Test structured logging, full verbosity
Production	JSON	INFO	Reduce volume, machine-parseable

JSON Format Example

Human-readable (development):

2024-01-15 10:23:45 INFO Order completed - order_id=12345, duration_ms=234

JSON (production):

{
  "timestamp": "2024-01-15T10:23:45.123Z",
  "level": "INFO",
  "message": "Order completed",
  "order_id": "12345",
  "duration_ms": 234,
  "service": "order-api",
  "trace_id": "abc123"
}

9. Performance Considerations

Lazy Evaluation

# BAD - expensive_call() runs even if DEBUG is disabled
logger.debug(f"User data: {expensive_call()}")

# GOOD - expensive_call() only runs if DEBUG is enabled
if logger.isEnabledFor(logging.DEBUG):
    logger.debug(f"User data: {expensive_call()}")

# ALSO GOOD - use lazy % formatting
logger.debug("User data: %s", expensive_call())  # Only evaluated if needed

Avoid Logging in Hot Paths

# BAD - logging inside tight loops
for item in million_items:
    logger.debug(f"Processing item {item.id}")
    process(item)

# GOOD - log boundaries, not iterations
logger.info("Starting batch processing", extra={"item_count": len(million_items)})
for item in million_items:
    process(item)
logger.info("Batch processing completed", extra={"item_count": len(million_items)})

10. Quick Reference Card

┌────────────────────────────────────────────────────────────┐
│                    LOGGING CHEAT SHEET                      │
├────────────────────────────────────────────────────────────┤
│ LEVEL     │ USE WHEN                                        │
├───────────┼────────────────────────────────────────────────┤
│ DEBUG     │ Detailed diagnostic info for troubleshooting   │
│ INFO      │ Business events, request boundaries            │
│ WARNING   │ Recoverable issues, fallbacks triggered        │
│ ERROR     │ Failures requiring investigation               │
│ CRITICAL  │ System unusable, immediate action needed       │
├────────────────────────────────────────────────────────────┤
│ PATTERN   │ EXAMPLE                                         │
├───────────┼────────────────────────────────────────────────┤
│ Structured│ logger.info("msg", extra={"key": value})       │
│ Correlation│ Always include request_id/trace_id            │
│ Boundaries│ Log start AND completion of operations         │
│ Errors    │ Use logger.exception() for stack traces        │
├────────────────────────────────────────────────────────────┤
│ DATADOG   │ CARDINALITY RULE                                │
├───────────┼────────────────────────────────────────────────┤
│ Tags      │ Low cardinality only (env, service, status)    │
│ Attributes│ High cardinality OK (request_id, user_id)      │
├────────────────────────────────────────────────────────────┤
│ DON'T LOG │ EVER                                            │
├───────────┼────────────────────────────────────────────────┤
│ Passwords │ logger.info(f"password={pw}")                  │
│ API Keys  │ logger.info(f"key={api_key}")                  │
│ PII       │ logger.info(f"ssn={ssn}")                      │
│ Card #s   │ logger.info(f"card={card_number}")             │
└────────────────────────────────────────────────────────────┘

inchoate/logging-best-practices.md

Select an option

No results found