Skip to content

Instantly share code, notes, and snippets.

@inchoate
Created January 25, 2026 20:41
Show Gist options
  • Select an option

  • Save inchoate/d6b90c272b6a7c6f92256f3a6b9f72a1 to your computer and use it in GitHub Desktop.

Select an option

Save inchoate/d6b90c272b6a7c6f92256f3a6b9f72a1 to your computer and use it in GitHub Desktop.
Logging Best Practices

Logging Best Practices

A practical guide to structured logging for production services, with emphasis on observability platforms like Datadog, Splunk, and CloudWatch.


1. Log Levels - When to Use Each

CRITICAL  →  System is unusable, wake someone up at 3am
ERROR     →  Something failed that shouldn't have, needs investigation
WARNING   →  Something unexpected but handled, or approaching limits
INFO      →  High-level business events worth keeping in production
DEBUG     →  Detailed diagnostic info for troubleshooting

The "Would I Want to Be Paged?" Test

If the answer is... Use this level
"Yes, wake me up" ERROR or CRITICAL
"I should know about this tomorrow" WARNING
"Useful for understanding what happened" INFO
"Only useful when debugging a specific issue" DEBUG

Practical Examples

# DEBUG - Verbose progress, only useful when actively debugging
logger.debug("Fetching user profile from cache", extra={"user_id": user_id})

# INFO - Business-relevant events worth keeping
logger.info("Order completed", extra={"order_id": order_id, "total": 99.99, "items": 3})

# WARNING - Recoverable issues, degraded functionality
logger.warning("Cache miss, falling back to database", extra={"user_id": user_id})

# ERROR - Something broke that shouldn't have
logger.error("Payment processing failed", extra={"order_id": order_id, "error": str(e)})

# CRITICAL - System is broken, immediate action needed
logger.critical("Database connection pool exhausted", extra={"active_connections": 100})

Common Mistakes

Wrong Right
INFO for every function entry/exit DEBUG for verbose progress
ERROR when a user provides bad input WARNING (it's expected behavior)
DEBUG for "user logged in" INFO (business event worth tracking)
Logging sensitive data at any level Never log passwords, tokens, PII

2. Structured Logging vs String Formatting

The Wrong Way: String Embedding

logger.info(f"Order {order_id} completed for user {user_id} in {duration_ms}ms")

Problems:

  • order_id is buried in the message string
  • Can't search by @order_id:12345 in your logging platform
  • Can't aggregate/filter programmatically
  • Message varies, harder to group identical events

The Right Way: Structured Logging

logger.info(
    "Order completed",
    extra={
        "order_id": order_id,
        "user_id": user_id,
        "duration_ms": duration_ms,
        "status": "completed"
    }
)

Benefits:

  • Attributes are searchable: @order_id:12345
  • Consistent message text → easy to group/count
  • Can create dashboards filtering by any attribute
  • Machine-readable for automated analysis

The Key Insight

Think of your log message as a static identifier (like an event name), and the extra dict as the variable data. This makes logs both human-readable AND machine-queryable.


3. Tags vs Attributes (Datadog-Specific)

This distinction is crucial for cost control at scale:

Aspect Tags Attributes
Indexed? Yes (fast queries) No (searched inline)
Cost Scales with cardinality Flat cost per log
Use for Low cardinality: env, service, status High cardinality: order_id, user_id, request_id
Query syntax env:production @order_id:12345

Why This Matters

With millions of requests, making request_id a tag would explode your Datadog bill because:

  • Each unique tag value creates an index entry
  • High-cardinality tags = high indexing costs

As an attribute (via extra={}):

  • Stored inline with each log entry
  • Searchable with @request_id:xxx
  • No cardinality-based cost scaling
# This becomes a searchable ATTRIBUTE (cheap for high-cardinality data)
logger.info("Request completed", extra={"request_id": request_id})

# Tags are set at the service level, not per-log (low cardinality only)
# env:production, service:api-gateway, version:1.2.3

4. What to Log

DO Log

# Request boundaries (start/end of significant operations)
logger.info("Processing started", extra={"request_id": request_id})
logger.info("Processing completed", extra={"request_id": request_id, "duration_ms": elapsed})

# State changes
logger.info("Order status changed", extra={"order_id": order_id, "from": "pending", "to": "shipped"})

# Errors with context
logger.error("Failed to send email", extra={"user_id": user_id, "error": str(e)})

# Business metrics
logger.info("Payment processed", extra={
    "order_id": order_id,
    "amount": 149.99,
    "currency": "USD",
    "payment_method": "card"
})

DON'T Log

# Sensitive data - NEVER
logger.info(f"User authenticated with password {password}")
logger.info(f"API key used: {api_key}")
logger.info(f"Credit card: {card_number}")

# Excessive verbosity in production
logger.info(f"Entering function xyz")  # Use DEBUG
logger.info(f"Variable x = {x}")  # Use DEBUG

# Large payloads
logger.info(f"Full response: {json.dumps(huge_object)}")  # Truncate or omit

# Redundant information
logger.info(f"Starting to process")
logger.info(f"Still processing...")
logger.info(f"Almost done processing...")
logger.info(f"Done processing")  # Just log start and end

5. Log at the Right Layer

┌─────────────────────────────────────────────────────────────┐
│  API Layer (routes, controllers)                             │
│  → Log: Request received, request completed, HTTP errors     │
│  → Level: INFO for boundaries, WARNING/ERROR for failures    │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  Service Layer (business logic)                              │
│  → Log: Business operations, state transitions               │
│  → Level: INFO for business events, DEBUG for internal steps │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  Data Layer (repositories, DAOs)                             │
│  → Log: Query execution, cache hits/misses                   │
│  → Level: DEBUG for queries, WARNING for slow queries        │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  External Integrations (APIs, queues, storage)               │
│  → Log: External calls, retries, failures                    │
│  → Level: DEBUG for success, WARNING/ERROR for failures      │
└─────────────────────────────────────────────────────────────┘

The "One INFO Per Layer" Rule

For a single request flowing through your system, aim for roughly one INFO log per architectural layer. More than that creates noise; less than that makes debugging hard. Use DEBUG for the detailed breadcrumbs within each layer.


6. Correlation IDs

Always propagate identifiers through your system:

# In your API handler
request_id = str(uuid.uuid4())
logger.info("Request received", extra={"request_id": request_id, "endpoint": "/api/orders"})

# In your service layer
logger.info("Processing order", extra={"request_id": request_id, "order_id": order_id})

# In your data layer
logger.debug("Executing query", extra={"request_id": request_id, "table": "orders"})

# In your external integration
logger.info("Calling payment gateway", extra={"request_id": request_id, "provider": "stripe"})

Now in your logging platform: @request_id:abc-123 shows the entire journey of that request across all services and layers.

Middleware Pattern (FastAPI Example)

from contextvars import ContextVar
import uuid

request_id_var: ContextVar[str] = ContextVar("request_id", default="")

@app.middleware("http")
async def add_request_id(request: Request, call_next):
    request_id = request.headers.get("X-Request-ID", str(uuid.uuid4()))
    request_id_var.set(request_id)
    response = await call_next(request)
    response.headers["X-Request-ID"] = request_id
    return response

# Then in any function:
def get_logger_extra() -> dict:
    return {"request_id": request_id_var.get()}

logger.info("Something happened", extra=get_logger_extra())

7. Error Logging Patterns

# Include exception info for unexpected errors
try:
    result = await risky_operation()
except Exception as e:
    logger.exception(  # Automatically includes stack trace
        "Unexpected error in risky operation",
        extra={"request_id": request_id, "error": str(e)}
    )
    raise

# Use warning for expected/handled failures
try:
    cached = await cache.get(key)
except CacheConnectionError as e:
    logger.warning(
        "Cache unavailable, falling back to database",
        extra={"request_id": request_id, "error": str(e)}
    )
    cached = None  # Graceful degradation

# NEVER swallow errors silently
try:
    do_something()
except Exception:
    pass  # NEVER DO THIS - you'll regret it at 3am

The Exception Logging Decision Tree

Is this error expected in normal operation?
├── Yes (e.g., user not found, invalid input)
│   └── WARNING with context, handle gracefully
└── No (e.g., database down, null pointer)
    ├── Can you recover?
    │   ├── Yes → ERROR + logger.exception() + recover
    │   └── No  → ERROR + logger.exception() + re-raise
    └── Is the system unusable?
        └── Yes → CRITICAL + alert + potentially shutdown

8. Production Configuration

import os
from your_logging_module import setup_logging

# Determine environment
env = os.getenv("ENVIRONMENT", "development")
use_json = env not in ("local", "development", "dev")
log_level = os.getenv("LOG_LEVEL", "INFO" if env == "production" else "DEBUG")

# Initialize logging BEFORE other imports that might log
setup_logging(level=log_level, use_json=use_json)
Environment Format Level Why
Local/Dev Human-readable DEBUG Easy to read in terminal
Staging JSON DEBUG Test structured logging, full verbosity
Production JSON INFO Reduce volume, machine-parseable

JSON Format Example

Human-readable (development):

2024-01-15 10:23:45 INFO Order completed - order_id=12345, duration_ms=234

JSON (production):

{
  "timestamp": "2024-01-15T10:23:45.123Z",
  "level": "INFO",
  "message": "Order completed",
  "order_id": "12345",
  "duration_ms": 234,
  "service": "order-api",
  "trace_id": "abc123"
}

9. Performance Considerations

Lazy Evaluation

# BAD - expensive_call() runs even if DEBUG is disabled
logger.debug(f"User data: {expensive_call()}")

# GOOD - expensive_call() only runs if DEBUG is enabled
if logger.isEnabledFor(logging.DEBUG):
    logger.debug(f"User data: {expensive_call()}")

# ALSO GOOD - use lazy % formatting
logger.debug("User data: %s", expensive_call())  # Only evaluated if needed

Avoid Logging in Hot Paths

# BAD - logging inside tight loops
for item in million_items:
    logger.debug(f"Processing item {item.id}")
    process(item)

# GOOD - log boundaries, not iterations
logger.info("Starting batch processing", extra={"item_count": len(million_items)})
for item in million_items:
    process(item)
logger.info("Batch processing completed", extra={"item_count": len(million_items)})

10. Quick Reference Card

┌────────────────────────────────────────────────────────────┐
│                    LOGGING CHEAT SHEET                      │
├────────────────────────────────────────────────────────────┤
│ LEVEL     │ USE WHEN                                        │
├───────────┼────────────────────────────────────────────────┤
│ DEBUG     │ Detailed diagnostic info for troubleshooting   │
│ INFO      │ Business events, request boundaries            │
│ WARNING   │ Recoverable issues, fallbacks triggered        │
│ ERROR     │ Failures requiring investigation               │
│ CRITICAL  │ System unusable, immediate action needed       │
├────────────────────────────────────────────────────────────┤
│ PATTERN   │ EXAMPLE                                         │
├───────────┼────────────────────────────────────────────────┤
│ Structured│ logger.info("msg", extra={"key": value})       │
│ Correlation│ Always include request_id/trace_id            │
│ Boundaries│ Log start AND completion of operations         │
│ Errors    │ Use logger.exception() for stack traces        │
├────────────────────────────────────────────────────────────┤
│ DATADOG   │ CARDINALITY RULE                                │
├───────────┼────────────────────────────────────────────────┤
│ Tags      │ Low cardinality only (env, service, status)    │
│ Attributes│ High cardinality OK (request_id, user_id)      │
├────────────────────────────────────────────────────────────┤
│ DON'T LOG │ EVER                                            │
├───────────┼────────────────────────────────────────────────┤
│ Passwords │ logger.info(f"password={pw}")                  │
│ API Keys  │ logger.info(f"key={api_key}")                  │
│ PII       │ logger.info(f"ssn={ssn}")                      │
│ Card #s   │ logger.info(f"card={card_number}")             │
└────────────────────────────────────────────────────────────┘

Further Reading


Generated with help from Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment