A practical guide to structured logging for production services, with emphasis on observability platforms like Datadog, Splunk, and CloudWatch.
CRITICAL → System is unusable, wake someone up at 3am
ERROR → Something failed that shouldn't have, needs investigation
WARNING → Something unexpected but handled, or approaching limits
INFO → High-level business events worth keeping in production
DEBUG → Detailed diagnostic info for troubleshooting
| If the answer is... | Use this level |
|---|---|
| "Yes, wake me up" | ERROR or CRITICAL |
| "I should know about this tomorrow" | WARNING |
| "Useful for understanding what happened" | INFO |
| "Only useful when debugging a specific issue" | DEBUG |
# DEBUG - Verbose progress, only useful when actively debugging
logger.debug("Fetching user profile from cache", extra={"user_id": user_id})
# INFO - Business-relevant events worth keeping
logger.info("Order completed", extra={"order_id": order_id, "total": 99.99, "items": 3})
# WARNING - Recoverable issues, degraded functionality
logger.warning("Cache miss, falling back to database", extra={"user_id": user_id})
# ERROR - Something broke that shouldn't have
logger.error("Payment processing failed", extra={"order_id": order_id, "error": str(e)})
# CRITICAL - System is broken, immediate action needed
logger.critical("Database connection pool exhausted", extra={"active_connections": 100})| Wrong | Right |
|---|---|
| INFO for every function entry/exit | DEBUG for verbose progress |
| ERROR when a user provides bad input | WARNING (it's expected behavior) |
| DEBUG for "user logged in" | INFO (business event worth tracking) |
| Logging sensitive data at any level | Never log passwords, tokens, PII |
logger.info(f"Order {order_id} completed for user {user_id} in {duration_ms}ms")Problems:
order_idis buried in the message string- Can't search by
@order_id:12345in your logging platform - Can't aggregate/filter programmatically
- Message varies, harder to group identical events
logger.info(
"Order completed",
extra={
"order_id": order_id,
"user_id": user_id,
"duration_ms": duration_ms,
"status": "completed"
}
)Benefits:
- Attributes are searchable:
@order_id:12345 - Consistent message text → easy to group/count
- Can create dashboards filtering by any attribute
- Machine-readable for automated analysis
Think of your log message as a static identifier (like an event name), and the
extradict as the variable data. This makes logs both human-readable AND machine-queryable.
This distinction is crucial for cost control at scale:
| Aspect | Tags | Attributes |
|---|---|---|
| Indexed? | Yes (fast queries) | No (searched inline) |
| Cost | Scales with cardinality | Flat cost per log |
| Use for | Low cardinality: env, service, status |
High cardinality: order_id, user_id, request_id |
| Query syntax | env:production |
@order_id:12345 |
With millions of requests, making request_id a tag would explode your Datadog bill because:
- Each unique tag value creates an index entry
- High-cardinality tags = high indexing costs
As an attribute (via extra={}):
- Stored inline with each log entry
- Searchable with
@request_id:xxx - No cardinality-based cost scaling
# This becomes a searchable ATTRIBUTE (cheap for high-cardinality data)
logger.info("Request completed", extra={"request_id": request_id})
# Tags are set at the service level, not per-log (low cardinality only)
# env:production, service:api-gateway, version:1.2.3# Request boundaries (start/end of significant operations)
logger.info("Processing started", extra={"request_id": request_id})
logger.info("Processing completed", extra={"request_id": request_id, "duration_ms": elapsed})
# State changes
logger.info("Order status changed", extra={"order_id": order_id, "from": "pending", "to": "shipped"})
# Errors with context
logger.error("Failed to send email", extra={"user_id": user_id, "error": str(e)})
# Business metrics
logger.info("Payment processed", extra={
"order_id": order_id,
"amount": 149.99,
"currency": "USD",
"payment_method": "card"
})# Sensitive data - NEVER
logger.info(f"User authenticated with password {password}")
logger.info(f"API key used: {api_key}")
logger.info(f"Credit card: {card_number}")
# Excessive verbosity in production
logger.info(f"Entering function xyz") # Use DEBUG
logger.info(f"Variable x = {x}") # Use DEBUG
# Large payloads
logger.info(f"Full response: {json.dumps(huge_object)}") # Truncate or omit
# Redundant information
logger.info(f"Starting to process")
logger.info(f"Still processing...")
logger.info(f"Almost done processing...")
logger.info(f"Done processing") # Just log start and end┌─────────────────────────────────────────────────────────────┐
│ API Layer (routes, controllers) │
│ → Log: Request received, request completed, HTTP errors │
│ → Level: INFO for boundaries, WARNING/ERROR for failures │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Service Layer (business logic) │
│ → Log: Business operations, state transitions │
│ → Level: INFO for business events, DEBUG for internal steps │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Data Layer (repositories, DAOs) │
│ → Log: Query execution, cache hits/misses │
│ → Level: DEBUG for queries, WARNING for slow queries │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ External Integrations (APIs, queues, storage) │
│ → Log: External calls, retries, failures │
│ → Level: DEBUG for success, WARNING/ERROR for failures │
└─────────────────────────────────────────────────────────────┘
For a single request flowing through your system, aim for roughly one INFO log per architectural layer. More than that creates noise; less than that makes debugging hard. Use DEBUG for the detailed breadcrumbs within each layer.
Always propagate identifiers through your system:
# In your API handler
request_id = str(uuid.uuid4())
logger.info("Request received", extra={"request_id": request_id, "endpoint": "/api/orders"})
# In your service layer
logger.info("Processing order", extra={"request_id": request_id, "order_id": order_id})
# In your data layer
logger.debug("Executing query", extra={"request_id": request_id, "table": "orders"})
# In your external integration
logger.info("Calling payment gateway", extra={"request_id": request_id, "provider": "stripe"})Now in your logging platform: @request_id:abc-123 shows the entire journey of that request across all services and layers.
from contextvars import ContextVar
import uuid
request_id_var: ContextVar[str] = ContextVar("request_id", default="")
@app.middleware("http")
async def add_request_id(request: Request, call_next):
request_id = request.headers.get("X-Request-ID", str(uuid.uuid4()))
request_id_var.set(request_id)
response = await call_next(request)
response.headers["X-Request-ID"] = request_id
return response
# Then in any function:
def get_logger_extra() -> dict:
return {"request_id": request_id_var.get()}
logger.info("Something happened", extra=get_logger_extra())# Include exception info for unexpected errors
try:
result = await risky_operation()
except Exception as e:
logger.exception( # Automatically includes stack trace
"Unexpected error in risky operation",
extra={"request_id": request_id, "error": str(e)}
)
raise
# Use warning for expected/handled failures
try:
cached = await cache.get(key)
except CacheConnectionError as e:
logger.warning(
"Cache unavailable, falling back to database",
extra={"request_id": request_id, "error": str(e)}
)
cached = None # Graceful degradation
# NEVER swallow errors silently
try:
do_something()
except Exception:
pass # NEVER DO THIS - you'll regret it at 3amIs this error expected in normal operation?
├── Yes (e.g., user not found, invalid input)
│ └── WARNING with context, handle gracefully
└── No (e.g., database down, null pointer)
├── Can you recover?
│ ├── Yes → ERROR + logger.exception() + recover
│ └── No → ERROR + logger.exception() + re-raise
└── Is the system unusable?
└── Yes → CRITICAL + alert + potentially shutdown
import os
from your_logging_module import setup_logging
# Determine environment
env = os.getenv("ENVIRONMENT", "development")
use_json = env not in ("local", "development", "dev")
log_level = os.getenv("LOG_LEVEL", "INFO" if env == "production" else "DEBUG")
# Initialize logging BEFORE other imports that might log
setup_logging(level=log_level, use_json=use_json)| Environment | Format | Level | Why |
|---|---|---|---|
| Local/Dev | Human-readable | DEBUG | Easy to read in terminal |
| Staging | JSON | DEBUG | Test structured logging, full verbosity |
| Production | JSON | INFO | Reduce volume, machine-parseable |
Human-readable (development):
2024-01-15 10:23:45 INFO Order completed - order_id=12345, duration_ms=234
JSON (production):
{
"timestamp": "2024-01-15T10:23:45.123Z",
"level": "INFO",
"message": "Order completed",
"order_id": "12345",
"duration_ms": 234,
"service": "order-api",
"trace_id": "abc123"
}# BAD - expensive_call() runs even if DEBUG is disabled
logger.debug(f"User data: {expensive_call()}")
# GOOD - expensive_call() only runs if DEBUG is enabled
if logger.isEnabledFor(logging.DEBUG):
logger.debug(f"User data: {expensive_call()}")
# ALSO GOOD - use lazy % formatting
logger.debug("User data: %s", expensive_call()) # Only evaluated if needed# BAD - logging inside tight loops
for item in million_items:
logger.debug(f"Processing item {item.id}")
process(item)
# GOOD - log boundaries, not iterations
logger.info("Starting batch processing", extra={"item_count": len(million_items)})
for item in million_items:
process(item)
logger.info("Batch processing completed", extra={"item_count": len(million_items)})┌────────────────────────────────────────────────────────────┐
│ LOGGING CHEAT SHEET │
├────────────────────────────────────────────────────────────┤
│ LEVEL │ USE WHEN │
├───────────┼────────────────────────────────────────────────┤
│ DEBUG │ Detailed diagnostic info for troubleshooting │
│ INFO │ Business events, request boundaries │
│ WARNING │ Recoverable issues, fallbacks triggered │
│ ERROR │ Failures requiring investigation │
│ CRITICAL │ System unusable, immediate action needed │
├────────────────────────────────────────────────────────────┤
│ PATTERN │ EXAMPLE │
├───────────┼────────────────────────────────────────────────┤
│ Structured│ logger.info("msg", extra={"key": value}) │
│ Correlation│ Always include request_id/trace_id │
│ Boundaries│ Log start AND completion of operations │
│ Errors │ Use logger.exception() for stack traces │
├────────────────────────────────────────────────────────────┤
│ DATADOG │ CARDINALITY RULE │
├───────────┼────────────────────────────────────────────────┤
│ Tags │ Low cardinality only (env, service, status) │
│ Attributes│ High cardinality OK (request_id, user_id) │
├────────────────────────────────────────────────────────────┤
│ DON'T LOG │ EVER │
├───────────┼────────────────────────────────────────────────┤
│ Passwords │ logger.info(f"password={pw}") │
│ API Keys │ logger.info(f"key={api_key}") │
│ PII │ logger.info(f"ssn={ssn}") │
│ Card #s │ logger.info(f"card={card_number}") │
└────────────────────────────────────────────────────────────┘
- The Twelve-Factor App: Logs
- Datadog Log Management Best Practices
- Python Logging HOWTO
- Structured Logging in Python
Generated with help from Claude Code