The first thing to check when inheriting a production system is the logs. Unstructured text like ERROR: something went wrong in payment service is a reliable signal that incident response is going to be painful. Structured logging is one of those practices that costs almost nothing to implement but transforms how fast you can diagnose problems.
The Problem with Text Logs
Traditional log lines look like this:
2026-01-28 14:23:01 ERROR PaymentService - Failed to process payment for user 12345, order 67890, amount $150.00, error: timeoutParsing this requires regex. Every service formats logs differently. Searching across services means writing different queries for each one. Correlating a single request across multiple services is nearly impossible.
Structured Logging
The same event as structured JSON:
{
"timestamp": "2026-01-28T14:23:01.456Z",
"level": "error",
"service": "payment-service",
"message": "Payment processing failed",
"userId": "12345",
"orderId": "67890",
"amount": 150.00,
"currency": "USD",
"error": "upstream_timeout",
"duration_ms": 30000,
"traceId": "abc-123-def-456",
"spanId": "span-789"
}Every field is queryable. Every service uses the same format. Correlating a request across services is a single query on traceId.
Implementation
Node.js with Pino
Pino is the fastest JSON logger for Node.js — it writes logs asynchronously and adds negligible overhead:
import pino from "pino";
export const logger = pino({
level: process.env.LOG_LEVEL ?? "info",
formatters: {
level(label) {
return { level: label };
},
},
base: {
service: process.env.SERVICE_NAME,
environment: process.env.NODE_ENV,
version: process.env.APP_VERSION,
},
});Usage in application code:
import { logger } from "./logger";
async function processPayment(userId: string, orderId: string, amount: number) {
const log = logger.child({ userId, orderId, amount });
log.info("Processing payment");
try {
const result = await paymentGateway.charge(amount);
log.info({ transactionId: result.id, duration_ms: result.duration }, "Payment succeeded");
return result;
} catch (error) {
log.error({ error: error.message, code: error.code }, "Payment failed");
throw error;
}
}The child() method creates a logger with context fields that are automatically included in every log entry. No more manually including userId in every log call.
Go with zerolog
package main
import (
"os"
"github.com/rs/zerolog"
"github.com/rs/zerolog/log"
)
func init() {
zerolog.TimeFieldFormat = zerolog.TimeFormatUnix
log.Logger = zerolog.New(os.Stdout).With().
Str("service", "payment-service").
Str("version", os.Getenv("APP_VERSION")).
Timestamp().
Logger()
}
func processPayment(userID string, amount float64) error {
log.Info().
Str("userId", userID).
Float64("amount", amount).
Msg("Processing payment")
return nil
}Shipping Logs
Structured logs are only useful if they're aggregated in a central, searchable system. A minimal self-hosted stack looks like this:
services:
vector:
image: timberio/vector:latest-alpine
volumes:
- ./vector.toml:/etc/vector/vector.toml:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
depends_on:
- loki
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
volumes:
- loki-data:/loki
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
volumes:
loki-data:
grafana-data:Vector collects logs from Docker containers, parses the JSON, and ships them to Loki. Grafana queries Loki for visualization and alerting.
[sources.docker]
type = "docker_logs"
[transforms.parse]
type = "remap"
inputs = ["docker"]
source = '''
. = parse_json!(.message)
'''
[sinks.loki]
type = "loki"
inputs = ["parse"]
endpoint = "http://loki:3100"
encoding.codec = "json"
labels.service = "{{ service }}"
labels.level = "{{ level }}"Query Patterns That Save Time
Find all errors for a specific user in the last hour
{service="payment-service", level="error"} | json | userId = "12345"Trace a request across services
{level=~"info|error"} | json | traceId = "abc-123-def-456"Find slow requests
{service="api-gateway"} | json | duration_ms > 5000Error rate by service (last 15 minutes)
sum by (service) (rate({level="error"}[15m]))Alerting on Logs
Logs aren't just for post-incident investigation. With structured data, you can alert proactively:
groups:
- name: log-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate({level="error"}[5m])) by (service) > 0.5
for: 2m
labels:
severity: warning
annotations:
summary: "High error rate in {{ $labels.service }}"
- alert: PaymentFailureSpike
expr: |
sum(rate({service="payment-service", level="error"} |= "Payment failed" [5m])) > 0.1
for: 1m
labels:
severity: criticalKey Takeaways
- Structured from day one — retrofitting structured logging is painful; start with JSON from the beginning
- Use child loggers for context — attach request-scoped fields once, not in every log call
- Include a trace ID in every log — this is the single most valuable field for debugging distributed systems
- Centralize immediately — logs on individual servers are useless during incidents when you need cross-service visibility
- Alert on log patterns — don't wait for users to report problems that your logs already show