logging-observabilityStructured logging, distributed tracing, and metrics collection patterns for building observable systems. Use when implementing logging infrastructure, setting up distributed tracing with OpenTelemetry, designing metrics collection (RED/USE methods), configuring alerting and dashboards, or reviewing observability practices. Covers structured JSON logging, context propagation, trace sampling, Prometheus/Grafana stack, alert design, and PII/secret scrubbing.
Install via ClawdBot CLI:
clawdbot install wpank/logging-observabilityPatterns for building observable systems across the three pillars: logs, metrics, and traces.
| Pillar | Purpose | Question It Answers | Example |
|--------|---------|---------------------|---------|
| Logs | What happened | Why did this request fail? | {"level":"error","msg":"payment declined","user_id":"u_82"} |
| Metrics | How much / how fast | Is latency increasing? | http_request_duration_seconds{route="/api/orders"} 0.342 |
| Traces | Request flow | Where is the bottleneck? | Span: api-gateway → auth → order-service → db |
Each pillar is strongest when correlated. Embed trace_id in every log line to jump from a log entry to the full distributed trace.
Always emit logs as structured JSON — never free-text strings.
| Field | Purpose | Required |
|-------|---------|----------|
| timestamp | ISO-8601 with milliseconds | Yes |
| level | Severity (DEBUG … FATAL) | Yes |
| service | Originating service name | Yes |
| message | Human-readable description | Yes |
| trace_id | Distributed trace correlation | Yes |
| span_id | Current span within trace | Yes |
| correlation_id | Business-level correlation (order ID) | When applicable |
| error | Structured error object | On errors |
| context | Request-specific metadata | Recommended |
Attach context at the middleware level so downstream logs inherit automatically:
app.use((req, res, next) => {
const ctx = {
trace_id: req.headers['x-trace-id'] || crypto.randomUUID(),
request_id: crypto.randomUUID(),
user_id: req.user?.id,
method: req.method,
path: req.path,
};
asyncLocalStorage.run(ctx, () => next());
});
| Library | Language | Strengths | Perf |
|---------|----------|-----------|------|
| Pino | Node.js | Fastest Node logger, low overhead | Excellent |
| structlog | Python | Composable processors, context binding | Good |
| zerolog | Go | Zero-allocation JSON logging | Excellent |
| zap | Go | High performance, typed fields | Excellent |
| tracing | Rust | Spans + events, async-aware | Excellent |
Choose a logger that outputs structured JSON natively. Avoid loggers requiring post-processing.
| Level | When to Use | Example |
|-------|-------------|---------|
| FATAL | App cannot continue, process will exit | Database connection pool exhausted |
| ERROR | Operation failed, needs attention | Payment charge failed: CARD_DECLINED |
| WARN | Unexpected but recoverable | Retry 2/3 for upstream timeout |
| INFO | Normal business events | Order ORD-1234 placed successfully |
| DEBUG | Developer troubleshooting | Cache miss for key user:82:preferences |
| TRACE | Very fine-grained (rarely in prod) | Entering validateAddress with payload |
Rules: Production default = INFO and above. If you log an ERROR, someone should act on it. Every FATAL should trigger an alert.
Always prefer OpenTelemetry over vendor-specific SDKs:
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
const sdk = new NodeSDK({
serviceName: 'order-service',
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4318/v1/traces',
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
const tracer = trace.getTracer('order-service');
async function processOrder(order: Order) {
return tracer.startActiveSpan('processOrder', async (span) => {
try {
span.setAttribute('order.id', order.id);
span.setAttribute('order.total_cents', order.totalCents);
await validateInventory(order);
await chargePayment(order);
span.setStatus({ code: SpanStatusCode.OK });
} catch (err) {
span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
span.recordException(err);
throw err;
} finally {
span.end();
}
});
}
traceparent header) — default in OTeltraceparent into the job payload| Strategy | Use When |
|----------|----------|
| Always On | Low-traffic services, debugging |
| Probabilistic (N%) | General production use |
| Rate-limited (N/sec) | High-throughput services |
| Tail-based | When you need all error traces |
Always sample 100% of error traces regardless of strategy.
Monitor these three for every service endpoint:
| Metric | What It Measures | Prometheus Example |
|--------|-----------------|-------------------|
| Rate | Requests/sec | rate(http_requests_total[5m]) |
| Errors | Failed request ratio | rate(http_requests_total{status=~"5.."}[5m]) |
| Duration | Response time | histogram_quantile(0.99, http_request_duration_seconds) |
For infrastructure components (CPU, memory, disk, network):
| Metric | What It Measures | Example |
|--------|-----------------|---------|
| Utilization | % resource busy | CPU usage at 78% |
| Saturation | Work queued/waiting | 12 requests queued in thread pool |
| Errors | Error events on resource | 3 disk I/O errors in last minute |
| Tool | Category | Best For |
|------|----------|----------|
| Prometheus | Metrics | Pull-based metrics, alerting rules |
| Grafana | Visualisation | Dashboards for metrics, logs, traces |
| Jaeger | Tracing | Distributed trace visualisation |
| Loki | Logs | Log aggregation (pairs with Grafana) |
| OpenTelemetry | Collection | Vendor-neutral telemetry collection |
Recommendation: Start with OTel Collector → Prometheus + Grafana + Loki + Jaeger. Migrate to SaaS only when operational overhead justifies cost.
| Severity | Response Time | Example |
|----------|---------------|---------|
| P1 | Immediate | Service fully down, data loss |
| P2 | < 30 min | Error rate > 5%, latency p99 > 5s |
| P3 | Business hours | Disk > 80%, cert expiring in 7 days |
| P4 | Best effort | Non-critical deprecation warning |
Every service must have:
/healthz and /readyz)| Anti-Pattern | Problem | Fix |
|-------------|---------|-----|
| Logging PII | Privacy/compliance violation | Mask or exclude PII; use token references |
| Excessive logging | Storage costs balloon, signal drowns | Log business events, not data flow |
| Unstructured logs | Cannot query or alert on fields | Use structured JSON with consistent schema |
| String interpolation | Breaks structured fields, injection risk | Pass fields as metadata, not in message |
| Missing correlation IDs | Cannot trace across services | Generate and propagate trace_id everywhere |
| Alert storms | On-call fatigue, real issues buried | Use grouping, inhibition, deduplication |
| Metrics with high cardinality | Prometheus OOM, dashboard timeouts | Never use user ID or request ID as label |
Generated Mar 1, 2026
Implement structured logging and distributed tracing to track user transactions across microservices like payment processing and inventory management. Use metrics to monitor API latency and error rates, ensuring high availability during peak shopping events.
Set up logging with PII scrubbing to audit financial transactions and detect anomalies. Use traces to follow money flows across services and metrics to ensure system performance meets regulatory SLAs.
Deploy observability for patient data processing, using logs to record access events and traces to monitor data pipeline latency. Metrics help track system resource usage to maintain uptime for critical applications.
Apply RED metrics to monitor user request rates, errors, and durations across multi-tenant services. Use structured logging with correlation IDs to debug customer-specific issues and improve user experience.
Implement logging and tracing for device communication across distributed networks. Use metrics to monitor data ingestion rates and system health, enabling proactive maintenance and alerting on failures.
Offer managed logging, tracing, and metrics solutions to clients, using OpenTelemetry and Prometheus stacks. Provide dashboards and alerting to help businesses monitor their systems without in-house expertise.
Help companies migrate from legacy logging to structured observability practices. Design and implement custom solutions for metrics collection and distributed tracing to improve system reliability and performance.
Create and sell specialized logging libraries or monitoring tools that integrate with existing observability stacks. Focus on performance-optimized solutions for high-throughput environments like financial trading.
💬 Integration Tip
Start by instrumenting key services with OpenTelemetry and structured logging, then gradually expand to full observability stacks like Prometheus/Grafana for metrics.
Automatically update Clawdbot and all installed skills once daily. Runs via cron, checks for updates, applies them, and messages the user with a summary of what changed.
Full desktop computer use for headless Linux servers. Xvfb + XFCE virtual desktop with xdotool automation. 17 actions (click, type, scroll, screenshot, drag,...
Essential Docker commands and workflows for container management, image operations, and debugging.
Tool discovery and shell one-liner reference for sysadmin, DevOps, and security tasks. AUTO-CONSULT this skill when the user is: troubleshooting network issues, debugging processes, analyzing logs, working with SSL/TLS, managing DNS, testing HTTP endpoints, auditing security, working with containers, writing shell scripts, or asks 'what tool should I use for X'. Source: github.com/trimstray/the-book-of-secret-knowledge
Deploy applications and manage projects with complete CLI reference. Commands for deployments, projects, domains, environment variables, and live documentation access.
Monitor topics of interest and proactively alert when important developments occur. Use when user wants automated monitoring of specific subjects (e.g., product releases, price changes, news topics, technology updates). Supports scheduled web searches, AI-powered importance scoring, smart alerts vs weekly digests, and memory-aware contextual summaries.