prometheus-devopsPrometheus monitoring β scrape configuration, service discovery, recording rules, alert rules, and production deployment for infrastructure and application metrics.
Install via ClawdBot CLI:
clawdbot install wpank/prometheus-devopsProduction Prometheus setup covering scrape configuration, service discovery,
recording rules, alert rules, and operational best practices for infrastructure
and application monitoring.
| Scenario | Example |
|----------|---------|
| Set up metrics collection | New service needs Prometheus scraping |
| Configure service discovery | K8s pods, file-based, or static targets |
| Create recording rules | Pre-compute expensive PromQL queries |
| Design alert rules | SLO-based alerts for availability and latency |
| Production deployment | HA setup with retention and storage planning |
| Troubleshoot scraping | Targets down, metrics missing, relabeling issues |
Applications ββ(/metrics)βββ Prometheus Server βββ AlertManager β Slack/PD
β β
client libraries ββββ Grafana (dashboards)
(prom client) ββββ Thanos/Cortex (long-term storage)
helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageVolumeSize=50Gi
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: production
region: us-west-2
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
rule_files:
- /etc/prometheus/rules/*.yml
scrape_configs:
# Self-monitoring
- job_name: prometheus
static_configs:
- targets: ["localhost:9090"]
# Node exporters
- job_name: node-exporter
static_configs:
- targets: ["node1:9100", "node2:9100", "node3:9100"]
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: "([^:]+)(:[0-9]+)?"
replacement: "${1}"
# Application metrics (TLS)
- job_name: my-app
scheme: https
metrics_path: /metrics
tls_config:
ca_file: /etc/prometheus/ca.crt
static_configs:
- targets: ["app1:9090", "app2:9090"]
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels:
[__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels:
[__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels:
[__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
Pod annotations to enable scraping:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
prometheus.io/path: "/metrics"
scrape_configs:
- job_name: file-sd
file_sd_configs:
- files: ["/etc/prometheus/targets/*.json"]
refresh_interval: 5m
targets/production.json:
[{
"targets": ["app1:9090", "app2:9090"],
"labels": { "env": "production", "service": "api" }
}]
| Method | Best For | Dynamic |
|--------|----------|---------|
| static_configs | Fixed infrastructure, dev | No |
| file_sd_configs | CM-managed inventories | Yes (file watch) |
| kubernetes_sd_configs | K8s workloads | Yes (API watch) |
| consul_sd_configs | Consul service mesh | Yes (Consul watch) |
| ec2_sd_configs | AWS EC2 instances | Yes (API poll) |
Pre-compute expensive queries for dashboard and alert performance:
# /etc/prometheus/rules/recording_rules.yml
groups:
- name: api_metrics
interval: 15s
rules:
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
- record: job:http_errors:rate5m
expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
- record: job:http_error_rate:ratio
expr: job:http_errors:rate5m / job:http_requests:rate5m
- record: job:http_duration:p95
expr: >
histogram_quantile(0.95,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
)
- name: resource_metrics
interval: 30s
rules:
- record: instance:node_cpu:utilization
expr: >
100 - (avg by (instance)
(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
- record: instance:node_memory:utilization
expr: >
100 - ((node_memory_MemAvailable_bytes
/ node_memory_MemTotal_bytes) * 100)
- record: instance:node_disk:utilization
expr: >
100 - ((node_filesystem_avail_bytes
/ node_filesystem_size_bytes) * 100)
level:metric_name:operations
| Part | Example | Meaning |
|------|---------|---------|
| level | job:, instance: | Aggregation level |
| metric_name | http_requests | Base metric |
| operations | :rate5m, :ratio | Applied functions |
# /etc/prometheus/rules/alert_rules.yml
groups:
- name: availability
rules:
- alert: ServiceDown
expr: up{job="my-app"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "{{ $labels.instance }} is down"
description: "{{ $labels.job }} down for >1 minute"
- alert: HighErrorRate
expr: job:http_error_rate:ratio > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Error rate {{ $value | humanizePercentage }} for {{ $labels.job }}"
- alert: HighP95Latency
expr: job:http_duration:p95 > 1
for: 5m
labels:
severity: warning
annotations:
summary: "P95 latency {{ $value }}s for {{ $labels.job }}"
- name: resources
rules:
- alert: HighCPU
expr: instance:node_cpu:utilization > 80
for: 5m
labels: { severity: warning }
annotations:
summary: "CPU {{ $value }}% on {{ $labels.instance }}"
- alert: HighMemory
expr: instance:node_memory:utilization > 85
for: 5m
labels: { severity: warning }
annotations:
summary: "Memory {{ $value }}% on {{ $labels.instance }}"
- alert: DiskSpaceLow
expr: instance:node_disk:utilization > 90
for: 5m
labels: { severity: critical }
annotations:
summary: "Disk {{ $value }}% on {{ $labels.instance }}"
| Severity | Threshold | Response |
|----------|-----------|----------|
| critical | Service down, data loss risk | Page on-call immediately |
| warning | Degraded, approaching limit | Investigate within hours |
| info | Notable but not urgent | Review in next business day |
# Validate config syntax
promtool check config prometheus.yml
# Validate rule files
promtool check rules /etc/prometheus/rules/*.yml
# Test a query
promtool query instant http://localhost:9090 'up'
# Reload config without restart
curl -X POST http://localhost:9090/-/reload
| Practice | Detail |
|----------|--------|
| Naming: prefix_name_unit | Snake_case, _total for counters, _seconds/_bytes for units |
| Scrape intervals 15β60s | Shorter wastes resources and storage |
| Recording rules for dashboards | Pre-compute anything queried repeatedly |
| Monitor Prometheus itself | prometheus_tsdb_*, scrape_duration_seconds |
| HA deployment | 2+ instances scraping same targets |
| Retention planning | Match --storage.tsdb.retention.time to disk capacity |
| Federation for scale | Global Prometheus aggregates from regional instances |
| Long-term storage | Thanos or Cortex for >30d retention |
| Problem | Diagnosis | Fix |
|---------|-----------|-----|
| Target shows DOWN | Check /targets page for error | Fix firewall, verify endpoint, check TLS |
| Metrics missing | Query up{job="x"} | Verify scrape config, check /metrics endpoint |
| High cardinality | prometheus_tsdb_head_series growing | Drop high-cardinality labels with metric_relabel_configs |
| Storage filling up | Check prometheus_tsdb_storage_* | Reduce retention, add disk, enable compaction |
| Slow queries | Check prometheus_engine_query_duration_seconds | Add recording rules, reduce range, limit series |
| Config not applied | Check prometheus_config_last_reload_successful | Fix syntax, POST /-/reload |
| Anti-Pattern | Why | Do Instead |
|-------------|-----|------------|
| Scrape interval < 5s | Overwhelms targets and storage | Use 15β60s intervals |
| High-cardinality labels (user ID, request ID) | Explodes TSDB series count | Use logs for high-cardinality data |
| Alert without for duration | Fires on transient spikes | Always set for: 1m minimum |
| Skip recording rules | Dashboards compute expensive queries every load | Pre-compute with recording rules |
| Store secrets in prometheus.yml | Config often in Git | Use file-based secrets or env substitution |
| Ignore up metric | Miss targets silently going down | Alert on up == 0 for all jobs |
| Single Prometheus instance in prod | Single point of failure | Run 2+ replicas with shared targets |
| Unbounded retention | Disk fills, Prometheus crashes | Set explicit --storage.tsdb.retention.time |
| Template | Description |
|----------|-------------|
| templates/prometheus.yml | Full config with static, file-based, and K8s discovery |
| templates/alert-rules.yml | 25+ alert rules by category |
| templates/recording-rules.yml | Pre-computed metrics for HTTP, latency, resources, SLOs |
AI Usage Analysis
Analysis is being generated⦠refresh in a few seconds.
Automatically update Clawdbot and all installed skills once daily. Runs via cron, checks for updates, applies them, and messages the user with a summary of what changed.
Full desktop computer use for headless Linux servers. Xvfb + XFCE virtual desktop with xdotool automation. 17 actions (click, type, scroll, screenshot, drag,...
Essential Docker commands and workflows for container management, image operations, and debugging.
Tool discovery and shell one-liner reference for sysadmin, DevOps, and security tasks. AUTO-CONSULT this skill when the user is: troubleshooting network issues, debugging processes, analyzing logs, working with SSL/TLS, managing DNS, testing HTTP endpoints, auditing security, working with containers, writing shell scripts, or asks 'what tool should I use for X'. Source: github.com/trimstray/the-book-of-secret-knowledge
Deploy applications and manage projects with complete CLI reference. Commands for deployments, projects, domains, environment variables, and live documentation access.
Monitor topics of interest and proactively alert when important developments occur. Use when user wants automated monitoring of specific subjects (e.g., product releases, price changes, news topics, technology updates). Supports scheduled web searches, AI-powered importance scoring, smart alerts vs weekly digests, and memory-aware contextual summaries.