sre-engineerUse when defining SLIs/SLOs, managing error budgets, or building reliable systems at scale. Invoke for incident management, chaos engineering, toil reduction, capacity planning.
Install via ClawdBot CLI:
clawdbot install Veeramanikandanr48/sre-engineerSenior Site Reliability Engineer with expertise in building highly reliable, scalable systems through SLI/SLO management, error budgets, capacity planning, and automation.
You are a senior SRE with 10+ years of experience building and maintaining production systems at scale. You specialize in defining meaningful SLOs, managing error budgets, reducing toil through automation, and building resilient systems. Your focus is on sustainable reliability that enables feature velocity.
Load detailed guidance based on context:
| Topic | Reference | Load When |
|-------|-----------|-----------|
| SLO/SLI | references/slo-sli-management.md | Defining SLOs, calculating error budgets |
| Error Budgets | references/error-budget-policy.md | Managing budgets, burn rates, policies |
| Monitoring | references/monitoring-alerting.md | Golden signals, alert design, dashboards |
| Automation | references/automation-toil.md | Toil reduction, automation patterns |
| Incidents | references/incident-chaos.md | Incident response, chaos engineering |
When implementing SRE practices, provide:
SLO/SLI design, error budgets, golden signals (latency/traffic/errors/saturation), Prometheus/Grafana, chaos engineering (Chaos Monkey, Gremlin), toil reduction, incident management, blameless postmortems, capacity planning, on-call best practices
Generated Mar 1, 2026
Define SLIs and SLOs for an online retail site to ensure 99.9% availability during peak shopping seasons, implement error budget policies to manage deployment risks, and automate incident response for payment gateway failures to reduce MTTR.
Monitor golden signals like latency and saturation for a video streaming platform, design chaos engineering experiments to test resilience against server failures, and automate toil in log analysis for capacity scaling during high-demand events.
Establish on-call practices and blameless postmortems for a banking app, implement Prometheus-based alerting for transaction errors, and reduce toil through automation of compliance reporting to maintain SLOs for uptime and security.
Set SLOs for a telemedicine platform to ensure 99.95% availability for patient consultations, build dashboards for error rates and traffic, and automate deployment processes with capacity planning to handle emergency surges.
Identify repetitive tasks in a multi-tenant SaaS environment, automate infrastructure provisioning with Terraform, and implement error budgets to balance feature releases with reliability targets for user satisfaction.
This model relies on recurring revenue from users, where high reliability and uptime are critical to retain customers and meet SLA commitments. SRE practices help manage error budgets to enable safe feature deployments while minimizing churn.
Revenue is generated per sale, making system availability and low latency essential during peak traffic. SRE focuses on SLOs for checkout processes and incident management to prevent revenue loss from downtime.
Income depends on user engagement and ad impressions, requiring scalable systems with reliable performance. SRE implements capacity planning and chaos engineering to ensure uptime for content delivery and ad serving.
💬 Integration Tip
Integrate this skill with existing monitoring tools like Prometheus and incident management platforms such as PagerDuty to streamline SLO tracking and automate alert responses for faster remediation.
Automatically update Clawdbot and all installed skills once daily. Runs via cron, checks for updates, applies them, and messages the user with a summary of what changed.
Full desktop computer use for headless Linux servers. Xvfb + XFCE virtual desktop with xdotool automation. 17 actions (click, type, scroll, screenshot, drag,...
Essential Docker commands and workflows for container management, image operations, and debugging.
Tool discovery and shell one-liner reference for sysadmin, DevOps, and security tasks. AUTO-CONSULT this skill when the user is: troubleshooting network issues, debugging processes, analyzing logs, working with SSL/TLS, managing DNS, testing HTTP endpoints, auditing security, working with containers, writing shell scripts, or asks 'what tool should I use for X'. Source: github.com/trimstray/the-book-of-secret-knowledge
Deploy applications and manage projects with complete CLI reference. Commands for deployments, projects, domains, environment variables, and live documentation access.
Monitor topics of interest and proactively alert when important developments occur. Use when user wants automated monitoring of specific subjects (e.g., product releases, price changes, news topics, technology updates). Supports scheduled web searches, AI-powered importance scoring, smart alerts vs weekly digests, and memory-aware contextual summaries.