incident-runbook-templates
Sécurité & ConformitéCreate structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use when building runbooks, responding to incidents, or establishing incident response procedures.
Documentation
Incident Runbook Templates
Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.
Do not use this skill when
Instructions
resources/implementation-playbook.md.Use this skill when
Core Concepts
1. Incident Severity Levels
| Severity | Impact | Response Time | Example |
|----------|--------|---------------|---------|
| SEV1 | Complete outage, data loss | 15 min | Production down |
| SEV2 | Major degradation | 30 min | Critical feature broken |
| SEV3 | Minor impact | 2 hours | Non-critical bug |
| SEV4 | Minimal impact | Next business day | Cosmetic issue |
2. Runbook Structure
1. Overview & Impact
2. Detection & Alerts
3. Initial Triage
4. Mitigation Steps
5. Root Cause Investigation
6. Resolution Procedures
7. Verification & Rollback
8. Communication Templates
9. Escalation MatrixRunbook Templates
Template 1: Service Outage Runbook
# [Service Name] Outage Runbook
## Overview
**Service**: Payment Processing Service
**Owner**: Platform Team
**Slack**: #payments-incidents
**PagerDuty**: payments-oncall
## Impact Assessment
- [ ] Which customers are affected?
- [ ] What percentage of traffic is impacted?
- [ ] Are there financial implications?
- [ ] What's the blast radius?
## Detection
### Alerts
- `payment_error_rate > 5%` (PagerDuty)
- `payment_latency_p99 > 2s` (Slack)
- `payment_success_rate < 95%` (PagerDuty)
### Dashboards
- [Payment Service Dashboard](https://grafana/d/payments)
- [Error Tracking](https://sentry.io/payments)
- [Dependency Status](https://status.stripe.com)
## Initial Triage (First 5 Minutes)
### 1. Assess ScopeCheck service health
kubectl get pods -n payments -l app=payment-service
Check recent deployments
kubectl rollout history deployment/payment-service -n payments
Check error rates
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
### 2. Quick Health Checks
- [ ] Can you reach the service? `curl -I https://api.company.com/payments/health`
- [ ] Database connectivity? Check connection pool metrics
- [ ] External dependencies? Check Stripe, bank API status
- [ ] Recent changes? Check deploy history
### 3. Initial Classification
| Symptom | Likely Cause | Go To Section |
|---------|--------------|---------------|
| All requests failing | Service down | Section 4.1 |
| High latency | Database/dependency | Section 4.2 |
| Partial failures | Code bug | Section 4.3 |
| Spike in errors | Traffic surge | Section 4.4 |
## Mitigation Procedures
### 4.1 Service Completely DownStep 1: Check pod status
kubectl get pods -n payments
Step 2: If pods are crash-looping, check logs
kubectl logs -n payments -l app=payment-service --tail=100
Step 3: Check recent deployments
kubectl rollout history deployment/payment-service -n payments
Step 4: ROLLBACK if recent deploy is suspect
kubectl rollout undo deployment/payment-service -n payments
Step 5: Scale up if resource constrained
kubectl scale deployment/payment-service -n payments --replicas=10
Step 6: Verify recovery
kubectl rollout status deployment/payment-service -n payments
### 4.2 High LatencyStep 1: Check database connections
kubectl exec -n payments deploy/payment-service -- \
curl localhost:8080/metrics | grep db_pool
Step 2: Check slow queries (if DB issue)
psql -h $DB_HOST -U $DB_USER -c "
SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND duration > interval '5 seconds'
ORDER BY duration DESC;"
Step 3: Kill long-running queries if needed
psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"
Step 4: Check external dependency latency
curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health
Step 5: Enable circuit breaker if dependency is slow
kubectl set env deployment/payment-service \
STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments
### 4.3 Partial Failures (Specific Errors)Step 1: Identify error pattern
kubectl logs -n payments -l app=payment-service --tail=500 | \
grep -i error | sort | uniq -c | sort -rn | head -20
Step 2: Check error tracking
Go to Sentry: https://sentry.io/payments
Step 3: If specific endpoint, enable feature flag to disable
curl -X POST https://api.company.com/internal/feature-flags \
-d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'
Step 4: If data issue, check recent data changes
psql -h $DB_HOST -c "
SELECT * FROM audit_log
WHERE table_name = 'payment_methods'
AND created_at > now() - interval '1 hour';"
### 4.4 Traffic SurgeStep 1: Check current request rate
kubectl top pods -n payments
Step 2: Scale horizontally
kubectl scale deployment/payment-service -n payments --replicas=20
Step 3: Enable rate limiting
kubectl set env deployment/payment-service \
RATE_LIMIT_ENABLED=true \
RATE_LIMIT_RPS=1000 -n payments
Step 4: If attack, block suspicious IPs
kubectl apply -f - < apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: block-suspicious namespace: payments spec: podSelector: matchLabels: app: payment-service ingress: cidr: 0.0.0.0/0 except: EOF curl -s https://api.company.com/payments/health | jq curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]' curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq ./scripts/smoke-test-payments.sh kubectl rollout undo deployment/payment-service -n payments ./scripts/db-rollback.sh $MIGRATION_VERSION curl -X POST https://api.company.com/internal/feature-flags \ -d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}' 🚨 INCIDENT: Payment Service Degradation Severity: SEV2 Status: Investigating Impact: ~20% of payment requests failing Start Time: [TIME] Incident Commander: [NAME] Current Actions: Updates in #payments-incidents 📊 UPDATE: Payment Service Incident Status: Mitigating Impact: Reduced to ~5% failure rate Duration: 25 minutes Actions Taken: Next Steps: ETA to Resolution: ~15 minutes ✅ RESOLVED: Payment Service Incident Duration: 45 minutes Impact: ~5,000 affected transactions Root Cause: Memory leak in v2.3.4 Resolution: Follow-up: -- Check current connections SELECT datname, usename, state, count(*) FROM pg_stat_activity GROUP BY datname, usename, state ORDER BY count(*) DESC; -- Identify long-running connections SELECT pid, usename, datname, state, query_start, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start; -- Terminate idle connections SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes'; -- Check lag on replica SELECT CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0 ELSE extract(epoch from now() - pg_last_xact_replay_timestamp()) END AS lag_seconds; -- If lag > 60s, consider: -- 1. Check network between primary/replica -- 2. Check replica disk I/O -- 3. Consider failover if unrecoverable df -h /var/lib/postgresql/data psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid)) FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 10;" psql -c "VACUUM FULL large_table;"
## Verification StepsVerify service is healthy
Verify error rate is back to normal
Verify latency is acceptable
Smoke test critical flows
## Rollback ProceduresRollback Kubernetes deployment
Rollback database migration (if applicable)
Rollback feature flag
## Escalation Matrix
| Condition | Escalate To | Contact |
|-----------|-------------|---------|
| > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) |
| Data breach suspected | Security Team | #security-incidents |
| Financial impact > $10k | Finance + Legal | @finance-oncall |
| Customer communication needed | Support Lead | @support-lead |
## Communication Templates
### Initial Notification (Internal)
### Status Update
### Resolution NotificationTemplate 2: Database Incident Runbook
# Database Incident Runbook
## Quick Reference
| Issue | Command |
|-------|---------|
| Check connections | `SELECT count(*) FROM pg_stat_activity;` |
| Kill query | `SELECT pg_terminate_backend(pid);` |
| Check replication lag | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` |
| Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` |
## Connection Pool Exhaustion
## Replication Lag
## Disk Space CriticalCheck disk usage
Find large tables
VACUUM to reclaim space
If emergency, delete old data or expand disk
Best Practices
Do's
Don'ts
Resources
Compétences similaires
Explorez d'autres agents de la catégorie Sécurité & Conformité
azure-keyvault-keys-ts
Manage cryptographic keys using Azure Key Vault Keys SDK for JavaScript (@azure/keyvault-keys). Use when creating, encrypting/decrypting, signing, or rotating keys.
Top 100 Web Vulnerabilities Reference
This skill should be used when the user asks to "identify web application vulnerabilities", "explain common security flaws", "understand vulnerability categories", "learn about injection attacks", "review access control weaknesses", "analyze API security issues", "assess security misconfigurations", "understand client-side vulnerabilities", "examine mobile and IoT security flaws", or "reference the OWASP-aligned vulnerability taxonomy". Use this skill to provide comprehensive vulnerability definitions, root causes, impacts, and mitigation strategies across all major web security categories.
mtls-configuration
Configure mutual TLS (mTLS) for zero-trust service-to-service communication. Use when implementing zero-trust networking, certificate management, or securing internal service communication.