Autonomous incident response and self-healing codebase agent. Use when building SRE automation, incident pipelines, error detection, auto-remediation, or production monitoring systems. Covers the full lifecycle from error ingestion to diagnosis, fix generation, approval gating, phone escalation, and deployment.
npx @senso-ai/shipables install ayushozha/deepops-sreAn autonomous agent that monitors live applications, detects errors in real-time, diagnoses root causes, writes and deploys fixes, and calls the on-call engineer only when human approval is needed.
Use this skill when:
The system follows a closed-loop incident pipeline:
Error Detected → Ingested → Stored → Diagnosed → Fixed → Auth Check → Deployed → Resolved
(Airbyte) (Aerospike) (Macroscope) (Kiro) (Auth0) (TrueFoundry) (Overmind)
| Severity | Route | Human Needed? |
|---|---|---|
| Low/Medium | Auto-deploy | No |
| High | Manual approval via dashboard | Yes |
| Critical | Phone escalation via Bland AI | Yes (voice call) |
Always use a single canonical incident record that flows through the entire pipeline:
{
"incident_id": "inc-uuid",
"status": "stored|diagnosing|fixing|gating|awaiting_approval|deploying|resolved|failed",
"severity": "pending|low|medium|high|critical",
"source": {"provider": "...", "error_type": "...", "error_message": "...", "source_file": "..."},
"diagnosis": {"status": "...", "root_cause": "...", "confidence": 0.0, "affected_components": []},
"fix": {"status": "...", "diff_preview": "...", "files_changed": [], "test_plan": []},
"approval": {"required": false, "mode": "auto|manual", "status": "pending|approved|rejected"},
"deployment": {"provider": "...", "status": "...", "deploy_url": "..."},
"timeline": [{"at_ms": 0, "status": "...", "actor": "...", "message": "...", "sponsor": "..."}]
}
Always normalize errors into the canonical schema before storing:
from agent.contracts import normalize_incident, default_incident
incident = default_incident(status="stored")
incident["source"]["error_type"] = "ZeroDivisionError"
incident["source"]["error_message"] = "division by zero"
incident["source"]["source_file"] = "app/routes.py"
normalized = normalize_incident(incident)
When using Aerospike, store the full incident as a single JSON payload bin with short index bins:
bins = {
"payload": json.dumps(incident), # Full JSON
"iid": incident["incident_id"], # Index bin
"st": incident["status"], # Filter by status
"sev": incident["severity"], # Filter by severity
"crt_ms": incident["created_at_ms"],
"upd_ms": incident["updated_at_ms"],
}
Always try the primary LLM provider first, fall back gracefully:
if os.environ.get("ANTHROPIC_API_KEY"):
# Use Claude for diagnosis
elif os.environ.get("OPENAI_API_KEY"):
# Fall back to GPT-4o
else:
raise RuntimeError("No LLM API key configured")
For critical incidents, trigger a voice call with structured decision parsing:
call_result = await bland_client.send_escalation_call(
incident=incident,
phone_number=settings.bland_phone_number,
webhook_url=settings.bland_webhook_url,
)
# Webhook callback at /api/webhooks/bland parses the voice transcript
# into approve/reject/suggest decisions
| Method | Endpoint | Purpose |
|---|---|---|
| GET | /api/health | Backend health + store status |
| GET | /api/incidents | List all incidents |
| POST | /api/ingest/demo-trigger | Inject a test bug |
| POST | /api/agent/run-once | Process one incident through full pipeline |
| POST | /api/approval/{id}/decision | Submit human approval/rejection |
| GET | /api/incidents/stream | SSE stream for real-time updates |
| GET | /api/settings/overview | Integration status dashboard |
normalize_incident() before storing to ensure schema compliancefallback_mode=True in production — let it be driven by env vars