ayushozha/deepops-sre

Autonomous incident response and self-healing codebase agent. Use when building SRE automation, incident pipelines, error detection, auto-remediation, or production monitoring systems. Covers the full lifecycle from error ingestion to diagnosis, fix generation, approval gating, phone escalation, and deployment.

1.0.0

MIT

Agent Skills

npx @senso-ai/shipables install ayushozha/deepops-sre

DeepOps: Self-Healing Codebase Agent

An autonomous agent that monitors live applications, detects errors in real-time, diagnoses root causes, writes and deploys fixes, and calls the on-call engineer only when human approval is needed.

When to use this skill

Use this skill when:

Building incident response automation or SRE tooling
Implementing error detection and auto-remediation pipelines
Creating approval gating workflows for code deployments
Integrating phone escalation into incident management
Setting up real-time incident dashboards with SSE streaming

Architecture

The system follows a closed-loop incident pipeline:

Error Detected → Ingested → Stored → Diagnosed → Fixed → Auth Check → Deployed → Resolved
   (Airbyte)   (Aerospike) (Macroscope) (Kiro)   (Auth0) (TrueFoundry) (Overmind)

Severity-Based Routing

Severity	Route	Human Needed?
Low/Medium	Auto-deploy	No
High	Manual approval via dashboard	Yes
Critical	Phone escalation via Bland AI	Yes (voice call)

Incident Schema

Always use a single canonical incident record that flows through the entire pipeline:

{
    "incident_id": "inc-uuid",
    "status": "stored|diagnosing|fixing|gating|awaiting_approval|deploying|resolved|failed",
    "severity": "pending|low|medium|high|critical",
    "source": {"provider": "...", "error_type": "...", "error_message": "...", "source_file": "..."},
    "diagnosis": {"status": "...", "root_cause": "...", "confidence": 0.0, "affected_components": []},
    "fix": {"status": "...", "diff_preview": "...", "files_changed": [], "test_plan": []},
    "approval": {"required": false, "mode": "auto|manual", "status": "pending|approved|rejected"},
    "deployment": {"provider": "...", "status": "...", "deploy_url": "..."},
    "timeline": [{"at_ms": 0, "status": "...", "actor": "...", "message": "...", "sponsor": "..."}]
}

Key Patterns

Error Ingestion

Always normalize errors into the canonical schema before storing:

from agent.contracts import normalize_incident, default_incident

incident = default_incident(status="stored")
incident["source"]["error_type"] = "ZeroDivisionError"
incident["source"]["error_message"] = "division by zero"
incident["source"]["source_file"] = "app/routes.py"
normalized = normalize_incident(incident)

Aerospike Storage (15-char bin name limit)

When using Aerospike, store the full incident as a single JSON payload bin with short index bins:

bins = {
    "payload": json.dumps(incident),  # Full JSON
    "iid": incident["incident_id"],    # Index bin
    "st": incident["status"],           # Filter by status
    "sev": incident["severity"],        # Filter by severity
    "crt_ms": incident["created_at_ms"],
    "upd_ms": incident["updated_at_ms"],
}

LLM Diagnosis with Fallback

Always try the primary LLM provider first, fall back gracefully:

if os.environ.get("ANTHROPIC_API_KEY"):
    # Use Claude for diagnosis
elif os.environ.get("OPENAI_API_KEY"):
    # Fall back to GPT-4o
else:
    raise RuntimeError("No LLM API key configured")

Phone Escalation

For critical incidents, trigger a voice call with structured decision parsing:

call_result = await bland_client.send_escalation_call(
    incident=incident,
    phone_number=settings.bland_phone_number,
    webhook_url=settings.bland_webhook_url,
)
# Webhook callback at /api/webhooks/bland parses the voice transcript
# into approve/reject/suggest decisions

API Endpoints

Method	Endpoint	Purpose
GET	/api/health	Backend health + store status
GET	/api/incidents	List all incidents
POST	/api/ingest/demo-trigger	Inject a test bug
POST	/api/agent/run-once	Process one incident through full pipeline
POST	/api/approval/{id}/decision	Submit human approval/rejection
GET	/api/incidents/stream	SSE stream for real-time updates
GET	/api/settings/overview	Integration status dashboard

Common Mistakes

Never write Aerospike bins with field names longer than 15 characters
Always use normalize_incident() before storing to ensure schema compliance
Never skip the gating step even for low-severity incidents
Always record timeline events for every state transition
Never hardcode fallback_mode=True in production — let it be driven by env vars

ayushozha/deepops-sre — Autonomous incident response and self-healing codebase agent. Use when building SRE automation, incident pipelines, error detection, auto-remediation, or production monitoring systems. Covers the full lifecycle from error ingestion to diagnosis, fix generation, approval gating, phone escalation, and deployment. | Shipables

ayushozha/deepops-sre

ayushozha/deepops-sre

DeepOps: Self-Healing Codebase Agent

When to use this skill

Architecture

Severity-Based Routing

Incident Schema

Key Patterns

Error Ingestion

Aerospike Storage (15-char bin name limit)

LLM Diagnosis with Fallback

Phone Escalation

API Endpoints

Common Mistakes

Keywords

Categories