Build self-improving runtime security for autonomous AI agents — intercept actions, dispatch adversarial investigators, generate evolving scoring rules, and enforce deterministic block decisions with no LLM in the enforcement path.
npx @senso-ai/shipables install stevenybusiness-svg/sentinel-ai-securitySentinel is a runtime security supervision layer for autonomous AI agents. It intercepts agent actions at the execution boundary, dispatches independent AI investigators to adversarially verify claims against ground truth, and blocks actions that fail verification — using deterministic scoring functions, not LLM judgment.
The core innovation is the self-improvement loop: when a novel attack is confirmed, Sentinel autonomously generates a Python scoring function capturing the behavioral fingerprint, validates it, and hot-deploys it. Generated rules evolve across incidents, compounding signal from every confirmed threat.
Payment Request
│
▼
┌─────────────────────┐
│ Sentinel Supervisor │ Claude Opus 4.6
│ (orchestration) │
└────────┬────────────┘
│
┌────┼────────────────┐
▼ ▼ ▼
┌─────┐ ┌────────────┐ ┌──────────┐
│Risk │ │ Compliance │ │Forensics │ Claude Sonnet 4.6
└──┬──┘ └─────┬──────┘ └────┬─────┘ (parallel dispatch)
│ │ │
└──────────┼──────────────┘
▼
┌─────────────────┐
│ Verdict Board │ Synthesized mismatches,
│ Assembly │ flags, z-scores
└────────┬────────┘
▼
┌─────────────────┐
│ Safety Gate │ Deterministic: score ≥ 1.0 → BLOCK
│ (no LLM here) │ Hardcoded + generated rules
└────────┬────────┘
▼
GO / NO-GO / ESCALATE
Always use asyncio.TaskGroup for parallel sub-agent dispatch. Never use asyncio.gather() — it does not cancel sibling tasks on exception.
async with asyncio.TaskGroup() as tg:
risk_task = tg.create_task(risk_agent.investigate(verdict_board))
compliance_task = tg.create_task(compliance_agent.investigate(verdict_board))
forensics_task = tg.create_task(forensics_agent.investigate(verdict_board))
# All three complete or all cancel — no zombie agent calls
risk_result = risk_task.result()
compliance_result = compliance_task.result()
forensics_result = forensics_task.result()
The block decision must never depend on an LLM call. Use a composite scoring pattern where each rule returns a weighted anomaly score.
# Tier 1: Hardcoded rules (immutable, expert-authored)
# Tier 2: Generated rules (learned from confirmed attacks)
composite_score = 0.0
fired_rules = []
for rule in hardcoded_rules + generated_rules:
score = rule.score(verdict_board)
if score > 0.0:
fired_rules.append((rule.rule_id, score))
composite_score += score
if composite_score >= 1.0:
decision = "NO-GO" # Block the action
elif composite_score >= 0.6:
decision = "ESCALATE" # Human review required
else:
decision = "GO" # Proceed
When an attack is confirmed, generate a Python scoring function that captures the behavioral pattern — not entity names or specific values.
# Rule generation pipeline:
# 1. Extract prediction errors (what supervisor expected vs what happened)
# 2. LLM generates Python scoring function targeting the behavioral fingerprint
# 3. Validate with 4 checks:
# - AST parse (syntactically valid Python)
# - Fires on attack fixture (score > 0.6)
# - Silent on clean baseline (score < 0.2)
# - No forbidden tokens (import, __, open, exec, eval)
# 4. Compile via RestrictedPython
# 5. Hot-deploy to Safety Gate (zero restart)
# 6. Persist to Aerospike with provenance metadata
Always validate generated rules against both attack and clean fixtures before deployment. A rule that fires on clean transactions is worse than no rule at all.
When a second attack fires an existing generated rule, evolve it rather than creating a new one.
# Evolution strategy:
# - Strengthen conditions that appeared in BOTH incidents
# - Drop conditions specific to only one incident (noise)
# - Tighten thresholds (e.g., confidence_threshold 0.85 → 0.88)
# - Version the rule (rule_generated_004.py → rule_generated_004_v2.py)
Synthesize findings from all investigators into a structured signal set that rules can evaluate.
verdict_board = {
"mismatches": [...], # Field-level: agent_claim vs verified_value
"behavioral_flags": [...], # Union of all agent flags
"agent_confidence": 0.95, # Original agent confidence
"confidence_z_score": 3.91, # Statistical deviation from baseline
"step_sequence_deviation": True, # Wrong tool call order
"unable_to_verify": [...], # Agent IDs that couldn't verify claims
"prediction_errors": [...], # Supervisor prediction vs actual
}
Never use eval() for generated rules. Always compile with RestrictedPython and execute with minimal safe globals.
from RestrictedPython import compile_restricted, safe_globals
# Pre-check: reject any source containing forbidden tokens
FORBIDDEN = ["import", "__", "open", "exec", "eval"]
for token in FORBIDDEN:
if token in rule_source:
raise SecurityError(f"Forbidden token: {token}")
# Compile and execute in restricted environment
code = compile_restricted(rule_source, filename=rule_id, mode="exec")
restricted_globals = safe_globals.copy()
restricted_globals["_getattr_"] = getattr # Allow attribute access on verdict_board
exec(code, restricted_globals)
score_fn = restricted_globals.get("score")
Stream investigation progress to the frontend. Define typed events for each phase.
# Event types to implement:
# investigation_started — Reset UI, initialize visualization
# payment_agent_step — Progress updates during agent execution
# agent_completed — Individual investigator results (x3)
# verdict_board_assembled — Mismatch table ready for display
# gate_evaluated — Decision + composite score + attribution
# episode_written — Persistence confirmation with latency
# rule_generating — Stream rule source tokens as they're generated
# rule_deployed — New rule added to gate, ready for next incident
Maintain statistical baselines for agent behavior. Confidence z-scores are the strongest single signal.
# Compute z-score: how far is this agent's confidence from normal?
z_score = (agent_confidence - baseline_mean) / baseline_std
# Example: baseline_mean=0.52, baseline_std=0.11
# Agent reports 0.95 confidence → z_score = 3.91 (extreme anomaly)
LLM in the enforcement path. The block decision must be pure computation. If model latency, downtime, or hallucination can affect whether a fraudulent action executes, the safety gate is broken.
Entity-specific rules. A rule that checks for "Globex Capital" catches one attack. A rule that checks for "extreme confidence + KYC verification failure + critical mismatches" catches an entire class. Always generate behavioral rules, not entity-specific ones.
Skipping rule validation. Every generated rule must be validated against both attack and clean fixtures before deployment. A rule that fires on legitimate transactions causes more damage than the attacks it prevents.
Using asyncio.gather() for agent dispatch. gather() does not cancel remaining tasks when one raises. Use asyncio.TaskGroup for structured concurrency with automatic cleanup.
Trusting agent self-reports. The entire architecture exists because agents are confident liars — not malicious, but adversarial inputs corrupt reasoning while leaving confidence intact. Always verify claims against independent ground truth.
Mutable rule state. Hardcoded rules must be immutable. Generated rules are append-only with versioning. Never modify a rule in place — create a new version.
sentinel/
├── agents/ # AI agents (Supervisor, Payment, Risk, Compliance, Forensics)
├── api/ # FastAPI app, routes, WebSocket manager
├── engine/ # Safety Gate, Rule Generator, Verdict Board, Prediction
├── gate/rules/ # Hardcoded (rule_*.py) + generated (rule_generated_*.py)
├── memory/ # Aerospike persistence (episodes, rules, baselines)
├── schemas/ # Pydantic models (Verdict, VerdictBoard, Episode, Payment)
├── fixtures/ # Test data (KYC ledger, counterparty DB, baselines)
└── config.py # Environment and model configuration
frontend/
└── src/
├── components/ # React components (InvestigationTree, GateDecisionPanel, etc.)
├── hooks/ # useWebSocket, state management
└── store.js # Zustand store
timeout= on every Claude API call. The default 10-minute timeout will hang the investigation. Use asyncio.timeout() as a secondary guard.