stevenybusiness-svg/sentinel-ai-security

Build self-improving runtime security for autonomous AI agents — intercept actions, dispatch adversarial investigators, generate evolving scoring rules, and enforce deterministic block decisions with no LLM in the enforcement path.

1.0.0

MIT

Agent Skills

npx @senso-ai/shipables install stevenybusiness-svg/sentinel-ai-security

Sentinel: Self-Improving AI Agent Security

Sentinel is a runtime security supervision layer for autonomous AI agents. It intercepts agent actions at the execution boundary, dispatches independent AI investigators to adversarially verify claims against ground truth, and blocks actions that fail verification — using deterministic scoring functions, not LLM judgment.

The core innovation is the self-improvement loop: when a novel attack is confirmed, Sentinel autonomously generates a Python scoring function capturing the behavioral fingerprint, validates it, and hot-deploys it. Generated rules evolve across incidents, compounding signal from every confirmed threat.

When to use this skill

Building runtime security for any system where an autonomous AI agent touches the real world (payments, procurement, infrastructure, customer service)
Implementing adversarial verification of AI agent claims against ground truth databases
Creating deterministic safety gates that enforce block decisions without LLM dependency
Designing self-improving detection systems that learn from confirmed incidents
Adding separation-of-duties architecture where the agent executing an action is not the agent verifying it

Architecture overview

Payment Request
      │
      ▼
┌─────────────────────┐
│  Sentinel Supervisor │  Claude Opus 4.6
│  (orchestration)     │
└────────┬────────────┘
         │
    ┌────┼────────────────┐
    ▼    ▼                ▼
 ┌─────┐ ┌────────────┐ ┌──────────┐
 │Risk │ │ Compliance │ │Forensics │  Claude Sonnet 4.6
 └──┬──┘ └─────┬──────┘ └────┬─────┘  (parallel dispatch)
    │          │              │
    └──────────┼──────────────┘
               ▼
      ┌─────────────────┐
      │  Verdict Board   │  Synthesized mismatches,
      │  Assembly        │  flags, z-scores
      └────────┬────────┘
               ▼
      ┌─────────────────┐
      │  Safety Gate     │  Deterministic: score ≥ 1.0 → BLOCK
      │  (no LLM here)  │  Hardcoded + generated rules
      └────────┬────────┘
               ▼
         GO / NO-GO / ESCALATE

Patterns

1. Multi-agent investigation with structured concurrency

Always use asyncio.TaskGroup for parallel sub-agent dispatch. Never use asyncio.gather() — it does not cancel sibling tasks on exception.

async with asyncio.TaskGroup() as tg:
    risk_task = tg.create_task(risk_agent.investigate(verdict_board))
    compliance_task = tg.create_task(compliance_agent.investigate(verdict_board))
    forensics_task = tg.create_task(forensics_agent.investigate(verdict_board))

# All three complete or all cancel — no zombie agent calls
risk_result = risk_task.result()
compliance_result = compliance_task.result()
forensics_result = forensics_task.result()

2. Deterministic safety gate with two-tier rules

The block decision must never depend on an LLM call. Use a composite scoring pattern where each rule returns a weighted anomaly score.

# Tier 1: Hardcoded rules (immutable, expert-authored)
# Tier 2: Generated rules (learned from confirmed attacks)

composite_score = 0.0
fired_rules = []

for rule in hardcoded_rules + generated_rules:
    score = rule.score(verdict_board)
    if score > 0.0:
        fired_rules.append((rule.rule_id, score))
    composite_score += score

if composite_score >= 1.0:
    decision = "NO-GO"   # Block the action
elif composite_score >= 0.6:
    decision = "ESCALATE"  # Human review required
else:
    decision = "GO"  # Proceed

3. Self-improvement loop: incident to scoring function

When an attack is confirmed, generate a Python scoring function that captures the behavioral pattern — not entity names or specific values.

# Rule generation pipeline:
# 1. Extract prediction errors (what supervisor expected vs what happened)
# 2. LLM generates Python scoring function targeting the behavioral fingerprint
# 3. Validate with 4 checks:
#    - AST parse (syntactically valid Python)
#    - Fires on attack fixture (score > 0.6)
#    - Silent on clean baseline (score < 0.2)
#    - No forbidden tokens (import, __, open, exec, eval)
# 4. Compile via RestrictedPython
# 5. Hot-deploy to Safety Gate (zero restart)
# 6. Persist to Aerospike with provenance metadata

Always validate generated rules against both attack and clean fixtures before deployment. A rule that fires on clean transactions is worse than no rule at all.

4. Rule evolution across incidents

When a second attack fires an existing generated rule, evolve it rather than creating a new one.

# Evolution strategy:
# - Strengthen conditions that appeared in BOTH incidents
# - Drop conditions specific to only one incident (noise)
# - Tighten thresholds (e.g., confidence_threshold 0.85 → 0.88)
# - Version the rule (rule_generated_004.py → rule_generated_004_v2.py)

5. Verdict board assembly

Synthesize findings from all investigators into a structured signal set that rules can evaluate.

verdict_board = {
    "mismatches": [...],           # Field-level: agent_claim vs verified_value
    "behavioral_flags": [...],     # Union of all agent flags
    "agent_confidence": 0.95,      # Original agent confidence
    "confidence_z_score": 3.91,    # Statistical deviation from baseline
    "step_sequence_deviation": True,  # Wrong tool call order
    "unable_to_verify": [...],     # Agent IDs that couldn't verify claims
    "prediction_errors": [...],    # Supervisor prediction vs actual
}

6. RestrictedPython sandboxing for generated code

Never use eval() for generated rules. Always compile with RestrictedPython and execute with minimal safe globals.

from RestrictedPython import compile_restricted, safe_globals

# Pre-check: reject any source containing forbidden tokens
FORBIDDEN = ["import", "__", "open", "exec", "eval"]
for token in FORBIDDEN:
    if token in rule_source:
        raise SecurityError(f"Forbidden token: {token}")

# Compile and execute in restricted environment
code = compile_restricted(rule_source, filename=rule_id, mode="exec")
restricted_globals = safe_globals.copy()
restricted_globals["_getattr_"] = getattr  # Allow attribute access on verdict_board
exec(code, restricted_globals)

score_fn = restricted_globals.get("score")

7. Real-time WebSocket event streaming

Stream investigation progress to the frontend. Define typed events for each phase.

# Event types to implement:
# investigation_started  — Reset UI, initialize visualization
# payment_agent_step     — Progress updates during agent execution
# agent_completed        — Individual investigator results (x3)
# verdict_board_assembled — Mismatch table ready for display
# gate_evaluated         — Decision + composite score + attribution
# episode_written        — Persistence confirmation with latency
# rule_generating        — Stream rule source tokens as they're generated
# rule_deployed          — New rule added to gate, ready for next incident

8. Behavioral baselines for anomaly detection

Maintain statistical baselines for agent behavior. Confidence z-scores are the strongest single signal.

# Compute z-score: how far is this agent's confidence from normal?
z_score = (agent_confidence - baseline_mean) / baseline_std

# Example: baseline_mean=0.52, baseline_std=0.11
# Agent reports 0.95 confidence → z_score = 3.91 (extreme anomaly)

Common mistakes

LLM in the enforcement path. The block decision must be pure computation. If model latency, downtime, or hallucination can affect whether a fraudulent action executes, the safety gate is broken.
Entity-specific rules. A rule that checks for "Globex Capital" catches one attack. A rule that checks for "extreme confidence + KYC verification failure + critical mismatches" catches an entire class. Always generate behavioral rules, not entity-specific ones.
Skipping rule validation. Every generated rule must be validated against both attack and clean fixtures before deployment. A rule that fires on legitimate transactions causes more damage than the attacks it prevents.
Using asyncio.gather() for agent dispatch. gather() does not cancel remaining tasks when one raises. Use asyncio.TaskGroup for structured concurrency with automatic cleanup.
Trusting agent self-reports. The entire architecture exists because agents are confident liars — not malicious, but adversarial inputs corrupt reasoning while leaving confidence intact. Always verify claims against independent ground truth.
Mutable rule state. Hardcoded rules must be immutable. Generated rules are append-only with versioning. Never modify a rule in place — create a new version.

File structure

sentinel/
├── agents/              # AI agents (Supervisor, Payment, Risk, Compliance, Forensics)
├── api/                 # FastAPI app, routes, WebSocket manager
├── engine/              # Safety Gate, Rule Generator, Verdict Board, Prediction
├── gate/rules/          # Hardcoded (rule_*.py) + generated (rule_generated_*.py)
├── memory/              # Aerospike persistence (episodes, rules, baselines)
├── schemas/             # Pydantic models (Verdict, VerdictBoard, Episode, Payment)
├── fixtures/            # Test data (KYC ledger, counterparty DB, baselines)
└── config.py            # Environment and model configuration
frontend/
└── src/
    ├── components/      # React components (InvestigationTree, GateDecisionPanel, etc.)
    ├── hooks/           # useWebSocket, state management
    └── store.js         # Zustand store

Error handling

Agent timeout. Set explicit timeout= on every Claude API call. The default 10-minute timeout will hang the investigation. Use asyncio.timeout() as a secondary guard.
Rule execution timeout. Wrap generated rule execution in a 5-second signal-based timeout. A generated function that hangs is equivalent to a denial-of-service on the gate.
Aerospike unavailable. Degrade gracefully — the investigation and gate work without persistence. Log the failure, continue the pipeline.
Rule validation failure. If a generated rule fails any of the 4 validation checks, discard it entirely. Do not partially deploy. Log the failure for prompt refinement.

Key principles

Separation of duties. The agent executing an action is never the agent verifying it.
Deterministic enforcement. Block decisions are if-statements over composite scores — no LLM, no probability, no latency dependency.
Behavioral, not entity-specific. Rules target patterns (overconfidence + verification failure), not values (specific account numbers).
Evolve, don't accumulate. Rules refine across incidents rather than piling up. Shared signals strengthen; noise drops.
Transparent and auditable. Every scoring function is inspectable Python. Every block decision traces to specific evidence and specific rules.

stevenybusiness-svg/sentinel-ai-security — Build self-improving runtime security for autonomous AI agents — intercept actions, dispatch adversarial investigators, generate evolving scoring rules, and enforce deterministic block decisions with no LLM in the enforcement path. | Shipables

stevenybusiness-svg/sentinel-ai-security

stevenybusiness-svg/sentinel-ai-security

Sentinel: Self-Improving AI Agent Security

When to use this skill

Architecture overview

Patterns

1. Multi-agent investigation with structured concurrency

2. Deterministic safety gate with two-tier rules

3. Self-improvement loop: incident to scoring function

4. Rule evolution across incidents

5. Verdict board assembly

6. RestrictedPython sandboxing for generated code

7. Real-time WebSocket event streaming

8. Behavioral baselines for anomaly detection

Common mistakes

File structure

Error handling

Key principles

Keywords

Categories