Sivolc2/risk-autoresearch

Builds an autonomous iterative research loop that narrows a wide uncertainty/exposure range toward a user-defined target by issuing Google searches, extracting evidence from results, re-estimating the range from accumulated evidence, and repeating until converged. Use when the user wants to implement autonomous research that progressively reduces uncertainty toward a quantified goal — risk analysis, market sizing, due diligence, literature review, or any domain where a wide estimate must be narrowed with real evidence.

1.0.0

MIT

Agent Skills

npx @senso-ai/shipables install Sivolc2/risk-autoresearch

Risk Autoresearch Skill

Implements an autonomous iterative research loop — inspired by karpathy/autoresearch — adapted for narrowing exposure/uncertainty ranges rather than minimizing ML validation loss.

Core Concept

Objective: minimize uncertainty_score = (range_high - range_low) / prior

Each iteration:

LLM generates targeted search queries based on current gaps
Google Custom Search executes the queries
LLM extracts structured evidence from each result
LLM re-estimates the range from all accumulated evidence
Score is computed; strategy for next iteration is informed by what worked

The loop stops when: score converges, target reached, or max_iterations hit.

File Structure to Create

your_project/
├── autoresearch/
│   ├── __init__.py
│   ├── models.py          # ResearchState, Evidence, SearchResult, AutoresearchRequest
│   ├── searcher.py        # Google Custom Search API + page fetching
│   ├── researcher.py      # generate_queries(), extract_evidence()
│   ├── evaluator.py       # progress_score(), re_estimate_exposure()
│   ├── engine.py          # Main async loop — yields stream events
│   ├── llm_utils.py       # ask_json() — disables LLM thinking for JSON calls
│   └── parse_utils.py     # extract_json() — robust code-fence stripping
└── routers/
    └── autoresearch_route.py   # POST /api/autoresearch SSE endpoint

Step-by-Step Implementation

1. Install dependencies

pip install httpx google-genai fastapi python-dotenv

2. Environment variables

GOOGLE_SEARCH_API_KEY=...   # Google API key with Custom Search JSON API enabled
GOOGLE_SEARCH_CX=...        # Programmable Search Engine ID
GEMINI_API_KEY=...           # Or your preferred LLM provider

Without GOOGLE_SEARCH_API_KEY + GOOGLE_SEARCH_CX, the loop runs in LLM-only mode: queries are generated but no searches are executed; the LLM reasons from prior knowledge only.

3. Core models (`autoresearch/models.py`)

from dataclasses import dataclass, field
from pydantic import BaseModel

class AutoresearchRequest(BaseModel):
    risk_factor_name: str
    business_context: str
    initial_exposure_low: int       # USD — current wide estimate
    initial_exposure_high: int
    target_exposure_low: int        # USD — desired narrow estimate
    target_exposure_high: int
    max_iterations: int = 8
    max_searches_per_iteration: int = 4
    fetch_page_content: bool = False

@dataclass
class Evidence:
    source_url: str
    source_title: str
    query_used: str
    summary: str
    exposure_impact: str   # narrows_low | narrows_high | narrows_both | widens | neutral
    suggested_low: int | None
    suggested_high: int | None
    confidence: float      # 0.0–1.0
    iteration: int

@dataclass
class ResearchState:
    risk_factor_name: str
    business_context: str
    current_low: int
    current_high: int
    initial_low: int
    initial_high: int
    target_low: int
    target_high: int
    evidence: list[Evidence] = field(default_factory=list)
    queries_used: list[str] = field(default_factory=list)
    gaps: list[str] = field(default_factory=list)
    iteration: int = 0
    total_searches: int = 0
    total_tokens: int = 0

    @property
    def current_width(self): return max(0, self.current_high - self.current_low)
    @property
    def initial_width(self): return max(1, self.initial_high - self.initial_low)
    @property
    def target_width(self): return max(1, self.target_high - self.target_low)

4. Critical: disable LLM thinking for JSON calls (`autoresearch/llm_utils.py`)

This is the most important implementation detail. Models like Gemini 2.5 Flash use thinking tokens that consume from max_output_tokens. Without disabling thinking, JSON responses are silently truncated.

from your_llm_client import gemini_client, DEFAULT_MODEL

async def ask_json(prompt_text: str, system_message: str,
                   max_tokens: int = 2000, temperature: float = 0.2) -> str:
    if gemini_client is not None:
        try:
            response = gemini_client.models.generate_content(
                model=DEFAULT_MODEL,
                contents=prompt_text,
                config={
                    "system_instruction": system_message,
                    "max_output_tokens": max_tokens,
                    "temperature": temperature,
                    "thinking_config": {"thinking_budget": 0},  # CRITICAL
                },
            )
            return response.text
        except Exception:
            pass  # Fall back to standard ask_llm
    return await ask_llm(prompt_text, system_message, max_tokens, temperature)

5. Robust JSON parsing (`autoresearch/parse_utils.py`)

LLMs inconsistently wrap JSON in code fences. Use a 3-strategy extractor:

import json, re

def extract_json(text: str) -> any:
    text = text.strip()
    # Strategy 1: direct parse
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        pass
    # Strategy 2: strip ``` fences
    fenced = re.sub(r"^```[a-z]*\n?", "", text)
    fenced = re.sub(r"\n?```$", "", fenced.strip())
    try:
        return json.loads(fenced)
    except json.JSONDecodeError:
        pass
    # Strategy 3: find first { } or [ ] block
    for s, e in [('{', '}'), ('[', ']')]:
        start, end = text.find(s), text.rfind(e)
        if start != -1 and end > start:
            try:
                return json.loads(text[start:end + 1])
            except json.JSONDecodeError:
                pass
    raise json.JSONDecodeError("No valid JSON found", text, 0)

6. Progress scoring (`autoresearch/evaluator.py`)

Score is a harmonic mean so both width reduction AND center alignment must improve together. A narrow range pointing at the wrong magnitude is penalized.

def progress_score(state: ResearchState) -> float:
    # Width progress: how far from initial_width toward target_width?
    if state.initial_width <= state.target_width:
        width_score = min(1.0, state.target_width / max(state.current_width, 1))
    else:
        needed = state.initial_width - state.target_width
        achieved = state.initial_width - state.current_width
        width_score = max(0.0, min(1.0, achieved / needed))

    # Center alignment
    current_center = (state.current_low + state.current_high) / 2
    target_center = (state.target_low + state.target_high) / 2
    center_diff = abs(current_center - target_center)
    center_score = max(0.0, 1.0 - center_diff / (state.initial_width / 2))

    # Harmonic mean — both must improve
    if width_score + center_score == 0:
        return 0.0
    return 2 * width_score * center_score / (width_score + center_score)

def has_converged(scores: list[float], window: int = 2, threshold: float = 0.005) -> bool:
    if len(scores) < window + 1:
        return False
    return (scores[-1] - scores[-(window + 1)]) < threshold

7. Stream events emitted by the engine

The engine (autoresearch/engine.py) yields NDJSON events. Wire these to your frontend:

Event	Key fields	Use
`search_query`	`query`, `iteration`	Show in agent thread panel
`search_result`	`title`, `url`, `snippet`	Show source being processed
`evidence_found`	`summary`, `impact`, `confidence`, `suggested_low`, `suggested_high`	Update evidence list
`evidence_skipped`	`title`, `url`	Show in thread panel
`iteration_update`	`iteration`, `exposure_low`, `exposure_high`, `progress_score`, `width_reduction_pct`, `tokens_so_far`	Drive your chart
`signal`	`text`	Status messages in thread panel
`complete`	`result`	Final summary with top evidence

The iteration_update event is what drives a live uncertainty-narrowing chart. Plot exposure_low/exposure_high as a band against tokens_so_far on the X-axis.

8. FastAPI streaming endpoint

from fastapi import APIRouter
from fastapi.responses import StreamingResponse
import json

router = APIRouter()

@router.post("/api/autoresearch")
async def autoresearch_endpoint(request: AutoresearchRequest):
    async def generate():
        async for event in engine.run(request):
            yield json.dumps(event) + "\n"
    return StreamingResponse(generate(), media_type="application/x-ndjson")

System Prompts for Each LLM Call

Query generation system prompt

You are an expert risk researcher generating targeted Google search queries.
Given a risk factor, business context, current exposure estimate, target, and known gaps,
generate the most useful search queries to find evidence that narrows the exposure range toward the target.
Return ONLY a JSON array of strings (3–5 queries, no duplicates with provided history):
["query 1", "query 2", "query 3"]
Rules:
- Be specific: include industry names, geographic regions, regulatory bodies, incident types
- Vary query style: statistics/data, incidents, regulatory filings
- Avoid queries already used
- Prefer queries that would surface quantified data (dollar amounts, failure rates, incident counts)

Evidence extraction system prompt

You are a specialist risk evidence analyst.
Return ONLY valid JSON (no markdown):
{
  "relevant": true,
  "summary": "1-3 sentences of specific finding relevant to the risk",
  "exposure_impact": "narrows_low|narrows_high|narrows_both|widens|neutral",
  "suggested_low": null,
  "suggested_high": null,
  "confidence": 0.0
}
exposure_impact:
  narrows_low  = evidence suggests lower bound can be raised
  narrows_high = evidence suggests upper bound can be lowered
  narrows_both = evidence constrains both ends
  widens       = evidence reveals more risk than estimated
  neutral      = no useful constraint on the range
confidence: 0.0–1.0 based on source quality (regulatory > peer-reviewed > industry > news > social)

Re-estimation system prompt

You are a senior risk quantification analyst.
Return ONLY valid JSON (no markdown):
{
  "exposure_low": 0,
  "exposure_high": 0,
  "failure_rate": 0.0,
  "uncertainty": 0.0,
  "rationale": "1-3 sentences explaining the range",
  "remaining_gaps": ["gap1", "gap2"]
}
Do NOT make ranges artificially narrow. Only narrow where evidence supports it.

Stopping Criteria

# In the main loop:
if score >= 0.95:
    break  # target reached

if has_converged(scores, window=2, threshold=0.005):
    break  # plateau — no new evidence is moving the range

if iteration == max_iterations:
    break  # hard limit

Testing Without Google Search

Run without API keys to verify the loop structure. The LLM will:

Generate specific, varied queries each iteration (these go to the live thread panel)
Hold the range unchanged when no evidence exists (honest LLM behavior, not hallucination)
Identify and log specific knowledge gaps
Detect convergence and stop cleanly

Expected output with no search configured:

[search] iter=1 → "Texas oil pipeline seismic damage costs Permian Basin induced seismicity"
[signal] No results for: Texas oil pipeline...
[signal] Re-estimating exposure from 0 evidence items
=== Iteration 1 complete ===
  Exposure: $4,200,000–$67,000,000   ← unchanged — correct, no evidence
  Progress: 0.0000 | Width reduction: 0.0%
  gap: Historical seismic activity data for specific pipeline routes
  gap: Correlation between injection volumes and seismic events

Common Pitfalls

Truncated JSON responses — Gemini 2.5 Flash thinking tokens consume output budget. Always use thinking_config: {thinking_budget: 0} for structured JSON calls. See llm_utils.py above.
Repeated queries across iterations — De-duplicate against state.queries_used before submitting. The LLM will generate similar queries if not given the history.
Range inversion — The LLM occasionally returns exposure_low > exposure_high. Always validate and swap if needed: if high < low: low, high = high, low.
Premature convergence — If max_searches_per_iteration is too low, has_converged triggers early because scores don't improve (no evidence found). Increase max_searches_per_iteration for production runs.
Score can decrease — If new evidence reveals the range is wider than estimated, progress_score drops. This is correct behavior — the loop is discovering that reality is more uncertain than the prior.

Sivolc2/risk-autoresearch

Sivolc2/risk-autoresearch

Risk Autoresearch Skill

Core Concept

File Structure to Create

Step-by-Step Implementation

1. Install dependencies

2. Environment variables

3. Core models (`autoresearch/models.py`)

4. Critical: disable LLM thinking for JSON calls (`autoresearch/llm_utils.py`)

5. Robust JSON parsing (`autoresearch/parse_utils.py`)

6. Progress scoring (`autoresearch/evaluator.py`)

7. Stream events emitted by the engine

8. FastAPI streaming endpoint

System Prompts for Each LLM Call

Query generation system prompt

Evidence extraction system prompt

Re-estimation system prompt

Stopping Criteria

Testing Without Google Search

Common Pitfalls

See Also

Keywords

Repository

Sivolc2/risk-autoresearch

Sivolc2/risk-autoresearch

Risk Autoresearch Skill

Core Concept

File Structure to Create

Step-by-Step Implementation

1. Install dependencies

2. Environment variables

3. Core models (autoresearch/models.py)

4. Critical: disable LLM thinking for JSON calls (autoresearch/llm_utils.py)

5. Robust JSON parsing (autoresearch/parse_utils.py)

6. Progress scoring (autoresearch/evaluator.py)

7. Stream events emitted by the engine

8. FastAPI streaming endpoint

System Prompts for Each LLM Call

Query generation system prompt

Evidence extraction system prompt

Re-estimation system prompt

Stopping Criteria

Testing Without Google Search

Common Pitfalls

See Also

Keywords

Repository

3. Core models (`autoresearch/models.py`)

4. Critical: disable LLM thinking for JSON calls (`autoresearch/llm_utils.py`)

5. Robust JSON parsing (`autoresearch/parse_utils.py`)

6. Progress scoring (`autoresearch/evaluator.py`)