Builds an autonomous iterative research loop that narrows a wide uncertainty/exposure range toward a user-defined target by issuing Google searches, extracting evidence from results, re-estimating the range from accumulated evidence, and repeating until converged. Use when the user wants to implement autonomous research that progressively reduces uncertainty toward a quantified goal — risk analysis, market sizing, due diligence, literature review, or any domain where a wide estimate must be narrowed with real evidence.
npx @senso-ai/shipables install Sivolc2/risk-autoresearchImplements an autonomous iterative research loop — inspired by karpathy/autoresearch — adapted for narrowing exposure/uncertainty ranges rather than minimizing ML validation loss.
Objective: minimize uncertainty_score = (range_high - range_low) / prior
Each iteration:
The loop stops when: score converges, target reached, or max_iterations hit.
your_project/
├── autoresearch/
│ ├── __init__.py
│ ├── models.py # ResearchState, Evidence, SearchResult, AutoresearchRequest
│ ├── searcher.py # Google Custom Search API + page fetching
│ ├── researcher.py # generate_queries(), extract_evidence()
│ ├── evaluator.py # progress_score(), re_estimate_exposure()
│ ├── engine.py # Main async loop — yields stream events
│ ├── llm_utils.py # ask_json() — disables LLM thinking for JSON calls
│ └── parse_utils.py # extract_json() — robust code-fence stripping
└── routers/
└── autoresearch_route.py # POST /api/autoresearch SSE endpoint
pip install httpx google-genai fastapi python-dotenv
GOOGLE_SEARCH_API_KEY=... # Google API key with Custom Search JSON API enabled
GOOGLE_SEARCH_CX=... # Programmable Search Engine ID
GEMINI_API_KEY=... # Or your preferred LLM provider
Without GOOGLE_SEARCH_API_KEY + GOOGLE_SEARCH_CX, the loop runs in LLM-only mode: queries are generated but no searches are executed; the LLM reasons from prior knowledge only.
autoresearch/models.py)from dataclasses import dataclass, field
from pydantic import BaseModel
class AutoresearchRequest(BaseModel):
risk_factor_name: str
business_context: str
initial_exposure_low: int # USD — current wide estimate
initial_exposure_high: int
target_exposure_low: int # USD — desired narrow estimate
target_exposure_high: int
max_iterations: int = 8
max_searches_per_iteration: int = 4
fetch_page_content: bool = False
@dataclass
class Evidence:
source_url: str
source_title: str
query_used: str
summary: str
exposure_impact: str # narrows_low | narrows_high | narrows_both | widens | neutral
suggested_low: int | None
suggested_high: int | None
confidence: float # 0.0–1.0
iteration: int
@dataclass
class ResearchState:
risk_factor_name: str
business_context: str
current_low: int
current_high: int
initial_low: int
initial_high: int
target_low: int
target_high: int
evidence: list[Evidence] = field(default_factory=list)
queries_used: list[str] = field(default_factory=list)
gaps: list[str] = field(default_factory=list)
iteration: int = 0
total_searches: int = 0
total_tokens: int = 0
@property
def current_width(self): return max(0, self.current_high - self.current_low)
@property
def initial_width(self): return max(1, self.initial_high - self.initial_low)
@property
def target_width(self): return max(1, self.target_high - self.target_low)
autoresearch/llm_utils.py)This is the most important implementation detail. Models like Gemini 2.5 Flash use thinking tokens that consume from max_output_tokens. Without disabling thinking, JSON responses are silently truncated.
from your_llm_client import gemini_client, DEFAULT_MODEL
async def ask_json(prompt_text: str, system_message: str,
max_tokens: int = 2000, temperature: float = 0.2) -> str:
if gemini_client is not None:
try:
response = gemini_client.models.generate_content(
model=DEFAULT_MODEL,
contents=prompt_text,
config={
"system_instruction": system_message,
"max_output_tokens": max_tokens,
"temperature": temperature,
"thinking_config": {"thinking_budget": 0}, # CRITICAL
},
)
return response.text
except Exception:
pass # Fall back to standard ask_llm
return await ask_llm(prompt_text, system_message, max_tokens, temperature)
autoresearch/parse_utils.py)LLMs inconsistently wrap JSON in code fences. Use a 3-strategy extractor:
import json, re
def extract_json(text: str) -> any:
text = text.strip()
# Strategy 1: direct parse
try:
return json.loads(text)
except json.JSONDecodeError:
pass
# Strategy 2: strip ``` fences
fenced = re.sub(r"^```[a-z]*\n?", "", text)
fenced = re.sub(r"\n?```$", "", fenced.strip())
try:
return json.loads(fenced)
except json.JSONDecodeError:
pass
# Strategy 3: find first { } or [ ] block
for s, e in [('{', '}'), ('[', ']')]:
start, end = text.find(s), text.rfind(e)
if start != -1 and end > start:
try:
return json.loads(text[start:end + 1])
except json.JSONDecodeError:
pass
raise json.JSONDecodeError("No valid JSON found", text, 0)
autoresearch/evaluator.py)Score is a harmonic mean so both width reduction AND center alignment must improve together. A narrow range pointing at the wrong magnitude is penalized.
def progress_score(state: ResearchState) -> float:
# Width progress: how far from initial_width toward target_width?
if state.initial_width <= state.target_width:
width_score = min(1.0, state.target_width / max(state.current_width, 1))
else:
needed = state.initial_width - state.target_width
achieved = state.initial_width - state.current_width
width_score = max(0.0, min(1.0, achieved / needed))
# Center alignment
current_center = (state.current_low + state.current_high) / 2
target_center = (state.target_low + state.target_high) / 2
center_diff = abs(current_center - target_center)
center_score = max(0.0, 1.0 - center_diff / (state.initial_width / 2))
# Harmonic mean — both must improve
if width_score + center_score == 0:
return 0.0
return 2 * width_score * center_score / (width_score + center_score)
def has_converged(scores: list[float], window: int = 2, threshold: float = 0.005) -> bool:
if len(scores) < window + 1:
return False
return (scores[-1] - scores[-(window + 1)]) < threshold
The engine (autoresearch/engine.py) yields NDJSON events. Wire these to your frontend:
| Event | Key fields | Use |
|---|---|---|
search_query | query, iteration | Show in agent thread panel |
search_result | title, url, snippet | Show source being processed |
evidence_found | summary, impact, confidence, suggested_low, suggested_high | Update evidence list |
evidence_skipped | title, url | Show in thread panel |
iteration_update | iteration, exposure_low, exposure_high, progress_score, width_reduction_pct, tokens_so_far | Drive your chart |
signal | text | Status messages in thread panel |
complete | result | Final summary with top evidence |
The iteration_update event is what drives a live uncertainty-narrowing chart. Plot exposure_low/exposure_high as a band against tokens_so_far on the X-axis.
from fastapi import APIRouter
from fastapi.responses import StreamingResponse
import json
router = APIRouter()
@router.post("/api/autoresearch")
async def autoresearch_endpoint(request: AutoresearchRequest):
async def generate():
async for event in engine.run(request):
yield json.dumps(event) + "\n"
return StreamingResponse(generate(), media_type="application/x-ndjson")
You are an expert risk researcher generating targeted Google search queries.
Given a risk factor, business context, current exposure estimate, target, and known gaps,
generate the most useful search queries to find evidence that narrows the exposure range toward the target.
Return ONLY a JSON array of strings (3–5 queries, no duplicates with provided history):
["query 1", "query 2", "query 3"]
Rules:
- Be specific: include industry names, geographic regions, regulatory bodies, incident types
- Vary query style: statistics/data, incidents, regulatory filings
- Avoid queries already used
- Prefer queries that would surface quantified data (dollar amounts, failure rates, incident counts)
You are a specialist risk evidence analyst.
Return ONLY valid JSON (no markdown):
{
"relevant": true,
"summary": "1-3 sentences of specific finding relevant to the risk",
"exposure_impact": "narrows_low|narrows_high|narrows_both|widens|neutral",
"suggested_low": null,
"suggested_high": null,
"confidence": 0.0
}
exposure_impact:
narrows_low = evidence suggests lower bound can be raised
narrows_high = evidence suggests upper bound can be lowered
narrows_both = evidence constrains both ends
widens = evidence reveals more risk than estimated
neutral = no useful constraint on the range
confidence: 0.0–1.0 based on source quality (regulatory > peer-reviewed > industry > news > social)
You are a senior risk quantification analyst.
Return ONLY valid JSON (no markdown):
{
"exposure_low": 0,
"exposure_high": 0,
"failure_rate": 0.0,
"uncertainty": 0.0,
"rationale": "1-3 sentences explaining the range",
"remaining_gaps": ["gap1", "gap2"]
}
Do NOT make ranges artificially narrow. Only narrow where evidence supports it.
# In the main loop:
if score >= 0.95:
break # target reached
if has_converged(scores, window=2, threshold=0.005):
break # plateau — no new evidence is moving the range
if iteration == max_iterations:
break # hard limit
Run without API keys to verify the loop structure. The LLM will:
Expected output with no search configured:
[search] iter=1 → "Texas oil pipeline seismic damage costs Permian Basin induced seismicity"
[signal] No results for: Texas oil pipeline...
[signal] Re-estimating exposure from 0 evidence items
=== Iteration 1 complete ===
Exposure: $4,200,000–$67,000,000 ← unchanged — correct, no evidence
Progress: 0.0000 | Width reduction: 0.0%
gap: Historical seismic activity data for specific pipeline routes
gap: Correlation between injection volumes and seismic events
Truncated JSON responses — Gemini 2.5 Flash thinking tokens consume output budget. Always use thinking_config: {thinking_budget: 0} for structured JSON calls. See llm_utils.py above.
Repeated queries across iterations — De-duplicate against state.queries_used before submitting. The LLM will generate similar queries if not given the history.
Range inversion — The LLM occasionally returns exposure_low > exposure_high. Always validate and swap if needed: if high < low: low, high = high, low.
Premature convergence — If max_searches_per_iteration is too low, has_converged triggers early because scores don't improve (no evidence found). Increase max_searches_per_iteration for production runs.
Score can decrease — If new evidence reveals the range is wider than estimated, progress_score drops. This is correct behavior — the loop is discovering that reality is more uncertain than the prior.
references/api-contract.md — Full request/response schemareferences/event-types.md — Complete stream event reference