Evaluate, score, and systematically improve prompts in the codebase. Identifies weak prompts, generates test cases, scores outputs, and proposes optimized versions. Use when the user says "improve this prompt", "why is the AI doing X", "eval my prompts", or "optimize the agent".
npx @senso-ai/shipables install KeyanVakil/eval-promptsFind bad prompts, fix them, and prove the fix works.
Search the codebase for strings passed to LLM APIs:
system:, content:, template literals containing You are, Your task is, etc.prompt variable or function parameterList every prompt with its file location and purpose.
Rate 1–5 on each dimension:
| Dimension | What to check |
|---|---|
| Specificity | Does it define exactly what output format is expected? |
| Persona clarity | Does the model know what role it's playing? |
| Scope bounds | Does it say what the model should NOT do? |
| Example grounding | Does it include examples of good output? |
| Failure handling | Does it tell the model what to do when it can't answer? |
Flag any prompt scoring below 3 on two or more dimensions.
For each weak prompt, create 5 test inputs covering:
For each weak prompt, write an improved version that:
Run both the original and improved prompt against the test cases. Compare outputs side-by-side. Only adopt the new prompt if it's strictly better on the test cases.
// v2: added output format constraint, fixed JSON hallucination