KeyanVakil/eval-prompts

Evaluate, score, and systematically improve prompts in the codebase. Identifies weak prompts, generates test cases, scores outputs, and proposes optimized versions. Use when the user says "improve this prompt", "why is the AI doing X", "eval my prompts", or "optimize the agent".

0.1.0

MIT

Agent Skills

npx @senso-ai/shipables install KeyanVakil/eval-prompts

Steps

Phase 1: Find all prompts

Search the codebase for strings passed to LLM APIs:

system:, content:, template literals containing You are, Your task is, etc.

Any prompt variable or function parameter

System message arrays

List every prompt with its file location and purpose.

Phase 2: Score each prompt against these criteria

Rate 1–5 on each dimension:

Dimension	What to check
Specificity	Does it define exactly what output format is expected?
Persona clarity	Does the model know what role it's playing?
Scope bounds	Does it say what the model should NOT do?
Example grounding	Does it include examples of good output?
Failure handling	Does it tell the model what to do when it can't answer?

Dimension

What to check

Specificity

Does it define exactly what output format is expected?

Persona clarity

Does the model know what role it's playing?

Scope bounds

Does it say what the model should NOT do?

Example grounding

Does it include examples of good output?

Failure handling

Does it tell the model what to do when it can't answer?

Flag any prompt scoring below 3 on two or more dimensions.

Phase 3: Generate test cases for weak prompts

For each weak prompt, create 5 test inputs covering:

Happy path (normal input)

Edge case (unusual but valid input)

Adversarial (input designed to confuse or jailbreak)

Empty/minimal input

Overloaded input (too much context)

Phase 4: Propose improved versions

For each weak prompt, write an improved version that:

Adds explicit output format instructions (JSON schema, markdown structure, etc.)

Adds a negative constraint ("Do not...")

Adds one concrete example of expected output

Keeps it under 500 tokens (long prompts drift)

Phase 5: A/B comparison

Run both the original and improved prompt against the test cases. Compare outputs side-by-side. Only adopt the new prompt if it's strictly better on the test cases.

Rules

Never make prompts longer just to feel more thorough — every token costs money and attention

The best prompt is often the shortest one that reliably produces correct output

If a prompt needs 10 rules to constrain bad behavior, the underlying task might need rethinking

Save prompt versions with a comment: // v2: added output format constraint, fixed JSON hallucination

KeyanVakil/eval-prompts

KeyanVakil/eval-prompts

Steps

Phase 1: Find all prompts

Phase 2: Score each prompt against these criteria

Phase 3: Generate test cases for weak prompts

Phase 4: Propose improved versions

Phase 5: A/B comparison

Rules

Keywords

Categories