Evaluate, score, and systematically improve prompts in the codebase. Identifies weak prompts, generates test cases, scores outputs, and proposes optimized versions. Use when the user says "improve this prompt", "why is the AI doing X", "eval my prompts", or "optimize the agent".