Catalog
affaan-m/agent-eval

affaan-m

agent-eval

Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics

global
0installs0uses~1.0k
v1.1Saved Apr 20, 2026

Agent Eval Skill

A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it.

When to Activate

  • Comparing coding agents (Claude Code, Aider, Codex, etc.) on your own codebase
  • Measuring agent performance before adopting a new tool or model
  • Running regression checks when an agent updates its model or tooling
  • Producing data-backed agent selection decisions for a team

Installation

Note: Install agent-eval from its repository after reviewing the source.

Core Concepts

YAML Task Definitions

Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success:

name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
  - src/http_client.py
prompt: |
  Add retry logic with exponential backoff to all HTTP requests.
  Max 3 retries. Initial delay 1s, max delay 30s.
judge:
  - type: pytest
    command: pytest tests/test_http_client.py -v
  - type: grep
    pattern: "exponential_backoff|retry"
    files: src/http_client.py
commit: "abc1234"  # pin to specific commit for reproducibility

Git Worktree Isolation

Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo.

Metrics Collected

Metric What It Measures
Pass rate Did the agent produce code that passes the judge?
Cost API spend per task (when available)
Time Wall-clock seconds to completion
Consistency Pass rate across repeated runs (e.g., 3/3 = 100%)

Workflow

1. Define Tasks

Create a tasks/ directory with YAML files, one per task:

mkdir tasks
# Write task definitions (see template above)

2. Run Agents

Execute agents against your tasks:

agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3

Each run:

  1. Creates a fresh git worktree from the specified commit
  2. Hands the prompt to the agent
  3. Runs the judge criteria
  4. Records pass/fail, cost, and time

3. Compare Results

Generate a comparison report:

agent-eval report --format table
Task: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent        │ Pass Rate │ Cost   │ Time   │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code  │ 3/3       │ $0.12  │ 45s    │ 100%        │
│ aider        │ 2/3       │ $0.08  │ 38s    │  67%        │
└──────────────┴───────────┴────────┴────────┴─────────────┘

Judge Types

Code-Based (deterministic)

judge:
  - type: pytest
    command: pytest tests/ -v
  - type: command
    command: npm run build

Pattern-Based

judge:
  - type: grep
    pattern: "class.*Retry"
    files: src/**/*.py

Model-Based (LLM-as-judge)

judge:
  - type: llm
    prompt: |
      Does this implementation correctly handle exponential backoff?
      Check for: max retries, increasing delays, jitter.

Best Practices

  • Start with 3-5 tasks that represent your real workload, not toy examples
  • Run at least 3 trials per agent to capture variance — agents are non-deterministic
  • Pin the commit in your task YAML so results are reproducible across days/weeks
  • Include at least one deterministic judge (tests, build) per task — LLM judges add noise
  • Track cost alongside pass rate — a 95% agent at 10x the cost may not be the right choice
  • Version your task definitions — they are test fixtures, treat them as code
Files1
1 files · 1.0 KB

Select a file to preview

Overall Score

73/100

Grade

B

Good

Safety

75

Quality

72

Clarity

78

Completeness

62

Summary

A systematic CLI tool for benchmarking and comparing coding agents (Claude Code, Aider, Codex, etc.) on reproducible tasks. It isolates each agent run in a git worktree, runs configurable judges (pytest, grep, LLM-based), and collects pass rate, cost, time, and consistency metrics to produce data-backed agent selection reports.

Detected Capabilities

Define reproducible tasks in YAMLCreate isolated git worktrees per agent runExecute agents with prompts against codeRun deterministic judges (pytest, bash commands, grep patterns)Run LLM-based judges for qualitative evaluationCollect pass rate, cost, time, and consistency metricsGenerate comparison reports in table format

Trigger Keywords

Phrases that MCP clients use to match this skill to user intent.

compare coding agentsbenchmark agent performanceagent regression testingaider vs claude codeagent selection decision

Risk Signals

INFO

Git worktree creation and checkout to pinned commits

Workflow section, step 2
INFO

External agent execution (Claude Code, Aider, etc.) — agents receive prompts and modify code

Core Concepts, Workflow sections
INFO

Bash command execution via judge (command type)

Judge Types section
INFO

LLM-as-judge requires API calls to external models

Judge Types, llm judge subsection
WARNING

No explicit guardrails documented for agent code modifications or rollback strategy

Workflow section
WARNING

No documented handling of agent failures, timeouts, or destructive code

Workflow and error handling

Referenced Domains

External domains referenced in skill content, detected by static analysis.

github.com

Use Cases

  • Compare coding agents before adopting a new tool
  • Measure agent performance on your own codebase
  • Run regression tests when an agent updates its model
  • Generate team-facing agent benchmark reports

Quality Notes

  • Strong conceptual clarity: core ideas (isolation via worktrees, metric collection, judge types) are well-explained
  • Good use of examples: YAML task definition, sample report table, and command-line syntax are concrete and actionable
  • Best practices section is valuable: addresses non-determinism, reproducibility via commit pinning, and cost trade-offs
  • Missing operational details: no guidance on how to handle agent failures, timeouts, or cleanup after failed runs
  • Error handling underspecified: what happens if an agent crashes mid-task? Does the worktree stay behind? How does cleanup work?
  • No explicit documentation of prerequisites: which agents need local installation vs. API keys? What Python/Node versions?
  • Limited scope for reproducibility: mentions commit pinning but doesn't explain handling of dependency version changes (pip, npm)
  • Judge execution model unclear: are judges run inside the worktree or on the main repo? If agents modify imports, can judges still run?
  • No mention of cost tracking implementation: skill says 'Cost' is collected but doesn't explain how agent-eval retrieves API spend data
Model: claude-haiku-4-5-20251001Analyzed: Apr 20, 2026

Reviews

Add this skill to your library to leave a review.

No reviews yet

Be the first to share your experience.

Version History

v1.1

Content updated

2026-04-20

Latest
v1.0

No changelog

2026-04-12

Add affaan-m/agent-eval to your library

Command Palette

Search for a command to run...