Catalog
github/agentic-eval

github

agentic-eval

Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when: - Implementing self-critique and reflection loops - Building evaluator-optimizer pipelines for quality-critical generation - Creating test-driven code refinement workflows - Designing rubric-based or LLM-as-judge evaluation systems - Adding iterative improvement to agent outputs (code, reports, analysis) - Measuring and improving agent response quality

global
New~1.5k
v1.0Saved Jun 26, 2026

Agentic Evaluation Patterns

Patterns for self-improvement through iterative evaluation and refinement.

Overview

Evaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops.

Generate → Evaluate → Critique → Refine → Output
    ↑                              │
    └──────────────────────────────┘

When to Use

  • Quality-critical generation: Code, reports, analysis requiring high accuracy
  • Tasks with clear evaluation criteria: Defined success metrics exist
  • Content requiring specific standards: Style guides, compliance, formatting

Pattern 1: Basic Reflection

Agent evaluates and improves its own output through self-critique.

def reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -> str:
    """Generate with reflection loop."""
    output = llm(f"Complete this task:\n{task}")
    
    for i in range(max_iterations):
        # Self-critique
        critique = llm(f"""
        Evaluate this output against criteria: {criteria}
        Output: {output}
        Rate each: PASS/FAIL with feedback as JSON.
        """)
        
        critique_data = json.loads(critique)
        all_pass = all(c["status"] == "PASS" for c in critique_data.values())
        if all_pass:
            return output
        
        # Refine based on critique
        failed = {k: v["feedback"] for k, v in critique_data.items() if v["status"] == "FAIL"}
        output = llm(f"Improve to address: {failed}\nOriginal: {output}")
    
    return output

Key insight: Use structured JSON output for reliable parsing of critique results.


Pattern 2: Evaluator-Optimizer

Separate generation and evaluation into distinct components for clearer responsibilities.

class EvaluatorOptimizer:
    def __init__(self, score_threshold: float = 0.8):
        self.score_threshold = score_threshold
    
    def generate(self, task: str) -> str:
        return llm(f"Complete: {task}")
    
    def evaluate(self, output: str, task: str) -> dict:
        return json.loads(llm(f"""
        Evaluate output for task: {task}
        Output: {output}
        Return JSON: {{"overall_score": 0-1, "dimensions": {{"accuracy": ..., "clarity": ...}}}}
        """))
    
    def optimize(self, output: str, feedback: dict) -> str:
        return llm(f"Improve based on feedback: {feedback}\nOutput: {output}")
    
    def run(self, task: str, max_iterations: int = 3) -> str:
        output = self.generate(task)
        for _ in range(max_iterations):
            evaluation = self.evaluate(output, task)
            if evaluation["overall_score"] >= self.score_threshold:
                break
            output = self.optimize(output, evaluation)
        return output

Pattern 3: Code-Specific Reflection

Test-driven refinement loop for code generation.

class CodeReflector:
    def reflect_and_fix(self, spec: str, max_iterations: int = 3) -> str:
        code = llm(f"Write Python code for: {spec}")
        tests = llm(f"Generate pytest tests for: {spec}\nCode: {code}")
        
        for _ in range(max_iterations):
            result = run_tests(code, tests)
            if result["success"]:
                return code
            code = llm(f"Fix error: {result['error']}\nCode: {code}")
        return code

Evaluation Strategies

Outcome-Based

Evaluate whether output achieves the expected result.

def evaluate_outcome(task: str, output: str, expected: str) -> str:
    return llm(f"Does output achieve expected outcome? Task: {task}, Expected: {expected}, Output: {output}")

LLM-as-Judge

Use LLM to compare and rank outputs.

def llm_judge(output_a: str, output_b: str, criteria: str) -> str:
    return llm(f"Compare outputs A and B for {criteria}. Which is better and why?")

Rubric-Based

Score outputs against weighted dimensions.

RUBRIC = {
    "accuracy": {"weight": 0.4},
    "clarity": {"weight": 0.3},
    "completeness": {"weight": 0.3}
}

def evaluate_with_rubric(output: str, rubric: dict) -> float:
    scores = json.loads(llm(f"Rate 1-5 for each dimension: {list(rubric.keys())}\nOutput: {output}"))
    return sum(scores[d] * rubric[d]["weight"] for d in rubric) / 5

Best Practices

Practice Rationale
Clear criteria Define specific, measurable evaluation criteria upfront
Iteration limits Set max iterations (3-5) to prevent infinite loops
Convergence check Stop if output score isn't improving between iterations
Log history Keep full trajectory for debugging and analysis
Structured output Use JSON for reliable parsing of evaluation results

Quick Start Checklist

## Evaluation Implementation Checklist

### Setup
- [ ] Define evaluation criteria/rubric
- [ ] Set score threshold for "good enough"
- [ ] Configure max iterations (default: 3)

### Implementation
- [ ] Implement generate() function
- [ ] Implement evaluate() function with structured output
- [ ] Implement optimize() function
- [ ] Wire up the refinement loop

### Safety
- [ ] Add convergence detection
- [ ] Log all iterations for debugging
- [ ] Handle evaluation parse failures gracefully
Files1
1 files · 1.0 KB

Select a file to preview

Overall Score

86/100

Grade

A

Excellent

Safety

88

Quality

86

Clarity

87

Completeness

81

Summary

This skill provides patterns and techniques for evaluating and iteratively improving AI agent outputs through self-critique, evaluator-optimizer pipelines, and test-driven refinement loops. It teaches agents how to implement quality gates, rubric-based evaluation, and convergence-aware iteration for code, reports, and analysis.

Detected Capabilities

llm-generationjson-parsingcode-executionpattern-guidance

Trigger Keywords

Phrases that MCP clients use to match this skill to user intent.

code quality evaluationself-critique looptest-driven refinementllm-as-judge patterniterative code improvementrubric-based scoringevaluator-optimizer pipelineagent output evaluation

Use Cases

  • .Implementing self-critique loops for agent-generated code
  • Building evaluator-optimizer pipelines for quality-critical content generation
  • Creating test-driven code refinement workflows with feedback loops
  • Designing rubric-based evaluation systems with measurable dimensions
  • Measuring and improving agent response quality through iterative refinement
  • Implementing LLM-as-judge patterns for output comparison and ranking

Quality Notes

  • Well-structured skill with clear conceptual examples across three concrete patterns (reflection, evaluator-optimizer, code-specific)
  • Best practices table provides actionable guidance on convergence detection, iteration limits, and structured output
  • Code examples use defensive JSON parsing patterns (json.loads) and include explicit termination conditions (all_pass, score_threshold)
  • Checklist format makes implementation straightforward and reduces cognitive load
  • Explicitly addresses safety concerns (convergence detection, iteration limits) demonstrating author awareness of infinite-loop risks
  • Evaluation strategies section covers multiple approaches (outcome-based, LLM-as-judge, rubric-based) giving agents flexibility
  • Good use of diagrams and pseudocode to illustrate conceptual flow
Model: claude-haiku-4-5-20251001Analyzed: Jun 26, 2026

Reviews

Add this skill to your library to leave a review.

No reviews yet

Be the first to share your experience.

Add github/agentic-eval to your library

Command Palette

Search for a command to run...