Catalog
affaan-m/skill-comply

affaan-m

skill-comply

Visualize whether skills, rules, and agent definitions are actually followed — auto-generates scenarios at 3 prompt strictness levels, runs agents, classifies behavioral sequences, and reports compliance rates with full tool call timelines

global
New~569
v1.1Saved May 11, 2026

skill-comply: Automated Compliance Measurement

Measures whether coding agents actually follow skills, rules, or agent definitions by:

  1. Auto-generating expected behavioral sequences (specs) from any .md file
  2. Auto-generating scenarios with decreasing prompt strictness (supportive → neutral → competing)
  3. Running claude -p and capturing tool call traces via stream-json
  4. Classifying tool calls against spec steps using LLM (not regex)
  5. Checking temporal ordering deterministically
  6. Generating self-contained reports with spec, prompts, and timelines

Supported Targets

  • Skills (skills/*/SKILL.md): Workflow skills like search-first, TDD guides
  • Rules (rules/common/*.md): Mandatory rules like testing.md, security.md, git-workflow.md
  • Agent definitions (agents/*.md): Whether an agent gets invoked when expected (internal workflow verification not yet supported)

When to Activate

  • User runs /skill-comply <path>
  • User asks "is this rule actually being followed?"
  • After adding new rules/skills, to verify agent compliance
  • Periodically as part of quality maintenance

Usage

# Full run
uv run python -m scripts.run ~/.claude/rules/common/testing.md

# Dry run (no cost, spec + scenarios only)
uv run python -m scripts.run --dry-run ~/.claude/skills/search-first/SKILL.md

# Custom models
uv run python -m scripts.run --gen-model haiku --model sonnet <path>

Key Concept: Prompt Independence

Measures whether a skill/rule is followed even when the prompt doesn't explicitly support it.

Report Contents

Reports are self-contained and include:

  1. Expected behavioral sequence (auto-generated spec)
  2. Scenario prompts (what was asked at each strictness level)
  3. Compliance scores per scenario
  4. Tool call timelines with LLM classification labels

Advanced (optional)

For users familiar with hooks, reports also include hook promotion recommendations for steps with low compliance. This is informational — the main value is the compliance visibility itself.

Files21
21 files · 49.4 KB

Select a file to preview

Overall Score

81/100

Grade

B

Good

Safety

78

Quality

85

Clarity

82

Completeness

76

Summary

skill-comply measures whether coding agents actually follow skills, rules, and agent definitions by auto-generating behavioral specs from markdown files, creating 3-level prompt scenarios, running agents via claude -p, and classifying tool calls against expected steps using LLM. It reports compliance rates with timelines and recommends hook promotion for steps with low compliance.

Detected Capabilities

file read (skill/rule/spec files)subprocess execution (claude -p commands)JSONL stream parsingYAML parsing and generationLLM-based classification (Haiku/Sonnet models)temporary sandbox creation and cleanupmarkdown report generation

Trigger Keywords

Phrases that MCP clients use to match this skill to user intent.

measure complianceverify agent follows rulestest prompt independencecheck skill adherencevalidate rule behavior

Risk Signals

INFO

Subprocess execution of user-provided model name without strict validation

scripts/runner.py:line 34, ALLOWED_MODELS whitelist
WARNING

Temporary sandbox directory cleanup uses shutil.rmtree on user-created paths

scripts/runner.py:_setup_sandbox, line 81
INFO

Multiple unbounded subprocess calls to 'claude -p' without explicit failure mode documentation

scripts/spec_generator.py, scripts/scenario_generator.py, scripts/classifier.py
INFO

Path traversal protection relies on resolve().relative_to() which could raise on invalid paths

scripts/runner.py:_safe_sandbox_dir, line 73

Use Cases

  • Verify rule compliance after adding new rules to an agent
  • Measure whether TDD workflows are actually followed despite prompt variations
  • Detect skill drift when agents contradict specified behaviors
  • Identify which compliance steps need hook promotion for better reliability
  • Periodically validate agent behavior against documented specifications

Quality Notes

  • Excellent structure: clear separation of concerns across 8 focused modules (parser, classifier, grader, runner, report)
  • Strong type safety: uses frozen dataclasses throughout for immutable specs and results
  • Well-documented prompt templates with clear YAML schemas for spec generation and scenario construction
  • Comprehensive test suite (16 tests) covering parser, grader, and edge cases with mocked LLM classifiers
  • Fixture-based testing with realistic TDD trace examples (compliant and noncompliant)
  • Temporal ordering validation is deterministic after LLM classification, reducing false positives
  • Report generation is self-contained with timeline, evidence, and hook promotion recommendations
  • Supports dry-run mode for cost-effective spec/scenario generation without execution
  • Model selection is parameterized (gen-model vs model) for flexibility
  • Error handling includes retry logic with YAML parse feedback for spec generation
  • Good CLI UX with progress logging and clear output paths
  • Stream-json parsing correctly handles tool_use/tool_result message pairs
  • Sandbox isolation is implemented with path safety checks
  • Detector descriptions use natural language rather than regex, improving maintainability
Model: claude-haiku-4-5-20251001Analyzed: May 11, 2026

Reviews

Add this skill to your library to leave a review.

No reviews yet

Be the first to share your experience.

Version History

v1.1

Content updated

2026-04-20

Latest
v1.0

No changelog

2026-04-12

Add affaan-m/skill-comply to your library

Command Palette

Search for a command to run...