affaan-m/skill-complyv1.2

skill-comply

Visualize whether skills, rules, and agent definitions are actually followed — auto-generates scenarios at 3 prompt strictness levels, runs agents, classifies behavioral sequences, and reports compliance rates with full tool call timelines

global

origin:ECC

New~569

v1.2Saved Jul 14, 2026

skill-comply: Automated Compliance Measurement

Measures whether coding agents actually follow skills, rules, or agent definitions by:

Auto-generating expected behavioral sequences (specs) from any .md file
Auto-generating scenarios with decreasing prompt strictness (supportive → neutral → competing)
Running claude -p and capturing tool call traces via stream-json
Classifying tool calls against spec steps using LLM (not regex)
Checking temporal ordering deterministically
Generating self-contained reports with spec, prompts, and timelines

Supported Targets

Skills (skills/*/SKILL.md): Workflow skills like search-first, TDD guides
Rules (rules/common/*.md): Mandatory rules like testing.md, security.md, git-workflow.md
Agent definitions (agents/*.md): Whether an agent gets invoked when expected (internal workflow verification not yet supported)

When to Activate

User runs /skill-comply <path>
User asks "is this rule actually being followed?"
After adding new rules/skills, to verify agent compliance
Periodically as part of quality maintenance

Usage

# Full run
uv run python -m scripts.run ~/.claude/rules/common/testing.md

# Dry run (no cost, spec + scenarios only)
uv run python -m scripts.run --dry-run ~/.claude/skills/search-first/SKILL.md

# Custom models
uv run python -m scripts.run --gen-model haiku --model sonnet <path>

Key Concept: Prompt Independence

Measures whether a skill/rule is followed even when the prompt doesn't explicitly support it.

Report Contents

Reports are self-contained and include:

Expected behavioral sequence (auto-generated spec)
Scenario prompts (what was asked at each strictness level)
Compliance scores per scenario
Tool call timelines with LLM classification labels

Advanced (optional)

For users familiar with hooks, reports also include hook promotion recommendations for steps with low compliance. This is informational — the main value is the compliance visibility itself.

Files22

22 files · 57.6 KB

Select a file to preview

Overall Score

87/100

Grade

Excellent

Safety

Quality

Clarity

Completeness

Summary

skill-comply is an automated compliance measurement tool that verifies whether coding agents actually follow skills, rules, or agent definitions. It generates expected behavioral specs from markdown files, creates scenarios at three prompt strictness levels (supportive/neutral/competing), executes scenarios via claude -p, classifies tool calls using LLM, and produces detailed compliance reports with tool call timelines and temporal ordering analysis.

Detected Capabilities

file readbash execution via subprocessLLM inference (claude -p)YAML/JSON parsingsandbox directory creation and teardownsubprocess stream-json parsingtemporal ordering analysis

Trigger Keywords

Phrases that MCP clients use to match this skill to user intent.

test rule complianceverify skill enforcementmeasure agent behaviorautomation rule auditingcompliance measurement tool

Risk Signals

INFO

Subprocess execution with configurable model parameter; models restricted to allowlist (haiku/sonnet/opus)

scripts/runner.py:ALLOWED_MODELS, run_scenario()

INFO

Sandbox directory creation under /tmp/skill-comply-sandbox with path traversal protection via resolve().relative_to() check

scripts/runner.py:_safe_sandbox_dir()

INFO

Setup command execution with allowlist of safe executables (git, npm, pip, touch, mkdir, etc.); blocks arbitrary commands

scripts/runner.py:ALLOWED_SETUP_EXECUTABLES, _setup_sandbox()

INFO

Shell builtins (cd/pushd/popd) explicitly detected and skipped to prevent subprocess crashes

scripts/runner.py:SHELL_BUILTINS, _setup_sandbox()

INFO

LLM subprocess calls with hardcoded tools allowlist (Read,Write,Edit,Bash,Glob,Grep) passed to claude -p

scripts/runner.py:run_scenario(), --allowedTools parameter

INFO

Max turn limit enforced (default 30); rc=1 with max_turns marker is gracefully handled rather than treated as fatal error

scripts/runner.py:max_turns parameter, nonfatal_max_turns logic

INFO

Error messages include stdout tail (500 chars) to aid debugging of LLM failures

scripts/runner.py:RuntimeError message formatting

INFO

Tool call output truncated to 5000 chars to prevent unbounded memory usage from large outputs

scripts/runner.py:_parse_stream_json(), input_str/output_str slicing

INFO

YAML parsing wrapped in retry loop with error feedback to LLM for self-correction

scripts/spec_generator.py:generate_spec(), max_retries loop

Use Cases

Verify new rules or skills are actually followed by agents
Measure compliance across different prompt contexts (supportive vs competing instructions)
Identify steps with low compliance and determine if they should be promoted to hooks
Generate compliance reports with full tool call timelines for auditing and debugging
Test whether agents maintain workflow discipline (e.g. TDD) when prompts don't explicitly enforce it

Quality Notes

Excellent defensive programming: allowlists for models and setup executables prevent injection; path traversal protection uses resolve().relative_to() correctly
Comprehensive test coverage with realistic fixtures (compliant/noncompliant traces); tests verify both happy path and edge cases (missing executables, shell builtins, max_turns termination)
Well-structured error handling: distinguishes between fatal failures and graceful termination (max_turns); includes diagnostic context in errors
Prompt templates are clear and well-documented; spec/scenario/classifier prompts guide LLM generation with explicit rules and examples
Temporal ordering logic is deterministic and testable; separation of LLM classification and order verification allows unit testing order constraints without model variability
Tool call classification uses semantic matching (LLM) rather than regex, enabling flexible interpretation of intent across varied tool sequences
Report generation is comprehensive and self-contained: includes spec, prompts, compliance scores per scenario, and detailed timeline with tool call classification
YAML safety: uses yaml.safe_load() exclusively, no pickle/exec risk
Logging is informative: progress markers ([1/4], [2/4], etc.) aid debugging; compliance rates printed per scenario
Project structure is modular: clear separation between spec_gen, scenario_gen, runner, grader, classifier, and reporter concerns

Model: claude-haiku-4-5-20251001Analyzed: Jul 14, 2026

Reviews

Add this skill to your library to leave a review.

No reviews yet

Be the first to share your experience.

Version History

v1.2

Content updated

2026-07-14

Latest

v1.1

Content updated

2026-04-20

v1.0

No changelog

2026-04-12

Use affaan-m/skill-comply in your dev environment — a Developer account adds skills to your library and syncs them via the SkillRepo CLI.

Start a Developer trial

skill-comply

skill-comply: Automated Compliance Measurement

Supported Targets

When to Activate

Usage

Key Concept: Prompt Independence

Report Contents

Advanced (optional)

Summary

Detected Capabilities

Trigger Keywords

Risk Signals

Use Cases

Quality Notes

Reviews

Version History

Command Palette