Finding Duplicate-Intent Functions

Overview

LLM-generated codebases accumulate semantic duplicates: functions that serve the same purpose but were implemented independently. Classical copy-paste detectors (jscpd) find syntactic duplicates but miss "same intent, different implementation."

This skill uses a two-phase approach: classical extraction followed by LLM-powered intent clustering.

When to Use

Codebase has grown organically with multiple contributors (human or LLM)
You suspect utility functions have been reimplemented multiple times
Before major refactoring to identify consolidation opportunities
After jscpd has been run and syntactic duplicates are already handled

Quick Reference

Phase	Tool	Model	Output
1. Extract	`scripts/extract-functions.sh`	-	`catalog.json`
2. Categorize	`scripts/categorize-prompt.md`	haiku	`categorized.json`
3. Split	`scripts/prepare-category-analysis.sh`	-	`categories/*.json`
4. Detect	`scripts/find-duplicates-prompt.md`	opus	`duplicates/*.json`
5. Report	`scripts/generate-report.sh`	-	`report.md`

Process

digraph duplicate_detection {
  rankdir=TB;
  node [shape=box];

  extract [label="1. Extract function catalog\n./scripts/extract-functions.sh"];
  categorize [label="2. Categorize by domain\n(haiku subagent)"];
  split [label="3. Split into categories\n./scripts/prepare-category-analysis.sh"];
  detect [label="4. Find duplicates per category\n(opus subagent per category)"];
  report [label="5. Generate report\n./scripts/generate-report.sh"];
  review [label="6. Human review & consolidate"];

  extract -> categorize -> split -> detect -> report -> review;
}

Phase 1: Extract Function Catalog

./scripts/extract-functions.sh src/ -o catalog.json

Options:

-o FILE: Output file (default: stdout)
-c N: Lines of context to capture (default: 15)
-t GLOB: File types (default: *.ts,*.tsx,*.js,*.jsx)
--include-tests: Include test files (excluded by default)

Test files (*.test.*, *.spec.*, __tests__/**) are excluded by default since test utilities are less likely to be consolidation candidates.

Phase 2: Categorize by Domain

Dispatch a haiku subagent using the prompt in scripts/categorize-prompt.md.

Insert the contents of catalog.json where indicated in the prompt template. Save output as categorized.json.

Phase 3: Split into Categories

./scripts/prepare-category-analysis.sh categorized.json ./categories

Creates one JSON file per category. Only categories with 3+ functions are worth analyzing.

Phase 4: Find Duplicates (Per Category)

For each category file in ./categories/, dispatch an opus subagent using the prompt in scripts/find-duplicates-prompt.md.

Save each output as ./duplicates/{category}.json.

Phase 5: Generate Report

./scripts/generate-report.sh ./duplicates ./duplicates-report.md

Produces a prioritized markdown report grouped by confidence level.

Phase 6: Human Review

Review the report. For HIGH confidence duplicates:

Verify the recommended survivor has tests
Update callers to use the survivor
Delete the duplicates
Run tests

High-Risk Duplicate Zones

Focus extraction on these areas first - they accumulate duplicates fastest:

Zone	Common Duplicates
`utils/`, `helpers/`, `lib/`	General utilities reimplemented
Validation code	Same checks written multiple ways
Error formatting	Error-to-string conversions
Path manipulation	Joining, resolving, normalizing paths
String formatting	Case conversion, truncation, escaping
Date formatting	Same formats implemented repeatedly
API response shaping	Similar transformations for different endpoints

Common Mistakes

Extracting too much: Focus on exported functions and public methods. Internal helpers are less likely to be duplicated across files.

Skipping the categorization step: Going straight to duplicate detection on the full catalog produces noise. Categories focus the comparison.

Using haiku for duplicate detection: Haiku is cost-effective for categorization but misses subtle semantic duplicates. Use Opus for the actual duplicate analysis.

Consolidating without tests: Before deleting duplicates, ensure the survivor has tests covering all use cases of the deleted functions.

Files6

6 files · 21.1 KB

Select a file to preview

Overall Score

86/100

Grade

A

Excellent

Safety

88

Quality

88

Clarity

87

Completeness

82

Summary

This skill guides AI agents through a systematic 5-phase process to detect semantic duplicate functions in a codebase—functions that serve the same purpose but have different names or implementations. It uses shell scripts for extraction and splitting, plus LLM-powered subagents (haiku for categorization, opus for duplicate detection) to identify consolidation opportunities, which is especially valuable in LLM-generated codebases where reimplementation is common.

Detected Capabilities

Shell script execution for function extraction and file splittingJSON parsing and transformation using jqLLM subagent dispatch and prompt templating (haiku and opus models)Markdown report generation from structured duplicate analysisBash glob patterns and ripgrep-based code scanningMulti-phase pipeline orchestration with intermediate state files

Trigger Keywords

Phrases that MCP clients use to match this skill to user intent.

detect duplicate functionssemantic code duplicationconsolidate utilitiescodebase refactoring auditremove reimplemented functions

Risk Signals

INFO

Bash script uses ripgrep for code pattern matching across project

scripts/extract-functions.sh, lines ~80-95

INFO

Scripts create and read intermediate JSON files in current working directory

scripts/extract-functions.sh, scripts/prepare-category-analysis.sh, scripts/generate-report.sh

INFO

generate-report.sh uses jq to process JSON and write markdown output

scripts/generate-report.sh, lines ~40-90

INFO

Skill requires two external LLM model calls (haiku and opus subagents)

SKILL.md, Phase 2 and Phase 4 sections

INFO

extract-functions.sh invokes ripgrep with multiple glob patterns to scan source tree

scripts/extract-functions.sh, lines ~58-67

Use Cases

Audit LLM-generated codebases for semantic duplication before refactoring
Identify utility function consolidation opportunities across a project
Prepare codebase cleanup after syntactic duplicate detection has been completed
Reduce maintenance burden by finding and merging intent-equivalent functions
Validate that functions with different names don't implement the same logic

Quality Notes

Excellent documentation with clear phase diagram and quick reference table
Well-structured process broken into discrete, testable phases with specific outputs
Good error handling in shell scripts (set -euo pipefail, input validation, helpful error messages)
Clear guidance on when to use each model (haiku for categorization efficiency, opus for accuracy)
Comprehensive 'Common Mistakes' section helps users avoid pitfalls (e.g., skipping categorization, using wrong model)
High-risk zones table provides practical guidance on where duplicates accumulate
Prompt templates are detailed and include explicit output format specifications
Scripts include usage documentation and example invocations
Test files are correctly excluded by default from extraction phase
Output confidence levels (HIGH/MEDIUM/LOW) are clearly defined with examples
Recommendation system (CONSOLIDATE/INVESTIGATE/KEEP_SEPARATE) is well-motivated
Process includes human review step (Phase 6) rather than fully automated consolidation
Missing: explicit guidance on handling large codebases (performance implications of ripgrep)
Missing: error recovery if a phase fails (e.g., what to do if Opus output is malformed)

Model: claude-haiku-4-5-20251001Analyzed: May 2, 2026

Reviews

Add this skill to your library to leave a review.

No reviews yet

Be the first to share your experience.

finding-duplicate-functions

Finding Duplicate-Intent Functions

Overview

When to Use

Quick Reference

Process

Phase 1: Extract Function Catalog

Phase 2: Categorize by Domain

Phase 3: Split into Categories

Phase 4: Find Duplicates (Per Category)

Phase 5: Generate Report

Phase 6: Human Review

High-Risk Duplicate Zones

Common Mistakes

Summary

Detected Capabilities

Trigger Keywords

Risk Signals

Use Cases

Quality Notes

Reviews

Command Palette