Data Throughput Accelerator

Use this skill when the bottleneck is moving, transforming, or saving lots of data. The goal is not just speed. The goal is faster correct data landing in the right place with proof.

First Distinction

Separate these before optimizing:

source extraction speed;
network transfer speed;
warehouse/load speed;
transform speed;
serving-table freshness;
live tail growth while the job runs.

A pipeline can be "fast" and still appear behind if new data arrives faster than the final catch-up window.

Fast Path Heuristics

Move compute to where the data already is.
Prefer warehouse-native scans, joins, and appends for large landed files.
Use manifests or checkpoints so completed files/partitions are skipped.
Use partitioning and clustering that match the read and append pattern.
Batch small files, requests, and writes.
Make writes idempotent through unique keys, manifests, or replaceable staging.
Keep raw, derived, and serving tables separately accountable.

Workflow

Read the current source, target, and manifest contracts.
Measure backlog: external files, manifest rows, raw rows, derived rows, min/max timestamps, and unprocessed counts.
Run a safe catch-up or sample benchmark.
Compare variants: batch size, worker count, warehouse SQL, file grouping, staging shape, and manifest update method.
Promote only the fastest path that keeps counts and timestamps coherent.
Codify the path as a CLI, scheduled job, workflow, or runbook.
Rerun final accounting after the codified path executes.

Accounting Output

Use a hard accounting block:

Data throughput result:
- Source files discovered: 294
- Files processed this run: 294
- Raw rows added: 9,683,598
- Derived rows added: 8,917,585
- Remaining tail: 24 files at readback time
- Runtime: 38.7s
- Correctness gate: manifest counts and table max timestamps match

Guardrails

Do not delete raw data to make a metric look better.
Do not skip failed files silently.
Do not mix historical backfill status with live-tail freshness.
Do not call a pipeline complete until the target tables and manifest agree.
For finance, healthcare, regulated, or customer-impacting data, preserve replay evidence and approval gates.

Files1

1 files · 1.0 KB

Select a file to preview

Overall Score

82/100

Grade

B

Good

Safety

88

Quality

82

Clarity

85

Completeness

72

Summary

This skill guides agents through optimizing high-throughput data pipelines—ingestion, backfill, ETL, and warehouse loading operations. It provides a structured workflow for identifying bottlenecks, benchmarking optimization strategies, and ensuring data correctness through accounting and manifest validation.

Detected Capabilities

file readingdata analysis and measurementworkflow orchestration guidancequery optimization recommendationsmanifest/checkpoint validation

Trigger Keywords

Phrases that MCP clients use to match this skill to user intent.

optimize data pipelinespeed up warehouse loadingbatch data backfilletl performance tuningparallel ingestionmanifest synchronization

Use Cases

Accelerate large-scale data backfill operations with manifest tracking
Optimize ETL pipeline performance while preserving data correctness
Conduct benchmarking comparisons (batch size, worker count, SQL) for warehouse loading
Implement idempotent data ingestion with checkpoints and manifests
Validate data integrity through accounting gates and timestamp coherence

Quality Notes

Clear, practical heuristics (move compute to data, use warehouse-native operations, batch writes) grounded in real data pipeline patterns
Strong emphasis on correctness gates and accounting—addresses the critical risk of sacrificing data integrity for speed
Well-structured workflow with seven sequential steps from measurement through codification to final validation
Explicit guardrails section prohibits dangerous shortcuts (deleting raw data, skipping failed files silently, mixing backfill with live-tail freshness)
Excellent accounting block example shows the agent exactly what proof of correctness looks like (file counts, row counts, manifest agreement)
Scope is domain-specific (data warehousing, ETL) and does not prescribe implementation details—allows flexibility across tech stacks
References regulated data scenarios and includes recommendation for 'replay evidence and approval gates' showing awareness of compliance contexts
Minor gap: does not address common pitfalls like schema evolution mid-pipeline or handling partial file failures

Model: claude-haiku-4-5-20251001Analyzed: May 25, 2026

Reviews

Add this skill to your library to leave a review.

No reviews yet

Be the first to share your experience.

data-throughput-accelerator