Marketplace
etl-elt-patterns
Use when designing data pipelines, choosing between ETL and ELT approaches, or implementing data transformation patterns. Covers modern data pipeline architecture.
allowed_tools: Read, Glob, Grep
$ Installer
git clone https://github.com/melodic-software/claude-code-plugins /tmp/claude-code-plugins && cp -r /tmp/claude-code-plugins/plugins/systems-design/skills/etl-elt-patterns ~/.claude/skills/claude-code-plugins// tip: Run this command in your terminal to install the skill
SKILL.md
name: etl-elt-patterns description: Use when designing data pipelines, choosing between ETL and ELT approaches, or implementing data transformation patterns. Covers modern data pipeline architecture. allowed-tools: Read, Glob, Grep
ETL/ELT Patterns
Patterns for data extraction, loading, and transformation including modern ELT approaches and pipeline design.
When to Use This Skill
- Choosing between ETL and ELT
- Designing data pipelines
- Implementing data transformations
- Building modern data stacks
- Handling data quality in pipelines
ETL vs ELT
ETL (Extract, Transform, Load)
โโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Sources โ โโโบ โ Transform โ โโโบ โ Warehouse โ
โโโโโโโโโโโ โ Server โ โโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโ
(Transformation happens
before loading)
Characteristics:
- Transform before load
- Requires ETL server/tool
- Schema-on-write
- Traditional approach
Best for:
- Complex transformations
- Limited target storage
- Strict data quality requirements
- Legacy systems
ELT (Extract, Load, Transform)
โโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Sources โ โโโบ โ Target โ โโโบ โ Transform โ
โโโโโโโโโโโ โ (Load raw) โ โ (in-place) โ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
(Transformation happens
after loading)
Characteristics:
- Load first, transform later
- Uses target system's compute
- Schema-on-read
- Modern approach
Best for:
- Cloud data warehouses
- Flexible exploration
- Iterative development
- Large data volumes
Comparison
| Factor | ETL | ELT |
|---|---|---|
| Transform timing | Before load | After load |
| Compute location | Separate server | Target system |
| Raw data access | Limited | Full |
| Flexibility | Low | High |
| Latency | Higher | Lower |
| Cost model | ETL server + storage | Storage + target compute |
| Best for | Complex, pre-defined | Exploratory, iterative |
Modern Data Stack
Extract: Fivetran, Airbyte, Stitch, Custom
โ
โผ
Load: Cloud Warehouse (Snowflake, BigQuery, Redshift)
โ
โผ
Transform: dbt, Dataform, SQLMesh
โ
โผ
Visualize: Looker, Tableau, Metabase
dbt (Data Build Tool)
Core concepts:
- Models: SQL SELECT statements that define transformations
- Tests: Data quality assertions
- Documentation: Inline docs and lineage
- Macros: Reusable SQL snippets
Example model:
-- models/customers.sql
SELECT
customer_id,
first_name,
last_name,
order_count
FROM {{ ref('stg_customers') }}
LEFT JOIN {{ ref('customer_orders') }} USING (customer_id)
Pipeline Patterns
Full Refresh
Strategy: Drop and recreate entire table
Process:
1. Extract all data from source
2. Truncate target table
3. Load all data
Pros: Simple, consistent
Cons: Slow for large tables, can't handle deletes
Best for: Small dimension tables, reference data
Incremental Load
Strategy: Only process new/changed records
Process:
1. Track high watermark (last processed timestamp/ID)
2. Extract records > watermark
3. Merge into target
Pros: Fast, efficient
Cons: Complex, may miss updates
Best for: Large fact tables, event data
Change Data Capture (CDC)
Strategy: Capture all changes from source
Approaches:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Log-based CDC (Debezium, AWS DMS) โ
โ - Reads database transaction log โ
โ - Captures inserts, updates, deletes โ
โ - No source table modification needed โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Trigger-based CDC โ
โ - Database triggers on changes โ
โ - Writes to change table โ
โ - Adds load to source โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Timestamp-based CDC โ
โ - Query by updated_at timestamp โ
โ - Simple but misses hard deletes โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Merge (Upsert) Pattern
-- Snowflake/BigQuery style MERGE
MERGE INTO target t
USING source s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET
t.name = s.name,
t.updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN INSERT (id, name, created_at)
VALUES (s.id, s.name, CURRENT_TIMESTAMP);
Data Quality Patterns
Validation Gates
Pipeline with quality gates:
Extract โ Validate โ Load โ Transform โ Validate โ Serve
โ โ
โผ โผ
Quarantine Alert/Block
(bad records) (quality issue)
Quality Checks
Schema validation:
- Required fields present
- Data types match
- Field lengths within limits
Data validation:
- Null checks
- Range checks
- Format validation (dates, emails)
- Referential integrity
Statistical validation:
- Row count within expected range
- Value distributions normal
- No unexpected duplicates
Data Contracts
Define expectations between producer and consumer:
{
"contract_version": "1.0",
"schema": {
"customer_id": {"type": "string", "required": true},
"email": {"type": "string", "format": "email"},
"created_at": {"type": "timestamp"}
},
"quality": {
"freshness": "< 1 hour",
"completeness": "> 99%",
"row_count": "10000-100000"
}
}
Pipeline Architecture
Batch Pipeline
Schedule-based processing:
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
โ Cron โ โโโบ โ Spark โ โโโบ โ DW โ
โ (daily) โ โ (ETL) โ โ โ
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
Best for: Non-real-time, large volumes
Tools: Airflow, Dagster, Prefect
Streaming Pipeline
Real-time processing:
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
โ Kafka โ โโโบ โ Flink โ โโโบ โ DW โ
โ(events) โ โ(process)โ โ(stream) โ
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
Best for: Real-time analytics, event-driven
Tools: Kafka Streams, Flink, Spark Streaming
Lambda Architecture
Batch + Speed layers:
โโโโโโโโโโโโโโโโโโโ
โโโโโโบ โ Batch Layer โ โโโโโ
โ โ (comprehensive) โ โ
โโโโโโโโโโโ โ โโโโโโโโโโโโโโโโโโโ โ โโโโโโโโโโโ
โ Data โโโโค โโโบโ Serving โ
โ Sources โ โ โโโโโโโโโโโโโโโโโโโ โ โ Layer โ
โโโโโโโโโโโ โโโโโโบ โ Speed Layer โ โโโโโ โโโโโโโโโโโ
โ (real-time) โ
โโโโโโโโโโโโโโโโโโโ
Pros: Complete + real-time
Cons: Complex, duplicate logic
Kappa Architecture
Streaming only (reprocess from log):
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
โ Kafka โ โโโบ โ Stream โ โโโบ โ Serving โ
โ (log) โ โ Process โ โ Layer โ
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
โ
โโโ Replay for reprocessing
Pros: Simple, single codebase
Cons: Requires replayable log
Orchestration
DAG-Based Orchestration
Directed Acyclic Graph of tasks:
extract_a โโโ
โโโ transform โโ load
extract_b โโโ
Tools: Airflow, Dagster, Prefect
Orchestration Best Practices
1. Idempotent tasks (safe to retry)
2. Clear dependencies
3. Appropriate granularity
4. Monitoring and alerting
5. Backfill support
6. Parameterized runs
Error Handling
Retry Strategies
Transient errors (network, timeout):
- Exponential backoff
- Max retry count
- Circuit breaker
Data errors (validation failure):
- Quarantine bad records
- Continue processing good records
- Alert for review
Dead Letter Queue
Failed records โ DLQ โ Manual review โ Reprocess
Capture:
- Original record
- Error message
- Timestamp
- Retry count
Best Practices
Pipeline Design
1. Idempotent transformations
2. Clear lineage tracking
3. Appropriate checkpointing
4. Graceful failure handling
5. Comprehensive logging
6. Data quality gates
Performance
1. Partition data appropriately
2. Incremental processing when possible
3. Parallel extraction
4. Efficient file formats (Parquet, ORC)
5. Compression
6. Resource sizing
Related Skills
data-architecture- Data platform designstream-processing- Real-time processingml-system-design- Feature engineering
Repository

melodic-software
Author
melodic-software/claude-code-plugins/plugins/systems-design/skills/etl-elt-patterns
3
Stars
0
Forks
Updated2d ago
Added1w ago