Marketplace

etl-elt-patterns

Use when designing data pipelines, choosing between ETL and ELT approaches, or implementing data transformation patterns. Covers modern data pipeline architecture.

allowed_tools: Read, Glob, Grep

$ Installer

git clone https://github.com/melodic-software/claude-code-plugins /tmp/claude-code-plugins && cp -r /tmp/claude-code-plugins/plugins/systems-design/skills/etl-elt-patterns ~/.claude/skills/claude-code-plugins

// tip: Run this command in your terminal to install the skill


name: etl-elt-patterns description: Use when designing data pipelines, choosing between ETL and ELT approaches, or implementing data transformation patterns. Covers modern data pipeline architecture. allowed-tools: Read, Glob, Grep

ETL/ELT Patterns

Patterns for data extraction, loading, and transformation including modern ELT approaches and pipeline design.

When to Use This Skill

  • Choosing between ETL and ELT
  • Designing data pipelines
  • Implementing data transformations
  • Building modern data stacks
  • Handling data quality in pipelines

ETL vs ELT

ETL (Extract, Transform, Load)

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Sources โ”‚ โ”€โ”€โ–บ โ”‚  Transform  โ”‚ โ”€โ”€โ–บ โ”‚  Warehouse  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚   Server    โ”‚     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                (Transformation happens
                 before loading)

Characteristics:
- Transform before load
- Requires ETL server/tool
- Schema-on-write
- Traditional approach

Best for:
- Complex transformations
- Limited target storage
- Strict data quality requirements
- Legacy systems

ELT (Extract, Load, Transform)

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Sources โ”‚ โ”€โ”€โ–บ โ”‚   Target    โ”‚ โ”€โ”€โ–บ โ”‚  Transform  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚  (Load raw) โ”‚     โ”‚  (in-place) โ”‚
                โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                    (Transformation happens
                                     after loading)

Characteristics:
- Load first, transform later
- Uses target system's compute
- Schema-on-read
- Modern approach

Best for:
- Cloud data warehouses
- Flexible exploration
- Iterative development
- Large data volumes

Comparison

FactorETLELT
Transform timingBefore loadAfter load
Compute locationSeparate serverTarget system
Raw data accessLimitedFull
FlexibilityLowHigh
LatencyHigherLower
Cost modelETL server + storageStorage + target compute
Best forComplex, pre-definedExploratory, iterative

Modern Data Stack

Extract:        Fivetran, Airbyte, Stitch, Custom
                        โ”‚
                        โ–ผ
Load:           Cloud Warehouse (Snowflake, BigQuery, Redshift)
                        โ”‚
                        โ–ผ
Transform:      dbt, Dataform, SQLMesh
                        โ”‚
                        โ–ผ
Visualize:      Looker, Tableau, Metabase

dbt (Data Build Tool)

Core concepts:
- Models: SQL SELECT statements that define transformations
- Tests: Data quality assertions
- Documentation: Inline docs and lineage
- Macros: Reusable SQL snippets

Example model:
-- models/customers.sql
SELECT
    customer_id,
    first_name,
    last_name,
    order_count
FROM {{ ref('stg_customers') }}
LEFT JOIN {{ ref('customer_orders') }} USING (customer_id)

Pipeline Patterns

Full Refresh

Strategy: Drop and recreate entire table

Process:
1. Extract all data from source
2. Truncate target table
3. Load all data

Pros: Simple, consistent
Cons: Slow for large tables, can't handle deletes
Best for: Small dimension tables, reference data

Incremental Load

Strategy: Only process new/changed records

Process:
1. Track high watermark (last processed timestamp/ID)
2. Extract records > watermark
3. Merge into target

Pros: Fast, efficient
Cons: Complex, may miss updates
Best for: Large fact tables, event data

Change Data Capture (CDC)

Strategy: Capture all changes from source

Approaches:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Log-based CDC (Debezium, AWS DMS)              โ”‚
โ”‚ - Reads database transaction log               โ”‚
โ”‚ - Captures inserts, updates, deletes           โ”‚
โ”‚ - No source table modification needed          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Trigger-based CDC                              โ”‚
โ”‚ - Database triggers on changes                 โ”‚
โ”‚ - Writes to change table                       โ”‚
โ”‚ - Adds load to source                          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Timestamp-based CDC                            โ”‚
โ”‚ - Query by updated_at timestamp                โ”‚
โ”‚ - Simple but misses hard deletes              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Merge (Upsert) Pattern

-- Snowflake/BigQuery style MERGE
MERGE INTO target t
USING source s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET
    t.name = s.name,
    t.updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN INSERT (id, name, created_at)
    VALUES (s.id, s.name, CURRENT_TIMESTAMP);

Data Quality Patterns

Validation Gates

Pipeline with quality gates:

Extract โ†’ Validate โ†’ Load โ†’ Transform โ†’ Validate โ†’ Serve
            โ”‚                              โ”‚
            โ–ผ                              โ–ผ
         Quarantine                    Alert/Block
         (bad records)                 (quality issue)

Quality Checks

Schema validation:
- Required fields present
- Data types match
- Field lengths within limits

Data validation:
- Null checks
- Range checks
- Format validation (dates, emails)
- Referential integrity

Statistical validation:
- Row count within expected range
- Value distributions normal
- No unexpected duplicates

Data Contracts

Define expectations between producer and consumer:

{
  "contract_version": "1.0",
  "schema": {
    "customer_id": {"type": "string", "required": true},
    "email": {"type": "string", "format": "email"},
    "created_at": {"type": "timestamp"}
  },
  "quality": {
    "freshness": "< 1 hour",
    "completeness": "> 99%",
    "row_count": "10000-100000"
  }
}

Pipeline Architecture

Batch Pipeline

Schedule-based processing:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Cron   โ”‚ โ”€โ”€โ–บ โ”‚  Spark  โ”‚ โ”€โ”€โ–บ โ”‚   DW    โ”‚
โ”‚ (daily) โ”‚     โ”‚  (ETL)  โ”‚     โ”‚         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Best for: Non-real-time, large volumes
Tools: Airflow, Dagster, Prefect

Streaming Pipeline

Real-time processing:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Kafka  โ”‚ โ”€โ”€โ–บ โ”‚  Flink  โ”‚ โ”€โ”€โ–บ โ”‚   DW    โ”‚
โ”‚(events) โ”‚     โ”‚(process)โ”‚     โ”‚(stream) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Best for: Real-time analytics, event-driven
Tools: Kafka Streams, Flink, Spark Streaming

Lambda Architecture

Batch + Speed layers:

                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
             โ”Œโ”€โ”€โ”€โ”€โ–บ โ”‚   Batch Layer   โ”‚ โ”€โ”€โ”€โ”€โ”
             โ”‚      โ”‚ (comprehensive) โ”‚     โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Data   โ”‚โ”€โ”€โ”ค                              โ”œโ”€โ–บโ”‚ Serving โ”‚
โ”‚ Sources โ”‚  โ”‚      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”‚  โ”‚  Layer  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ–บ โ”‚   Speed Layer   โ”‚ โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                    โ”‚ (real-time)     โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Pros: Complete + real-time
Cons: Complex, duplicate logic

Kappa Architecture

Streaming only (reprocess from log):

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Kafka  โ”‚ โ”€โ”€โ–บ โ”‚ Stream  โ”‚ โ”€โ”€โ–บ โ”‚ Serving โ”‚
โ”‚  (log)  โ”‚     โ”‚ Process โ”‚     โ”‚  Layer  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
      โ”‚
      โ””โ”€โ”€ Replay for reprocessing

Pros: Simple, single codebase
Cons: Requires replayable log

Orchestration

DAG-Based Orchestration

Directed Acyclic Graph of tasks:

    extract_a โ”€โ”€โ”
                โ”œโ”€โ”€ transform โ”€โ”€ load
    extract_b โ”€โ”€โ”˜

Tools: Airflow, Dagster, Prefect

Orchestration Best Practices

1. Idempotent tasks (safe to retry)
2. Clear dependencies
3. Appropriate granularity
4. Monitoring and alerting
5. Backfill support
6. Parameterized runs

Error Handling

Retry Strategies

Transient errors (network, timeout):
- Exponential backoff
- Max retry count
- Circuit breaker

Data errors (validation failure):
- Quarantine bad records
- Continue processing good records
- Alert for review

Dead Letter Queue

Failed records โ†’ DLQ โ†’ Manual review โ†’ Reprocess

Capture:
- Original record
- Error message
- Timestamp
- Retry count

Best Practices

Pipeline Design

1. Idempotent transformations
2. Clear lineage tracking
3. Appropriate checkpointing
4. Graceful failure handling
5. Comprehensive logging
6. Data quality gates

Performance

1. Partition data appropriately
2. Incremental processing when possible
3. Parallel extraction
4. Efficient file formats (Parquet, ORC)
5. Compression
6. Resource sizing

Related Skills

  • data-architecture - Data platform design
  • stream-processing - Real-time processing
  • ml-system-design - Feature engineering

Repository

melodic-software
melodic-software
Author
melodic-software/claude-code-plugins/plugins/systems-design/skills/etl-elt-patterns
3
Stars
0
Forks
Updated2d ago
Added1w ago