data-lake-platform

Universal data lake and lakehouse patterns covering ingestion (dlt, Airbyte), transformation (SQLMesh, dbt), storage formats (Iceberg, Delta, Hudi, Parquet), query engines (ClickHouse, DuckDB, Doris, StarRocks), streaming (Kafka, Flink), orchestration (Dagster, Airflow, Prefect), and visualization (Metabase, Superset, Grafana). Self-hosted and cloud options.

$ Instalar

git clone https://github.com/vasilyu1983/AI-Agents-public /tmp/AI-Agents-public && cp -r /tmp/AI-Agents-public/frameworks/claude-code-kit/framework/skills/data-lake-platform ~/.claude/skills/AI-Agents-public

// tip: Run this command in your terminal to install the skill


name: data-lake-platform description: > Universal data lake and lakehouse patterns covering ingestion (dlt, Airbyte), transformation (SQLMesh, dbt), storage formats (Iceberg, Delta, Hudi, Parquet), query engines (ClickHouse, DuckDB, Doris, StarRocks), streaming (Kafka, Flink), orchestration (Dagster, Airflow, Prefect), and visualization (Metabase, Superset, Grafana). Self-hosted and cloud options.

Data Lake Platform — Quick Reference

Build production data lakes and lakehouses: ingest from any source, transform with SQL, store in open formats, query at scale.


When to Use This Skill

  • Design data lake/lakehouse architecture (medallion, data mesh, lambda/kappa)
  • Set up data ingestion pipelines (dlt, Airbyte)
  • Build SQL transformation layers (SQLMesh, dbt)
  • Choose and configure table formats (Iceberg, Delta Lake, Hudi)
  • Deploy analytical query engines (ClickHouse, DuckDB, Doris, StarRocks)
  • Implement streaming pipelines (Kafka, Flink, Spark Streaming)
  • Set up orchestration (Dagster, Airflow, Prefect)
  • Implement data quality and governance (Great Expectations, DataHub)
  • Optimize storage costs and query performance
  • Migrate from legacy warehouse to lakehouse

Quick Reference

LayerToolsTemplatesWhen to Use
Ingestiondlt, Airbytetemplates/ingestion/Extract from APIs, databases, files
TransformationSQLMesh, dbttemplates/transformation/SQL-based data modeling
StorageIceberg, Delta, Huditemplates/storage/Open table formats with ACID
Query EnginesClickHouse, DuckDB, Doristemplates/query-engines/Fast analytical queries
StreamingKafka, Flink, Sparktemplates/streaming/Real-time data pipelines
OrchestrationDagster, Airflow, Prefecttemplates/orchestration/Pipeline scheduling
CloudSnowflake, BigQuery, Redshifttemplates/cloud/Managed warehouses
VisualizationMetabase, Superset, Grafanatemplates/visualization/Dashboards, BI, monitoring

Decision Tree: Choosing Your Stack

Data Lake Architecture?
    ├─ Self-hosted priority?
    │   ├─ OLAP queries? → ClickHouse (Yandex origin, Russian-friendly)
    │   ├─ Embedded analytics? → DuckDB (in-process, no server)
    │   ├─ Complex joins + updates? → Apache Doris or StarRocks
    │   └─ Data lakehouse? → Iceberg + Trino/Spark
    │
    ├─ Cloud-native?
    │   ├─ AWS? → Redshift, Athena + Iceberg
    │   ├─ GCP? → BigQuery
    │   └─ Multi-cloud? → Snowflake
    │
    ├─ Ingestion tool?
    │   ├─ Python-first, simple? → dlt (recommended)
    │   ├─ GUI, many connectors? → Airbyte
    │   └─ Enterprise, CDC? → Debezium, Fivetran
    │
    ├─ Transformation tool?
    │   ├─ SQL-first, CI/CD? → SQLMesh (recommended)
    │   ├─ Large community? → dbt
    │   └─ Python transformations? → Pandas + Great Expectations
    │
    ├─ Table format?
    │   ├─ Multi-engine reads? → Apache Iceberg (industry standard)
    │   ├─ Databricks ecosystem? → Delta Lake
    │   └─ CDC, real-time updates? → Apache Hudi
    │
    ├─ Streaming?
    │   ├─ Event streaming? → Apache Kafka
    │   ├─ Stream processing? → Apache Flink
    │   └─ Batch + streaming? → Spark Streaming
    │
    └─ Orchestration?
        ├─ Data-aware, modern? → Dagster (recommended)
        ├─ Mature, many integrations? → Airflow
        └─ Python-native, simple? → Prefect

Architecture Patterns

Pattern 1: Medallion Architecture (Bronze/Silver/Gold)

Use when: Building enterprise data lake with clear data quality tiers.

Sources → Bronze (raw) → Silver (cleaned) → Gold (business-ready)
            ↓              ↓                  ↓
         Iceberg        Iceberg            Iceberg
         append-only    deduped            aggregated

Tools: dlt → SQLMesh → ClickHouse/DuckDB → Metabase

See templates/cross-platform/template-medallion-architecture.md

Pattern 2: Data Mesh (Domain-Oriented)

Use when: Large organization, multiple domains, decentralized ownership.

Domain A ──→ Domain A Lake ──→ Data Products
Domain B ──→ Domain B Lake ──→ Data Products
Domain C ──→ Domain C Lake ──→ Data Products
                   ↓
            Federated Catalog (DataHub/OpenMetadata)

Tools: dlt (per domain) → SQLMesh → Iceberg → DataHub

See resources/architecture-patterns.md

Pattern 3: Lambda/Kappa (Streaming + Batch)

Use when: Real-time + historical analytics required.

Lambda: Kafka → Flink (speed layer) ───→ Serving
                   ↓                        ↑
        Batch (Spark) → Iceberg ───────────┘

Kappa:  Kafka → Flink → Iceberg → Serving (single path)

Tools: Kafka → Flink → Iceberg → ClickHouse

See resources/streaming-patterns.md


Core Capabilities

1. Data Ingestion (dlt, Airbyte)

Extract data from any source: REST APIs, databases, files, SaaS platforms.

dlt (Python-native, recommended):

  • Simple pip install, Python scripts
  • 100+ verified sources
  • Incremental loading, schema evolution
  • Destinations: ClickHouse, DuckDB, Snowflake, BigQuery, Postgres

Airbyte (GUI, connectors):

  • 300+ pre-built connectors
  • Self-hosted or cloud
  • CDC replication
  • Declarative YAML configuration

See templates/ingestion/dlt/ and templates/ingestion/airbyte/

2. SQL Transformation (SQLMesh, dbt)

Build data models with SQL, manage dependencies, test data quality.

SQLMesh (recommended):

  • Virtual data environments
  • Automatic change detection
  • Plan/apply workflow (like Terraform)
  • Built-in unit testing

dbt (popular alternative):

  • Large community
  • Many packages
  • Cloud offering
  • Jinja templating

See templates/transformation/sqlmesh/ and templates/transformation/dbt/

3. Open Table Formats (Iceberg, Delta, Hudi)

Store data in open formats with ACID transactions, time travel, schema evolution.

Apache Iceberg (recommended):

  • Multi-engine support (Spark, Trino, Flink, ClickHouse)
  • Hidden partitioning
  • Snapshot isolation
  • Industry momentum (Snowflake, Databricks, AWS)

Delta Lake:

  • Databricks native
  • Z-ordering optimization
  • Change data feed

Apache Hudi:

  • Optimized for CDC/updates
  • Record-level indexing
  • Near real-time ingestion

See templates/storage/

4. Query Engines (ClickHouse, DuckDB, Doris, StarRocks)

Fast analytical queries on large datasets.

ClickHouse (self-hosted priority):

  • 100+ GB/s scan speed
  • Columnar storage + compression
  • Russian origin (Yandex), fully open source
  • MergeTree engine family

DuckDB (embedded):

  • In-process, no server
  • Parquet/CSV native
  • Python/R integration
  • Laptop-scale analytics

Apache Doris / StarRocks:

  • MPP architecture
  • Complex joins
  • Real-time updates
  • MySQL protocol compatible

See templates/query-engines/

5. Streaming (Kafka, Flink, Spark)

Real-time data pipelines and stream processing.

Apache Kafka:

  • Event streaming platform
  • Durable log storage
  • Connect ecosystem

Apache Flink:

  • True stream processing
  • Exactly-once semantics
  • Stateful computations

Spark Streaming:

  • Micro-batch processing
  • Unified batch + stream API
  • ML integration

See templates/streaming/

6. Orchestration (Dagster, Airflow, Prefect)

Schedule and monitor data pipelines.

Dagster (recommended):

  • Software-defined assets
  • Data-aware scheduling
  • Built-in observability
  • Type system

Airflow:

  • Mature, battle-tested
  • Many operators
  • Large community

Prefect:

  • Python-native
  • Hybrid execution
  • Simple deployment

See templates/orchestration/

7. Visualization (Metabase, Superset, Grafana)

Business intelligence dashboards and operational monitoring.

Metabase (business users):

  • Intuitive question builder
  • Self-serve analytics
  • Signed embedding
  • Caching and permissions

Apache Superset (analysts):

  • SQL Lab for exploration
  • Advanced visualizations
  • Row-level security
  • Custom chart plugins

Grafana (infrastructure):

  • Time-series dashboards
  • Real-time alerting
  • Prometheus/InfluxDB native
  • Data pipeline monitoring

See templates/visualization/ and resources/bi-visualization-patterns.md


Navigation

Resources (Deep Guides)

ResourceDescription
architecture-patterns.mdMedallion, data mesh, lakehouse design
ingestion-patterns.mddlt vs Airbyte, CDC, incremental loading
transformation-patterns.mdSQLMesh vs dbt, testing, CI/CD
storage-formats.mdIceberg vs Delta vs Hudi comparison
query-engine-patterns.mdClickHouse, DuckDB, Doris optimization
streaming-patterns.mdKafka, Flink, Spark Streaming
orchestration-patterns.mdDagster, Airflow, Prefect comparison
governance-catalog.mdDataHub, OpenMetadata, lineage
cost-optimization.mdStorage, compute, query optimization
operational-playbook.mdMonitoring, incidents, migrations
bi-visualization-patterns.mdMetabase, Superset, Grafana operations

Templates (Copy-Paste Ready)

CategoryTemplates
Cross-platformmedallion, pipeline, incremental, quality, schema, partitioning, cost, migration
Ingestion/dltpipeline, rest-api, database-source, incremental, warehouse-loading
Ingestion/Airbyteconnection, custom-connector
Transformation/SQLMeshproject, model, incremental, dag, testing, production, security
Transformation/dbtproject, incremental, testing
Storageiceberg-table, iceberg-maintenance, delta-table, hudi-table, parquet-optimization
Query Enginesclickhouse (setup, ingestion, optimization, materialized-views, replication), duckdb, doris, starrocks
Streamingkafka-ingestion, flink-processing, spark-streaming
Orchestrationdagster-pipeline, airflow-dag, prefect-flow
Cloudsnowflake-setup, bigquery-setup, redshift-setup
Visualizationmetabase (connection-checklist, dashboard-request, incident-playbook)

Related Skills


Quick Start Examples

Example 1: dlt + ClickHouse + Metabase

# Install dlt with ClickHouse
pip install "dlt[clickhouse]"

# Initialize pipeline
dlt init rest_api clickhouse

# Configure and run
python pipeline.py

Example 2: SQLMesh + DuckDB (Local Development)

# Install SQLMesh
pip install sqlmesh

# Initialize project
sqlmesh init duckdb

# Plan and apply
sqlmesh plan
sqlmesh run

Example 3: Dagster + dlt + SQLMesh

# dagster_pipeline.py
from dagster import asset, Definitions
import dlt
import sqlmesh

@asset
def raw_data():
    pipeline = dlt.pipeline("my_pipeline", destination="duckdb")
    return pipeline.run(source_data())

@asset(deps=[raw_data])
def transformed_data():
    ctx = sqlmesh.Context()
    ctx.run()

Sources

See data/sources.json for 100+ curated sources covering:

  • Official documentation (dlt, SQLMesh, ClickHouse, Iceberg, etc.)
  • Architecture guides (medallion, data mesh)
  • Comparison articles (2024-2025 benchmarks)
  • Community resources (Discord, Stack Overflow)
  • Russian ecosystem (Yandex DataLens, Arenadata, Altinity)

Repository

vasilyu1983
vasilyu1983
Author
vasilyu1983/AI-Agents-public/frameworks/claude-code-kit/framework/skills/data-lake-platform
21
Stars
6
Forks
Updated3d ago
Added6d ago