Marketplace

data-architecture

Use when designing data platforms, choosing between data lakes/lakehouses/warehouses, or implementing data mesh patterns. Covers modern data architecture approaches.

allowed_tools: Read, Glob, Grep

$ 安裝

git clone https://github.com/melodic-software/claude-code-plugins /tmp/claude-code-plugins && cp -r /tmp/claude-code-plugins/plugins/systems-design/skills/data-architecture ~/.claude/skills/claude-code-plugins

// tip: Run this command in your terminal to install the skill


name: data-architecture description: Use when designing data platforms, choosing between data lakes/lakehouses/warehouses, or implementing data mesh patterns. Covers modern data architecture approaches. allowed-tools: Read, Glob, Grep

Data Architecture

Modern data architecture patterns including data lakes, lakehouses, data mesh, and data platform design.

When to Use This Skill

  • Choosing between data lake, warehouse, and lakehouse
  • Designing a modern data platform
  • Implementing data mesh principles
  • Planning data storage strategy
  • Understanding data architecture trade-offs

Data Architecture Evolution

Generation 1: Data Warehouse (1990s-2000s)
- Structured data only
- ETL into warehouse
- Star/snowflake schemas
- SQL-based analytics

Generation 2: Data Lake (2010s)
- All data types (structured, semi, unstructured)
- Schema-on-read
- Hadoop/HDFS based
- Cheap storage, complex processing

Generation 3: Lakehouse (2020s)
- Best of both: lake flexibility + warehouse features
- ACID transactions on lake
- Schema enforcement optional
- Unified analytics and ML

Architecture Comparison

Data Warehouse

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Sources   │ ──► │     ETL     │ ──► │  Warehouse  │
│ (Structured)│     │ (Transform) │     │ (Star/Snow) │
└─────────────┘     └─────────────┘     └─────────────┘
                                              │
                                              ▼
                                        ┌─────────────┐
                                        │     BI      │
                                        │  Analytics  │
                                        └─────────────┘

Characteristics:
- Schema-on-write
- Optimized for SQL queries
- Structured data only
- High data quality
- Expensive storage

Best for:
- Business intelligence
- Financial reporting
- Structured analytics

Data Lake

┌─────────────┐     ┌─────────────┐
│   Sources   │ ──► │  Data Lake  │
│    (All)    │     │   (Raw)     │
└─────────────┘     └─────────────┘
                          │
         ┌────────────────┼────────────────┐
         ▼                ▼                ▼
    ┌─────────┐     ┌─────────┐     ┌─────────┐
    │   ML    │     │   ETL   │     │  Spark  │
    │ Training│     │ to DW   │     │ Analysis│
    └─────────┘     └─────────┘     └─────────┘

Characteristics:
- Schema-on-read
- All data types
- Cheap storage
- Flexible processing
- Risk of "data swamp"

Best for:
- Data science/ML
- Unstructured data
- Experimental analysis

Data Lakehouse

┌─────────────┐     ┌─────────────────────────────────┐
│   Sources   │ ──► │         Data Lakehouse          │
│    (All)    │     │  ┌──────────────────────────┐   │
└─────────────┘     │  │    Metadata Layer        │   │
                    │  │ (Delta/Iceberg/Hudi)     │   │
                    │  └──────────────────────────┘   │
                    │  ┌──────────────────────────┐   │
                    │  │    Storage Layer         │   │
                    │  │    (Object Storage)      │   │
                    │  └──────────────────────────┘   │
                    └─────────────────────────────────┘
                                   │
              ┌────────────────────┼────────────────────┐
              ▼                    ▼                    ▼
         ┌─────────┐         ┌─────────┐         ┌─────────┐
         │   SQL   │         │   ML    │         │  Stream │
         │   BI    │         │ Workload│         │ Process │
         └─────────┘         └─────────┘         └─────────┘

Characteristics:
- ACID transactions
- Schema evolution
- Time travel
- Unified batch/streaming
- Open formats

Best for:
- Unified analytics
- Both BI and ML
- Modern data platforms

Architecture Selection Guide

FactorWarehouseLakeLakehouse
Data typesStructuredAllAll
Query performanceExcellentPoor-MediumGood
Data qualityHighVariableConfigurable
CostHighLowMedium
ML workloadsLimitedExcellentExcellent
Real-timeLimitedGoodGood
GovernanceStrongWeakStrong
ComplexityLowHighMedium
Decision Tree:

Is data mostly structured with BI focus?
├── Yes → Data Warehouse
└── No
    └── Need ML + BI on same data?
        ├── Yes → Lakehouse
        └── No
            └── Primarily ML/unstructured?
                ├── Yes → Data Lake
                └── No → Lakehouse

Lakehouse Technologies

Delta Lake (Databricks)

Features:
- ACID transactions
- Time travel (data versioning)
- Schema enforcement/evolution
- Unified batch/streaming
- Optimized performance (Z-ordering, compaction)

File format: Parquet + Delta log

Apache Iceberg (Netflix)

Features:
- ACID transactions
- Hidden partitioning
- Schema evolution
- Time travel
- Vendor neutral

File format: Parquet/ORC/Avro + metadata

Apache Hudi (Uber)

Features:
- ACID transactions
- Incremental processing
- Record-level updates
- Time travel
- Optimized for streaming

File format: Parquet + Hudi metadata

Technology Comparison

FeatureDelta LakeIcebergHudi
ACIDYesYesYes
Time TravelYesYesYes
Schema EvolutionGoodExcellentGood
StreamingExcellentGoodExcellent
EcosystemDatabricksWideWide
PerformanceExcellentExcellentGood
CommunityLargeGrowingMedium

Data Mesh

Principles

Data Mesh = Decentralized data architecture

Four Principles:

1. Domain Ownership
   - Data owned by domain teams
   - Not centralized data team

2. Data as a Product
   - Treat data like a product
   - Quality, discoverability, usability

3. Self-Serve Platform
   - Platform enables domain teams
   - Reduces friction

4. Federated Governance
   - Global standards
   - Local implementation

Data Products

Data Product = Autonomous unit of data

Components:
┌──────────────────────────────────────┐
│           Data Product               │
│  ┌──────────┐  ┌──────────────────┐ │
│  │   Data   │  │     Metadata     │ │
│  │ (Tables) │  │ (Schema, docs)   │ │
│  └──────────┘  └──────────────────┘ │
│  ┌──────────┐  ┌──────────────────┐ │
│  │   Code   │  │      APIs        │ │
│  │ (ETL)    │  │  (Access layer)  │ │
│  └──────────┘  └──────────────────┘ │
│  ┌──────────────────────────────────┐│
│  │         Quality + SLAs           ││
│  └──────────────────────────────────┘│
└──────────────────────────────────────┘

Data Mesh vs Centralized

AspectCentralizedData Mesh
OwnershipCentral data teamDomain teams
ScalingTeam bottleneckScales with org
Domain knowledgeLost in translationPreserved
GovernanceCentralizedFederated
ImplementationUniformHeterogeneous
ComplexityLower initiallyHigher initially

Data Modeling Patterns

Star Schema

        ┌─────────────┐
        │  Dim_Time   │
        └──────┬──────┘
               │
┌───────────┐  │  ┌───────────┐
│Dim_Product├──┼──┤Dim_Customer│
└───────────┘  │  └───────────┘
               │
        ┌──────┴──────┐
        │ Fact_Sales  │
        └─────────────┘

Pros: Simple, fast queries
Cons: Denormalized, redundancy
Best for: BI, reporting

Snowflake Schema

Normalized dimensions:
Dim_Product → Dim_Category → Dim_Subcategory

Pros: Less redundancy
Cons: More joins, slower
Best for: Complex hierarchies

Data Vault

Hub (business keys) ←→ Link (relationships) ←→ Satellite (attributes)

Pros: Auditable, flexible, scalable
Cons: Complex, learning curve
Best for: Enterprise data warehouse

Storage Layers

Bronze/Silver/Gold (Medallion Architecture)

┌─────────┐     ┌─────────┐     ┌─────────┐
│ Bronze  │ ──► │ Silver  │ ──► │  Gold   │
│  (Raw)  │     │(Cleaned)│     │(Curated)│
└─────────┘     └─────────┘     └─────────┘

Bronze: Raw ingestion, append-only
Silver: Cleaned, validated, conformed
Gold: Business-level aggregates, features

Zones in Data Lake

Landing Zone: Raw files from sources
Raw Zone: Structured raw data
Curated Zone: Transformed, quality-checked
Consumption Zone: Ready for analytics
Sandbox Zone: Exploration and experimentation

Best Practices

Data Quality

Implement quality gates:
- Schema validation
- Null checks
- Range validation
- Referential integrity
- Freshness monitoring

Governance

Key capabilities:
- Data catalog
- Lineage tracking
- Access control
- Privacy compliance
- Audit logging

Performance

Optimization techniques:
- Partitioning (by date, region)
- Clustering/Z-ordering
- Compaction
- Caching
- Materialized views

Related Skills

  • etl-elt-patterns - Data transformation
  • stream-processing - Real-time data
  • database-scaling - Database patterns

Repository

melodic-software
melodic-software
Author
melodic-software/claude-code-plugins/plugins/systems-design/skills/data-architecture
3
Stars
0
Forks
Updated5d ago
Added1w ago