Marketplace
data-architecture
Use when designing data platforms, choosing between data lakes/lakehouses/warehouses, or implementing data mesh patterns. Covers modern data architecture approaches.
allowed_tools: Read, Glob, Grep
$ 安裝
git clone https://github.com/melodic-software/claude-code-plugins /tmp/claude-code-plugins && cp -r /tmp/claude-code-plugins/plugins/systems-design/skills/data-architecture ~/.claude/skills/claude-code-plugins// tip: Run this command in your terminal to install the skill
SKILL.md
name: data-architecture description: Use when designing data platforms, choosing between data lakes/lakehouses/warehouses, or implementing data mesh patterns. Covers modern data architecture approaches. allowed-tools: Read, Glob, Grep
Data Architecture
Modern data architecture patterns including data lakes, lakehouses, data mesh, and data platform design.
When to Use This Skill
- Choosing between data lake, warehouse, and lakehouse
- Designing a modern data platform
- Implementing data mesh principles
- Planning data storage strategy
- Understanding data architecture trade-offs
Data Architecture Evolution
Generation 1: Data Warehouse (1990s-2000s)
- Structured data only
- ETL into warehouse
- Star/snowflake schemas
- SQL-based analytics
Generation 2: Data Lake (2010s)
- All data types (structured, semi, unstructured)
- Schema-on-read
- Hadoop/HDFS based
- Cheap storage, complex processing
Generation 3: Lakehouse (2020s)
- Best of both: lake flexibility + warehouse features
- ACID transactions on lake
- Schema enforcement optional
- Unified analytics and ML
Architecture Comparison
Data Warehouse
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Sources │ ──► │ ETL │ ──► │ Warehouse │
│ (Structured)│ │ (Transform) │ │ (Star/Snow) │
└─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────┐
│ BI │
│ Analytics │
└─────────────┘
Characteristics:
- Schema-on-write
- Optimized for SQL queries
- Structured data only
- High data quality
- Expensive storage
Best for:
- Business intelligence
- Financial reporting
- Structured analytics
Data Lake
┌─────────────┐ ┌─────────────┐
│ Sources │ ──► │ Data Lake │
│ (All) │ │ (Raw) │
└─────────────┘ └─────────────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ ML │ │ ETL │ │ Spark │
│ Training│ │ to DW │ │ Analysis│
└─────────┘ └─────────┘ └─────────┘
Characteristics:
- Schema-on-read
- All data types
- Cheap storage
- Flexible processing
- Risk of "data swamp"
Best for:
- Data science/ML
- Unstructured data
- Experimental analysis
Data Lakehouse
┌─────────────┐ ┌─────────────────────────────────┐
│ Sources │ ──► │ Data Lakehouse │
│ (All) │ │ ┌──────────────────────────┐ │
└─────────────┘ │ │ Metadata Layer │ │
│ │ (Delta/Iceberg/Hudi) │ │
│ └──────────────────────────┘ │
│ ┌──────────────────────────┐ │
│ │ Storage Layer │ │
│ │ (Object Storage) │ │
│ └──────────────────────────┘ │
└─────────────────────────────────┘
│
┌────────────────────┼────────────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ SQL │ │ ML │ │ Stream │
│ BI │ │ Workload│ │ Process │
└─────────┘ └─────────┘ └─────────┘
Characteristics:
- ACID transactions
- Schema evolution
- Time travel
- Unified batch/streaming
- Open formats
Best for:
- Unified analytics
- Both BI and ML
- Modern data platforms
Architecture Selection Guide
| Factor | Warehouse | Lake | Lakehouse |
|---|---|---|---|
| Data types | Structured | All | All |
| Query performance | Excellent | Poor-Medium | Good |
| Data quality | High | Variable | Configurable |
| Cost | High | Low | Medium |
| ML workloads | Limited | Excellent | Excellent |
| Real-time | Limited | Good | Good |
| Governance | Strong | Weak | Strong |
| Complexity | Low | High | Medium |
Decision Tree:
Is data mostly structured with BI focus?
├── Yes → Data Warehouse
└── No
└── Need ML + BI on same data?
├── Yes → Lakehouse
└── No
└── Primarily ML/unstructured?
├── Yes → Data Lake
└── No → Lakehouse
Lakehouse Technologies
Delta Lake (Databricks)
Features:
- ACID transactions
- Time travel (data versioning)
- Schema enforcement/evolution
- Unified batch/streaming
- Optimized performance (Z-ordering, compaction)
File format: Parquet + Delta log
Apache Iceberg (Netflix)
Features:
- ACID transactions
- Hidden partitioning
- Schema evolution
- Time travel
- Vendor neutral
File format: Parquet/ORC/Avro + metadata
Apache Hudi (Uber)
Features:
- ACID transactions
- Incremental processing
- Record-level updates
- Time travel
- Optimized for streaming
File format: Parquet + Hudi metadata
Technology Comparison
| Feature | Delta Lake | Iceberg | Hudi |
|---|---|---|---|
| ACID | Yes | Yes | Yes |
| Time Travel | Yes | Yes | Yes |
| Schema Evolution | Good | Excellent | Good |
| Streaming | Excellent | Good | Excellent |
| Ecosystem | Databricks | Wide | Wide |
| Performance | Excellent | Excellent | Good |
| Community | Large | Growing | Medium |
Data Mesh
Principles
Data Mesh = Decentralized data architecture
Four Principles:
1. Domain Ownership
- Data owned by domain teams
- Not centralized data team
2. Data as a Product
- Treat data like a product
- Quality, discoverability, usability
3. Self-Serve Platform
- Platform enables domain teams
- Reduces friction
4. Federated Governance
- Global standards
- Local implementation
Data Products
Data Product = Autonomous unit of data
Components:
┌──────────────────────────────────────┐
│ Data Product │
│ ┌──────────┐ ┌──────────────────┐ │
│ │ Data │ │ Metadata │ │
│ │ (Tables) │ │ (Schema, docs) │ │
│ └──────────┘ └──────────────────┘ │
│ ┌──────────┐ ┌──────────────────┐ │
│ │ Code │ │ APIs │ │
│ │ (ETL) │ │ (Access layer) │ │
│ └──────────┘ └──────────────────┘ │
│ ┌──────────────────────────────────┐│
│ │ Quality + SLAs ││
│ └──────────────────────────────────┘│
└──────────────────────────────────────┘
Data Mesh vs Centralized
| Aspect | Centralized | Data Mesh |
|---|---|---|
| Ownership | Central data team | Domain teams |
| Scaling | Team bottleneck | Scales with org |
| Domain knowledge | Lost in translation | Preserved |
| Governance | Centralized | Federated |
| Implementation | Uniform | Heterogeneous |
| Complexity | Lower initially | Higher initially |
Data Modeling Patterns
Star Schema
┌─────────────┐
│ Dim_Time │
└──────┬──────┘
│
┌───────────┐ │ ┌───────────┐
│Dim_Product├──┼──┤Dim_Customer│
└───────────┘ │ └───────────┘
│
┌──────┴──────┐
│ Fact_Sales │
└─────────────┘
Pros: Simple, fast queries
Cons: Denormalized, redundancy
Best for: BI, reporting
Snowflake Schema
Normalized dimensions:
Dim_Product → Dim_Category → Dim_Subcategory
Pros: Less redundancy
Cons: More joins, slower
Best for: Complex hierarchies
Data Vault
Hub (business keys) ←→ Link (relationships) ←→ Satellite (attributes)
Pros: Auditable, flexible, scalable
Cons: Complex, learning curve
Best for: Enterprise data warehouse
Storage Layers
Bronze/Silver/Gold (Medallion Architecture)
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Bronze │ ──► │ Silver │ ──► │ Gold │
│ (Raw) │ │(Cleaned)│ │(Curated)│
└─────────┘ └─────────┘ └─────────┘
Bronze: Raw ingestion, append-only
Silver: Cleaned, validated, conformed
Gold: Business-level aggregates, features
Zones in Data Lake
Landing Zone: Raw files from sources
Raw Zone: Structured raw data
Curated Zone: Transformed, quality-checked
Consumption Zone: Ready for analytics
Sandbox Zone: Exploration and experimentation
Best Practices
Data Quality
Implement quality gates:
- Schema validation
- Null checks
- Range validation
- Referential integrity
- Freshness monitoring
Governance
Key capabilities:
- Data catalog
- Lineage tracking
- Access control
- Privacy compliance
- Audit logging
Performance
Optimization techniques:
- Partitioning (by date, region)
- Clustering/Z-ordering
- Compaction
- Caching
- Materialized views
Related Skills
etl-elt-patterns- Data transformationstream-processing- Real-time datadatabase-scaling- Database patterns
Repository

melodic-software
Author
melodic-software/claude-code-plugins/plugins/systems-design/skills/data-architecture
3
Stars
0
Forks
Updated5d ago
Added1w ago