Marketplace

clickhouse-architect

ClickHouse schema design authority (hub skill). Use when designing schemas, selecting compression codecs, tuning ORDER BY, optimizing queries, or reviewing table structure. **Delegates to**: clickhouse-cloud-management for user creation, clickhouse-pydantic-config for DBeaver config, schema-e2e-validation for YAML contracts. Triggers: "design ClickHouse schema", "compression codecs", "MergeTree optimization", "ORDER BY tuning", "partition key", "ClickHouse performance", "SharedMergeTree", "ReplicatedMergeTree", "migrate to ClickHouse".

allowed_tools: Read, Bash, Grep, Skill

$ Installer

git clone https://github.com/terrylica/cc-skills /tmp/cc-skills && cp -r /tmp/cc-skills/plugins/quality-tools/skills/clickhouse-architect ~/.claude/skills/cc-skills

// tip: Run this command in your terminal to install the skill


name: clickhouse-architect description: > ClickHouse schema design authority (hub skill). Use when designing schemas, selecting compression codecs, tuning ORDER BY, optimizing queries, or reviewing table structure. Delegates to: clickhouse-cloud-management for user creation, clickhouse-pydantic-config for DBeaver config, schema-e2e-validation for YAML contracts. Triggers: "design ClickHouse schema", "compression codecs", "MergeTree optimization", "ORDER BY tuning", "partition key", "ClickHouse performance", "SharedMergeTree", "ReplicatedMergeTree", "migrate to ClickHouse". allowed-tools: Read, Bash, Grep, Skill

ClickHouse Architect

Prescriptive schema design, compression selection, and performance optimization for ClickHouse (v24.4+). Covers both ClickHouse Cloud (SharedMergeTree) and self-hosted (ReplicatedMergeTree) deployments.

Core Methodology

Schema Design Workflow

Follow this sequence when designing or reviewing ClickHouse schemas:

  1. Define ORDER BY key (3-5 columns, lowest cardinality first)
  2. Select compression codecs per column type
  3. Configure PARTITION BY for data lifecycle management
  4. Add performance accelerators (projections, indexes)
  5. Validate with audit queries (see scripts/)
  6. Document with COMMENT statements (see references/schema-documentation.md)

ORDER BY Key Selection

The ORDER BY clause is the most critical decision in ClickHouse schema design.

Rules:

  • Limit to 3-5 columns maximum (each additional column has diminishing returns)
  • Place lowest cardinality columns first (e.g., tenant_id before timestamp)
  • Include all columns used in WHERE clauses for range queries
  • PRIMARY KEY must be a prefix of ORDER BY (or omit to use full ORDER BY)

Example:

-- Correct: Low cardinality first, 4 columns
CREATE TABLE trades (
    exchange LowCardinality(String),
    symbol LowCardinality(String),
    timestamp DateTime64(3),
    trade_id UInt64,
    price Float64,
    quantity Float64
) ENGINE = MergeTree()
ORDER BY (exchange, symbol, timestamp, trade_id);

-- Wrong: High cardinality first (10x slower queries)
ORDER BY (trade_id, timestamp, symbol, exchange);

Compression Codec Quick Reference

Column TypeDefault CodecRead-Heavy AlternativeExample
DateTime/DateTime64CODEC(DoubleDelta, ZSTD)CODEC(DoubleDelta, LZ4)timestamp DateTime64(3) CODEC(DoubleDelta, ZSTD)
Float prices/gaugesCODEC(Gorilla, ZSTD)CODEC(Gorilla, LZ4)price Float64 CODEC(Gorilla, ZSTD)
Integer countersCODEC(T64, ZSTD)count UInt64 CODEC(T64, ZSTD)
Slowly changing integersCODEC(Delta, ZSTD)CODEC(Delta, LZ4)version UInt32 CODEC(Delta, ZSTD)
String (low cardinality)LowCardinality(String)status LowCardinality(String)
General dataCODEC(ZSTD(3))CODEC(LZ4)Default compression level 3

When to use LZ4 over ZSTD: LZ4 provides 1.76x faster decompression. Use LZ4 for read-heavy workloads with monotonic sequences (timestamps, counters). Use ZSTD (default) when compression ratio matters or data patterns are unknown.

Note on codec combinations:

Delta/DoubleDelta + Gorilla combinations are blocked by default (allow_suspicious_codecs) because Gorilla already performs implicit delta compression internally—combining them is redundant, not dangerous. A historical corruption bug (PR #45615, Jan 2023) was fixed, but the blocking remains as a best practice guardrail.

Use each codec family independently for its intended data type:

-- Correct usage
price Float64 CODEC(Gorilla, ZSTD)              -- Floats: use Gorilla
timestamp DateTime64 CODEC(DoubleDelta, ZSTD)   -- Timestamps: use DoubleDelta
timestamp DateTime64 CODEC(DoubleDelta, LZ4)    -- Read-heavy: use LZ4

PARTITION BY Guidelines

PARTITION BY is for data lifecycle management, NOT query optimization.

Rules:

  • Partition by time units (month, week) for TTL and data management
  • Keep partition count under 1000 total across all tables
  • Each partition should contain 1-300 parts maximum
  • Never partition by high-cardinality columns

Example:

-- Correct: Monthly partitions for TTL management
PARTITION BY toYYYYMM(timestamp)

-- Wrong: Daily partitions (too many parts)
PARTITION BY toYYYYMMDD(timestamp)

-- Wrong: High-cardinality partition key
PARTITION BY user_id

Anti-Patterns Checklist (v24.4+)

PatternSeverityModern StatusFix
Too many parts (>300/partition)CriticalStill criticalReduce partition granularity
Small batch inserts (<1000)CriticalStill criticalBatch to 10k-100k rows
High-cardinality first ORDER BYCriticalStill criticalReorder: lowest cardinality first
No memory limitsHighStill criticalSet max_memory_usage
Denormalization overuseHighStill criticalUse dictionaries + materialized views
Large JOINsMedium180x improvedStill avoid for ultra-low-latency
Mutations (UPDATE/DELETE)Medium1700x improvedUse lightweight updates (v24.4+)

Table Engine Selection

DeploymentEngineUse Case
ClickHouse CloudSharedMergeTreeDefault for cloud deployments
Self-hosted clusterReplicatedMergeTreeMulti-node with replication
Self-hosted singleMergeTreeSingle-node development/testing

Cloud (SharedMergeTree):

CREATE TABLE trades (...)
ENGINE = SharedMergeTree('/clickhouse/tables/{shard}/trades', '{replica}')
ORDER BY (exchange, symbol, timestamp);

Self-hosted (ReplicatedMergeTree):

CREATE TABLE trades (...)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/trades', '{replica}')
ORDER BY (exchange, symbol, timestamp);

Skill Delegation Guide

This skill is the hub for ClickHouse-related tasks. When the user's needs extend beyond schema design, invoke the related skills below.

Delegation Decision Matrix

User NeedInvoke SkillTrigger Phrases
Create database users, manage permissionsdevops-tools:clickhouse-cloud-management"create user", "GRANT", "permissions", "credentials"
Configure DBeaver, generate connection JSONdevops-tools:clickhouse-pydantic-config"DBeaver", "client config", "connection setup"
Validate schema contracts against live databasequality-tools:schema-e2e-validation"validate schema", "Earthly E2E", "schema contract"

Typical Workflow Sequence

  1. Schema Design (THIS SKILL) → Design ORDER BY, compression, partitioning
  2. User Setupclickhouse-cloud-management (if cloud credentials needed)
  3. Client Configclickhouse-pydantic-config (generate DBeaver JSON)
  4. Validationschema-e2e-validation (CI/CD schema contracts)

Example: Full Stack Request

User: "I need to design a trades table for ClickHouse Cloud and set up DBeaver to query it."

Expected behavior:

  1. Use THIS skill for schema design
  2. Invoke clickhouse-cloud-management for creating database user
  3. Invoke clickhouse-pydantic-config for DBeaver configuration

Performance Accelerators

Projections

Create alternative sort orders that ClickHouse automatically selects:

ALTER TABLE trades ADD PROJECTION trades_by_symbol (
    SELECT * ORDER BY symbol, timestamp
);
ALTER TABLE trades MATERIALIZE PROJECTION trades_by_symbol;

Materialized Views

Pre-compute aggregations for dashboard queries:

CREATE MATERIALIZED VIEW trades_hourly_mv
ENGINE = SummingMergeTree()
ORDER BY (exchange, symbol, hour)
AS SELECT
    exchange,
    symbol,
    toStartOfHour(timestamp) AS hour,
    sum(quantity) AS total_volume,
    count() AS trade_count
FROM trades
GROUP BY exchange, symbol, hour;

Dictionaries

Replace JOINs with O(1) dictionary lookups for large-scale star schemas:

When to use dictionaries (v24.4+):

  • Fact tables with 100M+ rows joining dimension tables
  • Dimension tables 1k-500k rows with monotonic keys
  • LEFT ANY JOIN semantics required

When JOINs are sufficient (v24.4+):

  • Dimension tables <500 rows (JOIN overhead negligible)
  • v24.4+ predicate pushdown provides 8-180x improvements
  • Complex JOIN types (FULL, RIGHT, multi-condition)

Benchmark context: 6.6x speedup measured on Star Schema Benchmark (1.4B rows).

CREATE DICTIONARY symbol_info (
    symbol String,
    name String,
    sector String
)
PRIMARY KEY symbol
SOURCE(CLICKHOUSE(TABLE 'symbols'))
LAYOUT(FLAT())  -- Best for <500k entries with monotonic keys
LIFETIME(3600);

-- Use in queries (O(1) lookup)
SELECT
    symbol,
    dictGet('symbol_info', 'name', symbol) AS symbol_name
FROM trades;

Scripts

Execute comprehensive schema audit:

clickhouse-client --multiquery < scripts/schema-audit.sql

The audit script checks:

  • Part count per partition (threshold: 300)
  • Compression ratios by column
  • Query performance patterns
  • Replication lag (if applicable)
  • Memory usage patterns

Additional Resources

Reference Files

ReferenceContent
references/schema-design-workflow.mdComplete workflow with examples
references/compression-codec-selection.mdDecision tree + benchmarks
references/anti-patterns-and-fixes.md13 deadly sins + v24.4+ status
references/audit-and-diagnostics.mdQuery interpretation guide
references/idiomatic-architecture.mdParameterized views, dictionaries, dedup
references/schema-documentation.mdCOMMENT patterns + naming for AI understanding

External Documentation

Python Driver Policy

Use clickhouse-connect (official) for all Python integrations.

# ✅ RECOMMENDED: clickhouse-connect (official, HTTP)
import clickhouse_connect

client = clickhouse_connect.get_client(
    host='localhost',
    port=8123,  # HTTP port
    username='default',
    password=''
)
result = client.query("SELECT * FROM trades LIMIT 1000")
df = client.query_df("SELECT * FROM trades")  # Pandas integration

Why NOT clickhouse-driver

Factorclickhouse-connectclickhouse-driver
MaintainerClickHouse Inc.Solo developer
Weekly commitsYes (active)Sparse (months)
Open issues41 (addressed)76 (accumulating)
Downloads/week2.7M1.5M
Bus factor riskLow (company)High (1 person)

Do NOT use clickhouse-driver despite its ~26% speed advantage for large exports. The maintenance risk outweighs performance gains:

  • Single maintainer (mymarilyn) with no succession plan
  • Issues accumulating without response
  • Risk of abandonment breaks production code

Exception: Only consider clickhouse-driver if you have extreme performance requirements (exporting millions of rows) AND accept the maintenance risk.

Related Skills

SkillPurpose
devops-tools:clickhouse-cloud-managementUser/permission management
devops-tools:clickhouse-pydantic-configDBeaver connection generation
quality-tools:schema-e2e-validationYAML schema contracts
quality-tools:multi-agent-e2e-validationDatabase migration validation