databricks-expert
Databricks Expert Engineer Skill - Comprehensive guide for data engineering, machine learning infrastructure, and permission design
$ Installer
git clone https://github.com/i9wa4/dotfiles /tmp/dotfiles && cp -r /tmp/dotfiles/dot.config/claude/skills/databricks-expert ~/.claude/skills/dotfiles// tip: Run this command in your terminal to install the skill
name: databricks-expert description: Databricks Expert Engineer Skill - Comprehensive guide for data engineering, machine learning infrastructure, and permission design
Databricks Expert Engineer Skill
This skill provides a comprehensive guide for Databricks development.
1. Databricks CLI Usage
1.1. About warehouse_id
- Find and select one Serverless SQL Warehouse for warehouse_id
- Note: databricks CLI does not auto-read warehouse_id from config files, so explicitly include it in JSON each time
1.2. Basic Usage
# Execute query
databricks api post /api/2.0/sql/statements --profile "DEFAULT" --json '{
"warehouse_id": "xxxxxxxxxx",
"catalog": "catalog_name",
"schema": "schema_name",
"statement": "select * from table_name limit 10"
}'
# Get results (statement_id is returned from execution)
databricks api get /api/2.0/sql/statements/{statement_id} --profile "DEFAULT"
1.3. Command Tips
-
Query execution flow
postexecutes query -> returnsstatement_idgetretrieves results (wait untilstateisSUCCEEDED)- For long queries, add
sleepand retry
-
Error handling
state: CLOSED: Result retrieval was too slow. Get earlierstate: FAILED: SQL error. Check error_messagestate: RUNNING: Still executing. Wait and retryget- Timeout: For large data, use
limitto verify
-
Reading results
data_array: Actual data (2D array)schema.columns: Column names and type infototal_row_count: Total count (shown even with limit)state: Query execution state
-
Parameterized queries
databricks api post /api/2.0/sql/statements --profile "DEFAULT" --json '{
"warehouse_id": "xxxxxxxxxx",
"statement": "select * from table where date >= :start_date",
"parameters": [{"name": "start_date", "value": "2025-01-01", "type": "DATE"}]
}'
2. Well-Architected Lakehouse Framework
Consists of 7 pillars:
2.1. Data and AI Governance
Policies and practices to securely manage data and AI assets. Minimize data copies with unified governance solution.
2.2. Interoperability and Usability
Consistent user experience and seamless integration with external systems.
2.3. Operational Excellence
Processes supporting continuous production operations.
2.4. Security, Privacy, and Compliance
Implement safeguards against threats.
2.5. Reliability
Ensure disaster recovery capabilities.
2.6. Performance Efficiency
Adaptability to workload changes.
2.7. Cost Optimization
Cost management to maximize value delivery.
3. Unity Catalog
3.1. Basic Concepts
- "Define once, secure everywhere" approach
- Unified access control policies across multiple workspaces
- ANSI SQL compliant permission management
3.2. Object Model
3-level namespace: catalog.schema.table
- Catalog layer: Data isolation unit (by department, etc.)
- Schema layer: Logical group containing tables, views, volumes
- Object layer: Tables, views, volumes, functions, models
3.3. Permission Management
- Users cannot access data by default
- Explicit permission grants required
- Permissions inherit from parent to child (catalog -> schema -> table)
-- Check permissions
SHOW GRANTS ON SCHEMA main.default;
-- Grant permissions
GRANT CREATE TABLE ON SCHEMA main.default TO `finance-team`;
-- Revoke permissions
REVOKE CREATE TABLE ON SCHEMA main.default FROM `finance-team`;
3.4. Best Practices
- Managed tables/volumes recommended (Delta Lake format, full lifecycle management)
- Catalog isolation across workspaces possible
- Independent managed storage location per catalog recommended
4. Data Engineering
4.1. Lakeflow Solution
Unifies data ingestion, transformation, and orchestration.
- Lakeflow Connect: Simplifies data ingestion
- Lakeflow Spark Declarative Pipelines (SDP): Declarative pipeline framework
- Lakeflow Jobs: Workflow automation
4.2. Delta Lake
- Parquet data files with file-based transaction log
- ACID transactions
- Time travel functionality
- Optimizations: liquid clustering, data skipping, file layout optimization, vacuum
4.3. Lakeflow Jobs
Task types:
- Notebook tasks
- Pipeline tasks
- Python script tasks
Triggers:
- Time-based (e.g., daily at 2 AM)
- Event-based (on new data arrival)
Limits:
- Workspace: Max 2000 concurrent task executions
- Saved jobs: Max 12000
- Tasks per job: Max 1000
5. Machine Learning Infrastructure
5.1. MLflow
- Core tool for experiment tracking and model management
- Dedicated features for GenAI
5.2. Feature Store
- Feature management system
- Automatic data pipelines and feature discovery
5.3. Model Serving
- Deploy custom models and LLMs as REST endpoints
- Auto-scaling and GPU support
6. Security
6.1. Authentication and Access Control
- SSO configuration
- Multi-factor authentication
- Access control lists
6.2. Network Security
- Private connectivity
- Serverless egress control
- Firewall settings
- VPC management
6.3. Data Encryption
- Encryption at rest and in transit
- Customer-managed keys
- Inter-cluster communication encryption
- Automatic credential masking
7. SQL Warehouse
7.1. Serverless SQL Warehouse Benefits
- Instant and elastic compute
- Auto-scaling
- Minimal management (Databricks handles capacity)
- Low total cost of ownership
8. Detailed Documentation
See docs/ directory for detailed documentation:
unity-catalog.md- Unity Catalog details and permission managementdata-engineering.md- Data engineering best practicesmachine-learning.md- Machine learning infrastructure detailssecurity.md- Security and permission designwell-architected.md- Well-Architected framework
9. Reference Links
- Official docs: https://docs.databricks.com/
- Unity Catalog: https://docs.databricks.com/en/data-governance/unity-catalog/
- Lakeflow Jobs: https://docs.databricks.com/en/jobs/
- MLflow: https://docs.databricks.com/en/mlflow/
- Delta Lake: https://docs.databricks.com/en/delta/
- Security: https://docs.databricks.com/en/security/
Repository
