ai-mlops
Complete MLOps skill covering production ML lifecycle and security. Includes data ingestion, model deployment, drift detection, monitoring, plus ML security (prompt injection, jailbreak defense, RAG security, privacy, governance). Modern automation-first patterns with multi-layered defenses.
$ Installieren
git clone https://github.com/vasilyu1983/AI-Agents-public /tmp/AI-Agents-public && cp -r /tmp/AI-Agents-public/frameworks/claude-code-kit/framework/skills/ai-mlops ~/.claude/skills/AI-Agents-public// tip: Run this command in your terminal to install the skill
name: ai-mlops description: Complete MLOps skill covering production ML lifecycle and security. Includes data ingestion, model deployment, drift detection, monitoring, plus ML security (prompt injection, jailbreak defense, RAG security, privacy, governance). Modern automation-first patterns with multi-layered defenses.
MLOps & ML Security â Complete Reference
Production ML lifecycle with modern security practices.
This skill covers:
- Production: Data ingestion, deployment, drift detection, monitoring, incident response
- Security: Prompt injection, jailbreak defense, RAG security, output filtering
- Governance: Privacy protection, supply chain security, safety evaluation
- Data ingestion (dlt): Load data from APIs, databases to warehouses
- Model deployment: Batch jobs, real-time APIs, hybrid systems, event-driven automation
- Operations: Real-time monitoring, 18-second drift detection, automated retraining, incident response
Key Advances:
- Event-driven, modular, auditable pipelines automating every key phase
- 18-second drift detection with F1 >0.99 post-attack recovery
- Automated retraining triggers (drift, schema change, volume threshold, manual override)
- Scalable architecture: >2,300 req/sec with sub-50ms latency
It is execution-focused:
- Data ingestion patterns (REST APIs, database replication, incremental loading)
- Deployment patterns (batch, online, hybrid, streaming, event-driven)
- Automated monitoring with real-time drift detection
- Automated retraining pipelines (monitor â detect â trigger â validate â deploy)
- Incident handling with rapid recovery (F1 >0.99 restoration)
- Links to copy-paste templates in
templates/
Quick Reference
| Task | Tool/Framework | Command | When to Use |
|---|---|---|---|
| Data Ingestion | dlt (data load tool) | dlt pipeline run, dlt init | Loading from APIs, databases to warehouses |
| Batch Deployment | Airflow, Dagster, Prefect | airflow dags trigger, dagster job launch | Scheduled predictions on large datasets |
| API Deployment | FastAPI, Flask, TorchServe | uvicorn app:app, torchserve --start | Real-time inference (<500ms latency) |
| Model Registry | MLflow, W&B | mlflow.register_model(), wandb.log_model() | Versioning and promoting models |
| Drift Detection | Evidently, WhyLabs | evidently.dashboard(), monitor metrics | Automated drift monitoring (18s response) |
| Monitoring | Prometheus, Grafana | prometheus.yml, Grafana dashboards | Metrics, alerts, SLO tracking |
| Incident Response | Runbooks, PagerDuty | Documented playbooks, alert routing | Handling failures and degradation |
When to Use This Skill
Claude should invoke this skill when the user asks for deployment, operations, or data ingestion help, e.g.:
- "How do I deploy this model to prod?"
- "Design a batch + online scoring architecture."
- "Add monitoring and drift detection to our model."
- "Write an incident runbook for this ML service."
- "Package this LLM/RAG pipeline as an API."
- "Plan our retraining and promotion workflow."
- "Load data from Stripe API to Snowflake."
- "Set up incremental database replication with dlt."
- "Build an ELT pipeline for warehouse loading."
If the user is asking only about EDA, modelling, or theory, prefer:
ai-ml-data-science(EDA, features, modelling, SQL transformation with SQLMesh)ai-llm(prompting, fine-tuning, eval)ai-rag(retrieval pipeline design)ai-llm-inference(compression, spec decode, serving internals)
If the user is asking about SQL transformation (after data is loaded), prefer:
ai-ml-data-science(SQLMesh templates for staging, intermediate, marts layers)
Decision Tree: Choosing Deployment Strategy
User needs to deploy: [ML System]
ââ Data Ingestion?
â ââ From REST APIs? â dlt REST API templates
â ââ From databases? â dlt database sources (PostgreSQL, MySQL, MongoDB)
â ââ Incremental loading? â dlt incremental patterns (timestamp, ID-based)
â
ââ Model Serving?
â ââ Latency <500ms? â FastAPI real-time API
â ââ Batch predictions? â Airflow/Dagster batch pipeline
â ââ Mix of both? â Hybrid (batch features + online scoring)
â
ââ Monitoring & Ops?
â ââ Drift detection? â Evidently + automated retraining triggers
â ââ Performance tracking? â Prometheus + Grafana dashboards
â ââ Incident response? â Runbooks + PagerDuty alerts
â
ââ LLM/RAG Production?
ââ Cost optimization? â Caching, prompt templates, token budgets
ââ Safety? â See ai-mlops skill
Core Patterns Overview
This skill provides 13 production-ready patterns organized into comprehensive guides:
Data & Infrastructure Patterns
Pattern 0: Data Contracts, Ingestion & Lineage â See Data Ingestion Patterns
- Data contracts with SLAs and versioning
- Ingestion modes (CDC, batch, streaming)
- Lineage tracking and schema evolution
- Replay and backfill procedures
Pattern 1: Choose Deployment Mode â See Deployment Patterns
- Decision table (batch, online, hybrid, streaming)
- When to use each mode
- Deployment mode selection checklist
Pattern 2: Standard Deployment Lifecycle â See Deployment Lifecycle
- Pre-deploy, deploy, observe, operate, evolve phases
- Environment promotion (dev â staging â prod)
- Gradual rollout strategies (canary, blue-green)
Pattern 3: Packaging & Model Registry â See Model Registry Patterns
- Model registry structure and metadata
- Packaging strategies (Docker, ONNX, MLflow)
- Promotion flows (experimental â production)
- Versioning and governance
Serving Patterns
Pattern 4: Batch Scoring Pipeline â See Deployment Patterns
- Orchestration with Airflow/Dagster
- Idempotent scoring jobs
- Validation and backfill procedures
Pattern 5: Real-Time API Scoring â See API Design Patterns
- Service design (HTTP/JSON, gRPC)
- Input/output schemas
- Rate limiting, timeouts, circuit breakers
Pattern 6: Hybrid & Feature Store Integration â See Feature Store Patterns
- Batch vs online features
- Feature store architecture
- Training-serving consistency
- Point-in-time correctness
Operations Patterns
Pattern 7: Monitoring & Alerting â See Monitoring Best Practices
- Data, performance, and technical metrics
- SLO definition and tracking
- Dashboard design and alerting strategies
Pattern 8: Drift Detection & Automated Retraining â See Drift Detection Guide
- Real-time drift detection (18-second response)
- Automated retraining triggers
- Event-driven retraining pipelines
- Performance targets (F1 >0.99 recovery)
Pattern 9: Incidents & Runbooks â See Incident Response Playbooks
- Common failure modes
- Detection, diagnosis, resolution
- Post-mortem procedures
Pattern 10: LLM / RAG in Production â See LLM & RAG Production Patterns
- Prompt and configuration management
- Safety and compliance (PII, jailbreaks)
- Cost optimization (token budgets, caching)
- Monitoring and fallbacks
Pattern 11: Cross-Region, Residency & Rollback â See Multi-Region Patterns
- Multi-region deployment architectures
- Data residency and tenant isolation
- Disaster recovery and failover
- Regional rollback procedures
Pattern 12: Online Evaluation & Feedback Loops â See Online Evaluation Patterns
- Feedback signal collection (implicit, explicit)
- Shadow and canary deployments
- A/B testing with statistical significance
- Human-in-the-loop labeling
- Automated retraining cadence
Resources (Detailed Guides)
For comprehensive operational guides, see:
Core Infrastructure:
- Data Ingestion Patterns - Data contracts, CDC, batch/streaming ingestion, lineage, schema evolution
- Deployment Lifecycle - Pre-deploy validation, environment promotion, gradual rollout, rollback
- Model Registry Patterns - Versioning, packaging, promotion workflows, governance
- Feature Store Patterns - Batch/online features, hybrid architectures, consistency, latency optimization
Serving & APIs:
- Deployment Patterns - Batch, online, hybrid, streaming deployment strategies and architectures
- API Design Patterns - ML/LLM/RAG API patterns, input/output schemas, reliability patterns, versioning
Operations & Reliability:
- Monitoring Best Practices - Metrics collection, alerting strategies, SLO definition, dashboard design
- Drift Detection Guide - Statistical tests, automated detection, retraining triggers, recovery strategies
- Incident Response Playbooks - Runbooks for common failure modes, diagnostics, resolution steps
Advanced Patterns:
- LLM & RAG Production Patterns - Prompt management, safety, cost optimization, caching, monitoring
- Multi-Region Patterns - Multi-region deployment, data residency, disaster recovery, rollback
- Online Evaluation Patterns - A/B testing, shadow deployments, feedback loops, automated retraining
Templates
Use these as copy-paste starting points for production artifacts:
Data Ingestion (dlt)
For loading data into warehouses and pipelines:
- dlt basic pipeline setup - Install, configure, run basic extraction and loading
- dlt REST API sources - Extract from REST APIs with pagination, authentication, rate limiting
- dlt database sources - Replicate from PostgreSQL, MySQL, MongoDB, SQL Server
- dlt incremental loading - Timestamp-based, ID-based, merge/upsert patterns, lookback windows
- dlt warehouse loading - Load to Snowflake, BigQuery, Redshift, Postgres, DuckDB
Use dlt when:
- Loading data from APIs (Stripe, HubSpot, Shopify, custom APIs)
- Replicating databases to warehouses
- Building ELT pipelines with incremental loading
- Managing data ingestion with Python
For SQL transformation (after ingestion), use:
â ai-ml-data-science skill (SQLMesh templates for staging/intermediate/marts layers)
Deployment & Packaging
- Deployment & MLOps template - Complete MLOps lifecycle, model registry, promotion workflows
- API service template - Real-time REST/gRPC API with FastAPI, input validation, rate limiting
- Batch scoring pipeline template - Orchestrated batch inference with Airflow/Dagster, validation, backfill
Monitoring & Operations
- Monitoring & alerting template - Data/performance/technical metrics, dashboards, SLO definition
- Drift detection & retraining template - Automated drift detection, retraining triggers, promotion pipelines
- Incident runbook template - Failure mode playbooks, diagnosis steps, resolution procedures
Navigation
Resources
- resources/drift-detection-guide.md
- resources/model-registry-patterns.md
- resources/online-evaluation-patterns.md
- resources/monitoring-best-practices.md
- resources/llm-rag-production-patterns.md
- resources/api-design-patterns.md
- resources/incident-response-playbooks.md
- resources/deployment-patterns.md
- resources/data-ingestion-patterns.md
- resources/deployment-lifecycle.md
- resources/feature-store-patterns.md
- resources/multi-region-patterns.md
Templates
- templates/ingestion/template-dlt-pipeline.md
- templates/ingestion/template-dlt-rest-api.md
- templates/ingestion/template-dlt-database-source.md
- templates/ingestion/template-dlt-incremental.md
- templates/ingestion/template-dlt-warehouse-loading.md
- templates/deployment/template-deployment-mlops.md
- templates/deployment/template-api-service.md
- templates/deployment/template-batch-pipeline.md
- templates/ops/template-incident-runbook.md
- templates/monitoring/template-drift-retraining.md
- templates/monitoring/template-monitoring-plan.md
Data
- data/sources.json â Curated external references
External Resources
See data/sources.json for curated references on:
- Serving frameworks (FastAPI, Flask, gRPC, TorchServe, KServe, Ray Serve)
- Orchestration (Airflow, Dagster, Prefect)
- Model registries and MLOps (MLflow, W&B, Vertex AI, Sagemaker)
- Monitoring and observability (Prometheus, Grafana, OpenTelemetry, Evidently)
- Feature stores (Feast, Tecton, Vertex, Databricks)
- Streaming & messaging (Kafka, Pulsar, Kinesis)
- LLMOps & RAG infra (vector DBs, LLM gateways, safety tools)
Data Lake & Lakehouse
For comprehensive data lake/lakehouse patterns (beyond dlt ingestion), see data-lake-platform:
- Table formats: Apache Iceberg, Delta Lake, Apache Hudi
- Query engines: ClickHouse, DuckDB, Apache Doris, StarRocks
- Alternative ingestion: Airbyte (GUI-based connectors)
- Transformation: dbt (alternative to SQLMesh)
- Streaming: Apache Kafka patterns
- Orchestration: Dagster, Airflow
This skill focuses on ML-specific deployment, monitoring, and security. Use data-lake-platform for general-purpose data infrastructure.
Related Skills
For adjacent topics, reference these skills:
- ai-ml-data-science - EDA, feature engineering, modelling, evaluation, SQLMesh transformations
- ai-llm - Prompting, fine-tuning, evaluation for LLMs
- ai-agents - Agentic workflows, multi-agent systems, LLMOps
- ai-rag - RAG pipeline design, chunking, retrieval, evaluation
- ai-llm-inference - Model serving optimization, quantization, batching
- ai-prompt-engineering - Prompt design patterns and best practices
- data-lake-platform - Data lake/lakehouse infrastructure (ClickHouse, Iceberg, Kafka)
Use this skill to turn trained models into reliable services, not to derive the model itself.
Repository
