Marketplace
ml-cv-specialist
Deep expertise in ML/CV model selection, training pipelines, and inference architecture. Use when designing machine learning systems, computer vision pipelines, or AI-powered features.
$ Instalar
git clone https://github.com/alirezarezvani/claude-cto-team /tmp/claude-cto-team && cp -r /tmp/claude-cto-team/skills/ml-cv-specialist ~/.claude/skills/claude-cto-team// tip: Run this command in your terminal to install the skill
SKILL.md
name: ml-cv-specialist description: Deep expertise in ML/CV model selection, training pipelines, and inference architecture. Use when designing machine learning systems, computer vision pipelines, or AI-powered features.
ML/CV Specialist
Provides specialized guidance for machine learning and computer vision system design, model selection, and production deployment.
When to Use
- Selecting ML models for specific use cases
- Designing training and inference pipelines
- Optimizing ML system performance and cost
- Evaluating build vs. API for ML capabilities
- Planning data pipelines for ML workloads
ML System Design Framework
Model Selection Decision Tree
Use Case Identified
โ
โโโบ Text/Language Tasks
โ โโโบ Classification โ BERT, DistilBERT, or API (OpenAI, Claude)
โ โโโบ Generation โ GPT-4, Claude, Llama (self-hosted)
โ โโโบ Embeddings โ OpenAI Ada, sentence-transformers
โ โโโบ Search/RAG โ Vector DB + Embeddings + LLM
โ
โโโบ Computer Vision Tasks
โ โโโบ Classification โ ResNet, EfficientNet, ViT
โ โโโบ Object Detection โ YOLOv8, DETR, Faster R-CNN
โ โโโบ Segmentation โ SAM, Mask R-CNN, U-Net
โ โโโบ OCR โ Tesseract, PaddleOCR, Cloud Vision API
โ โโโบ Face Recognition โ InsightFace, DeepFace
โ
โโโบ Audio Tasks
โ โโโบ Speech-to-Text โ Whisper, DeepSpeech, Cloud APIs
โ โโโบ Text-to-Speech โ ElevenLabs, Coqui TTS
โ โโโบ Audio Classification โ PANNs, AudioSet models
โ
โโโบ Structured Data
โโโบ Tabular โ XGBoost, LightGBM, CatBoost
โโโบ Time Series โ Prophet, ARIMA, Transformer-based
โโโบ Recommendations โ Two-tower, matrix factorization
API vs. Self-Hosted Decision
When to Use APIs
| Factor | API Preferred | Self-Hosted Preferred |
|---|---|---|
| Volume | < 10K requests/month | > 100K requests/month |
| Latency | > 500ms acceptable | < 100ms required |
| Customization | General use case | Domain-specific fine-tuning |
| Data Privacy | Non-sensitive data | PII, HIPAA, financial |
| Team Expertise | No ML engineers | ML team available |
| Budget | Predictable per-call costs | High volume justifies infra |
Cost Comparison Framework
## API Costs (Example: OpenAI GPT-4)
- Input: $0.03/1K tokens
- Output: $0.06/1K tokens
- Average request: 500 input + 200 output tokens
- Cost per request: $0.027
- 100K requests/month: $2,700
## Self-Hosted Costs (Example: Llama 70B)
- GPU instance: $3/hour (A100 40GB)
- Throughput: ~50 requests/minute = 3K/hour
- Cost per request: $0.001
- 100K requests/month: $100 + $500 engineering time
## Break-even Analysis
- < 50K requests: API likely cheaper
- > 50K requests: Self-hosted may be cheaper
- Factor in: engineering time, ops burden, model quality
Training Pipeline Architecture
Standard ML Pipeline
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ DATA LAYER โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Data Sources โ ETL โ Feature Store โ Training Data โ
โ (S3, DBs) (Airflow) (Feast) (Versioned) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ TRAINING LAYER โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Experiment Tracking โ Training Jobs โ Model Registry โ
โ (MLflow, W&B) (SageMaker) (MLflow, S3) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SERVING LAYER โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Model Server โ Load Balancer โ Monitoring โ
โ (TorchServe) (K8s/ELB) (Prometheus) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Component Selection Guide
| Component | Options | Recommendation |
|---|---|---|
| Feature Store | Feast, Tecton, SageMaker | Feast (open source), Tecton (enterprise) |
| Experiment Tracking | MLflow, Weights & Biases, Neptune | MLflow (free), W&B (best UX) |
| Training Orchestration | Kubeflow, SageMaker, Vertex AI | SageMaker (AWS), Vertex (GCP) |
| Model Registry | MLflow, SageMaker, custom S3 | MLflow (standard) |
| Model Serving | TorchServe, TFServing, Triton | Triton (multi-framework) |
Inference Architecture Patterns
Pattern 1: Synchronous API
Best for: Low-latency requirements, simple integration
Client โ API Gateway โ Model Server โ Response
โ
Load Balancer
โ
โโโโโโโโดโโโโโโโ
โ โ
Model Pod Model Pod
Latency targets:
- P50: < 100ms
- P95: < 300ms
- P99: < 500ms
Pattern 2: Asynchronous Processing
Best for: Long-running inference, batch processing
Client โ API โ Queue (SQS) โ Worker โ Result Store โ Webhook/Poll
โ
S3/Redis
Use when:
- Inference > 5 seconds
- Batch processing required
- Variable load patterns
Pattern 3: Edge Inference
Best for: Privacy, offline capability, ultra-low latency
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ EDGE DEVICE โ
โ โโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Camera โโโโโถโ Optimized Model โ โ
โ โโโโโโโโโโโ โ (ONNX, TFLite) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ Local Result โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
Sync to Cloud
(non-blocking)
Model optimization for edge:
- Quantization (INT8): 4x smaller, 2-3x faster
- Pruning: 50-90% sparsity possible
- Distillation: Smaller model, similar accuracy
- ONNX/TFLite: Optimized runtime
Computer Vision Pipeline Design
Real-Time Video Processing
Camera Stream โ Frame Extraction โ Preprocessing โ Model โ Postprocessing โ Output
โ โ โ โ โ
RTSP/ 1-30 FPS Resize, Batch or NMS, tracking,
WebRTC normalize single annotation
Performance optimization:
- Process every Nth frame (skip frames)
- Resize to model input size early
- Batch frames when latency allows
- Use GPU preprocessing (NVIDIA DALI)
Object Detection System
## Pipeline Components
1. **Input Processing**
- Video decode: FFmpeg, OpenCV
- Frame buffer: Ring buffer for temporal context
- Preprocessing: NVIDIA DALI (GPU), OpenCV (CPU)
2. **Detection**
- Model: YOLOv8 (speed), DETR (accuracy)
- Batch size: 1-8 depending on latency requirements
- Confidence threshold: 0.5-0.7 typical
3. **Post-processing**
- NMS (Non-Maximum Suppression)
- Tracking: SORT, DeepSORT, ByteTrack
- Smoothing: Kalman filter for stable boxes
4. **Output**
- Annotations: Bounding boxes, labels, confidence
- Events: Trigger on detection (webhook, queue)
- Storage: Frame + metadata to S3/DB
LLM Integration Patterns
RAG (Retrieval-Augmented Generation)
User Query โ Embedding โ Vector Search โ Context Retrieval โ LLM โ Response
โ
Vector DB
(Pinecone, Weaviate,
Chroma, pgvector)
Vector DB Selection:
| Database | Best For | Limitations |
|---|---|---|
| Pinecone | Managed, scale | Cost at scale |
| Weaviate | Self-hosted, features | Operational overhead |
| Chroma | Simple, local dev | Not for production scale |
| pgvector | PostgreSQL users | Performance at >1M vectors |
| Qdrant | Performance | Newer, smaller community |
LLM Serving Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ API GATEWAY โ
โ Rate limiting, auth, request routing โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโ โโโโโโโโโโ โโโโโโโโโโ
โ GPT-4 โ โ Claude โ โ Local โ
โ API โ โ API โ โ Llama โ
โโโโโโโโโโ โโโโโโโโโโ โโโโโโโโโโ
โ
Model Router
(cost/latency/capability)
Multi-model strategy:
- Simple queries โ Cheaper model (GPT-3.5, Haiku)
- Complex reasoning โ Expensive model (GPT-4, Opus)
- Sensitive data โ Self-hosted (Llama, Mistral)
Performance Optimization
GPU Memory Optimization
| Technique | Memory Reduction | Speed Impact |
|---|---|---|
| FP16 (Half Precision) | 50% | Neutral to faster |
| INT8 Quantization | 75% | 10-20% slower |
| INT4 Quantization | 87.5% | 20-40% slower |
| Gradient Checkpointing | 60-80% | 20-30% slower |
| Model Sharding | Distributed | Communication overhead |
Batching Strategies
# Dynamic batching pseudocode
class DynamicBatcher:
def __init__(self, max_batch=32, max_wait_ms=50):
self.queue = []
self.max_batch = max_batch
self.max_wait = max_wait_ms
async def add_request(self, request):
self.queue.append(request)
# Batch when full or timeout
if len(self.queue) >= self.max_batch:
return await self.process_batch()
await asyncio.sleep(self.max_wait / 1000)
return await self.process_batch()
async def process_batch(self):
batch = self.queue[:self.max_batch]
self.queue = self.queue[self.max_batch:]
return await self.model.predict_batch(batch)
Model Monitoring
Key Metrics to Track
| Metric | What It Measures | Alert Threshold |
|---|---|---|
| Latency (P95) | Response time | > 2x baseline |
| Throughput | Requests/second | < 80% capacity |
| Error Rate | Failed predictions | > 1% |
| Model Drift | Distribution shift | PSI > 0.2 |
| Data Quality | Input anomalies | > 5% anomalies |
Drift Detection
Training Distribution โโโ
โโโโบ Statistical Test โโโบ Alert
Production Distribution โโ
(PSI, KS test, JS divergence)
Population Stability Index (PSI):
- PSI < 0.1: No significant change
- 0.1 < PSI < 0.2: Moderate change, monitor
- PSI > 0.2: Significant change, investigate
Quick Reference Tables
Model Selection by Use Case
| Use Case | Recommended Model | Latency | Cost |
|---|---|---|---|
| Text Classification | DistilBERT | 10ms | Low |
| Text Generation | GPT-4 / Claude | 1-5s | Medium |
| Image Classification | EfficientNet-B0 | 5ms | Low |
| Object Detection | YOLOv8-n | 10ms | Low |
| Object Detection (Accurate) | YOLOv8-x | 50ms | Medium |
| Semantic Segmentation | SAM | 100ms | Medium |
| Speech-to-Text | Whisper-base | Real-time | Low |
| Embeddings | text-embedding-ada-002 | 50ms | Low |
Infrastructure Sizing
| Scale | GPU | Model Size | Throughput |
|---|---|---|---|
| Development | T4 (16GB) | < 7B params | 10-50 req/s |
| Production Small | A10G (24GB) | < 13B params | 50-100 req/s |
| Production Medium | A100 (40GB) | < 70B params | 100-500 req/s |
| Production Large | A100 (80GB) x 2+ | > 70B params | 500+ req/s |
References
- Model Catalog - Detailed model comparison and benchmarks
- Inference Patterns - Architecture patterns for different use cases
Repository

alirezarezvani
Author
alirezarezvani/claude-cto-team/skills/ml-cv-specialist
32
Stars
7
Forks
Updated5d ago
Added1w ago