Marketplace

ml-cv-specialist

Deep expertise in ML/CV model selection, training pipelines, and inference architecture. Use when designing machine learning systems, computer vision pipelines, or AI-powered features.

$ Instalar

git clone https://github.com/alirezarezvani/claude-cto-team /tmp/claude-cto-team && cp -r /tmp/claude-cto-team/skills/ml-cv-specialist ~/.claude/skills/claude-cto-team

// tip: Run this command in your terminal to install the skill


name: ml-cv-specialist description: Deep expertise in ML/CV model selection, training pipelines, and inference architecture. Use when designing machine learning systems, computer vision pipelines, or AI-powered features.

ML/CV Specialist

Provides specialized guidance for machine learning and computer vision system design, model selection, and production deployment.

When to Use

  • Selecting ML models for specific use cases
  • Designing training and inference pipelines
  • Optimizing ML system performance and cost
  • Evaluating build vs. API for ML capabilities
  • Planning data pipelines for ML workloads

ML System Design Framework

Model Selection Decision Tree

Use Case Identified
    โ”‚
    โ”œโ”€โ–บ Text/Language Tasks
    โ”‚   โ”œโ”€โ–บ Classification โ†’ BERT, DistilBERT, or API (OpenAI, Claude)
    โ”‚   โ”œโ”€โ–บ Generation โ†’ GPT-4, Claude, Llama (self-hosted)
    โ”‚   โ”œโ”€โ–บ Embeddings โ†’ OpenAI Ada, sentence-transformers
    โ”‚   โ””โ”€โ–บ Search/RAG โ†’ Vector DB + Embeddings + LLM
    โ”‚
    โ”œโ”€โ–บ Computer Vision Tasks
    โ”‚   โ”œโ”€โ–บ Classification โ†’ ResNet, EfficientNet, ViT
    โ”‚   โ”œโ”€โ–บ Object Detection โ†’ YOLOv8, DETR, Faster R-CNN
    โ”‚   โ”œโ”€โ–บ Segmentation โ†’ SAM, Mask R-CNN, U-Net
    โ”‚   โ”œโ”€โ–บ OCR โ†’ Tesseract, PaddleOCR, Cloud Vision API
    โ”‚   โ””โ”€โ–บ Face Recognition โ†’ InsightFace, DeepFace
    โ”‚
    โ”œโ”€โ–บ Audio Tasks
    โ”‚   โ”œโ”€โ–บ Speech-to-Text โ†’ Whisper, DeepSpeech, Cloud APIs
    โ”‚   โ”œโ”€โ–บ Text-to-Speech โ†’ ElevenLabs, Coqui TTS
    โ”‚   โ””โ”€โ–บ Audio Classification โ†’ PANNs, AudioSet models
    โ”‚
    โ””โ”€โ–บ Structured Data
        โ”œโ”€โ–บ Tabular โ†’ XGBoost, LightGBM, CatBoost
        โ”œโ”€โ–บ Time Series โ†’ Prophet, ARIMA, Transformer-based
        โ””โ”€โ–บ Recommendations โ†’ Two-tower, matrix factorization

API vs. Self-Hosted Decision

When to Use APIs

FactorAPI PreferredSelf-Hosted Preferred
Volume< 10K requests/month> 100K requests/month
Latency> 500ms acceptable< 100ms required
CustomizationGeneral use caseDomain-specific fine-tuning
Data PrivacyNon-sensitive dataPII, HIPAA, financial
Team ExpertiseNo ML engineersML team available
BudgetPredictable per-call costsHigh volume justifies infra

Cost Comparison Framework

## API Costs (Example: OpenAI GPT-4)
- Input: $0.03/1K tokens
- Output: $0.06/1K tokens
- Average request: 500 input + 200 output tokens
- Cost per request: $0.027
- 100K requests/month: $2,700

## Self-Hosted Costs (Example: Llama 70B)
- GPU instance: $3/hour (A100 40GB)
- Throughput: ~50 requests/minute = 3K/hour
- Cost per request: $0.001
- 100K requests/month: $100 + $500 engineering time

## Break-even Analysis
- < 50K requests: API likely cheaper
- > 50K requests: Self-hosted may be cheaper
- Factor in: engineering time, ops burden, model quality

Training Pipeline Architecture

Standard ML Pipeline

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    DATA LAYER                                โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Data Sources โ†’ ETL โ†’ Feature Store โ†’ Training Data         โ”‚
โ”‚  (S3, DBs)     (Airflow)  (Feast)     (Versioned)          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
                            โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                  TRAINING LAYER                              โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Experiment Tracking โ†’ Training Jobs โ†’ Model Registry       โ”‚
โ”‚  (MLflow, W&B)         (SageMaker)    (MLflow, S3)         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
                            โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                  SERVING LAYER                               โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Model Server โ†’ Load Balancer โ†’ Monitoring                  โ”‚
โ”‚  (TorchServe)   (K8s/ELB)      (Prometheus)                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Component Selection Guide

ComponentOptionsRecommendation
Feature StoreFeast, Tecton, SageMakerFeast (open source), Tecton (enterprise)
Experiment TrackingMLflow, Weights & Biases, NeptuneMLflow (free), W&B (best UX)
Training OrchestrationKubeflow, SageMaker, Vertex AISageMaker (AWS), Vertex (GCP)
Model RegistryMLflow, SageMaker, custom S3MLflow (standard)
Model ServingTorchServe, TFServing, TritonTriton (multi-framework)

Inference Architecture Patterns

Pattern 1: Synchronous API

Best for: Low-latency requirements, simple integration

Client โ†’ API Gateway โ†’ Model Server โ†’ Response
                           โ”‚
                      Load Balancer
                           โ”‚
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚             โ”‚
                Model Pod    Model Pod

Latency targets:

  • P50: < 100ms
  • P95: < 300ms
  • P99: < 500ms

Pattern 2: Asynchronous Processing

Best for: Long-running inference, batch processing

Client โ†’ API โ†’ Queue (SQS) โ†’ Worker โ†’ Result Store โ†’ Webhook/Poll
                                          โ”‚
                                     S3/Redis

Use when:

  • Inference > 5 seconds
  • Batch processing required
  • Variable load patterns

Pattern 3: Edge Inference

Best for: Privacy, offline capability, ultra-low latency

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              EDGE DEVICE                 โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚ Camera  โ”‚โ”€โ”€โ”€โ–ถโ”‚ Optimized Model     โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚ (ONNX, TFLite)      โ”‚ โ”‚
โ”‚                 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚                          โ”‚              โ”‚
โ”‚                     Local Result        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
                    Sync to Cloud
                    (non-blocking)

Model optimization for edge:

  • Quantization (INT8): 4x smaller, 2-3x faster
  • Pruning: 50-90% sparsity possible
  • Distillation: Smaller model, similar accuracy
  • ONNX/TFLite: Optimized runtime

Computer Vision Pipeline Design

Real-Time Video Processing

Camera Stream โ†’ Frame Extraction โ†’ Preprocessing โ†’ Model โ†’ Postprocessing โ†’ Output
     โ”‚              โ”‚                   โ”‚            โ”‚           โ”‚
   RTSP/         1-30 FPS           Resize,      Batch or    NMS, tracking,
   WebRTC                           normalize    single       annotation

Performance optimization:

  • Process every Nth frame (skip frames)
  • Resize to model input size early
  • Batch frames when latency allows
  • Use GPU preprocessing (NVIDIA DALI)

Object Detection System

## Pipeline Components

1. **Input Processing**
   - Video decode: FFmpeg, OpenCV
   - Frame buffer: Ring buffer for temporal context
   - Preprocessing: NVIDIA DALI (GPU), OpenCV (CPU)

2. **Detection**
   - Model: YOLOv8 (speed), DETR (accuracy)
   - Batch size: 1-8 depending on latency requirements
   - Confidence threshold: 0.5-0.7 typical

3. **Post-processing**
   - NMS (Non-Maximum Suppression)
   - Tracking: SORT, DeepSORT, ByteTrack
   - Smoothing: Kalman filter for stable boxes

4. **Output**
   - Annotations: Bounding boxes, labels, confidence
   - Events: Trigger on detection (webhook, queue)
   - Storage: Frame + metadata to S3/DB

LLM Integration Patterns

RAG (Retrieval-Augmented Generation)

User Query โ†’ Embedding โ†’ Vector Search โ†’ Context Retrieval โ†’ LLM โ†’ Response
                              โ”‚
                         Vector DB
                       (Pinecone, Weaviate,
                        Chroma, pgvector)

Vector DB Selection:

DatabaseBest ForLimitations
PineconeManaged, scaleCost at scale
WeaviateSelf-hosted, featuresOperational overhead
ChromaSimple, local devNot for production scale
pgvectorPostgreSQL usersPerformance at >1M vectors
QdrantPerformanceNewer, smaller community

LLM Serving Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    API GATEWAY                               โ”‚
โ”‚  Rate limiting, auth, request routing                       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚             โ”‚             โ”‚
              โ–ผ             โ–ผ             โ–ผ
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚ GPT-4  โ”‚   โ”‚ Claude โ”‚   โ”‚ Local  โ”‚
         โ”‚  API   โ”‚   โ”‚  API   โ”‚   โ”‚ Llama  โ”‚
         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
                    Model Router
              (cost/latency/capability)

Multi-model strategy:

  • Simple queries โ†’ Cheaper model (GPT-3.5, Haiku)
  • Complex reasoning โ†’ Expensive model (GPT-4, Opus)
  • Sensitive data โ†’ Self-hosted (Llama, Mistral)

Performance Optimization

GPU Memory Optimization

TechniqueMemory ReductionSpeed Impact
FP16 (Half Precision)50%Neutral to faster
INT8 Quantization75%10-20% slower
INT4 Quantization87.5%20-40% slower
Gradient Checkpointing60-80%20-30% slower
Model ShardingDistributedCommunication overhead

Batching Strategies

# Dynamic batching pseudocode
class DynamicBatcher:
    def __init__(self, max_batch=32, max_wait_ms=50):
        self.queue = []
        self.max_batch = max_batch
        self.max_wait = max_wait_ms

    async def add_request(self, request):
        self.queue.append(request)

        # Batch when full or timeout
        if len(self.queue) >= self.max_batch:
            return await self.process_batch()

        await asyncio.sleep(self.max_wait / 1000)
        return await self.process_batch()

    async def process_batch(self):
        batch = self.queue[:self.max_batch]
        self.queue = self.queue[self.max_batch:]
        return await self.model.predict_batch(batch)

Model Monitoring

Key Metrics to Track

MetricWhat It MeasuresAlert Threshold
Latency (P95)Response time> 2x baseline
ThroughputRequests/second< 80% capacity
Error RateFailed predictions> 1%
Model DriftDistribution shiftPSI > 0.2
Data QualityInput anomalies> 5% anomalies

Drift Detection

Training Distribution โ”€โ”€โ”
                        โ”œโ”€โ”€โ–บ Statistical Test โ”€โ”€โ–บ Alert
Production Distribution โ”€โ”˜
                         (PSI, KS test, JS divergence)

Population Stability Index (PSI):

  • PSI < 0.1: No significant change
  • 0.1 < PSI < 0.2: Moderate change, monitor
  • PSI > 0.2: Significant change, investigate

Quick Reference Tables

Model Selection by Use Case

Use CaseRecommended ModelLatencyCost
Text ClassificationDistilBERT10msLow
Text GenerationGPT-4 / Claude1-5sMedium
Image ClassificationEfficientNet-B05msLow
Object DetectionYOLOv8-n10msLow
Object Detection (Accurate)YOLOv8-x50msMedium
Semantic SegmentationSAM100msMedium
Speech-to-TextWhisper-baseReal-timeLow
Embeddingstext-embedding-ada-00250msLow

Infrastructure Sizing

ScaleGPUModel SizeThroughput
DevelopmentT4 (16GB)< 7B params10-50 req/s
Production SmallA10G (24GB)< 13B params50-100 req/s
Production MediumA100 (40GB)< 70B params100-500 req/s
Production LargeA100 (80GB) x 2+> 70B params500+ req/s

References