Marketplace

ml-inference-optimization

ML inference latency optimization, model compression, distillation, caching strategies, and edge deployment patterns. Use when optimizing inference performance, reducing model size, or deploying ML at the edge.

allowed_tools: Read, Glob, Grep

$ 安裝

git clone https://github.com/melodic-software/claude-code-plugins /tmp/claude-code-plugins && cp -r /tmp/claude-code-plugins/plugins/systems-design/skills/ml-inference-optimization ~/.claude/skills/claude-code-plugins

// tip: Run this command in your terminal to install the skill


name: ml-inference-optimization description: ML inference latency optimization, model compression, distillation, caching strategies, and edge deployment patterns. Use when optimizing inference performance, reducing model size, or deploying ML at the edge. allowed-tools: Read, Glob, Grep

ML Inference Optimization

When to Use This Skill

Use this skill when:

  • Optimizing ML inference latency
  • Reducing model size for deployment
  • Implementing model compression techniques
  • Designing inference caching strategies
  • Deploying models at the edge
  • Balancing accuracy vs. latency trade-offs

Keywords: inference optimization, latency, model compression, distillation, pruning, quantization, caching, edge ML, TensorRT, ONNX, model serving, batching, hardware acceleration

Inference Optimization Overview

┌─────────────────────────────────────────────────────────────────────┐
│                 Inference Optimization Stack                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                    Model Level                                │  │
│  │  Distillation │ Pruning │ Quantization │ Architecture Search │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                              │                                      │
│                              ▼                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                   Compiler Level                              │  │
│  │  Graph optimization │ Operator fusion │ Memory planning       │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                              │                                      │
│                              ▼                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                  Runtime Level                                │  │
│  │  Batching │ Caching │ Async execution │ Multi-threading      │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                              │                                      │
│                              ▼                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                  Hardware Level                               │  │
│  │  GPU │ TPU │ NPU │ CPU SIMD │ Custom accelerators            │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Model Compression Techniques

Technique Overview

TechniqueSize ReductionSpeed ImprovementAccuracy Impact
Quantization2-4x2-4xLow (1-2%)
Pruning2-10x1-3xLow-Medium
Distillation3-10x3-10xMedium
Low-rank factorization2-5x1.5-3xLow-Medium
Weight sharing10-100xVariableMedium-High

Knowledge Distillation

┌─────────────────────────────────────────────────────────────────────┐
│                    Knowledge Distillation                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────┐                                                   │
│  │ Teacher Model│ (Large, accurate, slow)                          │
│  │   GPT-4      │                                                   │
│  └──────────────┘                                                   │
│         │                                                           │
│         ▼ Soft labels (probability distributions)                   │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                    Training Process                           │  │
│  │  Loss = α × CrossEntropy(student, hard_labels)               │  │
│  │       + (1-α) × KL_Div(student, teacher_soft_labels)         │  │
│  └──────────────────────────────────────────────────────────────┘  │
│         │                                                           │
│         ▼                                                           │
│  ┌──────────────┐                                                   │
│  │Student Model │ (Small, nearly as accurate, fast)                │
│  │  DistilBERT  │                                                   │
│  └──────────────┘                                                   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Distillation Types:

TypeDescriptionUse Case
Response distillationMatch teacher outputsGeneral compression
Feature distillationMatch intermediate layersBetter transfer
Relation distillationMatch sample relationshipsStructured data
Self-distillationModel teaches itselfRegularization

Pruning Strategies

Unstructured Pruning (Weight-level):
Before: [0.1, 0.8, 0.2, 0.9, 0.05, 0.7]
After:  [0.0, 0.8, 0.0, 0.9, 0.0, 0.7]  (50% sparse)
• Flexible, high sparsity possible
• Needs sparse hardware/libraries

Structured Pruning (Channel/Layer-level):
Before: ┌───┬───┬───┬───┐
        │ C1│ C2│ C3│ C4│
        └───┴───┴───┴───┘
After:  ┌───┬───┬───┐
        │ C1│ C3│ C4│  (Removed C2 entirely)
        └───┴───┴───┘
• Works with standard hardware
• Lower compression ratio

Pruning Decision Criteria:

MethodDescriptionEffectiveness
Magnitude-basedRemove smallest weightsSimple, effective
Gradient-basedRemove low-gradient weightsBetter accuracy
Second-orderUse Hessian informationBest but expensive
Lottery ticketFind winning subnetworkTheoretical insight

Quantization (Detailed)

Precision Hierarchy:

FP32 (32 bits): ████████████████████████████████
FP16 (16 bits): ████████████████
BF16 (16 bits): ████████████████  (different mantissa/exponent)
INT8 (8 bits):  ████████
INT4 (4 bits):  ████
Binary (1 bit): █

Memory and Compute Scale Proportionally

Quantization Approaches:

ApproachWhen AppliedQualityEffort
Dynamic quantizationRuntimeGoodLow
Static quantizationPost-training with calibrationBetterMedium
QATDuring trainingBestHigh

Compiler-Level Optimization

Graph Optimization

Original Graph:
Input → Conv → BatchNorm → ReLU → Conv → BatchNorm → ReLU → Output

Optimized Graph (Operator Fusion):
Input → FusedConvBNReLU → FusedConvBNReLU → Output

Benefits:
• Fewer kernel launches
• Better memory locality
• Reduced memory bandwidth

Common Optimizations

OptimizationDescriptionSpeedup
Operator fusionCombine sequential ops1.2-2x
Constant foldingPre-compute constants1.1-1.5x
Dead code eliminationRemove unused opsVariable
Layout optimizationOptimize tensor memory layout1.1-1.3x
Memory planningOptimize buffer allocation1.1-1.2x

Optimization Frameworks

FrameworkVendorBest For
TensorRTNVIDIANVIDIA GPUs, lowest latency
ONNX RuntimeMicrosoftCross-platform, broad support
OpenVINOIntelIntel CPUs/GPUs
Core MLAppleApple devices
TFLiteGoogleMobile, embedded
Apache TVMOpen sourceCustom hardware, research

Runtime Optimization

Batching Strategies

No Batching:
Request 1: [Process] → Response 1      10ms
Request 2: [Process] → Response 2      10ms
Request 3: [Process] → Response 3      10ms
Total: 30ms, GPU underutilized

Dynamic Batching:
Requests 1-3: [Wait 5ms] → [Process batch] → Responses
Total: 15ms, 2x throughput

Trade-off: Latency vs. Throughput
• Larger batch: Higher throughput, higher latency
• Smaller batch: Lower latency, lower throughput

Batching Parameters:

ParameterDescriptionTrade-off
batch_sizeMaximum batch sizeThroughput vs. latency
max_wait_timeWait time for batch fillLatency vs. efficiency
min_batch_sizeMinimum before processingLatency predictability

Caching Strategies

┌─────────────────────────────────────────────────────────────────────┐
│                    Inference Caching Layers                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Layer 1: Input Cache                                               │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Cache exact inputs → Return cached outputs                   │   │
│  │ Hit rate: Low (inputs rarely repeat exactly)                 │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  Layer 2: Embedding Cache                                           │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Cache computed embeddings for repeated tokens/entities       │   │
│  │ Hit rate: Medium (common tokens repeat)                      │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  Layer 3: KV Cache (for transformers)                               │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Cache key-value pairs for attention                          │   │
│  │ Hit rate: High (reuse across tokens in sequence)             │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  Layer 4: Result Cache                                              │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Cache semantic equivalents (fuzzy matching)                  │   │
│  │ Hit rate: Variable (depends on query distribution)           │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Semantic Caching for LLMs:

Query: "What's the capital of France?"
       ↓
Hash + Embed query
       ↓
Search cache (similarity > threshold)
       ↓
├── Hit: Return cached response
└── Miss: Generate → Cache → Return

Async and Parallel Execution

Sequential:
┌─────┐ ┌─────┐ ┌─────┐
│Prep │→│Model│→│Post │  Total: 30ms
│10ms │ │15ms │ │5ms  │
└─────┘ └─────┘ └─────┘

Pipelined:
Request 1: │Prep│Model│Post│
Request 2:      │Prep│Model│Post│
Request 3:           │Prep│Model│Post│

Throughput: 3x higher
Latency per request: Same

Hardware Acceleration

Hardware Comparison

HardwareStrengthsLimitationsBest For
GPU (NVIDIA)High parallelism, mature ecosystemPower, costTraining, large batch inference
TPU (Google)Matrix ops, cloud integrationVendor lock-inGoogle Cloud workloads
NPU (Apple/Qualcomm)Power efficient, on-deviceLimited modelsMobile, edge
CPUFlexible, availableSlower for MLLow-batch, CPU-bound
FPGACustomizable, low latencyDevelopment complexitySpecialized workloads

GPU Optimization

OptimizationDescriptionImpact
Tensor CoresUse FP16/INT8 tensor operations2-8x speedup
CUDA graphsReduce kernel launch overhead1.5-2x for small models
Multi-streamParallel executionHigher throughput
Memory poolingReduce allocation overheadLower latency variance

Edge Deployment

Edge Constraints

┌─────────────────────────────────────────────────────────────────────┐
│                      Edge Deployment Constraints                    │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Resource Constraints:                                              │
│  ├── Memory: 1-4 GB (vs. 64+ GB cloud)                             │
│  ├── Compute: 1-10 TOPS (vs. 100+ TFLOPS cloud)                    │
│  ├── Power: 5-15W (vs. 300W+ cloud)                                │
│  └── Storage: 16-128 GB (vs. TB cloud)                             │
│                                                                     │
│  Operational Constraints:                                           │
│  ├── No network (offline operation)                                 │
│  ├── Variable ambient conditions                                    │
│  ├── Infrequent updates                                            │
│  └── Long deployment lifetime                                       │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Edge Optimization Strategies

StrategyDescriptionUse When
Model selectionUse edge-native models (MobileNet, EfficientNet)Accuracy acceptable
Aggressive quantizationINT8 or lowerMemory/power constrained
On-device distillationDistill to tiny modelExtreme constraints
Split inferenceEdge preprocessing, cloud inferenceNetwork available
Model cachingCache results locallyRepeated queries

Edge ML Frameworks

FrameworkPlatformFeatures
TensorFlow LiteAndroid, iOS, embeddedQuantization, delegates
Core MLiOS, macOSNeural Engine optimization
ONNX Runtime MobileCross-platformBroad model support
PyTorch MobileAndroid, iOSFamiliar API
TensorRTNVIDIA JetsonMaximum performance

Latency Profiling

Profiling Methodology

┌─────────────────────────────────────────────────────────────────────┐
│                    Latency Breakdown Analysis                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  1. Data Loading:          ████████░░░░░░░░░░  15%                 │
│  2. Preprocessing:         ██████░░░░░░░░░░░░  10%                 │
│  3. Model Inference:       ████████████████░░  60%                 │
│  4. Postprocessing:        ████░░░░░░░░░░░░░░   8%                 │
│  5. Response Serialization:███░░░░░░░░░░░░░░░   7%                 │
│                                                                     │
│  Target: Model inference (60% = biggest optimization opportunity)  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Profiling Tools

ToolUse For
PyTorch ProfilerPyTorch model profiling
TensorBoardTensorFlow visualization
NVIDIA NsightGPU profiling
Chrome TracingGeneral timeline visualization
perfCPU profiling

Key Metrics

MetricDescriptionTarget
P50 latencyMedian latency< SLA
P99 latencyTail latency< 2x P50
ThroughputRequests/secondMeet demand
GPU utilizationCompute usage> 80%
Memory bandwidthMemory usage< limit

Optimization Workflow

Systematic Approach

┌─────────────────────────────────────────────────────────────────────┐
│                  Optimization Workflow                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  1. Baseline                                                        │
│     └── Measure current performance (latency, throughput, accuracy) │
│                                                                     │
│  2. Profile                                                         │
│     └── Identify bottlenecks (model, data, system)                  │
│                                                                     │
│  3. Optimize (in order of effort/impact):                           │
│     ├── Hardware: Use right accelerator                             │
│     ├── Compiler: Enable optimizations (TensorRT, ONNX)            │
│     ├── Runtime: Batching, caching, async                          │
│     ├── Model: Quantization, pruning                                │
│     └── Architecture: Distillation, model change                    │
│                                                                     │
│  4. Validate                                                        │
│     └── Verify accuracy maintained, latency improved                │
│                                                                     │
│  5. Deploy and Monitor                                              │
│     └── Track real-world performance                                │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Optimization Priority Matrix

                    High Impact
                         │
    Compiler Opts    ────┼──── Quantization
    (easy win)           │     (best ROI)
                         │
Low Effort ──────────────┼──────────────── High Effort
                         │
    Batching         ────┼──── Distillation
    (quick win)          │     (major effort)
                         │
                    Low Impact

Common Patterns

Multi-Model Serving

┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  Request → ┌─────────┐                                              │
│            │ Router  │                                              │
│            └─────────┘                                              │
│               │   │   │                                             │
│      ┌────────┘   │   └────────┐                                    │
│      ▼            ▼            ▼                                    │
│  ┌───────┐   ┌───────┐   ┌───────┐                                 │
│  │ Tiny  │   │ Small │   │ Large │                                 │
│  │ <10ms │   │ <50ms │   │<500ms │                                 │
│  └───────┘   └───────┘   └───────┘                                 │
│                                                                     │
│  Routing strategies:                                                │
│  • Complexity-based: Simple→Tiny, Complex→Large                    │
│  • Confidence-based: Try Tiny, escalate if low confidence          │
│  • SLA-based: Route based on latency requirements                  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Speculative Execution

Query: "Translate: Hello"
        │
        ├──▶ Small model (draft): "Bonjour" (5ms)
        │
        └──▶ Large model (verify): Check "Bonjour" (10ms parallel)
             │
             ├── Accept: Return immediately
             └── Reject: Generate with large model

Speedup: 2-3x when drafts are often accepted

Cascade Models

Input → ┌────────┐
        │ Filter │ ← Cheap filter (reject obvious negatives)
        └────────┘
             │ (candidates only)
             ▼
        ┌────────┐
        │ Stage 1│ ← Fast model (coarse ranking)
        └────────┘
             │ (top-100)
             ▼
        ┌────────┐
        │ Stage 2│ ← Accurate model (fine ranking)
        └────────┘
             │ (top-10)
             ▼
         Output

Benefit: 10x cheaper, similar accuracy

Optimization Checklist

Pre-Deployment

  • Profile baseline performance
  • Identify primary bottleneck (model, data, system)
  • Apply compiler optimizations (TensorRT, ONNX)
  • Evaluate quantization (INT8 usually safe)
  • Tune batch size for target throughput
  • Test accuracy after optimization

Deployment

  • Configure appropriate hardware
  • Enable caching where applicable
  • Set up monitoring (latency, throughput, errors)
  • Configure auto-scaling policies
  • Implement graceful degradation

Post-Deployment

  • Monitor p99 latency
  • Track accuracy metrics
  • Analyze cache hit rates
  • Review cost efficiency
  • Plan iterative improvements

Related Skills

  • llm-serving-patterns - LLM-specific serving optimization
  • ml-system-design - End-to-end ML pipeline design
  • quality-attributes-taxonomy - Performance as quality attribute
  • estimation-techniques - Capacity planning for ML systems

Version History

  • v1.0.0 (2025-12-26): Initial release - ML inference optimization patterns

Last Updated

Date: 2025-12-26

Repository

melodic-software
melodic-software
Author
melodic-software/claude-code-plugins/plugins/systems-design/skills/ml-inference-optimization
3
Stars
0
Forks
Updated3d ago
Added1w ago