Marketplace
ml-inference-optimization
ML inference latency optimization, model compression, distillation, caching strategies, and edge deployment patterns. Use when optimizing inference performance, reducing model size, or deploying ML at the edge.
allowed_tools: Read, Glob, Grep
$ Instalar
git clone https://github.com/melodic-software/claude-code-plugins /tmp/claude-code-plugins && cp -r /tmp/claude-code-plugins/plugins/systems-design/skills/ml-inference-optimization ~/.claude/skills/claude-code-plugins// tip: Run this command in your terminal to install the skill
SKILL.md
name: ml-inference-optimization description: ML inference latency optimization, model compression, distillation, caching strategies, and edge deployment patterns. Use when optimizing inference performance, reducing model size, or deploying ML at the edge. allowed-tools: Read, Glob, Grep
ML Inference Optimization
When to Use This Skill
Use this skill when:
- Optimizing ML inference latency
- Reducing model size for deployment
- Implementing model compression techniques
- Designing inference caching strategies
- Deploying models at the edge
- Balancing accuracy vs. latency trade-offs
Keywords: inference optimization, latency, model compression, distillation, pruning, quantization, caching, edge ML, TensorRT, ONNX, model serving, batching, hardware acceleration
Inference Optimization Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Inference Optimization Stack β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Model Level β β
β β Distillation β Pruning β Quantization β Architecture Search β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Compiler Level β β
β β Graph optimization β Operator fusion β Memory planning β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Runtime Level β β
β β Batching β Caching β Async execution β Multi-threading β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Hardware Level β β
β β GPU β TPU β NPU β CPU SIMD β Custom accelerators β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Model Compression Techniques
Technique Overview
| Technique | Size Reduction | Speed Improvement | Accuracy Impact |
|---|---|---|---|
| Quantization | 2-4x | 2-4x | Low (1-2%) |
| Pruning | 2-10x | 1-3x | Low-Medium |
| Distillation | 3-10x | 3-10x | Medium |
| Low-rank factorization | 2-5x | 1.5-3x | Low-Medium |
| Weight sharing | 10-100x | Variable | Medium-High |
Knowledge Distillation
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Knowledge Distillation β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββ β
β β Teacher Modelβ (Large, accurate, slow) β
β β GPT-4 β β
β ββββββββββββββββ β
β β β
β βΌ Soft labels (probability distributions) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Training Process β β
β β Loss = Ξ± Γ CrossEntropy(student, hard_labels) β β
β β + (1-Ξ±) Γ KL_Div(student, teacher_soft_labels) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββ β
β βStudent Model β (Small, nearly as accurate, fast) β
β β DistilBERT β β
β ββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Distillation Types:
| Type | Description | Use Case |
|---|---|---|
| Response distillation | Match teacher outputs | General compression |
| Feature distillation | Match intermediate layers | Better transfer |
| Relation distillation | Match sample relationships | Structured data |
| Self-distillation | Model teaches itself | Regularization |
Pruning Strategies
Unstructured Pruning (Weight-level):
Before: [0.1, 0.8, 0.2, 0.9, 0.05, 0.7]
After: [0.0, 0.8, 0.0, 0.9, 0.0, 0.7] (50% sparse)
β’ Flexible, high sparsity possible
β’ Needs sparse hardware/libraries
Structured Pruning (Channel/Layer-level):
Before: βββββ¬ββββ¬ββββ¬ββββ
β C1β C2β C3β C4β
βββββ΄ββββ΄ββββ΄ββββ
After: βββββ¬ββββ¬ββββ
β C1β C3β C4β (Removed C2 entirely)
βββββ΄ββββ΄ββββ
β’ Works with standard hardware
β’ Lower compression ratio
Pruning Decision Criteria:
| Method | Description | Effectiveness |
|---|---|---|
| Magnitude-based | Remove smallest weights | Simple, effective |
| Gradient-based | Remove low-gradient weights | Better accuracy |
| Second-order | Use Hessian information | Best but expensive |
| Lottery ticket | Find winning subnetwork | Theoretical insight |
Quantization (Detailed)
Precision Hierarchy:
FP32 (32 bits): ββββββββββββββββββββββββββββββββ
FP16 (16 bits): ββββββββββββββββ
BF16 (16 bits): ββββββββββββββββ (different mantissa/exponent)
INT8 (8 bits): ββββββββ
INT4 (4 bits): ββββ
Binary (1 bit): β
Memory and Compute Scale Proportionally
Quantization Approaches:
| Approach | When Applied | Quality | Effort |
|---|---|---|---|
| Dynamic quantization | Runtime | Good | Low |
| Static quantization | Post-training with calibration | Better | Medium |
| QAT | During training | Best | High |
Compiler-Level Optimization
Graph Optimization
Original Graph:
Input β Conv β BatchNorm β ReLU β Conv β BatchNorm β ReLU β Output
Optimized Graph (Operator Fusion):
Input β FusedConvBNReLU β FusedConvBNReLU β Output
Benefits:
β’ Fewer kernel launches
β’ Better memory locality
β’ Reduced memory bandwidth
Common Optimizations
| Optimization | Description | Speedup |
|---|---|---|
| Operator fusion | Combine sequential ops | 1.2-2x |
| Constant folding | Pre-compute constants | 1.1-1.5x |
| Dead code elimination | Remove unused ops | Variable |
| Layout optimization | Optimize tensor memory layout | 1.1-1.3x |
| Memory planning | Optimize buffer allocation | 1.1-1.2x |
Optimization Frameworks
| Framework | Vendor | Best For |
|---|---|---|
| TensorRT | NVIDIA | NVIDIA GPUs, lowest latency |
| ONNX Runtime | Microsoft | Cross-platform, broad support |
| OpenVINO | Intel | Intel CPUs/GPUs |
| Core ML | Apple | Apple devices |
| TFLite | Mobile, embedded | |
| Apache TVM | Open source | Custom hardware, research |
Runtime Optimization
Batching Strategies
No Batching:
Request 1: [Process] β Response 1 10ms
Request 2: [Process] β Response 2 10ms
Request 3: [Process] β Response 3 10ms
Total: 30ms, GPU underutilized
Dynamic Batching:
Requests 1-3: [Wait 5ms] β [Process batch] β Responses
Total: 15ms, 2x throughput
Trade-off: Latency vs. Throughput
β’ Larger batch: Higher throughput, higher latency
β’ Smaller batch: Lower latency, lower throughput
Batching Parameters:
| Parameter | Description | Trade-off |
|---|---|---|
batch_size | Maximum batch size | Throughput vs. latency |
max_wait_time | Wait time for batch fill | Latency vs. efficiency |
min_batch_size | Minimum before processing | Latency predictability |
Caching Strategies
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Inference Caching Layers β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Layer 1: Input Cache β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Cache exact inputs β Return cached outputs β β
β β Hit rate: Low (inputs rarely repeat exactly) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Layer 2: Embedding Cache β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Cache computed embeddings for repeated tokens/entities β β
β β Hit rate: Medium (common tokens repeat) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Layer 3: KV Cache (for transformers) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Cache key-value pairs for attention β β
β β Hit rate: High (reuse across tokens in sequence) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Layer 4: Result Cache β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Cache semantic equivalents (fuzzy matching) β β
β β Hit rate: Variable (depends on query distribution) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Semantic Caching for LLMs:
Query: "What's the capital of France?"
β
Hash + Embed query
β
Search cache (similarity > threshold)
β
βββ Hit: Return cached response
βββ Miss: Generate β Cache β Return
Async and Parallel Execution
Sequential:
βββββββ βββββββ βββββββ
βPrep βββModelβββPost β Total: 30ms
β10ms β β15ms β β5ms β
βββββββ βββββββ βββββββ
Pipelined:
Request 1: βPrepβModelβPostβ
Request 2: βPrepβModelβPostβ
Request 3: βPrepβModelβPostβ
Throughput: 3x higher
Latency per request: Same
Hardware Acceleration
Hardware Comparison
| Hardware | Strengths | Limitations | Best For |
|---|---|---|---|
| GPU (NVIDIA) | High parallelism, mature ecosystem | Power, cost | Training, large batch inference |
| TPU (Google) | Matrix ops, cloud integration | Vendor lock-in | Google Cloud workloads |
| NPU (Apple/Qualcomm) | Power efficient, on-device | Limited models | Mobile, edge |
| CPU | Flexible, available | Slower for ML | Low-batch, CPU-bound |
| FPGA | Customizable, low latency | Development complexity | Specialized workloads |
GPU Optimization
| Optimization | Description | Impact |
|---|---|---|
| Tensor Cores | Use FP16/INT8 tensor operations | 2-8x speedup |
| CUDA graphs | Reduce kernel launch overhead | 1.5-2x for small models |
| Multi-stream | Parallel execution | Higher throughput |
| Memory pooling | Reduce allocation overhead | Lower latency variance |
Edge Deployment
Edge Constraints
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Edge Deployment Constraints β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Resource Constraints: β
β βββ Memory: 1-4 GB (vs. 64+ GB cloud) β
β βββ Compute: 1-10 TOPS (vs. 100+ TFLOPS cloud) β
β βββ Power: 5-15W (vs. 300W+ cloud) β
β βββ Storage: 16-128 GB (vs. TB cloud) β
β β
β Operational Constraints: β
β βββ No network (offline operation) β
β βββ Variable ambient conditions β
β βββ Infrequent updates β
β βββ Long deployment lifetime β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Edge Optimization Strategies
| Strategy | Description | Use When |
|---|---|---|
| Model selection | Use edge-native models (MobileNet, EfficientNet) | Accuracy acceptable |
| Aggressive quantization | INT8 or lower | Memory/power constrained |
| On-device distillation | Distill to tiny model | Extreme constraints |
| Split inference | Edge preprocessing, cloud inference | Network available |
| Model caching | Cache results locally | Repeated queries |
Edge ML Frameworks
| Framework | Platform | Features |
|---|---|---|
| TensorFlow Lite | Android, iOS, embedded | Quantization, delegates |
| Core ML | iOS, macOS | Neural Engine optimization |
| ONNX Runtime Mobile | Cross-platform | Broad model support |
| PyTorch Mobile | Android, iOS | Familiar API |
| TensorRT | NVIDIA Jetson | Maximum performance |
Latency Profiling
Profiling Methodology
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Latency Breakdown Analysis β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. Data Loading: ββββββββββββββββββ 15% β
β 2. Preprocessing: ββββββββββββββββββ 10% β
β 3. Model Inference: ββββββββββββββββββ 60% β
β 4. Postprocessing: ββββββββββββββββββ 8% β
β 5. Response Serialization:ββββββββββββββββββ 7% β
β β
β Target: Model inference (60% = biggest optimization opportunity) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Profiling Tools
| Tool | Use For |
|---|---|
| PyTorch Profiler | PyTorch model profiling |
| TensorBoard | TensorFlow visualization |
| NVIDIA Nsight | GPU profiling |
| Chrome Tracing | General timeline visualization |
| perf | CPU profiling |
Key Metrics
| Metric | Description | Target |
|---|---|---|
| P50 latency | Median latency | < SLA |
| P99 latency | Tail latency | < 2x P50 |
| Throughput | Requests/second | Meet demand |
| GPU utilization | Compute usage | > 80% |
| Memory bandwidth | Memory usage | < limit |
Optimization Workflow
Systematic Approach
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Optimization Workflow β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. Baseline β
β βββ Measure current performance (latency, throughput, accuracy) β
β β
β 2. Profile β
β βββ Identify bottlenecks (model, data, system) β
β β
β 3. Optimize (in order of effort/impact): β
β βββ Hardware: Use right accelerator β
β βββ Compiler: Enable optimizations (TensorRT, ONNX) β
β βββ Runtime: Batching, caching, async β
β βββ Model: Quantization, pruning β
β βββ Architecture: Distillation, model change β
β β
β 4. Validate β
β βββ Verify accuracy maintained, latency improved β
β β
β 5. Deploy and Monitor β
β βββ Track real-world performance β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Optimization Priority Matrix
High Impact
β
Compiler Opts βββββΌββββ Quantization
(easy win) β (best ROI)
β
Low Effort βββββββββββββββΌββββββββββββββββ High Effort
β
Batching βββββΌββββ Distillation
(quick win) β (major effort)
β
Low Impact
Common Patterns
Multi-Model Serving
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Request β βββββββββββ β
β β Router β β
β βββββββββββ β
β β β β β
β ββββββββββ β ββββββββββ β
β βΌ βΌ βΌ β
β βββββββββ βββββββββ βββββββββ β
β β Tiny β β Small β β Large β β
β β <10ms β β <50ms β β<500ms β β
β βββββββββ βββββββββ βββββββββ β
β β
β Routing strategies: β
β β’ Complexity-based: SimpleβTiny, ComplexβLarge β
β β’ Confidence-based: Try Tiny, escalate if low confidence β
β β’ SLA-based: Route based on latency requirements β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Speculative Execution
Query: "Translate: Hello"
β
ββββΆ Small model (draft): "Bonjour" (5ms)
β
ββββΆ Large model (verify): Check "Bonjour" (10ms parallel)
β
βββ Accept: Return immediately
βββ Reject: Generate with large model
Speedup: 2-3x when drafts are often accepted
Cascade Models
Input β ββββββββββ
β Filter β β Cheap filter (reject obvious negatives)
ββββββββββ
β (candidates only)
βΌ
ββββββββββ
β Stage 1β β Fast model (coarse ranking)
ββββββββββ
β (top-100)
βΌ
ββββββββββ
β Stage 2β β Accurate model (fine ranking)
ββββββββββ
β (top-10)
βΌ
Output
Benefit: 10x cheaper, similar accuracy
Optimization Checklist
Pre-Deployment
- Profile baseline performance
- Identify primary bottleneck (model, data, system)
- Apply compiler optimizations (TensorRT, ONNX)
- Evaluate quantization (INT8 usually safe)
- Tune batch size for target throughput
- Test accuracy after optimization
Deployment
- Configure appropriate hardware
- Enable caching where applicable
- Set up monitoring (latency, throughput, errors)
- Configure auto-scaling policies
- Implement graceful degradation
Post-Deployment
- Monitor p99 latency
- Track accuracy metrics
- Analyze cache hit rates
- Review cost efficiency
- Plan iterative improvements
Related Skills
llm-serving-patterns- LLM-specific serving optimizationml-system-design- End-to-end ML pipeline designquality-attributes-taxonomy- Performance as quality attributeestimation-techniques- Capacity planning for ML systems
Version History
- v1.0.0 (2025-12-26): Initial release - ML inference optimization patterns
Last Updated
Date: 2025-12-26
Repository

melodic-software
Author
melodic-software/claude-code-plugins/plugins/systems-design/skills/ml-inference-optimization
3
Stars
0
Forks
Updated3d ago
Added1w ago