Marketplace
llm-serving-patterns
LLM inference infrastructure, serving frameworks (vLLM, TGI, TensorRT-LLM), quantization techniques, batching strategies, and streaming response patterns. Use when designing LLM serving infrastructure, optimizing inference latency, or scaling LLM deployments.
allowed_tools: Read, Glob, Grep
$ Instalar
git clone https://github.com/melodic-software/claude-code-plugins /tmp/claude-code-plugins && cp -r /tmp/claude-code-plugins/plugins/systems-design/skills/llm-serving-patterns ~/.claude/skills/claude-code-plugins// tip: Run this command in your terminal to install the skill
SKILL.md
name: llm-serving-patterns description: LLM inference infrastructure, serving frameworks (vLLM, TGI, TensorRT-LLM), quantization techniques, batching strategies, and streaming response patterns. Use when designing LLM serving infrastructure, optimizing inference latency, or scaling LLM deployments. allowed-tools: Read, Glob, Grep
LLM Serving Patterns
When to Use This Skill
Use this skill when:
- Designing LLM inference infrastructure
- Choosing between serving frameworks (vLLM, TGI, TensorRT-LLM)
- Implementing quantization for production deployment
- Optimizing batching and throughput
- Building streaming response systems
- Scaling LLM deployments cost-effectively
Keywords: LLM serving, inference, vLLM, TGI, TensorRT-LLM, quantization, INT8, INT4, FP16, batching, continuous batching, streaming, SSE, WebSocket, KV cache, PagedAttention, speculative decoding
LLM Serving Architecture Overview
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ LLM Serving Stack โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Clients (API, Chat UI, Agents) โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Load Balancer / API Gateway โ โ
โ โ โข Rate limiting โข Authentication โข Request routing โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Inference Server โ โ
โ โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ Request โ โ Batching โ โ KV Cache โ โ โ
โ โ โ Queue โโโโถโ Engine โโโโถโ Management โ โ โ
โ โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ โ โ
โ โ โผ โผ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ Model Execution Engine โ โ โ
โ โ โ โข Tensor operations โข Attention โข Token sampling โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ GPU/TPU Cluster โ โ
โ โ โข Model sharding โข Tensor parallelism โข Pipeline parallel โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Serving Framework Comparison
| Framework | Strengths | Best For | Considerations |
|---|---|---|---|
| vLLM | PagedAttention, high throughput, continuous batching | General LLM serving, high concurrency | Python-native, active community |
| TGI (Text Generation Inference) | Production-ready, Hugging Face integration | Enterprise deployment, HF models | Rust backend, Docker-first |
| TensorRT-LLM | NVIDIA optimization, lowest latency | NVIDIA GPUs, latency-critical | NVIDIA-only, complex setup |
| Triton Inference Server | Multi-model, multi-framework | Heterogeneous model serving | Enterprise complexity |
| Ollama | Simple local deployment | Development, edge deployment | Limited scaling features |
| llama.cpp | CPU inference, quantization | Resource-constrained, edge | C++ integration required |
Framework Selection Decision Tree
Need lowest latency on NVIDIA GPUs?
โโโ Yes โ TensorRT-LLM
โโโ No
โโโ Need high throughput with many concurrent users?
โโโ Yes โ vLLM (PagedAttention)
โโโ No
โโโ Need enterprise features + HF integration?
โโโ Yes โ TGI
โโโ No
โโโ Simple local/edge deployment?
โโโ Yes โ Ollama or llama.cpp
โโโ No โ vLLM (general purpose)
Quantization Techniques
Precision Levels
| Precision | Bits | Memory Reduction | Quality Impact | Use Case |
|---|---|---|---|---|
| FP32 | 32 | Baseline | None | Training, reference |
| FP16/BF16 | 16 | 2x | Minimal | Standard serving |
| INT8 | 8 | 4x | Low | Production serving |
| INT4 | 4 | 8x | Moderate | Resource-constrained |
| INT2 | 2 | 16x | Significant | Experimental |
Quantization Methods
| Method | Description | Quality | Speed |
|---|---|---|---|
| PTQ (Post-Training Quantization) | Quantize after training, no retraining | Good | Fast to apply |
| QAT (Quantization-Aware Training) | Simulate quantization during training | Better | Requires training |
| GPTQ | One-shot weight quantization | Very good | Moderate |
| AWQ (Activation-aware Weight Quantization) | Preserves salient weights | Excellent | Moderate |
| GGUF/GGML | llama.cpp format, CPU-optimized | Good | Very fast inference |
| SmoothQuant | Migrates difficulty to weights | Excellent | Moderate |
Quantization Selection
Quality vs. Efficiency Trade-off:
Quality โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโถ Efficiency
โ โ
โ FP32 FP16 INT8+AWQ INT8+GPTQ INT4 INT2 โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ โ โ โ โ โ
โ Best Great Good Good Fair Poor โ
โ โ
Batching Strategies
Static Batching
Request 1: [tokens: 100] โโ
Request 2: [tokens: 50] โโผโโโถ [Batch: pad to 100] โโโถ Process โโโถ All complete
Request 3: [tokens: 80] โโ
Problem: Short requests wait for long ones (head-of-line blocking)
Continuous Batching (Preferred)
Time โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโถ
Req 1: [โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ] โโโถ Complete
Req 2: [โโโโโโโโโโโโ] โโโถ Complete โโโถ Req 4 starts [โโโโโโโโโโโโโโโโ]
Req 3: [โโโโโโโโโโโโโโโโโโโโ] โโโถ Complete โโโถ Req 5 starts [โโโโโโโโ]
โข New requests join batch as others complete
โข No padding waste
โข Optimal GPU utilization
Batching Parameters
| Parameter | Description | Trade-off |
|---|---|---|
max_batch_size | Maximum concurrent requests | Memory vs. throughput |
max_waiting_tokens | Tokens before forcing batch | Latency vs. throughput |
max_num_seqs | Maximum sequences in batch | Memory vs. concurrency |
KV Cache Management
The KV Cache Problem
Attention: Q ร K^T ร V
For each token generated:
โข Must recompute attention with ALL previous tokens
โข K and V tensors grow with sequence length
โข Memory: O(batch_size ร seq_len ร num_layers ร hidden_dim)
Example (70B model, 4K context):
โข KV cache per request: ~8GB
โข 10 concurrent requests: ~80GB GPU memory
PagedAttention (vLLM Innovation)
Traditional KV Cache:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Request 1 KV Cache (contiguous, fixed) โ โ Wastes memory
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Request 2 KV Cache (contiguous, fixed) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ FRAGMENTED/WASTED SPACE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
PagedAttention:
โโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโ
โ R1 โ R2 โ R1 โ R3 โ R2 โ R1 โ R3 โ R2 โ โ Pages allocated on demand
โโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโ
โข Non-contiguous memory allocation
โข Near-zero memory waste
โข 2-4x higher throughput
KV Cache Optimization Strategies
| Strategy | Description | Memory Savings |
|---|---|---|
| Paged Attention | Virtual memory for KV cache | ~50% reduction |
| Prefix Caching | Reuse KV cache for common prefixes | System prompt: 100% |
| Quantized KV Cache | INT8/FP8 for KV values | 50-75% reduction |
| Sliding Window | Limited attention context | Linear memory |
| MQA/GQA | Grouped query attention | Architecture-dependent |
Streaming Response Patterns
Server-Sent Events (SSE)
Client Server
โ โ
โโโโโ GET /v1/chat/completions โโโโโโโถโ
โ (stream: true) โ
โ โ
โโโโโโ HTTP 200 OK โโโโโโโโโโโโโโโโโโโโ
โ Content-Type: text/event-streamโ
โ โ
โโโโโโ data: {"token": "Hello"} โโโโโโโ
โโโโโโ data: {"token": " world"} โโโโโโ
โโโโโโ data: {"token": "!"} โโโโโโโโโโโ
โโโโโโ data: [DONE] โโโโโโโโโโโโโโโโโโโ
โ โ
SSE Benefits:
- HTTP/1.1 compatible
- Auto-reconnection support
- Simple to implement
- Wide client support
WebSocket Streaming
Client Server
โ โ
โโโโโ WebSocket Upgrade โโโโโโโโโโโโโโถโ
โโโโโโ 101 Switching Protocols โโโโโโโโ
โ โ
โโโโโ {"prompt": "Hello"} โโโโโโโโโโโโถโ
โ โ
โโโโโโ {"token": "Hi"} โโโโโโโโโโโโโโโโ
โโโโโโ {"token": " there"} โโโโโโโโโโโโ
โโโโโโ {"token": "!"} โโโโโโโโโโโโโโโโโ
โโโโโโ {"done": true} โโโโโโโโโโโโโโโโโ
โ โ
WebSocket Benefits:
- Bidirectional communication
- Lower latency
- Better for chat applications
- Connection persistence
Streaming Implementation Considerations
| Aspect | SSE | WebSocket |
|---|---|---|
| Reconnection | Built-in | Manual |
| Scalability | Per-request | Connection pool |
| Load Balancing | Standard HTTP | Sticky sessions |
| Firewall/Proxy | Usually works | May need config |
| Best For | One-way streaming | Interactive chat |
Speculative Decoding
Concept
Standard Decoding:
Large Model: [T1] โ [T2] โ [T3] โ [T4] โ [T5]
10ms 10ms 10ms 10ms 10ms = 50ms total
Speculative Decoding:
Draft Model: [T1, T2, T3, T4, T5] (parallel, 5ms)
โ
โผ
Large Model: [Verify T1-T5 in one pass] (15ms)
Accept: T1, T2, T3 โ Reject: T4, T5 โ
โ
โผ
[Generate T4, T5 correctly]
Total: ~25ms (2x speedup if 60% acceptance)
Speculative Decoding Trade-offs
| Factor | Impact |
|---|---|
| Draft model quality | Higher match rate = more speedup |
| Draft model size | Larger = better quality, slower |
| Speculation depth | More tokens = higher risk/reward |
| Verification cost | Must be < sequential generation |
Scaling Strategies
Horizontal Scaling
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Load Balancer โ
โ (Round-robin, Least-connections) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
โ vLLM โ โ vLLM โ โ vLLM โ
โ Node 1 โ โ Node 2 โ โ Node 3 โ
โ (GPUร4) โ โ (GPUร4) โ โ (GPUร4) โ
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
Model Parallelism
| Strategy | Description | Use Case |
|---|---|---|
| Tensor Parallelism | Split layers across GPUs | Single large model |
| Pipeline Parallelism | Different layers on different GPUs | Very large models |
| Data Parallelism | Same model, different batches | High throughput |
Tensor Parallelism (TP=4):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Layer N โ
โ GPU0 โ GPU1 โ GPU2 โ GPU3 โ
โ 25% โ 25% โ 25% โ 25% โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Pipeline Parallelism (PP=4):
GPU0: Layers 0-7
GPU1: Layers 8-15
GPU2: Layers 16-23
GPU3: Layers 24-31
Latency Optimization Checklist
Pre-deployment
- Choose appropriate quantization (INT8 for production)
- Enable continuous batching
- Configure KV cache size appropriately
- Set optimal batch size for hardware
- Enable prefix caching for system prompts
Runtime
- Monitor GPU memory utilization
- Track p50/p95/p99 latencies
- Measure time-to-first-token (TTFT)
- Monitor tokens-per-second (TPS)
- Set appropriate timeouts
Infrastructure
- Use fastest available interconnect (NVLink, InfiniBand)
- Minimize network hops
- Place inference close to users (edge)
- Consider dedicated inference hardware
Cost Optimization
Cost Drivers
| Factor | Impact | Optimization |
|---|---|---|
| GPU hours | Highest | Quantization, batching |
| Memory | High | PagedAttention, KV cache optimization |
| Network | Medium | Response compression, edge deployment |
| Storage | Low | Model deduplication |
Cost Estimation Formula
Monthly Cost =
(Requests/month) ร (Avg tokens/request) ร (GPU-seconds/token) ร ($/GPU-hour)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
3600
Example:
โข 10M requests/month
โข 500 tokens average
โข 0.001 GPU-seconds/token (optimized)
โข $2/GPU-hour
Cost = (10M ร 500 ร 0.001 ร 2) / 3600 = $2,778/month
Common Patterns
Multi-model Routing
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Router โ
โ โข Classify request complexity โ
โ โข Route to appropriate model โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
โ Small โ โ Medium โ โ Large โ
โ Model โ โ Model โ โ Model โ
โ (7B) โ โ (13B) โ โ (70B) โ
โ Fast โ โ Balancedโ โ Quality โ
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
Caching Strategies
| Cache Type | What to Cache | TTL |
|---|---|---|
| Prompt cache | Common system prompts | Long |
| KV cache | Prefix tokens | Session |
| Response cache | Exact query matches | Varies |
| Embedding cache | Document embeddings | Long |
Related Skills
ml-system-design- End-to-end ML pipeline designrag-architecture- Retrieval-augmented generation patternsvector-databases- Vector search for LLM contextml-inference-optimization- General inference optimizationestimation-techniques- Capacity planning for LLM systems
Version History
- v1.0.0 (2025-12-26): Initial release - LLM serving patterns for systems design interviews
Last Updated
Date: 2025-12-26
Repository

melodic-software
Author
melodic-software/claude-code-plugins/plugins/systems-design/skills/llm-serving-patterns
3
Stars
0
Forks
Updated1d ago
Added6d ago