Marketplace
llm-serving-patterns
LLM inference infrastructure, serving frameworks (vLLM, TGI, TensorRT-LLM), quantization techniques, batching strategies, and streaming response patterns. Use when designing LLM serving infrastructure, optimizing inference latency, or scaling LLM deployments.
allowed_tools: Read, Glob, Grep
$ インストール
git clone https://github.com/melodic-software/claude-code-plugins /tmp/claude-code-plugins && cp -r /tmp/claude-code-plugins/plugins/systems-design/skills/llm-serving-patterns ~/.claude/skills/claude-code-plugins// tip: Run this command in your terminal to install the skill
SKILL.md
name: llm-serving-patterns description: LLM inference infrastructure, serving frameworks (vLLM, TGI, TensorRT-LLM), quantization techniques, batching strategies, and streaming response patterns. Use when designing LLM serving infrastructure, optimizing inference latency, or scaling LLM deployments. allowed-tools: Read, Glob, Grep
LLM Serving Patterns
When to Use This Skill
Use this skill when:
- Designing LLM inference infrastructure
- Choosing between serving frameworks (vLLM, TGI, TensorRT-LLM)
- Implementing quantization for production deployment
- Optimizing batching and throughput
- Building streaming response systems
- Scaling LLM deployments cost-effectively
Keywords: LLM serving, inference, vLLM, TGI, TensorRT-LLM, quantization, INT8, INT4, FP16, batching, continuous batching, streaming, SSE, WebSocket, KV cache, PagedAttention, speculative decoding
LLM Serving Architecture Overview
┌─────────────────────────────────────────────────────────────────────┐
│ LLM Serving Stack │
├─────────────────────────────────────────────────────────────────────┤
│ Clients (API, Chat UI, Agents) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Load Balancer / API Gateway │ │
│ │ • Rate limiting • Authentication • Request routing │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Inference Server │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ │
│ │ │ Request │ │ Batching │ │ KV Cache │ │ │
│ │ │ Queue │──▶│ Engine │──▶│ Management │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────────────┘ │ │
│ │ │ │ │ │
│ │ ▼ ▼ │ │
│ │ ┌─────────────────────────────────────────────────────┐ │ │
│ │ │ Model Execution Engine │ │ │
│ │ │ • Tensor operations • Attention • Token sampling │ │ │
│ │ └─────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ GPU/TPU Cluster │ │
│ │ • Model sharding • Tensor parallelism • Pipeline parallel │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Serving Framework Comparison
| Framework | Strengths | Best For | Considerations |
|---|---|---|---|
| vLLM | PagedAttention, high throughput, continuous batching | General LLM serving, high concurrency | Python-native, active community |
| TGI (Text Generation Inference) | Production-ready, Hugging Face integration | Enterprise deployment, HF models | Rust backend, Docker-first |
| TensorRT-LLM | NVIDIA optimization, lowest latency | NVIDIA GPUs, latency-critical | NVIDIA-only, complex setup |
| Triton Inference Server | Multi-model, multi-framework | Heterogeneous model serving | Enterprise complexity |
| Ollama | Simple local deployment | Development, edge deployment | Limited scaling features |
| llama.cpp | CPU inference, quantization | Resource-constrained, edge | C++ integration required |
Framework Selection Decision Tree
Need lowest latency on NVIDIA GPUs?
├── Yes → TensorRT-LLM
└── No
└── Need high throughput with many concurrent users?
├── Yes → vLLM (PagedAttention)
└── No
└── Need enterprise features + HF integration?
├── Yes → TGI
└── No
└── Simple local/edge deployment?
├── Yes → Ollama or llama.cpp
└── No → vLLM (general purpose)
Quantization Techniques
Precision Levels
| Precision | Bits | Memory Reduction | Quality Impact | Use Case |
|---|---|---|---|---|
| FP32 | 32 | Baseline | None | Training, reference |
| FP16/BF16 | 16 | 2x | Minimal | Standard serving |
| INT8 | 8 | 4x | Low | Production serving |
| INT4 | 4 | 8x | Moderate | Resource-constrained |
| INT2 | 2 | 16x | Significant | Experimental |
Quantization Methods
| Method | Description | Quality | Speed |
|---|---|---|---|
| PTQ (Post-Training Quantization) | Quantize after training, no retraining | Good | Fast to apply |
| QAT (Quantization-Aware Training) | Simulate quantization during training | Better | Requires training |
| GPTQ | One-shot weight quantization | Very good | Moderate |
| AWQ (Activation-aware Weight Quantization) | Preserves salient weights | Excellent | Moderate |
| GGUF/GGML | llama.cpp format, CPU-optimized | Good | Very fast inference |
| SmoothQuant | Migrates difficulty to weights | Excellent | Moderate |
Quantization Selection
Quality vs. Efficiency Trade-off:
Quality ────────────────────────────────────────────▶ Efficiency
│ │
│ FP32 FP16 INT8+AWQ INT8+GPTQ INT4 INT2 │
│ ○───────○────────○──────────○──────────○──────○ │
│ │ │ │ │ │ │ │
│ Best Great Good Good Fair Poor │
│ │
Batching Strategies
Static Batching
Request 1: [tokens: 100] ─┐
Request 2: [tokens: 50] ─┼──▶ [Batch: pad to 100] ──▶ Process ──▶ All complete
Request 3: [tokens: 80] ─┘
Problem: Short requests wait for long ones (head-of-line blocking)
Continuous Batching (Preferred)
Time ──────────────────────────────────────────────────────────▶
Req 1: [████████████████████████████████] ──▶ Complete
Req 2: [████████████] ──▶ Complete ──▶ Req 4 starts [████████████████]
Req 3: [████████████████████] ──▶ Complete ──▶ Req 5 starts [████████]
• New requests join batch as others complete
• No padding waste
• Optimal GPU utilization
Batching Parameters
| Parameter | Description | Trade-off |
|---|---|---|
max_batch_size | Maximum concurrent requests | Memory vs. throughput |
max_waiting_tokens | Tokens before forcing batch | Latency vs. throughput |
max_num_seqs | Maximum sequences in batch | Memory vs. concurrency |
KV Cache Management
The KV Cache Problem
Attention: Q × K^T × V
For each token generated:
• Must recompute attention with ALL previous tokens
• K and V tensors grow with sequence length
• Memory: O(batch_size × seq_len × num_layers × hidden_dim)
Example (70B model, 4K context):
• KV cache per request: ~8GB
• 10 concurrent requests: ~80GB GPU memory
PagedAttention (vLLM Innovation)
Traditional KV Cache:
┌──────────────────────────────────────────┐
│ Request 1 KV Cache (contiguous, fixed) │ ← Wastes memory
├──────────────────────────────────────────┤
│ Request 2 KV Cache (contiguous, fixed) │
├──────────────────────────────────────────┤
│ FRAGMENTED/WASTED SPACE │
└──────────────────────────────────────────┘
PagedAttention:
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ R1 │ R2 │ R1 │ R3 │ R2 │ R1 │ R3 │ R2 │ ← Pages allocated on demand
└────┴────┴────┴────┴────┴────┴────┴────┘
• Non-contiguous memory allocation
• Near-zero memory waste
• 2-4x higher throughput
KV Cache Optimization Strategies
| Strategy | Description | Memory Savings |
|---|---|---|
| Paged Attention | Virtual memory for KV cache | ~50% reduction |
| Prefix Caching | Reuse KV cache for common prefixes | System prompt: 100% |
| Quantized KV Cache | INT8/FP8 for KV values | 50-75% reduction |
| Sliding Window | Limited attention context | Linear memory |
| MQA/GQA | Grouped query attention | Architecture-dependent |
Streaming Response Patterns
Server-Sent Events (SSE)
Client Server
│ │
│──── GET /v1/chat/completions ──────▶│
│ (stream: true) │
│ │
│◀──── HTTP 200 OK ───────────────────│
│ Content-Type: text/event-stream│
│ │
│◀──── data: {"token": "Hello"} ──────│
│◀──── data: {"token": " world"} ─────│
│◀──── data: {"token": "!"} ──────────│
│◀──── data: [DONE] ──────────────────│
│ │
SSE Benefits:
- HTTP/1.1 compatible
- Auto-reconnection support
- Simple to implement
- Wide client support
WebSocket Streaming
Client Server
│ │
│──── WebSocket Upgrade ─────────────▶│
│◀──── 101 Switching Protocols ───────│
│ │
│──── {"prompt": "Hello"} ───────────▶│
│ │
│◀──── {"token": "Hi"} ───────────────│
│◀──── {"token": " there"} ───────────│
│◀──── {"token": "!"} ────────────────│
│◀──── {"done": true} ────────────────│
│ │
WebSocket Benefits:
- Bidirectional communication
- Lower latency
- Better for chat applications
- Connection persistence
Streaming Implementation Considerations
| Aspect | SSE | WebSocket |
|---|---|---|
| Reconnection | Built-in | Manual |
| Scalability | Per-request | Connection pool |
| Load Balancing | Standard HTTP | Sticky sessions |
| Firewall/Proxy | Usually works | May need config |
| Best For | One-way streaming | Interactive chat |
Speculative Decoding
Concept
Standard Decoding:
Large Model: [T1] → [T2] → [T3] → [T4] → [T5]
10ms 10ms 10ms 10ms 10ms = 50ms total
Speculative Decoding:
Draft Model: [T1, T2, T3, T4, T5] (parallel, 5ms)
│
▼
Large Model: [Verify T1-T5 in one pass] (15ms)
Accept: T1, T2, T3 ✓ Reject: T4, T5 ✗
│
▼
[Generate T4, T5 correctly]
Total: ~25ms (2x speedup if 60% acceptance)
Speculative Decoding Trade-offs
| Factor | Impact |
|---|---|
| Draft model quality | Higher match rate = more speedup |
| Draft model size | Larger = better quality, slower |
| Speculation depth | More tokens = higher risk/reward |
| Verification cost | Must be < sequential generation |
Scaling Strategies
Horizontal Scaling
┌─────────────────────────────────────────────────────────┐
│ Load Balancer │
│ (Round-robin, Least-connections) │
└─────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ vLLM │ │ vLLM │ │ vLLM │
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ (GPU×4) │ │ (GPU×4) │ │ (GPU×4) │
└─────────┘ └─────────┘ └─────────┘
Model Parallelism
| Strategy | Description | Use Case |
|---|---|---|
| Tensor Parallelism | Split layers across GPUs | Single large model |
| Pipeline Parallelism | Different layers on different GPUs | Very large models |
| Data Parallelism | Same model, different batches | High throughput |
Tensor Parallelism (TP=4):
┌─────────────────────────────────────────┐
│ Layer N │
│ GPU0 │ GPU1 │ GPU2 │ GPU3 │
│ 25% │ 25% │ 25% │ 25% │
└─────────────────────────────────────────┘
Pipeline Parallelism (PP=4):
GPU0: Layers 0-7
GPU1: Layers 8-15
GPU2: Layers 16-23
GPU3: Layers 24-31
Latency Optimization Checklist
Pre-deployment
- Choose appropriate quantization (INT8 for production)
- Enable continuous batching
- Configure KV cache size appropriately
- Set optimal batch size for hardware
- Enable prefix caching for system prompts
Runtime
- Monitor GPU memory utilization
- Track p50/p95/p99 latencies
- Measure time-to-first-token (TTFT)
- Monitor tokens-per-second (TPS)
- Set appropriate timeouts
Infrastructure
- Use fastest available interconnect (NVLink, InfiniBand)
- Minimize network hops
- Place inference close to users (edge)
- Consider dedicated inference hardware
Cost Optimization
Cost Drivers
| Factor | Impact | Optimization |
|---|---|---|
| GPU hours | Highest | Quantization, batching |
| Memory | High | PagedAttention, KV cache optimization |
| Network | Medium | Response compression, edge deployment |
| Storage | Low | Model deduplication |
Cost Estimation Formula
Monthly Cost =
(Requests/month) × (Avg tokens/request) × (GPU-seconds/token) × ($/GPU-hour)
─────────────────────────────────────────────────────────────────────────────
3600
Example:
• 10M requests/month
• 500 tokens average
• 0.001 GPU-seconds/token (optimized)
• $2/GPU-hour
Cost = (10M × 500 × 0.001 × 2) / 3600 = $2,778/month
Common Patterns
Multi-model Routing
┌─────────────────────────────────────────────────────────┐
│ Router │
│ • Classify request complexity │
│ • Route to appropriate model │
└─────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Small │ │ Medium │ │ Large │
│ Model │ │ Model │ │ Model │
│ (7B) │ │ (13B) │ │ (70B) │
│ Fast │ │ Balanced│ │ Quality │
└─────────┘ └─────────┘ └─────────┘
Caching Strategies
| Cache Type | What to Cache | TTL |
|---|---|---|
| Prompt cache | Common system prompts | Long |
| KV cache | Prefix tokens | Session |
| Response cache | Exact query matches | Varies |
| Embedding cache | Document embeddings | Long |
Related Skills
ml-system-design- End-to-end ML pipeline designrag-architecture- Retrieval-augmented generation patternsvector-databases- Vector search for LLM contextml-inference-optimization- General inference optimizationestimation-techniques- Capacity planning for LLM systems
Version History
- v1.0.0 (2025-12-26): Initial release - LLM serving patterns for systems design interviews
Last Updated
Date: 2025-12-26
Repository

melodic-software
Author
melodic-software/claude-code-plugins/plugins/systems-design/skills/llm-serving-patterns
3
Stars
0
Forks
Updated1d ago
Added6d ago