Marketplace
llm-serving-patterns
LLM inference infrastructure, serving frameworks (vLLM, TGI, TensorRT-LLM), quantization techniques, batching strategies, and streaming response patterns. Use when designing LLM serving infrastructure, optimizing inference latency, or scaling LLM deployments.
allowed_tools: Read, Glob, Grep
$ Installieren
git clone https://github.com/melodic-software/claude-code-plugins /tmp/claude-code-plugins && cp -r /tmp/claude-code-plugins/plugins/systems-design/skills/llm-serving-patterns ~/.claude/skills/claude-code-plugins// tip: Run this command in your terminal to install the skill
SKILL.md
name: llm-serving-patterns description: LLM inference infrastructure, serving frameworks (vLLM, TGI, TensorRT-LLM), quantization techniques, batching strategies, and streaming response patterns. Use when designing LLM serving infrastructure, optimizing inference latency, or scaling LLM deployments. allowed-tools: Read, Glob, Grep
LLM Serving Patterns
When to Use This Skill
Use this skill when:
- Designing LLM inference infrastructure
- Choosing between serving frameworks (vLLM, TGI, TensorRT-LLM)
- Implementing quantization for production deployment
- Optimizing batching and throughput
- Building streaming response systems
- Scaling LLM deployments cost-effectively
Keywords: LLM serving, inference, vLLM, TGI, TensorRT-LLM, quantization, INT8, INT4, FP16, batching, continuous batching, streaming, SSE, WebSocket, KV cache, PagedAttention, speculative decoding
LLM Serving Architecture Overview
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â LLM Serving Stack â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ€
â Clients (API, Chat UI, Agents) â
â â â
â ⌠â
â âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
â â Load Balancer / API Gateway â â
â â ⢠Rate limiting ⢠Authentication ⢠Request routing â â
â âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
â â â
â ⌠â
â âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
â â Inference Server â â
â â âââââââââââââââ âââââââââââââââ âââââââââââââââââââââââ â â
â â â Request â â Batching â â KV Cache â â â
â â â Queue ââââ¶â Engine ââââ¶â Management â â â
â â âââââââââââââââ âââââââââââââââ âââââââââââââââââââââââ â â
â â â â â â
â â ⌠⌠â â
â â âââââââââââââââââââââââââââââââââââââââââââââââââââââââ â â
â â â Model Execution Engine â â â
â â â ⢠Tensor operations ⢠Attention ⢠Token sampling â â â
â â âââââââââââââââââââââââââââââââââââââââââââââââââââââââ â â
â âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
â â â
â ⌠â
â âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
â â GPU/TPU Cluster â â
â â ⢠Model sharding ⢠Tensor parallelism ⢠Pipeline parallel â â
â âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
Serving Framework Comparison
| Framework | Strengths | Best For | Considerations |
|---|---|---|---|
| vLLM | PagedAttention, high throughput, continuous batching | General LLM serving, high concurrency | Python-native, active community |
| TGI (Text Generation Inference) | Production-ready, Hugging Face integration | Enterprise deployment, HF models | Rust backend, Docker-first |
| TensorRT-LLM | NVIDIA optimization, lowest latency | NVIDIA GPUs, latency-critical | NVIDIA-only, complex setup |
| Triton Inference Server | Multi-model, multi-framework | Heterogeneous model serving | Enterprise complexity |
| Ollama | Simple local deployment | Development, edge deployment | Limited scaling features |
| llama.cpp | CPU inference, quantization | Resource-constrained, edge | C++ integration required |
Framework Selection Decision Tree
Need lowest latency on NVIDIA GPUs?
âââ Yes â TensorRT-LLM
âââ No
âââ Need high throughput with many concurrent users?
âââ Yes â vLLM (PagedAttention)
âââ No
âââ Need enterprise features + HF integration?
âââ Yes â TGI
âââ No
âââ Simple local/edge deployment?
âââ Yes â Ollama or llama.cpp
âââ No â vLLM (general purpose)
Quantization Techniques
Precision Levels
| Precision | Bits | Memory Reduction | Quality Impact | Use Case |
|---|---|---|---|---|
| FP32 | 32 | Baseline | None | Training, reference |
| FP16/BF16 | 16 | 2x | Minimal | Standard serving |
| INT8 | 8 | 4x | Low | Production serving |
| INT4 | 4 | 8x | Moderate | Resource-constrained |
| INT2 | 2 | 16x | Significant | Experimental |
Quantization Methods
| Method | Description | Quality | Speed |
|---|---|---|---|
| PTQ (Post-Training Quantization) | Quantize after training, no retraining | Good | Fast to apply |
| QAT (Quantization-Aware Training) | Simulate quantization during training | Better | Requires training |
| GPTQ | One-shot weight quantization | Very good | Moderate |
| AWQ (Activation-aware Weight Quantization) | Preserves salient weights | Excellent | Moderate |
| GGUF/GGML | llama.cpp format, CPU-optimized | Good | Very fast inference |
| SmoothQuant | Migrates difficulty to weights | Excellent | Moderate |
Quantization Selection
Quality vs. Efficiency Trade-off:
Quality âââââââââââââââââââââââââââââââââââââââââââââ¶ Efficiency
â â
â FP32 FP16 INT8+AWQ INT8+GPTQ INT4 INT2 â
â âââââââââââââââââââââââââââââââââââââââââââââââ â
â â â â â â â â
â Best Great Good Good Fair Poor â
â â
Batching Strategies
Static Batching
Request 1: [tokens: 100] ââ
Request 2: [tokens: 50] ââŒâââ¶ [Batch: pad to 100] âââ¶ Process âââ¶ All complete
Request 3: [tokens: 80] ââ
Problem: Short requests wait for long ones (head-of-line blocking)
Continuous Batching (Preferred)
Time âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¶
Req 1: [ââââââââââââââââââââââââââââââââ] âââ¶ Complete
Req 2: [ââââââââââââ] âââ¶ Complete âââ¶ Req 4 starts [ââââââââââââââââ]
Req 3: [ââââââââââââââââââââ] âââ¶ Complete âââ¶ Req 5 starts [ââââââââ]
⢠New requests join batch as others complete
⢠No padding waste
⢠Optimal GPU utilization
Batching Parameters
| Parameter | Description | Trade-off |
|---|---|---|
max_batch_size | Maximum concurrent requests | Memory vs. throughput |
max_waiting_tokens | Tokens before forcing batch | Latency vs. throughput |
max_num_seqs | Maximum sequences in batch | Memory vs. concurrency |
KV Cache Management
The KV Cache Problem
Attention: Q Ã K^T Ã V
For each token generated:
⢠Must recompute attention with ALL previous tokens
⢠K and V tensors grow with sequence length
⢠Memory: O(batch_size à seq_len à num_layers à hidden_dim)
Example (70B model, 4K context):
⢠KV cache per request: ~8GB
⢠10 concurrent requests: ~80GB GPU memory
PagedAttention (vLLM Innovation)
Traditional KV Cache:
ââââââââââââââââââââââââââââââââââââââââââââ
â Request 1 KV Cache (contiguous, fixed) â â Wastes memory
ââââââââââââââââââââââââââââââââââââââââââââ€
â Request 2 KV Cache (contiguous, fixed) â
ââââââââââââââââââââââââââââââââââââââââââââ€
â FRAGMENTED/WASTED SPACE â
ââââââââââââââââââââââââââââââââââââââââââââ
PagedAttention:
ââââââ¬âââââ¬âââââ¬âââââ¬âââââ¬âââââ¬âââââ¬âââââ
â R1 â R2 â R1 â R3 â R2 â R1 â R3 â R2 â â Pages allocated on demand
ââââââŽâââââŽâââââŽâââââŽâââââŽâââââŽâââââŽâââââ
⢠Non-contiguous memory allocation
⢠Near-zero memory waste
⢠2-4x higher throughput
KV Cache Optimization Strategies
| Strategy | Description | Memory Savings |
|---|---|---|
| Paged Attention | Virtual memory for KV cache | ~50% reduction |
| Prefix Caching | Reuse KV cache for common prefixes | System prompt: 100% |
| Quantized KV Cache | INT8/FP8 for KV values | 50-75% reduction |
| Sliding Window | Limited attention context | Linear memory |
| MQA/GQA | Grouped query attention | Architecture-dependent |
Streaming Response Patterns
Server-Sent Events (SSE)
Client Server
â â
âââââ GET /v1/chat/completions âââââââ¶â
â (stream: true) â
â â
ââââââ HTTP 200 OK ââââââââââââââââââââ
â Content-Type: text/event-streamâ
â â
ââââââ data: {"token": "Hello"} âââââââ
ââââââ data: {"token": " world"} ââââââ
ââââââ data: {"token": "!"} âââââââââââ
ââââââ data: [DONE] âââââââââââââââââââ
â â
SSE Benefits:
- HTTP/1.1 compatible
- Auto-reconnection support
- Simple to implement
- Wide client support
WebSocket Streaming
Client Server
â â
âââââ WebSocket Upgrade ââââââââââââââ¶â
ââââââ 101 Switching Protocols ââââââââ
â â
âââââ {"prompt": "Hello"} ââââââââââââ¶â
â â
ââââââ {"token": "Hi"} ââââââââââââââââ
ââââââ {"token": " there"} ââââââââââââ
ââââââ {"token": "!"} âââââââââââââââââ
ââââââ {"done": true} âââââââââââââââââ
â â
WebSocket Benefits:
- Bidirectional communication
- Lower latency
- Better for chat applications
- Connection persistence
Streaming Implementation Considerations
| Aspect | SSE | WebSocket |
|---|---|---|
| Reconnection | Built-in | Manual |
| Scalability | Per-request | Connection pool |
| Load Balancing | Standard HTTP | Sticky sessions |
| Firewall/Proxy | Usually works | May need config |
| Best For | One-way streaming | Interactive chat |
Speculative Decoding
Concept
Standard Decoding:
Large Model: [T1] â [T2] â [T3] â [T4] â [T5]
10ms 10ms 10ms 10ms 10ms = 50ms total
Speculative Decoding:
Draft Model: [T1, T2, T3, T4, T5] (parallel, 5ms)
â
âŒ
Large Model: [Verify T1-T5 in one pass] (15ms)
Accept: T1, T2, T3 â Reject: T4, T5 â
â
âŒ
[Generate T4, T5 correctly]
Total: ~25ms (2x speedup if 60% acceptance)
Speculative Decoding Trade-offs
| Factor | Impact |
|---|---|
| Draft model quality | Higher match rate = more speedup |
| Draft model size | Larger = better quality, slower |
| Speculation depth | More tokens = higher risk/reward |
| Verification cost | Must be < sequential generation |
Scaling Strategies
Horizontal Scaling
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â Load Balancer â
â (Round-robin, Least-connections) â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â â â
⌠⌠âŒ
âââââââââââ âââââââââââ âââââââââââ
â vLLM â â vLLM â â vLLM â
â Node 1 â â Node 2 â â Node 3 â
â (GPUÃ4) â â (GPUÃ4) â â (GPUÃ4) â
âââââââââââ âââââââââââ âââââââââââ
Model Parallelism
| Strategy | Description | Use Case |
|---|---|---|
| Tensor Parallelism | Split layers across GPUs | Single large model |
| Pipeline Parallelism | Different layers on different GPUs | Very large models |
| Data Parallelism | Same model, different batches | High throughput |
Tensor Parallelism (TP=4):
âââââââââââââââââââââââââââââââââââââââââââ
â Layer N â
â GPU0 â GPU1 â GPU2 â GPU3 â
â 25% â 25% â 25% â 25% â
âââââââââââââââââââââââââââââââââââââââââââ
Pipeline Parallelism (PP=4):
GPU0: Layers 0-7
GPU1: Layers 8-15
GPU2: Layers 16-23
GPU3: Layers 24-31
Latency Optimization Checklist
Pre-deployment
- Choose appropriate quantization (INT8 for production)
- Enable continuous batching
- Configure KV cache size appropriately
- Set optimal batch size for hardware
- Enable prefix caching for system prompts
Runtime
- Monitor GPU memory utilization
- Track p50/p95/p99 latencies
- Measure time-to-first-token (TTFT)
- Monitor tokens-per-second (TPS)
- Set appropriate timeouts
Infrastructure
- Use fastest available interconnect (NVLink, InfiniBand)
- Minimize network hops
- Place inference close to users (edge)
- Consider dedicated inference hardware
Cost Optimization
Cost Drivers
| Factor | Impact | Optimization |
|---|---|---|
| GPU hours | Highest | Quantization, batching |
| Memory | High | PagedAttention, KV cache optimization |
| Network | Medium | Response compression, edge deployment |
| Storage | Low | Model deduplication |
Cost Estimation Formula
Monthly Cost =
(Requests/month) Ã (Avg tokens/request) Ã (GPU-seconds/token) Ã ($/GPU-hour)
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
3600
Example:
⢠10M requests/month
⢠500 tokens average
⢠0.001 GPU-seconds/token (optimized)
⢠$2/GPU-hour
Cost = (10M Ã 500 Ã 0.001 Ã 2) / 3600 = $2,778/month
Common Patterns
Multi-model Routing
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â Router â
â ⢠Classify request complexity â
â ⢠Route to appropriate model â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â â â
⌠⌠âŒ
âââââââââââ âââââââââââ âââââââââââ
â Small â â Medium â â Large â
â Model â â Model â â Model â
â (7B) â â (13B) â â (70B) â
â Fast â â Balancedâ â Quality â
âââââââââââ âââââââââââ âââââââââââ
Caching Strategies
| Cache Type | What to Cache | TTL |
|---|---|---|
| Prompt cache | Common system prompts | Long |
| KV cache | Prefix tokens | Session |
| Response cache | Exact query matches | Varies |
| Embedding cache | Document embeddings | Long |
Related Skills
ml-system-design- End-to-end ML pipeline designrag-architecture- Retrieval-augmented generation patternsvector-databases- Vector search for LLM contextml-inference-optimization- General inference optimizationestimation-techniques- Capacity planning for LLM systems
Version History
- v1.0.0 (2025-12-26): Initial release - LLM serving patterns for systems design interviews
Last Updated
Date: 2025-12-26
Repository

melodic-software
Author
melodic-software/claude-code-plugins/plugins/systems-design/skills/llm-serving-patterns
3
Stars
0
Forks
Updated1d ago
Added6d ago