Marketplace

llm-serving-patterns

LLM inference infrastructure, serving frameworks (vLLM, TGI, TensorRT-LLM), quantization techniques, batching strategies, and streaming response patterns. Use when designing LLM serving infrastructure, optimizing inference latency, or scaling LLM deployments.

allowed_tools: Read, Glob, Grep

$ Instalar

git clone https://github.com/melodic-software/claude-code-plugins /tmp/claude-code-plugins && cp -r /tmp/claude-code-plugins/plugins/systems-design/skills/llm-serving-patterns ~/.claude/skills/claude-code-plugins

// tip: Run this command in your terminal to install the skill


name: llm-serving-patterns description: LLM inference infrastructure, serving frameworks (vLLM, TGI, TensorRT-LLM), quantization techniques, batching strategies, and streaming response patterns. Use when designing LLM serving infrastructure, optimizing inference latency, or scaling LLM deployments. allowed-tools: Read, Glob, Grep

LLM Serving Patterns

When to Use This Skill

Use this skill when:

  • Designing LLM inference infrastructure
  • Choosing between serving frameworks (vLLM, TGI, TensorRT-LLM)
  • Implementing quantization for production deployment
  • Optimizing batching and throughput
  • Building streaming response systems
  • Scaling LLM deployments cost-effectively

Keywords: LLM serving, inference, vLLM, TGI, TensorRT-LLM, quantization, INT8, INT4, FP16, batching, continuous batching, streaming, SSE, WebSocket, KV cache, PagedAttention, speculative decoding

LLM Serving Architecture Overview

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                         LLM Serving Stack                           โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Clients (API, Chat UI, Agents)                                     โ”‚
โ”‚       โ”‚                                                             โ”‚
โ”‚       โ–ผ                                                             โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚              Load Balancer / API Gateway                     โ”‚   โ”‚
โ”‚  โ”‚  โ€ข Rate limiting  โ€ข Authentication  โ€ข Request routing        โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚       โ”‚                                                             โ”‚
โ”‚       โ–ผ                                                             โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚                   Inference Server                           โ”‚   โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚   โ”‚
โ”‚  โ”‚  โ”‚  Request    โ”‚  โ”‚  Batching   โ”‚  โ”‚  KV Cache           โ”‚  โ”‚   โ”‚
โ”‚  โ”‚  โ”‚  Queue      โ”‚โ”€โ”€โ–ถโ”‚  Engine     โ”‚โ”€โ”€โ–ถโ”‚  Management        โ”‚  โ”‚   โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚   โ”‚
โ”‚  โ”‚       โ”‚                                      โ”‚               โ”‚   โ”‚
โ”‚  โ”‚       โ–ผ                                      โ–ผ               โ”‚   โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚   โ”‚
โ”‚  โ”‚  โ”‚              Model Execution Engine                  โ”‚    โ”‚   โ”‚
โ”‚  โ”‚  โ”‚  โ€ข Tensor operations  โ€ข Attention  โ€ข Token sampling โ”‚    โ”‚   โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚       โ”‚                                                             โ”‚
โ”‚       โ–ผ                                                             โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚                    GPU/TPU Cluster                           โ”‚   โ”‚
โ”‚  โ”‚  โ€ข Model sharding  โ€ข Tensor parallelism  โ€ข Pipeline parallel โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Serving Framework Comparison

FrameworkStrengthsBest ForConsiderations
vLLMPagedAttention, high throughput, continuous batchingGeneral LLM serving, high concurrencyPython-native, active community
TGI (Text Generation Inference)Production-ready, Hugging Face integrationEnterprise deployment, HF modelsRust backend, Docker-first
TensorRT-LLMNVIDIA optimization, lowest latencyNVIDIA GPUs, latency-criticalNVIDIA-only, complex setup
Triton Inference ServerMulti-model, multi-frameworkHeterogeneous model servingEnterprise complexity
OllamaSimple local deploymentDevelopment, edge deploymentLimited scaling features
llama.cppCPU inference, quantizationResource-constrained, edgeC++ integration required

Framework Selection Decision Tree

Need lowest latency on NVIDIA GPUs?
โ”œโ”€โ”€ Yes โ†’ TensorRT-LLM
โ””โ”€โ”€ No
    โ””โ”€โ”€ Need high throughput with many concurrent users?
        โ”œโ”€โ”€ Yes โ†’ vLLM (PagedAttention)
        โ””โ”€โ”€ No
            โ””โ”€โ”€ Need enterprise features + HF integration?
                โ”œโ”€โ”€ Yes โ†’ TGI
                โ””โ”€โ”€ No
                    โ””โ”€โ”€ Simple local/edge deployment?
                        โ”œโ”€โ”€ Yes โ†’ Ollama or llama.cpp
                        โ””โ”€โ”€ No โ†’ vLLM (general purpose)

Quantization Techniques

Precision Levels

PrecisionBitsMemory ReductionQuality ImpactUse Case
FP3232BaselineNoneTraining, reference
FP16/BF16162xMinimalStandard serving
INT884xLowProduction serving
INT448xModerateResource-constrained
INT2216xSignificantExperimental

Quantization Methods

MethodDescriptionQualitySpeed
PTQ (Post-Training Quantization)Quantize after training, no retrainingGoodFast to apply
QAT (Quantization-Aware Training)Simulate quantization during trainingBetterRequires training
GPTQOne-shot weight quantizationVery goodModerate
AWQ (Activation-aware Weight Quantization)Preserves salient weightsExcellentModerate
GGUF/GGMLllama.cpp format, CPU-optimizedGoodVery fast inference
SmoothQuantMigrates difficulty to weightsExcellentModerate

Quantization Selection

Quality vs. Efficiency Trade-off:

Quality โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ Efficiency
   โ”‚                                                      โ”‚
   โ”‚  FP32    FP16    INT8+AWQ   INT8+GPTQ   INT4   INT2  โ”‚
   โ”‚   โ—‹โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ—‹โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ—‹โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ—‹โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ—‹โ”€โ”€โ”€โ”€โ”€โ”€โ—‹    โ”‚
   โ”‚   โ”‚       โ”‚        โ”‚          โ”‚          โ”‚      โ”‚    โ”‚
   โ”‚  Best   Great    Good      Good       Fair   Poor   โ”‚
   โ”‚                                                      โ”‚

Batching Strategies

Static Batching

Request 1: [tokens: 100] โ”€โ”
Request 2: [tokens: 50]  โ”€โ”ผโ”€โ”€โ–ถ [Batch: pad to 100] โ”€โ”€โ–ถ Process โ”€โ”€โ–ถ All complete
Request 3: [tokens: 80]  โ”€โ”˜

Problem: Short requests wait for long ones (head-of-line blocking)

Continuous Batching (Preferred)

Time โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ

Req 1: [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ] โ”€โ”€โ–ถ Complete
Req 2: [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ] โ”€โ”€โ–ถ Complete โ”€โ”€โ–ถ Req 4 starts [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ]
Req 3: [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ] โ”€โ”€โ–ถ Complete โ”€โ”€โ–ถ Req 5 starts [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ]

โ€ข New requests join batch as others complete
โ€ข No padding waste
โ€ข Optimal GPU utilization

Batching Parameters

ParameterDescriptionTrade-off
max_batch_sizeMaximum concurrent requestsMemory vs. throughput
max_waiting_tokensTokens before forcing batchLatency vs. throughput
max_num_seqsMaximum sequences in batchMemory vs. concurrency

KV Cache Management

The KV Cache Problem

Attention: Q ร— K^T ร— V

For each token generated:
โ€ข Must recompute attention with ALL previous tokens
โ€ข K and V tensors grow with sequence length
โ€ข Memory: O(batch_size ร— seq_len ร— num_layers ร— hidden_dim)

Example (70B model, 4K context):
โ€ข KV cache per request: ~8GB
โ€ข 10 concurrent requests: ~80GB GPU memory

PagedAttention (vLLM Innovation)

Traditional KV Cache:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Request 1 KV Cache (contiguous, fixed)   โ”‚ โ† Wastes memory
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Request 2 KV Cache (contiguous, fixed)   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ FRAGMENTED/WASTED SPACE                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

PagedAttention:
โ”Œโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”
โ”‚ R1 โ”‚ R2 โ”‚ R1 โ”‚ R3 โ”‚ R2 โ”‚ R1 โ”‚ R3 โ”‚ R2 โ”‚  โ† Pages allocated on demand
โ””โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”˜
โ€ข Non-contiguous memory allocation
โ€ข Near-zero memory waste
โ€ข 2-4x higher throughput

KV Cache Optimization Strategies

StrategyDescriptionMemory Savings
Paged AttentionVirtual memory for KV cache~50% reduction
Prefix CachingReuse KV cache for common prefixesSystem prompt: 100%
Quantized KV CacheINT8/FP8 for KV values50-75% reduction
Sliding WindowLimited attention contextLinear memory
MQA/GQAGrouped query attentionArchitecture-dependent

Streaming Response Patterns

Server-Sent Events (SSE)

Client                                Server
   โ”‚                                     โ”‚
   โ”‚โ”€โ”€โ”€โ”€ GET /v1/chat/completions โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚
   โ”‚      (stream: true)                 โ”‚
   โ”‚                                     โ”‚
   โ”‚โ—€โ”€โ”€โ”€โ”€ HTTP 200 OK โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚
   โ”‚      Content-Type: text/event-streamโ”‚
   โ”‚                                     โ”‚
   โ”‚โ—€โ”€โ”€โ”€โ”€ data: {"token": "Hello"} โ”€โ”€โ”€โ”€โ”€โ”€โ”‚
   โ”‚โ—€โ”€โ”€โ”€โ”€ data: {"token": " world"} โ”€โ”€โ”€โ”€โ”€โ”‚
   โ”‚โ—€โ”€โ”€โ”€โ”€ data: {"token": "!"} โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚
   โ”‚โ—€โ”€โ”€โ”€โ”€ data: [DONE] โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚
   โ”‚                                     โ”‚

SSE Benefits:

  • HTTP/1.1 compatible
  • Auto-reconnection support
  • Simple to implement
  • Wide client support

WebSocket Streaming

Client                                Server
   โ”‚                                     โ”‚
   โ”‚โ”€โ”€โ”€โ”€ WebSocket Upgrade โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚
   โ”‚โ—€โ”€โ”€โ”€โ”€ 101 Switching Protocols โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚
   โ”‚                                     โ”‚
   โ”‚โ”€โ”€โ”€โ”€ {"prompt": "Hello"} โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚
   โ”‚                                     โ”‚
   โ”‚โ—€โ”€โ”€โ”€โ”€ {"token": "Hi"} โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚
   โ”‚โ—€โ”€โ”€โ”€โ”€ {"token": " there"} โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚
   โ”‚โ—€โ”€โ”€โ”€โ”€ {"token": "!"} โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚
   โ”‚โ—€โ”€โ”€โ”€โ”€ {"done": true} โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚
   โ”‚                                     โ”‚

WebSocket Benefits:

  • Bidirectional communication
  • Lower latency
  • Better for chat applications
  • Connection persistence

Streaming Implementation Considerations

AspectSSEWebSocket
ReconnectionBuilt-inManual
ScalabilityPer-requestConnection pool
Load BalancingStandard HTTPSticky sessions
Firewall/ProxyUsually worksMay need config
Best ForOne-way streamingInteractive chat

Speculative Decoding

Concept

Standard Decoding:
Large Model: [T1] โ†’ [T2] โ†’ [T3] โ†’ [T4] โ†’ [T5]
             10ms   10ms   10ms   10ms   10ms = 50ms total

Speculative Decoding:
Draft Model: [T1, T2, T3, T4, T5] (parallel, 5ms)
                      โ”‚
                      โ–ผ
Large Model: [Verify T1-T5 in one pass] (15ms)
             Accept: T1, T2, T3 โœ“  Reject: T4, T5 โœ—
                      โ”‚
                      โ–ผ
             [Generate T4, T5 correctly]

Total: ~25ms (2x speedup if 60% acceptance)

Speculative Decoding Trade-offs

FactorImpact
Draft model qualityHigher match rate = more speedup
Draft model sizeLarger = better quality, slower
Speculation depthMore tokens = higher risk/reward
Verification costMust be < sequential generation

Scaling Strategies

Horizontal Scaling

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    Load Balancer                        โ”‚
โ”‚         (Round-robin, Least-connections)                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚              โ”‚              โ”‚
         โ–ผ              โ–ผ              โ–ผ
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚ vLLM    โ”‚    โ”‚ vLLM    โ”‚    โ”‚ vLLM    โ”‚
    โ”‚ Node 1  โ”‚    โ”‚ Node 2  โ”‚    โ”‚ Node 3  โ”‚
    โ”‚ (GPUร—4) โ”‚    โ”‚ (GPUร—4) โ”‚    โ”‚ (GPUร—4) โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Model Parallelism

StrategyDescriptionUse Case
Tensor ParallelismSplit layers across GPUsSingle large model
Pipeline ParallelismDifferent layers on different GPUsVery large models
Data ParallelismSame model, different batchesHigh throughput
Tensor Parallelism (TP=4):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              Layer N                     โ”‚
โ”‚  GPU0   โ”‚   GPU1   โ”‚   GPU2   โ”‚   GPU3  โ”‚
โ”‚  25%    โ”‚   25%    โ”‚   25%    โ”‚   25%   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Pipeline Parallelism (PP=4):
GPU0: Layers 0-7
GPU1: Layers 8-15
GPU2: Layers 16-23
GPU3: Layers 24-31

Latency Optimization Checklist

Pre-deployment

  • Choose appropriate quantization (INT8 for production)
  • Enable continuous batching
  • Configure KV cache size appropriately
  • Set optimal batch size for hardware
  • Enable prefix caching for system prompts

Runtime

  • Monitor GPU memory utilization
  • Track p50/p95/p99 latencies
  • Measure time-to-first-token (TTFT)
  • Monitor tokens-per-second (TPS)
  • Set appropriate timeouts

Infrastructure

  • Use fastest available interconnect (NVLink, InfiniBand)
  • Minimize network hops
  • Place inference close to users (edge)
  • Consider dedicated inference hardware

Cost Optimization

Cost Drivers

FactorImpactOptimization
GPU hoursHighestQuantization, batching
MemoryHighPagedAttention, KV cache optimization
NetworkMediumResponse compression, edge deployment
StorageLowModel deduplication

Cost Estimation Formula

Monthly Cost =
  (Requests/month) ร— (Avg tokens/request) ร— (GPU-seconds/token) ร— ($/GPU-hour)
  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
                                    3600

Example:
โ€ข 10M requests/month
โ€ข 500 tokens average
โ€ข 0.001 GPU-seconds/token (optimized)
โ€ข $2/GPU-hour

Cost = (10M ร— 500 ร— 0.001 ร— 2) / 3600 = $2,778/month

Common Patterns

Multi-model Routing

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                     Router                              โ”‚
โ”‚  โ€ข Classify request complexity                          โ”‚
โ”‚  โ€ข Route to appropriate model                           โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚              โ”‚              โ”‚
         โ–ผ              โ–ผ              โ–ผ
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚ Small   โ”‚    โ”‚ Medium  โ”‚    โ”‚ Large   โ”‚
    โ”‚ Model   โ”‚    โ”‚ Model   โ”‚    โ”‚ Model   โ”‚
    โ”‚ (7B)    โ”‚    โ”‚ (13B)   โ”‚    โ”‚ (70B)   โ”‚
    โ”‚ Fast    โ”‚    โ”‚ Balancedโ”‚    โ”‚ Quality โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Caching Strategies

Cache TypeWhat to CacheTTL
Prompt cacheCommon system promptsLong
KV cachePrefix tokensSession
Response cacheExact query matchesVaries
Embedding cacheDocument embeddingsLong

Related Skills

  • ml-system-design - End-to-end ML pipeline design
  • rag-architecture - Retrieval-augmented generation patterns
  • vector-databases - Vector search for LLM context
  • ml-inference-optimization - General inference optimization
  • estimation-techniques - Capacity planning for LLM systems

Version History

  • v1.0.0 (2025-12-26): Initial release - LLM serving patterns for systems design interviews

Last Updated

Date: 2025-12-26

Repository

melodic-software
melodic-software
Author
melodic-software/claude-code-plugins/plugins/systems-design/skills/llm-serving-patterns
3
Stars
0
Forks
Updated1d ago
Added6d ago