Marketplace

llm-serving-patterns

LLM inference infrastructure, serving frameworks (vLLM, TGI, TensorRT-LLM), quantization techniques, batching strategies, and streaming response patterns. Use when designing LLM serving infrastructure, optimizing inference latency, or scaling LLM deployments.

allowed_tools: Read, Glob, Grep

$ 설치

git clone https://github.com/melodic-software/claude-code-plugins /tmp/claude-code-plugins && cp -r /tmp/claude-code-plugins/plugins/systems-design/skills/llm-serving-patterns ~/.claude/skills/claude-code-plugins

// tip: Run this command in your terminal to install the skill


name: llm-serving-patterns description: LLM inference infrastructure, serving frameworks (vLLM, TGI, TensorRT-LLM), quantization techniques, batching strategies, and streaming response patterns. Use when designing LLM serving infrastructure, optimizing inference latency, or scaling LLM deployments. allowed-tools: Read, Glob, Grep

LLM Serving Patterns

When to Use This Skill

Use this skill when:

  • Designing LLM inference infrastructure
  • Choosing between serving frameworks (vLLM, TGI, TensorRT-LLM)
  • Implementing quantization for production deployment
  • Optimizing batching and throughput
  • Building streaming response systems
  • Scaling LLM deployments cost-effectively

Keywords: LLM serving, inference, vLLM, TGI, TensorRT-LLM, quantization, INT8, INT4, FP16, batching, continuous batching, streaming, SSE, WebSocket, KV cache, PagedAttention, speculative decoding

LLM Serving Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                         LLM Serving Stack                           │
├─────────────────────────────────────────────────────────────────────┤
│  Clients (API, Chat UI, Agents)                                     │
│       │                                                             │
│       ▼                                                             │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │              Load Balancer / API Gateway                     │   │
│  │  • Rate limiting  • Authentication  • Request routing        │   │
│  └─────────────────────────────────────────────────────────────┘   │
│       │                                                             │
│       ▼                                                             │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                   Inference Server                           │   │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │   │
│  │  │  Request    │  │  Batching   │  │  KV Cache           │  │   │
│  │  │  Queue      │──▶│  Engine     │──▶│  Management        │  │   │
│  │  └─────────────┘  └─────────────┘  └─────────────────────┘  │   │
│  │       │                                      │               │   │
│  │       ▼                                      ▼               │   │
│  │  ┌─────────────────────────────────────────────────────┐    │   │
│  │  │              Model Execution Engine                  │    │   │
│  │  │  • Tensor operations  • Attention  • Token sampling │    │   │
│  │  └─────────────────────────────────────────────────────┘    │   │
│  └─────────────────────────────────────────────────────────────┘   │
│       │                                                             │
│       ▼                                                             │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    GPU/TPU Cluster                           │   │
│  │  • Model sharding  • Tensor parallelism  • Pipeline parallel │   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Serving Framework Comparison

FrameworkStrengthsBest ForConsiderations
vLLMPagedAttention, high throughput, continuous batchingGeneral LLM serving, high concurrencyPython-native, active community
TGI (Text Generation Inference)Production-ready, Hugging Face integrationEnterprise deployment, HF modelsRust backend, Docker-first
TensorRT-LLMNVIDIA optimization, lowest latencyNVIDIA GPUs, latency-criticalNVIDIA-only, complex setup
Triton Inference ServerMulti-model, multi-frameworkHeterogeneous model servingEnterprise complexity
OllamaSimple local deploymentDevelopment, edge deploymentLimited scaling features
llama.cppCPU inference, quantizationResource-constrained, edgeC++ integration required

Framework Selection Decision Tree

Need lowest latency on NVIDIA GPUs?
├── Yes → TensorRT-LLM
└── No
    └── Need high throughput with many concurrent users?
        ├── Yes → vLLM (PagedAttention)
        └── No
            └── Need enterprise features + HF integration?
                ├── Yes → TGI
                └── No
                    └── Simple local/edge deployment?
                        ├── Yes → Ollama or llama.cpp
                        └── No → vLLM (general purpose)

Quantization Techniques

Precision Levels

PrecisionBitsMemory ReductionQuality ImpactUse Case
FP3232BaselineNoneTraining, reference
FP16/BF16162xMinimalStandard serving
INT884xLowProduction serving
INT448xModerateResource-constrained
INT2216xSignificantExperimental

Quantization Methods

MethodDescriptionQualitySpeed
PTQ (Post-Training Quantization)Quantize after training, no retrainingGoodFast to apply
QAT (Quantization-Aware Training)Simulate quantization during trainingBetterRequires training
GPTQOne-shot weight quantizationVery goodModerate
AWQ (Activation-aware Weight Quantization)Preserves salient weightsExcellentModerate
GGUF/GGMLllama.cpp format, CPU-optimizedGoodVery fast inference
SmoothQuantMigrates difficulty to weightsExcellentModerate

Quantization Selection

Quality vs. Efficiency Trade-off:

Quality ────────────────────────────────────────────▶ Efficiency
   │                                                      │
   │  FP32    FP16    INT8+AWQ   INT8+GPTQ   INT4   INT2  │
   │   ○───────○────────○──────────○──────────○──────○    │
   │   │       │        │          │          │      │    │
   │  Best   Great    Good      Good       Fair   Poor   │
   │                                                      │

Batching Strategies

Static Batching

Request 1: [tokens: 100] ─┐
Request 2: [tokens: 50]  ─┼──▶ [Batch: pad to 100] ──▶ Process ──▶ All complete
Request 3: [tokens: 80]  ─┘

Problem: Short requests wait for long ones (head-of-line blocking)

Continuous Batching (Preferred)

Time ──────────────────────────────────────────────────────────▶

Req 1: [████████████████████████████████] ──▶ Complete
Req 2: [████████████] ──▶ Complete ──▶ Req 4 starts [████████████████]
Req 3: [████████████████████] ──▶ Complete ──▶ Req 5 starts [████████]

• New requests join batch as others complete
• No padding waste
• Optimal GPU utilization

Batching Parameters

ParameterDescriptionTrade-off
max_batch_sizeMaximum concurrent requestsMemory vs. throughput
max_waiting_tokensTokens before forcing batchLatency vs. throughput
max_num_seqsMaximum sequences in batchMemory vs. concurrency

KV Cache Management

The KV Cache Problem

Attention: Q × K^T × V

For each token generated:
• Must recompute attention with ALL previous tokens
• K and V tensors grow with sequence length
• Memory: O(batch_size × seq_len × num_layers × hidden_dim)

Example (70B model, 4K context):
• KV cache per request: ~8GB
• 10 concurrent requests: ~80GB GPU memory

PagedAttention (vLLM Innovation)

Traditional KV Cache:
┌──────────────────────────────────────────┐
│ Request 1 KV Cache (contiguous, fixed)   │ ← Wastes memory
├──────────────────────────────────────────┤
│ Request 2 KV Cache (contiguous, fixed)   │
├──────────────────────────────────────────┤
│ FRAGMENTED/WASTED SPACE                  │
└──────────────────────────────────────────┘

PagedAttention:
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ R1 │ R2 │ R1 │ R3 │ R2 │ R1 │ R3 │ R2 │  ← Pages allocated on demand
└────┴────┴────┴────┴────┴────┴────┴────┘
• Non-contiguous memory allocation
• Near-zero memory waste
• 2-4x higher throughput

KV Cache Optimization Strategies

StrategyDescriptionMemory Savings
Paged AttentionVirtual memory for KV cache~50% reduction
Prefix CachingReuse KV cache for common prefixesSystem prompt: 100%
Quantized KV CacheINT8/FP8 for KV values50-75% reduction
Sliding WindowLimited attention contextLinear memory
MQA/GQAGrouped query attentionArchitecture-dependent

Streaming Response Patterns

Server-Sent Events (SSE)

Client                                Server
   │                                     │
   │──── GET /v1/chat/completions ──────▶│
   │      (stream: true)                 │
   │                                     │
   │◀──── HTTP 200 OK ───────────────────│
   │      Content-Type: text/event-stream│
   │                                     │
   │◀──── data: {"token": "Hello"} ──────│
   │◀──── data: {"token": " world"} ─────│
   │◀──── data: {"token": "!"} ──────────│
   │◀──── data: [DONE] ──────────────────│
   │                                     │

SSE Benefits:

  • HTTP/1.1 compatible
  • Auto-reconnection support
  • Simple to implement
  • Wide client support

WebSocket Streaming

Client                                Server
   │                                     │
   │──── WebSocket Upgrade ─────────────▶│
   │◀──── 101 Switching Protocols ───────│
   │                                     │
   │──── {"prompt": "Hello"} ───────────▶│
   │                                     │
   │◀──── {"token": "Hi"} ───────────────│
   │◀──── {"token": " there"} ───────────│
   │◀──── {"token": "!"} ────────────────│
   │◀──── {"done": true} ────────────────│
   │                                     │

WebSocket Benefits:

  • Bidirectional communication
  • Lower latency
  • Better for chat applications
  • Connection persistence

Streaming Implementation Considerations

AspectSSEWebSocket
ReconnectionBuilt-inManual
ScalabilityPer-requestConnection pool
Load BalancingStandard HTTPSticky sessions
Firewall/ProxyUsually worksMay need config
Best ForOne-way streamingInteractive chat

Speculative Decoding

Concept

Standard Decoding:
Large Model: [T1] → [T2] → [T3] → [T4] → [T5]
             10ms   10ms   10ms   10ms   10ms = 50ms total

Speculative Decoding:
Draft Model: [T1, T2, T3, T4, T5] (parallel, 5ms)
                      │
                      ▼
Large Model: [Verify T1-T5 in one pass] (15ms)
             Accept: T1, T2, T3 ✓  Reject: T4, T5 ✗
                      │
                      ▼
             [Generate T4, T5 correctly]

Total: ~25ms (2x speedup if 60% acceptance)

Speculative Decoding Trade-offs

FactorImpact
Draft model qualityHigher match rate = more speedup
Draft model sizeLarger = better quality, slower
Speculation depthMore tokens = higher risk/reward
Verification costMust be < sequential generation

Scaling Strategies

Horizontal Scaling

┌─────────────────────────────────────────────────────────┐
│                    Load Balancer                        │
│         (Round-robin, Least-connections)                │
└─────────────────────────────────────────────────────────┘
         │              │              │
         ▼              ▼              ▼
    ┌─────────┐    ┌─────────┐    ┌─────────┐
    │ vLLM    │    │ vLLM    │    │ vLLM    │
    │ Node 1  │    │ Node 2  │    │ Node 3  │
    │ (GPU×4) │    │ (GPU×4) │    │ (GPU×4) │
    └─────────┘    └─────────┘    └─────────┘

Model Parallelism

StrategyDescriptionUse Case
Tensor ParallelismSplit layers across GPUsSingle large model
Pipeline ParallelismDifferent layers on different GPUsVery large models
Data ParallelismSame model, different batchesHigh throughput
Tensor Parallelism (TP=4):
┌─────────────────────────────────────────┐
│              Layer N                     │
│  GPU0   │   GPU1   │   GPU2   │   GPU3  │
│  25%    │   25%    │   25%    │   25%   │
└─────────────────────────────────────────┘

Pipeline Parallelism (PP=4):
GPU0: Layers 0-7
GPU1: Layers 8-15
GPU2: Layers 16-23
GPU3: Layers 24-31

Latency Optimization Checklist

Pre-deployment

  • Choose appropriate quantization (INT8 for production)
  • Enable continuous batching
  • Configure KV cache size appropriately
  • Set optimal batch size for hardware
  • Enable prefix caching for system prompts

Runtime

  • Monitor GPU memory utilization
  • Track p50/p95/p99 latencies
  • Measure time-to-first-token (TTFT)
  • Monitor tokens-per-second (TPS)
  • Set appropriate timeouts

Infrastructure

  • Use fastest available interconnect (NVLink, InfiniBand)
  • Minimize network hops
  • Place inference close to users (edge)
  • Consider dedicated inference hardware

Cost Optimization

Cost Drivers

FactorImpactOptimization
GPU hoursHighestQuantization, batching
MemoryHighPagedAttention, KV cache optimization
NetworkMediumResponse compression, edge deployment
StorageLowModel deduplication

Cost Estimation Formula

Monthly Cost =
  (Requests/month) × (Avg tokens/request) × (GPU-seconds/token) × ($/GPU-hour)
  ─────────────────────────────────────────────────────────────────────────────
                                    3600

Example:
• 10M requests/month
• 500 tokens average
• 0.001 GPU-seconds/token (optimized)
• $2/GPU-hour

Cost = (10M × 500 × 0.001 × 2) / 3600 = $2,778/month

Common Patterns

Multi-model Routing

┌─────────────────────────────────────────────────────────┐
│                     Router                              │
│  • Classify request complexity                          │
│  • Route to appropriate model                           │
└─────────────────────────────────────────────────────────┘
         │              │              │
         ▼              ▼              ▼
    ┌─────────┐    ┌─────────┐    ┌─────────┐
    │ Small   │    │ Medium  │    │ Large   │
    │ Model   │    │ Model   │    │ Model   │
    │ (7B)    │    │ (13B)   │    │ (70B)   │
    │ Fast    │    │ Balanced│    │ Quality │
    └─────────┘    └─────────┘    └─────────┘

Caching Strategies

Cache TypeWhat to CacheTTL
Prompt cacheCommon system promptsLong
KV cachePrefix tokensSession
Response cacheExact query matchesVaries
Embedding cacheDocument embeddingsLong

Related Skills

  • ml-system-design - End-to-end ML pipeline design
  • rag-architecture - Retrieval-augmented generation patterns
  • vector-databases - Vector search for LLM context
  • ml-inference-optimization - General inference optimization
  • estimation-techniques - Capacity planning for LLM systems

Version History

  • v1.0.0 (2025-12-26): Initial release - LLM serving patterns for systems design interviews

Last Updated

Date: 2025-12-26

Repository

melodic-software
melodic-software
Author
melodic-software/claude-code-plugins/plugins/systems-design/skills/llm-serving-patterns
3
Stars
0
Forks
Updated1d ago
Added6d ago