Marketplace

rate-limiting-patterns

Use when implementing rate limiting, throttling, or API quotas. Covers algorithms like token bucket and sliding window, plus distributed rate limiting patterns.

allowed_tools: Read, Glob, Grep

$ Instalar

git clone https://github.com/melodic-software/claude-code-plugins /tmp/claude-code-plugins && cp -r /tmp/claude-code-plugins/plugins/systems-design/skills/rate-limiting-patterns ~/.claude/skills/claude-code-plugins

// tip: Run this command in your terminal to install the skill


name: rate-limiting-patterns description: Use when implementing rate limiting, throttling, or API quotas. Covers algorithms like token bucket and sliding window, plus distributed rate limiting patterns. allowed-tools: Read, Glob, Grep

Rate Limiting Patterns

Patterns for protecting APIs and services through rate limiting, throttling, and quota management.

When to Use This Skill

  • Implementing API rate limiting
  • Choosing rate limiting algorithms
  • Designing distributed rate limiting
  • Setting up quota management
  • Protecting against abuse

Why Rate Limiting

Protection against:
- DDoS attacks
- Brute force attempts
- Resource exhaustion
- Cost overruns (cloud APIs)
- Cascading failures

Business benefits:
- Fair resource allocation
- Predictable performance
- Cost control
- SLA enforcement

Rate Limiting Algorithms

Token Bucket

Concept: Tokens added at fixed rate, requests consume tokens

Configuration:
- Bucket size (max tokens): 100
- Refill rate: 10 tokens/second

Behavior:
┌─────────────────────────┐
│ Bucket (capacity: 100)  │
│ ████████████░░░░░░░░░░  │ 60 tokens available
└─────────────────────────┘
        ↑           ↓
   10 tokens/s   Request takes 1 token

Allows bursts up to bucket size, then rate-limited.

Characteristics:

  • Allows controlled bursts
  • Simple to implement
  • Memory efficient
  • Most common algorithm

Implementation sketch:

token_bucket:
  tokens = min(tokens + (now - last_update) * rate, capacity)
  if tokens >= cost:
    tokens -= cost
    return ALLOW
  return DENY

Leaky Bucket

Concept: Requests queue and process at fixed rate

┌─────────────────────────┐
│ Queue (capacity: 100)   │
│ ██████████████████████  │ Requests waiting
└──────────┬──────────────┘
           │
           ▼ Process at fixed rate (10/sec)
       [Processing]

Smooths traffic to constant rate.

Characteristics:

  • Smooth output rate
  • No bursts allowed
  • Requests may queue
  • Good for downstream protection

Fixed Window

Concept: Count requests in fixed time windows

Window: 1 minute, Limit: 100 requests

|-------- Window 1 --------|-------- Window 2 --------|
   95 requests                  ? requests
   [Allow]                      [Reset to 0]

Problem: Boundary burst
End of window 1: 100 requests
Start of window 2: 100 requests
= 200 requests in ~1 second span

Characteristics:

  • Simple to implement
  • Memory efficient
  • Boundary burst problem
  • Good for simple use cases

Sliding Window Log

Concept: Track timestamp of each request

Window: 1 minute, Limit: 100

Requests: [t-55s, t-50s, t-45s, ..., t-5s, t-2s, now]
Count all requests in [now - 60s, now]

No boundary burst problem, but memory intensive.

Characteristics:

  • Precise limiting
  • No boundary issues
  • Memory intensive (stores all timestamps)
  • Good for strict limits

Sliding Window Counter

Concept: Weighted average of current and previous windows

Previous window: 80 requests
Current window: 30 requests (40% through window)

Weighted count = 80 * 0.6 + 30 = 78
Limit: 100
Result: ALLOW (78 < 100)

Characteristics:

  • Approximation (usually good enough)
  • Memory efficient
  • Smooths boundary issues
  • Best balance for most cases

Algorithm Selection Guide

AlgorithmBurst HandlingMemoryPrecisionUse Case
Token BucketAllows burstsLowGoodGeneral API limiting
Leaky BucketNo burstsLowGoodSmooth rate enforcement
Fixed WindowBoundary burstVery LowPoorSimple limits
Sliding LogNo burstsHighExactStrict compliance
Sliding CounterMinimal burstLowGoodBest general choice

Distributed Rate Limiting

Challenge

Single node: Simple in-memory counter
Multiple nodes: Need coordination

Without coordination:
Node 1: 50 requests (under 100 limit)
Node 2: 50 requests (under 100 limit)
Node 3: 50 requests (under 100 limit)
Total: 150 requests (over 100 limit!)

Pattern 1: Centralized (Redis)

┌─────────┐     ┌─────────┐     ┌─────────┐
│ Node 1  │     │ Node 2  │     │ Node 3  │
└────┬────┘     └────┬────┘     └────┬────┘
     │               │               │
     └───────────────┼───────────────┘
                     │
              ┌──────▼──────┐
              │    Redis    │
              │ (counters)  │
              └─────────────┘

Pros: Accurate, consistent
Cons: Redis dependency, latency, single point of failure

Pattern 2: Local + Sync

Each node gets fraction of limit:
- 3 nodes, 100 limit → 33 per node

Periodically sync to rebalance unused capacity.

Pros: Low latency, resilient
Cons: Less precise, sync complexity

Pattern 3: Sticky Sessions

Route same client to same node (by IP, API key, etc.)

Pros: Simple, no coordination needed
Cons: Uneven load, failover complexity

Redis Implementation

Token Bucket with Redis:

EVALSHA token_bucket_script 1 {key}
  {capacity} {refill_rate} {tokens_requested}

Script:
1. Get current tokens and timestamp
2. Calculate tokens to add since last request
3. If enough tokens, decrement and allow
4. Return tokens remaining

Rate Limit Headers

Standard headers to communicate limits to clients:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1640000000
Retry-After: 30  (when rate limited)

Or draft standard:
RateLimit-Limit: 100
RateLimit-Remaining: 45
RateLimit-Reset: 30

Rate Limit Response

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 30

{
  "error": {
    "code": "RATE_LIMITED",
    "message": "Rate limit exceeded",
    "retry_after": 30,
    "limit": 100,
    "window": "1m"
  }
}

Multi-Tier Rate Limiting

Apply limits at multiple levels:

Level 1: Global (protect infrastructure)
  - 10,000 req/sec across all clients

Level 2: Per-tenant (fair allocation)
  - 1,000 req/min per organization

Level 3: Per-user (prevent abuse)
  - 100 req/min per user

Level 4: Per-endpoint (protect expensive operations)
  - 10 req/min for /export endpoint

Quota Management

Quota vs Rate Limit

Rate Limit: Requests per time window (burst protection)
  - 100 requests/minute

Quota: Total allocation over period (budget)
  - 10,000 API calls/month

Quota Tracking

Track usage:
- Per API key
- Per endpoint
- Per operation type

Alert thresholds:
- 80% usage: Warning notification
- 100% usage: Hard block or overage charges

Best Practices

Graceful Degradation

Instead of hard block:
1. Reduce quality (lower resolution, fewer results)
2. Queue requests (process later)
3. Serve cached responses
4. Allow burst with penalty (slower recovery)

Client-Side Handling

Implement exponential backoff:
1. Receive 429
2. Wait Retry-After (or 1s)
3. Retry
4. If 429 again, wait 2s
5. Continue doubling up to max (e.g., 60s)

Testing Rate Limits

Test scenarios:
- Burst traffic
- Sustained high traffic
- Clock skew (distributed systems)
- Recovery after limit
- Multiple client types

Related Skills

  • api-design-fundamentals - API design patterns
  • idempotency-patterns - Safe retries
  • quality-attributes-taxonomy - Performance attributes