performance-optimization
Apply systematic performance optimization techniques for Python and Rust code: estimation + profiling, API/bulk design, algorithmic wins, cache-friendly memory layout, fewer allocations, fast paths, caching, and compiler-friendly hot loops. Use for performance code reviews, refactors, and profiling-driven optimizations. Keywords: performance, latency, throughput, cache, allocation, memory layout, PyO3, msgspec, tokio, async, pprof, py-spy, perf.
$ Installer
git clone https://github.com/FarhanAliRaza/django-bolt /tmp/django-bolt && cp -r /tmp/django-bolt/.claude/performance-optimization ~/.claude/skills/django-bolt// tip: Run this command in your terminal to install the skill
name: performance-optimization description: "Apply systematic performance optimization techniques for Python and Rust code: estimation + profiling, API/bulk design, algorithmic wins, cache-friendly memory layout, fewer allocations, fast paths, caching, and compiler-friendly hot loops. Use for performance code reviews, refactors, and profiling-driven optimizations. Keywords: performance, latency, throughput, cache, allocation, memory layout, PyO3, msgspec, tokio, async, pprof, py-spy, perf." license: Apache-2.0 compatibility: Skills-compatible coding agents working on Python and Rust codebases. metadata: upstream_url: "https://abseil.io/fast/hints" upstream_authors: "Jeff Dean; Sanjay Ghemawat" upstream_original_version: "2023-07-27" upstream_last_updated: "2025-12-25"
Python & Rust Performance Hints (Jeff Dean & Sanjay Ghemawat style)
This skill packages key ideas from Abseil's Performance Hints document, adapted for Python and Rust development.
Use it to:
- review Python/Rust code for performance risks
- propose high-impact optimizations with explicit tradeoffs
- design APIs/data structures that keep future optimizations possible
- write an experiment plan (profile + microbenchmark) to validate changes
Scope and guardrails
- Scope: single-process / single-binary performance (CPU, memory, allocations, cache behavior).
- Do not: change externally observable behavior unless the user explicitly agrees.
- Do not: introduce undefined behavior, data races, or brittle "clever" micro-opts without evidence.
- Default philosophy: choose the faster alternative when it doesn't materially harm readability or complexity; otherwise, measure first.
When to apply
Use this skill when the task involves any of:
- reducing latency or improving throughput
- cutting memory footprint or allocation rate
- improving cache locality / reducing cache misses
- designing performant APIs (bulk ops, view types, threading model)
- reviewing performance-sensitive Python or Rust changes
- interpreting a flat profile and finding next steps
What to ask for (minimum inputs)
If you don't have enough information, ask for the smallest set that changes your recommendation quality:
- Goal: latency vs throughput vs memory (and the SLO if any)
- Where: hot path vs init vs test-only (and typical input sizes)
- Evidence: profile / flame graph / perf counters / allocation profile (if available)
- Constraints: correctness constraints, API constraints, thread-safety requirements
If none exists yet, proceed with static analysis + "what to measure first".
Workflow: how an agent should use these hints
Step 1 — classify the code
- Test code: mostly care about asymptotic complexity and test runtime.
- Application code: separate init/cold vs hot path.
- Library code: prefer "safe, low-complexity" performance techniques because you can't predict callers.
Step 2 — do a back-of-the-envelope estimate
Before implementing changes, estimate what might dominate:
- Count expensive operations (seeks, round-trips, allocations, bytes touched, comparisons, etc.)
- Multiply by rough cost.
- If latency matters and there is concurrency, consider overlap.
Reference latency table (rough order-of-magnitude)
| Operation | Approx time |
|---|---|
| L1 cache reference | 0.5 ns |
| L2 cache reference | 3 ns |
| Branch mispredict | 5 ns |
| Mutex lock/unlock (uncontended) | 15 ns |
| Main memory reference | 50 ns |
| Python function call overhead | 50-100 ns |
| Rust function call (non-inlined) | 1-10 ns |
| PyO3 GIL acquire/release | 100-500 ns |
| Compress 1K bytes with Snappy | 1,000 ns |
| Read 4KB from SSD | 20,000 ns |
| Round trip within same datacenter | 50,000 ns |
| Read 1MB sequentially from memory | 64,000 ns |
| Read 1MB over 100 Gbps network | 100,000 ns |
| Read 1MB from SSD | 1,000,000 ns (1 ms) |
| Disk seek | 5,000,000 ns (5 ms) |
| Read 1MB sequentially from disk | 10,000,000 ns (10 ms) |
| Send packet CA→Netherlands→CA | 150,000,000 ns (150 ms) |
Python-specific costs
| Operation | Approx time |
|---|---|
| dict lookup | 20-50 ns |
| list.append | 20-40 ns |
| getattr on object | 50-100 ns |
| isinstance check | 30-60 ns |
| JSON parse (stdlib) 1KB | 50-100 us |
| msgspec parse 1KB | 5-15 us |
| Django ORM query (simple) | 1-10 ms |
| Django ORM async query | 0.5-5 ms |
Rust-specific costs
| Operation | Approx time |
|---|---|
| HashMap lookup | 10-30 ns |
| Vec push (no realloc) | 2-10 ns |
| String allocation (small) | 20-50 ns |
| Arc clone | 10-20 ns |
| tokio task spawn | 200-500 ns |
| async channel send | 50-200 ns |
Estimation examples (templates)
Web request through PyO3 bridge:
- Rust HTTP parsing: ~1us
- GIL acquisition: ~200ns
- Python handler execution: ~50us (simple) to ~5ms (with ORM)
- Response serialization (msgspec): ~10us
- GIL release + Rust response: ~200ns
- Total: ~50us to ~5ms depending on handler complexity
Batch processing 10K items:
- Per-item Python function call: 10K × 100ns = 1ms (overhead alone)
- With msgspec struct validation: 10K × 500ns = 5ms
- With dict allocation per item: 10K × 50ns = 0.5ms (plus GC pressure)
Step 3 — measure before paying complexity
When you can, measure to validate impact:
-
Python profiling:
py-spyfor sampling profiler (low overhead, production-safe)cProfilefor deterministic profiling (high overhead)memrayormemory_profilerfor allocation profilingscalenefor CPU + memory + GPU profiling
-
Rust profiling:
perffor Linux sampling profilerflamegraphcrate for generating flame graphscriterionfor microbenchmarksdhatfor allocation profiling
-
Watch for GIL contention in Python: contention can lower CPU usage and hide the "real" bottleneck.
-
Watch for async runtime overhead in Rust: too many small tasks can hurt more than help.
Step 4 — pick the biggest lever first
Prioritize in this order unless evidence suggests otherwise:
- Algorithmic complexity wins (O(N²) → O(N log N) or O(N))
- Data structure choice / memory layout (cache locality; fewer cache lines)
- Allocation reduction (fewer allocs, better reuse)
- Avoid unnecessary work (fast paths, precompute, defer)
- Language boundary optimization (minimize Python↔Rust crossings)
- Compiler/interpreter friendliness (simplify hot loops, reduce abstraction overhead)
Step 5 — produce an LLM-friendly output
When you respond to the user, use this structure:
- Hot path hypothesis (what you think dominates, and why)
- Top issues (ranked): issue → evidence/estimate → proposed fix → expected impact
- Patch sketch (minimal code changes or pseudocode)
- Tradeoffs & risks (correctness, memory, API, complexity)
- Measurement plan (what to profile/benchmark and success criteria)
Techniques and patterns
1) API design for performance
Use bulk APIs to amortize overhead
When: callers do N similar operations (lookups, deletes, updates, decoding, locking).
Why: reduce boundary crossings and repeated fixed costs (locks, dispatch, decoding, syscalls).
Python patterns:
# BAD: N database round trips
for user_id in user_ids:
user = await User.objects.aget(id=user_id)
# GOOD: 1 database round trip
users = await sync_to_async(list)(User.objects.filter(id__in=user_ids))
users_by_id = {u.id: u for u in users}
Rust patterns:
// BAD: N individual operations
for id in ids {
let result = cache.get(&id)?;
// ...
}
// GOOD: Batch lookup
let results = cache.get_many(&ids)?;
Prefer view types for function arguments
Python:
- Use
Sequence[T]orIterable[T]instead oflist[T]when you don't mutate - Accept
bytesormemoryviewinstead of copying tobytearray - Use
msgspec.Structfor zero-copy deserialization
Rust:
- Use
&[T]orimpl AsRef<[T]>instead ofVec<T>when you don't need ownership - Use
&strinstead ofStringfor read-only string access - Use
Cow<'_, T>when you might need to own or borrow
Thread-compatible vs thread-safe types
Python:
- Default to thread-compatible (external GIL or explicit locks)
- Use
threading.local()for per-thread state - Prefer
asyncioover threads for I/O-bound work
Rust:
- Default to
Send + Syncfor shared state - Use
Arc<RwLock<T>>only when needed; prefer message passing - Consider
dashmapor sharded maps for high-contention scenarios
2) Algorithmic improvements
The rare-but-massive wins.
Reduce complexity class
Common transformations:
- O(N²) → O(N log N) or O(N)
- O(N log N) sorted-list intersection → O(N) using a hash set
- O(log N) tree lookup → O(1) using hash lookup
Python-specific patterns
# BAD: O(N²) - checking membership in list
for item in items:
if item in seen_list: # O(N) lookup
continue
seen_list.append(item)
# GOOD: O(N) - using set
seen = set()
for item in items:
if item in seen: # O(1) lookup
continue
seen.add(item)
# BETTER: O(N) - using dict.fromkeys for deduplication
unique_items = list(dict.fromkeys(items))
Rust-specific patterns
// BAD: O(N²) - nested iteration
for a in &items_a {
for b in &items_b {
if a.key == b.key {
// ...
}
}
}
// GOOD: O(N) - hash lookup
let b_map: HashMap<_, _> = items_b.iter().map(|b| (&b.key, b)).collect();
for a in &items_a {
if let Some(b) = b_map.get(&a.key) {
// ...
}
}
3) Better memory representation and cache locality
Python: Prefer slots and dataclasses
# BAD: Regular class with __dict__
class Item:
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z
# GOOD: Slots-based class (smaller memory, faster attribute access)
class Item:
__slots__ = ('x', 'y', 'z')
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z
# BETTER: msgspec.Struct (even more compact, fast serialization)
class Item(msgspec.Struct):
x: int
y: int
z: int
Python: Prefer numpy/array for numeric data
# BAD: List of floats (each is a Python object)
values = [1.0, 2.0, 3.0, ...] # ~28 bytes per float
# GOOD: numpy array (contiguous, cache-friendly)
values = np.array([1.0, 2.0, 3.0, ...]) # 8 bytes per float
# GOOD: array module for simpler cases
from array import array
values = array('d', [1.0, 2.0, 3.0, ...])
Rust: Memory layout and padding
// BAD: 24 bytes due to padding
struct Item {
flag: bool, // 1 byte + 7 padding
value: i64, // 8 bytes
count: i32, // 4 bytes + 4 padding
}
// GOOD: 16 bytes with reordering
struct Item {
value: i64, // 8 bytes
count: i32, // 4 bytes
flag: bool, // 1 byte + 3 padding
}
// Use #[repr(C)] or #[repr(packed)] when ABI matters
Rust: Indices instead of pointers
// Pointer-heavy: 8 bytes per reference, poor cache locality
struct Node {
data: i32,
left: Option<Box<Node>>,
right: Option<Box<Node>>,
}
// Index-based: 4 bytes per reference, better cache locality
struct Tree {
nodes: Vec<NodeData>,
}
struct NodeData {
data: i32,
left: Option<u32>, // index into nodes
right: Option<u32>,
}
Rust: SmallVec and tinyvec for small collections
use smallvec::SmallVec;
// Stack allocation for up to 8 elements, heap only if larger
let mut items: SmallVec<[Item; 8]> = SmallVec::new();
4) Reduce allocations
Python: Avoid unnecessary allocations
# BAD: Creates new list on every call
def process(items):
return [transform(item) for item in items]
# GOOD: Generator for streaming (no intermediate list)
def process(items):
for item in items:
yield transform(item)
# GOOD: Pre-allocate when size is known
def process(items):
result = [None] * len(items)
for i, item in enumerate(items):
result[i] = transform(item)
return result
Python: Reuse objects
# BAD: New dict on every iteration
for data in stream:
result = {} # allocation
result['key'] = process(data)
yield result
# GOOD: Reuse dict (if consumers don't hold reference)
result = {}
for data in stream:
result.clear()
result['key'] = process(data)
yield result # Caveat: only if consumer processes immediately
Python: Use slots to reduce memory
# Without __slots__: ~152 bytes per instance
class Point:
def __init__(self, x, y):
self.x = x
self.y = y
# With __slots__: ~56 bytes per instance
class Point:
__slots__ = ('x', 'y')
def __init__(self, x, y):
self.x = x
self.y = y
Rust: Pre-allocate with capacity
// BAD: Multiple reallocations
let mut results = Vec::new();
for item in items {
results.push(transform(item));
}
// GOOD: Single allocation
let mut results = Vec::with_capacity(items.len());
for item in items {
results.push(transform(item));
}
// BETTER: Use collect with size hint
let results: Vec<_> = items.iter().map(transform).collect();
Rust: Avoid cloning when borrowing suffices
// BAD: Unnecessary clone
fn process(data: &Data) -> String {
let s = data.name.clone(); // allocation
format!("Hello, {}", s)
}
// GOOD: Borrow
fn process(data: &Data) -> String {
format!("Hello, {}", &data.name)
}
5) Avoid unnecessary work
Fast paths for common cases
Python:
# BAD: Always takes slow path
def parse_int(s: str) -> int:
return int(s) # Handles all edge cases
# GOOD: Fast path for common ASCII digits
def parse_int(s: str) -> int:
if len(s) <= 10 and s.isdigit(): # Fast check
result = 0
for c in s:
result = result * 10 + (ord(c) - 48)
return result
return int(s) # Fallback for edge cases
Rust:
// Fast path for single-byte varint (common case)
fn parse_varint(data: &[u8]) -> (u64, usize) {
if data[0] < 128 {
return (data[0] as u64, 1); // 90% of cases
}
parse_varint_slow(data) // Rare multi-byte
}
Defer expensive computations
Python:
# BAD: Always computes expensive value
def process(data, config):
expensive = compute_expensive(data) # Always runs
if config.needs_expensive:
use(expensive)
# GOOD: Defer until needed
def process(data, config):
if config.needs_expensive:
expensive = compute_expensive(data) # Only when needed
use(expensive)
Move loop-invariant code outside loops
# BAD: Repeated attribute lookup
for item in items:
result = self.config.transform_fn(item) # 2 lookups per iteration
# GOOD: Hoist invariant lookups
transform = self.config.transform_fn
for item in items:
result = transform(item)
Cache computed results
from functools import lru_cache
# Cache expensive computations
@lru_cache(maxsize=1024)
def expensive_computation(key: str) -> Result:
# ...
6) Python↔Rust boundary optimization (PyO3)
Minimize GIL crossings
// BAD: Acquire/release GIL for each item
fn process_items(py: Python, items: Vec<PyObject>) -> PyResult<Vec<PyObject>> {
items.iter().map(|item| {
// Each call acquires GIL internally
process_one(py, item)
}).collect()
}
// GOOD: Hold GIL for batch, release for Rust-only work
fn process_items(py: Python, items: Vec<PyObject>) -> PyResult<Vec<PyObject>> {
// Extract data while holding GIL
let data: Vec<_> = items.iter().map(|i| extract_data(py, i)).collect::<PyResult<_>>()?;
// Release GIL for CPU-intensive work
let results = py.allow_threads(|| {
data.par_iter().map(|d| rust_process(d)).collect()
});
// Re-acquire GIL to build Python objects
results.iter().map(|r| to_python(py, r)).collect()
}
Batch data across the boundary
# BAD: N PyO3 calls
for item in items:
rust_process(item)
# GOOD: Single PyO3 call with batch
rust_process_batch(items)
Use zero-copy types
// Accept bytes without copying
#[pyfunction]
fn process_bytes(data: &[u8]) -> PyResult<Vec<u8>> {
// data is a view into Python bytes, no copy
Ok(transform(data))
}
// Use Py<PyBytes> for owned bytes without copying
#[pyfunction]
fn process_bytes_owned(py: Python, data: Py<PyBytes>) -> PyResult<Py<PyBytes>> {
let bytes = data.as_bytes(py);
// ...
}
7) Async optimization
Python async patterns
# BAD: Sequential await
async def fetch_all(urls):
results = []
for url in urls:
results.append(await fetch(url)) # Sequential
return results
# GOOD: Concurrent await
async def fetch_all(urls):
return await asyncio.gather(*[fetch(url) for url in urls])
# BETTER: Bounded concurrency
async def fetch_all(urls, max_concurrent=10):
semaphore = asyncio.Semaphore(max_concurrent)
async def fetch_with_limit(url):
async with semaphore:
return await fetch(url)
return await asyncio.gather(*[fetch_with_limit(url) for url in urls])
Rust async patterns
// BAD: Sequential await
async fn fetch_all(urls: Vec<Url>) -> Vec<Response> {
let mut results = Vec::new();
for url in urls {
results.push(fetch(&url).await); // Sequential
}
results
}
// GOOD: Concurrent with join_all
async fn fetch_all(urls: Vec<Url>) -> Vec<Response> {
futures::future::join_all(urls.iter().map(fetch)).await
}
// BETTER: Bounded concurrency with buffer_unordered
use futures::stream::{self, StreamExt};
async fn fetch_all(urls: Vec<Url>) -> Vec<Response> {
stream::iter(urls)
.map(|url| fetch(url))
.buffer_unordered(10) // Max 10 concurrent
.collect()
.await
}
8) Reduce logging and stats costs
Python logging in hot paths
# BAD: Logging overhead even when disabled
for item in items:
logger.debug(f"Processing {item}") # String formatting always happens
# GOOD: Check level first
if logger.isEnabledFor(logging.DEBUG):
for item in items:
logger.debug(f"Processing {item}")
# BETTER: Remove logging from innermost loops entirely
Rust logging in hot paths
// BAD: log! macro overhead in hot loop
for item in items {
log::debug!("Processing {:?}", item); // Format even if disabled
}
// GOOD: Check level outside loop
if log::log_enabled!(log::Level::Debug) {
for item in items {
log::debug!("Processing {:?}", item);
}
}
// BETTER: Use tracing with static filtering
#[tracing::instrument(skip_all)]
fn process_batch(items: &[Item]) {
// Span created once, not per item
for item in items {
// Hot loop without logging
}
}
Flat-profile playbook
If no single hotspot dominates:
- Don't discount many small wins (twenty 1% improvements can matter).
- Look for loops closer to the top of call stacks (flame graphs help).
- Consider structural refactors (one-shot construction instead of incremental mutation).
- Replace overly general abstractions with specialized code.
- Reduce allocations (allocation profiles help).
- Use hardware counters (cache misses, branch misses) to find invisible costs.
- Python-specific: Look for GIL contention, excessive object creation, slow imports.
- Rust-specific: Look for excessive async overhead, unnecessary clones, poor cache locality.
Quick review checklist
When reviewing a performance-sensitive change, scan for:
Python
- Any O(N²) behavior on realistic N?
- List comprehension where generator would suffice?
- Missing
__slots__on frequently instantiated classes? - dict/list allocation inside hot loops?
- Sequential awaits where concurrent would work?
- Repeated attribute lookups in loops?
- Logging/debug in innermost loops?
- Using stdlib json instead of msgspec/orjson?
- Synchronous I/O in async context?
- Missing @lru_cache for repeated expensive computations?
Rust
- Any O(N²) behavior on realistic N?
- Missing
with_capacity()for known-size collections? - Unnecessary
.clone()where borrow would work? - String allocation in hot loops (use
&strorCow)? - Box/Arc where stack allocation would work?
- Excessive async task spawning for small operations?
- Poor struct field ordering (padding waste)?
- Missing
#[inline]on critical small functions? - Lock contention in concurrent code?
- Logging format overhead in hot paths?
PyO3 boundary
- Multiple GIL acquisitions where one would suffice?
- Python object creation in hot Rust loops?
- Missing
allow_threads()for CPU-bound Rust work? - Copying data across boundary when view would work?
- Single-item APIs where batch API would reduce crossings?
Examples (inputs → outputs)
Example A — Python list vs generator
Input:
def process_large_file(path):
lines = open(path).readlines() # Load entire file
return [transform(line) for line in lines] # Another full copy
Agent output:
- Issue: Two full copies of file contents; O(N) memory for O(N) items.
- Fix: Use generator for streaming:
def process_large_file(path):
with open(path) as f:
for line in f:
yield transform(line)
- Expected impact: O(1) memory instead of O(N).
- Tradeoff: Can only iterate once; caller must handle streaming.
Example B — Rust unnecessary clones
Input:
fn build_response(data: &ResponseData) -> String {
let name = data.name.clone();
let id = data.id.clone();
format!("User {} ({})", name, id)
}
Agent output:
- Issue: Unnecessary clones; format! can borrow directly.
- Fix:
fn build_response(data: &ResponseData) -> String {
format!("User {} ({})", &data.name, &data.id)
}
- Expected impact: 2 fewer allocations per call.
Example C — PyO3 batch optimization
Input:
# Python side
for item in items:
result = rust_validate(item) # 10K calls
Agent output:
- Issue: 10K GIL acquisitions/releases (~2-5ms overhead alone).
- Fix: Add batch API:
#[pyfunction]
fn rust_validate_batch(items: Vec<Item>) -> Vec<bool> {
items.iter().map(|i| validate(i)).collect()
}
results = rust_validate_batch(items) # 1 call
- Expected impact: ~10-50x faster for large batches.
- Tradeoff: Must collect all items first; can't stream results.
Repository
