Unnamed Skill
Expert-level Rust systems programming for high-performance, critical systems like Reth (Ethereum execution client). Use when working on: performance optimization, systems-level code, blockchain execution layers, memory-critical paths, concurrent systems, or production-grade infrastructure requiring staff engineer precision. Specializes in: allocation reduction, redundancy elimination, prefetching, parallelization, and simplicity-first optimization.
$ Instalar
git clone https://github.com/yongkangc/agent-rules /tmp/agent-rules && cp -r /tmp/agent-rules/claude_code_configs/skills/rust-systems-engineering ~/.claude/skills/agent-rules// tip: Run this command in your terminal to install the skill
name: rust-systems-engineering description: Expert-level Rust systems programming for high-performance, critical systems like Reth (Ethereum execution client). Use when working on: performance optimization, systems-level code, blockchain execution layers, memory-critical paths, concurrent systems, or production-grade infrastructure requiring staff engineer precision. Specializes in: allocation reduction, redundancy elimination, prefetching, parallelization, and simplicity-first optimization.
Rust Systems Engineering - Reth & High-Performance Critical Systems
Expert guidance for staff-level Rust systems programming with focus on Reth architecture patterns and production-grade performance engineering.
Core Engineering Principles
1. Simplicity First
- Measure before optimizing: Use benchmarks and profilers to identify actual bottlenecks
- Clear over clever: Direct code paths beat abstraction layers in hot paths
- Reduce indirection: Virtual dispatch and excessive trait bounds add overhead
- Delete code: The fastest code is code that doesn't run
2. Modularity & Reusability
- Design components as independent, well-tested crates
- Each module should have clear boundaries and minimal coupling
- Follow Reth's philosophy: "Components should be reusable as standalone libraries"
- Document public APIs with examples and performance characteristics
3. Safety Without Compromise
- Leverage Rust's type system to eliminate entire bug classes
- Use
unsafeonly when measured performance gains justify it - Document all safety invariants explicitly in comments
- Prefer safe abstractions that compile to efficient code
4. Production-Grade Observability
- Add structured logging with
tracingfor debugging production issues - Instrument critical paths with metrics (latency, throughput, error rates)
- Include debug assertions that can be disabled in release builds
- Design systems to be debuggable under load
Performance Optimization Framework
Apply these four principles in order:
1. Reduce Allocation
Why: Heap allocation is 10-100x slower than stack allocation. Memory pressure triggers GC-like behavior in allocators.
Techniques:
- Stack allocation: Use arrays instead of
Vecwhen size is known:[T; N] - Buffer reuse: Keep
Vecacross iterations, use.clear()instead of creating new - Arena allocators: Use
bumpaloor typed-arena for bulk allocations with same lifetime SmallVec: Inline small collections to avoid heap allocation- Zero-copy: Use
&[u8]slices and avoid cloning data Cow<'_, T>: Clone only when mutation is needed- Custom allocators: Consider
jemallocor mimalloc for better performance profiles
Red flags:
.clone()in hot loops- Creating
VecorStringper iteration - Temporary allocations that could be reused
- Boxing when stack storage suffices
2. Spot Redundant Patterns & Simplify
Why: The best optimization is eliminating unnecessary work entirely.
What to look for:
- Duplicate computations: Cache results of pure functions
- Unnecessary conversions: Avoid format → parse → format chains
- Redundant checks: If type system guarantees invariant, don't check again
- Over-generic code: Monomorphization can help, but sometimes manual specialization wins
- Algorithmic improvements: O(n²) → O(n log n) beats micro-optimizations
Simplification strategies:
- Collapse nested matches/ifs into single match
- Replace complex iterator chains with simple loops when clearer
- Inline small functions in hot paths (verify with benchmarks)
- Remove abstraction layers that don't justify their cost
- Question every
Box,Rc,Arc- is shared ownership truly needed?
3. Prefetch & Parallelize
Why: Modern CPUs excel at parallel work and predictable memory access.
Prefetching:
- Linear access patterns: Process arrays sequentially for cache-friendly access
- Structure of arrays (SoA): Group same fields together instead of array of structs
- Cache line awareness: Align hot data to 64-byte boundaries
- Avoid pointer chasing: Minimize indirection in tight loops
- Batch processing: Process data in chunks that fit in L2/L3 cache
Parallelization:
- Data parallelism: Use
rayonfor embarrassingly parallel work:.par_iter() - Task parallelism: Spawn tokio tasks for independent I/O-bound work
- SIMD: Use
std::simdfor vectorized operations on numeric data - Lock-free structures: Use atomic operations instead of mutexes when possible
- Staged concurrency: Follow Reth's staged-sync pattern - pipeline distinct phases
When NOT to parallelize:
- Work is too fine-grained (overhead dominates)
- Dependencies between iterations
- Already bottlenecked on memory bandwidth
4. Aim for Simplicity
Why: Simple code is debuggable, maintainable, and often fastest after compiler optimization.
Practices:
- Tight inner loops: Keep hot paths focused and linear
- Minimize branches: CPU branch prediction works best with predictable patterns
- Avoid allocations in loops: See principle #1
- Profile-guided decisions: Let data guide optimization choices
- Incremental optimization: Change one thing at a time, measure impact
- Document why, not what: Explain performance-critical design choices
Reth-Specific Architecture Patterns
Staged-Sync Architecture
Reth follows Erigon's staged-sync model:
- Break blockchain sync into distinct stages (headers, bodies, execution, etc.)
- Each stage can be optimized independently
- Enables parallel processing of different chain segments
- Allows graceful interruption and resumption
Apply this pattern:
// Define clear stages with isolated concerns
trait SyncStage {
async fn execute(&mut self, input: StageInput) -> Result<StageOutput>;
fn name(&self) -> &str;
}
// Pipeline stages, measure each independently
async fn sync_pipeline(stages: Vec<Box<dyn SyncStage>>) -> Result<()> {
for stage in stages {
let start = Instant::now();
stage.execute(input).await?;
info!("{} completed in {:?}", stage.name(), start.elapsed());
}
}
Database Design (MDBX)
Reth uses MDBX for high-performance storage:
- Memory-mapped files for fast random access
- Copy-on-write for ACID properties without write-ahead log
- Separate storage for historical vs current state
Key principles:
- Batch writes in transactions
- Use cursors for sequential access
- Design schema for access patterns (not normalization)
- Consider data layout: hot data together, cold data separate
Component Modularity
Each Reth component is a standalone crate:
reth-primitives: Core types (blocks, transactions)reth-db: Database abstractionsreth-revm: EVM executionreth-network: P2P layer
Apply this:
- Design public APIs independent of implementation
- Use traits for pluggable components
- Document performance contracts
- Provide both sync and async variants when appropriate
Critical Systems Standards
Error Handling
- Type-safe errors: Use
thiserrorfor domain errors - Context propagation: Add context with
anyhowor custom types - Fast path optimization:
Resultin hot paths, panic for programmer errors - Structured logging: Log errors with context for debugging
#[derive(Debug, thiserror::Error)]
enum ExecutionError {
#[error("Invalid transaction: {0}")]
InvalidTransaction(String),
#[error("State error: {0}")]
StateError(#[from] StateError),
}
// In hot paths, consider returning error codes instead of Result
// when you need every nanosecond
Testing Strategy
- Unit tests: Test individual functions with property-based testing (
proptest) - Integration tests: Test component interactions
- Benchmarks: Use
criterionfor statistical rigor - Fuzzing: Use
cargo-fuzzfor finding edge cases - Regression tests: Lock in performance with benchmark CI gates
Profiling & Measurement
- Start with high-level metrics: What's actually slow?
- CPU profiling:
cargo flamegraphorperf record - Memory profiling:
valgrind --tool=massiforheaptrack - Cache analysis:
perf stat -dorcachegrind - Benchmark:
cargo benchwith statistical analysis
Always measure before and after optimizations.
Code Review Checklist
For every performance-critical change:
- Benchmarked before and after?
- Allocations minimized? (checked with
cargo instrumentsordhat) - Redundant work eliminated?
- Parallelization opportunities explored?
- Code is simple and maintainable?
- Error paths don't allocate unnecessarily?
- Public API documented with performance characteristics?
- Tests cover edge cases and validate correctness?
- Tracing/metrics added for production debugging?
Quick Reference Commands
# Profile CPU usage
cargo flamegraph --bin your-binary
# Benchmark with statistical analysis
cargo bench --bench your-bench
# Check assembly output
cargo asm --bin your-binary path::to::function
# Memory profiling (Linux)
heaptrack ./target/release/your-binary
# Cache analysis
perf stat -d ./target/release/your-binary
# Check allocation sites
RUSTFLAGS="-Z print-type-sizes" cargo +nightly build --release
Resources
Repository
