performance
Production-grade skill for C++ performance optimization. Covers profiling, benchmarking, cache optimization, SIMD vectorization, multithreading, and lock-free programming techniques.
$ 설치
git clone https://github.com/pluginagentmarketplace/custom-plugin-cpp /tmp/custom-plugin-cpp && cp -r /tmp/custom-plugin-cpp/skills/performance ~/.claude/skills/custom-plugin-cpp// tip: Run this command in your terminal to install the skill
═══════════════════════════════════════════════════════════════════════════════
SKILL: Performance
Version: 3.0.0 | SASMP v1.3.0 Compliant | Production-Grade
═══════════════════════════════════════════════════════════════════════════════
─────────────────────────────────────────────────────────────────────────────
IDENTITY
─────────────────────────────────────────────────────────────────────────────
name: performance version: "3.0.0" description: > Production-grade skill for C++ performance optimization. Covers profiling, benchmarking, cache optimization, SIMD vectorization, multithreading, and lock-free programming techniques.
─────────────────────────────────────────────────────────────────────────────
COMPLIANCE
─────────────────────────────────────────────────────────────────────────────
sasmp_version: "1.3.0" skill_version: "3.0.0"
─────────────────────────────────────────────────────────────────────────────
BONDING
─────────────────────────────────────────────────────────────────────────────
bonded_agent: 05-performance-optimizer bond_type: PRIMARY_BOND category: development
─────────────────────────────────────────────────────────────────────────────
PARAMETERS
─────────────────────────────────────────────────────────────────────────────
parameters: optimization_target: type: string required: false enum: [throughput, latency, memory, cpu, all] default: all description: "Primary optimization target" profiling_tool: type: string required: false enum: [perf, vtune, valgrind, tracy, instruments] description: "Profiling tool to use" optimization_level: type: string required: false enum: [quick_wins, moderate, aggressive] default: moderate description: "Depth of optimization effort" maintain_readability: type: boolean required: false default: true description: "Whether to prioritize code readability"
─────────────────────────────────────────────────────────────────────────────
ERROR HANDLING
─────────────────────────────────────────────────────────────────────────────
error_handling: retry_logic: max_attempts: 3 backoff: exponential initial_delay_ms: 1000 max_delay_ms: 16000 jitter: true fallback: on_benchmark_unstable: "increase_iterations" on_profiling_fail: "use_alternative_tool" on_no_improvement: "try_different_approach" on_regression: "rollback_and_analyze" validation: verify_no_regression: true statistical_significance: true test_multiple_inputs: true
Performance Skill
Production-Grade Development Skill | C++ Performance Engineering
Optimize C++ code for maximum performance through profiling, analysis, and targeted optimization.
Golden Rules
┌─────────────────────────────────────────────────────────────────┐
│ 1. MEASURE first - never optimize without profiling data │
│ 2. OPTIMIZE hotspots - focus on the 20% that takes 80% time │
│ 3. VERIFY improvements - benchmark before and after │
│ 4. MAINTAIN readability - premature optimization is evil │
└─────────────────────────────────────────────────────────────────┘
Profiling Tools
Linux perf
# Record CPU profile
perf record -g ./program
perf report
# Flamegraph generation
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg
# Hardware counters
perf stat -e cache-misses,cache-references,instructions,cycles ./program
# Specific function profiling
perf record -g -e cycles:u --call-graph dwarf ./program
Valgrind Callgrind
# Instruction-level profiling
valgrind --tool=callgrind ./program
kcachegrind callgrind.out.*
# Cache simulation
valgrind --tool=cachegrind ./program
cg_annotate cachegrind.out.*
Google Benchmark
#include <benchmark/benchmark.h>
static void BM_VectorPushBack(benchmark::State& state) {
for (auto _ : state) {
std::vector<int> v;
v.reserve(state.range(0)); // Fair comparison
for (int i = 0; i < state.range(0); ++i) {
v.push_back(i);
}
benchmark::DoNotOptimize(v.data());
benchmark::ClobberMemory();
}
state.SetComplexityN(state.range(0));
}
BENCHMARK(BM_VectorPushBack)
->Range(8, 8 << 10)
->Complexity(benchmark::oN);
BENCHMARK_MAIN();
Cache Optimization
Data Layout: AoS vs SoA
// ❌ Array of Structures (AoS) - cache unfriendly for iteration
struct ParticleAoS {
float x, y, z; // Position
float vx, vy, vz; // Velocity
float mass;
int id;
};
std::vector<ParticleAoS> particles; // 32 bytes per particle
// ✅ Structure of Arrays (SoA) - cache friendly
struct ParticlesSoA {
std::vector<float> x, y, z; // Contiguous positions
std::vector<float> vx, vy, vz; // Contiguous velocities
std::vector<float> mass;
std::vector<int> id;
void update_positions(float dt) {
const size_t n = x.size();
for (size_t i = 0; i < n; ++i) {
x[i] += vx[i] * dt; // Full cache line utilization
y[i] += vy[i] * dt;
z[i] += vz[i] * dt;
}
}
};
Cache Line Alignment
// Avoid false sharing with cache line alignment
struct alignas(64) CacheAlignedCounter {
std::atomic<long> count{0};
char padding[56]; // Ensure 64-byte alignment
};
// Per-thread counters without false sharing
std::array<CacheAlignedCounter, 8> thread_counters;
// Hot/cold data separation
struct HotData {
int frequently_accessed;
int also_frequent;
};
struct ColdData {
std::string rarely_used;
std::vector<int> debug_info;
};
struct OptimizedNode {
HotData hot;
ColdData* cold; // Pointer to cold data (loaded on demand)
};
SIMD Vectorization
Auto-vectorization Hints
// Help compiler vectorize with restrict and pragmas
void add_arrays(float* __restrict a, float* __restrict b,
float* __restrict result, size_t n) {
#pragma omp simd
for (size_t i = 0; i < n; ++i) {
result[i] = a[i] + b[i];
}
}
// Alignment for better vectorization
void process_aligned(float* data, size_t n) {
float* __restrict aligned_data =
std::assume_aligned<32>(data); // C++20
for (size_t i = 0; i < n; ++i) {
aligned_data[i] *= 2.0f;
}
}
Explicit SIMD (AVX)
#include <immintrin.h>
void add_vectors_avx(const float* a, const float* b,
float* result, size_t n) {
size_t i = 0;
// Process 8 floats at a time with AVX
for (; i + 8 <= n; i += 8) {
__m256 va = _mm256_loadu_ps(&a[i]);
__m256 vb = _mm256_loadu_ps(&b[i]);
__m256 vr = _mm256_add_ps(va, vb);
_mm256_storeu_ps(&result[i], vr);
}
// Handle remainder
for (; i < n; ++i) {
result[i] = a[i] + b[i];
}
}
// Horizontal sum with AVX
float horizontal_sum_avx(__m256 v) {
__m128 lo = _mm256_castps256_ps128(v);
__m128 hi = _mm256_extractf128_ps(v, 1);
lo = _mm_add_ps(lo, hi);
lo = _mm_hadd_ps(lo, lo);
lo = _mm_hadd_ps(lo, lo);
return _mm_cvtss_f32(lo);
}
Multithreading
Parallel Algorithms (C++17)
#include <execution>
#include <algorithm>
#include <numeric>
std::vector<int> data(1'000'000);
// Parallel sort
std::sort(std::execution::par_unseq, data.begin(), data.end());
// Parallel transform
std::transform(std::execution::par, data.begin(), data.end(),
data.begin(), [](int x) { return x * 2; });
// Parallel reduce
long sum = std::reduce(std::execution::par,
data.begin(), data.end(), 0L);
// Parallel for_each
std::for_each(std::execution::par_unseq, data.begin(), data.end(),
[](int& x) { x = process(x); });
Thread Pool
#include <thread>
#include <queue>
#include <functional>
#include <future>
#include <condition_variable>
class ThreadPool {
std::vector<std::thread> workers_;
std::queue<std::function<void()>> tasks_;
std::mutex mutex_;
std::condition_variable cv_;
std::atomic<bool> stop_{false};
public:
explicit ThreadPool(size_t threads = std::thread::hardware_concurrency()) {
for (size_t i = 0; i < threads; ++i) {
workers_.emplace_back([this] {
while (true) {
std::function<void()> task;
{
std::unique_lock lock(mutex_);
cv_.wait(lock, [this] {
return stop_ || !tasks_.empty();
});
if (stop_ && tasks_.empty()) return;
task = std::move(tasks_.front());
tasks_.pop();
}
task();
}
});
}
}
template<typename F, typename... Args>
auto enqueue(F&& f, Args&&... args)
-> std::future<std::invoke_result_t<F, Args...>> {
using return_type = std::invoke_result_t<F, Args...>;
auto task = std::make_shared<std::packaged_task<return_type()>>(
std::bind(std::forward<F>(f), std::forward<Args>(args)...)
);
std::future<return_type> res = task->get_future();
{
std::lock_guard lock(mutex_);
tasks_.emplace([task]() { (*task)(); });
}
cv_.notify_one();
return res;
}
~ThreadPool() {
stop_ = true;
cv_.notify_all();
for (auto& worker : workers_) {
worker.join();
}
}
};
Quick Wins Checklist
Immediate Optimizations
- Use
reserve()for vectors with known size - Prefer
emplace_back()overpush_back() - Move instead of copy when possible
- Use
string_viewfor read-only strings - Avoid unnecessary allocations in loops
- Use
[[likely]]/[[unlikely]]for branch hints
Data Layout
- Profile cache misses first
- Consider SoA vs AoS for large datasets
- Align hot data to cache lines
- Separate hot and cold data
Algorithmic
- Choose right container for access pattern
- Use binary search on sorted data
- Avoid redundant computation
- Consider lookup tables for expensive functions
Performance Workflow
┌─────────────┐ ┌──────────────┐ ┌───────────────┐
│ PROFILE │───▶│ IDENTIFY │───▶│ OPTIMIZE │
│ (measure) │ │ (hotspots) │ │ (implement) │
└─────────────┘ └──────────────┘ └───────────────┘
▲ │
│ ▼
│ ┌──────────────┐ ┌───────────────┐
└──────────────│ VERIFY │◀───│ BENCHMARK │
│ (improved?) │ │ (measure) │
└──────────────┘ └───────────────┘
Troubleshooting Decision Tree
Performance issue?
├── High CPU, low throughput
│ ├── Check cache misses → perf stat -e cache-misses
│ ├── Check branch mispredictions → perf stat -e branch-misses
│ └── Profile hotspots → perf record + flamegraph
├── High latency spikes
│ ├── Check for locks → Look for mutex contention
│ ├── Check allocations → Use custom allocator
│ └── Check I/O blocking → Use async I/O
├── Memory growing
│ ├── Memory leak → Valgrind / ASan
│ ├── Fragmentation → Custom allocator
│ └── Retained references → Check lifetimes
└── Inconsistent performance
├── CPU throttling → Check power management
├── NUMA effects → Pin threads to cores
└── Context switches → Reduce thread count
Unit Test Template
#include <gtest/gtest.h>
#include <benchmark/benchmark.h>
#include <chrono>
class PerformanceTest : public ::testing::Test {
protected:
static constexpr size_t ITERATIONS = 1000;
template<typename Func>
auto measure(Func&& f) {
auto start = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < ITERATIONS; ++i) {
f();
}
auto end = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::microseconds>(
end - start).count() / ITERATIONS;
}
};
TEST_F(PerformanceTest, VectorReserveIsFaster) {
auto without_reserve = measure([]{
std::vector<int> v;
for (int i = 0; i < 1000; ++i) v.push_back(i);
});
auto with_reserve = measure([]{
std::vector<int> v;
v.reserve(1000);
for (int i = 0; i < 1000; ++i) v.push_back(i);
});
EXPECT_LT(with_reserve, without_reserve);
}
TEST_F(PerformanceTest, SoAFasterThanAoS) {
// Test cache efficiency
auto aos_time = measure([this]{ process_aos(); });
auto soa_time = measure([this]{ process_soa(); });
EXPECT_LT(soa_time, aos_time * 0.8); // At least 20% faster
}
Integration Points
| Component | Interface |
|---|---|
build-engineer | Optimization flags |
modern-cpp-expert | Move semantics |
memory-specialist | Allocation patterns |
cpp-debugger-agent | Performance debugging |
C++ Plugin v3.0.0 - Production-Grade Development Skill
Repository
