optimization-benchmark

Comprehensive performance benchmarking, regression detection, automated testing, and performance validation. Use for CI/CD quality gates, baseline comparisons, and systematic performance testing.

$ Instalar

git clone https://github.com/vamseeachanta/workspace-hub /tmp/workspace-hub && cp -r /tmp/workspace-hub/.claude/skills/tools/optimization/optimization-benchmark ~/.claude/skills/workspace-hub

// tip: Run this command in your terminal to install the skill


name: optimization-benchmark description: Comprehensive performance benchmarking, regression detection, automated testing, and performance validation. Use for CI/CD quality gates, baseline comparisons, and systematic performance testing.

Benchmark Suite Skill

Overview

This skill provides comprehensive automated performance testing capabilities including benchmark execution, regression detection, performance validation, and quality assessment for ensuring optimal system performance.

When to Use

  • Running performance benchmark suites before deployment
  • Detecting performance regressions between versions
  • Validating SLA compliance through automated testing
  • Load, stress, and endurance testing
  • CI/CD pipeline performance quality gates
  • Comparing performance across configurations

Quick Start

# Run comprehensive benchmark suite
npx claude-flow benchmark-run --suite comprehensive --duration 300

# Execute specific benchmark
npx claude-flow benchmark-run --suite throughput --iterations 10

# Compare with baseline
npx claude-flow benchmark-compare --current <results> --baseline <baseline>

# Quality assessment
npx claude-flow quality-assess --target swarm-performance --criteria throughput,latency

# Performance validation
npx claude-flow validate-performance --results <file> --criteria <file>

Architecture

+-----------------------------------------------------------+
|                   Benchmark Suite                          |
+-----------------------------------------------------------+
|  Benchmark Runner  |  Regression Detector  |  Validator   |
+--------------------+-----------------------+--------------+
         |                     |                    |
         v                     v                    v
+------------------+  +-------------------+  +--------------+
| Benchmark Types  |  | Detection Methods |  | Validation   |
| - Throughput     |  | - Statistical     |  | - SLA        |
| - Latency        |  | - ML-based        |  | - Regression |
| - Scalability    |  | - Threshold       |  | - Scalability|
| - Coordination   |  | - Trend Analysis  |  | - Reliability|
+------------------+  +-------------------+  +--------------+
         |                     |                    |
         v                     v                    v
+-----------------------------------------------------------+
|              Reporter & Comparator                         |
+-----------------------------------------------------------+

Benchmark Types

Standard Benchmarks

BenchmarkMetricsDurationTargets
Throughputrequests/sec, tasks/sec, messages/sec5 minmin: 1000, optimal: 5000
Latencyp50, p90, p95, p99, max5 minp50<100ms, p99<1s
Scalabilitylinear coefficient, efficiency retentionvariablecoefficient>0.8
Coordinationmessage latency, sync time5 min<50ms
Fault Tolerancerecovery time, failover success10 min<30s recovery

Test Campaign Types

  1. Load Testing: Gradual ramp-up to sustained load
  2. Stress Testing: Find breaking points
  3. Volume Testing: Large data set handling
  4. Endurance Testing: Long-duration stability
  5. Spike Testing: Sudden load changes
  6. Configuration Testing: Different settings comparison

Core Capabilities

1. Comprehensive Benchmarking

// Run benchmark suite
const results = await benchmarkSuite.run({
  duration: 300000,      // 5 minutes
  iterations: 10,        // 10 iterations
  warmupTime: 30000,     // 30 seconds warmup
  cooldownTime: 10000,   // 10 seconds cooldown
  parallel: false,       // Sequential execution
  baseline: previousRun  // Compare with baseline
});

// Results include:
// - summary: Overall scores and status
// - detailed: Per-benchmark results
// - baseline_comparison: Delta from baseline
// - recommendations: Optimization suggestions

2. Regression Detection

Multi-algorithm detection:

MethodDescriptionUse Case
StatisticalCUSUM change point detectionDetect gradual degradation
Machine LearningAnomaly detection modelsIdentify unusual patterns
ThresholdFixed limit comparisonsHard performance limits
TrendTime series regressionLong-term degradation
# Detect performance regressions
npx claude-flow detect-regression --current <results> --historical <data>

# Set up automated regression monitoring
npx claude-flow regression-monitor --enable --sensitivity 0.95

3. Automated Performance Testing

// Execute test campaign
const campaign = await tester.runTestCampaign({
  tests: [
    { type: 'load', config: loadTestConfig },
    { type: 'stress', config: stressTestConfig },
    { type: 'endurance', config: enduranceConfig }
  ],
  constraints: {
    maxDuration: 3600000,  // 1 hour max
    failFast: true         // Stop on first failure
  }
});

4. Performance Validation

Validation framework with multi-criteria assessment:

Validation TypeCriteria
SLA ValidationAvailability, response time, throughput, error rate
Regression ValidationComparison with historical data
Scalability ValidationLinear scaling, efficiency retention
Reliability ValidationError handling, recovery, consistency

MCP Integration

// Comprehensive benchmark integration
const benchmarkIntegration = {
  // Execute performance benchmarks
  async runBenchmarks(config = {}) {
    const [benchmark, metrics, trends, cost] = await Promise.all([
      mcp.benchmark_run({ suite: config.suite || 'comprehensive' }),
      mcp.metrics_collect({ components: ['system', 'agents', 'coordination'] }),
      mcp.trend_analysis({ metric: 'performance', period: '24h' }),
      mcp.cost_analysis({ timeframe: '24h' })
    ]);

    return { benchmark, metrics, trends, cost, timestamp: Date.now() };
  },

  // Quality assessment
  async assessQuality(criteria) {
    return await mcp.quality_assess({
      target: 'swarm-performance',
      criteria: criteria || ['throughput', 'latency', 'reliability', 'scalability']
    });
  }
};

Key Metrics

Benchmark Targets

const benchmarkTargets = {
  throughput: {
    requests_per_second: { min: 1000, optimal: 5000 },
    tasks_per_second: { min: 100, optimal: 500 },
    messages_per_second: { min: 10000, optimal: 50000 }
  },
  latency: {
    p50: { max: 100 },   // 100ms
    p90: { max: 200 },   // 200ms
    p95: { max: 500 },   // 500ms
    p99: { max: 1000 },  // 1s
    max: { max: 5000 }   // 5s
  },
  scalability: {
    linear_coefficient: { min: 0.8 },
    efficiency_retention: { min: 0.7 }
  }
};

CI/CD Quality Gates

GateCriteriaAction on Failure
Performance< 10% degradationBlock deployment
Latencyp99 < 1sWarning
Error Rate< 0.5%Block deployment
Scalability> 80% linearWarning

Load Testing Example

// Load test with gradual ramp-up
const loadTest = {
  type: 'load',
  phases: [
    { phase: 'ramp-up', duration: 60000, startLoad: 10, endLoad: 100 },
    { phase: 'sustained', duration: 300000, load: 100 },
    { phase: 'ramp-down', duration: 30000, startLoad: 100, endLoad: 0 }
  ],
  successCriteria: {
    p99_latency: { max: 1000 },
    error_rate: { max: 0.01 },
    throughput: { min: 80 }  // % of expected
  }
};

Stress Testing Example

// Stress test to find breaking point
const stressTest = {
  type: 'stress',
  startLoad: 100,
  maxLoad: 10000,
  loadIncrement: 100,
  duration: 60000,  // Per load level
  breakingCriteria: {
    error_rate: { max: 0.05 },    // 5% errors
    latency_p99: { max: 5000 },   // 5s latency
    timeout_rate: { max: 0.10 }   // 10% timeouts
  }
};

Integration Points

IntegrationPurpose
Performance MonitorContinuous monitoring data for benchmarking
Load BalancerValidates load balancing effectiveness
Topology OptimizerTests topology configurations
CI/CD PipelineAutomated quality gates

Best Practices

  1. Consistent Environment: Run benchmarks in consistent, isolated environments
  2. Warmup Period: Always include warmup to eliminate cold-start effects
  3. Multiple Iterations: Run multiple iterations for statistical significance
  4. Baseline Maintenance: Keep baseline updated with expected performance
  5. Historical Tracking: Store all benchmark results for trend analysis
  6. Realistic Workloads: Use production-like workload patterns

Example: CI/CD Integration

#!/bin/bash
# ci-performance-gate.sh

# Run benchmark suite
RESULTS=$(npx claude-flow benchmark-run --suite quick --output json)

# Compare with baseline
COMPARISON=$(npx claude-flow benchmark-compare \
  --current "$RESULTS" \
  --baseline ./baseline.json)

# Check for regressions
if echo "$COMPARISON" | jq -e '.regression_detected == true' > /dev/null; then
  echo "Performance regression detected!"
  echo "$COMPARISON" | jq '.regressions'
  exit 1
fi

echo "Performance validation passed"
exit 0

Related Skills

  • optimization-monitor - Real-time performance monitoring
  • optimization-analyzer - Bottleneck analysis and reporting
  • optimization-load-balancer - Load distribution optimization
  • optimization-topology - Topology performance testing

Version History

  • 1.0.0 (2026-01-02): Initial release - converted from benchmark-suite agent with comprehensive benchmarking, regression detection, automated testing, and performance validation