Marketplace

ops-monitor

Monitor deployed infrastructure health and performance - check resource status, query CloudWatch metrics (CPU, memory, requests, errors), analyze performance trends, track SLI/SLO metrics, detect anomalies, generate health reports with resource status summaries, identify degraded services, provide performance optimization recommendations.

model: claude-haiku-4-5

$ 安裝

git clone https://github.com/fractary/claude-plugins /tmp/claude-plugins && cp -r /tmp/claude-plugins/plugins/faber-cloud/.archive/phase4-clean-separation/ops-monitor ~/.claude/skills/claude-plugins

// tip: Run this command in your terminal to install the skill


name: ops-monitor model: claude-haiku-4-5 description: | Monitor deployed infrastructure health and performance - check resource status, query CloudWatch metrics (CPU, memory, requests, errors), analyze performance trends, track SLI/SLO metrics, detect anomalies, generate health reports with resource status summaries, identify degraded services, provide performance optimization recommendations. tools: Bash, Read, Write

Operations Monitoring Skill

<CRITICAL_RULES> IMPORTANT: Monitoring and health check rules

  • Always check resource registry to know what resources exist
  • Query CloudWatch for actual runtime status and metrics
  • Report both healthy and unhealthy resources
  • Provide clear status summaries (healthy/degraded/unhealthy)
  • Include actionable recommendations for issues found
  • Track metrics over time to identify trends
  • Never assume health - always verify via AWS APIs </CRITICAL_RULES>

EXECUTE STEPS:

Step 1: Load Configuration and Registry

  • Read: .fractary/plugins/faber-cloud/devops.json
  • Read: .fractary/plugins/faber-cloud/deployments/${environment}/registry.json
  • Extract: List of deployed resources to monitor
  • Output: "✓ Found ${resource_count} resources to monitor"

Step 2: Determine Operation

  • If operation == "health-check":
    • Read: workflow/health-check.md
    • Check status of all resources
  • If operation == "performance-analysis":
    • Read: workflow/performance-analysis.md
    • Analyze metrics and trends
  • If operation == "metrics-query":
    • Read: workflow/metrics-query.md
    • Query specific metrics
  • Output: "✓ Operation determined: ${operation}"

Step 3: Execute Monitoring

  • For each resource in scope:
    • Query resource status via handler
    • Query CloudWatch metrics
    • Analyze current state
    • Compare against thresholds
  • Collect results for all resources
  • Output: "✓ Monitoring completed for ${resource_count} resources"

Step 4: Analyze Results

  • Read: workflow/analyze-health.md
  • Categorize resources: healthy / degraded / unhealthy
  • Identify patterns (multiple failures, related issues)
  • Detect anomalies (unusual metrics, sudden changes)
  • Output: "✓ Analysis complete"

Step 5: Generate Report

  • Create monitoring report with:
    • Overall health status
    • Resource-by-resource status
    • Metrics summary
    • Issues found
    • Recommendations
  • Save to: .fractary/plugins/faber-cloud/monitoring/${environment}/${timestamp}-${operation}.json
  • Output: "✓ Report generated: ${report_path}"

Step 6: Check Thresholds

  • Compare metrics against configured thresholds
  • Identify threshold violations
  • Prioritize by severity
  • Output: "✓ Threshold check complete"

OUTPUT COMPLETION MESSAGE:

✅ COMPLETED: Operations Monitoring
Status: ${overall_health}
Resources Checked: ${total_count}
Healthy: ${healthy_count}
Degraded: ${degraded_count}
Unhealthy: ${unhealthy_count}

${issues_summary}

Report: ${report_path}
───────────────────────────────────────
${recommendations_summary}

IF ISSUES FOUND:

⚠️  COMPLETED: Operations Monitoring (Issues Found)
Status: DEGRADED
Resources Checked: ${total_count}
Unhealthy: ${unhealthy_count}

Issues:
${issue_list}

Recommendations:
${recommendations}
───────────────────────────────────────
Next: Investigate issues with ops-investigator

IF FAILURE:

❌ FAILED: Operations Monitoring
Step: ${failed_step}
Error: ${error_message}
───────────────────────────────────────
Resolution: ${resolution_steps}

<COMPLETION_CRITERIA> This skill is complete and successful when ALL verified:

1. Resources Identified

  • Resource registry loaded
  • All resources in scope identified
  • Resource types determined

2. Status Checked

  • Resource status queried from AWS
  • CloudWatch metrics collected
  • Current state determined

3. Health Analyzed

  • Resources categorized by health
  • Issues identified and prioritized
  • Patterns and anomalies detected

4. Report Generated

  • Monitoring report created
  • All findings documented
  • Recommendations provided

5. Thresholds Evaluated

  • Metrics compared to thresholds
  • Violations identified
  • Severity assessed

FAILURE CONDITIONS - Stop and report if: ❌ Cannot access CloudWatch (check AWS permissions) ❌ Resource registry not found (no deployments in environment) ❌ CloudWatch logs/metrics not available (check resource configuration)

PARTIAL COMPLETION - Not acceptable: ⚠️ Some resources not checked → Return to Step 3 ⚠️ Report not generated → Return to Step 5 </COMPLETION_CRITERIA>

  1. Monitoring Report

    • Location: .fractary/plugins/faber-cloud/monitoring/${environment}/${timestamp}-${operation}.json
    • Format: JSON with detailed findings
    • Contains: Health status, metrics, issues, recommendations
  2. Health Summary

    • Overall status: HEALTHY / DEGRADED / UNHEALTHY
    • Resource counts by status
    • Critical issues list
    • Priority recommendations

Return to agent:

{
  "overall_health": "HEALTHY|DEGRADED|UNHEALTHY",
  "environment": "${environment}",
  "timestamp": "2025-10-28T...",

  "resources": {
    "total": 10,
    "healthy": 8,
    "degraded": 1,
    "unhealthy": 1
  },

  "issues": [
    {
      "severity": "HIGH",
      "resource": "api-lambda",
      "issue": "Error rate above threshold (5.2% > 1%)",
      "metric": "Errors",
      "current_value": "5.2%",
      "threshold": "1%"
    }
  ],

  "metrics_summary": {
    "api-lambda": {
      "invocations": 1250,
      "errors": 65,
      "error_rate": "5.2%",
      "duration_avg": "245ms",
      "throttles": 0
    }
  },

  "recommendations": [
    "Investigate api-lambda errors (5.2% error rate)",
    "Consider increasing Lambda memory (avg duration 245ms)",
    "Review database connection pooling"
  ],

  "report_path": ".fractary/plugins/faber-cloud/monitoring/test/2025-10-28-health-check.json"
}
**USE SKILL: handler-hosting-${hosting_handler}**
Operation: get-resource-status | query-metrics
Arguments: ${resource_id} ${metric_name} ${timeframe}

Reports are stored in:

  • .fractary/plugins/faber-cloud/monitoring/${environment}/${timestamp}-${operation}.json
  • Historical trends in monitoring-history.json

<ERROR_HANDLING> <CLOUDWATCH_ACCESS_ERROR> Pattern: AccessDenied for CloudWatch operations Action: 1. Check if CloudWatch permissions granted 2. Suggest adding cloudwatch:GetMetricStatistics, logs:FilterLogEvents 3. Delegate to infra-permission-manager if needed </CLOUDWATCH_ACCESS_ERROR>

<RESOURCE_NOT_FOUND> Pattern: Resource doesn't exist in AWS Action: 1. Check if resource listed in registry but deleted 2. Warn about registry drift 3. Suggest verifying deployment </RESOURCE_NOT_FOUND>

<METRICS_NOT_AVAILABLE> Pattern: No metrics data for resource Action: 1. Check if resource recently created (metrics may lag) 2. Verify CloudWatch logging enabled 3. Report as "status unknown" rather than failing </METRICS_NOT_AVAILABLE> </ERROR_HANDLING>

<HEALTH_STATUS_CRITERIA> Resources are classified as:

HEALTHY:

  • Resource exists and is running
  • All metrics within thresholds
  • No errors or minimal error rate (<0.1%)
  • Performance acceptable

DEGRADED:

  • Resource exists and is running
  • Some metrics approaching thresholds (>80%)
  • Elevated error rate (0.1% - 1%)
  • Performance slightly degraded

UNHEALTHY:

  • Resource doesn't exist or is stopped
  • Metrics exceed thresholds
  • High error rate (>1%)
  • Performance severely degraded
  • Resource in failed state

UNKNOWN:

  • Cannot determine status
  • Metrics not available
  • CloudWatch access issues </HEALTH_STATUS_CRITERIA>

<METRICS_BY_RESOURCE_TYPE>

Lambda:

  • Invocations (count)
  • Errors (count)
  • Duration (ms)
  • Throttles (count)
  • ConcurrentExecutions (count)
  • Error rate = Errors / Invocations * 100

S3:

  • BucketSizeBytes (bytes)
  • NumberOfObjects (count)
  • 4xxErrors (count)
  • 5xxErrors (count)

RDS:

  • CPUUtilization (percent)
  • DatabaseConnections (count)
  • FreeableMemory (bytes)
  • ReadLatency (seconds)
  • WriteLatency (seconds)

ECS:

  • CPUUtilization (percent)
  • MemoryUtilization (percent)
  • RunningTaskCount (count)
  • DesiredTaskCount (count)

API Gateway:

  • Count (requests)
  • 4XXError (count)
  • 5XXError (count)
  • Latency (ms)
  • IntegrationLatency (ms) </METRICS_BY_RESOURCE_TYPE>