ops-monitor
Monitor deployed infrastructure health and performance - check resource status, query CloudWatch metrics (CPU, memory, requests, errors), analyze performance trends, track SLI/SLO metrics, detect anomalies, generate health reports with resource status summaries, identify degraded services, provide performance optimization recommendations.
$ 安裝
git clone https://github.com/fractary/claude-plugins /tmp/claude-plugins && cp -r /tmp/claude-plugins/plugins/faber-cloud/.archive/phase4-clean-separation/ops-monitor ~/.claude/skills/claude-plugins// tip: Run this command in your terminal to install the skill
name: ops-monitor model: claude-haiku-4-5 description: | Monitor deployed infrastructure health and performance - check resource status, query CloudWatch metrics (CPU, memory, requests, errors), analyze performance trends, track SLI/SLO metrics, detect anomalies, generate health reports with resource status summaries, identify degraded services, provide performance optimization recommendations. tools: Bash, Read, Write
Operations Monitoring Skill
<CRITICAL_RULES> IMPORTANT: Monitoring and health check rules
- Always check resource registry to know what resources exist
- Query CloudWatch for actual runtime status and metrics
- Report both healthy and unhealthy resources
- Provide clear status summaries (healthy/degraded/unhealthy)
- Include actionable recommendations for issues found
- Track metrics over time to identify trends
- Never assume health - always verify via AWS APIs </CRITICAL_RULES>
EXECUTE STEPS:
Step 1: Load Configuration and Registry
- Read: .fractary/plugins/faber-cloud/devops.json
- Read: .fractary/plugins/faber-cloud/deployments/${environment}/registry.json
- Extract: List of deployed resources to monitor
- Output: "✓ Found ${resource_count} resources to monitor"
Step 2: Determine Operation
- If operation == "health-check":
- Read: workflow/health-check.md
- Check status of all resources
- If operation == "performance-analysis":
- Read: workflow/performance-analysis.md
- Analyze metrics and trends
- If operation == "metrics-query":
- Read: workflow/metrics-query.md
- Query specific metrics
- Output: "✓ Operation determined: ${operation}"
Step 3: Execute Monitoring
- For each resource in scope:
- Query resource status via handler
- Query CloudWatch metrics
- Analyze current state
- Compare against thresholds
- Collect results for all resources
- Output: "✓ Monitoring completed for ${resource_count} resources"
Step 4: Analyze Results
- Read: workflow/analyze-health.md
- Categorize resources: healthy / degraded / unhealthy
- Identify patterns (multiple failures, related issues)
- Detect anomalies (unusual metrics, sudden changes)
- Output: "✓ Analysis complete"
Step 5: Generate Report
- Create monitoring report with:
- Overall health status
- Resource-by-resource status
- Metrics summary
- Issues found
- Recommendations
- Save to: .fractary/plugins/faber-cloud/monitoring/${environment}/${timestamp}-${operation}.json
- Output: "✓ Report generated: ${report_path}"
Step 6: Check Thresholds
- Compare metrics against configured thresholds
- Identify threshold violations
- Prioritize by severity
- Output: "✓ Threshold check complete"
OUTPUT COMPLETION MESSAGE:
✅ COMPLETED: Operations Monitoring
Status: ${overall_health}
Resources Checked: ${total_count}
Healthy: ${healthy_count}
Degraded: ${degraded_count}
Unhealthy: ${unhealthy_count}
${issues_summary}
Report: ${report_path}
───────────────────────────────────────
${recommendations_summary}
IF ISSUES FOUND:
⚠️ COMPLETED: Operations Monitoring (Issues Found)
Status: DEGRADED
Resources Checked: ${total_count}
Unhealthy: ${unhealthy_count}
Issues:
${issue_list}
Recommendations:
${recommendations}
───────────────────────────────────────
Next: Investigate issues with ops-investigator
IF FAILURE:
❌ FAILED: Operations Monitoring
Step: ${failed_step}
Error: ${error_message}
───────────────────────────────────────
Resolution: ${resolution_steps}
<COMPLETION_CRITERIA> This skill is complete and successful when ALL verified:
✅ 1. Resources Identified
- Resource registry loaded
- All resources in scope identified
- Resource types determined
✅ 2. Status Checked
- Resource status queried from AWS
- CloudWatch metrics collected
- Current state determined
✅ 3. Health Analyzed
- Resources categorized by health
- Issues identified and prioritized
- Patterns and anomalies detected
✅ 4. Report Generated
- Monitoring report created
- All findings documented
- Recommendations provided
✅ 5. Thresholds Evaluated
- Metrics compared to thresholds
- Violations identified
- Severity assessed
FAILURE CONDITIONS - Stop and report if: ❌ Cannot access CloudWatch (check AWS permissions) ❌ Resource registry not found (no deployments in environment) ❌ CloudWatch logs/metrics not available (check resource configuration)
PARTIAL COMPLETION - Not acceptable: ⚠️ Some resources not checked → Return to Step 3 ⚠️ Report not generated → Return to Step 5 </COMPLETION_CRITERIA>
-
Monitoring Report
- Location: .fractary/plugins/faber-cloud/monitoring/${environment}/${timestamp}-${operation}.json
- Format: JSON with detailed findings
- Contains: Health status, metrics, issues, recommendations
-
Health Summary
- Overall status: HEALTHY / DEGRADED / UNHEALTHY
- Resource counts by status
- Critical issues list
- Priority recommendations
Return to agent:
{
"overall_health": "HEALTHY|DEGRADED|UNHEALTHY",
"environment": "${environment}",
"timestamp": "2025-10-28T...",
"resources": {
"total": 10,
"healthy": 8,
"degraded": 1,
"unhealthy": 1
},
"issues": [
{
"severity": "HIGH",
"resource": "api-lambda",
"issue": "Error rate above threshold (5.2% > 1%)",
"metric": "Errors",
"current_value": "5.2%",
"threshold": "1%"
}
],
"metrics_summary": {
"api-lambda": {
"invocations": 1250,
"errors": 65,
"error_rate": "5.2%",
"duration_avg": "245ms",
"throttles": 0
}
},
"recommendations": [
"Investigate api-lambda errors (5.2% error rate)",
"Consider increasing Lambda memory (avg duration 245ms)",
"Review database connection pooling"
],
"report_path": ".fractary/plugins/faber-cloud/monitoring/test/2025-10-28-health-check.json"
}
**USE SKILL: handler-hosting-${hosting_handler}**
Operation: get-resource-status | query-metrics
Arguments: ${resource_id} ${metric_name} ${timeframe}
Reports are stored in:
- .fractary/plugins/faber-cloud/monitoring/${environment}/${timestamp}-${operation}.json
- Historical trends in monitoring-history.json
<ERROR_HANDLING> <CLOUDWATCH_ACCESS_ERROR> Pattern: AccessDenied for CloudWatch operations Action: 1. Check if CloudWatch permissions granted 2. Suggest adding cloudwatch:GetMetricStatistics, logs:FilterLogEvents 3. Delegate to infra-permission-manager if needed </CLOUDWATCH_ACCESS_ERROR>
<RESOURCE_NOT_FOUND> Pattern: Resource doesn't exist in AWS Action: 1. Check if resource listed in registry but deleted 2. Warn about registry drift 3. Suggest verifying deployment </RESOURCE_NOT_FOUND>
<METRICS_NOT_AVAILABLE> Pattern: No metrics data for resource Action: 1. Check if resource recently created (metrics may lag) 2. Verify CloudWatch logging enabled 3. Report as "status unknown" rather than failing </METRICS_NOT_AVAILABLE> </ERROR_HANDLING>
<HEALTH_STATUS_CRITERIA> Resources are classified as:
HEALTHY:
- Resource exists and is running
- All metrics within thresholds
- No errors or minimal error rate (<0.1%)
- Performance acceptable
DEGRADED:
- Resource exists and is running
- Some metrics approaching thresholds (>80%)
- Elevated error rate (0.1% - 1%)
- Performance slightly degraded
UNHEALTHY:
- Resource doesn't exist or is stopped
- Metrics exceed thresholds
- High error rate (>1%)
- Performance severely degraded
- Resource in failed state
UNKNOWN:
- Cannot determine status
- Metrics not available
- CloudWatch access issues </HEALTH_STATUS_CRITERIA>
<METRICS_BY_RESOURCE_TYPE>
Lambda:
- Invocations (count)
- Errors (count)
- Duration (ms)
- Throttles (count)
- ConcurrentExecutions (count)
- Error rate = Errors / Invocations * 100
S3:
- BucketSizeBytes (bytes)
- NumberOfObjects (count)
- 4xxErrors (count)
- 5xxErrors (count)
RDS:
- CPUUtilization (percent)
- DatabaseConnections (count)
- FreeableMemory (bytes)
- ReadLatency (seconds)
- WriteLatency (seconds)
ECS:
- CPUUtilization (percent)
- MemoryUtilization (percent)
- RunningTaskCount (count)
- DesiredTaskCount (count)
API Gateway:
- Count (requests)
- 4XXError (count)
- 5XXError (count)
- Latency (ms)
- IntegrationLatency (ms) </METRICS_BY_RESOURCE_TYPE>
Repository
