Marketplace

aws-cloudwatch

Implement monitoring, alerting, and observability with CloudWatch

$ Installer

git clone https://github.com/pluginagentmarketplace/custom-plugin-aws /tmp/custom-plugin-aws && cp -r /tmp/custom-plugin-aws/skills/aws-cloudwatch ~/.claude/skills/custom-plugin-aws

// tip: Run this command in your terminal to install the skill


name: aws-cloudwatch description: Implement monitoring, alerting, and observability with CloudWatch sasmp_version: "1.3.0" bonded_agent: 08-aws-devops bond_type: SECONDARY_BOND

AWS CloudWatch Skill

Set up comprehensive monitoring and alerting for AWS resources.

Quick Reference

AttributeValue
AWS ServiceCloudWatch
ComplexityMedium
Est. Time15-30 min
PrerequisitesResources to monitor

Parameters

Required

ParameterTypeDescriptionValidation
namespacestringMetric namespaceAWS/* or custom
metric_namestringMetric nameValid metric
resource_idstringResource identifierValid ARN or ID

Optional

ParameterTypeDefaultDescription
periodint300Evaluation period (seconds)
statisticstringAverageAverage, Sum, Min, Max, p99
thresholdfloatvariesAlert threshold
evaluation_periodsint3Consecutive periods

Essential Alarms

EC2 Alarms

- name: HighCPU
  metric: CPUUtilization
  threshold: 80
  period: 300
  evaluation_periods: 3

- name: StatusCheckFailed
  metric: StatusCheckFailed
  threshold: 1
  period: 60
  evaluation_periods: 2

ECS Alarms

- name: HighCPU
  metric: CPUUtilization
  threshold: 80

- name: HighMemory
  metric: MemoryUtilization
  threshold: 85

- name: RunningTaskCount
  metric: RunningTaskCount
  threshold: 1
  comparison: LessThan

RDS Alarms

- name: HighCPU
  metric: CPUUtilization
  threshold: 80

- name: LowFreeStorage
  metric: FreeStorageSpace
  threshold: 10737418240  # 10GB
  comparison: LessThan

- name: HighConnections
  metric: DatabaseConnections
  threshold: 100

Implementation

Create Alarm

aws cloudwatch put-metric-alarm \
  --alarm-name prod-ec2-high-cpu \
  --alarm-description "EC2 CPU > 80% for 15 minutes" \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 3 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts \
  --ok-actions arn:aws:sns:us-east-1:123456789012:alerts \
  --treat-missing-data notBreaching

Dashboard Template

{
  "widgets": [
    {
      "type": "metric",
      "properties": {
        "title": "EC2 CPU Utilization",
        "metrics": [
          ["AWS/EC2", "CPUUtilization", "InstanceId", "i-xxx"]
        ],
        "period": 300,
        "stat": "Average",
        "region": "us-east-1"
      }
    },
    {
      "type": "metric",
      "properties": {
        "title": "ECS Service Memory",
        "metrics": [
          ["AWS/ECS", "MemoryUtilization", "ServiceName", "my-service"]
        ]
      }
    }
  ]
}

Custom Metrics

import boto3

cloudwatch = boto3.client('cloudwatch')

# Publish custom metric
cloudwatch.put_metric_data(
    Namespace='MyApp',
    MetricData=[
        {
            'MetricName': 'RequestLatency',
            'Dimensions': [
                {'Name': 'Service', 'Value': 'API'},
                {'Name': 'Environment', 'Value': 'prod'}
            ],
            'Value': 150.5,
            'Unit': 'Milliseconds'
        }
    ]
)

Log Insights Queries

Error Rate

fields @timestamp, @message
| filter @message like /ERROR/
| stats count() as error_count by bin(5m)

Latency Analysis

fields @timestamp, latency
| stats avg(latency) as avg_latency,
        pct(latency, 95) as p95_latency,
        pct(latency, 99) as p99_latency
  by bin(1h)

Top Errors

fields @timestamp, @message
| filter @message like /Exception|Error/
| parse @message /(?<error_type>\w+Exception)/
| stats count() as count by error_type
| sort count desc
| limit 10

Troubleshooting

Common Issues

SymptomCauseSolution
No dataMetric not emittingCheck CloudWatch Agent
Alarm stuckInsufficient dataCheck treat_missing_data
Dashboard emptyWrong namespaceVerify metric source
High costsToo many metricsUse metric filters

Debug Checklist

  • CloudWatch Agent installed and running?
  • IAM role allows cloudwatch:PutMetricData?
  • Correct namespace and dimensions?
  • Metric has data in expected period?
  • Alarm threshold reasonable?
  • SNS topic has subscriptions?

Test Template

def test_cloudwatch_alarm():
    # Arrange
    alarm_name = "test-alarm"

    # Act
    cw.put_metric_alarm(
        AlarmName=alarm_name,
        MetricName='CPUUtilization',
        Namespace='AWS/EC2',
        Statistic='Average',
        Period=300,
        EvaluationPeriods=1,
        Threshold=80,
        ComparisonOperator='GreaterThanThreshold'
    )

    # Assert
    response = cw.describe_alarms(AlarmNames=[alarm_name])
    assert len(response['MetricAlarms']) == 1

    # Cleanup
    cw.delete_alarms(AlarmNames=[alarm_name])

Assets

  • assets/alarm-config.yaml - Common alarm configurations

References