AWS CloudWatch Monitoring
AWS CloudWatch Monitoring
AWS CloudWatch is the unified monitoring and observability service that collects metrics, logs, and events from virtually every AWS resource and custom application. It provides the foundation for understanding what is happening inside your infrastructure, detecting problems before they affect users, and automating responses to operational events. Without CloudWatch, you are flying blind in the cloud, unable to answer basic questions like how much CPU your servers are using, whether your application is throwing errors, or why response times suddenly increased.
CloudWatch goes far beyond simple metric collection. It offers log aggregation and analysis through CloudWatch Logs, intelligent alerting through CloudWatch Alarms, visual operational dashboards, anomaly detection powered by machine learning, cross-account and cross-region observability, and deep query capabilities through CloudWatch Logs Insights. Together these features form a complete observability platform that scales from a single EC2 instance to thousands of microservices running across multiple AWS accounts.
Understanding CloudWatch deeply is essential for any engineer operating workloads on AWS. Whether you are running containers on ECS, serverless functions on Lambda, databases on RDS, or traditional servers on EC2, CloudWatch is the service that tells you whether everything is healthy and alerts you when it is not. This guide covers everything from basic metric collection to advanced anomaly detection, composite alarms, and Logs Insights queries. If you are following the AWS services roadmap, CloudWatch is the observability layer that ties all other services together.
What You Will Learn
After completing this guide, you will have a thorough understanding of AWS CloudWatch and how to build production-grade observability for your cloud infrastructure. Specifically, you will learn:
- How CloudWatch collects and stores metrics from AWS services automatically, including namespaces, dimensions, statistics, and retention periods that determine how long your data is available
- How to publish custom metrics from your applications using the PutMetricData API and the embedded metric format for high-throughput scenarios
- How CloudWatch Logs works for centralized log management, including log groups, log streams, retention policies, and subscription filters that route logs to other services
- How to write CloudWatch Logs Insights queries to search, filter, aggregate, and visualize log data across thousands of log streams in seconds
- How CloudWatch Alarms monitor metrics and trigger automated actions including SNS notifications, Auto Scaling policies, EC2 instance recovery, and Lambda function invocations
- How composite alarms combine multiple alarm states using boolean logic to reduce alert noise and create sophisticated alerting hierarchies
- How anomaly detection uses machine learning models to establish metric baselines and alert on unusual behavior without requiring you to set static thresholds
- How CloudWatch Dashboards provide real-time visual overviews of your infrastructure health with customizable widgets, automatic refresh, and cross-account visibility
Each section builds progressively, so reading from start to finish gives you the most complete understanding of CloudWatch from basic metric collection to advanced observability patterns.
Prerequisites
Before working through this guide, ensure you have the following ready:
- An active AWS account with permissions to access CloudWatch, create alarms, manage log groups, and publish custom metrics through IAM policies
- The AWS CLI installed and configured with credentials using
aws configure, so you can interact with CloudWatch from your terminal for metric queries, log retrieval, and alarm management - Basic familiarity with IAM roles and policies because CloudWatch access requires specific permissions like
cloudwatch:PutMetricData,logs:CreateLogGroup, andcloudwatch:PutMetricAlarm - At least one running AWS resource such as an EC2 instance or a Lambda function that generates metrics and logs you can observe in CloudWatch
- Understanding of JSON for working with metric filters, alarm configurations, and Logs Insights query results
- Comfort with the AWS Management Console for navigating CloudWatch dashboards, although this guide emphasizes CLI commands for reproducibility
No prior monitoring or observability experience is required. If you have ever checked a server's CPU usage or searched through application logs, the CloudWatch concepts will feel familiar but significantly more powerful.
Concept Overview
CloudWatch operates on a data model built around three core primitives: metrics, logs, and events. Metrics are time-series numerical data points that represent the behavior of your resources over time. Logs are timestamped text records that capture detailed information about what your applications and services are doing. Events are state changes and notifications that trigger automated workflows. Together these three primitives give you complete visibility into your infrastructure.
Metrics in CloudWatch are organized by namespace, which groups related metrics together. AWS services publish metrics to their own namespaces automatically. For example, EC2 metrics live in the AWS/EC2 namespace, Lambda metrics in AWS/Lambda, and RDS metrics in AWS/RDS. Each metric within a namespace is further identified by dimensions, which are name-value pairs that specify exactly which resource the metric applies to. An EC2 CPU utilization metric has an InstanceId dimension that tells you which specific instance reported that data point.
CloudWatch stores metrics at different resolutions. Standard resolution metrics are stored at one-minute intervals and retained for fifteen days at that granularity, then aggregated to five-minute periods for sixty-three days, and finally to one-hour periods for four hundred and fifty-five days. High-resolution metrics can be stored at one-second intervals for three hours, then follow the same aggregation pattern. This tiered retention means you always have recent detailed data for troubleshooting and longer-term trends for capacity planning.
CloudWatch Logs provides a centralized location for all your log data. Applications running on EC2 send logs through the CloudWatch agent, Lambda functions send logs automatically, ECS containers route logs through the awslogs driver, and API Gateway can log every request and response. Log data is organized into log groups, which typically represent one application or service, and log streams within each group, which represent individual sources like a specific container instance or Lambda execution environment.
The power of CloudWatch comes from connecting these primitives together. A metric alarm watches a metric and triggers an action when it crosses a threshold. A metric filter scans log data and extracts numerical values to create custom metrics. A dashboard combines metrics from multiple services into a single visual overview. Logs Insights lets you query across millions of log records in seconds to find patterns, errors, and anomalies. This interconnected system means you can go from a high-level dashboard showing elevated error rates, to an alarm that fired, to the specific log entries that caused the problem, all within CloudWatch.
Step-by-Step Explanation
This section walks through the essential implementation steps in order. Each step builds on the previous one, providing a clear path from initial configuration to a production-ready setup that follows AWS best practices.
Publishing and Querying Metrics
Every AWS service publishes metrics to CloudWatch automatically at no additional cost for basic monitoring. EC2 instances report CPU utilization, network traffic, disk operations, and status checks every five minutes with basic monitoring, or every one minute with detailed monitoring enabled. To view these metrics from the CLI, you use the get-metric-statistics command:
# Query EC2 CPU utilization for the last hour
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average Maximum
# Publish a custom metric from your application
aws cloudwatch put-metric-data \
--namespace MyApplication \
--metric-name RequestLatency \
--value 142 \
--unit Milliseconds \
--dimensions Name=Environment,Value=production Name=Service,Value=api
# List all available metrics in a namespace
aws cloudwatch list-metrics \
--namespace AWS/Lambda \
--metric-name DurationCustom metrics let you track application-specific data that AWS services do not capture automatically. Common custom metrics include request latency percentiles, business transaction counts, queue depths, cache hit ratios, and error rates broken down by error type. You publish custom metrics using the put-metric-data API call, either directly from your application code using an AWS SDK or through the CloudWatch agent which can collect system-level metrics like memory usage and disk space that EC2 basic monitoring does not include.
The embedded metric format is a newer approach for high-throughput custom metric publishing. Instead of making explicit API calls, you write specially structured JSON to stdout, and the CloudWatch agent or Lambda runtime automatically extracts metrics from the log output. This approach is more efficient because it batches metric data with your existing log output and avoids the overhead of separate API calls for every data point.
Configuring CloudWatch Logs
CloudWatch Logs requires a log group to exist before log data can be sent to it. You create log groups with retention policies that control how long log data is kept, which directly affects your storage costs. Log data that exceeds the retention period is automatically deleted:
# Create a log group with 30-day retention
aws logs create-log-group --log-group-name /app/production/api
aws logs put-retention-policy \
--log-group-name /app/production/api \
--retention-in-days 30
# Create a metric filter that counts ERROR occurrences
aws logs put-metric-filter \
--log-group-name /app/production/api \
--filter-name ErrorCount \
--filter-pattern "ERROR" \
--metric-transformations \
metricName=ApplicationErrors,metricNamespace=MyApplication,metricValue=1,defaultValue=0
# Query recent log events from a specific stream
aws logs get-log-events \
--log-group-name /app/production/api \
--log-stream-name "2025/01/15/[$LATEST]abc123" \
--start-time 1705276800000 \
--limit 50Metric filters are one of the most powerful features in CloudWatch Logs. They continuously scan incoming log data for patterns you define and extract numerical values that become CloudWatch metrics. For example, you can create a metric filter that counts every log line containing the word ERROR, or one that extracts the response time value from structured log entries. These derived metrics can then trigger alarms, appear on dashboards, and participate in anomaly detection just like any native AWS metric.
Subscription filters let you route log data in real time to other AWS services for further processing. You can send logs to a Lambda function for custom processing, to an Elasticsearch cluster for full-text search, to a Kinesis Data Firehose for delivery to S3 or Redshift, or to another account's log group for centralized logging. This real-time routing happens as log events arrive, with minimal delay, making it suitable for security monitoring, compliance auditing, and operational alerting.
Setting Up Alarms and Notifications
CloudWatch Alarms are the primary mechanism for automated monitoring. An alarm watches a single metric over a specified period and transitions between three states: OK when the metric is within acceptable bounds, ALARM when it has breached the threshold, and INSUFFICIENT_DATA when not enough data points are available to make a determination. When an alarm transitions to the ALARM state, it can trigger one or more actions:
# Create an alarm that triggers when CPU exceeds 80% for 5 minutes
aws cloudwatch put-metric-alarm \
--alarm-name high-cpu-production \
--alarm-description "CPU utilization exceeded 80% for 5 consecutive minutes" \
--metric-name CPUUtilization \
--namespace AWS/EC2 \
--statistic Average \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 1 \
--dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
--alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts \
--ok-actions arn:aws:sns:us-east-1:123456789012:ops-alerts
# Create a composite alarm combining multiple conditions
aws cloudwatch put-composite-alarm \
--alarm-name service-degraded \
--alarm-rule "ALARM(high-cpu-production) AND ALARM(high-error-rate)" \
--alarm-actions arn:aws:sns:us-east-1:123456789012:critical-alerts \
--alarm-description "Service degraded: both CPU and error rate are elevated"
# Enable anomaly detection on a metric
aws cloudwatch put-anomaly-detector \
--namespace AWS/Lambda \
--metric-name Duration \
--dimensions Name=FunctionName,Value=my-api-handler \
--stat AverageComposite alarms solve the alert fatigue problem that plagues monitoring systems. Instead of receiving separate notifications for high CPU, high memory, elevated error rates, and increased latency, you create a composite alarm that only fires when a specific combination of conditions is true simultaneously. This dramatically reduces false positives and ensures that on-call engineers are only paged for genuine incidents that require human intervention.
Anomaly detection alarms use machine learning to establish a baseline for your metric's normal behavior, accounting for daily patterns, weekly cycles, and seasonal trends. Instead of setting a static threshold like "alert when latency exceeds 500 milliseconds," you configure an anomaly detection alarm that alerts when latency deviates significantly from what the model expects at that specific time of day and day of week. This approach catches subtle degradations that static thresholds miss and avoids false alarms during known traffic spikes.
Building Dashboards for Operational Visibility
CloudWatch Dashboards provide a real-time visual overview of your infrastructure health. Each dashboard contains widgets that display metrics as line graphs, stacked area charts, numbers, gauges, or text. Dashboards automatically refresh and can span multiple AWS accounts and regions, making them ideal for centralized operations centers:
# Create a dashboard with multiple widgets
aws cloudwatch put-dashboard \
--dashboard-name production-overview \
--dashboard-body '{
"widgets": [
{
"type": "metric",
"x": 0, "y": 0, "width": 12, "height": 6,
"properties": {
"metrics": [
["AWS/EC2", "CPUUtilization", "InstanceId", "i-0123456789abcdef0"],
["AWS/EC2", "CPUUtilization", "InstanceId", "i-0987654321fedcba0"]
],
"period": 300,
"stat": "Average",
"title": "EC2 CPU Utilization"
}
},
{
"type": "metric",
"x": 12, "y": 0, "width": 12, "height": 6,
"properties": {
"metrics": [
["AWS/Lambda", "Errors", "FunctionName", "my-api-handler"],
["AWS/Lambda", "Invocations", "FunctionName", "my-api-handler"]
],
"period": 60,
"stat": "Sum",
"title": "Lambda Invocations and Errors"
}
}
]
}'
# List all dashboards in the account
aws cloudwatch list-dashboardsEffective dashboards follow a layered approach. The top-level dashboard shows aggregate health indicators: total error rate across all services, overall request volume, and key business metrics. Clicking into a specific service reveals a service-level dashboard with detailed metrics for that component. This drill-down pattern lets operators quickly identify which service is causing a problem without being overwhelmed by hundreds of individual metrics on a single screen.
Querying Logs with CloudWatch Logs Insights
CloudWatch Logs Insights is a purpose-built query language for searching and analyzing log data at scale. It can scan gigabytes of log data in seconds, making it practical for real-time troubleshooting during incidents. The query language supports filtering, pattern matching, aggregation, time-series visualization, and statistical functions:
# Run a Logs Insights query to find the slowest requests
aws logs start-query \
--log-group-name /app/production/api \
--start-time $(date -u -d '1 hour ago' +%s) \
--end-time $(date -u +%s) \
--query-string 'fields @timestamp, @message
| filter @message like /latency/
| parse @message "latency=* ms" as latency_ms
| sort latency_ms desc
| limit 20'
# Get query results (use the queryId from start-query response)
aws logs get-query-results --query-id "abc123-def456-ghi789"Logs Insights queries are invaluable during incident response. When an alarm fires indicating elevated error rates, you can immediately query the relevant log group to find the specific error messages, identify which endpoints are affected, determine when the errors started, and correlate with deployment events. The ability to aggregate and visualize log data as time-series charts directly in the query results makes it easy to spot patterns that would be invisible when scrolling through individual log lines.
Common Logs Insights patterns include counting errors by type over time, calculating percentile latencies from structured logs, identifying the most frequent error messages, finding requests that exceeded a duration threshold, and correlating events across multiple log groups using the @logStream field. The query language also supports regular expressions for complex pattern matching and the stats command for computing averages, sums, counts, and percentiles across grouped results.
Real-World Use Cases
CloudWatch serves different observability needs depending on your architecture and scale. Here are the most common production patterns:
Microservices health monitoring is the most fundamental use case. Each microservice publishes custom metrics for request count, error rate, and latency percentiles. CloudWatch Alarms watch these metrics and page the on-call engineer when error rates exceed acceptable thresholds. Dashboards show the health of all services at a glance, and Logs Insights queries help diagnose issues during incidents. For services running on Lambda, most of this monitoring comes free because Lambda automatically publishes invocation count, duration, error count, and throttle metrics.
Auto Scaling based on custom metrics extends beyond the default CPU-based scaling. You publish a custom metric representing your application's queue depth, active connection count, or business-specific load indicator. An Auto Scaling policy watches this metric and adds or removes capacity based on your application's actual workload rather than a generic CPU threshold that may not correlate with user-facing performance.
Cost optimization through usage tracking uses CloudWatch metrics to identify underutilized resources. By monitoring CPU utilization, network throughput, and storage IOPS over time, you can right-size EC2 instances, identify idle RDS databases, and find Lambda functions with over-provisioned memory. CloudWatch's fifteen-month metric retention gives you enough historical data to make confident sizing decisions.
Security and compliance monitoring leverages CloudWatch Logs and metric filters to detect suspicious activity. You create metric filters that count failed authentication attempts, unauthorized API calls, or access to sensitive resources. Alarms on these security metrics trigger immediate notifications to your security team. Combined with CloudTrail logs flowing into CloudWatch, you have a complete audit trail of every API call made in your account.
Cross-account centralized observability uses CloudWatch cross-account observability to aggregate metrics and logs from multiple AWS accounts into a single monitoring account. This pattern is essential for organizations that use separate accounts for development, staging, and production environments, or that have multiple product teams each with their own account. A central dashboard shows the health of all accounts, and centralized alarms can detect issues regardless of which account they originate from.
Best Practices
Following these practices ensures your CloudWatch implementation is effective, cost-efficient, and maintainable as your infrastructure grows:
Use structured logging consistently across all services. JSON-formatted logs with consistent field names like timestamp, level, service, requestId, and message make Logs Insights queries dramatically more powerful. When every service logs in the same format, you can write queries that work across all log groups without service-specific parsing logic. The embedded metric format builds on structured logging by letting you emit metrics directly from your log output.
Set appropriate retention policies on every log group. The default retention for CloudWatch Logs is indefinite, which means log data accumulates forever and storage costs grow continuously. Most operational logs only need thirty to ninety days of retention. Compliance-sensitive logs might need one to seven years. Set retention policies explicitly on every log group and review them quarterly to avoid unnecessary costs.
Design alarms around service-level objectives rather than resource metrics. Instead of alerting on raw CPU utilization, alert on the metrics that directly affect your users: error rate, latency percentiles, and availability. A server running at ninety percent CPU is not a problem if response times are still within your SLO. Conversely, a server at thirty percent CPU with elevated error rates is a genuine incident. Align your alarms with what matters to your users.
Use composite alarms to reduce alert noise. A single metric breaching a threshold often does not indicate a real problem. Network blips, deployment rollouts, and traffic spikes can cause momentary threshold breaches that resolve automatically. Composite alarms that require multiple conditions to be true simultaneously filter out these transient events and only page engineers for genuine multi-signal incidents.
Implement dashboard hierarchies for different audiences. Executive dashboards show business metrics and overall system health with green, yellow, and red indicators. Engineering dashboards show detailed service metrics, deployment markers, and error breakdowns. On-call dashboards show the specific metrics and logs needed to diagnose and resolve incidents quickly. Each audience needs different information at different levels of detail.
Leverage anomaly detection for metrics without obvious static thresholds. Metrics like request volume, Lambda duration, and API latency vary naturally throughout the day and week. Static thresholds either miss subtle degradations or fire false alarms during normal traffic patterns. Anomaly detection adapts to your metric's natural patterns and alerts only on genuinely unusual behavior.
Common Mistakes
These mistakes frequently cause monitoring gaps, excessive costs, or alert fatigue in CloudWatch implementations:
Ignoring CloudWatch Logs costs until the bill arrives. CloudWatch Logs charges for data ingestion and storage separately. A single verbose microservice logging every request at debug level can generate gigabytes of log data daily, costing hundreds of dollars per month. Always set log levels appropriately for production, use sampling for high-volume debug logging, and set retention policies before deploying new services.
Creating too many alarms without a response plan. Every alarm should have a documented runbook that explains what the alarm means, what to check first, and how to resolve the issue. Alarms without runbooks create confusion during incidents and train engineers to ignore alerts. If you cannot write a runbook for an alarm, question whether the alarm should exist.
Not using dimensions effectively for metric granularity. Publishing a single aggregate metric for all instances or all endpoints makes it impossible to identify which specific resource is causing a problem. Use dimensions to break metrics down by instance, function name, endpoint, environment, and customer segment. This granularity costs slightly more but saves hours during incident diagnosis.
Relying solely on CloudWatch basic monitoring for EC2. Basic monitoring provides metrics at five-minute intervals, which is too coarse for detecting brief spikes or rapid degradations. Enable detailed monitoring for production instances to get one-minute resolution. Additionally, basic monitoring does not include memory utilization or disk space metrics. Install the CloudWatch agent to collect these critical system metrics that the hypervisor cannot observe.
Failing to test alarms before relying on them. An alarm that has never fired might have an incorrect threshold, a misconfigured action, or a broken SNS topic subscription. Use the set-alarm-state CLI command to manually transition alarms to the ALARM state and verify that notifications arrive correctly, runbooks are accessible, and automated actions execute as expected.
Not correlating metrics with deployments. When metrics change, the most common cause is a recent deployment. Use CloudWatch annotations or dashboard markers to indicate deployment times on your metric graphs. This simple practice dramatically speeds up incident diagnosis by making it immediately obvious whether a metric change correlates with a code or configuration change.
Summary
AWS CloudWatch provides the complete observability foundation for any workload running on AWS. Its metrics system automatically collects data from every AWS service and accepts custom metrics from your applications. CloudWatch Logs centralizes log data from all sources and makes it searchable through Logs Insights queries. Alarms watch metrics continuously and trigger automated responses when thresholds are breached or anomalies are detected. Dashboards tie everything together into visual overviews that make infrastructure health immediately apparent.
The key to effective CloudWatch usage is treating observability as a first-class concern rather than an afterthought. Instrument your applications with structured logging and custom metrics from the start. Set up alarms aligned with your service-level objectives before incidents occur. Build dashboards that answer the questions your team asks during incidents. Configure retention policies and cost controls before log volumes grow unexpectedly.
CloudWatch integrates deeply with the rest of the AWS ecosystem. EC2 instances report health metrics automatically. Lambda functions stream logs without any configuration. ECS containers route output through the awslogs driver. RDS databases publish performance metrics natively. This tight integration means you get baseline observability for free with every AWS service you use, and you build on that foundation with custom metrics, alarms, and dashboards tailored to your specific operational needs. Combined with the other services in the AWS services roadmap, CloudWatch is the layer that ensures everything else is running correctly and alerts you the moment something goes wrong.