Amazon CloudWatch is a Monitoring and Management service that enables capturing key monitoring and operational data in the form of logs, metrics, and events in one centralized location, for AWS and on-premises resources and services.
Key Points
- CloudWatch can natively collect metrics from most of AWS services and resources
- You can leverage CloudWatch Agent or API to collect metrics from on-premises services and resources
- CloudWatch allows up to 1-second visibility of metrics and log data and up to 15 months of data retention
- Data retention is based on granularity, and each time-period data points are then aggregated into next time-period category:
- Less than 60 seconds data points for 3 hours; aggregated to 1 minute metrics
- 1 minute data points for 15 days; aggregated to 5 minutes metrics
- 5 minute data points for 63 days; aggregated to 1 hour metrics
- 1 hour data points for 455 days (15 months)
- Note: you cannot delete metrics data. It simply expires at the end of retention period.
- EC2 Standard monitoring is performed at 5 minute intervals, but detailed monitoring allows monitoring to be done at 1 minute intervals (at an extra cost)
- CloudWatch Alarms can be created to trigger alerts
- You can use IAM to specify which CloudWatch actions can a user perform.
- You cannot limit access to CloudWatch data for specific resources. When you grant access to CloudWatch data, it’s for all the data (and for example, not just data from specific EC2 instances and not others)
- You cannot use IAM roles with CloudWatch command line tools
Key Components of Amazon CloudWatch
CloudWatch Logs
CloudWatch Logs provide a centralize place to collect, monitor and analyze the logs from multiple sources, such as AWS services, your applications, and 3rd parties.
- You can retain your logs and can specify retention period by log group (logical grouping of related logs)
- You can query your log data using CloudWatch Logs Insights
CloudWatch Alarms
You can create CloudWatch alarms that monitor specific CloudWatch metrics and then trigger notification when specific threshold is breached.
- Metric Alarm watches a single CloudWatch metric for a value or calculated value
- Composite Alarm works based on a rule expression that considers alarm states of multiple alarms
- Alarm history is available for 14 days
Configuring an Alarm requires following settings
- Period – expressed in seconds, is the length of the time to evaluate the metric or expression for each data point
- Evaluation Period – is the number of most recent periods, or data points, to evaluate when determining alarm state
- Datapoints to Alarm – is the number of data points within the Evaluation periods that must be breaching to cause the alarm to go to the ALARM state.
- Additionally, you can specify how to treat missing data points when evaluating an alarm.
Alarm States
- OK – the metric or expression is within the defined threshold
- ALARM – the metric or expression is outside of the defined threshold
- INSUFFICIENT_DATA – the alarm has just started, the metric is not available, or not enough data is available for the metric to determine the state
CloudWatch Events (CWE)
CloudWatch Events is a stream of system events describing changes in your AWS resources.
- This is in addition to existing CloudWatch Metrics and Logs from these resources
- Currently only these resources are supported:
- EC2, Auto Scaling, and CloudTrail
- Also, via CloudTrail, mutating API calls (that is, calls other than Describe, List, and Get) across all services are also visible in CloudWatch Events
- You can create rules to trigger actions based on specific CloudWatch Events
Multi-dimensional usage of Amazon CloudWatch
Collect
- Logs – three primary categories
- Vended Logs – natively published logs (currently only from VPC Flow Logs and Route 53)
- (AWS Service) Logs – published by AWS Services (fair number of AWS services support this)
- Custom Logs – published by your applications and resources from within AWS environment, or from on-premises (via CloudWatch Agent or API)
- Metrics – most of AWS services support capturing of key metrics (specific to that service)
- Custom Metrics – from your applications and resources
Monitor
- CloudWatch Dashboards enable customizable visual playground to view metrics and logs for easy analysis
- CloudWatch Alarms enable setting thresholds based triggers and actions on metrics
- Container Insights enable automatic dashboards for various metrics of deployed containers
- CloudWatch Anomaly Detection enables use of machine-learning algorithms to analyze collected metrics and trigger actions
- CloudWatch ServiceLens enhances the observability of your services and applications by enabling you to integrate traces, metrics, logs, and alarms into one place.
- CloudWatch Synthetics allow you to create scripts (called canaries) that run on a schedule mimicking your customer actions to monitor your endpoints and APIs
- Canaries are Node.js scripts that run as Lambda functions
Act
- Auto Scaling can be triggered based on CloudWatch alarms
- CloudWatch Events can trigger actions enabling automation
Analyze
- CloudWatch metrics data can be analyzed in almost real-time, or you can analyzed months-worth of captured data for seasonality trends
- CloudWatch Logs Insights enable customized queries with aggregations, filters, and regular expressions to gain useful insight from your captured log data
Notes on Monitoring vs Observability:
- Monitoring is focused on operations (of an application, a resource, or interaction) to determine state (good / bad / warning) or to detect behavioral deviation
- Focuses on the State (and its variations), thus focusing on the “effects”
- Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.
- Focuses on influencers of the State, thus enabling focus on the “causes”
Pricing
Fair amount of CloudWatch related metrics, alarms, etc., are covered in its Free Tier- see below.
Metrics |
Basic Monitoring Metrics (at 5-minute frequency) 10 Detailed Monitoring Metrics (at 1-minute frequency) 1 Million API requests (not applicable to GetMetricData and GetMetricWidgetImage) |
---|---|
Dashboard | 3 Dashboards for up to 50 metrics per month |
Alarms | 10 Alarm metrics (not applicable to high-resolution alarms) |
Logs | 5GB Data (ingestion, archive storage, and data scanned by Logs Insights queries) |
Events | All events except custom events are included |
Contributor Insights |
1 Contributor Insights rule per month The first one million log events that match the rule per month |
Synthetics | 100 canary runs per month |
Please visit following page to see detailed pricing for usage beyond (above-mentioned) free tier:
External Resources