Real Time Monitoring with AWS ElastiCache

7 min readDec 16, 2022

Amazon ElastiCache is a managed in-memory data store and cache service provided by AWS. We want to build alerting around our AWS-EKS cluster which is using ElastiCache & Redis. We’re doing this because, when an outage occurs we want to be notified right way so we can troubleshoot what went wrong with Redis and try to resolve it as soon as possible.

Solution

The best solution would be, to take a more proactive approach to monitoring and alerting. Our approach will be to send an alert when urgent issues have been detected, or when resource limits are about to be reached.

Alerts & Notifications

It’s necessary to call attention to our metrics even when the system is performing adequately. We are looking to implement the following types of alerts to help mitigate our risk in terms of downtime. The best way to diagnose these issues is to set a priority on the kinds of alerts we are receiving from Grafana:

soft-alerts (low to moderate severity): posting a message in your messaging application such as slack or gchat will give this message high visibility, but won't disrupt anyone in the middle of the work day. It will let us know for example, if we're about to run out of disk space or anything memory or CPU usage related. This is going to give us time to diagnose the situation before something more sever occurs.

high-alerts (high severity): This will also be posted as a notification in the same chat room, this particular alert will require immediate attention. Recipients of this alert, will prioritize this alert and share findings. As well as, solve the issue and validate that things are functioning as designed.

Integrate AWS CloudWatch with Grafana

CloudWatch monitors all of our cloud resources, for us to get better metrics visualization of our environments. Let’s start by logging in to our AWS account and creating a user in the IAM module:

Select the checkbox to enable Programmatic Access:

Then you want to attach an existing policy, which is provided by AWS (Amazon Managed Grafana policy):

Once you have done that, click the Next: Review button, then add your AWS access ID & secret key. Once you have successfully created the grafana-cloudwatchuser, navigate to Access Management > Policies:

The same policy is attached to our grafana-cloudwatch user (navigate to the IAM > Users module and select the grafana-cloudwatch user):

Add AWS X-Ray Plugin

The AWS X-Ray plugin provides a holistic view of requests as they travel through your system. It helps with analyzing and debugging applications & monitoring cloud performance metrics.

Grafana is going to need permissions granted via IAM, so click on the Add permissions button:

Select the Attach existing policies directly, option and search for AWSXrayFullAccess. Once you have successfully added that existing policy, you’ll be able to see that policy on the Permissions tab:

Later on we will add this Data Source with Cloudwatch, you’ll be able to be see an option X-ray trace link which will integrate the plugin & allow us to have Observability & Traceability of our product platform.

Add CloudWatch as a Data Source in Grafana

Now we can add the CloudWatch Data Source in Grafana, Log in to Grafana and navigate to (Configuration > Data sources):

Click on Add Data Source & select the CloudWatch Data Source:

Fill out all of your Connection Details:

Access & secret key
ARN role
Default Region

Once you have successfully filled out all of those details, when you scroll down you’ll see the X-ray trace link, I mentioned above:

Adding the AWS ElastiCache Redis Dashboard

Now we can import the AWS ElastiCache Dashboard in Grafana, Click the Import button:

We’re going to use the dashboard ID, for the ElastiCache Dashboard, which is 969 in this case:

Then go ahead and name your dashboard and click the Import button:

Grafana should now display the new Dashboard:

Monitoring and Alerting

There are quite a few CloudWatch metrics which offer good insight into current real-time performance. By implementing this dashboard we will be able to take corrective action before performance issues occur. We will be setting up alerting on the following Redis metrics:

CPUUtilization
DatabaseMemoryUsagePercentage

CPUUtilization

The CPUUtilization metric monitors your workload, this metric is reported to us as a percentage.

soft-alert: If we set a soft-alert to notify us via webhook at 80%, when we utilized 79% of our CPU usage it will trigger a message which will be sent to us in our team chat room. This will be an eye-opener for us, basically notifying us that Redis is almost exhausted when it comes to processing requests.

high-alert: For this particular alert, we’ll set our threshold at 90%, when we have reached 89% of CPU usage it triggers a webhook & sends a message to our team chat room letting us know that an outage has occurred & our website is down.

DatabaseMemoryUsagePercentage

The DatabaseMemoryUsagePercentage metric monitors the percentage of memory that is currently in use. This percentage is calculated by used_memory/maxmemory

soft-alert: We’re going to set our soft-alert to notify us at 80%, when we have reached 79% of our memory usage it will trigger a webhook & send us a message to our team chat room. This should indicate that something out of the ordinary is happening and we should investigate why this is occuring.

high-alert: For this alert, we’ll set our threshold at 90%, when we have reached 89% of our memory usage and we’ll receive a message in our team chat room, at this point an outage has occurred & is disruptive to our business.

Customizing Alerts

As mentioned above, we wanted to be notified via our chat messaging system when an outage occurs. I wanted these messages to be team specific, so I am going to create a notification channel on the sidebar, click on bell icon and select Notification channels: