Note: This post was originally written for the Netreo blog. You can check out the original here.
In today’s hybrid and multi-cloud world, you need to be more sure than ever that you have a handle on your service-level agreement (SLA) performance. But how do you make sure your cloud providers are giving you what you’re paying for? You have likely read, or maybe skimmed, their SLA. How do you find out if they’re meeting that SLA? You do so by monitoring your SLA metrics.
In this post, you’ll learn about SLAs and the metrics you can use to monitor their performance. This can help you hold your providers and your team accountable. You’ll also see some examples of SLA metrics to help you get an idea of exactly what you can monitor.
What’s an SLA?
An SLA is a contract between a provider and its customers that defines the level of service it is committed to offer. It generally covers one or more provided services such as IaaS like AWS EC2 or PaaS like Azure SQL Database. Usually, providers track performance against their SLAs for you. You likely offer SLAs to your organization’s end users, so you should do the same.
There are usually consequences if a provider fails to meet service levels. Service credits are a common remedy that providers compensate customers with if they don’t meet their SLAs. You should define metrics that you can monitor so you’ll know when they haven’t been met.
What Are SLA Metrics?
SLA metrics are a set of key performance indicators (KPIs) that you can measure and monitor. You can have any number of SLA metrics to monitor, but you can break many of them up into five types.
1. Availability
The availability of a particular cloud resource is the percentage or length of time it’s working for its users. You want availability to be as close to 100 percent as possible. Here are a couple of metrics and examples of availability.
Uptime
Uptime defines the percentage of time an instance is up, running, and ready for use. An example is the percentage of the time that your AWS EC2 instance runs without any reboots due to an AWS outage. Such an instance has 100 percent uptime. If your AWS SLA for EC2 is 99.99 percent, AWS is meeting its SLA.
Service Availability
Service availability is the percentage of time that service requests return with an expected response. An example is an Azure web app service your organization uses that’s consistently able to respond when users need to log in. If your monitoring shows this service is suddenly failing, SLA performance suffers.
2. Response Time
The response time or latency from any cloud resource is the amount of time it takes for a response to return after a request. You want response time to be as low as possible since it most directly impacts the user experience. Here are a couple of examples:
MTTR
Mean time to repair (MTTR) is the length of time it takes for a specific problem to resolve. The R can mean either repair or resolution, depending on the system, but the expectation is the same: you are concerned with how how fast the provider or your team fixes a problem. An example is a measure of the gap from the time you first observe a regional cloud network outage in your monitoring tool to when that alert goes away.
Transaction Response Time
The transaction response time metric is the length of time, usually in milliseconds, that it takes a transaction request to return a response. Let’s say one of your organization’s users sends an email through your Amazon SES service. The time it takes to get confirmation of a sent email after clicking the “send” button measures the transaction response time.
3. Throughput
The throughput metric is the amount of data over a period of time that your cloud resources sends and receives. You want throughput to be as high as a system supports. Here are a couple of examples:
Disk Write Bytes
Disk write bytes is a metric that measures the rate at which a system writes bytes of data to a disk over a period of time, usually in seconds. An example is an Amazon S3 storage system you use to save large files uploaded by your users. They might love coffee, but you don’t want them having to go get a cup after uploading files to your system and waiting for it to process. A low throughput in this scenario is bad for your SLA performance.
Link Throughput
Link throughput is the amount of packet data that can transfer across a given network link over a period of time. This metric is represented in bytes or bits per second. An example is a network connection between locations in New York City and London that transfers 150Mbps. If link throughput drops below a defined alert threshold, you can be alerted if needed before users are impacted.
4. Errors
The errors metric defines the amount or percentage of failed requests to a particular resource. Here are a couple of examples:
HTTP Errors
HTTP errors are the percentage of requests a user sends that return with an unexpected HTTP status code. An example is a user receiving the dreaded HTTP 500 “server unavailable” error on your web application calling an API. Any such error is cause for concern and should be investigated since it could be due to a network outage, which could impact your SLA.
Disk Read Errors
The disk read errors metric is a percentage of read requests to disk that fail. An example is a PostgreSQL request pulling data from the disk on which the database data is stored. A read error could be the result of storage issues, which could impact your SLA.
5. Utilization
The utilization metric is the percentage of use of your cloud system’s resources. Here are a couple of examples:
Disk Utilization
Disk utilization is the amount of disk space being used on a given server instance. An example is an Azure instance that’s running out of available disk space. The instance disk utilization will tell you how much space you have left so you can determine whether you need an upgrade. A server instance with no more disk space is sure to trigger an uptime SLA violation.
Memory Utilization
Memory utilization is the amount of RAM a system uses. An example is an AWS instance configured with too little memory. The instance memory utilization will let you know how much memory is being used at given periods of time. This can help you figure out if you need to get more RAM or perform a temporary reboot to free up more memory.
Move Forward with Important Metrics
You may be wondering why it’s important to track all these metrics. An often-used quote attributed to Peter Drucker is “If you can’t measure it, you can’t improve it.” When it comes to metrics, you cannot improve your SLA performance if you’re not measuring or monitoring the metrics that are tied to these SLAs. Remember, this is only a subset of the metrics you can collect and tie to your SLA.
The best way to measure your SLA metrics is to monitor them all with a monitoring solution. This is yet another service you implement using money, time, or both. So it’s very important that you properly measure your SLA metrics. You’ll not only need to monitor them but also use a tool that’s right for the job. Netreo can be that tool. You can use its reporting engine to notify you of the metrics that violate your SLAs. If you’re not sitting in front of the Netreo web UI, you can use the many integrations available to send alerts to other platforms such as PagerDuty. Or you can have quick access to any of those alerts with the Netreo mobile app, helping you keep your finger on the pulse of your SLA performance.
So make sure you have SLAs for your cloud and hybrid environment and use the examples above as a way forward. It’s important to identify ways to monitor cloud performance, and there are a plethora of monitoring tools to help you do that. Give Netreo a try as your monitoring tool of choice to monitor your SLA metrics. You can register for a free trial here.