Editor’s note: This post was originally written for the The New Stack. You can check out the original here.
The most important ability is availability.
This is a common statement I hear from sports commentators. Whether athletes are in the NBA, NFL or EPL, they need to be available to play the game before anyone can talk about how good they are at scoring points or goals. Someone just cannot become a great player if they’re always on the bench because of an injury.
The same principle applies to monitoring systems. Traditionally, monitoring involved simply pinging a device to check if a system was up or down — in other words, its availability. Back in the day, that’s all we could get, and all we needed. Even with advancements in technology, the most basic requirement of an IT system is for it to be available. Nothing else matters if your users can’t get to that full-featured, cloud-native web application your developers just built from scratch.
But today, traditional monitoring is not enough. With software installed inside containers that are running on cloud infrastructure, we need more. While those old monitoring systems would tell you when something was broken, you also need to know why it broke.
That’s where observability comes in.
Observability isn’t as new as its popularity in the last few years would have you believe. The term, or variations of it, has roots in both control theory and quantum physics.
So the abstract concept of being able to observe the state of a system goes back decades. But in order to actually observe and measure something, a system must expose its state. Once that internal data is provided externally, it can be collected, stored and analyzed.
Traditional monitoring tools are not equipped to do this. In today’s world of complex clusters of microservices in your cloud infrastructure, you need to use tooling for your DevOps team that goes beyond traditional monitoring.
But even that’s changing.
Prediction: You’re Going to Need APM
Monitoring tools, specifically application performance monitoring (APM) solutions, have been adapting for years to be able to observe this type of data. Gartner predicts that enterprise organizations will have increased the percentage of applications they monitor with APM tools from 5 percent in 2018 to 20 percent in 2021. So over three years, your organization may have quadrupled its monitoring.
Why? It’s becoming clear that to succeed, these organizations need to broaden APM tool coverage to focus on the customer experience, business insights, and service health.
But what do you need APM tooling for? You have observability. Isn’t monitoring dead anyway?
Monitoring isn’t dead. It simply changed.
Observability and Monitoring Aren’t at Odds
Monitoring and observability aren’t the same, but they’re not different, either. Monitoring is essentially a subset of observability. It goes beyond monitoring by helping you find out why something is broken through better visibility and insights, more precise alerting, and faster root cause identification with debugging.
So the two aren’t at odds at all. They’re like family.
But it isn’t about whether they’re the same or not. It’s about having the right strategy in place to ensure that when something is broken, you find out that this is the case and are quickly able to determine why it broke.
The goal is to avoid or minimize the impact on your application’s users.
To ensure your software is operating at its peak performance levels, you need to develop an observability strategy. Let me discuss some things to think about when doing that.
Observability consists of three key areas that must be included in any strategy for observing your infrastructure, especially if your applications are cloud native. A proper strategy includes all three. Anything less, and you might as well go back to just monitoring.
You’re likely familiar with the process of collecting metrics from your IT systems. Much of it is similar to traditional monitoring. Going back to the days of simple SNMP (simple network management protocol) monitoring, systems have had the ability to provide data on various metrics at any given point in time.
You can gather data about various metrics such as the number of total requests and the number of errors. This data is usually represented over a time interval, which lets you put them on graphs and dashboards. Metrics data also allows you to do trending and capacity planning.
With metrics, you’ll be able to see if there’s a problem or whether something isn’t working quite as expected. Once something isn’t working, you have the ability to generate alerts and get notified.
Collecting traces from your systems lets you see requests from the user’s point of view, as their request travels through your application. Traces are especially important in today’s modern application architecture because of microservices. This gives you visibility across all the services that make up your application and the associated infrastructure.
The trend is to become more user-centric. You need to not just observe system performance, but also observe how the user is affected by that performance. Traces help pinpoint where the problem is — not just the fact that there is a problem. You can more quickly identify the components in your system causing a problem and monitor the performance through various parts of your application.
The collection of logs has been around a long time. You have logs generated by the OS, application, infrastructure — you name it. Logs typically contain detailed sets of events of what’s happened in your application.
Having logs can help you identify exactly what’s causing a problem. But the challenge is that finding exceptions being thrown by an application can be like finding a needle in a haystack. The sheer volume can make it difficult to find the cause of the problem.
With modern applications, you can’t just rely on the systems logging their own data. Developers have the ability to log custom data about their application that can be exposed and stored in logs. This should be part of your strategy as well.
So with logs, you can find the exact cause of the problem.
Get in Where You Fit in
With those three key areas of observability, monitoring may not seem needed. But the exact opposite is true. Monitoring tools are what allow for the collection of all this data.
APM tools already collect metrics data. Your developers can make their software observable by exposing metrics data such as the RED method: rate, errors, and duration. For example, you can increase the observability of your application by capturing metrics for the rate of login requests during deployment testing and feeding that information to your monitoring tool for analysis.
Traces can be used to provide utilization information of, for example, each microservice that a user’s login requests traverse. With this data, your APM tool can help you quickly identify where in your cloud infrastructure any login slowdowns are occurring.
Many APM vendors also provide Crash Reporting tools. The capture of log data is best when it’s paired with these types of tools, which can show you the specific code that caused something to be logged. This lets you more quickly debug an issue that’s actively affecting users.
The best way to guarantee observability is for developers to build it into their code as the software is written. Having an APM tool that’s integrated with your continuous delivery pipeline ensures your whole system is observed, making it easier for you to troubleshoot when things break.
If you’re not sure where to start with your observability strategy, start with service level objectives, or SLOs. Your SLOs define the level of system performance you provide for your users. If you want to keep users happy, you want to employ whatever monitoring tools you have and start developing these SLOs. Once you have your objectives, you can begin to put in place the key areas of observability — metrics, traces, and logs — and send the data collected to your monitoring tool.
Conclusion: Becoming a Great Observer
Running an APM tool on top of your observability pillars is an approach that will help you focus on solving customer experience issues and not just application health issues. But with that comes challenges — both technical and organizational. Not everyone wants to break down silos.
You want to simplify your observability strategy, but there’s no genie who’s going to do it for you. It will take work and time. You must spend the time to do it as correctly as possible, following any organizational processes. You must observe your data, including your APM tool, and make sure that when your users experience problems, you not only know what’s causing the problem, you also know why.
Now you’re being observant.