StatsD: How to Measure Anything in Your System

Editor’s note: This post was originally written for the Scalyr (now DataSet by SentinelOne) blog. You can check out the original here. You can also find the archived version here.

In his book “How to Measure Anything,” management consultant and author Douglas Hubbard states that “anything can be measured.” Hubbard argues that something that can be observed lends itself to being measured.

How can this apply to software development and operations? Well, in today’s world of increasingly complex IT systems, you can’t afford not to measure anything and everything. But in order to observe and then measure something, it needs to meet the literal definition of observability, meaning that a system’s internal state must be exposed externally. This allows you to measure it. With observability, you find out not only that your system malfunctioned, but also why. This is done with data from logs, metrics, and traces.

In 2011, the Etsy Engineering team made things a little bit easier to measure and observe metrics in your IT system with the introduction of StatsD. Historically, collecting data about networks and servers has always been easier to do than gaining the same information about applications. StatsD made collecting application metrics simpler for developers by instrumenting your code with specific metrics you want to observe. As a result, StatsD has become one of the most popular tools for gathering metrics data.

In this post, I’m going to give you a brief tutorial of StatsD and how you can use it to measure anything in your application.

What’s This StatsD You Speak Of?

StatsD is a network daemon released by Etsy and written in Node.js to collect, aggregate, and send developer-defined application metrics to a separate system for graphical analysis. Initially, the daemon’s job was to listen on a UDP port for incoming metrics data, parse and extract this information, and periodically send this data to Graphite in an aggregated format.

One big goal of StatsD is to collect data quickly. The better transport protocol for this is UDP. With UDP, the StatsD client can just send the metrics data and assume that it will get to the daemon, especially if it’s on the same instance.

What It’s Made Of

The StatsD architecture consists of three main components: client, server, and backend.

The client implementation contains the libraries for the specific language you’re using for your application. With StatsD’s increased popularity, there’s now support for multiple languages. With the appropriate client library, you can instrument your software code with any metrics you want to track, in almost any way you want to track them.

The server implementation includes a daemon that listens for UDP traffic coming from the client libraries. It then aggregates all their data and flushes everything to the backend system. By default, this happens every 10 seconds, which effectively means that metrics are collected in real time.

The backend component, which now includes more than Graphite, is where all of the metrics data will reside for graphing and analysis. The StatsD daemon will utilize what’s often an HTTP-based connection to send the aggregated data to some other system. This could be something installed on the same instance, but it’s more often another monitoring or logging solution that’s external to the client and server implementations.

How Metrics Are Formatted

The basic metrics data that the StatsD client sends contains three things: a metric name, its value, and a metric type. This data is formatted this way:

<metric_name>:<metric_value>|<metric_type>

Metric name (also called a bucket) is pretty
self-explanatory. One key thing to remember is to name your metric in a
way that aims to avoid confusion or misinterpretation later.
Metric value
is the number associated with that metric’s performance at collection
time. The actual value will depend on the type of metric which you are
collecting data for.
Metric type defines what
type of data the metric actually represents. StatsD supports several
metric types, including counters, gauges, timers, and sets.

A counter metric type is a count the number of times a particular event occurred in your application. This type is incremented each time it happens, and sends both the total count and the count rate, over the flush interval. An example could be the number of times users logged in:

page.login.users:1|c

A timer metric type is the amount of time, in milliseconds, it takes a request to finish. An example could be how long it takes for a login page to load:

page.login.time:350|ms

Why Sampling Measures More

In a production environment, your application’s likely very busy. If you find that StatsD is collecting a lot of the metrics you want to include, then you should utilize an extended metric format to send your data.

This includes allowing the StatsD client to collect a sampling of the metrics data by percentage and send that information to the server. So to collect data for only 50% of the time, you would specify 0.5 as your sampling rate.

The server will multiply that number by the inverse of the sampling rate you specified and send this new number to your backend. The metric format for the sampling rate is:

<metric_name>:<metric_value>|<metric_type>|@<sampling_rate>

An example of the number of users logging in could be:

page.login.users:10|c|@0.5

How to Collect Data

Now that you’ve learned the basics about StatsD, let’s go through some of the steps to start collecting data. We’ll install and configure it to start sending data.

Installing and Configuring StatsD

Installing the StatsD daemon starts by first installing Node.js. Other server implementations are also supported. I’m using Ubuntu in this example. Adjust your commands appropriately if you’re using another distro.

sudo apt-get install -y nodejs

Once you have Node.js installed, you need to get the StatsD package where it resides, which is currently on GitHub. You will need to clone the package to run it on your machine.

If you don’t already have git on your machine, run:

sudo apt-get install -y git

Next, clone the repo:

git clone https://github.com/statsd/statsd.git

Now, let’s go to the StatsD config file to specify where our server and backend will run.

StatsD comes with an example file with the config data in it. Let’s make a copy of that:

cd statsd
sudo cp exampleConfig.js localConfig.js

Open the config file and make the necessary changes:

sudo nano localConfig.js

With that file open, make sure to scroll down to the bottom. What you’re looking for is something that looks like this:

{
  graphitePort: 2003
, graphiteHost: "graphite.example.com"
, port: 8125
, backends: [ "./backends/graphite" ]
}

This config is telling us that the daemon is listening on UDP port 8125 for messages from the client. The default server IP is 0.0.0.0.

It’s also specifying Graphite as your backend, but you may decide to use another system. For this test, comment out Graphite and add “console” as your backend. You should have this:

{
/*  graphitePort: 2003
, graphiteHost: "graphite.example.com"
,*/ port: 8125
, backends: [ "./backends/console" ]
}

Now save and close your config file.

Finally, start the StatsD server, making sure you’re still in the /statsd directory:

node stats.js localConfig.js

If all worked well, you should see StatsD running:

ubuntu@hostname:~/statsd$ node stats.js localConfig.js
21 Apr 00:00:08 - [23604] reading config file: localConfig.js
21 Apr 00:00:08 - server is up INFO

Get the Metrics Out

Now that you have the StatsD daemon configured and running, the next step is to send some metrics data.

Sending data requires instrumenting your code. There are numerous client implementations of StatsD that can allow you to do this task.

For the purposes of this tutorial, you’re going to send some data to the console.

So let’s try running the following command to send data:

echo “page.login.accessed:1|c” | nc -u -w 1 127.0.0.1 8125

This command will send to the StatsD server a metric called “page.login.accessed” with a value of 1 and counting metric type. If run successfully, you should be able to see the data in your console where you have the StatsD daemon running.

{ counters:
   { 'statsd.bad_lines_seen': 0,
     'statsd.packets_received': 1,
     'statsd.metrics_received': 1,
     'page.login.accessed': 1 },
  timers: { 'page.login.time': [] },
  gauges: { 'statsd.timestamp_lag': 0 },
  timer_data: { 'page.login.time': { count_ps: 0, count: 0 } },
  counter_rates:
   { 'statsd.bad_lines_seen': 0,
     'statsd.packets_received': 0.1,
     'statsd.metrics_received': 0.1,
     'page.login.accessed': 0.1 },
  sets: {},
  pctThreshold: [ 90 ] }

Making Sense of the Data

Now that you have StatsD sending data to the daemon, the server will collect and aggregate the metrics data and forward them to the backend system. In our test, we sent data to the console backend.

In the monitoring backend, depending on your preferred system, you’re able to view all of the data from your application in graphical form. You’ll be able to search for the metrics that you’re sending and create dashboards in ways you want to view the data.

Because of the client-server-backend model with StatsD, you don’t have to be locked into using one backend. If you ever need to change, you simply add your backend to the /backends directory, go to the localConfig.js file, and update the backends variable.

Don’t Be Wrong

Instrumenting your application code with specific metrics you care about will help you identify potential problems not just in development, but also when your application goes into production. While StatsD makes it simple to do, application instrumentation may be the most complicated piece. But the ability to measure anything can be invaluable.

If it’s that important, you should take the time to define the custom metrics you want to measure. If there’s something uncertain about your application, you have a chance of being wrong.

So the lesson is: don’t be wrong. Measuring the performance of your application using StatsD will prove or disprove any thoughts or ideas you have about how your application actually performs in production.

And that’s why you measure anything and everything.