[Guide] Get the custom metrics you want with StatsD and Sensu

Learn more about StatsD — including how to get started — and how to get better insights into your applications and services using Sensu and StatsD together.

By: Todd Campbell and Jef Spaleta

If you’re reading this guide, you’re probably looking for ways to pull more metrics out of your applications and make even better use of the metrics you’re already exposing. StatsD is a great framework for this, and it’s easy to use with Sensu Go — both are highly flexible and allow you to work with the tools that best suit your needs.

As an observability pipeline that enables monitoring as code on any cloud, Sensu empowers you to send custom metrics (from StatsD, Prometheus, Nagios, and more) to the data platform of your choosing.

Read on to learn more about StatsD — including how to get started — and how to get better insights into your applications and services using Sensu and StatsD together.

Get the custom metrics you want with StatsD and Sensu

What is StatsD?

StatsD is an open source network daemon that listens for stats, which it collects via both TCP and UDP, giving you a choice of delivery guarantees. (Here’s a great explanation of the two protocols and the differences between them.)

StatsD flushes all the collected stats to any number of compatible backends — for example, a time-series database. StatsD is also a data format, so you can use a simple line protocol. It’s an easy interface for presenting telemetry data as plain text:

echo "foo:1|c" | nc -u w0 8125

StatsD was originally developed at Etsy, starting in 2011 and inspired by work the Flickr team started three years earlier. The original open source StatsD daemon was written in NodeJS[1].

Today, StatsD is an open standard protocol and reference architecture. Like most modern telemetry frameworks, StatsD provides support for a variety of different metric types such as counters, timers, gauges, and sets. You implement StatsD by installing a client in your code, applications, and services, which then emits metrics to a server or daemon that consumes and stores them in a compatible and appropriate backend.

One of the most popular StatsD implementations has been Datadog’s custom metrics solution, backed by the company’s own StatsD daemon, DogStatsD, but StatsD is supported by many other products and tools, including Sensu. Later on in this guide, we’ll get into using DogStatsD examples with Sensu.

Using Sensu and StatsD together — it just works

Sensu has always provided support for popular standards-based monitoring and observability frameworks and protocols. When we first started out, Nagios service checks were a popular methodology, and Sensu still supports Nagios check specification, even as we’ve added support for more modern frameworks. That’s why you can reuse Nagios check scripts and they “just work” in Sensu.

Sensu has had native support for StatsD for a long time. With Sensu Core (the original version of Sensu), you could install StatsD in the Sensu agent by using an extension; now it’s built into Sensu Go, and there’s native support in the Sensu event format, so you can send metrics to the database of your choice.

Another reason Sensu and StatsD work nicely together are their conceptually similar architectures. StatsD has a client-to-server-to-backend architecture, while Sensu’s architecture is agent-to-pipeline-to-backend. The Sensu agent and pipeline together act like a StatsD server, giving you a much richer workflow and automation capabilities in the observability pipeline. You can do a dual delivery pipeline or a multi-tenant delivery pipeline for metrics you collect via StatsD and the Sensu platform.

Getting started with StatsD

Before we dive into the demos, let’s start with the code example above. We take the StatsD line protocol and use Netcat to send an example metric to a StatsD server.

$ echo "foo:1|c" | nc -u -w0 8125

In this case we’re sending the metric foo with the value of 1 as a counter — c — to our StatsD server. But that’s not a scalable approach.

Instead, you could start by creating a connection to your StatsD server, using the standard libraries for the programming language of your choice — you just send the StatsD line protocol in plain text to the server. Or you can integrate your applications with any of the excellent client libraries you’ll find on GitHub at https://github.com/statsd/statsd/wiki. There are so many different client libraries available, you’re likely to find just what you need.

If you prefer, you can roll your own in a reliable way by using standard libraries for TCP and UDP and a little bit of string interpolation to emit StatsD formatted text to a StatsD socket. It’s pretty quick and easy to build a StatsD implementation that you can use across your applications and services.

Datadog provides a lot of information on StatsD, both on their blog and in code examples. We often refer our customers to this blog post, which offers a high-level view of the value StatsD provides — describing just how simple and lightweight it is — and dives deeper into the history of StatsD, which is interesting.

Now for the demos. In our recent webinar, we used some code examples from the DogStatsD documentation.

Get started with StatsD + Sensu

In each demo, we compile the example code and send it to the Sensu StatsD listener, using three different data types: counter, gauge, and histogram.

First we’re going to demonstrate how to have your existing StatsD code — without any modification — collected by the Sensu agent StatsD listener. The demo covers four examples: a counter, a gauge, a gauge for a set, and a histogram.

To show each example, we go to the DogStatsD documentation and copy the Go code for the example we want. We run the code on a server with a Sensu agent listening on the StatsD port, and then the data is collected into InfluxDB for graphing in Grafana.

Starting off, we look at the Sensu agent’s configuration and see that StatsD events will be sent to the InfluxDB handler on the backend. Looking at the Grafana dashboard, we also see that our graph is empty now, because there are no measurements as yet in InfluxDB.

Now we go to the DogStatsD page and choose the first code example. It’s for a counter that will be included in an increment operation, a decrement operation and a static counter. We copy that code and paste it into the terminal, then kick it off.

We see data as it’s coming into Grafana from InfluxDB. We can see the increment operation, the decrement, and the static counter. As the data comes in, the graph moves along and the counters increase. When we go back and look at the measurements, we see the example metrics being sent from Datadog.

The next example is the gauge data type. Again, we copy the example Go code from the DogStatsD page. After deleting the first demo, we paste the new example code into the terminal, execute the build, and run it.

When we go back to the Grafana dashboard, we see at first that the increment and decrement counts are flattening out. That’s because the data is no longer being sent from our first example — the counter that we stopped running before pasting in the copied code for the gauge example. As the process runs, you can see the gauge going up.

Our third example is a set that is sent into StatsD as a gauge data type. We copy the Go code from the DogStatsD page and paste it into the demo window, then again delete the prior demo. At the time the video was recorded there were some compile errors with the code found on the DogStatD page, those have since been corrected. This time, we perform the build and discover that the code we copied and pasted doesn’t compile. Once we fix it, the code can run. Going back to Grafana, we see the gauge for the set example begin to populate.

The last example is for a histogram. We copy and paste the Go code for the histogram from the DogStatsD page, kick it off, return to Grafana, and watch the histogram being populated. As expected, it maps the mean and the max from this dataset.

Now let’s take a look at how we’re doing this. From the drop-down menu under Example, Histogram, we choose Edit, select InfluxDB as our data source, and select our example_metric table, which is in the Datadog code. Our environment tag is set equal to dev, and we have histogram.mean for the mean point and histogram.max for the max point.

The demo above, with all four examples, illustrates how to send your existing Go StatsD code to your Sensu agent StatsD listener and have it graph to InfluxDB and Grafana.

In this next demo, we show the same type of work, but instead of using the Go language examples, we use the Datadog Ruby code. We also limit this demo to just the histogram data type, instead of covering all the StatsD data types we’ve shown already.

The environment for this demo is the Sensu Go workshop, which is available on GitHub. It’s an easy way to spin up an entire reference architecture around Sensu, with Sensu Go as the backend agent, TimescaleDB as the default time-series database, and Grafana as the dashboard. Our Sensu environment for this demo has just a single agent configured.

Starting fresh from our previous demo, we paste the example Ruby code from the DogStatsD page into our terminal. Just as we did in the gauge data type demo in Go, we discover a small bug right away, so we fix the typo. We’re also running our agent in a Docker container, so we need to modify the default port.

Next we install the StatsD Ruby gem, dogstatsd-ruby, which emits metrics to the local port. You can see in the dashboard when we start to get metrics in our database, and when they start being graphed in Grafana. Just as we get a graph that shows the min-max average in Datadog, you can get the same thing in Grafana, with the time-series database of your choosing. So there you have it: StatsD with Ruby and Sensu Go.

Using Sensu and StatsD together for custom metrics

The examples above show how easy it is to graph metrics from StatsD by using Sensu Go and your database of choice. In the next demo, we show how to use these elements — counters, gauges, and histograms — to collect user activity metrics for a web application using Sensu Go and StatsD.

Of note, the following demo shows how to use StatsD timers to get performance metrics out of a web app’s functions. Being able to get custom metrics that are exactly what you need for your instrumentation demonstrates just what a flexible framework StatsD really is.

For this demo, we use a Python Flask application — “Eel Slime,” an application Jef developed so he and his friends can play the Snake Oil card game while staying socially distant. This demo focuses on getting performance information about the functions he’s written into his app by using function timers, which helps Jef during prototyping, showing which function implementations are most performant for his use case.

The previous demos showed examples of the DogStatsD histogram metrics data type, which is actually an abstraction of the standard StatsD timer metrics data type. The original implementation for StatsD was built to provide metrics for web applications, and timing of function execution was an important part of that original mission. Many StatsD client implementations provide function decorators for timers, because they are commonly used when doing function performance testing.

In this demo, we show how easy it is to make use of the StatsD timer to measure web application function performance. Eel Slime is a pretty simple Flask app so far, with just one important web page view for managing card hands, allowing players to draw and discard cards from their hand on each round.

The core functionality of this web app is to draw cards from the pool of available nouns for each round of the game. (In case you’re not familiar with Snake Oil, you have to combine two nouns from your hand to make a compound noun that describes a product you then try to sell to another player. Hence the name of the game, “Snake Oil.”) Jef used an online random word generator to provide the nouns along with a static backup word list in case the random word service fails.

The function that pulls the cards is instrumented with a simple DogStatsD timer to see how well the function is performing. Jef has added a simple counter to help keep up with the error rate when trying to pull new random nouns from the online service.

Similar to the previous histogram and counter demos, we import the Datadog module and initialize with the StatsD host information, but this time, we use the Python function decorator instead of manually adding the histogram stat. The decorator does all the work behind the scenes to time the function and then send that information as if it were a histogram.

Running the Python function to load test, we can see how well the application performs by looking at the Grafana dashboard. We run the function in a loop with a small sleep in between, and we send it through the Sensu agent that’s running as a StatsD server. That in turn calls an InfluxDB handler and populates the Grafana view.

On the Grafana dashboard, you can see that the Sensu agent emits the StatsD event with the metrics every 10 seconds or so. The Grafana dashboard is set up similarly, so information shows up from both metrics. In the top graph there’s information about the error count and the number of counts associated with the number of times that function is being called. Looking at both metrics, it looks as if every time the function runs, the online service errors out.

This setup also gives us information about the amount of time associated with that 10 seconds’ worth of data. For each packet we get the minimum, the mean, the maximum, and the 90th percentile. In this case, we see that the online service is falling and instead we are using a static list of words every time, and then randomly sorting them. The performance is reasonably fast, so we probably don’t need the online random word generator at all.

However, instead of telemetrizing just this one function, we actually can telemetrize the entire Flask route.

@app.route('/cards', methods=['GET', 'POST'])
def cards():

When we take a look at the route page for the Flask app, you can see there’s not much in there right now. The Cards route does just a bit of work; it calls the new card function when the “form submit” button is pushed.

We want to provide information in the dashboard about how many times the submit button has been pushed as well as information about the whole route, including time needed to render the template plus any extra work beyond the random card draw function.

This can be done pretty simply. First we initialize the DogStatsD information, then add the decorator associated with the metric we want. We use a different metric name so as to not conflict with metrics coming from the other function.

Now we want to provide a submetric associated with this function that’s just a gauge. The gauge we have is submit, a submetric to the flask_cards metrics. We should now be able to restart the Flask application and not get any errors.

To check, we go to the application home page, then to the Cards page and hit the submit button, which performs the draw cards function. Every time we do that, the submit count goes up, indicating the metric is working.

Checking the Grafana dashboard, it does not have any information about these new values. We turn on one of the queries we have set up; instead of looking up new cards, we will look up the metric flask_cards. Then we get the submit value, which is the value of that gauge.

The submit count appears as an orange line. As we continue to submit, the gauge should go up as well.

We still have all the information about the new card draw, and we also have information about how many times this route has been submitted. We can add some timing information, too. The dashed purple line you see is the max time associated with the route. It’s just a little more than the max time associated with the job, so we know that the drawing function accounts for most of the time.

This process lets you see the time performance bottlenecks in your web application. This use of function timers is what we like to call “APM lite” (APM = application performance management) You monitor just the functions that matter for the critical path to get the information you need.

And while it’s great to know you can use StatsD to give you just enough APM for what you need, we wouldn’t call it a full-blown APM solution. (Stay tuned for an upcoming post that goes deeper into what we consider “just enough APM.”)

Get started with StatsD + Sensu

Just like so many other tools, StatsD works easily with Sensu. In fact, almost any StatsD guide you could find on the web should “just work” with Sensu, as we showed in our demo(s). And our documentation explains how to configure the enrichment, processing, and routing of metrics for storage and analysis in the data platform of your choosing.

But you don’t have to go find a StatsD guide yourself — you can download Sensu and try out the Sensu Go StatsD integration for free:

Try the Sensu Go StatsD Integration

As a reminder, the commercial version of Sensu is free to use up to 100 nodes. If you need to use Sensu at scale (which includes unlimited metrics), you can sign up for a free 30-day trial.

As you’re getting started with Sensu and StatsD, we invite you to join the Sensu Community on Discourse . Come ask questions, share your feedback, and engage with other Sensu users. We look forward to seeing you there!

  1. To learn more about the work done at Etsy, read Ian Malpass’s article, Measure Anything, Measure Everything, on Code as Craft — it’s 10 years old, but still highly relevant. Cal Henderson’s piece about measurement at Flickr, Counting and Timing, predates Ian’s article by three years, but it’s still a great read and well worth your time. ↩︎