Hey folks, I’ve got a special treat for your weekend reading: a Monitoring Weekly special issue! We’ve grown quite a lot since the first issue back in March, so many of you missed some really great articles that have run over the past few months. This special issue is chock full of the best articles and tools from every Monitoring Weekly issue over the last quarter. So, without further ado, grab another cup of coffee and enjoy your Saturday reading material!

Monitoring News, Articles, and Blog posts
Monitor your applications with Prometheus

Prometheus is a cloud-native monitoring system seeing impressive growth and adoption. This hands-on guide demonstrates how to instrument an existing app and expose your metrics using a Prometheus-compatible HTTP endpoint and then ask questions of your data with PromQL.

Practical Services Monitoring with Prometheus and Docker

A pragmatic look at how one company uses Prometheus to monitor their Docker Swarm cluster, from the Exporters used with containers to the custom service discovery methods employed to make “Federated” Prometheus work in their architecture. Most importantly, they conclude with some hard-won lessons and necessary improvements for deploying Prometheus in production environments.

A Practical Guide to Monitoring and Alerting with Time Series at Scale

Jamie Wilkinson, SRE at Google, gave a presentation at SREcon recently on how to design effective and useful alerts. Though Google uses Borgmon internally, Jamie relates all of his recommendations to Prometheus and how to implement them using it.

Lies My Parents Told Me (About Logs)

The Honeycomb.io blog is chock full of monitoring gold and this post about logging is no exception. I think a great alternative title to this article could be “Logging Antipatterns” or perhaps “11 Ways You’ve Screwed Up Logging.”

Metrics @ Robinhood

Part One of a multi-part series looking at how Robinhood (the stock trading service, not the backwoods outlaw) collects, manages, and interacts with the metrics used to monitor their internal services. In this post we get a first look at how application metrics get routed through statsd and Kafka to their OpenTSDB storage backend.

Monitoring Redis

Mike Perham (of Ruby’s Sidekiq fame) gives some tips for monitoring Redis: gathering internal stats from the INFO command, avoiding disk pages, watching for network latency, and identifying slow commands.

What’s not Actionable & Business Critical Shouldn’t Ring: Building the Right Alerting System

I’m a big fan of stories about how teams have cut down on unnecessary alerts. This article is particularly interesting because of both the before-and-after numbers and the specifics of how they approached the problem.

Caveats in metric collection

Sometimes antipatterns creep into monitoring efforts. This article goes through some hard-won lessons found in instrumenting applications for metrics and logs.

Logs and Metrics

A wonderfully-deep look at when you might want a metric versus when you might want a log, the role of unit tests versus monitoring, structured versus unstructured logging, whitebox versus blackbox metrics, and how all of this fits nicely into the umbrella of “observability.”

CPU Utilization is Wrong

Turns out every time Brendan Gregg drops some knowledge, I walk away with a view perspective of the world. No exception here, either: think you know what %CPU in ‘top’ means? Think again.

Don’t Read Your Logs

A caution that logs aren’t always the best instrumentation approach for your apps, including plenty of examples where exception capture tools or metric tools would be a far better solution over some common logging patterns.

The Calculus of Service Availability

The authors of the Site Reliability Engineering book expound more about their concept of Service Level Objectives in this article.

A Million Metrics per Second

I love stories about the monitoring journey teams go through and the lessons they learn about their apps, infrastructure, and themselves along the way. This one is from the folks at Swissquote and is largely Graphite-focused. Also, 1.1 million metrics per second is nothing to sneeze at (everyone thinking “Graphite doesn’t scale” should probably settle down now…)

Metrics are dead? Thoughts after Monitorama

I love the different take on this. It’s true that Monitorama felt very much “metrics are the past” this year, but the author is spot on in that I think they’re here to stay, and for good reasons.

The Art of Data Visualization

Being in ops, we all love a good line chart. Histograms are starting to become a thing finally, but there’s more options for visualization out there. The author goes over seven graph types and their typical use cases.

Your Dashboard Needs a Waffle Chart

Oh how I loathe pie charts. Pie charts have a special place in hell, in my opinion. As a visualization, they’re meh at best, utterly atrocious at worst, and there’s nearly always an alternative visualization that better conveys the information anyways. This article talks about one of the alternatives, amusingly named a “waffle chart”.

Going open-source in monitoring, part I: Deploying Prometheus and Grafana to Kubernetes

The first in what’s looking like will be a pretty awesome series on implementing open-source monitoring. This article is exactly as the title suggests: Prometheus and Grafana, running on Kubernetes. You won’t find a super deep-dive here, but you will find a configs-included starter approach.

Going open-source in monitoring, part II: Creating the first dashboard in Grafana

The second installment in a series I’ve covered in past issues. The examples are great, but I really love the whole guiding purpose behind how the author is building Grafana dashboards: replace New Relic. For teams steeped in New Relic, this could make it easier to switch to Grafana.

What the heck is time-series data (and why do I need a time-series database)?

Foundational skills and knowledge are, in my opinion, what sets great engineers apart from good engineers. The folks at TimescaleDB give us all a great foundational walkthrough of what time series data is and how it’s different from other data. Even for someone well-versed in the monitoring world, it’s worth a read.

Cachet: The Open Source Status Page System

A self-hosted, PHP-based StatusPage.io clone.

Deadman Check

Monitoring absence-of-data things (such as a backup job that didn’t run) has always been a huge pain. This tool allows you to easily monitor those sorts of things using the “dead man’s switch” approach.

Cerebro: open alerting for DevOps teams

Cerebro is an “open alerting system” designed to integrate with Graphite’s time-series API and Seyren’s alerting and scheduling features. It offers a native REST API and dashboard to allow users to interactively or programmatically construct alerting rules with custom notification recipients.

Simple command line stats

We all love a well-designed, thought-out, permanent solution to an engineering problem. Of course, sometimes a clever, quick-and-dirty approach is just what you need.


About your friendly editor

I’m Mike Julian, a monitoring/observability consultant and trainer. I help companies improve their application and infrastructure monitoring. Interested in working together? You can find me at AsterLabs.io.

Do you enjoy Monitoring Weekly?

If you like what you’ve seen, here’s the link to invite your friends and colleagues! As always, if you have interesting articles, news, events, or tools to share, send them my way by emailing me (just reply to this email).

See you next week!

– Mike (@mike_julian) Monitoring Weekly editor