This issue is sponsored by:

Raygun logoOne of the most frustrating things about some monitoring vendors is that they don’t even use their own software. Raygun totally isn’t one of those. See how they used their own software to find and fix an error buried in the app for only 263 users.

From The Community

Stack Overflow: How We Do Monitoring

I absolutely love it when Nick Craver writes up these posts about Stack Exchange. They’re always incredibly detailed and interesting.

Resilience Weekly

My friend Thai Wood, an engineer specializing in reliability at Walmart Labs and former EMT, sees a problem: most of the academic research on resilience, incident management, and emergency triage is tucked away inside esoteric papers and non-technical industry knowledge. His new email newsletter aims to bring the lessons academia and other emergency fields has found into our realm, where we can put them into practice.

Using Audit Logs for Security and Compliance

Every security engineer I know loves logs. Especially all the ones in /var/logs/ that most ops people tend to ignore. This article talks about those logs, what’s in them, and why you should care.

5 alerting and visualization tools for sysadmins

I like this article, and not only because I’m quoted in it. Though, one particular point the article brings up: I really don’t like the term “informational alert” that I used in the Practical Monitoring book, but I can’t think of a term that better describes a message sent automatically for “review at your convenience” but doesn’t need to wake someone up. For example, high child process churn rate in supervisord or an instance in an ASG that keeps getting killed and respawned: good to know it’s happening so someone can look into it, but it’s not impacting customers, so it’s unnecessary to wake someone up. Anyone got a better term?

Implementing SLOs using Prometheus and Grafana

Great explanations of SLOs, error budgets, and metrics, but also an awesome bonus: they’ve codified the explanation in some publicly-available Grafana dashboard definitions.

How many metrics should an application return?

The question sounds kinda weird to those well-versed in monitoring and observability, but underlying it is actually a point I find is commonly-found. Many people will tend to instrument too little, not too much, so the advice given in this article is actually a great starting point for level-setting. Think of it like negotiating a salary, in a sense: if you think your worth is X, it would be a gamechanger for someone knowledgable to say “No, you should be expecting at least X+20%”

This issue is sponsored by:

Blue Medora logoFor you folks in large, complex companies, you’re gonna love this. Blue Medora allows you to ingest and ship monitoring data from multiple tools to other tools, without having to change your monitoring tools at all. Imagine that: no fights over which vendor is best–you can use whatever.

Cron jobs execution monitoring in slack

Exactly as the title says, but also introduced me to a new tool as well: slacktee, which is like the Linux tee command but to Slack instead of stdout.

How to Monitor Your Database

When Baron Schwartz talks databases, you should be listening. This talk goes through his own framework for monitoring a database.

Why Your Server Monitoring (Still) Sucks

I wrote an article for Linux Journal on the top five reasons your monitoring still sucks.

Best Practices for On-Call and Incident Response

Here’s some good insight into how New Relic handles on-call and incident response. I really like this pattern of companies posting their on-call/incident management methodologies publicly now. It’s silly for people to reinvent the wheel when so many companies have workable processes already.

Observability at Scale: Building Uber’s Alerting Ecosystem

On a much lighter note, Uber’s alerting service is interesting. It’s mostly inhouse tools put together in a pipeline, though they do rely on Graphite’s query language for metric queries. There’s some built-in alert dedupe going on, too.

GitOps Part 3 — Observability

Chock full of references to other awesome articles as well, this article hits on the observability challenges Weaveworks encounters with their gitops-based processes and a few of the solutions they’ve implemented.

This issue is sponsored by:

SignalFx logoMonitoring Microservices on Kubernetes

While Kubernetes abstracts away many complexities, it also introduces new operational and monitoring challenges. This free ebook from SignalFx discusses each component of Kubernetes and the challenges + solutions to effectively monitoring them.

Why Use K-Means for Time Series Data? - Part 1, Part Two

These articles remind me of why I have a long way to go with my grasp of stats.

We can do better than percentile latencies

We’ve known for some time that using the average for things like latency results in missing a ton of data, which is why using 95th or 99th percentile is now common. But the author makes another point: many vendors implement percentiles in a pre-aggregated way, resulting in the same problem.

Four Great SaaS Visualizations

Visualization goes hand-in-hand with great monitoring but I’ve found too few of us really think hard about it. This article isn’t about monitoring at all, but rather talks about the business side of things and visualizing business KPIs. That said, there are great takeaways for those of us building visualizations or just creating charts for a report every now and then.

Heatmaps Make Ops Better

If you’re still wondering why heatmaps are awesome, this article has some great graphs to show their value and why other visualizations fall short for some data/questions.

Lumen: Custom, Self-Service Dashboarding For Netflix

This looks like a great dashboard framework from the folks at Netflix, but sadly, it doesn’t appear it’s open-sourced (yet). Still, a great read for those working on their own dashboard systems.

Building, testing and iterating our monitoring and alerting service

I’m a huge fan of the folks at the United Kingdom’s Government Digital Services group and the work they’re doing to modernize government services–something that directly impacts the lives of citizens, residents, and visitors. This article talks about their monitoring and alerting.

How we built ‘BARITO’ to enhance logging

What do you do when you’re beyond what ELK can reasonably handle? Well, you either fork over a mind-bogglingly large sum of money to Splunk, or you build your own solution. You can guess which option the folks at GO-JEK opted for.

How To Improve On-Call

Sourced from a fantastic list of resources on Twitter, I wanted to put all of this advice in one convenient location. Enjoy. (got something you think needs to be in this list? send it over!)

How Pinterest runs Kafka at scale

One of the most common ways to scale and manage time series ingestion is to put Kafka in front. That’s pretty much the pattern for most SaaS monitoring solutions and every large-scale in-house monitoring system I’ve seen. So, with that, here’s an article about Pinterest’s Kafka setup.

Real World DevOps Podcast

You asked for a podcast, so you know what? Fine, here, have a podcast. (it’s pretty great)

This issue is sponsored by:

VictorOps logoIncident management without data is called guessing

The foundation of any incident management practice is data–whether that’s logs, time series, or complaints on Twitter. Of course, even complaints on Twitter can be turned into time series data. Learn how VictorOps and Splunk can solve that problem for you.


Want your job listed here? Why not submit a post to the job board? It’s only $99/ad for 30 days.

See you next week!

– Mike (@mike_julian) Monitoring Weekly Editor