Issue 051

Hey folks, welcome to another installment of Monitoring Weekly! Did you write something about monitoring recently? Maybe got an idea rolling around in your head? Send it on over and let the community learn from you. :D

Monitoring News, Articles, and Blog posts
April 2018 SF Metrics Meetup

Graphite is pretty great, but one of the big objections to it is on the topic of scale. So I went looking for someone to really tell us how it is and found Brad Lhotsky from Craigslist. He’ll be speaking at the SF Metrics Meetup on April 4th on scaling Graphite. For those not in San Francisco, there’s livestream!

Kausal to join Grafana Labs to bring Prometheus to the masses

This is the first time I’m learning about Kausal, which looks like a super neat product. The folks at Grafana have acquired them, so I suspect we’re going to see Grafana become significantly more than just visualization+basic alerting in the future.

Time-series histograms with Rothko — metrics collection for large deployments

More people should use histograms for time series data visualization. Quantiles/Percentiles are great, but they hide the spread of datapoints and the worst cases. Histograms solve those problems. This new tool is essentially a TSDB focused on storing metrics for later visualization as a histogram.

On Prometheus

The folks at Sitewards reflect on 12 months of Prometheus in their organization, lessons learned, and where they hope to go from here.

AWS Route 53 Logging with Logz.io and the ELK Stack

It turns out that Route 53 can do query logging now, which is kinda cool. You can ship the data to CloudWatch and S3, and from there to an external service for more in depth analysis–such as ELK.

Burst credits of t2 EC2 instances need monitoring

Did you know t2 instances on EC2 depend on burst credits? Did you know running out of burst credits means you’d be better off just killing the instance? This article goes into more detail about it, including the importance (and how) of monitoring burst credits on t2 instance types.

How our production team runs the weekly on-call handover

On-call handoff is an under-appreciated and oft-overlooked aspect of on-call. Having the opportunity to discuss the previous on-call period is incredibly worthwhile for teams. One of the tricky parts, though, is remembering everything about it. The folks at Gitlab have created a tool to make this much easier to do.

Tonight We Monitor, For Tomorrow, We Test in Production! (video)

I love this quote: “If you’re not really monitoring, you’re not really testing–you’re just hoping that things go right.” This is a good watch from the recent Test in Production Meetup.

Why we removed Inbox delivery tests from our status page

Sometimes the metrics you depend on turn out to be a poor indicator of actual performance. The folks at Postmark recently came to this conclusion and wrote up a fantastic post about why the monitoring they had in place wasn’t actually telling them what they (and customers) wanted to know.

See you next week!

– Mike (@mike_julian) Monitoring Weekly Editor