Driving old, beat-up cars is both a treat and a nightmare, especially when it comes to figuring out why they’ve stopped working (this time). In many ways, diagnosing issues with any old car feels not-at-all dissimilar to monitoring for and diagnosing failures in software.
The question sounds kinda weird to those well-versed in monitoring and observability, but underlying it is actually a point I find is commonly-found. Many people will tend to instrument too little, not too much, so the advice given in this article is actually a great starting point for level-setting. Think of it like negotiating a salary, in a sense: if you think your worth is X, it would be a gamechanger for someone knowledgable to say “No, you should be expecting at least X+20%”
Great explanations of SLOs, error budgets, and metrics, but also an awesome bonus: they’ve codified the explanation in some publicly-available Grafana dashboard definitions.
Sparse metrics, aka, datapoints that aren’t in consistent intervals or have large gaps between them, are a big pain to deal with in the time series world for a bunch of reasons. The folks at Influx have some suggestions on how to properly handle them so you don’t end up with incorrect answers.
Here’s some good insight into how New Relic handles on-call and incident response. I really like this pattern of companies posting their on-call/incident management methodologies publicly now. It’s silly for people to reinvent the wheel when so many companies have workable processes already.
Exactly as the title says, but also introduced me to a new tool as well: slacktee, which is like the Linux tee command but to Slack instead of stdout.
Perhaps surprisingly, one of the most challenging things about operating RubyGems.org is the logs. Unlike most Rails applications, RubyGems sees between 4,000 and 25,000 requests per second, all day long, every single day. As you can probably imagine, this creates… a lot of logs. A single day of request logs is usually around 500 gigabytes on disk. That’s, uhh, a lot of logs. There’s some interesting comments on this at Hacker News, mainly around how slow that parsing rate actually is (which is absolutely not the fault of the author) and why that might be.
When Baron Schwartz talks databases, you should be listening. This talk goes through his own framework for monitoring a database.
It feels like there’s a recent uptick in monitoring companies open-sourcing stuff, which I quite like–for all the reasons SemaText lays out in this article, actually. Their agent is Java-based, has all the integrations you’d expect, and writes to InfluxDB.
Capacity planning is one of those things that everyone knows they should be doing but no one ever actually dones–usually because it’s so complex and just a total pain in the ass. The folks at Etsy, in their usual way, wrote up a concise article on how they recently did their capacity planning exercise in the wake of their migration from datacenter to Google Cloud Platform.
I’m speaking at OSMC in Nuremberg next week. Come on out if you’re in the area.
I had the pleasure of speaking with the hiring manager recently and it sounds like a really awesome gig. If you’re into Ops/SRE/DevOps and love monitoring, click through to check it out.
Want your job listed here? Why not submit a post to the job board? It’s only $199/ad for 30 days.
See you next week!
— Mike (@mike_julian) Monitoring Weekly Editor