Driving old, beat-up cars is both a treat and a nightmare, especially when it comes to figuring out why they’ve stopped working (this time). In many ways, diagnosing issues with any old car feels not-at-all dissimilar to monitoring for and diagnosing failures in software.
For those that didn’t know, California has been on fire for the past couple weeks in the most devastating wildfire in California history. The scene outside has been, well, apocalyptic. Fred Moyer at Circonus, who lives just down the road from me, wrote up Take 2 of his air quality analysis using some IoT sensors, a metrics tool (Circonus, of course), and a bit of toothpicks-and-glue for good measure. Final verdict: don’t go outside, San Franciscans.
On a much lighter note, Uber’s alerting service is interesting. It’s mostly inhouse tools put together in a pipeline, though they do rely on Graphite’s query language for metric queries. There’s some built-in alert dedupe going on, too.
You know how it’d be great to only send alerts during certain times of day? Turns out, that’s an open problem with Prometheus, but this article has a good approach that relies on PromQL. It’s a bit, well, involved, but it works. Also: timezones are hard.
Speaking of Prometheus, the latest
promtool has new functionality to test your PromQL expressions and ensure they’re doing what you intend.
The author walks us through instrumenting a Python app with Elastic APM. I don’t know why, but I’ve always had in my head that Elastic’s APM was a paid product, but it’s actually not. You really can get a free APM tool with them. That’s kinda cool.
I think it’s interesting to see companies discussing abstraction layers for monitoring tools lately. With so many specialist tools, it’s become a huge pain in the ass to manage instrumentation without having to do it multiple times for each vendor you may be using.
Aside from an awesome name, this tool from Eero does exactly what it says: makes monitoring Java GC much easier.
Hope you like math, cause there’s a whole bunch of it. I include this mainly because of the implications on capacity planning (for those of you who do capacity planning exercises).
I had the good fortune to see the imitable John Allspaw do this talk live at this year’s PagerDuty Summit and it’s just as good the second time around–maybe better, even. I strongly recommend watching this video and letting John turn your understanding of incidents on its head.
Speaking of incidents, PagerDuty just open-sourced their incident response training to go along with their public incident response documentation.
I’m including this not because it’s a bunch of actionable stuff, but because a lot of you keep asking me some variation of, “Why would I use Honeycomb? I don’t understand what it’s for.” They really are building something we’ve not seen before, so I think it’s worth linking to and talking about. (yeah, it is true that many of you are running systems that wouldn’t benefit from tools like this–that’s totally okay too)
I had the pleasure of speaking with the hiring manager and it sounds like a really awesome gig. If you’re into Ops/SRE/DevOps and love monitoring, click through to check it out and apply.
Want your job listed here? Why not submit a post to the job board? It’s only $199/ad for 30 days.
See you next week!
— Mike (@mike_julian)
Monitoring Weekly Editor