It’s time for another “best of” issue! We have some fantastic articles here covering the most popular topics and themes from the past few months. Enjoy!
This issue is sponsored by:
The Plug-and-Debug Serverless Observability Platform
Trouble locating bugs in your serverless environment? Quit wasting precious development time and get an end-to-end map of your services in just four minutes with 1-click distributed tracing. Navigate your serverless chaos seamlessly—with Lumigo.
Articles & News on monitoring.love
Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.
From The Community
O’Reilly’s new Observability Engineering book has been released, and Honeycomb has made the entire eBook available as a free download. Looking forward to reading this one soon.
What do you do when the most popular dashboards for Kubernetes are too old to support newer Grafana features and panels? You write your own, of course. Props to the author for explaining their motivation for each new visualization.
A quick introduction to log aggregation concepts along with a fairly objective comparison of numerous commercial and open source alternatives.
I feel like this can be a useful overview of monitoring principles for some internal customer teams, but if you’re planning a ground-up monitoring infrastructure for yourself, this article feels like it’s missing a lot of “hard lessons learned”-type considerations.
Pretty big announcement from Grafana. Interesting to read that they plan to support more than just Prometheus metrics.
A look at the role of logs in supporting production workloads along with some helpful tips and best practices to make them even more effective.
How Doctolib audited and continue to iterate on their noisy alerting behaviors.
Although the Prometheus project has some pretty good documentation, they run a bit terse. Nice to see a more extended (and user-friendly) look at the different Prometheus metric types.
I love when observability companies talk about their systems designs. If I’m being honest, you’d think we’d have figured this all out by now (but it’s still a good read). 😜
One of the better articles I’ve read on Distributed Tracing, with some helpful analogies and context to help newcomers develop a foundational understanding of this key observability principle.
Observability powered by SQL
Promscale is a unified observability backend for Prometheus metrics and OpenTelemetry traces built on PostgreSQL and TimescaleDB. With full SQL support, it allows you to solve even the most complex issues in your distributed systems. Find out more about it here. (SPONSORED)
We talk a lot about OpenTelemetry framework in terms of traces and spans, but it provides enormous value in the form of metrics as well. This post is an excellent guide at what makes OTel metrics unique, how to set them up, and when to use the various types.
This article speaks to me on a very personal level. I’ve built up Observability teams over the years; it’s not surprising that we share many of the same problems, but it’s always interesting to hear how we tackle (or prioritize) them differently.
Everyone’s favorite open source dashboard is out with another new release. Love to see the new alert grouping features.
A handy guide for adding examplars to your Prometheus metrics (and why this matters).
Always interesting to read how other engineering teams work through really frustrating incidents.
Speaking of hard lessons learned, I love this article on how to approach outages. So many of these examples resonate with me (painfully).
An excellent look at Prometheus histograms and scenarios where you might be using them incorrectly. Please read this one if you care about your metrics.
I love technical postmortems, especially when it comes to TCP/IP networking and DNS. I think most of us can empathize when an old benign setting wakes up much later to bite us in the rear.
A nice bit of background on what makes eBPF such an effective technology for observability uses.
We see a lot of OpenTelemetry articles around here, but not many about the portability that OTel offers. Nice to see how easy it can be to swap backends when your company makes an unexpected vendor pivot.
I don’t know many folks that weren’t impacted by this massive outage in one way or another. Most of the technical aspects of this incident are already public, but Atlassian leadership has released this extended PIR for a broader, more official review of the incident. Grab a drink and get comfortable… this is a long one.
How (and why) Razorpay engineers switched from Jaegar to Hypertrace for their distributed tracing needs.
A refreshing look at monitoring tooling from the perspective of an application engineer.
I genuinely feel more calm and serene after reading this story. A level-headed approach to how we think about the risk associated with incidents.
See you next week!
– Jason (@obfuscurity) Monitoring Weekly Editor