It’s time for another “best of” issue! This collection looks back through our summer months at our most popular articles. Hope you enjoy it!
This issue is sponsored by:
We’ve all heard about the 3 Pillars of Observability. But what about the 3 Phases of Observability? It’s simple: Know, Triage, Understand. Chronosphere is a SaaS cloud monitoring tool that helps teams rapidly navigate these three phases. See how teams zero in on the three phases to derive maximum value from their data Learn more here.
Articles & News on monitoring.love
Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.
From The Community
It feels like a great day to revisit the golden signals of monitoring. I love how the author adds context around each of these signal types with some useful examples and perspective.
A detailed look at the motivation for and architecture of Workday’s observability platform, Pharos. It’s always interesting to see which companies are still building their own in-house monitoring services, and why.
We don’t talk nearly as much about the math of our metrics as we did in years past. Here’s a brief introduction to some of the more important metric types found in Prometheus, and when to use each.
An introduction to dtrace and strace, along with some useful examples for using them effectively.
An excellent pattern for simplifying large Blackbox exporter configurations for Prometheus.
Some fascinating insights into eBay’s “Sherlock.io” monitoring system, including its use of machine learning for anomaly detection. Not gonna lie, some of their algorithms broke my brain.
If you can get past the triangle of SRE doom (my description, not theirs), this is a good deep-dive on SLOs & error budgets and how they impact SRE.
Because everyone knows that having six pillars is twice as good as only having three. I bet I could come up with another few and venture capitalists would be banging my door down.
P.S. All kidding aside, this is a great review of the “pillars of observability”, adding more context and broadening our definition of the data we use to understand our systems better.
This list of CLI tools is heavy on debugging utilities, with many landing somewhere between “things I use every day” and “things I forgot even existed”.
I’m starting to hear more frustration from the community around scaling Prometheus. This article makes the case for using VictoriaMetrics as an alternate backend, but are we trading one complex stack for another? Regardless, if you’d like to try it out, this looks like a solid guide for getting started.
An impressively detailed look at how Cloudflare ensures that their Prometheus queries and alerts are as reliable as possible.
An excellent primer on PostgreSQL logging and how to manage them effectively.
This article looks to be a good jumping off point for folks curious about OpenTelemetry, with plenty of links to more in-depth resources elsewhere.
An impressive collection of hard lessons learned and precautionary planning steps to take ahead of your next
outage troubleshooting session.
Really appreciate the lovely write-up from one of our conference speakers. Aside from the expected stress of managing an IRL event in a post-pandemic world, I also had a fantastic time.
I always enjoy seeing how different companies approach building their own monitoring stacks. This week we have an engineer from Ninja Van sharing the details of their architecture.
I’ve often preached to peers about the importance of monitoring and observability in the context of your product and users’ workflows. However, this is the first time I’ve heard of Critical User Journeys (CUJs); this strikes me as a fantastic way to frame this topic and to further the adoption of SLOs.
A primer on observability and incident management topics, that somehow manages to be both technically relevant and an approachable introduction for business executives. Share this one with your CIO.
If you’re considering switching from Prometheus metrics to OpenTelemetry, this guide covers most of the differences and considerations for converting between them.
An unexpected flood of new metrics can ruin anyone’s day. Here’s one pattern for filtering out unwanted metrics in your Prometheus cluster.
An approachable explanation of flame charts and percentiles. After you read this one, go immerse yourself in Brendan Gregg’s massive collection of flame graph resources.
Heavily influenced by traditional monitoring principles, this article covers numerous tools and services for monitoring your Linux systems.
A very thorough look at monitoring your MySQL containers in Kubernetes with Prometheus. I appreciate that the author went to the trouble of showing how to actually generate some sample load to help visualize the results.
A broad look at Observability, how to distinguish it from Monitoring, some practical examples, and a number of high-level best practices to consider before starting your own observability journey. Share this one with your CIO.
This is a well-reasoned and written article from someone I respect a lot. That said, I don’t agree that it has to be the case for companies with the foresight to avoid painting themselves into a corner. If this does sound like you, please share this post with your peers and reconsider how you’re building your systems. </rant>
See you next week!
– Jason (@obfuscurity) Monitoring Weekly Editor