With the most recent Monitorama behind us, it feels like a great time for our quarterly “best of” issue! We have some fantastic articles here covering the most popular topics and themes from the past few months. Enjoy!

This issue is sponsored by:

Armory logo

In a single afternoon I set it up then ran 500 deployments, a dozen different ways…

When was the last time you ran one deployment without a hiccup, let alone 500? Learn how declarative deployment with a GitOps experience makes all the difference with Armory.

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

The Pyramid of Alerting

Reading about how others think about alerts (and the failures of bad alerts) is something I’ll never get tired of, and unfortunately, something I think we’ll never really master as a discipline. Still, it’s important to share our learnings and continue evolving our practice.

How we reduced our Prometheus infrastructure footprint by a third

How one team profiled their Prometheus metrics usage and found a massive storage savings win.

Getting Started with Mermaid for Diagramming and Charting

This is a bit of a one-off, but it reminds me that monitoring systems can be a very complicated beast. Easy diagramming is a win for everyone, admins and users alike.

Grafana vs. Prometheus Agent

A brief look at the history and differences between two popular open source Observability agents.

Demystifying OOM Killer in Kubernetes: Tracking Down Memory Issues

Kubernetes does a good job of managing resources, but it’s naive to think we won’t need to troubleshoot it like any other system from time to time. And knowing how to debug something is the first step to monitoring it effectively.

How to find unused Prometheus metrics using mimirtool

I’m a little sad that this tool needs to exist (I wrote something similar for Graphite a decade ago?) but it does address a valid need. Worth adding to your sack of monitoring tools.

A comparison of eBPF Observability vs Agents and Sidecars

I don’t believe that traditional instrumentation approaches are going anywhere, but eBPF is making a strong case for less intrusive metrics collection.

OpenTelemetry: The Star of KubeCon 2023

A recap from KubeCon with a particular emphasis on how OpenTelemetry continues to dominate observability instrumentation and what might be next.

Datadog: Metrics without Limit

Some lessons learned for reducing custom metrics usage with Datadog’s “Metrics without Limits” feature. Woof.

System Observability in a nutshell

A vendor-agnostic look at what Observability really means in a systems context. Useful for SRE folks and anyone else who cares for production services but might not actually be developing the services themselves.

Understanding Real-Time Application Monitoring

An overview of Expedia’s most important operational metrics across a variety of use cases and service types. If you’re a technical leader in your group, it might be a fun exercise to review these with your team.

Warden: Real Time Anomaly Detection at Pinterest

A fascinating look at Pinterest’s anomaly detection platform and the algorithm choices they’ve made in its design.

Distributed Tracing — Past, Present and Future

An excellent look at the state of distributed tracing, acknowleding the pains that we’ve experienced up to this point, and some thoughts on where the discipline might be heading.

Unveiling the Architectural Brilliance of Prometheus

As a fan of push-based metrics collection, I’m not sure I buy into the rhetoric here, but this is a very good look at Prometheus’ strengths and how to use its multitude of features.

OpenTelemetry Dynamic Integrations

OpenTelemetry already has a strong reputation for portability but this example really underscores just how easy it is to switch your final destination(s) using OTel collectors.

Monitoring is a Pain

There are some valid frustrations here, but I’ve been in this industry for a long time and literally every piece of software is going to cause heartache sooner or later. Still, I encourage everyone to read this and go make a positive impact where you can.

Observability at tb.lx: the key to our product’s success

An engineer at tb.lx talks about their adoption of observability practices and tooling, and how it’s leading to better outcomes for not just their internal teams and business outcomes, but for better understanding customer issues.

Monitoring K8s? Here’s Why and How We Use Prometheus

I love this story from a ZipRecruiter engineer, recalling their path from Icinga and Graphite monitoring to adopting Prometheus for their Kubernetes infrastructure. Teams considering a similar adventure would do well to heed these learnings.

Observability on Kubernetes — lessons learned

One company’s take on running their observability stack (Grafana, Loki, Tempo, and Prometheus) on Kubernetes.

The Single Pain of Glass

A hot take on dashboards. Sorta. I don’t really get the argument that single pane dashboards are good or bad. Any dashboard is only as good as the effort you put into it to make it answer the questions that are relevant to your needs.

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor