SPECIAL EDITION: Q1 2024 Best of

It’s time for our quarterly “best of” issue! We have some fantastic articles here covering the most popular topics and themes from the past few months. Enjoy! 🌞📝🌈

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

eBPF Documentary

While the eBPF video documentary was released late last year, Brendan Gregg took the opportunity to add a little extra context to the story. Always great to see him post anything on his blog.

A checklist to choose a monitoring system

Most of us probably come to this decision with numerous biases and preconceptions. This is a pretty simple checklist but I bet I’ve overlooked at least a couple of these items in the past.

Monitorama PDX 2024 - Agenda is Live!

Monitorama organizers released the upcoming speakers list and agenda for this year’s upcoming PDX 2024. Exciting to see so many unique topics, I can feel the FOMO rising already. 😅

Monitoring Weekly readers can save $100 off General Admission tickets with the MWEEKLY2024 discount code. Hope to see you there!

Linux Crisis Tools

Another amazing post from Brendan Gregg, this time reviewing his go-to list of “crisis tools” for troubleshooting under pressure. And the imaginary scenario he describes hits a little close to home. 😜

Keep your dashboard clean: Acknowledgement is not a solution!

Tips on crafting sustainable alerts and ways to avoid alert fatigue. Sage advice, worth a read.

Terraform Strategies for Grafana

How one company manages their global Grafana footprint, keeping all of their dashboards and alerts in sync.

Alerts Are Fundamentally Messy

For those of us with years of experience crafting, tweaking, and responding to [bad] alerts, this author says the quiet parts out loud. I still argue that good alerting isn’t as difficult as some people would have you believe, but there are good lessons to be had regardless.

prodzilla/prodzilla

“Prodzilla is a modern synthetic monitoring tool built in Rust. It’s focused on surfacing whether existing behaviour in production is as expected in a human-readable format, so that stakeholders, or even customers, can contribute to system verification.”

Observability n.0?

A unique take on the state of commercial Observability tooling, from the perspective of an investor evaluating this space within the greater tech industry.

API load testing: A beginner’s guide

Grafana Labs recently published a series of guides for load testing with k6. This post in particularly jumped out at me because API load testing is such an overlooked and underappreciated practice.

Go Microservices: Monitoring, Logging, Debugging, Tracing, and Profiling

A primer on Observability from the perspective of a Go programmer. Even if your jam isn’t Golang, the examples and context are approachable for anyone who codes.

OpenTelemetry Collector Anti-Patterns

Came here to discover how I’m using the OTel collector wrong, stayed for the pet rats. 🐀😹

What you need to know before creating your first OpenTelemetry pipeline for tracing

I love this story documenting the journey of considerations, planning, and lessons learned from adopting something as invasive as distributed tracing. Excellent post.

Observability trends and predictions for 2024

Probably some bias here, but still an interesting bit of prognostication from different folks at Grafana Labs.

How to use Prometheus for web application monitoring

A solid introduction to Prometheus with a more focused look at using it for synthetic monitoring of a website.

How I’m Getting Free Synthetic Monitoring

If you’re looking for a free Pingdom alternative, this new open source synthetic monitoring tool might be a good fit. The example uses a commercial hosting service for Rust web apps, but you could presumably host this anywhere.

Rolling out Distributed Tracing

A post on distributed tracing that covers possibly the hardest parts… planning, onboarding, and adoption. Anyone who’s ever attempted to help out their engineers by introducing something new (no matter how helpful) knows that it’s never a straightforward path. Gotta do your homework first.

Prometheus Guide: Metrics, PromQL & Dashboards

This post is an excellent overview of Prometheus in its own right, but it also goes satisfyingly deep on topics that are often glossed over: the various metrics types, how to start building effective queries with labels, and common mistakes to avoid.

Incident Commander Training Strategies: What The Books Don’t Tell You

A collection of lessons learned from fellow Incident Commanders. Some of these are painful to read but I’m also nodding violently and screaming silently into the void.

Best Practices to Prevent Alert Fatigue

Healthy alerting is one of my pet peeves, so I’m grateful to see Datadog tackle this important topic. Yes, this post is heavy on Datadog specifics, but there’s plenty to take away and apply to other systems.

Prerequisites for building an efficient observability system

An overview of the primary organizational (people, policy, etc) and technical (tools) considerations needed for an effective observability culture (my words, not theirs). Share with that executive who’s been pushing back on your asks for {more custom metrics, another observability engineer, etc}.

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor