SPECIAL EDITION: Q4 2023 Best of

I hope you don’t mind another “gift”, because it’s time for our “Best of Q4” issue! I’ve gone back over the past few months and pulled out the most popular articles as chosen by you… Enjoy! 🎁🎄⛄

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Open source log monitoring: The concise guide to Grafana Loki

I really wish this primer on Grafana Loki existed four years ago when I was making a case for our next logging platform. At the time, the maintainers struggled (imho) to make a compelling case for why Loki was different (besides “Prometheus labels!”) or why I should clear. It’s pretty clear the project has found its audience since then and I’m happy to see continued competition in the space.

Remove high cardinality in Prometheus

Speaking of cardinality, here are some tips and examples for tracking down and mitigating the sprawl of high-cardinality metrics in your Prometheus cluster. Love this.

There is an oom kill count in Linux!

Don’t worry George, you weren’t the only one who missed this. Nice catch.

Do you probe your network?

As someone who largely broke into the monitoring space through a love of TCP/IP networks and troubleshooting, this article sings to me. 💘

How to be on-call

I think we all deal with on-call so much we forget that not everyone is experience with it (or knows how to do it well). An excellent article looking at the spectrum of responsibilities and guardrails to ensure are in place to optimize you and your team’s on-call rotation.

What’s next for observability?

A C-suite worthy analysis of the Observability landscape. If you’re trying to make a case with your leadership for dedicated resources, this could be a good article to share with them.

Migrating to OpenTelemetry

Another strong case for OpenTelemetry; not just for its technical capabilities and ability to avoid vendor lock-in, but also for reducing per-node vendor costs. Happy to see these stories are becoming more common every day.

A rabbit hole in monitoring

I really enjoyed reading this post and appreciate the author’s honesty, but imho this situation could’ve been easily avoided. I would also encourage them to be mindful of their next technology decision because it sounds like history might repeat itself.

Alert and Alert Manager in Prometheus

In my experience, Alertmanager tends to be one of those tools you learn through shadowing and tribal knowledge. This post cuts through a lot of that and goes beyond basic setup tips to demystify some of the less obvious aspects of using it in real scenarios.

How Grafanalib Helps You Manage Dashboards at Scale

How to manage the sprawl and maintain discoverability of Grafana dashboards and data is a common theme for most organizations. This post introduces a pattern with the Grafanalib library that sounds like a good option for many.

How to write a Postmortem

A framework for writing postmortems, with templates, some solid references, and sample incidents. Even if you already have a solid incident response program in place, you might pick up a few tips.

A Deep Dive Into CPU Requests and Limits in Kubernetes

An excellent technical deep dive into Kubernetes behaviors around CPU requests, limits, and how its design abstractions affect the way we use it.

Exploring the OpenTelemetry Collector

A creative example for leveraging some of the OTel Collector’s less obvious capabilities.

Managing Prometheus alerts in Kubernetes at scale using GitOps

Organizing and managing alerting rules can be a major hassle as your teams and architecture grows. This post demonstrates a pattern for decentralizing ownership of your alerts using GitOps.

Cinnamon Auto-Tuner: Adaptive Concurrency in the Wild

Frankly, I wanted to include this post from Uber Engineering mostly for the absolutely gorgeous visualizations. There’s also some pretty interesting talk about their load shedding architecture works and demonstrations of its efficacy.

linkedin/oncall

“Oncall is a calendar tool designed for scheduling and managing on-call shifts”

Kubernetes: Liveness and Readiness Probes — Best practices

A handy primer on Kubernetes probes and how to make the best use of their respective states.

Insights from building a scalable distributed tracing platform for adidas

Lessons learned adopting distributed tracing (and its effects on the rest of their observability stack) inside a Platform team at Adidas Group.

Monitoring Multiple Kubernetes Clusters with Prometheus Federation

How to configure a relatively modest Prometheus federation for monitoring multiple Kubernetes clusters. However, I’d caution you to be prepared to start looking at other solutions as your complexity and scale grows.

Profiling: Flame Chart vs. Flame Graph

I honestly never gave much thought to the differences between flame charts and graphs before reading this article. A pretty handy guide for understanding when to use each visualization.

The Essential Guide to Linux System Monitoring with Top, Htop, and Vmstat

Kids these days and their fancy orchestrators and ephemeral runtimes. Learn you some basic Linux debugging commands and the world is your oyster.

See you next ~~week~~ year!

– Jason (@obfuscurity) Monitoring Weekly Editor