It’s time for our quarterly “best of” issue! We have some fantastic articles here covering the most popular topics and themes from the past few months. Enjoy!

This issue is sponsored by:

Chronosphere logo

Are you among the 99% of companies that are missing their MTTR targets? Technical teams are finding it not only increasingly challenging, but downright impossible, to remediate issues quickly. Discover the top three reasons for this extraordinary gap and what you can do to remediate issues faster.



Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Claims Datadog asked developer to kill open source data tool

We don’t see a lot of drama when searching for monitoring content, so you’ll excuse me if I was taken aback by this report of Datadog “killing” a contribution to the OpenTelemetry project; a data export tool that would make it possible to export data from their APM service. Woof.

20 tips for Prometheus Monitoring

Each of these tips are succinct yet they leave you wanting a bit more. This post is the tapas of monitoring articles, and I’m here for it.

etcd: getting 30% more write/s

I love a good performance debugging story, and this one from the Zendesk engineering team delivers. Almost makes me miss fixing slow Graphite clusters… almost.

Monitoring 101 : A dashboard to rule them all

You’ve just deployed a new in-house service to production and the SRE team needs some visibility to understand how it’s performing. Where to start? This five-minute guide should get your first dashboard in their hands… now go write that runbook! 😅

Runbooks

Everyone knows the importance of runbooks, but so few teams invest and care for them properly. Here are some tips to get you started.

USE vs RED vs The Four Golden Signals

A comparison of the “competing” monitoring principles that the SRE wars were fought over. 😈

Can We Stop With Those Horrible “System Overview” Dashboards Already?

Another examination of dashboard design, this time with an emphasis on the telemetry and signals used to inform our dashboards and the responders who rely on them.

9 Logging best practices

A collection of best practices and precautions for your logging setup. Many of these fall into the “seems obvious until you catch yourself doing the same thing” category.

18 Kubernetes Metrics to Monitor for Optimal Cluster Performance

I’ve seen my share of “k8s metrics to monitor” articles, but this might be the most thorough and concise list yet. You should read this if you interact with Kubernetes at all.

OpenTelemetry and the future of monitoring and observability

Probably one of the more honest appraisals of OpenTelemetry’s strengths and areas for improvement.

6 Best Practices for Effective Monitoring Alerts

I love this topic because alerting has the potential to directly impact our lives in a manner proportionate to the thought and consideration that went into their designs. Practice effective alerting and it will reward your efforts.

Unreadable Metrics: Why You Can’t Find Anything in Your Monitoring Dashboards

I “grew up” in this industry cutting my teeth on dashboard design and usability. Tools like Grafana make this a lot easier than it used to be, crafting your own charts and pages usind D3.js. Still, it can be almost too easy to vomit a bunch of graphs on a monitor and call it a day. This article does a good job calling out the design considerations that will turn the dashboard into a truly useful resource for your team.

Percentiles don’t work: Analyzing the distribution of response times for web services

I was pleasantly surprised to discover this new post from Adrian Cockcroft with research into response time distributions. A fair bit of the math is above my head, but the implications for analyzing logarithmic time series data is exciting.

What are Structured Logs and Why do They Improve Performance?

There’s a good chance if you’re reading this newsletter you already recognize the benefits of structured logs over traditional log formats. For everyone else, this post sums up the benefits and makes a strong case for switching.

OpenTelemetry — Mastering the basic main concepts

OpenTelemetry is a huge step forward in terms of standardizing the instrumentation and collection of observability data. But it can also feel like chewing an elephant to get it adopted and used effectively. This post attempts to cut through the noise and simplify the concepts of OpenTelemetry to help you get started on your journey.

Network Insights in a Distributed Environment

If you’re a NetFlow geek (or just really enjoy network monitoring tools), you’re going to love this article. I haven’t been this excited about a new tool in a very long time.

Prometheus Alertmanager best practices

Alertmanager is one of those “small, sharp tools” that’s fairly intuitive and is easy to work with. Fortunately, it also has the right set of features to allow customization as your use cases become more sophisticated. This post does a good job providing some of these examples.

Logging Best Practices: Proven Techniques for Services

Beyond just using structured logging (which we covered above), this post covers some additional best practices for anyone working with logs.

Alerting and how 50 lines of code changed how we do it

Really appreciate when an engineer works through a complex problem and shares their solution publicly. I learned a lot more about ElastAlert (and a little Scala) than I expected, tbqh.

Grafana Mimir — our journey towards infinite wisdom with 5m active time series

A fabulously detailed writeup of lovehistory’s migration from Thanos to Mimir (and other considerations along the way). Great story!

Elasticsearch — solution to searching

This sort-of reads like someone’s notes as they master Elasticsearch internals over the course of a year, but I honestly couldn’t put it down. This post is super rich on useful details for anyone who admins Elasticsearch. Heck, most of this information is helpful even for users who just want to understand how its search internals work.

eBPF and its capabilities

We hear a lot about eBPF, but most of the articles I’ve seen are either very high-level or deeply technical. This author does a great job providing background context and motivation before expanding scope and eventually diving into the weeds with code examples.

Events

Monitorama 2023 PDX

Monitorama has announced their full agenda for this year’s event. Looks like an awesome collection of topics and speakers. Hope to see you there!

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor