It’s time for another “best of” issue! We have some fantastic articles here covering the most popular topics and themes from the past few months. Enjoy!

This issue is sponsored by:

Lumigo logo

The Plug-and-Debug Serverless Observability Platform

Trouble locating bugs in your serverless environment? Quit wasting precious development time and get an end-to-end map of your services in just four minutes with 1-click distributed tracing. Navigate your serverless chaos seamlessly—with Lumigo.



Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Observability Engineering - O’Reilly Book

O’Reilly’s new Observability Engineering book has been released, and Honeycomb has made the entire eBook available as a free download. Looking forward to reading this one soon.

A set of modern Grafana dashboards for Kubernetes

What do you do when the most popular dashboards for Kubernetes are too old to support newer Grafana features and panels? You write your own, of course. Props to the author for explaining their motivation for each new visualization.

What Is Log Aggregation: 101 Guide to Best Tools & Practices

A quick introduction to log aggregation concepts along with a fairly objective comparison of numerous commercial and open source alternatives.

Site Reliability Engineering: Setting up the right Monitoring System

I feel like this can be a useful overview of monitoring principles for some internal customer teams, but if you’re planning a ground-up monitoring infrastructure for yourself, this article feels like it’s missing a lot of “hard lessons learned”-type considerations.

Announcing Grafana Mimir, the most scalable open source TSDB in the world

Pretty big announcement from Grafana. Interesting to read that they plan to support more than just Prometheus metrics.

Application logs: Your eyes in production

A look at the role of logs in supporting production workloads along with some helpful tips and best practices to make them even more effective.

How we avoided alarm fatigue syndrome by managing/reducing the alerting noise

How Doctolib audited and continue to iterate on their noisy alerting behaviors.

A Deep Dive Into the Four Types of Prometheus Metrics

Although the Prometheus project has some pretty good documentation, they run a bit terse. Nice to see a more extended (and user-friendly) look at the different Prometheus metric types.

Introducing Husky, Datadog’s Third-Generation Event Store

I love when observability companies talk about their systems designs. If I’m being honest, you’d think we’d have figured this all out by now (but it’s still a good read). 😜

Distributed Tracing: The Why, What, and How?

One of the better articles I’ve read on Distributed Tracing, with some helpful analogies and context to help newcomers develop a foundational understanding of this key observability principle.

Timescale logo

Observability powered by SQL

Promscale is a unified observability backend for Prometheus metrics and OpenTelemetry traces built on PostgreSQL and TimescaleDB. With full SQL support, it allows you to solve even the most complex issues in your distributed systems. Find out more about it here. (SPONSORED)



A Deep Dive Into OpenTelemetry Metrics

We talk a lot about OpenTelemetry framework in terms of traces and spans, but it provides enormous value in the form of metrics as well. This post is an excellent guide at what makes OTel metrics unique, how to set them up, and when to use the various types.

Notes on an Observability Team

This article speaks to me on a very personal level. I’ve built up Observability teams over the years; it’s not surprising that we share many of the same problems, but it’s always interesting to hear how we tackle (or prioritize) them differently.

New in Grafana 8.5

Everyone’s favorite open source dashboard is out with another new release. Love to see the new alert grouping features.

Enriching Prometheus metrics with exemplars for easier observation of a distributed system

A handy guide for adding examplars to your Prometheus metrics (and why this matters).

Thoughts Over an Annoying Production Issue

Always interesting to read how other engineering teams work through really frustrating incidents.

10 years of major incidents

Speaking of hard lessons learned, I love this article on how to approach outages. So many of these examples resonate with me (painfully).

Have You Been Using Histogram Metrics Correctly?

An excellent look at Prometheus histograms and scenarios where you might be using them incorrectly. Please read this one if you care about your metrics.

It’s Always DNS . . . Except When It’s Not: A Deep Dive through gRPC, Kubernetes, and AWS networking

I love technical postmortems, especially when it comes to TCP/IP networking and DNS. I think most of us can empathize when an old benign setting wakes up much later to bite us in the rear.

How is eBPF efficient for observability

A nice bit of background on what makes eBPF such an effective technology for observability uses.

OpenTelemetry, the standardized observability framework for everyone

We see a lot of OpenTelemetry articles around here, but not many about the portability that OTel offers. Nice to see how easy it can be to swap backends when your company makes an unexpected vendor pivot.

Post-Incident Review on the Atlassian April 2022 outage

I don’t know many folks that weren’t impacted by this massive outage in one way or another. Most of the technical aspects of this incident are already public, but Atlassian leadership has released this extended PIR for a broader, more official review of the incident. Grab a drink and get comfortable… this is a long one.

Distributed Tracing with Hypertrace

How (and why) Razorpay engineers switched from Jaeger to Hypertrace for their distributed tracing needs.

On monitoring from a (slightly) different point of view

A refreshing look at monitoring tooling from the perspective of an application engineer.

Handling Incidents Mindfully 🧘🏽 — Part 1: Acceptance

I genuinely feel more calm and serene after reading this story. A level-headed approach to how we think about the risk associated with incidents.

Job Opportunities

Software Engineer, Cloud Foundations at Block (US Remote)

Engineer Manager, Observability at Block (US Remote)

Site Reliability Engineer at Fivetran (US Remote)

Site Reliability Engineer at GitHub (US Remote)

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor