It’s time for another “best of” issue! This collection looks back through our summer months at our most popular articles. Hope you enjoy it!

This issue is sponsored by:

Chronosphere logo

We’ve all heard about the 3 Pillars of Observability. But what about the 3 Phases of Observability? It’s simple: Know, Triage, Understand. Chronosphere is a SaaS cloud monitoring tool that helps teams rapidly navigate these three phases. See how teams zero in on the three phases to derive maximum value from their data Learn more here.



Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

SRE Bytes: The Four Golden Signals of Monitoring

It feels like a great day to revisit the golden signals of monitoring. I love how the author adds context around each of these signal types with some useful examples and perspective.

Pharos: The Observability Platform at Workday

A detailed look at the motivation for and architecture of Workday’s observability platform, Pharos. It’s always interesting to see which companies are still building their own in-house monitoring services, and why.

The Mathematics Behind Monitoring

We don’t talk nearly as much about the math of our metrics as we did in years past. Here’s a brief introduction to some of the more important metric types found in Prometheus, and when to use each.

External Debugging Tools 1: dtrace and strace

An introduction to dtrace and strace, along with some useful examples for using them effectively.

Single Prometheus job for dozens of Blackbox exporters

An excellent pattern for simplifying large Blackbox exporter configurations for Prometheus.

Sherlock.io: An Upgraded Machine Learning Monitoring System

Some fascinating insights into eBay’s “Sherlock.io” monitoring system, including its use of machine learning for anomaly detection. Not gonna lie, some of their algorithms broke my brain.

Effective SRE: SLO Engineering and Error Budget

If you can get past the triangle of SRE doom (my description, not theirs), this is a good deep-dive on SLOs & error budgets and how they impact SRE.

TEMPLE: Six Pillars of Observability

Because everyone knows that having six pillars is twice as good as only having three. I bet I could come up with another few and venture capitalists would be banging my door down.

P.S. All kidding aside, this is a great review of the “pillars of observability”, adding more context and broadening our definition of the data we use to understand our systems better.

9 Useful Interactive CLI Tools for Linux

This list of CLI tools is heavy on debugging utilities, with many landing somewhere between “things I use every day” and “things I forgot even existed”.

Multi-site monitoring with HA and dynamic scale using VictoriaMetrics

I’m starting to hear more frustration from the community around scaling Prometheus. This article makes the case for using VictoriaMetrics as an alternate backend, but are we trading one complex stack for another? Regardless, if you’d like to try it out, this looks like a solid guide for getting started.

Monitoring our monitoring: how we validate our Prometheus alert rules

An impressively detailed look at how Cloudflare ensures that their Prometheus queries and alerts are as reliable as possible.

PostgreSQL Logs Explained: Logging Configuration Tutorial

An excellent primer on PostgreSQL logging and how to manage them effectively.

A beginner’s guide to OpenTelemetry

This article looks to be a good jumping off point for folks curious about OpenTelemetry, with plenty of links to more in-depth resources elsewhere.

Effective Troubleshooting

An impressive collection of hard lessons learned and precautionary planning steps to take ahead of your next outage troubleshooting session.

Monitorama 2022 — What an amazing experience!

Really appreciate the lovely write-up from one of our conference speakers. Aside from the expected stress of managing an IRL event in a post-pandemic world, I also had a fantastic time.

Ninja Van’s monitoring stack

I always enjoy seeing how different companies approach building their own monitoring stacks. This week we have an engineer from Ninja Van sharing the details of their architecture.

From Critical User Journey to SLO/SLIs

I’ve often preached to peers about the importance of monitoring and observability in the context of your product and users’ workflows. However, this is the first time I’ve heard of Critical User Journeys (CUJs); this strikes me as a fantastic way to frame this topic and to further the adoption of SLOs.

Observability offers promising benefits. Don’t dismiss it as a buzzword.

A primer on observability and incident management topics, that somehow manages to be both technically relevant and an approachable introduction for business executives. Share this one with your CIO.

Prometheus vs. OpenTelemetry Metrics: A Complete Guide

If you’re considering switching from Prometheus metrics to OpenTelemetry, this guide covers most of the differences and considerations for converting between them.

Prevent metrics explosion in Prometheus

An unexpected flood of new metrics can ruin anyone’s day. Here’s one pattern for filtering out unwanted metrics in your Prometheus cluster.

How To Read Flame Charts and Percentiles

An approachable explanation of flame charts and percentiles. After you read this one, go immerse yourself in Brendan Gregg’s massive collection of flame graph resources.

Linux Server Monitoring Tools Overview

Heavily influenced by traditional monitoring principles, this article covers numerous tools and services for monitoring your Linux systems.

Monitoring MySQL using Prometheus, Grafana and mysqld_exporter in Kubernetes

A very thorough look at monitoring your MySQL containers in Kubernetes with Prometheus. I appreciate that the author went to the trouble of showing how to actually generate some sample load to help visualize the results.

Everything you need to know about Observability: A complete Guide

A broad look at Observability, how to distinguish it from Monitoring, some practical examples, and a number of high-level best practices to consider before starting your own observability journey. Share this one with your CIO.

Why Siloed Monitoring Hit the Wall

This is a well-reasoned and written article from someone I respect a lot. That said, I don’t agree that it has to be the case for companies with the foresight to avoid painting themselves into a corner. If this does sound like you, please share this post with your peers and reconsider how you’re building your systems. </rant>

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor