Issue 280

Feels like everyone is squeezing the last few drops out of summer (at least here in North America), and I expect to start seeing more event announcements and project updates soon. This week I found a number of new technical guides, with an emphasis on on-call, outages, and error rates. Enjoy! 🍂📶☕

This issue is sponsored by:

Embrace logo

Backend says: “99.999%" Frontend says: “Your mobile app sucks."

It's time to learn what your SLOs aren't telling you about mobile. Join Embrace for a session on how to create SLOs for your mobile apps that actually measure what matters — your end user experiences.

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Burn Rate Is a Better Error Rate

Helpful comparison of burn and error rates from Datadog. Props to the author for simplifying the math.

Stanza Outage Simulator

Saw this post in my LinkedIn feed and had to include it. I love this as a general tool for approximating the impact of a potential outage. You’ll almost certainly want to click on the “Instructions” button at the bottom for more details.

What makes a good on-call shift system for DevOps engineer?

A look at some of the primary considerations for choosing an on-call service provider and a quick comparison of four of the most popular options.

CI/CD Observability using OpenTelemetry

Solid introduction to observabilty for CI/CD systems, an overview of OpenTelemetry, and a guide for setting it up with Jenkins.

Building a Scalable Logging Service: Akka Lightbend vs. Kafka Confluent

Comparing two different projects as the basis for a scalable logging service. IMHO this is less of a head-to-head showdown and more of a “how to” evaluate these two options if that’s something you’re already planning.

How to Set Up a Free Web App Status Page and On-Call System: A Step-by-Step Guide

A fun guide for gluing together some free service plans to handle website monitoring and paging duties. Definitely skews hard towards the “DIY” end of the spectrum; this probably isn’t a great long-term solution, especially when you factor in turnover concerns.

Monitoring in Kubernetes: Best Practices

A decent collection of monitoring concerns and best practices for Kubernetes. 50/50 chance this was written by AI, but it still has some solid points. 😅

Implementing Observability with Prometheus, VictoriaMetrics, and Tilt

Setting up Prometheus metrics collection with a VictoriaMetrics storage backend, using Tilt to manage the underlying resources on the Kubernetes cluster.

Tools

Stanza Outage Simulator

“This tool simulates the impact of various outage scenarios on a system, allowing you to adjust parameters and observe the results.”

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor