So much good stuff this week, where to begin? To be fair, I love network debugging stories, and we’ve got two really good ones. Oh, and an official word from Atlassian leadership on their recent multi-week outage. Plus a bunch of interesting tools including a new one for E2E testing with OTel. Enjoy! ☕😍📖

This issue is sponsored by:

Chronosphere logo

Ready to stop managing your own Prometheus?

The world of monitoring has fundamentally changed. Companies need a monitoring solution that is as scalable, reliable, and flexible as the cloud-native apps they need to monitor. Ready to stop managing your own Prometheus? Here’s your buyer's guide.

Articles & News on

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Post-Incident Review on the Atlassian April 2022 outage

I don’t know many folks that weren’t impacted by this massive outage in one way or another. Most of the technical aspects of this incident are already public, but Atlassian leadership has released this extended PIR for a broader, more official review of the incident. Grab a drink and get comfortable… this is a long one.

It’s Always DNS . . . Except When It’s Not: A Deep Dive through gRPC, Kubernetes, and AWS networking

I love technical postmortems, especially when it comes to TCP/IP networking and DNS. I think most of us can empathize when an old benign setting wakes up much later to bite us in the rear.

First Responder’s Guide to Log Alerts

Having enough context around any incident is key to minimizing its MTTR. Go City engineers chose to glue aspects of their metrics and logging systems together to provide first responders with enough information to react effectively.

Monitor and troubleshoot Consul with Prometheus

If you’re already running Consul on Kubernetes, this guide will get you up to speed on exporting your metrics to Prometheus, which ones to be concerned about, and why. Great article.

Using Fault Injection Testing to Improve DoorDash Reliability

Chaos Engineering techniques provide a lot of value… if you can use them. DoorDash takes a slightly different approach, with automated and targeted resilience testing of their microservices.

Investigating TCP Self-Throttling Triggered Overload

Get ready to get lost in the TCP/IP networking weeds. If you thought the Datadog postmortem was interesting, buckle up and enjoy this one.

Robust Perception — Reliable Insights

If you’re a Prometheus user, it’s a virtual certainty that you’ve read one of Brian Brazil’s many blog posts on its idiosyncrasies. This post from a fan serves as a helpful collection of some of Brian’s most insightful articles.

Introducing Tracetest - Trace-based Testing with OpenTelemetry

Introduction of a new tool designed to use OpenTelemetry traces for end-to-end tests. Pretty cool stuff, looking forward to seeing this one evolve.

Sysdig logo

CrashLoopBackoff + Four Other K8s Troubleshooting Tips Everyone Should Know

We all love Kubernetes but it can be a hassle to fix when things go sideways. In this webinar, we will cover some of the common problems that plague every Kubernetes user and show you how to fix them. Join us at 10am PT on Thursday, April 28 to add these tips to your troubleshooting toolbox. Save your seat here. (SPONSORED)

HA Monitoring Setup with Thanos and Prometheus

An excellent two-part series on leveraging Thanos to upgrade your Prometheus deployment for high availability.

Auto-Generated Monitoring of Event Data with Annotations

A deep-dive into Udemy’s approach to automated event monitoring. Great to see them focused on resiliency and reliability but with an eye towards developer adoption and efficiency.

Introducing Tracetest - Trace-based Testing with OpenTelemetry

Introduction of a new tool designed to use OpenTelemetry traces for end-to-end tests. Pretty cool stuff, looking forward to seeing this one evolve.

Migrating from Cortex to Grafana Mimir

If you’re already running Cortex and considering a move to Mimir, this guide from Grafana Labs is a great place to get started.

Docker Monitoring Stack with Grafana

A quick how-to for deploying a container-friendly monitoring stack with Docker.



Prototype implementation of Service-Level Fault Injection Testing in Python.


This service watches Elasticsearch logs for certain patterns defined as code in YAML by users.


Healthchecker is an application to generate report on health status of all deployed microservices for the K8s environments.


End-to-end tests powered by your OpenTelemetry Traces.


SLOconf - Service Level Objective Conference 2022

SLOconf is back again as a virtual event, taking place May 9-12 online. Looks like a lot of familiar faces, looking forward to this one.

Monitorama PDX 2022 - June 27-29 (Portland, OR)

Monitorama is returning to Portland, OR this summer. It looks like a return to form for one of our favorite events (ok, we might be biased). Hope to see you there!

Job Opportunities

Infrastructure Engineer at (NA Remote)

Ready to lower your AWS bill? Now might be the perfect time for an AWS Cost Optimization project with The Duckbill Group. The Duckbill Group aims for a 15-20% cost reduction in identified savings opportunities through tweaks to your architecture–or your money back. (SPONSORED)

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor