A little bit of everything this week, with an emphasis on production outages, alerting, performance, and more. Enjoy! ☕🔥🔔

This issue is sponsored by:

Armory logo

Deployed-to-Prod Horror Stories

“We gave a status update to the board that we’d reached a milestone for core functionality and were progressing nicely. In the CEO’s mind he thought, “looks good, let’s launch it”. So he told his son to launch and never bothered to tell anybody in the IT department.”



Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

What happened to Vivaldi Social?

An entertaining (for readers, anyways) postmortem of the Mastodon service run by Vivaldi. It’s almost always a good learning experience to understand how other admins respond to a service outage.

From Blind Spots to Clear Insights: The Evolution of Observability Tools and Practices at Greenlight

How one fintech company has leaned into Observability through a combination of bespoke in-house tooling and commercial vendors.

Failure Mitigation for Microservices: An Intro to Aperture

For a company like Doordash, with numerous discrete services, automating reliability in a decentralized fashion comes with its own set of challenges. This post details those challenges and explains how their integration of the Aperture project delivered more sophisticated reliability countermeasures for mitigating outages.

Alertmanager’s Group wait, Group interval and Repeat interval explained

I’ve really enjoyed George Robinson’s articles on Alertmanager use and [somewhat undocumented] behaviors. Here is the last one I found published on his blog, looking at some internal timers and their effects on Alertmanager behavior.

Using OpenTelemetry and Prometheus: A practical guide to data collection

Some genuinely helpful tips for managing and querying OpenTelemetry metrics stored in Prometheus (or Mimir).

The Dark Side of Observability: Are We Inviting Cyber Attacks?

A reminder that the pillars of observablity are potentially a treasure map for bad actors. As with any tool or technology, keep your security posture in mind when designing or adopting new software.

Adventures in Garbage Collection: Improving GC Performance in our Massive Monolith

Not specifically a monitoring story, but still a great read about Shopify’s efforts to improve GC performance for their monolith (with plenty of metrics and graphs).

Observing AWS Lambda with Golang and Datadog

A hands-on guide for instrumenting your Golang functions in AWS Lambda for collecting logs, traces, and spans in Datadog.

My htop Setup + Tips on making your own!

I almost wish I had a job where I could make use of an htop view like this one. Definitely takes me back a decade or two. 😸

Tools

https://github.com/fluxninja/aperture

Aperture is an observability-driven load management platform designed for classifying, scheduling, and rate-limiting API traffic in cloud applications.

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor