A little bit of everything this week, with an emphasis on production outages, alerting, performance, and more. Enjoy! ☕🔥🔔
This issue is sponsored by:
Deployed-to-Prod Horror Stories
“We gave a status update to the board that we’d reached a milestone for core functionality and were progressing nicely. In the CEO’s mind he thought, “looks good, let’s launch it”. So he told his son to launch and never bothered to tell anybody in the IT department.”
Articles & News on monitoring.love
Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.
From The Community
An entertaining (for readers, anyways) postmortem of the Mastodon service run by Vivaldi. It’s almost always a good learning experience to understand how other admins respond to a service outage.
How one fintech company has leaned into Observability through a combination of bespoke in-house tooling and commercial vendors.
For a company like Doordash, with numerous discrete services, automating reliability in a decentralized fashion comes with its own set of challenges. This post details those challenges and explains how their integration of the Aperture project delivered more sophisticated reliability countermeasures for mitigating outages.
I’ve really enjoyed George Robinson’s articles on Alertmanager use and [somewhat undocumented] behaviors. Here is the last one I found published on his blog, looking at some internal timers and their effects on Alertmanager behavior.
Some genuinely helpful tips for managing and querying OpenTelemetry metrics stored in Prometheus (or Mimir).
A reminder that the pillars of observablity are potentially a treasure map for bad actors. As with any tool or technology, keep your security posture in mind when designing or adopting new software.
Not specifically a monitoring story, but still a great read about Shopify’s efforts to improve GC performance for their monolith (with plenty of metrics and graphs).
A hands-on guide for instrumenting your Golang functions in AWS Lambda for collecting logs, traces, and spans in Datadog.
I almost wish I had a job where I could make use of an htop view like this one. Definitely takes me back a decade or two. 😸
“Aperture is an observability-driven load management platform designed for classifying, scheduling, and rate-limiting API traffic in cloud applications.”
See you next week!
– Jason (@obfuscurity) Monitoring Weekly Editor