Issue 282
So much great stuff this week, happy to see fresh posts from engineering teams sharing their experiences and challenges. Enjoy! 🍩☕🍂
Articles & News on monitoring.love
Observability & Monitoring Community Slack
Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.
From The Community
Tales of Performance Engineering
Insightful read about Mercado Libre’s Performance Engineering team, how they collaborate with Observability and SRE teams, and an anecdote from one of their recent wins.
A summary of some of the talks from last month’s eBPF Summit. All of the streamed talks are also available for binging here.
OpenTelemetry Tracing in 200 lines of code
Really enjoyed this article and how the author broke down tracing into more approachable, digestible bits that the reader’s likely already comfortable with. Good one to share with your developer friends who might be tracing-averse.
The 4 Evolutions of Your Observability Journey
How to reason about where you might fall in your own observability journey, and what sorts of questions you’re probably trying to answer even if you aren’t explicitly aware of it.
How to use Prometheus to efficiently detect anomalies at scale
Maybe I’m just crotchety, but I’m happy to see more open source contributions for anomaly detection that don’t rely on LLMs or outsourcing private data. Gives me old Etsy and Graphite vibes, and I hope to see this framework continue to evolve.
Syncing PagerDuty Schedules to Slack Groups
I appreciate hearing how other folks solve these sorts of bespoke friction problem areas. It’s unfortunate that much of on-call management software is still pretty rough for these sorts of workflows.
Achieving Optimal Service Reliability: Insights Into Service Level Objectives
A good primer on service levels, error budgets, and burn rates. I would’ve liked to hear more about getting buy-in from teams where SLAs originate (e.g. legal, sales, support, etc) because, in my experience, this is where SLOs generally hit a brick wall in terms of usefulness.
Patched versions of Grafana Alloy and Grafana Agent have been released to address a high severity CVE. Note that users are encouraged to reinstall both applications, as the upgrade process will not make the necessary corrections.
Balancing Speed & Innovation with Reliability: Building a Blameless Incident Culture in Startups
Some useful tips for building a blameless incident response culture. This won’t happen overnight, but it’s a solid outline for any startup looking to improve their incident processes and posture.
Tools
“a highly configurable web SDK for real user monitoring”
grafana/promql-anomaly-detection
“A framework for anomaly detection using Prometheus and PromQL”
See you next week!
– Jason (@obfuscurity) Monitoring Weekly Editor