Issue 282

So much great stuff this week, happy to see fresh posts from engineering teams sharing their experiences and challenges. Enjoy! 🍩☕🍂

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Tales of Performance Engineering

Insightful read about Mercado Libre’s Performance Engineering team, how they collaborate with Observability and SRE teams, and an anecdote from one of their recent wins.

Recap from eBPF Summit 2024

A summary of some of the talks from last month’s eBPF Summit. All of the streamed talks are also available for binging here.

OpenTelemetry Tracing in 200 lines of code

Really enjoyed this article and how the author broke down tracing into more approachable, digestible bits that the reader’s likely already comfortable with. Good one to share with your developer friends who might be tracing-averse.

The 4 Evolutions of Your Observability Journey

How to reason about where you might fall in your own observability journey, and what sorts of questions you’re probably trying to answer even if you aren’t explicitly aware of it.

How to use Prometheus to efficiently detect anomalies at scale

Maybe I’m just crotchety, but I’m happy to see more open source contributions for anomaly detection that don’t rely on LLMs or outsourcing private data. Gives me old Etsy and Graphite vibes, and I hope to see this framework continue to evolve.

Syncing PagerDuty Schedules to Slack Groups

I appreciate hearing how other folks solve these sorts of bespoke friction problem areas. It’s unfortunate that much of on-call management software is still pretty rough for these sorts of workflows.

Achieving Optimal Service Reliability: Insights Into Service Level Objectives

A good primer on service levels, error budgets, and burn rates. I would’ve liked to hear more about getting buy-in from teams where SLAs originate (e.g. legal, sales, support, etc) because, in my experience, this is where SLOs generally hit a brick wall in terms of usefulness.

Grafana Alloy and Grafana Agent Flow security release: High severity fix for CVE-2024-8975 and CVE-2024-8996

Patched versions of Grafana Alloy and Grafana Agent have been released to address a high severity CVE. Note that users are encouraged to reinstall both applications, as the upgrade process will not make the necessary corrections.

Balancing Speed & Innovation with Reliability: Building a Blameless Incident Culture in Startups

Some useful tips for building a blameless incident response culture. This won’t happen overnight, but it’s a solid outline for any startup looking to improve their incident processes and posture.

Tools

grafana/faro-web-sdk

“a highly configurable web SDK for real user monitoring”

grafana/promql-anomaly-detection

“A framework for anomaly detection using Prometheus and PromQL”

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor