Had a great time last week at Monitorama watching all the talks and seeing so many familiar faces. Feels serendipitous to come back and discover so many stories this week about production incidents, learning from our mistakes, and more. Enjoy! 🌞🍹📈

This issue is sponsored by:

Armory logo

Can you rely on your deployments?

In a recent Armory and Gartner report, 35% of respondents’ top pain point with app deployment is reliability and consistency. If you need help with consistent, reliable deployments, try Armory Continuous Deployment-as-a-Service. Check out more in the reports here.

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

How Observability Changed My (Developer) Life

A genuine look at observability and its impact on our work from the perspective of a web developer.

Alerting: The Do’s and Don’ts for Effective Observability

This post is sincerely interesting; it starts off almost as a chapter from a novel before pivoting hard into thoughtful considerations for crafting effective alerts.

Developing a data driven tool to estimate the cost of incidents

Most companies I’ve seen struggle with quantifying the impact of an outage. Props to this HelloFresh engineer for sharing how they model incidents and derive actionable insights.

From Chaos to Recovery: How We Restored Our AWS Microservice After Accidental Deletion at Dolap

We’ve all been there… that moment of realization that you just did something very, very wrong and there’s no way to take it back (in my case, an errant rm -rf / at an OpenBSD hackathon). Still, this is how we learn from our mistakes and build more resiliency into our systems.

Why are Prometheus queries hard?

Maybe I’m biased because I’ve used other time series query languages (Graphite, Librato, etc) for many years before Prometheus came along, but I agree… PromQL can be a hassle to master. This post explains why it can feel that way and introduces a new open source project to help make it easier.

The Problem with Timeseries Data in Machine Learning Feature Systems

This might be a bit of a niche concern for our audience, but if you happen to be applying machine learning to your time series data, you’ll probably appreciate reading how Etsy stumbled across some potential issues.

There Are No Repeat Incidents

Preach, we should always strive to learn from (and avoid reoccurences of) our mistakes.

How to run faster Loki metric queries with more accurate results

Some handy tips (and explanations for why they matter) for improving your Loki queries.

Activating Automatical Performance Analysis – Continuous Profiling

A first look at SkyWalking’s new continuous profiling capabilities.



Autometrics uses instrumented function names to generate Prometheus queries so you don’t need to hand-write complicated PromQL.

Job Opportunities

Software Engineer, Site Reliability at Redpanda Data (NA Remote)

Senior Staff Site Reliability Engineer at SentinelOne (US Remote)

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor