Issue 183

Some fantastic articles this week, with a strong emphasis on troubleshooting tips, tools, and techniques. Enjoy! 🔧☕😍

This issue is sponsored by:

Chronosphere logo

What are the potential pitfalls of Prometheus-based monitoring, and how can teams successfully address them? Chronosphere is teaming up with the Co-founder of Prometheus to share the potential roadblocks and discuss important best practices to get the most from your cloud native monitoring. Register for the webinar now.

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Effective Troubleshooting

An impressive collection of hard lessons learned and precautionary planning steps to take ahead of your next ~~outage~~ troubleshooting session.

Why Siloed Monitoring Hit the Wall

This is a well-reasoned and written article from someone I respect a lot. That said, I don’t agree that it has to be the case for companies with the foresight to avoid painting themselves into a corner. If this does sound like you, please share this post with your peers and reconsider how you’re building your systems. </rant>

Myth: service mesh can do distributed tracing of your application

Some hard truths about “out of the box” distributed tracing support among service meshes.

DIY multi-regional uptime monitoring with Fly.io and Uptime Kuma

If you’re looking for an ~~inexpensive~~ free alternative to uptime monitoring services, check out this OSS project running on Fly.io. This feels particularly timely given Heroku’s decision to discontinue free dynos.

Improving Meta’s SLO workflows with data annotations

A look at how Meta evolved their SLO management platform to support annotations for richer context. This is a really nice pattern for informing your on-call teammates, especially between shift transitions.

How adding Kubernetes label selectors caused an outage in Grafana Cloud Logs — and how we resolved it

An interesting look behind the curtain at how Grafana Labs diagnosed an incident in their hosted service. I’ll be honest, seeing the fix makes me glad I don’t have to support that configuration. 😬

Adevinta logo

Want to make Kubernetes clusters highly available? Adevinta's tech teams achieve high availability while operating eight clusters serving 54k requests per second over 20 tenants. Read more about their internal microservices platform and how they deliver a reliable service for tenants. Blog post here. (SPONSORED)

Auto-convert Grafana dashboards from influxQL to PromQL

I’m not sure how common this use case is, but if you’re one of those looking to migrate from InfluxDB to Prometheus, this looks like a huge win.

Is Docker eating up disk space?

A friendly reminder to check your Docker logging configuration before it wakes you up at night.

9 Useful Interactive CLI Tools for Linux

This list of CLI tools is heavy on debugging utilities, with many landing somewhere between “things I use every day” and “things I forgot even existed”.

Security release: New versions of Grafana and Grafana Image Renderer with a high severity security fix for CVE-2022-31176

A high severity fix affecting both Grafana and the dedicated Image Renderer. Please upgrade your installations ASAP.