A spike of stories from production teams this week and some unexpected discussions on profiling and service migrations. Enjoy! 😻🚀🍿
This issue is sponsored by:
Skyrocketing observability costs means teams are trying to get spend under control. However, a wrong decision can make troubleshooting harder and lead to lengthy incidents and outages. In this platform walkthrough, see how Chronosphere's Control Plane allows you to control long-term data growth while ensuring your engineers have the necessary information to do their jobs.
Articles & News on monitoring.love
Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.
From The Community
I love this story from a ZipRecruiter engineer, recalling their path from Icinga and Graphite monitoring to adopting Prometheus for their Kubernetes infrastructure. Teams considering a similar adventure would do well to heed these learnings.
Profiling is an underappreciated skill for many teams, especially in the era of cloud native ephemeral resources. This post covers numerous approaches, why you need them, and how profiling dovetails with the pillars of observability.
A reminder that instrumentation (in this case, OpenTelemetry) has a performance cost, especially if you don’t plan accordingly.
A recent sink service change in Google Cloud has made it possible to create alerts for logs aggregated across disparate projects.
One company’s take on running their observability stack (Grafana, Loki, Tempo, and Prometheus) on Kubernetes.
A look back on three seemingly random but high profile outages. Personally, I think comparing the incident reports over the five year span is possibly more interesting than the events themselves.
I love a good production migration story, especially when it involves stateful systems. This post is focused on their replay traffic testing; looking forward to more tactical details (hopefully with graphs).
Grafana released their latest version in the 9.x series a few weeks ago. Good to see improvements across alerting, accessibility, and security (service accounts).
A solid review of Elasticsearch metrics and how to monitor your cluster’s index and search performance.
Just six weeks left until everyone’s favorite monitoring conference of the year. I’m super excited to see the new speakers and to hear what everyone has been up to since the conference returned to Portland in 2022. Hope to see you there!
See you next week!
– Jason (@obfuscurity) Monitoring Weekly Editor