Issue 216

A spike of stories from production teams this week and some unexpected discussions on profiling and service migrations. Enjoy! 😻🚀🍿

This issue is sponsored by:

Chronosphere logo

Skyrocketing observability costs means teams are trying to get spend under control. However, a wrong decision can make troubleshooting harder and lead to lengthy incidents and outages. In this platform walkthrough, see how Chronosphere's Control Plane allows you to control long-term data growth while ensuring your engineers have the necessary information to do their jobs.

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Monitoring K8s? Here’s Why and How We Use Prometheus

I love this story from a ZipRecruiter engineer, recalling their path from Icinga and Graphite monitoring to adopting Prometheus for their Kubernetes infrastructure. Teams considering a similar adventure would do well to heed these learnings.

Accelerating Performance Issue Resolution through Code Profiling

Profiling is an underappreciated skill for many teams, especially in the era of cloud native ephemeral resources. This post covers numerous approaches, why you need them, and how profiling dovetails with the pillars of observability.

Leveraging Web Workers for performance at HelloFresh

A reminder that instrumentation (in this case, OpenTelemetry) has a performance cost, especially if you don’t plan accordingly.

Futurama meme reference to the Heisenberg Principle

Right way to alert on aggregated logs in Google Cloud

A recent sink service change in Google Cloud has made it possible to create alerts for logs aggregated across disparate projects.

Observability on Kubernetes — lessons learned

One company’s take on running their observability stack (Grafana, Loki, Tempo, and Prometheus) on Kubernetes.

Lessons from outages

A look back on three seemingly random but high profile outages. Personally, I think comparing the incident reports over the five year span is possibly more interesting than the events themselves.

Migrating Critical Traffic At Scale with No Downtime

I love a good production migration story, especially when it involves stateful systems. This post is focused on their replay traffic testing; looking forward to more tactical details (hopefully with graphs).

Grafana 9.5 release

Grafana released their latest version in the 9.x series a few weeks ago. Good to see improvements across alerting, accessibility, and security (service accounts).

Top metrics for Elasticsearch monitoring with Prometheus

A solid review of Elasticsearch metrics and how to monitor your cluster’s index and search performance.

Events

Monitorama 2023 PDX

Just six weeks left until everyone’s favorite monitoring conference of the year. I’m super excited to see the new speakers and to hear what everyone has been up to since the conference returned to Portland in 2022. Hope to see you there!

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor