This was a surprisingly rich week for fun and intriguing articles, with a particular emphasis on Kubernetes, Prometheus, and sustainable practices. I hope you enjoy them as much as I did! 😸📈☕

This issue is sponsored by:

DataSet logo

Are you looking to modernize Log Analytics while controlling the cost?

DataSet is the cloud-native event data platform that enables teams to achieve petabytes of effortless scalability and real-time performance at a fraction of the cost. Get complete visibility into your entire stack and experience the DataSet difference for free.

Articles & News on

Observability & Monitoring Community Slack

It’s been amazing to see the community continue to grow. We’d love to have you join us and share what you’ve been working on.

From The Community

Building a resilient SRE process

I love this tale of how Reputation (the company) approached their distributed service reliability concerns. Unlike a lot of SLO stories I’ve read, this is a very approachable one that can serve as a model to other growing companies.

Monitor it! A short introduction to Prometheus

We’ve seen a bunch of “how to Prometheus” articles here, but I’m not sure I’ve seen one this concise but also quite so full of helpful pointers and references. Definitely give this one a look if you’re new to Prometheus or just want a quick refresher.

“Nobody could have known”: inclusive behaviors to counter short-termism

This isn’t the typical topic we cover here, but in light of the current state of the tech industry, I felt it would be prudent to share this with you all. This is an excellent article on sustainable work environments and each of us should be able to take away some valuable lessons from this post.

Migration from Thanos to Grafana Mimir

I can’t vouch for the why but if you’re considering a move from Thanos to Mimir, this guide should help with the how.

Kubernetes IO Problem Investigation

This story of a disk performance issue on Kubernetes really hits close to home. It strains credulity that the underlying cAdvisor issue still hasn’t been fixed, at least seven years after the original bug report.

Observability at Kubecon

A recap of one vendor’s experience at Kubecon and the related observability events.

PrometheusDay NA 2022

On a related note, the CNCF have uploaded videos and provided a playlist of talks from the recent PrometheusDay NA event.

Microservices Observability: How, when, and what to measure?

A discussion on observability principles and benefits, framed in the context of Pipedrive’s own architecture and engineering needs.

Tales from the Kernel Parameter Side

In order to monitor a thing properly, we need to understand it first. How many times have you had to dig into some obscure performance issue, only to end up combing through kernel man pages (or worse, source code)? Save yourself some time and keep this post at arm’s reach.

Announcing Grafana Phlare

Grafana recently announced a couple of new OSS projects, but I found this one the more interesting of the two. I haven’t tried it yet, but it sort of reminds me of a modern take on Riemann. Hopefully this one doesn’t require me to learn Clojure (sorry, Kyle).

AWS ECS Task deployment failed alert using Amazon EventBridge

A quick but handy pattern for monitoring your ECS task deployments using Amazon EventBridge.


Monitorama PDX 2023 - June 26-28 (Portland, OR)

Monitorama is returning to Portland, OR next summer. The 2022 conference was a fantastic event and I look forward to seeing you all again in 2023.

Job Opportunities

Senior Site Reliability Engineer at Replicated (Remote)

Senior Platform Engineer at Articulate (US Remote)

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor