Issue 191
This was a surprisingly rich week for fun and intriguing articles, with a particular emphasis on Kubernetes, Prometheus, and sustainable practices. I hope you enjoy them as much as I did! đ¸đâ
This issue is sponsored by:
Are you looking to modernize Log Analytics while controlling the cost?
DataSet is the cloud-native event data platform that enables teams to achieve petabytes of effortless scalability and real-time performance at a fraction of the cost. Get complete visibility into your entire stack and experience the DataSet difference for free.
Articles & News on monitoring.love
Observability & Monitoring Community Slack
Itâs been amazing to see the community continue to grow. Weâd love to have you join us and share what youâve been working on.
From The Community
Building a resilient SRE process
I love this tale of how Reputation (the company) approached their distributed service reliability concerns. Unlike a lot of SLO stories Iâve read, this is a very approachable one that can serve as a model to other growing companies.
Monitor it! A short introduction to Prometheus
Weâve seen a bunch of âhow to Prometheusâ articles here, but Iâm not sure Iâve seen one this concise but also quite so full of helpful pointers and references. Definitely give this one a look if youâre new to Prometheus or just want a quick refresher.
âNobody could have knownâ: inclusive behaviors to counter short-termism
This isnât the typical topic we cover here, but in light of the current state of the tech industry, I felt it would be prudent to share this with you all. This is an excellent article on sustainable work environments and each of us should be able to take away some valuable lessons from this post.
Migration from Thanos to Grafana Mimir
I canât vouch for the why but if youâre considering a move from Thanos to Mimir, this guide should help with the how.
Kubernetes IO Problem Investigation
This story of a disk performance issue on Kubernetes really hits close to home. It strains credulity that the underlying cAdvisor issue still hasnât been fixed, at least seven years after the original bug report.
A recap of one vendorâs experience at Kubecon and the related observability events.
On a related note, the CNCF have uploaded videos and provided a playlist of talks from the recent PrometheusDay NA event.
Microservices Observability: How, when, and what to measure?
A discussion on observability principles and benefits, framed in the context of Pipedriveâs own architecture and engineering needs.
Tales from the Kernel Parameter Side
In order to monitor a thing properly, we need to understand it first. How many times have you had to dig into some obscure performance issue, only to end up combing through kernel man pages (or worse, source code)? Save yourself some time and keep this post at armâs reach.
Grafana recently announced a couple of new OSS projects, but I found this one the more interesting of the two. I havenât tried it yet, but it sort of reminds me of a modern take on Riemann. Hopefully this one doesnât require me to learn Clojure (sorry, Kyle).
AWS ECS Task deployment failed alert using Amazon EventBridge
A quick but handy pattern for monitoring your ECS task deployments using Amazon EventBridge.
Events
Monitorama PDX 2023 - June 26-28 (Portland, OR)
Monitorama is returning to Portland, OR next summer. The 2022 conference was a fantastic event and I look forward to seeing you all again in 2023.
Job Opportunities
Senior Site Reliability Engineer at Replicated (Remote)
Senior Platform Engineer at Articulate (US Remote)
See you next week!
â Jason (@obfuscurity) Monitoring Weekly Editor