It’s time for another “best of” issue! This collection looks back through our summer months at our most popular articles. Hope you enjoy it!
This issue is sponsored by:
Alerting is evolving. Signals is coming soon.
This winter from incident management platform FireHydrant: alerting and incident response in one ring-to-retro tool for the first time. Sign up for the early access waitlist and be the first to experience the power of alerting + incident response in one platform — at last.
Articles & News on monitoring.love
Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.
From The Community
This post is sincerely interesting; it starts off almost as a chapter from a novel before pivoting hard into thoughtful considerations for crafting effective alerts.
Maybe I’m biased because I’ve used other time series query languages (Graphite, Librato, etc) for many years before Prometheus came along, but I agree… PromQL can be a hassle to master. This post explains why it can feel that way and introduces a new open source project to help make it easier.
I genuinely can’t tell if this is fan fiction, developer advocacy, or an SRE biopic. Either way, it’s an interesting read.
A genuine look at observability and its impact on our work from the perspective of a web developer.
An overview of Thanos, its components, and how it complements Prometheus when horizontal scaling becomes a necessity.
A reflection on the early days of monitoring and whether anomaly detection has really gotten us anywhere (my words, not theirs).
This might be a bit of a niche concern for our audience, but if you happen to be applying machine learning to your time series data, you’ll probably appreciate reading how Etsy stumbled across some potential issues.
Inhibit rules are an important aspect of alerting but can have unexpected behavior if you don’t fully understand how to configure them properly. If you’re alerting with Prometheus and Alertmanager, you should definitely read this post.
I’ve really enjoyed George Robinson’s articles on Alertmanager use and [somewhat undocumented] behaviors. Here is the last one I found published on his blog, looking at some internal timers and their effects on Alertmanager behavior.
This appears intended to serve as an exhaustive overview of Kubernetes observability; it does a decent job touching on all of the related topics, but you’ll want to perform deeper research on any specific area. Frankly, if you just grabbed all of the section titles they would make a great checklist for your manager. 😜
We’ve all been there… that moment of realization that you just did something very, very wrong and there’s no way to take it back (in my case, an errant
rm -rf / at an OpenBSD hackathon). Still, this is how we learn from our mistakes and build more resiliency into our systems.
Honestly, the title says it all. Although most of the best practices apply to logging in general, it’s still a good review for anyone using or maintaining logging infrastructure in Kubernetes.
This post speaks the trade-offs we face with our technology choices. More specifically, it compares logging costs between Cloudwatch, Datadog, and a “custom” solution using AWS components.
Some excellent tips on Loki performance gained from real-world use and frustration. Reminds me of my old Graphite Tips blog posts.
A detailed look at how DoorDash engineers have iterated on their eBPF agent and probes and where this has paid off in terms of debugging, observability, and for validating system migrations.
We take it for granted that engineers are born knowing how and what to log. This article reminded me that’s not the case, and does a good job covering the reasons we should log, along with examples for develoeprs to apply to their own applications.
Loki seems to be gaining a lot of mindshare in the logging space. Here’s a quick post demonstrating one pattern for storing logs using its API.
Like many of you, I appreciate OpenTelemetry for its capabilities, but even moreso I love it for its ability to protect us from vendor lock-in. This might be its one true killer feature.
Now that you’ve got your developers emitting logs on the regular, what next? There’s a lot to stay on top of when managing log aggregation at scale, and this post does a good job listing off a bunch of the considerations.
How one fintech company has leaned into Observability through a combination of bespoke in-house tooling and commercial vendors.
As an EM with a team of product developers, this one hits close to home. I expect most of the folks here are strong advocates of these principles, but it might be helpful to share this post with your peers.
A basic walkthrough for setting up Loki with Promtail and Grafana.
An entertaining (for readers, anyways) postmortem of the Mastodon service run by Vivaldi. It’s almost always a good learning experience to understand how other admins respond to a service outage.
See you next week!
– Jason (@obfuscurity) Monitoring Weekly Editor