I hope you don’t mind another “gift”, because it’s time for our “Best of Q4” issue! I’ve gone back over the past few months and pulled out the most popular articles as chosen by you… Enjoy! 🍂☕🍎
This issue is sponsored by:
What steps can you take to alleviate some on-call pain this holiday season? Well, take it from someone who has first-hand experience with on-call burnout during the holidays. Check out Chronosphere’s guide to making on-call holidays suck less. Read the blog.
Articles & News on monitoring.love
Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.
From The Community
A very unique look at the evolution (pun intended) of one company’s platform infrastructure, including Observability and related concerns.
I was excited to learn that Julia Evans, everyone’s favorite tech author and illustrator, is working on a new zine about debugging. In the meantime, she accumulated a massive number of log analysis tips and ended up sharing them in one super compressed (pun intended) blog post for us all.
We see a lot of guides for collecting metrics and monitoring them with Prometheus and Grafana, respectively, but this might be the most concise and comprehensive (is that possible?) article I’ve seen for anyone new to these tools and concepts. A++ would recommend to junior engineers and SRE managers.
I’ve been guilty of “monitoring all the things” in the past, but we still hear the same question repeated year after year… “what should I be monitoring?” This post revisits numerous important considerations for metrics design and collection.
A solid collection of approaches for managing your Grafana instance(s) in code. Super helpful comparison guide if you’re looking to level up your Grafana automation.
I’ve genuinely enjoyed these monitoring deep-dives on Kubernetes components from Sysdig. Although much of this information is available in the official docs, it’s nice to see it aggregated for a specific controller, along with the metrics relevant to their health.
It can be trendy to hate on dashboards these days. Overzealous vendors love to throw shade on open source approaches, but I’ve also seen the rush to get new services deployed under deadline lead to some truly garbage monitoring pages. Articles like this can help, highlighting design components and organization that can lead to more consistent and user-friendly experiences.
Database Performance Monitoring shouldn’t be hard
SolarWinds® Database Performance Monitor combines the simplicity of a SaaS-based database performance monitoring solution with big data analytics to visualize the health, performance, and availability of open-source and NoSQL databases. Download a fully-functional 30-day free trial and gain full visibility into traditional and open-source databases. (SPONSORED)
This is a fascinating read on Prometheus out-of-order metrics, particularly if you’re a crufty old TSDB admin and former Graphite maintainer who argues this should have been supported(*) years ago. All teasing aside, it really is a very interesting post with plenty of relevant technical details and helpful bits for Prometheus admins.
* I acknowledge that all TSDB authors make compromises relevant to their respective requirements, but after having seen countless “new hot metrics engines” come and go, it feels inevitable to me that all competing TSDBs eventually settle on roughly the same feature set with the primary differences boiling down to implementation details and a select collection of bugs deemed too difficult to fix. Don’t @ me.
Props to eBay engineering for sharing the story of their transition from Elastic Beats to OpenTelemetry for telemetry collection. Note that the author gave a corresponding talk at Open Observability Day back in October.
Monitoring for TLS versions and ciphers feels like a bit of an edge case, but I have no doubt there are security and compliance engineers in your org right now that would swoon over this.
A fun side project for one dev advocate turned into an OpenTelemetry tutorial with a collection of cloud-native tools. There’s a good chance I’m still working through this as you’re reading these words. 😆
A look behind the scenes at how LinkedIn has adopted eBPF, what’s working well, and where challenges remain. Great stuff.
I love this tale of how Reputation (the company) approached their distributed service reliability concerns. Unlike a lot of SLO stories I’ve read, this is a very approachable one that can serve as a model to other growing companies.
A reminder that not everything with “Observability” in the name really is.
A technical deep-dive on Uber’s log management challenges and how they improved their retention and compression.
An excellent guide for creating eBPF-based protocol tracers to inspect your HTTP traffic. If you’re new to eBPF, this feels like a great hands-on lab for getting started.
We’ve seen a bunch of “how to Prometheus” articles here, but I’m not sure I’ve seen one this concise but also quite so full of helpful pointers and references. Definitely give this one a look if you’re new to Prometheus or just want a quick refresher.
Grafana recently announced a couple of new OSS projects, but I found this one the more interesting of the two. I haven’t tried it yet, but it sort of reminds me of a modern take on Riemann. Hopefully this one doesn’t require me to learn Clojure (sorry, Kyle).
Although this post was intended for data teams, it includes some valuable considerations for anyone dealing with the neverending growth of metrics, logs, and event storage.
A hot take on the state of vendor neutral observability tools and instrumentation. Props to the author for having strong opinions that aren’t aligned with a specific employer or product.
See you next
– Jason (@obfuscurity) Monitoring Weekly Editor