SPECIAL EDITION: Q4 2022 Best of
I hope you don’t mind another “gift”, because it’s time for our “Best of Q4” issue! I’ve gone back over the past few months and pulled out the most popular articles as chosen by you… Enjoy! 🍂☕🍎
This issue is sponsored by:
What steps can you take to alleviate some on-call pain this holiday season? Well, take it from someone who has first-hand experience with on-call burnout during the holidays. Check out Chronosphere’s guide to making on-call holidays suck less. Read the blog.
Articles & News on monitoring.love
Observability & Monitoring Community Slack
Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.
From The Community
A very unique look at the evolution (pun intended) of one company’s platform infrastructure, including Observability and related concerns.
I was excited to learn that Julia Evans, everyone’s favorite tech author and illustrator, is working on a new zine about debugging. In the meantime, she accumulated a massive number of log analysis tips and ended up sharing them in one super compressed (pun intended) blog post for us all.
Observability with Prometheus and Grafana
We see a lot of guides for collecting metrics and monitoring them with Prometheus and Grafana, respectively, but this might be the most concise and comprehensive (is that possible?) article I’ve seen for anyone new to these tools and concepts. A++ would recommend to junior engineers and SRE managers.
Phantom Metrics: Why Your Monitoring Dashboard May Be Lying to You
I’ve been guilty of “monitoring all the things” in the past, but we still hear the same question repeated year after year… “what should I be monitoring?” This post revisits numerous important considerations for metrics design and collection.
Grafana as code: A complete guide to tools, tips, and tricks
A solid collection of approaches for managing your Grafana instance(s) in code. Super helpful comparison guide if you’re looking to level up your Grafana automation.
How to monitor kube-controller-manager
I’ve genuinely enjoyed these monitoring deep-dives on Kubernetes components from Sysdig. Although much of this information is available in the official docs, it’s nice to see it aggregated for a specific controller, along with the metrics relevant to their health.
Visual Patterns to Improve Monitoring Dashboards
It can be trendy to hate on dashboards these days. Overzealous vendors love to throw shade on open source approaches, but I’ve also seen the rush to get new services deployed under deadline lead to some truly garbage monitoring pages. Articles like this can help, highlighting design components and organization that can lead to more consistent and user-friendly experiences.
Database Performance Monitoring shouldn’t be hard
SolarWinds® Database Performance Monitor combines the simplicity of a SaaS-based database performance monitoring solution with big data analytics to visualize the health, performance, and availability of open-source and NoSQL databases. Download a fully-functional 30-day free trial and gain full visibility into traditional and open-source databases. (SPONSORED)
Understanding Duplicate Samples and Out-of-order Timestamp Errors in Prometheus
This is a fascinating read on Prometheus out-of-order metrics, particularly if you’re a crufty old TSDB admin and former Graphite maintainer who argues this should have been supported(*) years ago. All teasing aside, it really is a very interesting post with plenty of relevant technical details and helpful bits for Prometheus admins.
* I acknowledge that all TSDB authors make compromises relevant to their respective requirements, but after having seen countless “new hot metrics engines” come and go, it feels inevitable to me that all competing TSDBs eventually settle on roughly the same feature set with the primary differences boiling down to implementation details and a select collection of bugs deemed too difficult to fix. Don’t @ me.
Why and How eBay Pivoted to OpenTelemetry
Props to eBay engineering for sharing the story of their transition from Elastic Beats to OpenTelemetry for telemetry collection. Note that the author gave a corresponding talk at Open Observability Day back in October.
k8spacket — are your TLS connections inside the cluster still secure?
Monitoring for TLS versions and ciphers feels like a bit of an edge case, but I have no doubt there are security and compliance engineers in your org right now that would swoon over this.
Running the OpenTelemetry Demo App on HashiCorp Nomad
A fun side project for one dev advocate turned into an OpenTelemetry tutorial with a collection of cloud-native tools. There’s a good chance I’m still working through this as you’re reading these words. 😆
Skyfall: eBPF agent for infrastructure observability
A look behind the scenes at how LinkedIn has adopted eBPF, what’s working well, and where challenges remain. Great stuff.
Building a resilient SRE process
I love this tale of how Reputation (the company) approached their distributed service reliability concerns. Unlike a lot of SLO stories I’ve read, this is a very approachable one that can serve as a model to other growing companies.
Observability Mythbusters: Observability Anti-Patterns
A reminder that not everything with “Observability” in the name really is.
Reducing Logging Cost by Two Orders of Magnitude using CLP
A technical deep-dive on Uber’s log management challenges and how they improved their retention and compression.
A Practical Guide to Capturing Production Traffic With EBPF
An excellent guide for creating eBPF-based protocol tracers to inspect your HTTP traffic. If you’re new to eBPF, this feels like a great hands-on lab for getting started.
Monitor it! A short introduction to Prometheus
We’ve seen a bunch of “how to Prometheus” articles here, but I’m not sure I’ve seen one this concise but also quite so full of helpful pointers and references. Definitely give this one a look if you’re new to Prometheus or just want a quick refresher.
Grafana recently announced a couple of new OSS projects, but I found this one the more interesting of the two. I haven’t tried it yet, but it sort of reminds me of a modern take on Riemann. Hopefully this one doesn’t require me to learn Clojure (sorry, Kyle).
The ROI of monitoring data usage
Although this post was intended for data teams, it includes some valuable considerations for anyone dealing with the neverending growth of metrics, logs, and event storage.
Notes on Vendor Neutral Observability Instrumentation
A hot take on the state of vendor neutral observability tools and instrumentation. Props to the author for having strong opinions that aren’t aligned with a specific employer or product.
See you next
– Jason (@obfuscurity) Monitoring Weekly Editor