SPECIAL EDITION: Q4 2021 Best of

I hope you don’t mind another “gift”, because it’s time for our “Best of Q4” issue! I’ve gone back over the past few months and pulled out the most popular articles as chosen by you… Enjoy! 😍

This issue is sponsored by:

Elastic logo

Join the Elastic Community Conference. Save the date and submit.

ElasticCC is a free technical conference for the community, happening February 11–12. Submit your stories and learnings from ELK to Elastic observability and security until January 4; introduction, deep dive, legacy, or cutting edge are all welcome. And don't forget to join us in February!

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Sidecar Pattern

We hear about the sidecar pattern all the time, but rarely with an explanation of what it is or why we should care. Here you go.

Observability: The 5-Year Retrospective

If you’ve been asleep for the past five years, this is a great way to get caught up on the past and present of Observability as an industry practice.

5 open source APM tools compared

Ever wonder why we don’t hear more about open source APM tooling (i.e. is it even a thing?). Wonder no more, this article has you covered.

Pushing Logs to Loki Without Using Promtail

If you can’t (or don’t want to) use Promtail, it’s now possible to push your logs directly to Loki. Lots of useful bits about Python and Loki logging internals. A great read.

How to Perform Incident Post-Mortems: Identify Root Cause with “Five Whys”

This might be the best article I’ve read on incident response and postmortems in a long while. Read this. Share it.

Golden Signals - Monitoring from first principles

An excellent summary of Google’s “Four Golden Signals” for SRE, including some examples that feel appropriate for this audience.

Prometheus Definitive Guide Part III - Prometheus Operator

Running Kubernetes and thinking about monitoring it with Prometheus, but not sure how to get started? This is the definitive guide you’ve ben looking for.

Forget predefined alerts, now you can create your own alert when your API is down with Monika

We’ve covered the Monika project a couple times this year. Looks like they’ve added some new alerting features and more flexible capabilities added to the project.

Groot: eBay’s Event-graph-based Approach for Root Cause Analysis

Whether or not you believe the “single root cause” exists, eBay’s Groot event-graph-based approach to RCA demonstrates some extremely impressive numbers for their causality graphs. The whitepaper on Groot’s design (in partnership with University of Illinois Urbana-Champaign and Peking University) can be found here.

Who is the winner — Comparing Vector, Fluent Bit, Fluentd performance

A comparison of three popular log aggregators. I’d like to see more dimensions covered (e.g. transforms, metrics exporting, etc) but it’s still a useful look at how each performs at basic log collection duties.

Building a Basic Website Monitoring Bot with Python — Part 1

On the other hand, if you’re feeling adventurous and thinking about writing your own website-monitoring Monika knock-off, this article has you covered.

Forgot to renew the TLS certificates? Monika will remind you from now on

Honestly, this would have saved my bacon at a previous gig where we used TLS certificates for everything.

Best Practices for Writing Incident Postmortems

Incident responses can be a chaotic experience for everyone. This post from Datadog highlights some best practices for collecting your data in preparation for writing the postmortem.

Kubernetes HPA optimization based on any metric

How to autoscale your Kubernetes systems using any custom metric. Yes please and thank you.

Open Source for Better Observability

I’m a big believer that for any new technology (e.g. events) to become ubiquitous, there needs to be an open source alternative to provide competition and training opportunities. This article does a good job summarizing the most popular open source tools representing the pillars of observability.

SLI’s and SLO’s, how to wrap your head around it and actually use them to calculate availability

Most of us have a passing understanding of SLIs, SLOs, and how they feed into SLAs. Unfortunately, many of us still struggle with the question of how to leverage them for availability numbers and error budgets. This post aims to answer these for us.

OpenTelemetry Collector achieves Tracing stability milestone

Congratulations to the OpenTelemetry project on reaching their GA milestone for Tracing components! 🎉

What is KUTTL?

Good monitoring and testing goes hand in hand. Here’s an interesting tool I first learned about this week for testing Kubernetes operators.

Maintenance windows are a mistake

I have a lot of conflicting feels on this one. Yes, I agree with the author’s take, but I also recognize that not everyone has the resources or freedom to proritize High Availability for their entire architecture. Here’s a terrible thought… are you better off taking a service down for maintenance without notifying your customer?

Unpacking Observability: The Path to OpenTelemetry

We’ve seen countless articles explaining what OpenTelemetry is, where and how it can help us, etc. This is one of the few articles I’ve read that actually walks us through the considerations leading up to adoption, which questions to ask yourselves, and how to plan the rollout.

Kubernetes Logging in Production

A thorough look at the logging patterns for Kubernetes clusters with a comprehensive look at the pros and cons for each approach.

Expanding the Observable Universe?

In a cloud-native world, it shouldn’t surprise me that we don’t see more network monitoring articles. I’m probably one of the few who gets excited to see articles about SNMP, but this one’s a doozy.

Take your alerts under control

I don’t think any single article can address the variety of cultural and systemic organizational issues that can lead to Really Bad Alerting Practices™. Nevertheless, this one does a solid job covering many of the aspects within our direct control and influence.

Observability Tips and Tricks For Using Grafana and Prometheus

Some excellent tips on how to leverage Prometheus labels more effectively in Grafana. Bonus points to the author for demonstrating the Prometheus labels API.

Negotiating your AWS contract? Let us help. At The Duckbill Group, we’re on your side and we see dozens of these a year–more than most AWS account managers! We’ve helped negotiate everything from $3mm contracts to $650mm contracts and a whole slew in between. Check out our AWS contract negotiation services. (SPONSORED)

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor