SPECIAL EDITION: Q1 2022 Best of

It’s time for another “best of” issue! We have some fantastic articles here covering the most popular topics and themes from the past few months. Enjoy!

This issue is sponsored by:

Chronosphere logo

You might have heard discussions about the “three phases of observability.” But what do they really mean? Chronosphere is a SaaS cloud monitoring tool that helps teams rapidly navigate the three phases of observability. Learn more about Chronosphere and the three phases of observability here.

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Who monitors the monitoring system?

A look at how HelloFresh implemented a Dead Man’s Switch on top of their Prometheus and Thanos stack.

Get Started with eBPF

Despite the title, this is a fairly deep-dive into eBPF internals, writing your own eBPF programs, its potential for observability and much, much more.

5 Dashboard Design Best Practices

Most teams I’ve worked with will slap a bunch of metrics and graphs together without really understanding how to use the data effectively. This is a thoughtful look at how to design a dashboard with your users in mind.

OpenTelemetry democratises access to observability data & will enable massive innovation

We see a lot of articles about OpenTelemetry, but this might be the most concise and helpful one I’ve read yet. Bookmark this one and share it with your peers who need to learn about OpenTelemetry.

Transforming remote JSON into Prometheus metrics

Did you know you could consume data from remote JSON APIs into Prometheus? I can think of a number of different use cases for this. Nice example.

5 key observability trends for 2022

An overview of the most common trends in observability right now. Jibes with everything I’ve seen in this newsletter over the past year.

READS: Service Health Metrics

An insightful look at the bare minimum of metrics that service owners at Salesforce are expected to collect and monitor.

Why Don’t You Use …

I love this article from Brendan Gregg on why we do (or don’t) choose certain products. Frankly, it feels like the making of a great checklist for any new potential vendor.

Design Patterns and Principles That Support Large Scale Systems

Scaling systems is the kind of challenge that most of us live for, but it takes experience to learn the pitfalls and patterns that save us time and money the next time around. It should be no surprise that so many of these considerations overlap with the observability domain.

SRE Principles Part 1

A look at some of the differences between SRE and DevOps principles, with a particular emphasis on service levels and monitoring signals.

The Delivery Hero Reliability Manifesto

Reliability means something different to every company, but it’s critical to have a shared understanding of what that means. This manifesto from Delivery Hero is a fantastic example of how to drive consensus and set expectations among your engineering teams.

5 Design Patterns for Building Observable Services

An excellent article from Salesforce engineering, covering their more popular design choices for building observable services.

Rapid Event Notification System at Netflix

Another fantastic article from Netflix engineers about building (and observing) systems at scale.

Saving on AWS Lambda Amazon CloudWatch Logs costs

A really clever way of buffering up debug logs in AWS Lambda to avoid blowing up your CloudWatch budget.

Exploring logging strategies with the Elastic Stack

Considerations for indexing your Elastic Stack logging services. There’s some good stuff in here, but it also reminds me why I happily paid the “Splunk tax” at my last gig.

Scaling Kafka Consumer for Billions of Events

PayPal engineers share their techniques for benchmarking Kafka and testing different failure scenarios before their services went to production.

How secure is your Grafana instance? What you need to know

A fairly exhaustive look at Grafana’s security features. Just note that most of its advanced capabilities are locked away in their commercial offerings.

Getting visibility into your container images

This article introduces a new (to me) tool that looks super helpful for creating an inventory of all the software versions running in a container. I know that you can sort of do this with Prometheus already, but a standalone tool for audits makes a lot of sense too.

How to monitor Starlink with Prometheus

I consider myself fortunate to live in a rural area with fiber internet. If you’re one of the lucky folks with access to Starlink, here’s a quick tutorial for monitoring your connection with Prometheus.

Making Alerts Actionable

If you’ve been around here for a while, you know I’m highly opinionated about writing alerts that are useful and empathetic towards the engineers who answer them. I love hearing from others who are just as passionate and thoughtful about ~~writing~~ iterating on alerts.

My Grafana Dashboard

I love these little weekend projects with dashboards and home automation (or in this case, home network monitoring).

Microservices Observability Design Patterns

So often we get hung up on the tooling and their limitations without really thinking about the problems we’re trying to provide solutions for. I love this collection of design patterns for building observability into our (micro)services.

Unpacking Observability: The Paradigm Shift from APM to Observability

How to think about Observability if your organization is stuck in an APM (or monitoring-only) mindset.

Events

Monitorama PDX 2022 - June 27-29 (Portland, OR)

Monitorama is returning to Portland, OR this summer. It looks like a return to form for one of our favorite events (ok, we might be biased). Hope to see you there!

Job Opportunities

Site Reliability Engineer at Knock (US Remote)

Senior DevOps Engineer at Hive Collective (Remote)

DevOps Engineer at Amount Small Business (Remote)

Ready to lower your AWS bill? Now might be the perfect time for an AWS Cost Optimization project with The Duckbill Group. The Duckbill Group aims for a 15-20% cost reduction in identified savings opportunities through tweaks to your architecture–or your money back. (SPONSORED)

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor