Welcome back to another week of your favorite monitoring and observability newsletter. Hot and fresh, right off the search engine!

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. This past week there were some fascinating discussions around tracing at scale, and how one company in particular is doing tail-based sampling. Join our Slack so you don’t miss out. 😃

From The Community

Building a Healthy On-Call Culture

A really insightful article about how SoundCloud developed healthy habits and meaningful on-call rotations across their engineering organization. I don’t think that everything in the story will make sense for all companies, but there are a number of useful takeaways regardless.

The Easiest Way to Debug Kubernetes Workloads

What happens when you’re troubleshooting Kubernetes and the usual tools fall short? Or when you need to dig into a distroless image lacking shell access? Enter the role of kubectl debug and ephemeral containers.

Hacking your way to Observability — Part 2 : Alerts

The second part of an ongoing series on observability principles and tooling, this post covers alerting with Prometheus and Alertmanager, including how to start sending these alerts to your Slack workspace.

Enhancement to Kafka MirrorMaker to reduce CPU/memory pressure

A look at what led Pinterest engineers to deconstruct Kafka MirrorMaker’s message processing workflow. By streamlining the handling of payloads in transit, they were able to avoid unnecessary decompression (and re-compression) to alleviate significant CPU and memory overhead. Their improvements are currently being considered as a Kafka Improvement Proposal.

Network Validation

A very interesting read that tackles the domain of network monitoring using observability and automation principles more commonly found among software engineering teams.

Quest to the OS: Java Native Memory

Although I’m not a fan of Java, I always love a good debugging story. Props to any engineering team that isn’t satisfied simply applying the “fix” and who insists on digging further to understand the underlying behavior.

How to build your monitoring dashboards?

A concise list of tips (and things to avoid) when building Service Level Agreement (SLA) dashboards.

Build Prometheus Exporter for DHT22/AM2302 Sensor

Home DIY projects are always a fun weekend adventure, and this one is no exception. What do you get when you combine a Raspberry Pi, an inexpensive temperature / humidity sensor, and Prometheus? Read on to find out.

Monitoring

Coverwallet determined they were slow to identify some incidents before they became larger problems. To improve their communication and broader awareness of system health, they began an initiative to understand the role of monitoring and how it can be used to improve communication flows, respond faster to issues, and teach them how their applications should behave in production.

How to Correctly Frame and Calculate Latency SLOs

We so often hear how aggregated percentiles are bad and histograms are good, but after time it can sound like white noise. This article from Theo Schlossnagle explains the difference between good math and bad math and makes a strong case for using histograms to define your SLOs.

Bad Apple but it’s time series

This might be the most unusual thing I’ve seen in time-series art for some time. Words fail me.

Tools

opstrace/opstrace

A secure open source observability platform, deployed inside your own network.

I first read of Opstrace back in February, and the idea stuck with me. A collection of the most popular Open Source observability tools, packaged in an easy-to-use manner, designed to run on your own cloud infrastructure, with a Datadog-compatible HTTP API? I haven’t tried it out yet, but they clearly have an eye for pretty UIs and solid documentation. I’m anxious to see them add support for tracing (Prometheus exemplars, perhaps?), but for now it looks like they’ve got a good headstart on the basics.

linkedin/cruise-control

Cruise Control is a product that helps run Apache Kafka clusters at large scale.

According to the README, LinkedIn maintains over 7,000 Kafka brokers. At that scale, brokers die on a daily occurrence, making Kafka a very expensive service (in terms of overhead) to maintain. Hence, Cruise Control was developed to automate the utilization tracking, anomaly detection, and rebalancing of Kafka workloads.

Events

Monitorama PDX 2021 - September 13-15 (Portland, OR)

Monitorama is returning to Portland this fall. If ticket sales are anything to go on, the community is ready to get back together for another fun event. Hope to see you there!

Job Opportunities

Site Reliability Engineer (Remote, USA) at Linear Financial Technologies, LLC

Senior DevOps Engineer (Remote, USA) at Nota

Devops Engineer (Remote, London UK) at Perfect Ward

Ready to lower your AWS bill? Now might be the perfect time for an AWS Cost Optimization project with The Duckbill Group. The Duckbill Group aims for a 15-20% cost reduction in identified savings opportunities through tweaks to your architecture–or your money back. (SPONSORED)



See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor