SPECIAL EDITION: Q3 2021 Best of

I’m very excited to bring back our quarterly “best of” issues for 2021. This collection looks back through our summer months with some really standout articles. Hope you enjoy it!

This issue is sponsored by:

Moogsoft logo

Start incident response with context to all your alerts in one view

Moogsoft speeds up incident response with dynamic anomaly detection, suppressed alert noise, and correlated insights across all your telemetry data. Go from debugging across multiple tools, screens, and dashboards into a single incident view so you and your teams can take a more proactive approach to reduce MTTR. Sign up for the Moogsoft Free community plan today!

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

How We Replaced Splunk at 100TB Scale in 120 Days

I get a lot of value out of Splunk, but you better believe I’ll be looking for alternatives if a vendor acts like they have me locked-in. Even at 100TB of daily logging ingestion, this team managed to plan and execute their way into an open source alternative stack. But my favorite part of this article is how they’ve documented the processes for a successful transition. You should be able to take these steps and apply them to any software transition.

“THANOS” — Monitoring with Prometheus and Grafana

If you haven’t tried Thanos yet, this is a solid introduction to all the components and a walkthrough for setting it up yourself. Props to the author for including the relevant configurations and command-line steps, along with screenshots from the various UI elements.

Grafana dashboard showcase

Some of these dashboards are truly gorgeous. I’m not sure how effective they are for daily use, but they’re solid inspiration for future projects.

Common Kubernetes Errors Made by Beginners

Not a monitoring article per se, but it struck me that all of these mistakes could be alerted on without much effort.

Monitoring Alerts That Don’t Suck

Healthy alerting practices, yes please and thank you.

Psychological safety in a software team

If you read nothing else this week, read this one and share it with your peers. You may not agree with all of their points on deployments and incidents, but the discussion on psychological safety at work is an important one.

Prometheus, but bigger

Frankly, I never get tired of seeing companies switch between build versus buy (and back again). No matter which “team” you’re on, it’s always educational to hear that it’s possible (and cost-effective) to make that pivot. Another win for Thanos.

Five years evolution of open-source distributed tracing

If you’ve ever been confused by the proliferation of distributed tracing tools, this article was written for you. Fantastic post, well worth your time.

Unpacking Observability: A Beginner’s Guide

Much of this will probably sound familiar, but it’s a great primer and a fun read.

3 considerations from building a platform for Observability

Any effective observability platform should exist for the benefit of its customers. A few high-level considerations to keep in mind when beginning your observability journey.

Good and Bad Monitoring

Bad monitoring is unhealthy. IMHO good alerting is a byproduct of empathy for your peers (and systems).

Hacking your way to Observability — Part 3

The next part of a series on open source observability, this one pivots from Prometheus metrics to tracing with Jaeger and OpenTelemetry. Another great write-up with comprehensive code examples, diagrams, and configuration snippets.

OpsRamp logo

See ROI on cloud and cloud native monitoring in minutes, free.

Ready to see insights on your cloud infrastructure workloads in minutes? The OpsRamp free trial makes it easy. Set it up with no credit cards or commitments, onboard your resources with our wizard, and use out-of-the-box or custom dashboards to get the metrics that matter. We'll even supply the GCP resources if you just want to see how it works. Get started today. (SPONSORED)

Using Grafana, academics created a next-level dashboard tracking the impact of Covid-19 in Romania

Holy crap, this is frickin’ cool. Check out the live site here.

The SLAyer your data pipeline needs

The tool is nice, but I had to include this one just for the application name. Well played.

Setting up Service Monitoring

This post dovetails nicely with the other SLO articles this week. Beyond the golden signals, what else should you be monitoring? Quite a bit, as it turns out.

Logging with Loki

This might be the first article that successfully explained to me what Loki is all about. Excellent summary here, pass it along to your peers.

Updates of PostgreSQL Observability diagram.

Whoa. Ok, the story is helpful for context… but seriously, check out their interactive docs site.

Monitoring theory, from scratch

Gil Bahat has an exhaustive series on monitoring theory and practice. It’s such a great series, I’m going to include all of the articles to date:

SLOs should be easy, say hi to Sloth

Sloth generates SLOs easily for Prometheus based on a spec/manifest that scales. Is easy to understand and maintain.

For as much as folks talk about SLOs, I haven’t seen a lot of standardization in how we document them, communicate them, etc. I’m very excited to see a project like Sloth surface, and I hope it continues to mature. Interestingly, this is how I first heard of the OpenSLO specification.

How to Serve 200K Samples per Second with Single Prometheus

Lots of emphasis on Thanos’ long-term retention capabilities. This is one of the things that drove our own adoption of it at $DAYJOB, where it continues to serve us well.

Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance

This is a “chonky boi” of a technical article, but there’s so much good stuff in here I simply had to include it. Talk about squeezing every last drop of performance out of a system. And I loooove the inclusion of flame graphs.

Thoughts on HTTP instrumentation with OpenTelemetry

Lots to chew on here if you’re thinking about capturing tracing spans for HTTP requests.

Ready to lower your AWS bill? Now might be the perfect time for an AWS Cost Optimization project with The Duckbill Group. The Duckbill Group aims for a 15-20% cost reduction in identified savings opportunities through tweaks to your architecture–or your money back. (SPONSORED)

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor