I’m very excited to bring back our quarterly “best of” issues for 2021. This collection looks back through our summer months with some really standout articles. Hope you enjoy it!
This issue is sponsored by:
Start incident response with context to all your alerts in one view
Moogsoft speeds up incident response with dynamic anomaly detection, suppressed alert noise, and correlated insights across all your telemetry data. Go from debugging across multiple tools, screens, and dashboards into a single incident view so you and your teams can take a more proactive approach to reduce MTTR. Sign up for the Moogsoft Free community plan today!
Articles & News on monitoring.love
Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.
From The Community
I get a lot of value out of Splunk, but you better believe I’ll be looking for alternatives if a vendor acts like they have me locked-in. Even at 100TB of daily logging ingestion, this team managed to plan and execute their way into an open source alternative stack. But my favorite part of this article is how they’ve documented the processes for a successful transition. You should be able to take these steps and apply them to any software transition.
If you haven’t tried Thanos yet, this is a solid introduction to all the components and a walkthrough for setting it up yourself. Props to the author for including the relevant configurations and command-line steps, along with screenshots from the various UI elements.
Some of these dashboards are truly gorgeous. I’m not sure how effective they are for daily use, but they’re solid inspiration for future projects.
Not a monitoring article per se, but it struck me that all of these mistakes could be alerted on without much effort.
Healthy alerting practices, yes please and thank you.
If you read nothing else this week, read this one and share it with your peers. You may not agree with all of their points on deployments and incidents, but the discussion on psychological safety at work is an important one.
Frankly, I never get tired of seeing companies switch between build versus buy (and back again). No matter which “team” you’re on, it’s always educational to hear that it’s possible (and cost-effective) to make that pivot. Another win for Thanos.
If you’ve ever been confused by the proliferation of distributed tracing tools, this article was written for you. Fantastic post, well worth your time.
Much of this will probably sound familiar, but it’s a great primer and a fun read.
Any effective observability platform should exist for the benefit of its customers. A few high-level considerations to keep in mind when beginning your observability journey.
Bad monitoring is unhealthy. IMHO good alerting is a byproduct of empathy for your peers (and systems).
The next part of a series on open source observability, this one pivots from Prometheus metrics to tracing with Jaegar and OpenTelemetry. Another great write-up with comprehensive code examples, diagrams, and configuration snippets.
See ROI on cloud and cloud native monitoring in minutes, free.
Ready to see insights on your cloud infrastructure workloads in minutes? The OpsRamp free trial makes it easy. Set it up with no credit cards or commitments, onboard your resources with our wizard, and use out-of-the-box or custom dashboards to get the metrics that matter. We'll even supply the GCP resources if you just want to see how it works. Get started today. (SPONSORED)
Holy crap, this is frickin’ cool. Check out the live site here.
The tool is nice, but I had to include this one just for the application name. Well played.
This post dovetails nicely with the other SLO articles this week. Beyond the golden signals, what else should you be monitoring? Quite a bit, as it turns out.
This might be the first article that successfully explained to me what Loki is all about. Excellent summary here, pass it along to your peers.
Whoa. Ok, the story is helpful for context… but seriously, check out their interactive docs site.
Monitoring theory, from scratch
Gil Bahat has an exhaustive series on monitoring theory and practice. It’s such a great series, I’m going to include all of the articles to date:
- The definitions
- Indicators and synthetics
- Metrics and thresholds
- Events and transactions
- Business process monitoring
Sloth generates SLOs easily for Prometheus based on a spec/manifest that scales. Is easy to understand and maintain.
For as much as folks talk about SLOs, I haven’t seen a lot of standardization in how we document them, communicate them, etc. I’m very excited to see a project like Sloth surface, and I hope it continues to mature. Interestingly, this is how I first heard of the OpenSLO specification.
Lots of emphasis on Thanos’ long-term retention capabilities. This is one of the things that drove our own adoption of it at $DAYJOB, where it continues to serve us well.
This is a “chonky boi” of a technical article, but there’s so much good stuff in here I simply had to include it. Talk about squeezing every last drop of performance out of a system. And I loooove the inclusion of flame graphs.
Lots to chew on here if you’re thinking about capturing tracing spans for HTTP requests.
Ready to lower your AWS bill? Now might be the perfect time for an AWS Cost Optimization project with The Duckbill Group. The Duckbill Group aims for a 15-20% cost reduction in identified savings opportunities through tweaks to your architecture–or your money back. (SPONSORED)
See you next week!
– Jason (@obfuscurity) Monitoring Weekly Editor