Issue 164
A fantastic week for SLO and reliability topics. And from the so long ago you might have forgotten you’ve seen them department, a couple of hilarious videos about engineers yelling at servers. Enjoy! 🤣
This issue is sponsored by:
What are some of the market challenges with cloud-native and observability strategies? Join Chronosphere and ESG for an analyst webinar set to go live on Wednesday, April 20 to learn the three trends in cloud-native and observability you need to know. Save your seat here.
Articles & News on monitoring.love
Observability & Monitoring Community Slack
Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.
From The Community
ZEN and the art of Reliability
An aspirational set of principles and guidelines from Zendesk Engineering for building reliability into your organization.
Onboarding SLOs for Salesforce services
How Salesforce automates service level management and provides a consistent and approachable experience for internal developers.
OpenTelemetry, the standardized observability framework for everyone
We see a lot of OpenTelemetry articles around here, but not many about the portability that OTel offers. Nice to see how easy it can be to swap backends when your company makes an unexpected vendor pivot.
TensorFlow Library Performance
Great to see Brendan Gregg continue to publish new blog posts. Here’s a new one about him and a fellow Netflix engineer debugging a TensorFlow performance issue.
A Deep Dive Into the Four Types of Prometheus Metrics
Although the Prometheus project has some pretty good documentation, they run a bit terse. Nice to see a more extended (and user-friendly) look at the different Prometheus metric types.
Innovation Diffusion in Practice: SLOs
I love this two-part series on how Klarna developed and rolled out their SLO initiative. In contrast to the other reliability articles in this issue (Zendesk and Salesforce), this one leans heavily into the human factors associated with significant change at work.
How to effectively troubleshoot production issue
Monitoring tools offer little value without the skills to transform performance data into troubleshooting and remediation. This article offers an efficient distillation on the related chapter from Google’s SRE book.
Old Man Yelling at Cloud
There’s a good chance that many of you have already seen these, but they’re old enough that I feel justified to include them here. This first link is a viral video (at least within the tech industry) from way back in 2008, and was in many ways a watershed moment for how we think about systems monitoring and debugging. The second link is a follow-up talk from Bryan Cantrill with a long but rewardingly hilarious explanation of the events leading up to the original video. I urge you to watch both of them, if you haven’t already.
Thoroughly understand Events in Kubernetes
A walk through various Kubernetes events, how to read them effectively, and when to use them instead of logs.
NATS monitoring
If you use NATS at work, you’ll probably want to check out these two monitoring-specific articles from a longer series on adopting NATS for messaging in a distributed services infrastructure.
Monitoring Tesla Solar and Powerwall with Prometheus
I always get a kick out of seeing folks use Grafana for one-off home automation monitoring. This time we’ve got an engineer using it with Prometheus to monitor their Tesla Solar and Powerwall installation.
Tools
“A top-like tool for monitoring NATS servers.”
https://github.com/nats-io/nats-surveyor
“NATS surveyor polls the NATS server for Statz messages to generate data for Prometheus. This allows a single exporter to connect to any NATS server and get an entire picture of a NATS deployment without requiring extra monitoring components or sidecars.”
Events
Monitorama PDX 2022 - June 27-29 (Portland, OR)
Monitorama is returning to Portland, OR this summer. It looks like a return to form for one of our favorite events (ok, we might be biased). Hope to see you there!
Job Opportunities
DevOps Engineer at Array (Remote)
Site Reliability Engineer III at Stash (US Remote)
Senior Developer Relations & Community at Rootly (NA Remote)
Negotiating your AWS contract? Let us help. At The Duckbill Group, we’re on your side and we see dozens of these a year–more than most AWS account managers! We’ve helped negotiate everything from $3mm contracts to $650mm contracts and a whole slew in between. Check out our AWS contract negotiation services. (SPONSORED)
See you next week!
– Jason (@obfuscurity) Monitoring Weekly Editor