Issue 168

Some exciting announcements and discussions this week. Big news for Kubernetes container probes, a new book on Observability, some fresh takes on retrospectives, and much more. Enjoy! 😍📖🔥

This issue is sponsored by:

Drata logo

Say goodbye to manual evidence collection and hello to automated compliance. Drata, G2’s highest rate cloud compliance software, offers 60+ integrations that seamlessly connect with your various tech stacks used to manage compliance across your organization. Monitoring Weekly readers get 10% off Drata here.

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Observability is NOT Only for SREs

Lessons learned from a QA engineer into how Observability has empowered them to be more active in troubleshooting and provide more useful feedback for developers.

Application logs: Your eyes in production

A look at the role of logs in supporting production workloads along with some helpful tips and best practices to make them even more effective.

Improving Distributed Caching Performance and Efficiency at Pinterest

A fun post from Pinterest about performance and efficiency gains in their caching infrastructure made possible by their monitoring data.

Kubernetes 1.24: gRPC container probes in beta

Great to see gRPC probes finally beta status in Kubernetes 1.24, meaning they’re now available by default. This should be a big win for health-check and liveness probes of your clusters.

A Deep Dive Into OpenTelemetry Metrics

We talk a lot about OpenTelemetry framework in terms of traces and spans, but it provides enormous value in the form of metrics as well. This post is an excellent guide at what makes OTel metrics unique, how to set them up, and when to use the various types.

Observability Engineering - O’Reilly Book

O’Reilly’s new Observability Engineering book has been released, and Honeycomb has made the entire eBook available as a free download. Looking forward to reading this one soon.

Failing Forward — How We Grow from Incidents

Spotify engineers performed a retrospective on all of their incidents from 2021 in an effort to understand how they performed overall, and to uncover any missed learning opportunities. Good stuff, and something I wish more companies provided the space to emulate.

incident.io logo

incident.io has joined your #general Slack channel.

👋 I'm here to sponsor this issue and automate your entire incident management process in Slack. You just focus on fixing the issue, I'll keep your team and status page updated, nudge you to take the important actions, escalate to the right person when needed, auto-generate your post-mortem and make sure follow-up actions are taken care of.

Install incident.io to your Slack, type /incident and I'll take care of the rest.

incident.io has left the chat. (SPONSORED)

OpenTelemetry in Action: Identifying Database Dependencies

Databases can benefit from observability as much as all the other services we run in production. With the help of OpenTelemetry we can surface hidden dependencies that might otherwise be difficult to uncover or debug.

Incident postmortem pitfalls

Some tips and caveats to watch out for when managing your postmortems.

BellJar: A new framework for testing system recoverability at scale

Facebook engineers have developed an internal framework allowing them to isolate entire infrastructures into “vacuum-sealed” sandboxes, suitable for resiliency and recovery experimentation. It doesn’t sound like this project will be released as open source, but it serves as an interesting case study and pattern for developing our own pre-production environments.

What SREs Can Learn from the Atlassian Nightmare Outage of 2022

A recap and summary of the lessons learned from Atlassian’s extended outage.

Events

Monitorama PDX 2022 - June 27-29 (Portland, OR)

Monitorama is returning to Portland, OR this summer. It looks like a return to form for one of our favorite events (ok, we might be biased). Hope to see you there!

OSMC 2022 - Call for Papers

OSMC is back again for 2022, taking place November 14-16 in Nuremberg. CfP submissions are being accepted through July 31, 2022.

Job Opportunities

Senior Staff Engineer, Kubernetes at Wayfair (US Remote)

Senior Infrastructure Engineer at Eleanor Health (US Remote)

Site Reliability Engineer - Runtime Platforms at Flatiron Health (US Remote)

Site Reliability Engineer - OS Engineering at Flatiron Health (US Remote)

Negotiating your AWS contract? Let us help. At The Duckbill Group, we’re on your side and we see dozens of these a year–more than most AWS account managers! We’ve helped negotiate everything from $3mm contracts to $650mm contracts and a whole slew in between. Check out our AWS contract negotiation services. (SPONSORED)

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor