Issue 153

This has genuinely been a fantastic week for monitoring and observability content. I hope you enjoy them as much as I have. ☕📖📈

Chronosphere logo

You might have heard discussions about the “three phases of observability.” But what do they really mean? Chronosphere is a SaaS cloud monitoring tool that helps teams rapidly navigate the three phases of observability. Learn more about Chronosphere and the three phases of observability here.

Articles & News on monitoring.love

Observability & Monitoring Community Slack

It’s been amazing to see the community grow throughout 2021 and into 2022. We’d love to have you join us and share what you’ve been working on.

From The Community

Making Alerts Actionable

If you’ve been around here for a while, you know I’m highly opinionated about writing alerts that are useful and empathetic towards the engineers who answer them. I love hearing from others who are just as passionate and thoughtful about ~~writing~~ iterating on alerts.

Roblox Return to Service 10/28-10/31 2021

Roblox has released a very thorough and insightful postmortem of their 73-hour outage dating back to October 2021. It would be easy to throw stones (e.g. the circular dependency between telemetry and Consul) but IMHO they deserve props for publishing this excellent analysis of their recovery efforts and the pairing with Hashicorp engineers.

Load testing the Bloom & Wild Rails Application

A great example of how monitoring and observability tools provide the visibility for application engineers to mock tests and design load test scenarios that accurately represent production use.

Traditions — Incident Response

A quick look at how Harry’s has defined their incident response. You know, in case their production services have any close shaves with an outage. 😜

Capacity Recommendation Engine: Throughput and Utilization Based Predictive Scaling

A dive into Uber’s in-house engine for capacity planning and predictive, automated scaling. Very interesting read.

How to create “story teller” metrics for a better monitoring

Metrics are fine, but it can be helpful to think of them in terms of the story they’re trying to tell you about the overall health of your services or application stack.

Elastic logo

Join the Elastic Community Conference. Sign up and win a t-shirt!

ElasticCC is a free technical conference for the community, happening February 11–12. With stories and learnings from ELK to Elastic observability and security. Tracks in English, Portuguese, French, Korean, Mandarin and Japanese. Sign up now! (SPONSORED)

A beginner’s guide to network monitoring with Grafana and Prometheus

With the rise of cloud computing, network monitoring isn’t nearly as ubiquitous as it once was. Still, it never hurts to brush up on something as imminently useful (and confusing) as SNMP. Could be a fun weekend project to start monitoring your home router with Prometheus and Grafana.

How to tackle Kubernetes observability challenges with Pixie

An introduction and demo of the Pixie project, a modular open source project that supports monitoring Kubernetes clusters out of the box.

Scaling Kubernetes to Over 4k Nodes and 200k Pods

This isn’t strictly monitoring related, but I love reading scaling stories where it’s clear they couldn’t have told their tale without the use of our tooling. 😁

We Tested the Best Serverless Monitoring Solutions so You Don’t Have To

There are a lot of alternatives missing here, but it’s a good starting point if you’re curious about the ecosystem of serverless monitoring services.

Cloud Monitoring, We Need to Chat

In case you’re one of the three people actually using Google Chat and want to send your Google Cloud Monitoring alerts there. Kidding aside, this looks like a useful example for hooking up your own custom alerts pipeline.

Tools

pixie-io/pixie

“Pixie is an open source observability tool for Kubernetes applications. Use Pixie to view the high-level state of your cluster (service maps, cluster resources, application traffic) and also drill-down into more detailed views (pod state, flame graphs, individual full-body application requests).”

Job Opportunities

Deployment Engineer at Ada (NA Remote)

Customer Reliability Engineer at Nobl9 (US Remote)

Negotiating your AWS contract? Let us help. At The Duckbill Group, we’re on your side and we see dozens of these a year–more than most AWS account managers! We’ve helped negotiate everything from $3mm contracts to $650mm contracts and a whole slew in between. Check out our AWS contract negotiation services. (SPONSORED)

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor