Issue 154

Production scaling and debugging stories, alerting on error budgets, distributed tracing, and a lot more this week. Enjoy!

Chronosphere logo

What should we expect for the observability space in 2022? Chronosphere’s Co-founder and CEO, Martin Mao, breaks down his top 3 predictions for what’s to come in observability at the Predict 22 Virtual Summit. Catch the recap here!

Articles & News on monitoring.love

Observability & Monitoring Community Slack

It’s been amazing to see the community grow throughout 2021 and into 2022. We’d love to have you join us and share what you’ve been working on.

From The Community

Fixing Performance Regressions Before they Happen

Always interesting to hear how companies at Netflix’s scale think about regressions, anomalies, and which statistics are used to drive their validations.

5 key observability trends for 2022

An overview of the most common trends in observability right now. Jibes with everything I’ve seen in this newsletter over the past year.

Two reasons Kubernetes is so complex

Not stricly a monitoring or observability story; the author provides context around what makes Kubernetes such a complex system to operate, which should inform how we think about observing it.

Error Budget Is All You Need

An excellent two-part series on SLOs, error budgeting, burn rates, and how to alert on them correctly.

Tracing Node.js application with OpenTelemetry & Jaeger UI

A very approachable example for instrumenting your Node.js app for tracing.

Observations on Resilience, Part I: The Heroic St(age)

Having worked in the monitoring space for so long, thinking about resilience and reliability is something I take for granted. Knowledge transfer and shared context of our systems is key to the resilience of our services, which should be a crucial trait in our monitoring systems.

How We Saved 70K Cores Across 30 Mission-Critical Services

A look into how Uber reclaimed a significant amount of system resources by tuning their Go garbage collection.

Elastic logo

Join the Elastic Community Conference. Sign up and win a t-shirt!

ElasticCC is a free technical conference for the community, happening February 11–12. With stories and learnings from ELK to Elastic observability and security. Tracks in English, Portuguese, French, Korean, Mandarin and Japanese. Sign up now! (SPONSORED)

A (de)bug’s life: Diagnosing and fixing performance issues in Grafana Loki’s read path

Great to see more practical examples of Grafana Loki for debugging actual production problems. Hoping to read more stories like this from the greater open source community.

Distributed Tracing in Microservices

A primer on the problems facing microservices, and why tracing can help diagnose transactions through a complex stack.

Sampling, verbosity, and the case for (much) broader applications of distributed tracing

An illustrated matrix of the various use cases, limitations, and broader implications of distributed tracing across our industry.

Tools

pyrra-dev/pyrra

“Making SLOs with Prometheus manageable, accessible, and easy to use for everyone!”

Job Opportunities

Software Engineer, Incident Management at Slack (CA Remote)

Infrastructure Engineer at Plausible Analytics (Remote)

Sr. Cloud Platform Engineer at Classkick (US Remote)

Staff Software Engineer - Observability at Fastly (US Remote)

Ready to lower your AWS bill? Now might be the perfect time for an AWS Cost Optimization project with The Duckbill Group. The Duckbill Group aims for a 15-20% cost reduction in identified savings opportunities through tweaks to your architecture–or your money back. (SPONSORED)

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor