Issue 165

Lots of great “in the trenches” stories from a variety of engineering teams out there. Speaking of teams, we’ve got a stack of job postings this week… who’s looking for a new gig?! 💻📈💰

This issue is sponsored by:

Chronosphere logo

What are the 3 trends in cloud-native and observability you need to know?

Tune in for an on-demand discussion with Chronosphere and analyst group ESG as we talk about the market challenges with cloud-native and observability strategies. You’ll learn the cloud-native adoption benefits and challenges, observability impact on business outcomes, and much more. Register here!

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Notes on an Observability Team

This article speaks to me on a very personal level. I’ve built up Observability teams over the years; it’s not surprising that we share many of the same problems, but it’s always interesting to hear how we tackle (or prioritize) them differently.

Thoughts Over an Annoying Production Issue

Always interesting to read how other engineering teams work through really frustrating incidents.

Monitoring CPU performance of Lyft’s Android applications

A look at how Lyft instruments their Android mobile app to track CPU usage and monitor for performance regressions.

How we avoided alarm fatigue syndrome by managing/reducing the alerting noise

How Doctolib audited and continue to iterate on their noisy alerting behaviors.

What Is Log Aggregation: 101 Guide to Best Tools & Practices

A quick introduction to log aggregation concepts along with a fairly objective comparison of numerous commercial and open source alternatives.

New in Grafana 8.5

Everyone’s favorite open source dashboard is out with another new release. Love to see the new alert grouping features.

API Observability with Apache APISIX Plugins

If you’re using Apache APISIX already, there are a number of Observability plugins at your disposal. This article brings together a wealth of resources for getting started with the usual observability pillars (metrics, logs, and traces), and how to integrate them within your existing toolset.

Sysdig logo

CrashLoopBackoff + Four Other K8s Troubleshooting Tips Everyone Should Know

We all love Kubernetes but it can be a hassle to fix when things go sideways. In this webinar, we will cover some of the common problems that plague every Kubernetes user and show you how to fix them. Join us at 10am PT on Thursday, April 28 to add these tips to your troubleshooting toolbox. Save your seat here. (SPONSORED)

An Effective Incident Escalation Process of Sendoso

Great to see more companies talking about their incident management process publicly.

Google Cloud Monitoring: What You Need to Monitor and Why

A helpful guide for friends or peers who might otherwise be new to monitoring on Google Cloud.

Improve observability using Stackdriver metrics programmatically

If you’ve been wanting to pull metrics out of the Google Cloud Monitoring API, this article has you covered. Props to the author for including a GitHub project with examples.

Create Monitoring & Alerting for Webhook Errors using Datadog

A look at Xendit’s pattern for monitoring outgoing webhook failures.

Events

Monitorama PDX 2022 - June 27-29 (Portland, OR)

Monitorama is returning to Portland, OR this summer. It looks like a return to form for one of our favorite events (ok, we might be biased). Hope to see you there!

Job Opportunities

Principal SRE- Logging, Metrics, and Monitoring at athenahealth (US Remote)

Lead Developer- Cloud Infrastructure Engineering at athenahealth (US Remote)

Software Engineer - SRE at Barracuda (Remote)

Senior Software Engineer - SRE at Barracuda (Remote)

Principal Software Engineer - SRE at Barracuda (Remote)

Ready to lower your AWS bill? Now might be the perfect time for an AWS Cost Optimization project with The Duckbill Group. The Duckbill Group aims for a 15-20% cost reduction in identified savings opportunities through tweaks to your architecture–or your money back. (SPONSORED)

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor