Issue 139

Before we get into the articles, I want to take a moment to thank everyone for their support. It’s been a lot of fun bringing you this newsletter each week, and it sounds like you’re enjoying it as much as I am. Thank you and enjoy an inbox chock full of great articles from this past week!

This issue is sponsored by:

Moogsoft logo

Start incident response with context to all your alerts in one view

Moogsoft speeds up incident response with dynamic anomaly detection, suppressed alert noise, and correlated insights across all your telemetry data. Go from debugging across multiple tools, screens, and dashboards into a single incident view so you and your teams can take a more proactive approach to reduce MTTR. Sign up for the Moogsoft Free community plan today!

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

More details about the October 4 outage

By now you’ve heard all about the massive Facebook (and related properties) outage back on October 4. This is a follow-up post from Facebook providing more details on the cascading failure that led to a global shortcage of cat memes.

What happened at Facebook on the 4th

In response to the Facebook postmortem, I joined Mandi Walls, Pete Cheslock, and Joshua Timberman on Twitch to talk about the event, how fragile the Internet truly is, and why power tools are still a vital part of our disaster recovery plans. Hot takes galore.

Polar Signals: an expedition beyond traditional observability

Looks like another VC-backed Observability startup wading into the mix with Parca, an open source “continuous profiling” system. Reminds me a bit of Riemann with a bunch of Prometheus-isms.

Safe Updates of Client Applications at Netflix

Stories like this one remind me why I love the Netflix tech blog. Tons of great thinking around deployments and monitoring for regressions in A/B testing and canaries.

Trigger a Kubernetes HPA with Prometheus metrics

Looking to autoscale your Kubernetes pods based on Prometheus triggers? Here you go.

Grafana 8.2 released: Dynamic plugin catalog, new fine-grained access control permissions, and more

Great to see enhancements to the date picker and performance improvements to the image render. Lots of good stuff in this release. Oh and if you missed it, make sure to update your Grafana deployments for this recent CVE.

Changing the tires on a moving bus

I love stories about refactoring and the challenges of retooling a system in motion. Although this post has nothing to do with monitoring in the traditional sense, it should strike a chord with anyone who’s had to upgrade storage for metrics or logging systems.

A Lap around Kubernetes Security & Vulnerability scanning Tools

I shouldn’t need to say it, but security is everyone’s job. If you’re running Kubernetes you should check out this collection of security scanning tools. Props to the author for including sample output from each.

Chronosphere logo

Chronosphere is the only observability platform that puts you back in control by taming rampant data growth and cloud-native complexity, delivering increased business confidence. Teams at enterprises, large cloud-native, and mid-market companies around the world trust Chronosphere to help them operate scalable, highly available, and resilient applications. Learn more here. (SPONSORED)

The Cost of Increasing Incidents: How COVID-19 Affected MTTR, MTTA, and More

I’m not surprised to hear folks are experiencing more burnout as a result of COVID-19, but there are some interesting datapoints regarding MTTR and MTTA trends over the past few years.

Monitoring - Prometheus, Grafana and Loki

A friendly introduction into the most common open source observability tools. Share with your non-observability-SME friends.

MetricsQL: PromQL compliance

If you’re a Prometheus user but considering a move to VictoriaMetrics, this writeup covers some of the important differences and incompatibilities between PromQL and MetricsQL, respectively.

Kubernetes Cost Monitoring with Prometheus & Grafana

There are probably easier ways of tracking your Kubernetes spend, but I’m sure they’re not free. At the very least, you can try this out use it to justify a commercial alternative.

The Data Collection Herd Effect

Yes, their conclusion is for you to buy their observability service, but they still make some valid points about data ingestion and storage along the way.

Tools

parca-dev/parca

“Continuous profiling for analysis of CPU, memory usage over time, and down to the line number. Saving infrastructure cost, improving performance, and increasing reliability.”

Job Opportunities

Engineering Manager, Production Engineering at SoundCloud (Berlin)

Production Engineer/Site Reliability Engineer at SoundCloud (Berlin)

Senior Software Engineer - Platform at Upgrade (Remote)

Platform Engineer (SRE) at SparkPost (Remote)

Ready to lower your AWS bill? Now might be the perfect time for an AWS Cost Optimization project with The Duckbill Group. The Duckbill Group aims for a 15-20% cost reduction in identified savings opportunities through tweaks to your architecture–or your money back. (SPONSORED)

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor