Somehow I managed to pause Ted Lasso long enough to add the finishing touches on this week’s newsletter. No tricks, but plenty of treats (and an XL-sized bag of KubeCon videos) for your enjoyment – stay safe and enjoy the stories! 🎃👻🦇

This issue is sponsored by:

Moogsoft logo

Start incident response with context to all your alerts in one view

Moogsoft speeds up incident response with dynamic anomaly detection, suppressed alert noise, and correlated insights across all your telemetry data. Go from debugging across multiple tools, screens, and dashboards into a single incident view so you and your teams can take a more proactive approach to reduce MTTR. Sign up for the Moogsoft Free community plan today!

Articles & News on

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

KubeCon 2021 O11y Talks

KubeCon 2021 was a massive online event, with over 200 (!!) recorded talks. I’ve combed through all of them to create a playlist of the 19 videos specifically about monitoring and observability.

A developer’s guide to programatically overcome fear of failure

Yes, yes, yes, and yes. An important article that we should all read and take to heart. I think it’s vital that we provide space for our teams to build, iterate, and most importantly, fail. I know it’s cliche, but failure provides the best learning opportunities. Please share this one.

The road to world-class monitoring at Azimo

I love reading about how companies think about monitoring, actionable (or not) alerts, and empowering teams with the data needed to increase reliability and to surface problems before they become customer-impacting events. Props to Azimo’s engineering leadership for sharing their experiences.

Federating Prometheus Effectively

Personally, I’d rather just throw Thanos in front of my Prometheus clusters and not have to think about manual federation. But to be fair, my last big Thanos deployment far exceeded the limitations of native Prometheus federation, and the author of this article may not have the freedom to deploy yet another collection of services. If this sounds like you, I would definitely check this out.

Forgot to renew the TLS certificates? Monika will remind you from now on

Honestly, this would have saved my bacon at a previous gig where we used TLS certificates for everything.

Why Your Services Need Observability

A fun look at how one company’s growth and evolution might influence the observability practices and tooling they adopt. This is a great article to share with friends who may not be as experienced in these areas.

CarbonJ: A high performance, high-scale, drop-in replacement for carbon-cache and carbon-relay

I’m genuinely surprised to hear of another alternative Carbon project, and even more that it’s coming out of Salesforce. Still, it offers a compelling alternative to the traditional Python services, the Go-Graphite stack, and possibly even newcomers like Clickhouse.

Observability Into Your FinOps: Taking Distributed Tracing Beyond Monitoring

It doesn’t surprise me to read that folks are using observability data for use cases outside of traditional DevOps and Engineering applications. I’ve seen this in action myself, where business and marketing teams would leverage our systems rather than trying to build up more complex analytical queries elsewhere.

LogicMonitor logo

Customers named LogicMonitor #1 in satisfaction in the Fall 2021 Network Monitoring grid from G2. Download the full report to see real user reviews and rankings across top network monitoring vendors. (SPONSORED)

In our systems we trust

It can take years to build up trust in our systems (and among teams), so I can empathize with the situations presented here. Precision, transparency, and communications are key.

A different and (often) better way to downsample your Prometheus metrics

Timescale has released a beta version of Promscale with support for downsampled Prometheus metrics. There are some benefits to their “continuous aggregates” feature versus Prometheus recording rules, but it also means introducing a new system just for maintaining your aggregate data. Still, Timescale does offer some unique advantages over traditional PromQL queries.

Building an Enterprise Ready Monitoring Solution

InfluxDB (and its related projects) is one of those systems that’s evolved a lot since their early days. If you’ve ever been curious about what it might look like to deploy an “Enterprise-ready” Influx stack, this article is a good place to get started. (Note: the code formatting seems to be broken in this article, but there’s still a good bit of useful info before that)



CarbonJ is a drop-in replacement for carbon-cache and carbon-relay. It was designed with high performance read and write throughput in mind and supports writing millions of metric data points and serve millions of metrics datapoints per minute with low query latency.

Job Opportunities

Senior Dev Ops Engineer at Coursedog (Remote)

Senior Infastructure Software Engineer at MethaneSAT (Remote)

Cloud Engineer at Pingboard (Remote)

Senior Infrastructure Engineer at IRL (Remote)

Negotiating your AWS contract? Let us help. At The Duckbill Group, we’re on your side and we see dozens of these a year–more than most AWS account managers! We’ve helped negotiate everything from $3mm contracts to $650mm contracts and a whole slew in between. Check out our AWS contract negotiation services. (SPONSORED)

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor