I love issues like this one that skew heavy on the technical side, with debugging and hands-on guides. BTW if you ever run across something interesting that I’ve missed, please reach out and let me know! 📈💾👷‍♀️

This issue is sponsored by:

Chronosphere logo

Can you operate observability data at scale? Have you optimized for speed and performance? While cloud native is the modern architecture of choice, it can slow down your DevOps teams. In this ebook, learn 5 steps to align DevOps and cloud native operations to boost developer productivity.



Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Tracking down data corruption in Alertmanager notifications

An excellent diagnostic post from one of Alertmanager’s contributors. I genuinely love reading this author’s explanations on Alertmanager internals because they do a great job providing examples and sharing their thought process.

Tagging Everything

In the age of cloud and distributed systems, tags are a requisite for managing all of our disparate resources, particularly when it comes to observability. A broader look at the benefits of tagging versus metrics, though I wish the author would’ve touched on cardinality concerns.

Remove high cardinality in Prometheus

Speaking of cardinality, here are some tips and examples for tracking down and mitigating the sprawl of high-cardinality metrics in your Prometheus cluster. Love this.

Reducing the Cost of Custom Metrics in Datadog

Feels like we’re on a roll here… this one’s for those Datadog users with a glut of custom metrics weighing down your monthly invoice (oh wait, that’s everyone). 😜

OpenTelemetry Q&A Feat. Hazel Weakly

If you weren’t already aware, the OpenTelemetry project has their own official channel on YouTube. Their videos are generally rich in good information, but in particular I love this Q&A-style interview with Hazel Weakly where she describes the challenges and strategies for rolling out OpenTelemetry within an organization.

Firehydrant logo

Incident management platform FireHydrant is building an alerting product, which will mark the first time alerting and incident response is offered in one platform. Sign up for early access to Signals by FireHydrant, and be among the first to experience the power of alerting + incident response together — at last. (SPONSORED)



Network health overview with mtr, ss, lsof and iperf3

Some useful networking cli utilities that can aide with debugging a monitoring alert or even serve as the basis for a health check.

Create Datadog monitors and alert by code on Kubernetes

Datadog is a popular commercial offering because of its breadth of coverage and capabilities. This post demonstrates a quick way to automate your Datadog monitors for a Kubernetes cluster.

What’s next for observability?

A C-suite worthy analysis of the Observability landscape. If you’re trying to make a case with your leadership for dedicated resources, this could be a good article to share with them.

Monitoring SQS with Datadog

A very comprehensive guide for monitoring your SQS queues with Datadog.

Job Opportunities

Infrastructure Engineer at Nava (US Remote)

Senior Security Engineer at Redox (US Remote)

Staff Engineer, Solutions Architect at Bellese (US Remote)

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor