Some fun and interesting articles this week. Any week I can reference Brendan Gregg’s work with flame graphs is probably a good one. Oh, and some great articles on Kubernetes and the Cilium CNI… enjoy! 🔥📈🔔

This issue is sponsored by:

Sysdig logo

Troubleshoot Kubernetes in a Snap with Sysdig Monitor Advisor

Sysdig Monitor is making it easier to find important details about your clusters, namespaces, and deployments with a new feature called Advisor. In this on-demand webinar, you will learn how Advisor can help you debug and solve difficult Kubernetes problems 10x faster! Watch Now!

Articles & News on

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Managing Prometheus at scale with Cortex

This article is more of an overview and explanation of Cortex than it is a tutorial for using it to scale your Prometheus clusters, but it’s still an accurate introduction to Cortex’s features and architecture.

How To Read Flame Charts and Percentiles

An approachable explanation of flame charts and percentiles. After you read this one, go immerse yourself in Brendan Gregg’s massive collection of flame graph resources.

What You Need To Know To Debug A Preempted Pod On Kubernetes

I’m not a Kubernetes expert by any means, but I still learned a ton about pod preemption and the limits and criteria related to it. If you’re responsible for monitoring a Kubernetes cluster you owe it to yourself to read this one.

Automated Incident Management Through Slack

I love that Airbnb has built their own incident management Slack chatops bot and shared the story with us here. However, our industry is stuffed with vendors that do precisely this. Unless you’ve identifed a specific reason to build your own I’d probably encourage you to focus on your core mission and just buy one off the shelf.

Using Custom Span Attributes in OpenTelemetry

Part four of an excellent series on OpenTelemetry, this post covers the use of custom span attributes for instrumenting custom spans or metadata to your traces.

Understanding OpenTelemetry Collectors

Part five of the same OpenTelemetry series, this post provides more context over the design of OpenTelemetry collectors and their internal components. Great stuff.

Don’t overcategorise incidents!

We’ve probably all been guilty of this at times, though I do think it can be helpful to at least track these categories internally for learning and planning purposes.

Ingest Graphite, Datadog, Influx, and Prometheus metrics into Grafana Mimir

Great to see Mimir add experimental support for other time-series metrics formats, including my old personal favorite, Graphite. :)

Kubernetes Networking with Cilium CNI and OKE on Oracle Cloud

Although this isn’t strictly a monitoring-related article, it provides a ton of useful context around the Cilium CNI that you’ll probably want if you have to work with or support it (and makes a great intro before reading the next story).

Key Metrics for Monitoring Cilium

Speaking of Cilium, Datadog has published this helpful overview of the CNI and the metrics you’ll want to keep an eye on.

AWS — Log Anomaly Detection and Recommendations

I’m a little hesitant to open my wallet to let AWS a) apply machine learning to all of my logs and b) send them enough logs in the first place to get enough data to drive some accurate anomaly detection. OTOH if you’re already in the latter bucket and need to automate some insights, this might be just what you’re looking for.



Grafana Mimir proxies are a collection of open source software projects that provide native ingest capability for third-party applications into Mimir.

Job Opportunities

Golang Software Engineer at Replicated (Remote)

Customer Reliability Engineer (K8s) at Replicated (Remote)

Customer Reliability Engineer (Go) at Replicated (Remote)

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor