How We Improved Our Monitoring Stack With Only a Few Small Changes

An engineer from Riskified details their journey of scaling, streamlining, and generally improving the resilience of their monitoring infrastructure.

Tracing Notifications

A fascinating look at how Slack traces the deliverability of their notification system. And what a nugget in the closing of the story… “at least a dozen tracers running simultaneously in the Slack app”. 👀

Distributed System Debugging with OpenTelemetry and Teletrace: Real-World Examples

Teletrace looks like an interesting new-to-me project for visualizing OpenTelemetry traces and debugging distributed systems. Anyone using this in production?

Datadog: Metrics without Limit

Some lessons learned for reducing custom metrics usage with Datadog’s “Metrics without Limits” feature. Woof.

Observability Driven Development (ODD) - Enhancing System Reliability

Maybe I’m too close to the problem, but this always felt like the desired state to me anyways. Regardless, if it helps adoption in your org by framing it in an acronym I’m all for it. 😉

Kubernetes 1.27: Query Node Logs Using The Kubelet API

More details about the new “Node log query” feature introduced with Kubernetes 1.27.

Releasing Graphite Query Language in Open Source VictoriaMetrics

An interesting update from VictoriaMetrics, announcing support for the Graphite query API in their open source release v1.90. I’m a little surprised there’s enough demand for this to justify the effort, but it still makes me smile.

Distributed Tracing: OpenTelemetry and Grafana Tempo

Another distributed tracing how-to, this one provides a bit more detail and relies on Grafana Tempo for querying and visualization.

Analyzing a Django App Using OpenTelemetry APM

A quick walkthrough for instrumenting and tracing your Django or Python web application.



Monitorama 2023 PDX

Monitorama has announced their full agenda for this year’s event. Looks like an awesome collection of topics and speakers. Hope to see you there!

