A super fun week chock full of production stories and reflections. I particularly love the posts from DoorDash and Slack engineers. Enjoy! 🌈🚪💨

This issue is sponsored by:

Armory logo

The Darkest Knight

In 2012, a company called Knight Capital lost $440 million in 45 minutes and disrupted the stock market, all due to one failed deployment. Don’t let this happen to you. Read more deployment horror stories in this blog.



Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

How DoorDash Migrated from StatsD to Prometheus

Love the detail from DoorDash engineers on their migration to Prometheus. However, I do find it odd they chose to omit which legacy “monitoring backend” they moved away from (StatsD is an aggregation service, not a backend). A quick Google search shows they were using Wavefront (now a smaller part of VMware Aria) back in 2018. Looking forward to their re-architecture for OTLP metrics ingestion in a couple years. 😉

Prometheus Now Supports OpenTelemetry Metrics

Always good to see open standards prevail. I’m curious to see if this (OTLP ingestion endpoint), combined with the continued popularity of OpenTelemetry, forces Prometheus maintainers to reconsider their commitment to pull-based collection.

P.S. You should definitely read the caveats in this comment regarding the performance hit around these converted namespaces (as Prometheus labels). Sounds like a cache addition will be forthcoming.

OpenTelemetry: A beginner’s handbook to instrument your application

Speaking of OpenTelemetry, here’s a solid introduction for anyone looking to learn the basics before diving in.

Sofia’s Illuminating Voyage into Distributed Tracing

The next chapter from Sofia and her “Journey to Observability”. I still can’t tell if this is based on a real person or is supposed to be DevOps manga, but I’m here for it.

Service Delivery Index: A Driver for Reliability

I was genuinely impressed by this post from Slack engineers. They have a strong history of Observability leadership, but it’s clear they give a damn about their customers’ experience. This is an excellent read, I virtually guarantee you’ll take away something interesting.

Kubernetes Monitoring: Ensuring Performance and Stability in Containerized Environments

This appears intended to serve as an exhaustive overview of Kubernetes observability; it does a decent job touching on all of the related topics, but you’ll want to perform deeper research on any specific area. Frankly, if you just grabbed all of the section titles they would make a great checklist for your manager. 😜

Logs monitoring with Loki, Node.js and Fastify.js

How to add effective logging to your application, with some tips for promtail, Loki, and Grafana dashboards. This almost reads like a confessional by a developer figuring this stuff out for the first time, but they did a nice job capturing their experience.

How to Add Custom Label or Key to Records in Fluentd

A simple yet handy example for adding (or merging) custom labels in your logs with Fluentd.

Tools

grafana/loki

Loki is a horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus.

Julien-R44/pino-loki

This module provides a transport for pino that forwards messages to a Loki instance.

pinojs/pino

Very low overhead Node.js logger.

Events

PromCon EU 2023

PromCon EU 2023 is the eighth conference fully dedicated to the Prometheus monitoring system. It will take place 2023-09-28 & 2023-09-29 (Thu & Fri) in Berlin as a single-track event with space for 300 attendees.

Job Opportunities

Senior Site Reliability Engineer at Assured (US Remote)

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor