Issue 072
This issue is sponsored by:
Getting Started with Telegraf
Telegraf is an open source plugin-driven server agent for collecting and reporting metrics. It has plugins or integrations to source metrics directly from the system it’s running on, pull metrics from third-party APIs, and listen for metrics via a StatsD and Kafka consumer services.
Articles & News
Health Checks and Graceful Degradation in Distributed Systems
Cindy Sridharan/@copyconstruct is back with a new monster post on health checks and it’s great. Go read it.
CNCF to Host OpenMetrics in the Sandbox
Now this is some damn good news: the CNCF has accepted the OpenMetrics project into the fold. OpenMetrics, in a nutshell, aims to provide an open standard for the transmitting of metric data. Normally, projects like this languish before fading off into oblivion because there aren’t enough big names behind them (RIP Metrics 2.0). That isn’t true with OpenMetrics: AppOptics, Datadog, Google, InfluxData, Prometheus, Sysdig, and now the CNCF are putting their weight behind it. Sounds like good news to me.
Continuing the series also known as “please don’t treat building a TSDB like the intern’s summer project,” Frank Moyer of Circonus lays down some more knowledge around the complexity of designing and operating time series databases. You can catch up on Part 1 here.
Implement Django Watchman Custom Checks in NiceDay Backend
I’m a pretty big fan of the /health endpoint pattern, though sadly, it isn’t used as much as I’d like outside of the Kubernetes-based ecosystem. This article is an exception, though: a walkthrough of using the Watchman Python library to set up health checks in a Django application.
When it comes to statsd deployment patterns, there’s two major ones: centralized statsd servers and local statsd servers. Then again, if you’re pushing an incredible amount of data through statsd like the folks at DoorDash are, then you start to play with some more complex deployment patterns to handle scaling concerns.
You can’t debug systems with dashboards
This interview between A Cloud Guru and Charity Majors has some great stuff in it that will get you thinking about what your future in monitoring and observability could be.
Post-mortems to the rescue – Increment: Documentation
You would think that doing post-mortems is pretty easy. Gather some docs, write it up, review, done–right? From my experience, most companies are hilariously bad at effective post-mortems, not because the companies in question are bad but because this isn’t as easy as it looks. Even the basics of “Stop firing people for making mistakes” are harder to do well than you’d first think. I love this article’s explanation and overall approach to the topic, though, and it should help you with fixing your post-mortem process properly.
You are what you benchmark: Introducing the Time Series Benchmark Suite (TSBS)
The folks behind TimescaleDB just released this gem: a TSDB benchmarking tool.
M3: Uber’s Open Source, Large-scale Metrics Platform for Prometheus
Seems like everyone is getting into the large-scale time series game these days. Here’s Uber release of M3, M3DB, and M3 Coordinator. Together, they make up Uber’s scaled, distributed time series system. From what I can tell, M3DB (the storage backend) has enough constraints that would prevent most of you making much use out of it. Namely, it can’t do backfilling of data, and it only supports float64 values.
See you next week!
– Mike (@mike_julian) Monitoring Weekly Editor