Thanks for joining us for another issue of Monitoring Weekly! We seem to have hit on a theme this week, which we think you’ll quite enjoy: incident response and management.

Monitoring News, Articles, and Blog posts

Practical Services Monitoring with Prometheus and Docker

A pragmatic look at how one company uses Prometheus to monitor their Docker Swarm cluster, from the Exporters used with containers to the custom service discovery methods employed to make “Federated” Prometheus work in their architecture. Most importantly, they conclude with some hard-won lessons and necessary improvements for deploying Prometheus in production environments.

Monitoring Google Compute Engine metrics

Another series of articles from Datadog covering a specific application or platform and how best to monitor it in production. This time they take a look at Google Compute Engine (GCE), identifying its key metrics and how to interpret them correctly.

Avoiding Incident Response Bottlenecks

This article goes through a list of common “bottlenecks” in Incident Response. More colloquially, we all know these as “the things that make me sad”: poor planning, shifting priorities, excessive alerting, and more.

Setting up a Web API for Success

Lessons learned from building and running production APIs at Salesforce.com. Much of the article is devoted to the design, documentation, and testing of RESTful services, but the author concludes with a solid look at logging and telemetry within your API, and how to visualize and alert on these measurements effectively.

Etsy’s Debriefing Facilitation Guide for Blameless Postmortems

No Incident Response framework is complete without a postmortem, and the folks at Etsy have made a name for themselves with blameless postmortems. They published their full guidelines late last year on how to conduct a blameless postmortem, including meeting structure and our personal favorite, how to ask great questions that surface the real factors involved in an incident.

Why I joined InfluxData

Ryan Betts, founding developer and CTO of VoltDB, has joined InfluxData to lead the team behind InfluxDB. This doesn’t coincide with any big product announcements but it suggests that InfluxDB will continue to be the driving force behind their TICK stack.

PagerDuty Incident Response Documentation

While we were thinking about this Incident Response theme this week, PagerDuty’s internal IR guidelines came to mind. PagerDuty open-sourced these back in January of 2017 and we think they definitely deserve a revisit. The guide covers a range of topics: principles of on-call and the responsibilities (and the not-responsibilities!), alert design, incident management guidelines, and defined incident roles (such as Incident Commander, Scribe, and Communication Liaison). If your team wants to improve how they handle incidents in a more structured manner, this is a must-read.

Searching for needle in haystack

We often hear about how companies completely overhauled a service and saw big improvements. What we don’t usually get to see is the nitty-gritty behind these efforts. This article goes through specifics of designing and implementing a sizable ElasticSearch setup for a unique deployment: document management for the State of Goiás, Brazil’s Justice Prosecutor’s Office.

Monitoring Docker Containers: What Does it Take to Get Started?

If you’re running Docker, you know how frustrating it can be to get visibility into your containers. This article takes you through using collectd to monitor Docker containers, as well as how to directly instrument the code running inside the containers.

Incident Management and the Incident Complexity Framework

One of our all-time favorite talks on Incident Response, Curt Micol presents the Incident Complexity Framework in this talk from Monitorama 2015. Developed at Heroku, this framework for incident management was modeled after FEMA’s own National Incident Management System. He explains the history leading up to its development and then later how it helped the engineering and support teams at Simple navigate a complicated transition between financial service providers.

Tools
Superset: Scaling Data Access and Visual Insights at Airbnb

Back in March 2016, Airbnb open-sourced Superset, a data visualization tool for querying a variety of analytical and time series data sources. This past February they added SQL Lab, a new IDE allowing users to construct visualizations from arbitrary SQL queries. We recommend checking out the demo gif in the article to see how it works – it’s pretty slick.

Elastic Stack 5.3.0 Released

The Elastic Stack has a new release, touting a ton of new features, including encryption-at-rest.

Thanks for joining us, folks! If you like what you’ve seen, invite your friends and colleagues! As always, if you have interesting articles, news, events, or tools to share, send them our way by emailing us (just reply to this email).

See you next week!

– Mike (@mike_julian) & Jason (@obfuscurity) Monitoring Weekly curators