Real World DevOps: Observability in Mega-Scale Banking with Greg Parker

Ever thought hard about your company’s observability strategy and the challenges you’re facing? What about if your company spanned 70 countries, 90,000+ employees, and you were a bank? My guest certainly thinks about this regularly. In this episode, I speak with Greg Parker, the head of the Enterprise Monitoring Services team at Standard Chartered Bank about what it takes to design and implement a global monitoring strategy in a complex environment.

Chaos Engineering Observability: Q&A with Russ Miles

There’s a new O’Reilly ebook out, sponsored by the folks at Humio, about chaos engineering and observability.

Scaling up reporting on high-cardinality metrics

For those of you working on high-volume backend systems, you’ll like this article from the folks at Segment.

How We Prepared New York Times Engineering for the Midterm Elections

A great read about exactly what it sounds like.

How We Built an Automated Anomaly Detection System onto a Streaming Pipeline

A look under the hood of some interesting Salesforce engineering.

The Four Agreements of Incident Response

There’s some gems in here, but my personal favorite is this one: “Don’t litigate incident severity during the call. It’s a waste of time. By the time you’re done discussing whether it’s a SEV-1 or SEV-2, it will definitely have become a SEV-2. Best practice: If you can’t decide whether it’s a SEV-1 or SEV-2, always assume it’s the higher severity option and move on.”

How to Use InfluxDB’s Holt-Winters Function for Predictions

The final part in a three-part series on Holt-Winters predictive functionality in InfluxDB.

Resilience Roundup - Learning From Organizational Incidents: Resilience Eng

From my good friend Thai’s Resilience Roundup: “In this study, a lot of employees said that the accidents happen anywhere from 0 to 5 times a year, but at the same time, almost everyone said that small accidents or incidents were happening all the time. The operators in this company had normalized risk to such a degree that things like getting burned or getting acid in their eyes counted to them as only a minor incident.”

John Allspaw on Twitter: “On Aug 1, 2012, a company named Knight Capital experienced a business-destroying incident. Much has been written about it, but that’s not the topic of this thread.

The story of Knight Capital is an interesting one (which you can read about here), and this thread by John Allspaw points out some hypocrisy/hindsight bias among the peanut gallery as it relates to both Knight Capital’s story and the NY Stock Exchange halting in 2015 for similar reasons.


csabapalfi/awesome-web-performance-metrics: List of awesome web performance

A whole bunch of web performance metrics (and what they mean) and tools for collecting+analyzing them.

Open-sourcing UltraBrew Metrics, a Java library for instrumenting very large-scale applications

From the article, “UltraBrew Metrics can operate at millions of requests per second per JVM without measurably slowing the application down. We currently use the library to instrument multiple applications at Verizon Media, including one that uses this library 20+ million times per second on a single JVM.”

AWS Inter-Region Latency Monitoring

Someone had the great idea of set up nodes in a bunch of AWS regions and measuring latencies between them. Very cool.

