Issue 099
This issue is sponsored by:
đ Data-Driven Guide to Engineering Leadership
Ship faster because you know more, not because youâre rushing. Get actionable insights from 7 million commits and 85,000+ software engineers, to increase your teamâs velocity. Free Guide
Latest Articles on monitoring.love
Real World DevOps: Observability in Mega-Scale Banking with Greg Parker
Ever thought hard about your companyâs observability strategy and the challenges youâre facing? What about if your company spanned 70 countries, 90,000+ employees, and you were a bank? My guest certainly thinks about this regularly. In this episode, I speak with Greg Parker, the head of the Enterprise Monitoring Services team at Standard Chartered Bank about what it takes to design and implement a global monitoring strategy in a complex environment.
Observability & Monitoring Community Slack
Come hang out with all your fellow Monitoring Weekly readers. I mean, Iâm also there, but Iâm sure everyone else is way cooler.
From The Community
Chaos Engineering Observability: Q&A with Russ Miles
Thereâs a new OâReilly ebook out, sponsored by the folks at Humio, about chaos engineering and observability.
Scaling up reporting on high-cardinality metrics
For those of you working on high-volume backend systems, youâll like this article from the folks at Segment.
How We Prepared New York Times Engineering for the Midterm Elections
A great read about exactly what it sounds like.
How We Built an Automated Anomaly Detection System onto a Streaming Pipeline
A look under the hood of some interesting Salesforce engineering.
The Four Agreements of Incident Response
Thereâs some gems in here, but my personal favorite is this one: âDonât litigate incident severity during the call. Itâs a waste of time. By the time youâre done discussing whether itâs a SEV-1 or SEV-2, it will definitely have become a SEV-2. Best practice: If you canât decide whether itâs a SEV-1 or SEV-2, always assume itâs the higher severity option and move on.â
How to Use InfluxDBâs Holt-Winters Function for Predictions
The final part in a three-part series on Holt-Winters predictive functionality in InfluxDB.
Resilience Roundup - Learning From Organizational Incidents: Resilience Eng
From my good friend Thaiâs Resilience Roundup: âIn this study, a lot of employees said that the accidents happen anywhere from 0 to 5 times a year, but at the same time, almost everyone said that small accidents or incidents were happening all the time. The operators in this company had normalized risk to such a degree that things like getting burned or getting acid in their eyes counted to them as only a minor incident.â
The story of Knight Capital is an interesting one (which you can read about here), and this thread by John Allspaw points out some hypocrisy/hindsight bias among the peanut gallery as it relates to both Knight Capitalâs story and the NY Stock Exchange halting in 2015 for similar reasons.
Tools
csabapalfi/awesome-web-performance-metrics: List of awesome web performance
A whole bunch of web performance metrics (and what they mean) and tools for collecting+analyzing them.
Open-sourcing UltraBrew Metrics, a Java library for instrumenting very large-scale applications
From the article, âUltraBrew Metrics can operate at millions of requests per second per JVM without measurably slowing the application down. We currently use the library to instrument multiple applications at Verizon Media, including one that uses this library 20+ million times per second on a single JVM.â
AWS Inter-Region Latency Monitoring
Someone had the great idea of set up nodes in a bunch of AWS regions and measuring latencies between them. Very cool.
This issue is sponsored by:
Why Distributed Tracing Will Be So Important in 2019
Distributed tracing helps you troubleshoot problems you donât even know you have, and itâs only going to become more important as your software gets more complex.
Events
The CFP is now open for Datadogâs DashCon.
See you next week!
â Mike (@mike_julian) Monitoring Weekly Editor