How To Improve On-Call
Many people bemoan their on-call, and certainly, on-call can seriously suck. Everything from constant alerts to becoming burnt-out makes on-call duties particularly difficult.
But it doesn’t have to be that way. Here are some of the best resources around for how to make your on-call much better–pleasureable, even!
Articles
On-call doesn’t have to suck by Cindy Sridharan/@copyconstruct
On being on call by Silvia Botros/@dbsmasher
The On-Call Handbook by Alice Goldfuss
No More On-Call Martyrs by Alice Goldfuss
Developers On Call by Josh Barton
Increment Magazine Issue #1: On-Call by Ryn Daniels, Sytse “Sid” Sijbrandij, and Increment staff
On-Call and Incident Response: Lessons for Success, the New Relic Way by Beth Long
PagerDuty’s Incident Response Training Manual
Talks & Videos
Volunteers, Not Conscripts: Fixing Out-of-Hours On-Call by Brian Scanlan
There is also an accompanying article to this talk: How we fixed our on call process to avoid engineer burnout
A story of being on call by Charity Majors
Martyrs on Film: Learning to hate the #oncallselfie by Alice Goldfuss
Keys to SRE by Ben Treynor
Optimizing Ops For Happiness by Jesse Newland
Books
Site Reliability Engineering: How Google Runs Production Systems
Authors: Niall Richard Murphy, Betsy Beyer, Chris Jones, Jennifer Petoff
From the cover:
The overwhelming majority of a software system’s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems?
In this collection of essays and articles, key members of Google’s Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You’ll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient—lessons directly applicable to your organization.
This book is divided into four sections:
- Introduction—Learn what site reliability engineering is and why it differs from conventional IT industry practices
- Principles—Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE)
- Practices—Understand the theory and practice of an SRE’s day-to-day work: building and operating large distributed computing systems
- Management—Explore Google’s best practices for training, communication, and meetings that your organization can use
Read for free on the book’s homepage
Buy from Amazon (affiliate link)
The Site Reliability Workbook: Practical Ways to Implement SRE
Authors: Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara, Stephen Thorne
From the cover:
In 2016, Google’s Site Reliability Engineering book ignited an industry discussion on what it means to run production services today—and why reliability considerations are fundamental to service design. Now, Google engineers who worked on that bestseller introduce The Site Reliability Workbook, a hands-on companion that uses concrete examples to show you how to put SRE principles and practices to work in your environment.
This new workbook not only combines practical examples from Google’s experiences, but also provides case studies from Google’s Cloud Platform customers who underwent this journey. Evernote, The Home Depot, The New York Times, and other companies outline hard-won experiences of what worked for them and what didn’t.
Dive into this workbook and learn how to flesh out your own SRE practice, no matter what size your company is.
You’ll learn:
- How to run reliable services in environments you don’t completely control—like cloud
- Practical applications of how to create, monitor, and run your services via Service Level Objectives
- How to convert existing ops teams to SRE—including how to dig out of operational overload
- Methods for starting SRE from either greenfield or brownfield
Buy from Amazon (affiliate link)
Seeking SRE: Conversations About Running Production Systems at Scale by David Blank-Edelman/@otterbook
From the cover:
Organizations—big and small—have started to realize just how crucial system and application reliability is to their business. At the same time, they’ve also learned just how difficult it is to maintain that reliability while iterating at the speed demanded by the marketplace. Site Reliability Engineering (SRE) is a proven approach to this challenge.
SRE is a large and rich topic to discuss. Google led the way with Site Reliability Engineering, the wildly successful O’Reilly book that described Google’s creation of the discipline and the implementation that has allowed them to operate at a planetary scale. Inspired by that earlier work, this book explores a very different part of the SRE space.
The more than two dozen chapters in Seeking SRE bring you into some of the important conversations going on in the SRE world right now. Listen as engineers and other leaders in the field discuss different ways of implementing SRE and SRE principles in a wide variety of settings; how SRE relates to other approaches like DevOps; the specialities on the cutting edge that will soon be common place in SRE; best practices and technologies that make practicing SRE easier; and finally hear what people have to say about the important, but rarely discussed human side of SRE.
Buy from Amazon (affiliate link)