How To Improve On-Call

Many people bemoan their on-call, and certainly, on-call can seriously suck. Everything from constant alerts to becoming burnt-out makes on-call duties particularly difficult.

But it doesn’t have to be that way. Here are some of the best resources around for how to make your on-call much better–pleasureable, even!

Articles

On-call doesn’t have to suck by Cindy Sridharan/@copyconstruct

On being on call by Silvia Botros/@dbsmasher

The On-Call Handbook by Alice Goldfuss

No More On-Call Martyrs by Alice Goldfuss

Developers On Call by Josh Barton

Increment Magazine Issue #1: On-Call by Ryn Daniels, Sytse “Sid” Sijbrandij, and Increment staff

On-Call and Incident Response: Lessons for Success, the New Relic Way by Beth Long

PagerDuty’s Incident Response Training Manual

Talks & Videos

Volunteers, Not Conscripts: Fixing Out-of-Hours On-Call by Brian Scanlan

There is also an accompanying article to this talk: How we fixed our on call process to avoid engineer burnout

A story of being on call by Charity Majors

Martyrs on Film: Learning to hate the #oncallselfie by Alice Goldfuss

Keys to SRE by Ben Treynor

Optimizing Ops For Happiness by Jesse Newland

Books

Site Reliability Engineering: How Google Runs Production Systems

Authors: Niall Richard Murphy, Betsy Beyer, Chris Jones, Jennifer Petoff

From the cover:

The overwhelming majority of a software system’s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems?

In this collection of essays and articles, key members of Google’s Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You’ll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient—lessons directly applicable to your organization.

This book is divided into four sections:

Introduction—Learn what site reliability engineering is and why it differs from conventional IT industry practices
Principles—Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE)
Practices—Understand the theory and practice of an SRE’s day-to-day work: building and operating large distributed computing systems
Management—Explore Google’s best practices for training, communication, and meetings that your organization can use

Read for free on the book’s homepage

Buy from Amazon (affiliate link)

The Site Reliability Workbook: Practical Ways to Implement SRE

Authors: Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara, Stephen Thorne

From the cover:

In 2016, Google’s Site Reliability Engineering book ignited an industry discussion on what it means to run production services today—and why reliability considerations are fundamental to service design. Now, Google engineers who worked on that bestseller introduce The Site Reliability Workbook, a hands-on companion that uses concrete examples to show you how to put SRE principles and practices to work in your environment.

This new workbook not only combines practical examples from Google’s experiences, but also provides case studies from Google’s Cloud Platform customers who underwent this journey. Evernote, The Home Depot, The New York Times, and other companies outline hard-won experiences of what worked for them and what didn’t.

Dive into this workbook and learn how to flesh out your own SRE practice, no matter what size your company is.

You’ll learn:

How to run reliable services in environments you don’t completely control—like cloud
Practical applications of how to create, monitor, and run your services via Service Level Objectives
How to convert existing ops teams to SRE—including how to dig out of operational overload
Methods for starting SRE from either greenfield or brownfield

Buy from Amazon (affiliate link)

Seeking SRE: Conversations About Running Production Systems at Scale by David Blank-Edelman/@otterbook

From the cover:

Organizations—big and small—have started to realize just how crucial system and application reliability is to their business. At the same time, they’ve also learned just how difficult it is to maintain that reliability while iterating at the speed demanded by the marketplace. Site Reliability Engineering (SRE) is a proven approach to this challenge.

SRE is a large and rich topic to discuss. Google led the way with Site Reliability Engineering, the wildly successful O’Reilly book that described Google’s creation of the discipline and the implementation that has allowed them to operate at a planetary scale. Inspired by that earlier work, this book explores a very different part of the SRE space.

The more than two dozen chapters in Seeking SRE bring you into some of the important conversations going on in the SRE world right now. Listen as engineers and other leaders in the field discuss different ways of implementing SRE and SRE principles in a wide variety of settings; how SRE relates to other approaches like DevOps; the specialities on the cutting edge that will soon be common place in SRE; best practices and technologies that make practicing SRE easier; and finally hear what people have to say about the important, but rarely discussed human side of SRE.

Buy from Amazon (affiliate link)