Site Reliability Engineering: How Google Runs Production Systems (2016)

Niall Richard Murphy

Betsy Beyer

Chris Jones

Jennifer Petoff

Site Reliability Engineering: How Google Runs Production Systems (2016)

Niall Richard Murphy

Betsy Beyer

Chris Jones

Jennifer Petoff

Software Engineering Nonfiction

Why this book is great

Ever wonder what SLI, SLO and SLA mean? Curious about "error budget"? Not sure if time-to-market speed or system reliability is more important? If you are a software engineer or work with software engineers at an Internet company, this book is for you. Dozens of Senior DevOps engineers at Google basically wrote all the important lessons down for you in a concise, well-organized and easy-to-understand 550-page book.

We, at BooksLegit, think that this book can teach you something new, regardless of your level of understanding of DevOps. If you work on the business team, this will help you be more aware the challenges that engineers have to go through to keep the system always-on (well, 99.999% of the time).
If you're a small team without too much experience running a production stack, this will introduce you to many useful DevOps best practices.
If you are already a Systems Engineer, this will see under the hood how Google design, run and monitor their production services.

Site Reliability Engineering is one of those rare books that can help multiple cross-functional teams at tech companies work together better. For example, let's look at the concept of error budget as explained in the book, as quoted below. It should help product managers space out features on their roadmap to be aligned with the overall system reliablity expectation.

The error budget stems from the observation that 100% is the wrong reliability target for basically everything (pacemakers and anti-lock brakes being notable exceptions). In general, for any software service or system, 100% is not the right reliability target because no user can tell the difference between a system being 100% available and 99.999% available. There are many other systems in the path between user and service (their laptop, their home WiFi, their ISP, the power grid…) and those systems collectively are far less than 99.999% available. Thus, the marginal difference between 99.999% and 100% gets lost in the noise of other unavailability, and the user receives no benefit from the enormous effort required to add that last 0.001% of availability.

The business or the product must establish the system’s availability target. Once that target is established, the error budget is one minus the availability target. A service that’s 99.99% available is 0.01% unavailable. That permitted 0.01% unavailability is the service’s error budget. We can spend the budget on anything we want, as long as we don’t overspend it."

Site Reliability Engineering, Chapter 2.

About the authors
Niall Murphy leads the Ads Site Reliability Engineering team at Google Ireland.

Betsy Beyer is a Technical Writer for Google Site Reliability Engineering in NYC.

Chris Jones is a Site Reliability Engineer for Google App Engine, a cloud platform-as-a-service product serving over 28 billion requests per day.

Jennifer Petoff is a Program Manager for Site Reliability Engineering team at Google Ireland.

Why this book is great

The error budget stems from the observation that 100% is the wrong reliability target for basically everything (pacemakers and anti-lock brakes being notable exceptions). In general, for any software service or system, 100% is not the right reliability target because no user can tell the difference between a system being 100% available and 99.999% available. There are many other systems in the path between user and service (their laptop, their home WiFi, their ISP, the power grid…) and those systems collectively are far less than 99.999% available. Thus, the marginal difference between 99.999% and 100% gets lost in the noise of other unavailability, and the user receives no benefit from the enormous effort required to add that last 0.001% of availability.

The business or the product must establish the system’s availability target. Once that target is established, the error budget is one minus the availability target. A service that’s 99.99% available is 0.01% unavailable. That permitted 0.01% unavailability is the service’s error budget. We can spend the budget on anything we want, as long as we don’t overspend it."

Site Reliability Engineering, Chapter 2.

About the authors
Niall Murphy leads the Ads Site Reliability Engineering team at Google Ireland.

Betsy Beyer is a Technical Writer for Google Site Reliability Engineering in NYC.

Chris Jones is a Site Reliability Engineer for Google App Engine, a cloud platform-as-a-service product serving over 28 billion requests per day.

Jennifer Petoff is a Program Manager for Site Reliability Engineering team at Google Ireland.

Like Site Reliability Engineering? Check out the other 11 books in Software Engineering

Like Site Reliability Engineering? Check out the other 70 books in Nonfiction