Site Reliability Engineering: How Google Runs Production Systems
Cloud & Infrastructure

Site Reliability Engineering: How Google Runs Production Systems

Niall Richard Murphy, et al.· Published 2016

The foundational text on SRE, explaining how to apply engineering principles to operations to build ultra-scalable and reliable systems.

As an Amazon and Bookshelf Associate, I earn from qualifying purchases.

Why It's On My Shelf

The concept of an error budget reframes reliability as a business decision rather than an engineering argument. By explicitly trading stability for speed, teams stop debating ideology and start aligning on risk tolerance. CIOs often reference this model when balancing uptime commitments with innovation pressure in cloud native environments.

More from Cloud & Infrastructure

Cloud FinOps: Collaborative, Real-Time Cloud Financial Management
Cloud & Infrastructure

Cloud FinOps: Collaborative, Real-Time Cloud Financial Management

J.R. Storment & Mike Fuller

The definitive guide to the practice of FinOps, bringing financial accountability to the variable spend model of the cloud.

View on Amazon