Book Review: SRE – Site Reliability Engineering

6/10 – Overall. 8/10 for early chapters, 4/10 for later chapters.
The first 100 pages were excellent but the later chapters were a mixed bag, partially due to rotating authors. I skim-read the later chapters as they mostly focussed on a broad spectrum of not closely related topics.
Chapters that covered topics I interact with were too shallow to interst me, while many chapters were not of interest to me. Perhaps if I was an SRE rather than a developer I would have found the entire book better.

Key Takeways for Me

Every large firm I’ve worked at has been structured incorrectly and had the wrong metrics for measuring stability.
In banks, the production support team has typically been tasked with “zero outages” whilst the developers are incentivised to develop and release as quickly as possible, with some front-office “quant-devs” not being held accountable for stability at all. With the handover method looking like throwing it over a wall:
This book suggests a much better approach:
Rather than pace vs stability, agree a global “Error Budget” target for everyone. using SLOs/SLIs that if not met can result in moving responsibilities back and forth from DEV to SRE owned. Importantly the target e.g. between 99.8% and 99.9% uptime should have an upper and lower bound, it should NOT be an absolute. If you go above it, developers should be taking more risks, below, developers should work on stability.
100% is the wrong reliability target. I always intuitively knew this but the book provided useful arguments. e.g. If you build 100% reliable but users wifi is 99% reliable, you wasted a lot of effort that users could never benefit from and that took time away from other work.

Book Notes

Note the full book is actually available online here.
An outage is NOT a bad thing, it is an expected part of innovation.

Monitoring

Alerts – Immediate human action required
Ticket – Human action required within few days to prevent damage
Logging – For forentsics/diagnostics only
MTTF – Mean Time To Failure
MTTR – Mean Time To Repair
Humans add latency. MTTR speed critical to availability -> automation is best.

Google Specific Terms

Campus > Data centre > cluster > row > rack > server
Borg – Automates resources for applications
Chubby – Uses paxos to provide global locks
Users -> GFrontEnd -> AppFrontEnd -> AppBackEnd -> DB (all coordinate via Load Balancer / DNS)

Embrace Risk

Time Availability = uptime / (uptime+downtime)
Aggregate Availability = successful Requests / Total Requests
This metric is more ususal when there are regional outages etc.
There are different types of failure
- Global outages, regional outages
- Full outages, partial funcitonality
- Choose which you want
Error Budget = Control loop to manage release velocity
Error Budget – Aligns incentives

SLOS

SLI – Service Level Indicators – Measure a level of service e.g. latency/availability
SLO – Service Level Objective – A range of values that is measured by an SLI e.g. average response <100ms
SLA – Agreement – agreed with customers, including consequences for missed SLOs
Choosing Targets:
- Don’t base it on current performance (it could be way off)
- keep it simple
- Have as few as possible
- Keep a safetly margin (tighter internal number)
- Don’t overachieve, each “9” is costly
Percentiles – are better measurement than averages in case of long tail

Toil

-> Manual repetitive work devoid of enduring value, that could be automated
Toil = Lower morale, career stagnation, slower progress
Some amount of toil is unavoidable and can even be calming

Automation

Automation allows super-linear scaling of users vs human effort.

Levels of automation:

Fully automated – DB self identifies problem and preemptively resolves it
Internally maintained – Generic – script shipped with database
Externally maintained – Generic – shared DB recovery script
Externally Maintained – System Specific – A script on someones desktop
No Automation

Simplicity

Less code = Less maintenance
Simplicity = Stability

The later chapters held less of interest.
“You want a data recovery system NOT a data backup system.”

SRE Engagement Model – Not all services require SRE attention as they don’t need high reliability and availability. Those teams get given advice and documentation.

Book Review: SRE – Site Reliability Engineering

Key Takeways for Me

Book Notes

Monitoring

Google Specific Terms

Embrace Risk

SLOS

Toil

Automation

Levels of automation:

Simplicity

Published by ryanh878

Leave a comment Cancel reply

Key Takeways for Me

Book Notes

Monitoring

Google Specific Terms

Embrace Risk

SLOS

Toil

Automation

Levels of automation:

Simplicity

Share this:

Related

Published by ryanh878

Leave a comment Cancel reply