SRE – Site Reliability Engineering – BOOK

6/10 – Overall.    8/10 for early chapters, 4/10 for later chapters.
The first 100 pages were excellent but the later chapters were a mixed bag, partially due to rotating authors. I skim-read the later chapters as they mostly focussed on a broad spectrum of not closely related topics.
Chapters that covered topics I interact with were too shallow to interst me, while many chapters were not of interest to me. Perhaps if I was an SRE rather than a developer I would have found the entire book better.

Key Takeways for Me

  1. Every large firm I’ve worked at has been structured incorrectly and had the wrong metrics for measuring stability.
    In banks, the productiodevops-wall-thrown support team has typically been tasked with “zero outages” whilst the developers are incentivised to develop and release as quickly as possible, with some front-office “quant-devs” not being held accountable for stability at all. With the handover method looking like throwing it over a wall:
  2. This book suggests a much better approach:
    Rather than pace vs stability, agree a global “Error Budget” target for everyone. using SLOs/SLIs that if not met can result in moving responstargetibilities back and forth from DEV to SRE owned. Importantly the target e.g. between 99.8% and 99.9% uptime should have an upper and lower bound, it should NOT be an absolute. If you go above it, developers should be taking more risks, below, developers should work on stability.
  3. 100% is the wrong reliability target. I always intuitively knew this but the book provided useful arguments. e.g. If you build 100% reliable but users wifi is 99% reliable, you wasted a lot of effort that users could never benefit from and that took time away from other work.

Book Notes

Note the full book is actually available online here.
An outage is NOT a bad thing, it is an expected part of innovation.

Monitoring

  • Alerts – Immediate human action required
  • Ticket – Human action required within few days to prevent damage
  • Logging – For forentsics/diagnostics only
  • MTTF – Mean Time To Failure
  • MTTR – Mean Time To Repair
  • Humans add latency. MTTR speed critical to availability -> automation is best.

Google Specific Terms

  • Campus > Data centre > cluster > row > rack > server
  • Borg – Automates resources for applications
  • Chubby – Uses paxos to provide global locks
  • Users -> GFrontEnd -> AppFrontEnd -> AppBackEnd -> DB  (all coordinate via Load Balancer / DNS)

Embrace Risk

  • Time Availability = uptime / (uptime+downtime)
  • Aggregate Availability = successful Requests / Total Requests
    This metric is more ususal when there are regional outages etc.
  • There are different types of failure
    • Global outages, regional outages
    • Full outages, partial funcitonality
    • Choose which you want
  • Error Budget = Control loop to manage release velocity
  • Error Budget – Aligns incentives

SLOS

  • SLI – Service Level Indicators – Measure a level of service e.g. latency/availability
  • SLO – Service Level Objective – A range of values that is measured by an SLI e.g. average response <100ms
  • SLA – Agreement – agreed with customers, including consequences for missed SLOs
  • Choosing Targets:
    • Don’t base it on current performance (it could be way off)
    • keep it simple
    • Have as few as possible
    • Keep a safetly margin (tighter internal number)
    • Don’t overachieve, each “9” is costly
  • Percentiles – are better measurement than averages in case of long tail

Toil

  • -> Manual repetitive work devoid of enduring value, that could be automated
  • Toil = Lower morale, career stagnation, slower progress
  • Some amount of toil is unavoidable and can even be calming

Automation

Automation allows super-linear scaling of users vs human effort.

Levels of automation:

  1. Fully automated  – DB self identifies problem and preemptively resolves it
  2. Internally maintained – Generic – script shipped with database
  3. Externally maintained – Generic – shared DB recovery script
  4. Externally Maintained – System Specific – A script on someones desktop
  5. No Automation

Simplicity

  • Less code = Less maintenance
  • Simplicity = Stability

The later chapters held less of interest.
“You want a data recovery system NOT a data backup system.”

SRE Engagement Model – Not all services require SRE attention as they don’t need high reliability and availability. Those teams get given advice and documentation.

Accelerate -The Science of Lean Software and Devops – Book

Overall 8/10 – Good book that presents good ideas and clear evidence for why.
I was aware of slightly over half the best practices from this book but not all of them have been adopted by large firms. I picked up a few actions I’d take away but really the usefulness in this book may be in presenting it as evidence to try and drive change in others.

accelerate-book

Book Notes:

Measuring Performance:

  • Use capabilities to measure performance not maturity levels as maturity suggests mission complete.
  • (Scrum) Velocity is only a capacity planning tool
  • Utilization isn’t the correct measure, it should not be 100%
  • Should measure global outcome to ensure teams are not pitted against each other
  • Software Delivery Performance Depends on:
    • Lead time
    • Deployment Frequency
    • Mean Time To Restore
    • Change Fail %

Measuring and Changing Culture

  • Don’t try to change how people think, first change what people do (or change the people :))
  • Westnam Theory: Orgs with better information flow function more effectively
  1. Level 1 – Things we just know
  2. Level 2 – Culture – We can debate these within the team, e.g. importance of security
  3. Level 3 – Written artifacts and established processes

Culture Types:

  1. Pathological – based on power
  2. Bureaucratic – based on rules
  3. Generative – based on performance

Continuous Delivery

Key Principles

  1. Build quality in
  2. Work in small batches
  3. Automate repetition
  4. Relentlessly pursue continuous improvement
  5. Everyone is responsible
  6. Foundations:
    1. Comprehensive config management
    2. Continuous Integration – Small daily branch merges
    3. Continuous Testing

What Works:

  • Version control
  • Test Automation
  • Test data management
  • Trunk based development

Architecture

Goal is loose coupling to ensure bandwidth between teams isn’t swamped with implementation details.
cohesion-coupling
Can the team by itself without speaking to outsiders:
– Change architecture significantly
– Do a deployment? now? during business hours? anytime?

Critical = Tesability and Deployability
Systems are loosely coupled and can be developed and validated independently.

Management Practices

Components of Lean Management

  • Limit work in progress
  • Visual Management
  • Feedback from production
  • Lightweight change approvals

CAB – doesn’t work to increase stability!
External approvals are negatively correlated with lead time, deploy freq. and restore time.
Lean Management <-> Software delivery performance, becomes a virtuous cycle.
Lean: Build -> Measure -> Learn

Capabilities

  • Small batches
  • flow of work from requirements to user known by team
  • Actively seek user feedbck
  • Authority to create/change specs during dev without approval

Sustainable

  • Invest in employee development
  • Foster supportive work environment (no blame)
  • Ask employees what’s preventing them from achieving their objectives
  • Give time to experiment and learn

Factors Causing Employee Burnout:

  • Work overload
  • Lack of control
  • Insufficient rewards
  • Community breakdown
  • Unfairness
  • Value conflicts

Transformational Leadership

  • Vision – Clear understanding of where to be in 5 years
  • Inspiring Communication – Says things that make employee proud to be part of org
  • Intellectually Stimulates – Challenges my assumptions, makes me rethink principles
  • Supportive – Considers and acts to benefit my feelings
  • Personal Recognition – Commends me when I do a good job

 

Key Takeaways for Me:

  1. Most the suggestions from other books I’ve read and that I had seen work myself were correct. The large survey conducted by these authors gives me the evidence to back up my opinions.
  2. Action: In my current work, we need to find a way to get the 3 critical measurements improved. Increased release frequency and lower overhead change management would seem to be the highest effort/reward.
  3. The importance of loosely-coupled architecture gives me a clearer way to conceptualise interactions between teams and why it’s important. (limited bandwidth)