devops – Ryan Hamilton

6/10 – Overall. 8/10 for early chapters, 4/10 for later chapters.
The first 100 pages were excellent but the later chapters were a mixed bag, partially due to rotating authors. I skim-read the later chapters as they mostly focussed on a broad spectrum of not closely related topics.
Chapters that covered topics I interact with were too shallow to interst me, while many chapters were not of interest to me. Perhaps if I was an SRE rather than a developer I would have found the entire book better.

Key Takeways for Me

Every large firm I’ve worked at has been structured incorrectly and had the wrong metrics for measuring stability.
In banks, the production support team has typically been tasked with “zero outages” whilst the developers are incentivised to develop and release as quickly as possible, with some front-office “quant-devs” not being held accountable for stability at all. With the handover method looking like throwing it over a wall:
This book suggests a much better approach:
Rather than pace vs stability, agree a global “Error Budget” target for everyone. using SLOs/SLIs that if not met can result in moving responsibilities back and forth from DEV to SRE owned. Importantly the target e.g. between 99.8% and 99.9% uptime should have an upper and lower bound, it should NOT be an absolute. If you go above it, developers should be taking more risks, below, developers should work on stability.
100% is the wrong reliability target. I always intuitively knew this but the book provided useful arguments. e.g. If you build 100% reliable but users wifi is 99% reliable, you wasted a lot of effort that users could never benefit from and that took time away from other work.

Book Notes

Note the full book is actually available online here.
An outage is NOT a bad thing, it is an expected part of innovation.

Monitoring

Alerts – Immediate human action required
Ticket – Human action required within few days to prevent damage
Logging – For forentsics/diagnostics only
MTTF – Mean Time To Failure
MTTR – Mean Time To Repair
Humans add latency. MTTR speed critical to availability -> automation is best.

Google Specific Terms

Campus > Data centre > cluster > row > rack > server
Borg – Automates resources for applications
Chubby – Uses paxos to provide global locks
Users -> GFrontEnd -> AppFrontEnd -> AppBackEnd -> DB (all coordinate via Load Balancer / DNS)

Embrace Risk

Time Availability = uptime / (uptime+downtime)
Aggregate Availability = successful Requests / Total Requests
This metric is more ususal when there are regional outages etc.
There are different types of failure
- Global outages, regional outages
- Full outages, partial funcitonality
- Choose which you want
Error Budget = Control loop to manage release velocity
Error Budget – Aligns incentives

SLOS

SLI – Service Level Indicators – Measure a level of service e.g. latency/availability
SLO – Service Level Objective – A range of values that is measured by an SLI e.g. average response <100ms
SLA – Agreement – agreed with customers, including consequences for missed SLOs
Choosing Targets:
- Don’t base it on current performance (it could be way off)
- keep it simple
- Have as few as possible
- Keep a safetly margin (tighter internal number)
- Don’t overachieve, each “9” is costly
Percentiles – are better measurement than averages in case of long tail

Toil

-> Manual repetitive work devoid of enduring value, that could be automated
Toil = Lower morale, career stagnation, slower progress
Some amount of toil is unavoidable and can even be calming

Automation

Automation allows super-linear scaling of users vs human effort.

Levels of automation:

Fully automated – DB self identifies problem and preemptively resolves it
Internally maintained – Generic – script shipped with database
Externally maintained – Generic – shared DB recovery script
Externally Maintained – System Specific – A script on someones desktop
No Automation

Simplicity

Less code = Less maintenance
Simplicity = Stability

The later chapters held less of interest.
“You want a data recovery system NOT a data backup system.”

SRE Engagement Model – Not all services require SRE attention as they don’t need high reliability and availability. Those teams get given advice and documentation.

Overall 8/10 – Good book that presents good ideas and clear evidence for why.
I was aware of slightly over half the best practices from this book but not all of them have been adopted by large firms. I picked up a few actions I’d take away but really the usefulness in this book may be in presenting it as evidence to try and drive change in others.

Book Notes:

Measuring Performance:

Use capabilities to measure performance not maturity levels as maturity suggests mission complete.
(Scrum) Velocity is only a capacity planning tool
Utilization isn’t the correct measure, it should not be 100%
Should measure global outcome to ensure teams are not pitted against each other
Software Delivery Performance Depends on:
- Lead time
- Deployment Frequency
- Mean Time To Restore
- Change Fail %

Measuring and Changing Culture

Don’t try to change how people think, first change what people do (or change the people :))
Westnam Theory: Orgs with better information flow function more effectively

Level 1 – Things we just know
Level 2 – Culture – We can debate these within the team, e.g. importance of security
Level 3 – Written artifacts and established processes

Culture Types:

Pathological – based on power
Bureaucratic – based on rules
Generative – based on performance

Continuous Delivery

Key Principles

Build quality in
Work in small batches
Automate repetition
Relentlessly pursue continuous improvement
Everyone is responsible
Foundations:
1. Comprehensive config management
2. Continuous Integration – Small daily branch merges
3. Continuous Testing

What Works:

Version control
Test Automation
Test data management
Trunk based development

Architecture

Goal is loose coupling to ensure bandwidth between teams isn’t swamped with implementation details.

Can the team by itself without speaking to outsiders:
– Change architecture significantly
– Do a deployment? now? during business hours? anytime?

Critical = Tesability and Deployability
Systems are loosely coupled and can be developed and validated independently.

Management Practices

Components of Lean Management

Limit work in progress
Visual Management
Feedback from production
Lightweight change approvals

CAB – doesn’t work to increase stability!
External approvals are negatively correlated with lead time, deploy freq. and restore time.
Lean Management <-> Software delivery performance, becomes a virtuous cycle.
Lean: Build -> Measure -> Learn

Capabilities

Small batches
flow of work from requirements to user known by team
Actively seek user feedbck
Authority to create/change specs during dev without approval

Sustainable

Invest in employee development
Foster supportive work environment (no blame)
Ask employees what’s preventing them from achieving their objectives
Give time to experiment and learn

Factors Causing Employee Burnout:

Work overload
Lack of control
Insufficient rewards
Community breakdown
Unfairness
Value conflicts

Transformational Leadership

Vision – Clear understanding of where to be in 5 years
Inspiring Communication – Says things that make employee proud to be part of org
Intellectually Stimulates – Challenges my assumptions, makes me rethink principles
Supportive – Considers and acts to benefit my feelings
Personal Recognition – Commends me when I do a good job

Key Takeaways for Me:

Most the suggestions from other books I’ve read and that I had seen work myself were correct. The large survey conducted by these authors gives me the evidence to back up my opinions.
Action: In my current work, we need to find a way to get the 3 critical measurements improved. Increased release frequency and lower overhead change management would seem to be the highest effort/reward.
The importance of loosely-coupled architecture gives me a clearer way to conceptualise interactions between teams and why it’s important. (limited bandwidth)

Tag: devops

Book Review: SRE – Site Reliability Engineering