SRE – Site Reliability Engineering – BOOK

6/10 – Overall.    8/10 for early chapters, 4/10 for later chapters.
The first 100 pages were excellent but the later chapters were a mixed bag, partially due to rotating authors. I skim-read the later chapters as they mostly focussed on a broad spectrum of not closely related topics.
Chapters that covered topics I interact with were too shallow to interst me, while many chapters were not of interest to me. Perhaps if I was an SRE rather than a developer I would have found the entire book better.

Key Takeways for Me

  1. Every large firm I’ve worked at has been structured incorrectly and had the wrong metrics for measuring stability.
    In banks, the productiodevops-wall-thrown support team has typically been tasked with “zero outages” whilst the developers are incentivised to develop and release as quickly as possible, with some front-office “quant-devs” not being held accountable for stability at all. With the handover method looking like throwing it over a wall:
  2. This book suggests a much better approach:
    Rather than pace vs stability, agree a global “Error Budget” target for everyone. using SLOs/SLIs that if not met can result in moving responstargetibilities back and forth from DEV to SRE owned. Importantly the target e.g. between 99.8% and 99.9% uptime should have an upper and lower bound, it should NOT be an absolute. If you go above it, developers should be taking more risks, below, developers should work on stability.
  3. 100% is the wrong reliability target. I always intuitively knew this but the book provided useful arguments. e.g. If you build 100% reliable but users wifi is 99% reliable, you wasted a lot of effort that users could never benefit from and that took time away from other work.

Book Notes

Note the full book is actually available online here.
An outage is NOT a bad thing, it is an expected part of innovation.

Monitoring

  • Alerts – Immedaite human action required
  • Ticket – Human action required within few days to prevent damage
  • Logging – For forentsics/diagnostics only
  • MTTF – Mean Time To Failure
  • MTTR – Mean Time To Repair
  • Humans add latency. MTTR speed critical to availability -> automation is best.

Google Specific Terms

  • Campus > Data centre > cluster > row > rack > server
  • Borg – Automates resources for applications
  • Chubby – Uses paxos to provide global locks
  • Users -> GFrontEnd -> AppFrontEnd -> AppBackEnd -> DB  (all coordinate via Load Balancer / DNS)

Embrace Risk

  • Time Availability = uptime / (uptime+downtime)
  • Aggregate Availability = successful Requests / Total Requests
    This metric is more ususal when there are regional outages etc.
  • There are different types of failure
    • Global outages, regional outages
    • Full outages, partial funcitonality
    • Choose which you want
  • Error Budget = Control loop to manage release velocity
  • Error Budget – Aligns incentives

SLOS

  • SLI – Service Level Indicators – Measure a level of service e.g. latency/availability
  • SLO – Service Level Objective – A range of values that is measured by an SLI e.g. average response <100ms
  • SLA – Agreement – agreed with customers, including consequences for missed SLOs
  • Choosing Targets:
    • Don’t base it on current performance (it could be way off)
    • keep it simple
    • Have as few as possible
    • Keep a safetly margin (tighter internal number)
    • Don’t overachieve, each “9” is costly
  • Percentiles – are better measurement than averages in case of long tail

Toil

  • -> Manual repetitive work devoid of enduring value, that could be automated
  • Toil = Lower morale, career stagnation, slower progress
  • Some amount of toil is unavoidable and can even be calming

Automation

Automation allows super-linear scaling of users vs human effort.

Levels of automation:

  1. Fully automated  – DB self identifies problem and preemptively resolves it
  2. Internally maintained – Generic – script shipped with database
  3. Externally maintained – Generic – shared DB recovery script
  4. Externally Maintained – System Specific – A script on someones desktop
  5. No Automation

Simplicity

  • Less code = Less maintenance
  • Simplicity = Stability

The later chapters held less of interest.
“You want a data recovery system NOT a data backup system.”

SRE Engagement Model – Not all services require SRE attention as they don’t need high reliability and availability. Those teams get given advice and documentation.

Accelerate -The Science of Lean Software and Devops – Book

Overall 8/10 – Good book that presents good ideas and clear evidence for why.
I was aware of slightly over half the best practices from this book but not all of them have been adopted by large firms. I picked up a few actions I’d take away but really the usefulness in this book may be in presenting it as evidence to try and drive change in others.

accelerate-book

Book Notes:

Measuring Performance:

  • Use capabilities to measure performance not maturity levels as maturity suggests mission complete.
  • (Scrum) Velocity is only a capacity planning tool
  • Utilization isn’t the correct measure, it should not be 100%
  • Should measure global outcome to ensure teams are not pitted against each other
  • Software Delivery Performance Depends on:
    • Lead time
    • Deployment Frequency
    • Mean Time To Restore
    • Change Fail %

Measuring and Changing Culture

  • Don’t try to change how people think, first change what people do (or change the people :))
  • Westnam Theory: Orgs with better information flow function more effectively
  1. Level 1 – Things we just know
  2. Level 2 – Culture – We can debate these within the team, e.g. importance of security
  3. Level 3 – Written artifacts and established processes

Culture Types:

  1. Pathological – based on power
  2. Bureaucratic – based on rules
  3. Generative – based on performance

Continuous Delivery

Key Principles

  1. Build quality in
  2. Work in small batches
  3. Automate repetition
  4. Relentlessly pursue continuous improvement
  5. Everyone is responsible
  6. Foundations:
    1. Comprehensive config management
    2. Continuous Integration – Small daily branch merges
    3. Continuous Testing

What Works:

  • Version control
  • Test Automation
  • Test data management
  • Trunk based development

Architecture

Goal is loose coupling to ensure bandwidth between teams isn’t swamped with implementation details.
cohesion-coupling
Can the team by itself without speaking to outsiders:
– Change architecture significantly
– Do a deployment? now? during business hours? anytime?

Critical = Tesability and Deployability
Systems are loosely coupled and can be developed and validated independently.

Management Practices

Components of Lean Management

  • Limit work in progress
  • Visual Management
  • Feedback from production
  • Lightweight change approvals

CAB – doesn’t work to increase stability!
External approvals are negatively correlated with lead time, deploy freq. and restore time.
Lean Management <-> Software delivery performance, becomes a virtuous cycle.
Lean: Build -> Measure -> Learn

Capabilities

  • Small batches
  • flow of work from requirements to user known by team
  • Actively seek user feedbck
  • Authority to create/change specs during dev without approval

Sustainable

  • Invest in employee development
  • Foster supportive work environment (no blame)
  • Ask employees what’s preventing them from achieving their objectives
  • Give time to experiment and learn

Factors Causing Employee Burnout:

  • Work overload
  • Lack of control
  • Insufficient rewards
  • Community breakdown
  • Unfairness
  • Value conflicts

Transformational Leadership

  • Vision – Clear understanding of where to be in 5 years
  • Inspiring Communication – Says things that make employee proud to be aprt of org
  • Intellectually Stimulates – Challenges my assumptions, makes me rethink principles
  • Supportive – Considers and acts to benefit my feelings
  • Personal Recognition – Commends me when I do a good job

 

Key Takeaways for Me:

  1. Most the suggestions from other books I’ve read and that I had seen work myself were correct. The large survey conducted by these authors gives me the evidence to back up my opinions.
  2. Action: In my current work, we need to find a way to get the 3 critical measurements improved. Increased release frequency and lower overhead change management would seem to be the highest effort/reward.
  3. The importance of loosely-coupled architecture gives me a clearer way to conceptualise interactions between teams and why it’s important. (limited bandwidth)

 

Andrew Grove – High Output Management – Book

Overall 10/10. A short concise book with a number of good ideas that anyone working in a large/medium or perhaps even a small company would benefit from. So good, that I re-read parts of it twice already.

High Output Management Andrew Grove

Book Notes:

Andy uses the analogy of making a breakfast to show how process->assemble->test is a common workflow at any scale. For me many of his points in manufacturing paralleled topics from lean/agile software development.

Inspection Methods:
In-process inspection e.g. theremometer for boiling eggs
Receiving inspection e.g. Inspecting eggs on delivery, parallel in software: validating user requirements
Each step takes time/effort and ads value, best to eliminate waste at the earliest stage.

Picking Good Indicators: Indicators chosen dictate our direction.
Pairing Indicators – prevent overfitting e.g. From scrum points done with busines value delived. Quantitive and qualitive often pair well.
– black-box = Just measure in/output
– Cut a hole in the box to get leading indicators. Can look at a linear indicator (graph) or a stagger chart to ensure going at correct speed or to allow estimation.
stagger chart

Controlling Future Output -Goal should be to keep inventory at earliest, lowest cost stage.
1. Build to order
2. Build to forecast

Assuring Quality
– Act as a gate – Inspect everything, push back rejects
Monitoring Step – Inspect some, stop line on problems
Variable Inspection Best – Too regular = expected, too few = gaps, no problems = inspect less, problems = inspect more, dive into one at random.
I thought some of this may be very applicable to software, consider for example Pull Reviews or User Stories or Releases, how often should senior developers ensure requirements have been gathered adequately, that jiras are well written, or inspect juniors code. Though on that last point of managing people Andy very much suggests focussing on TRM – Task Relevant Maturity,

Misc
Nudges
– Most of a managers day is gathering information. After that there are a few direct decisions but often it will be a case of nudges. Gently influencing items in the direction you think best. Note combined with Andys other point that (Managers Output) = (Output of his Org) + (Output of neighbouring Orgs) this nudging of nearby teams can be very powerful for the company overall.

Hybrid Organisations
Organisations come in two extreme forms:
Mission Oriented = Small hedge fund where everyones goal is to make money
Functional Oriented = IT Support within a bank, whos goal is to deliver IT assistance.
The hybrid organisational form is inevitable. The company I think of is when considering these concepts is McDonalds. McDonalds will have a global marketing department ensuring consistency of branding but it will also have regional departments deciding which offers to run. Similarly some resources such as packaging will be produced in a shared department while regional speciailities can be ran at a much lower level. There will always be a conflict between the goals of these different groups but similar to democracy being the least worst form of government, hybrid organisations appear to be the least-worst method of organisation.

A related concept is Dual Reporting – Individuals can be individual contributors within one team e.g. coders but at other times contribute firm-wide as part of standards committees etc. This allows using their skill sets to the fullest.

…One-to-ones, meetings and a number of other topics were also covered in the book.

Key Takeaways for Me

I really like the idea of paired indicators. Once you’ve heard the concept it’s easy to see other teams doing it wrong and only looking at one indicator. e.g. Within large finance firms there are change management or support teams whose job is to ensure stability of all systems, the metric they will almost always look at is outages and their severity. You will typically see a line or bar chart over time reported or perhaps a breakdown by team. A high number of outsages by a particular team can result in their releases being frozen. I would suggest this metric should be weighted against business value delivered. Does it matter if a system crashes badly once a week if it prints money the rest of the time? Compared to say an accounting system that never fails but delivers little additional value.

What work should a manager do? The manager is responsible for overall team delivery. Therefore a manager should work on the highest leverage item. I think Andy is right that training is an extremely high leverage activity given the return over time. Training someone now, will lead to higher quality output later. Higher quality out = less quality checks aer required and staff are happier. The importance of training does however make you wonder about the impact of the rapidly increasing employee turnover in some countries and the increased reluctance of firms to invest in training.

TRM – Task Relevant Maturity. Either someone refuses to work or can’t do the work. Their ability to do the work will depend on their Task Relevant Maturity. When considering how to assign and monitor tasks, the key metric to keep in mind is TRM.

Summary

As I said at the start, a really good book. Some of his ideas I will have to take time to digest and consider how it would change my approach to certain work. Andy even included a todo list for once you’ve finished the book that I have partly completed. The parts of the book that I was more sceptical of, I plan to force myself to consider more fully. Given how knowledgable and accurate Andy was in some areas I have experience of, I shouldn’t dismiss him in the areas that seem to me more dubious.

US Political Books

On Holiday I took an unusual diversion to read three US political books:

Reagan – Was very biased in favour of Reagan and what a great job he had done.

James Comey – Felt like James honest version of the truth as he saw it, very anti-trump. Some of the situations presented and how everyone tries to manage events to get what they want are interesting.

McCain – Was probably the most balanced book with a few more interesting stories. The best of the bunch.

The Five Dysfunctions of a Team – Book

Overall 5/10 – An OK book with little surprising content and an OK story.

five-dysfunctions-of-a-team

I think the core content is true, but the narrative that the author tried to use to deliver his points was thin and didn’t resonate with me.

The one thing this book reminded me of was the interesting research google performed analysing their teams. Over “two years we conducted 200+ interviews with Googlers (our employees) and looked at more than 250 attributes of 180+ active Google teams”, interestingly if you look at their five points (listed 1-5) it closely parallels this book (image below).

  1. Psychological safety: Can we take risks on this team without feeling insecure or embarrassed?
  2. Dependability: Can we count on each other to do high quality work on time?
  3. Structure & clarity: Are goals, roles, and execution plans on our team clear?
  4. Meaning of work: Are we working on something that is personally important for each of us?
  5. Impact of work: Do we fundamentally believe that the work we’re doing matters?

Key Takeaways

five-dysfunctions-of-a-team-levels.png

Each level relies on the one below. A team must first have trust, then no fear of conflict, then commitment to team goals, hold themselves accountable and be commited to team results.

Quotes that float in my Mind

A colleague asked me to name books or articles that have influenced my thinking (Great question). However my mind works, I seem to store some ideas from multiple places but associated to short quotes. Here’s some of my favourite:

  • You only learn by listening, not by talking.
    • The best thing you can do to make a friend is to really listen to someone.

Software:

  • C.A.R.HoareThere are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.
  • Leslie Lamport“State the problem before describing the solution”
  • Often when dealing with messy code or 3rd party teams that aren’t delivering or need help, I try to think to myself “be the change” and try to make a difference. A quote commonly attributed to Gandhi. I had one colleague that would jibe back “Don’t be a hero” that the soldier with his head over the parapet gets shot. He felt triumphant when he realised Gandhi was assassinated. I still try to be the change 🙂
  • Tom Cargill – “The first 90 percent of the code accounts for the first 90 percent of the development time. The remaining 10 percent of the code accounts for the other 90 percent of the development time.”. Why I have to take a deep breath when someone says oh that’s just one line of code, rarely if ever is it one line of code. Especially on a legacy system with no tests and a business that specifies unclear demands.
  • There are two types of companies/people. Those..”that sees quality and efficiency as opposing forces, or one that sees them as inseparable“. If you rush to push things out quick now, the overall delivery speed will be slower. – HN post.
  • Fred Brooks – “The bearing of a child takes nine months, no matter how many women are assigned.”
  • Donald Knuth – “premature optimization is the root of all evil (or at least most of it) in programming.” Until you have a benchmark showing where time is being taken and that this area is the constraint we will not be rewriting this currently simple clear code.
  • Brian Kernighan“Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.”
  • – “There is no silver bullet that’s going to fix that. No, we are going to have to use a lot of lead bullets.”
  • Martin Fowler – “If it hurts, do it more often”

Work in General

  • If you accept sh*t and shovel sh*t, don’t complain when people bring you more sh*t to shovel. You brought it on yourself. – James.
  • The best way to challenge something, may not be to oppose it but to ask “why” it’s being done.

Misc

  • The power of habit – “Practice makes Permanent” – Practice by itself will not make you better, you must reflect and get feedback on the practice you’ve done and consider how you can improve.

The Checklist – Atul Gawande – Book

Overall 6/10 – Good but the few good ideas didn’t justify the book size, some parts felt like filler.

the-checklist-atul-gawande-book-medium.jpg

I liked the idea of this book as a number of processes that I am responsible for involve a long complicated process with many steps of varying difficulty that a developer is likely to forget and I thought the ideas from this book may help. Unfortunately the takeaways do not seem to carry over from medicine to software development.