SRE – Site Reliability Engineering – BOOK

6/10 – Overall.    8/10 for early chapters, 4/10 for later chapters.
The first 100 pages were excellent but the later chapters were a mixed bag, partially due to rotating authors. I skim-read the later chapters as they mostly focussed on a broad spectrum of not closely related topics.
Chapters that covered topics I interact with were too shallow to interst me, while many chapters were not of interest to me. Perhaps if I was an SRE rather than a developer I would have found the entire book better.

Key Takeways for Me

  1. Every large firm I’ve worked at has been structured incorrectly and had the wrong metrics for measuring stability.
    In banks, the productiodevops-wall-thrown support team has typically been tasked with “zero outages” whilst the developers are incentivised to develop and release as quickly as possible, with some front-office “quant-devs” not being held accountable for stability at all. With the handover method looking like throwing it over a wall:
  2. This book suggests a much better approach:
    Rather than pace vs stability, agree a global “Error Budget” target for everyone. using SLOs/SLIs that if not met can result in moving responstargetibilities back and forth from DEV to SRE owned. Importantly the target e.g. between 99.8% and 99.9% uptime should have an upper and lower bound, it should NOT be an absolute. If you go above it, developers should be taking more risks, below, developers should work on stability.
  3. 100% is the wrong reliability target. I always intuitively knew this but the book provided useful arguments. e.g. If you build 100% reliable but users wifi is 99% reliable, you wasted a lot of effort that users could never benefit from and that took time away from other work.

Book Notes

Note the full book is actually available online here.
An outage is NOT a bad thing, it is an expected part of innovation.

Monitoring

  • Alerts – Immedaite human action required
  • Ticket – Human action required within few days to prevent damage
  • Logging – For forentsics/diagnostics only
  • MTTF – Mean Time To Failure
  • MTTR – Mean Time To Repair
  • Humans add latency. MTTR speed critical to availability -> automation is best.

Google Specific Terms

  • Campus > Data centre > cluster > row > rack > server
  • Borg – Automates resources for applications
  • Chubby – Uses paxos to provide global locks
  • Users -> GFrontEnd -> AppFrontEnd -> AppBackEnd -> DB  (all coordinate via Load Balancer / DNS)

Embrace Risk

  • Time Availability = uptime / (uptime+downtime)
  • Aggregate Availability = successful Requests / Total Requests
    This metric is more ususal when there are regional outages etc.
  • There are different types of failure
    • Global outages, regional outages
    • Full outages, partial funcitonality
    • Choose which you want
  • Error Budget = Control loop to manage release velocity
  • Error Budget – Aligns incentives

SLOS

  • SLI – Service Level Indicators – Measure a level of service e.g. latency/availability
  • SLO – Service Level Objective – A range of values that is measured by an SLI e.g. average response <100ms
  • SLA – Agreement – agreed with customers, including consequences for missed SLOs
  • Choosing Targets:
    • Don’t base it on current performance (it could be way off)
    • keep it simple
    • Have as few as possible
    • Keep a safetly margin (tighter internal number)
    • Don’t overachieve, each “9” is costly
  • Percentiles – are better measurement than averages in case of long tail

Toil

  • -> Manual repetitive work devoid of enduring value, that could be automated
  • Toil = Lower morale, career stagnation, slower progress
  • Some amount of toil is unavoidable and can even be calming

Automation

Automation allows super-linear scaling of users vs human effort.

Levels of automation:

  1. Fully automated  – DB self identifies problem and preemptively resolves it
  2. Internally maintained – Generic – script shipped with database
  3. Externally maintained – Generic – shared DB recovery script
  4. Externally Maintained – System Specific – A script on someones desktop
  5. No Automation

Simplicity

  • Less code = Less maintenance
  • Simplicity = Stability

The later chapters held less of interest.
“You want a data recovery system NOT a data backup system.”

SRE Engagement Model – Not all services require SRE attention as they don’t need high reliability and availability. Those teams get given advice and documentation.

Accelerate -The Science of Lean Software and Devops – Book

Overall 8/10 – Good book that presents good ideas and clear evidence for why.
I was aware of slightly over half the best practices from this book but not all of them have been adopted by large firms. I picked up a few actions I’d take away but really the usefulness in this book may be in presenting it as evidence to try and drive change in others.

accelerate-book

Book Notes:

Measuring Performance:

  • Use capabilities to measure performance not maturity levels as maturity suggests mission complete.
  • (Scrum) Velocity is only a capacity planning tool
  • Utilization isn’t the correct measure, it should not be 100%
  • Should measure global outcome to ensure teams are not pitted against each other
  • Software Delivery Performance Depends on:
    • Lead time
    • Deployment Frequency
    • Mean Time To Restore
    • Change Fail %

Measuring and Changing Culture

  • Don’t try to change how people think, first change what people do (or change the people :))
  • Westnam Theory: Orgs with better information flow function more effectively
  1. Level 1 – Things we just know
  2. Level 2 – Culture – We can debate these within the team, e.g. importance of security
  3. Level 3 – Written artifacts and established processes

Culture Types:

  1. Pathological – based on power
  2. Bureaucratic – based on rules
  3. Generative – based on performance

Continuous Delivery

Key Principles

  1. Build quality in
  2. Work in small batches
  3. Automate repetition
  4. Relentlessly pursue continuous improvement
  5. Everyone is responsible
  6. Foundations:
    1. Comprehensive config management
    2. Continuous Integration – Small daily branch merges
    3. Continuous Testing

What Works:

  • Version control
  • Test Automation
  • Test data management
  • Trunk based development

Architecture

Goal is loose coupling to ensure bandwidth between teams isn’t swamped with implementation details.
cohesion-coupling
Can the team by itself without speaking to outsiders:
– Change architecture significantly
– Do a deployment? now? during business hours? anytime?

Critical = Tesability and Deployability
Systems are loosely coupled and can be developed and validated independently.

Management Practices

Components of Lean Management

  • Limit work in progress
  • Visual Management
  • Feedback from production
  • Lightweight change approvals

CAB – doesn’t work to increase stability!
External approvals are negatively correlated with lead time, deploy freq. and restore time.
Lean Management <-> Software delivery performance, becomes a virtuous cycle.
Lean: Build -> Measure -> Learn

Capabilities

  • Small batches
  • flow of work from requirements to user known by team
  • Actively seek user feedbck
  • Authority to create/change specs during dev without approval

Sustainable

  • Invest in employee development
  • Foster supportive work environment (no blame)
  • Ask employees what’s preventing them from achieving their objectives
  • Give time to experiment and learn

Factors Causing Employee Burnout:

  • Work overload
  • Lack of control
  • Insufficient rewards
  • Community breakdown
  • Unfairness
  • Value conflicts

Transformational Leadership

  • Vision – Clear understanding of where to be in 5 years
  • Inspiring Communication – Says things that make employee proud to be aprt of org
  • Intellectually Stimulates – Challenges my assumptions, makes me rethink principles
  • Supportive – Considers and acts to benefit my feelings
  • Personal Recognition – Commends me when I do a good job

 

Key Takeaways for Me:

  1. Most the suggestions from other books I’ve read and that I had seen work myself were correct. The large survey conducted by these authors gives me the evidence to back up my opinions.
  2. Action: In my current work, we need to find a way to get the 3 critical measurements improved. Increased release frequency and lower overhead change management would seem to be the highest effort/reward.
  3. The importance of loosely-coupled architecture gives me a clearer way to conceptualise interactions between teams and why it’s important. (limited bandwidth)

 

Andrew Grove – High Output Management – Book

Overall 10/10. A short concise book with a number of good ideas that anyone working in a large/medium or perhaps even a small company would benefit from. So good, that I re-read parts of it twice already.

High Output Management Andrew Grove

Book Notes:

Andy uses the analogy of making a breakfast to show how process->assemble->test is a common workflow at any scale. For me many of his points in manufacturing paralleled topics from lean/agile software development.

Inspection Methods:
In-process inspection e.g. theremometer for boiling eggs
Receiving inspection e.g. Inspecting eggs on delivery, parallel in software: validating user requirements
Each step takes time/effort and ads value, best to eliminate waste at the earliest stage.

Picking Good Indicators: Indicators chosen dictate our direction.
Pairing Indicators – prevent overfitting e.g. From scrum points done with busines value delived. Quantitive and qualitive often pair well.
– black-box = Just measure in/output
– Cut a hole in the box to get leading indicators. Can look at a linear indicator (graph) or a stagger chart to ensure going at correct speed or to allow estimation.
stagger chart

Controlling Future Output -Goal should be to keep inventory at earliest, lowest cost stage.
1. Build to order
2. Build to forecast

Assuring Quality
– Act as a gate – Inspect everything, push back rejects
Monitoring Step – Inspect some, stop line on problems
Variable Inspection Best – Too regular = expected, too few = gaps, no problems = inspect less, problems = inspect more, dive into one at random.
I thought some of this may be very applicable to software, consider for example Pull Reviews or User Stories or Releases, how often should senior developers ensure requirements have been gathered adequately, that jiras are well written, or inspect juniors code. Though on that last point of managing people Andy very much suggests focussing on TRM – Task Relevant Maturity,

Misc
Nudges
– Most of a managers day is gathering information. After that there are a few direct decisions but often it will be a case of nudges. Gently influencing items in the direction you think best. Note combined with Andys other point that (Managers Output) = (Output of his Org) + (Output of neighbouring Orgs) this nudging of nearby teams can be very powerful for the company overall.

Hybrid Organisations
Organisations come in two extreme forms:
Mission Oriented = Small hedge fund where everyones goal is to make money
Functional Oriented = IT Support within a bank, whos goal is to deliver IT assistance.
The hybrid organisational form is inevitable. The company I think of is when considering these concepts is McDonalds. McDonalds will have a global marketing department ensuring consistency of branding but it will also have regional departments deciding which offers to run. Similarly some resources such as packaging will be produced in a shared department while regional speciailities can be ran at a much lower level. There will always be a conflict between the goals of these different groups but similar to democracy being the least worst form of government, hybrid organisations appear to be the least-worst method of organisation.

A related concept is Dual Reporting – Individuals can be individual contributors within one team e.g. coders but at other times contribute firm-wide as part of standards committees etc. This allows using their skill sets to the fullest.

…One-to-ones, meetings and a number of other topics were also covered in the book.

Key Takeaways for Me

I really like the idea of paired indicators. Once you’ve heard the concept it’s easy to see other teams doing it wrong and only looking at one indicator. e.g. Within large finance firms there are change management or support teams whose job is to ensure stability of all systems, the metric they will almost always look at is outages and their severity. You will typically see a line or bar chart over time reported or perhaps a breakdown by team. A high number of outsages by a particular team can result in their releases being frozen. I would suggest this metric should be weighted against business value delivered. Does it matter if a system crashes badly once a week if it prints money the rest of the time? Compared to say an accounting system that never fails but delivers little additional value.

What work should a manager do? The manager is responsible for overall team delivery. Therefore a manager should work on the highest leverage item. I think Andy is right that training is an extremely high leverage activity given the return over time. Training someone now, will lead to higher quality output later. Higher quality out = less quality checks aer required and staff are happier. The importance of training does however make you wonder about the impact of the rapidly increasing employee turnover in some countries and the increased reluctance of firms to invest in training.

TRM – Task Relevant Maturity. Either someone refuses to work or can’t do the work. Their ability to do the work will depend on their Task Relevant Maturity. When considering how to assign and monitor tasks, the key metric to keep in mind is TRM.

Summary

As I said at the start, a really good book. Some of his ideas I will have to take time to digest and consider how it would change my approach to certain work. Andy even included a todo list for once you’ve finished the book that I have partly completed. The parts of the book that I was more sceptical of, I plan to force myself to consider more fully. Given how knowledgable and accurate Andy was in some areas I have experience of, I shouldn’t dismiss him in the areas that seem to me more dubious.

US Political Books

On Holiday I took an unusual diversion to read three US political books:

Reagan – Was very biased in favour of Reagan and what a great job he had done.

James Comey – Felt like James honest version of the truth as he saw it, very anti-trump. Some of the situations presented and how everyone tries to manage events to get what they want are interesting.

McCain – Was probably the most balanced book with a few more interesting stories. The best of the bunch.

The Five Dysfunctions of a Team – Book

Overall 5/10 – An OK book with little surprising content and an OK story.

five-dysfunctions-of-a-team

I think the core content is true, but the narrative that the author tried to use to deliver his points was thin and didn’t resonate with me.

The one thing this book reminded me of was the interesting research google performed analysing their teams. Over “two years we conducted 200+ interviews with Googlers (our employees) and looked at more than 250 attributes of 180+ active Google teams”, interestingly if you look at their five points (listed 1-5) it closely parallels this book (image below).

  1. Psychological safety: Can we take risks on this team without feeling insecure or embarrassed?
  2. Dependability: Can we count on each other to do high quality work on time?
  3. Structure & clarity: Are goals, roles, and execution plans on our team clear?
  4. Meaning of work: Are we working on something that is personally important for each of us?
  5. Impact of work: Do we fundamentally believe that the work we’re doing matters?

Key Takeaways

five-dysfunctions-of-a-team-levels.png

Each level relies on the one below. A team must first have trust, then no fear of conflict, then commitment to team goals, hold themselves accountable and be commited to team results.

The Checklist – Atul Gawande – Book

Overall 6/10 – Good but the few good ideas didn’t justify the book size, some parts felt like filler.

the-checklist-atul-gawande-book-medium.jpg

I liked the idea of this book as a number of processes that I am responsible for involve a long complicated process with many steps of varying difficulty that a developer is likely to forget and I thought the ideas from this book may help. Unfortunately the takeaways do not seem to carry over from medicine to software development.

Software Skills – The Software Developers Life Manual – Book

Overall 6/10 – Maybe a more junior/beginning developer would find this useful but for an inexperienced dev it’s mostly common sense.

I bought this book for the wrong reason. I saw it, looked at the index and thought that the content was exactly what I would have wanted as a beginning software developer. Problem being I’m no longer a beginning software developer so too much of the content is no longer useful.

Some Takeaways I did like:

  • Think of yourself, as a business
    • Consider what company long-term suits our goals
    • Market yourself
  • Climbing the Corporate Ladder (Big Company)
    • Take Responsibility – If you take responsibility for something, credit will follow.
    • Become Visible
  • Quota System – Forcing yourself to regularly contribute small pieces towards a big goal

Quirky ideas I wouldn’t have considered but can’t discredit or find interesting include:

  • Hire a professional resume writer. HIs argument is that you only write one and that you are not an expert. I think that could be a sound argument. I once worked for a consulting firm, where they “creatively wrote” the CVs for staff. They could remove parts that were factually true and replace it with what seemed like buzz-word bingo to a techy like me but it worked!
  • Hard work is hard and boring. Sometimes there’s no silver bullet and you just need to put in the work. Sounds obvious but I agree with the author, often people delay or try to find a magical soltuion when what is really needed is hard work. Reminds me of: “..those silver bullets that you and Mike are looking for are fine and good, but our web server is five times slower. There is no silver bullet that’s going to fix that. No, we are going to have to use a lot of lead bullets.” – Bill Turnpin
  • Any action is better than no action