Developers, Code Cowboys and Architecture Astronauts.
Slums, Skyscrapers and Ghost Cities.
Similar to constructing buildings, there are (at least) three approaches to software development:
Coders = Slums – Quickly built using material and knowledge at hand to develop for a small audience quickly. Good ideas will be copy-pasted from one area to another and modified to suit the individuals needs. We can cover a lot of ground quickly but it doesn’t scale, plumbing and electricity break down.
Developers = Skyscrapers – Construction takes longer, the outcome can result in a uniformity, often piecing together existing architectural concepts or libraries into a fairly standard shape. We can scale to a higher level (density of people) but we need more upfront planning and less individuality.
Architecture Astronauts = Ghost Cities = Master architects devise grand schemes of hugely scaleable systems but there are fundamental flaws in the plan and often the need of actual end users are ignored.
If this conceptual metaphor holds, what could we learn from the building industry?
- Don’t employ a coder when you need an architect?
- Sometimes you need to clear a slum, displeasing those residents to replace it with an efficient residential building, which will take time and investment?
- Building quality needs enforced by external parties?
Similar to governmental building inspections.
- Always get the core plumbing right, the facade/paint can be changed later?
- Is there anything they could learn from software development?
Perhaps the most important thing is to decide which category you are aiming for.
I initially breezed through this book in a week thinking it mostly contained nice stories and glib niceties. However going back to
write the Book Notes took 3 weeks, I found myself scribbling down page upon page compared to my usual amount of notes. Upon reflection, it was a really good book with lots of good points and anecdotal stories to help remember them. If Bill was getting this many “easy” parts right, I can see how the overall impact would have been large.
My main take-away: Team First – A company is formed from teams, get the right people, built an envelope of trust, support and love them.
- Improve work roadmap meetings. Given we have the whole team present, anything but very effective use of that time is a waste.
- Re-read project aristotle
- Think about what makes a good meeting
- Always set a measurable goal, sometimes a Big-Hairy-Goal to stretch people
- Caddie and CEO
- Title makes you a Manager, your people make you a leader.
- Built an envelope of trust
- Team first
- The power of love
- The yardstick
- Caddie and CEO
- background and some hero-worshiping
- Teams are building block of a company, not individuals.
- pg26 Raises interesting possibility of coaches for managers.
Given the leverage, why are there not coaches providing real-time feedback
- Mentor vs coach
- Title makes you a Manager, your people make you a leader.
- A managers authority emerges as they establish credibility with subordinates, peers and superiors.
- “It’s the people” – Support = Respect + Trust
- Support = tools, info, roadmap, training
- Respect = Career goals / Life choices
- Trust = Autonomy and Decision Making
- Lying in bed at night, the CEO should worry most about his staff
- One to ones and staff meeting are critical.
- Trip Reports – Having one person tell a personal story of their weekend at a Monday meeeting
- Decision Making – “Making the right decision is important
Just as important is getting the whole team there.”
- Managers job to run decision making process, ensure voices heard, cut tie-breaks when stuck and to remind everyone of purpose and root truths.
- Built an envelope of trust
- The first thing some managers focus on is building a product or getting people working. The priority should be to build trust.
- Bill saw the world as a network of people with different skills,
learning to trust each other as a primary mechanism of achieving goals.
- Psychological Safety – The ability for a team member to voice crazy ideas and feel safe from negative repercussions has been found to be critical to success.
- Coach the coachable – pg86 “A coach is someone that tells you what you don’t want to hear, who has you see what you don’t want to see, so you can be who you have always known you can be.”
- Honest, humility, perseverance and constant openness to learning.
- Leadership is about service to something that is bigger than you.
- Practice active-listening
- Diane Greene – “When I’m really annoyed or frustrated with what someone is doing, I step back to think about what they are doing well and what their value is”
- No gap between statements and facts. Give feedback close to the time, in public if good, in private if negative. Always give it from a place of love.
- Team first
- Work the team, then the problem.
When faced with aproblem or opportunity, the first step is to ensure the right team is in place and working on it.
- Pick the right players. The ability to learn fast, a willingness to work hard, integrity, grit, empathy, and a team first attitude.
- Bill saw peer relationships as critical and instituted a regular survey amongst peers at google to asses performance at job/relationships/meetings/leadership/innovation.
- Winning depends on having the best team and the best teams have more women.
- Identify the biggest problem, the elephant in the room, bring it front and centre and tackle it first.
- Listen, Observe and fill the communications gaps
- Work the team, then the problem.
- The power of love
- Get to know and care for people as individuals
- Cheer people and their successes
Overall 5/10 – Having read and loved the phoenix project, I had high expectations for this book, perhaps too high. It felt like the same message and story regurgitated to sell another book. Perhaps if I hadn’t seen most the ideas before elsewhere it would have felt newer and more impactful.
- Compared to the previous book there is a lot more emphasis on people skills,
it’s great to see this highlighted in a book for programmers where that kind of networking isn’t as common.
- The “rebellion” team was formed as a ragtag coalition of people that wanted to make a difference
- Kurt operated at the edge of permitted staff behaviour to get the resources the team needed
- Maxine visited people outside her own department in person to build alliances
- She asked how they completed their work and helped them find where they fitted into the overall flow to increase throughput overall
- Sarahs toxic behaviour and the need for psychological safety
- Some problems seem highly exaggerated to reach foregone conclusions to point at fashionable technology.
- For example getting a working build takes weeks = containers.
- Concurrency issue = Immutability and Functional programming solves the day
- I wouldn’t disagree that those technologies are great for some problems, it just seemed they were thrown into the book to namedrop.
- Near the very end it proposes that large companies can outpace their smaller rivals as they have the relationships, resources and data.
I’m not sure I entirely buy that. One of the hardest things to change is values and perceptions.
- Project Shamu sounded interesting, taking 23 API calls that have their own SLAs and reducing them to one dependency without caching. I wondered what technology this was referring to but googling didn’t help. Any ideas?
The Five Ideals
These are the ideals presented at the back of the book. I can certainly agree on their importance:
- Locality and Simplicity
- Focus, Flow and Joy
- Improvement of Daily Work
- Psychological Safety
- Customer Focus
During this lockdown I was due to take some holidays, originally to visit Pisa with Elaine. Instead of visiting Pisa I took a week off to code, for me it was just as much fun, I’m not sure Elaine agreed. This is the outcome of that week:
So far it’s extremely limited, casting, parsing, list definitions and a handful of operations. It has however been insightful. The first 2 days were spent hashing out code to make the pure fundamentals work in any way possible. On the 4th day I began to realise some very verbosely implemented operations could be done in a much simpler way. Then I began to see such savings again and again. Perhaps after the first decade I would have it whittled down to Arthurs two-pager.
An inordinate amount of fun was had when I discovered I could host the application fully in browser as doppio provides a method of running a full JVM:
jq Online Sandbox – http://timestored.com/jq/
So far it’s useful for basic snippets but I really think such a safe and easily launched environment would be great for onboarding new users to the language.
Bad Code Accretes
Sometimes while reading code, I get the impression that the person:
- Kept throwing more code at the problem until it “worked”.
- They never for a moment stepped back and thought about making it simpler.
Great Code Simplifies
Contrast that approach to Ken Iverson in this video from 1974:
I went from application to application trying to use the same techniques. The most encouraging thing is that they would work. After 2-3 years during which time the language had grown by accretion, it grew and grew, eventually I found it was shrinking.
Essentially the idea was once you look at enough different applications you begin to see what is the general notion. So I came to generalisations that allowed me to take out whole chunks of special things I had put in.
Furthermore to my surprise it turns out the general ideas are usually much simpler to understand than any of the special cases.
Modern Languages are Simplifying Common Cases
An Example from KDB
I find it worth mentioning how KDB supplies the user with handles to send data. Here we open a handle h to send a query to a remote process and get the result.
q)h:hopen `:localhost:5000; q)h "2+2" 4 q)h 7
That last line shows that the handle is 7. Why is KDB using 7 for handles?
Because linux maps files/sockets etc. using those exact same integers. In fact in kdb standard out/error can be used as 0/1. When people first encounter this, they find it confusing, possibly because they are coming from other languages that wrap handles ten layers deep in abstractions. I can’t help but imagine:
- Some coders take hours to work out what code can be removed
- Other developers like Arthur may never consider introducing unnecessary abstractions in the first place
Please for the sake of your reviewers take a moment before pushing code to ask yourself, can this be made simpler.
6/10 – Overall. 8/10 for early chapters, 4/10 for later chapters.
The first 100 pages were excellent but the later chapters were a mixed bag, partially due to rotating authors. I skim-read the later chapters as they mostly focussed on a broad spectrum of not closely related topics.
Chapters that covered topics I interact with were too shallow to interst me, while many chapters were not of interest to me. Perhaps if I was an SRE rather than a developer I would have found the entire book better.
Key Takeways for Me
- Every large firm I’ve worked at has been structured incorrectly and had the wrong metrics for measuring stability.
In banks, the production support team has typically been tasked with “zero outages” whilst the developers are incentivised to develop and release as quickly as possible, with some front-office “quant-devs” not being held accountable for stability at all. With the handover method looking like throwing it over a wall:
- This book suggests a much better approach:
Rather than pace vs stability, agree a global “Error Budget” target for everyone. using SLOs/SLIs that if not met can result in moving responsibilities back and forth from DEV to SRE owned. Importantly the target e.g. between 99.8% and 99.9% uptime should have an upper and lower bound, it should NOT be an absolute. If you go above it, developers should be taking more risks, below, developers should work on stability.
- 100% is the wrong reliability target. I always intuitively knew this but the book provided useful arguments. e.g. If you build 100% reliable but users wifi is 99% reliable, you wasted a lot of effort that users could never benefit from and that took time away from other work.
Note the full book is actually available online here.
An outage is NOT a bad thing, it is an expected part of innovation.
- Alerts – Immediate human action required
- Ticket – Human action required within few days to prevent damage
- Logging – For forentsics/diagnostics only
- MTTF – Mean Time To Failure
- MTTR – Mean Time To Repair
- Humans add latency. MTTR speed critical to availability -> automation is best.
Google Specific Terms
- Campus > Data centre > cluster > row > rack > server
- Borg – Automates resources for applications
- Chubby – Uses paxos to provide global locks
- Users -> GFrontEnd -> AppFrontEnd -> AppBackEnd -> DB (all coordinate via Load Balancer / DNS)
- Time Availability = uptime / (uptime+downtime)
- Aggregate Availability = successful Requests / Total Requests
This metric is more ususal when there are regional outages etc.
- There are different types of failure
- Global outages, regional outages
- Full outages, partial funcitonality
- Choose which you want
- Error Budget = Control loop to manage release velocity
- Error Budget – Aligns incentives
- SLI – Service Level Indicators – Measure a level of service e.g. latency/availability
- SLO – Service Level Objective – A range of values that is measured by an SLI e.g. average response <100ms
- SLA – Agreement – agreed with customers, including consequences for missed SLOs
- Choosing Targets:
- Don’t base it on current performance (it could be way off)
- keep it simple
- Have as few as possible
- Keep a safetly margin (tighter internal number)
- Don’t overachieve, each “9” is costly
- Percentiles – are better measurement than averages in case of long tail
- -> Manual repetitive work devoid of enduring value, that could be automated
- Toil = Lower morale, career stagnation, slower progress
- Some amount of toil is unavoidable and can even be calming
Automation allows super-linear scaling of users vs human effort.
Levels of automation:
- Fully automated – DB self identifies problem and preemptively resolves it
- Internally maintained – Generic – script shipped with database
- Externally maintained – Generic – shared DB recovery script
- Externally Maintained – System Specific – A script on someones desktop
- No Automation
- Less code = Less maintenance
- Simplicity = Stability
The later chapters held less of interest.
“You want a data recovery system NOT a data backup system.”
SRE Engagement Model – Not all services require SRE attention as they don’t need high reliability and availability. Those teams get given advice and documentation.
Overall 8/10 – Good book that presents good ideas and clear evidence for why.
I was aware of slightly over half the best practices from this book but not all of them have been adopted by large firms. I picked up a few actions I’d take away but really the usefulness in this book may be in presenting it as evidence to try and drive change in others.
- Use capabilities to measure performance not maturity levels as maturity suggests mission complete.
- (Scrum) Velocity is only a capacity planning tool
- Utilization isn’t the correct measure, it should not be 100%
- Should measure global outcome to ensure teams are not pitted against each other
- Software Delivery Performance Depends on:
- Lead time
- Deployment Frequency
- Mean Time To Restore
- Change Fail %
Measuring and Changing Culture
- Don’t try to change how people think, first change what people do (or change the people :))
- Westnam Theory: Orgs with better information flow function more effectively
- Level 1 – Things we just know
- Level 2 – Culture – We can debate these within the team, e.g. importance of security
- Level 3 – Written artifacts and established processes
- Pathological – based on power
- Bureaucratic – based on rules
- Generative – based on performance
- Build quality in
- Work in small batches
- Automate repetition
- Relentlessly pursue continuous improvement
- Everyone is responsible
- Comprehensive config management
- Continuous Integration – Small daily branch merges
- Continuous Testing
- Version control
- Test Automation
- Test data management
- Trunk based development
Goal is loose coupling to ensure bandwidth between teams isn’t swamped with implementation details.
Can the team by itself without speaking to outsiders:
– Change architecture significantly
– Do a deployment? now? during business hours? anytime?
Critical = Tesability and Deployability
Systems are loosely coupled and can be developed and validated independently.
Components of Lean Management
- Limit work in progress
- Visual Management
- Feedback from production
- Lightweight change approvals
CAB – doesn’t work to increase stability!
External approvals are negatively correlated with lead time, deploy freq. and restore time.
Lean Management <-> Software delivery performance, becomes a virtuous cycle.
Lean: Build -> Measure -> Learn
- Small batches
- flow of work from requirements to user known by team
- Actively seek user feedbck
- Authority to create/change specs during dev without approval
- Invest in employee development
- Foster supportive work environment (no blame)
- Ask employees what’s preventing them from achieving their objectives
- Give time to experiment and learn
Factors Causing Employee Burnout:
- Work overload
- Lack of control
- Insufficient rewards
- Community breakdown
- Value conflicts
- Vision – Clear understanding of where to be in 5 years
- Inspiring Communication – Says things that make employee proud to be part of org
- Intellectually Stimulates – Challenges my assumptions, makes me rethink principles
- Supportive – Considers and acts to benefit my feelings
- Personal Recognition – Commends me when I do a good job
Key Takeaways for Me:
- Most the suggestions from other books I’ve read and that I had seen work myself were correct. The large survey conducted by these authors gives me the evidence to back up my opinions.
- Action: In my current work, we need to find a way to get the 3 critical measurements improved. Increased release frequency and lower overhead change management would seem to be the highest effort/reward.
- The importance of loosely-coupled architecture gives me a clearer way to conceptualise interactions between teams and why it’s important. (limited bandwidth)
Overall 10/10. A short concise book with a number of good ideas that anyone working in a large/medium or perhaps even a small company would benefit from. So good, that I re-read parts of it twice already.
Andy uses the analogy of making a breakfast to show how process->assemble->test is a common workflow at any scale. For me many of his points in manufacturing paralleled topics from lean/agile software development.
– In-process inspection e.g. theremometer for boiling eggs
– Receiving inspection e.g. Inspecting eggs on delivery, parallel in software: validating user requirements
Each step takes time/effort and ads value, best to eliminate waste at the earliest stage.
Picking Good Indicators: Indicators chosen dictate our direction.
– Pairing Indicators – prevent overfitting e.g. From scrum points done with busines value delived. Quantitive and qualitive often pair well.
– black-box = Just measure in/output
– Cut a hole in the box to get leading indicators. Can look at a linear indicator (graph) or a stagger chart to ensure going at correct speed or to allow estimation.
Controlling Future Output -Goal should be to keep inventory at earliest, lowest cost stage.
1. Build to order
2. Build to forecast
– Act as a gate – Inspect everything, push back rejects
– Monitoring Step – Inspect some, stop line on problems
Variable Inspection Best – Too regular = expected, too few = gaps, no problems = inspect less, problems = inspect more, dive into one at random.
I thought some of this may be very applicable to software, consider for example Pull Reviews or User Stories or Releases, how often should senior developers ensure requirements have been gathered adequately, that jiras are well written, or inspect juniors code. Though on that last point of managing people Andy very much suggests focussing on TRM – Task Relevant Maturity,
Nudges – Most of a managers day is gathering information. After that there are a few direct decisions but often it will be a case of nudges. Gently influencing items in the direction you think best. Note combined with Andys other point that (Managers Output) = (Output of his Org) + (Output of neighbouring Orgs) this nudging of nearby teams can be very powerful for the company overall.
Organisations come in two extreme forms:
Mission Oriented = Small hedge fund where everyones goal is to make money
Functional Oriented = IT Support within a bank, whos goal is to deliver IT assistance.
The hybrid organisational form is inevitable. The company I think of is when considering these concepts is McDonalds. McDonalds will have a global marketing department ensuring consistency of branding but it will also have regional departments deciding which offers to run. Similarly some resources such as packaging will be produced in a shared department while regional speciailities can be ran at a much lower level. There will always be a conflict between the goals of these different groups but similar to democracy being the least worst form of government, hybrid organisations appear to be the least-worst method of organisation.
A related concept is Dual Reporting – Individuals can be individual contributors within one team e.g. coders but at other times contribute firm-wide as part of standards committees etc. This allows using their skill sets to the fullest.
…One-to-ones, meetings and a number of other topics were also covered in the book.
Key Takeaways for Me
I really like the idea of paired indicators. Once you’ve heard the concept it’s easy to see other teams doing it wrong and only looking at one indicator. e.g. Within large finance firms there are change management or support teams whose job is to ensure stability of all systems, the metric they will almost always look at is outages and their severity. You will typically see a line or bar chart over time reported or perhaps a breakdown by team. A high number of outsages by a particular team can result in their releases being frozen. I would suggest this metric should be weighted against business value delivered. Does it matter if a system crashes badly once a week if it prints money the rest of the time? Compared to say an accounting system that never fails but delivers little additional value.
What work should a manager do? The manager is responsible for overall team delivery. Therefore a manager should work on the highest leverage item. I think Andy is right that training is an extremely high leverage activity given the return over time. Training someone now, will lead to higher quality output later. Higher quality out = less quality checks aer required and staff are happier. The importance of training does however make you wonder about the impact of the rapidly increasing employee turnover in some countries and the increased reluctance of firms to invest in training.
TRM – Task Relevant Maturity. Either someone refuses to work or can’t do the work. Their ability to do the work will depend on their Task Relevant Maturity. When considering how to assign and monitor tasks, the key metric to keep in mind is TRM.
As I said at the start, a really good book. Some of his ideas I will have to take time to digest and consider how it would change my approach to certain work. Andy even included a todo list for once you’ve finished the book that I have partly completed. The parts of the book that I was more sceptical of, I plan to force myself to consider more fully. Given how knowledgable and accurate Andy was in some areas I have experience of, I shouldn’t dismiss him in the areas that seem to me more dubious.
On Holiday I took an unusual diversion to read three US political books:
Reagan – Was very biased in favour of Reagan and what a great job he had done.
James Comey – Felt like James honest version of the truth as he saw it, very anti-trump. Some of the situations presented and how everyone tries to manage events to get what they want are interesting.
McCain – Was probably the most balanced book with a few more interesting stories. The best of the bunch.