Friday, 9 March 2012

Qcon 2012, Keynote - Resilient Response in Complex Systems

I was a bit late to the Friday's keynote at London's QCon (thanks, Southern trains) , but I managed to take some notes from John Allspaw's "Resilient Response in Complex Systems" talks. Here they are:

1. There are number of things that go wrong when the teams are faced with a failure of a complex system.  There's refusal to make decisions caused by either lack of authority (people want to make a decision but they are not sure if they can), lack of information (people can't make a decision, they can only guess) or bureaucracy and politics (people that are able to make a decision can't make it fast or without some approvals). There is "heroism" - individuals who walk away from the team to focus on what they think is the solution for the problem. If they success they send a wrong message to the team and the company - that the issues are solved by individuals, if they fail they abandoned the team when facing a disaster. There are also distractions - you need to be able to cut down the distractions to minimum  when dealing with a disaster. This means irrelevant emails/links/social events but can also mean isolating business owners from the team if they only add distractions when "panicing" over the outage.

2.  SRK Framework:

3. A good example on how to deal with a disaster can be found in so called High Reliability Organisations (HRO - for example companies which work can affect human life or health, places like air traffic control or hospital). They are very complex, they have to trade off between efficiency and thoroughness, they usually engineer-driven.

4. "Managing the Unexpected: Resilient Performance in an Age of Uncertainty" by Karl E. Weick 

5. The ways HRO deal with disaster: the teams are close to each other, they share the tools and the information, there is overlap in skills and knowledge, the team members can be moved from one team to another. Over-communication is the norm. There is safe environment in which teammates can point out errors and mistakes - reporting of errors and mistakes is awarded.  There is high level of "unease" - everybody is used to idea that disaster will happen, and they are prepared for it. People have high level interpersonal skills. There are publicly available detailed records of previous incidents. Patterns of authority are dynamically changing to meet the demands - people directly dealing with the problem have the power to fix it.

6. Drill - practice troubleshooting under pressure, be comfortable with your tools,try to come up wit new ways your system can break, and practice the ways of dealing with those. The are actions that can be taken immediately when a disatser happens, make sure those are fast and automatic for your team.

7. The law of stretched systems

8. Near misses - communicate them and widely distribute them. They happen more often than actual distaters so you get much more volume of incidents to improve on. They are a reminder of the risks.

9. Spend as much time understanding why your team succeeded as on why the team failed. When faced with choice wether to analyse your success of your failure choose to analyse both. Think what and why things go right.

10. Learn from other fields, train for outages, learn from mistakes (and avoid politics and bureaucracy), learn from successes as well as failures.  

No comments:

Post a Comment