Here are the notes from the High Availability in Heroku talk by Mark McGranaghan at London's QCon.
1. There are two aspects Heroku team looked at when improving their availability - architecture and execution. First one relates to the designs of their system, latter is more about how they had to change their approach to implement the designs.
2. On the architecture level Mark covered number of techniques covering following problems: load balancing, supervision, error kernel, layered design, message passing.
3. The solution is based on idea of having multiply load balancers running as internal apps between front end applications and backend provided by Heroku. In addition to the balancers level they have supervising applications that are checking on the instances and bringing them up when necessary. This already shows that they use layered design - different problems are handled on separate levels in the system which makes it more robust and easier to monitor. Finally, they use well defined, non-app specfic messages that can pass increasing number parameters between layers and apps. The messages are versioned - explicitly or handled with graceful degradation.
4. Distributed call graphs - decide what to do when faced with broken producer or worker (you can't read or you can't write data). When you can't write consider local tickets that can be picked up when system is again available.
5. On the execution front the challenges were: jaded deploys, bad visibility and cascading feedback. The deploys were covered by iterative deployment and so called "canaries". The Visibility is covered by the real time monitoring systems and automated tools that allow to flag any issues as early as possible.