I've listened to "Architecting for failure at the Guardian.co.uk" talk by Michael Brunton-Spall at London's QCon. The slides cn be found here: http://speakerdeck.com/u/bruntonspall/p/architecting-for-failure and here are my notes from the talk:
1. The scale of both userbase and content in Guardian is pretty big (3.5 m unique users, 1.6m unique pieces of content, hundreds of editorial staff, different content types - text, audio, video...).
2. Old system which they replaced in 2011 was a big monolith - one code base for rendering of dynamic content, third party apps data and anything else you see on news page. The system was replaced with what they call micro-apps framework. They divided the page into contant (managed by CMS) and other components. The components were rendered based on responses from mico-apps. This was an SSI-like solution based on HTTP protocol (simplicity).
3.This meant faster development, more innovation, diversity of systems, languages and frameworks used. This also meant the have in-house internal system (CMS) and external environments hosted in the cloud that are accessed from the main system. They needed a good caching solution to be placed in between them.
4. The cost of new system - support for all different apps was far more complicated, maintenance was harder as the applications were written in different languages and hosted on different platforms, extra work had to be done to make it easier for people to work within the system.
5. Failure is not a problem - you can plan for a failure and make sure that your application knows how to behave when one of the apps returns 500 response Slow is problem - things go wrong when the application is not aware there is an issues because the app is just slow, or unresponsive for a long time. You need to think about what to do with slow responses and how to make application aware of them. (Guardian used some nice caching solutions for that).
6. Things to think about when planning for failure of micro-app - do you always want dynamic traffic or speed? Do you have peaky traffic or flat one? Is small subset of functionality enough? In case of Guardian the speed is sometimes more important than dynamic contant, so they came up with the Emergency Mode. In Emergency mode caches do not expire, the "pressed pages" are served, if no pressed pages are available the cached version is used, if cached version is not available only than render the page as in normal mode.
7. Pressed pages - fully rendered HTML stored on the disc as a flat file, served like static files, can serve up to 1k pages per second per server.
8. When caching be smart, cache only what's important.
9. Monitoring is not Alerting (and vice versa). When monitoring aggregate stats - you want to be able to group and compare behaviour per colo, per server, per app, per stage. Choose what to monitor (make sure the metrics you're looking at are helpful when trying to fix/understand the problem).
10. Automatic release valve - if page load time is higher than x automatically move the site into emergency mode. Stay in the mode for 60 seconds, if after that everything is ok there was no issue and you can ignore the incident. If the problem persists somebody needs to look at it. Use feature toggles, so you can switch things off to fix the main app.
11. Log only relevant and useful stuff, learn how to analyse your logs, make them easy to look at an analyse with the tools you'v got. Measure your Mean Time Between Failures (MTBF) and Mean Time Between Recovery (MTBR). If you fail rarely but need a long time to recover it's bad. You want t recover faster.
12. Expect failure, plan for failure (plan for failure at 4am), keep things simple and keep things independent.
1. The scale of both userbase and content in Guardian is pretty big (3.5 m unique users, 1.6m unique pieces of content, hundreds of editorial staff, different content types - text, audio, video...).
2. Old system which they replaced in 2011 was a big monolith - one code base for rendering of dynamic content, third party apps data and anything else you see on news page. The system was replaced with what they call micro-apps framework. They divided the page into contant (managed by CMS) and other components. The components were rendered based on responses from mico-apps. This was an SSI-like solution based on HTTP protocol (simplicity).
3.This meant faster development, more innovation, diversity of systems, languages and frameworks used. This also meant the have in-house internal system (CMS) and external environments hosted in the cloud that are accessed from the main system. They needed a good caching solution to be placed in between them.
4. The cost of new system - support for all different apps was far more complicated, maintenance was harder as the applications were written in different languages and hosted on different platforms, extra work had to be done to make it easier for people to work within the system.
5. Failure is not a problem - you can plan for a failure and make sure that your application knows how to behave when one of the apps returns 500 response Slow is problem - things go wrong when the application is not aware there is an issues because the app is just slow, or unresponsive for a long time. You need to think about what to do with slow responses and how to make application aware of them. (Guardian used some nice caching solutions for that).
6. Things to think about when planning for failure of micro-app - do you always want dynamic traffic or speed? Do you have peaky traffic or flat one? Is small subset of functionality enough? In case of Guardian the speed is sometimes more important than dynamic contant, so they came up with the Emergency Mode. In Emergency mode caches do not expire, the "pressed pages" are served, if no pressed pages are available the cached version is used, if cached version is not available only than render the page as in normal mode.
7. Pressed pages - fully rendered HTML stored on the disc as a flat file, served like static files, can serve up to 1k pages per second per server.
8. When caching be smart, cache only what's important.
9. Monitoring is not Alerting (and vice versa). When monitoring aggregate stats - you want to be able to group and compare behaviour per colo, per server, per app, per stage. Choose what to monitor (make sure the metrics you're looking at are helpful when trying to fix/understand the problem).
10. Automatic release valve - if page load time is higher than x automatically move the site into emergency mode. Stay in the mode for 60 seconds, if after that everything is ok there was no issue and you can ignore the incident. If the problem persists somebody needs to look at it. Use feature toggles, so you can switch things off to fix the main app.
11. Log only relevant and useful stuff, learn how to analyse your logs, make them easy to look at an analyse with the tools you'v got. Measure your Mean Time Between Failures (MTBF) and Mean Time Between Recovery (MTBR). If you fail rarely but need a long time to recover it's bad. You want t recover faster.
12. Expect failure, plan for failure (plan for failure at 4am), keep things simple and keep things independent.
No comments:
Post a Comment