Monday, 5 March 2012

Cloud Architecture training at QCon (1)

I'm attending the Cloud Architecture training by Adrian Cockcroft today. Here are my notes from the morning session.
http://qconlondon.com/london-2012/presentations/show_presentation.jsp?oid=3889

We have around 40 people in the room, 3 girls including myself, two of them are IT students. 90%+is the usual while male programmer type (the session kicked off with introductions from the audience, I haven't been asking people what they do on my own :)). So far the highlights for me were:

1. Netflix dev team slides & other resources we all can find online, see: http://www.slideshare.net/netflix http://techblog.netflix.com/ and http://developer.netflix.com/

2. Learning about how Netflix team works with Amazon providers: no sysops team, developers pushing builds over from testing to production. Fully in-cloud QA environments. Overreaching specs, almost no-existing capacity planning. Also, the out of the box account/keys management provided by amazon wasn't good enough, so they had to implement they own.

3. Discussion about other providers, actually the reasoning behind why Netflix uses Amazon and not any other cloud provider is not that interesting - Amazon is the only one that can handle scale. What is interesting is to learn about the risks they had to look into before even deciding to go to any cloud. Apparently if forced to they can move out of Amazon in 3 months. At the moment they would move into private cloud, simply because there is no other provider that can handle a company of the that size.

4. It seems like different teams in Netflix own different services, so you wouldn't get two teams or more messing around with the same service. They do CI & they release to production a lot (just like we do :)), but they had to move to the cloud to do that.

5. Moving to the cloud took 3-4 years. The bulk of work was to actually re-factor the system so it's more manageable. It's amazing how many things were broken and needed fixing before the move. We saw a noce chart of all the component that were moved to the cloud, seems like the process was very similar to what we do in Shopzilla, however the scale of the system, and the scale of changes introduced was much bigger.

6. Really good (and only 10 minutes long) explanation of the anti-architecture approach. A bit like the boxes method we use right now, but geared more towards developers. Will try that one out soon :).

Still want to hear more about how they detect issues in production (they use the canary method for releases, but what about the issues that can only be detected after some time? Do they use splunk, do they have some automated monitoring systems? Who owns those (no SysOps, no CM)?)

No comments:

Post a Comment