Showing posts with label cloud architecture. Show all posts
Showing posts with label cloud architecture. Show all posts

Monday, 5 March 2012

Cloud Architecture training at QCon (3)

Notes from the last part of the Cloud Architecture training at QCon.

This part was mostly about Cassandra - why Netflix uses it & how they use it. There were bits about monitoring and about scalability of their system. Here are the most interesting points:

1. CAP Theorem - Choose Consistency or Availability when Partitioned. Master-slave vs Quorum models.

2. An overview of current Netflix Persistence stack. It included info on Memcache, Cassandra, MongoDB and MySQL.

3. Introduction of Netflix open source tools for Cassandra users: Priam - Cassandra Automation and Astyanx - Cassandra java client. Both can be found on github. We also heard about Netflix contributions to Cassandra.

4. Interesting background on why Netflix uses Cassandra. Makes me really wonder about our use of Riak... Also we've got a very details Cassandra data flows for single and multi region apps. Included the replication, backup, archive and logging mechanism. It's all backed up by S3.

5. Cloud bring-up strategy - they used Shadow Traffic Redirection, worked in iterations, one page at a time, starting up with the simplest . They managed to "sell" the cloud to all developers early on a development boot camp. Most of the issues they faced were around key management and early lack of experience with the AWS.

6. The monitoring is based on logs (they log everything almost all the time). The logs are processed in Hadoop and are used to generate reports. They use AppDynamics as a portal that gives them deep look into what's running in production.

Last slides covered the structure of the team - notably small sys-ops, and DBA teams and very strong java dev teams.

Overall the training was very useful and packed with interesting info. I can see at least two more posts coming out of it. It also made me note some questions for Shopzilla's cloud migration in EU.

Cloud Architecture training at QCon (2)

Here are the notes from today's after-lunch session of Cloud Architecture training at QCon.

We've moved on to more detailed architecture description of what is currently running under Netflix customer-facing apps. Here are the most interesting bits I noted:

1. Finally I have good names for our client and site libraries that we use in Shopzilla. Netflix calls them SAL - Service Access Library and ESL - External Service Library. I like the way they use ESL to call up local cache and that service side cache has it's own SAL and can be accessed by webapp via ESL or from service. I think our clients may be still too thick and our site libraries are too thin at the moment.

2. The main problems Netflix developers tried to solve during the switch were: the development teams interaction and the kitchen sik objects (like Movie or Customer). The first one was fixed by more service oriented architecture, with grained libraries and well defined responsibilities, for the second they used "facets pattern" which I'll describe later. Basically it's about understanding that objects can be represented differently when they are used different. It gives developers a way of working with the same objects without breaking each other.

3. Lots of good practices on logging and monitoring. Especially on how to use hadoop/hive and AWS for that. Note to self - figure out a way for us to collect usage analysis data.

4. We got a nice overview of Amazon, and Netflix tools that are used and were developed during the migration. We've heard about S3, SQS, some third party tools like GeoIp and keystore /HSM. There are some new open source tools going to be releases soon by Netflix - watch this space: https://github.com/Netflix

5. We also heard about how they deal with security, making sure that only right people have access to the instances, but also limiting the ways services can cal each other - only services that are part of certain security group can call a service, easy way to find out who calls who.

Cloud Architecture training at QCon (1)

I'm attending the Cloud Architecture training by Adrian Cockcroft today. Here are my notes from the morning session.
http://qconlondon.com/london-2012/presentations/show_presentation.jsp?oid=3889

We have around 40 people in the room, 3 girls including myself, two of them are IT students. 90%+is the usual while male programmer type (the session kicked off with introductions from the audience, I haven't been asking people what they do on my own :)). So far the highlights for me were:

1. Netflix dev team slides & other resources we all can find online, see: http://www.slideshare.net/netflix http://techblog.netflix.com/ and http://developer.netflix.com/

2. Learning about how Netflix team works with Amazon providers: no sysops team, developers pushing builds over from testing to production. Fully in-cloud QA environments. Overreaching specs, almost no-existing capacity planning. Also, the out of the box account/keys management provided by amazon wasn't good enough, so they had to implement they own.

3. Discussion about other providers, actually the reasoning behind why Netflix uses Amazon and not any other cloud provider is not that interesting - Amazon is the only one that can handle scale. What is interesting is to learn about the risks they had to look into before even deciding to go to any cloud. Apparently if forced to they can move out of Amazon in 3 months. At the moment they would move into private cloud, simply because there is no other provider that can handle a company of the that size.

4. It seems like different teams in Netflix own different services, so you wouldn't get two teams or more messing around with the same service. They do CI & they release to production a lot (just like we do :)), but they had to move to the cloud to do that.

5. Moving to the cloud took 3-4 years. The bulk of work was to actually re-factor the system so it's more manageable. It's amazing how many things were broken and needed fixing before the move. We saw a noce chart of all the component that were moved to the cloud, seems like the process was very similar to what we do in Shopzilla, however the scale of the system, and the scale of changes introduced was much bigger.

6. Really good (and only 10 minutes long) explanation of the anti-architecture approach. A bit like the boxes method we use right now, but geared more towards developers. Will try that one out soon :).

Still want to hear more about how they detect issues in production (they use the canary method for releases, but what about the issues that can only be detected after some time? Do they use splunk, do they have some automated monitoring systems? Who owns those (no SysOps, no CM)?)