my Notes: qcon

Showing posts with label qcon. Show all posts

Friday, 9 March 2012

Qcon 2012, Where does Big Data meet Big Database

I've listened to the "Where does Big Data meet Big Database" byBenjamin Stopford talk at London's Qcon. The talk was mostly covering the differences between the "traditional" databases and the new technologies that were introduced in last decade. It provided a helpful insights into what questions one should answer when deciding on which technology to use, and unsurprisingly "which one is sexier" was not one of those questions. Here are my notes from the talk:

1. The situation in data sources domain is much different that 10 years ago. The scale changes, we have new big sources of data that people and companies are interested in, we have a range of new technologies, products and ideas that developers can leverage when architecting their applications.

2. Map Reduce is gaining popularity since 2004. It is simple, pragmatic, it solved a problem that didn't have a solution before, it was novel and not limited by the tradition that stands behind older ideas. It introduced a split between so called "hacker culture" and "enterprise culture". Around 2009 it was criticised as "not really novel", operating on a wrong level (physical not logical level), incompatible with tooling, lacking referential integrity and elegance.

3. The new approach can be characterised as bottom-up approach to accessing very large data sets, that allows us to keep the data close to the code (close to where it matters) and to keep it in it's natural form.

4. Relational approach ("old" approach) - solid, consistent and safe data, limiting developers by it's language.

5. As early as 2001 some people closely connected with "traditional" solutions realised that there are problems with existing DB - that they are good tools, but they force developers to "speak different language", they need to evolve and allow more flexibility. The NoSQL movement came about because developers wanted a simple storage that works on diverse hardware, that scales, and can be talked to in developer's language of choice. "One size fits all" is not true anymore.

6. When choosing the tool you need to consider following solutions: In-Memory DB, Shared Everything, Shared Nothing, Column Oriented Storage,

7. In-Memory DB - fast, but doesn't scale well and you can loose your data very easily. This will probably gain more popularity as the distributed-in-memory solution evolves.

8. Shared Everything - all the data is located on a shared disc space, all the client nodes have access to full data set, every node can handle every request. This is pretty slow, and every cache needs to sit above whole database, which means the bigger the data set the less effecting your caching solution is.

9. Shared Nothing - each node has the access to one shard of the full data set. This means you need to find the node that can handle your request - worst case scenarion you'll need to iterate through all your nodes.

10. Column Oriented Storage - data organised in columns not rows, laid contiguously on disc. Very good compression. Indexing becomes less important, sigle record pull is quite expensive, but bult read/write is really faster.

11. Additional factors: discs - older DBs were optimised for sequential access over magnetic drives not random access over SSD, growing speed of our networks. Mind that there are relational databases that leverage latest technologies - they can achieve very good results in benchmarks.

12. When comparing and deciding on solution don't only think about the size of your data set, it's potential to grow or even only about abilities of each tool. Think about what You need/want. Do you mind SQL? Is your data changing a lot constantly, is the data isolated? do you want to/can you keep your data in the natural state? which solution you can afford?

13. It is good that we do have a range of tools that provide different solutions for different problems, the trick is to know what is your problem, so you can pick the right solution.

Qcon 2012: Technology is your office

I've attended the "Technology is your office" talk by Horia Dragomir this morning. Most of the session covered good practices to be followed when working with distributed teams of developers. It could be summed up with talk with each other, be smart and create a culture people can share. What I found interesting was the list of tools Horia recommended for distributed teams, so here they are:

"Management" (as in management of time and workload, not resources):

Skype
Google apps
Google docs
Basecamp
Redmine
Pivotaltracker
JIRA
Rally

Development

Mercurial
Github
Yuuguu
Cloud9 IDE
everytimezone.com
TeamViewer

Design

Mocking Bird
Balsamiq
Axxure
photoshopetiquette.com
Layer Vault

Also bonus points for the eastern-european accent joke at the beginning of the talk and neat idea to come up with a catch phrase that sums up your team culture, so it can be propagated more easily.

Qcon 2012, Keynote - Resilient Response in Complex Systems

I was a bit late to the Friday's keynote at London's QCon (thanks, Southern trains) , but I managed to take some notes from John Allspaw's "Resilient Response in Complex Systems" talks. Here they are:

1. There are number of things that go wrong when the teams are faced with a failure of a complex system. There's refusal to make decisions caused by either lack of authority (people want to make a decision but they are not sure if they can), lack of information (people can't make a decision, they can only guess) or bureaucracy and politics (people that are able to make a decision can't make it fast or without some approvals). There is "heroism" - individuals who walk away from the team to focus on what they think is the solution for the problem. If they success they send a wrong message to the team and the company - that the issues are solved by individuals, if they fail they abandoned the team when facing a disaster. There are also distractions - you need to be able to cut down the distractions to minimum when dealing with a disaster. This means irrelevant emails/links/social events but can also mean isolating business owners from the team if they only add distractions when "panicing" over the outage.

2. SRK Framework: http://en.wikipedia.org/wiki/Ecological_interface_design#The_Skills.2C_Rules.2C_Knowledge_.28SRK.29_framework

3. A good example on how to deal with a disaster can be found in so called High Reliability Organisations (HRO - for example companies which work can affect human life or health, places like air traffic control or hospital). They are very complex, they have to trade off between efficiency and thoroughness, they usually engineer-driven.

4. "Managing the Unexpected: Resilient Performance in an Age of Uncertainty" by Karl E. Weick

5. The ways HRO deal with disaster: the teams are close to each other, they share the tools and the information, there is overlap in skills and knowledge, the team members can be moved from one team to another. Over-communication is the norm. There is safe environment in which teammates can point out errors and mistakes - reporting of errors and mistakes is awarded. There is high level of "unease" - everybody is used to idea that disaster will happen, and they are prepared for it. People have high level interpersonal skills. There are publicly available detailed records of previous incidents. Patterns of authority are dynamically changing to meet the demands - people directly dealing with the problem have the power to fix it.

6. Drill - practice troubleshooting under pressure, be comfortable with your tools,try to come up wit new ways your system can break, and practice the ways of dealing with those. The are actions that can be taken immediately when a disatser happens, make sure those are fast and automatic for your team.

7. The law of stretched systems

8. Near misses - communicate them and widely distribute them. They happen more often than actual distaters so you get much more volume of incidents to improve on. They are a reminder of the risks.

9. Spend as much time understanding why your team succeeded as on why the team failed. When faced with choice wether to analyse your success of your failure choose to analyse both. Think what and why things go right.

10. Learn from other fields, train for outages, learn from mistakes (and avoid politics and bureaucracy), learn from successes as well as failures.

Thursday, 8 March 2012

Qcon 2012 ,230 Iterations later

I've wandered off to the Agile track at London's QCon today, and listened to "230 Iterations later" talk by Suki Bains and Kris Lander. Here are my notes from the track:

The main idea was to use example of the team both speakers worked with in past 4 years to show how being agile, focused on good process and delivery can lead to "delivery zombies" - teams that only focus on delivering stories without asking how? and why?

1. They did the right thing at the beginning and it worked well. The team was piloting agile approach in the company, it took on new project, started off with brave decision of building up a new platform as a backend for the new site. Managed to successfully deliver but also to build a "perfect" agile environment, strong foundation for future projects, and a team that felt they work towards common goal.

2. Reflecting now on this time speakers wondered what they mean "good" when they say the team was in a good place back than. Does "good team" mean a team that produces code that provides required features? Is it a feeling that team members have?

3. Thanks to the success the team build up reputation, good relationship with business owners, delivery of features became easy. Shipping became a mesure of success. Some early signs of future problems started showing up: the goals the team had in front of it felt small, the technical debt started building up, the "how?" question being answered the process felt too easy/boring, and the team started making mistakes - slipping. Task weren't always picked up in priority order ("This is first in the queue but it's a lot of front end, I can't work on it"). Sounds scary when I think about my team... :S

4. After 160 iterations the team became a "delivery zombie". Focused only on shipment of the features, not interested in innovation. The office became quiet. Mind that the code was still well maintained, there was high level of automation in the platform. The external test were covering most of functionality (even though they were a "nightmare to manage"), the cost of adding new features were kept down. The code is still considered to be an asset.

5. 176 iterations is a new project is introduced. It's in line with what the team was doing before and they feel confident they can deliver again. The situation changes when the business owners change. The team looses 1-to-1 relationship with product owners, new structure brings on new complexity, and the team can't adapt quickly enough. The feedback loop between development and business is broken and the team loses it's identity.

6. 190 iterations in there are conflicts in the team, there is no common purpose, and people are not working together anymore. The team realises that delivery of the features can't be it's measure of success anymore - they are still delivering, but they can see that's not enough. They struggle to define new goals and values.

7. At the end the projects is still shopped on time, and there is a positive feedback both from business and users. But the team during retrospective doesn't feel successful. The takeaway: vision is important, needs to be reinforced often and on different levels. The team needs to share common values and goals, and needs to understand (and ask) why? as well as how? Keep on asking what it means for your team to be "good", "successful" and "right".

Qcon 2012 Architecting for failure

I've listened to "Architecting for failure at the Guardian.co.uk" talk by Michael Brunton-Spall at London's QCon. The slides cn be found here: http://speakerdeck.com/u/bruntonspall/p/architecting-for-failure and here are my notes from the talk:

1. The scale of both userbase and content in Guardian is pretty big (3.5 m unique users, 1.6m unique pieces of content, hundreds of editorial staff, different content types - text, audio, video...).

2. Old system which they replaced in 2011 was a big monolith - one code base for rendering of dynamic content, third party apps data and anything else you see on news page. The system was replaced with what they call micro-apps framework. They divided the page into contant (managed by CMS) and other components. The components were rendered based on responses from mico-apps. This was an SSI-like solution based on HTTP protocol (simplicity).

3.This meant faster development, more innovation, diversity of systems, languages and frameworks used. This also meant the have in-house internal system (CMS) and external environments hosted in the cloud that are accessed from the main system. They needed a good caching solution to be placed in between them.

4. The cost of new system - support for all different apps was far more complicated, maintenance was harder as the applications were written in different languages and hosted on different platforms, extra work had to be done to make it easier for people to work within the system.

5. Failure is not a problem - you can plan for a failure and make sure that your application knows how to behave when one of the apps returns 500 response Slow is problem - things go wrong when the application is not aware there is an issues because the app is just slow, or unresponsive for a long time. You need to think about what to do with slow responses and how to make application aware of them. (Guardian used some nice caching solutions for that).

6. Things to think about when planning for failure of micro-app - do you always want dynamic traffic or speed? Do you have peaky traffic or flat one? Is small subset of functionality enough? In case of Guardian the speed is sometimes more important than dynamic contant, so they came up with the Emergency Mode. In Emergency mode caches do not expire, the "pressed pages" are served, if no pressed pages are available the cached version is used, if cached version is not available only than render the page as in normal mode.

7. Pressed pages - fully rendered HTML stored on the disc as a flat file, served like static files, can serve up to 1k pages per second per server.

8. When caching be smart, cache only what's important.

9. Monitoring is not Alerting (and vice versa). When monitoring aggregate stats - you want to be able to group and compare behaviour per colo, per server, per app, per stage. Choose what to monitor (make sure the metrics you're looking at are helpful when trying to fix/understand the problem).

10. Automatic release valve - if page load time is higher than x automatically move the site into emergency mode. Stay in the mode for 60 seconds, if after that everything is ok there was no issue and you can ignore the incident. If the problem persists somebody needs to look at it. Use feature toggles, so you can switch things off to fix the main app.

11. Log only relevant and useful stuff, learn how to analyse your logs, make them easy to look at an analyse with the tools you'v got. Measure your Mean Time Between Failures (MTBF) and Mean Time Between Recovery (MTBR). If you fail rarely but need a long time to recover it's bad. You want t recover faster.

12. Expect failure, plan for failure (plan for failure at 4am), keep things simple and keep things independent.

Qcon 2012 The Evolution of PaaS

I've attended "The Evolution of PaaS" talk by Paul Fremantle at QCon. Here are my notes from the talk:

1. The are still problems we face in the application development teams: randomness (random libraries are used, there are random ways of doing things), project infrastructure takes too long to set up, we use unknown, unapproved libraries, there is no easy way to have a clear idea of which project is at what state at any given moment, there's too little automation, there are few or no metrics on code quality, test coverage and re-use. PaaS could give a way for fixing some if not all of those issues.

2. What it means to be cloud-native? To be distributed (works in the cloud), elastic (uses cloud efficiently), multi-tenant (only costs when we use it), self-service (in the hands of users), granularly billed and metered (pay just for what you use), incrementally deployed & tested (seamless live updates).

3. PaaS itself is a step in the evolution - either from IaaS (when you decide to start using services provided by PaaS in addition to infrastructure you're already using) or from SaaS (when you decide you want to enhance the services you're using more deeply).

4. PaaS is about taking what developers care about and making it self-provisioned, managed, metered and paid per use in the cloud. The developers care about the code, but also about tools outside of the codebase (messages, databases, logs, workflows) and about non-runtime "things" - svn, git, build, CI, code coverage, automated tests.

5. The evolution of PaaS so far was about moving from just allowing developers to deploy applications (they still need to care about what the infrastructure looks like) through speciation, expending services towards dealing with multi tenancy. PaaS will obviously be still evolving.

6. What it means to be open: open to run in different places, open to run on different IaaS provider, open to run different types of applications, open to new services, maybe open source? Types of PaaS right now: public PaaS (Heroku, Amazon Elastic Beanstalk), private/public (Tibico, Microsoft), open private/public IaaS.

Qcon 2012 Sustainable speed is king

I've attended the "Zero to ten million users in four weeks: sustainable speed is the king" by Jodi Moran. Here are my notes (and bonus here are the slides from the talk: http://www.slideshare.net/plumbee/sustainable-speed-is-king-qconlondon-2012)

The session was a high level look at how maintaing a sustainable speed of delivery can be the key to the success for teams working in fast paces, high traffic, web based industries.

1. Speed - end to end time for each change. Sustainability - maintaining the speed for a long time period. Long probably means the lifetime of the project/app/game :).

2. Sustainable speed is desired because it gives your project responsiveness - you can react to changes in your users behaviour/business model changes/any external factors, it means greater returns - as you are able to deliver more features, especially in social gaming business it means you can earn more off your users, your investments are less - you deliver fast so you work less (time-wise, but not only).

3. Couple of methods Jodi used to achieve sustainable speed in her projects: Iterate & Automate, Use commodity technologies, Analyse and Improve, Create high-speed culture.

4. Iterate & Automate - be agile, break things down, focus on principles, reflect on what you did and adjust. Automate all routine work - building, testing, provisioning, deployment. Isolate your changes, high-risk components need to be separated from low-risk ones, so you can deliver faster.Launch with minimal product, minimal process and minimal technology. Prepare for the technical debt. Decide early on how you'll pay for it and when. Take it on intentionally when needed.

5. Use commodity technologies. This means it's easier for you to find people who will work on your product. It will be faster to set your development process up and running. Your team will understand different components of your system, and will be able to change them/work on them faster. Even complicated systems can be build on commodity technology as was proved in presented case stud (read/write game properties management tool that allowed the product owner tweak properties of the game and deploy it to test environment in "one step").

6. Analyse and Improve. You never have "too much" of data - you need this information for reporting, monitoring, data mining, predicate analytics, personalisation, split testing (a/b testing). We should collect both user and system data so we can track both application state and user behaviour as both are important. Split testing is especially important when maintaining sustainable speed - if you can analyse your data quickly you can react quickly. Good test means you don't invest time implementing things that do not work for your product.

7. Create high-speed culture. Teams arrangement is similar to you'r apps components arrangement - you want small, well specified modularised components that are responsible for providing a service and are independent from others. To manage communications keep the teams that need to talk to each other close to each other. You can than "reuse" your teams ike you would reuse your services. Hire best people, trust them, make them responsible and excited about your product.

8. "Source Code Is A Liability, Not An Asset" (http://blogs.msdn.com/b/elee/archive/2009/03/11/source-code-is-a-liability-not-an-asset.aspx).

9. Testing: think about out why do you test your application. Shouldn't developer be responsible for the code quality (if the code does what developer intended)? Shouldn't product owner be responsible for features implemented (Is the application doing what he intended?). Won't user behaviour verify if the feature implemented is what users wanted (react fast, a/b test)?

10. Load testing: why do we perform load tests? If we're testing for capacity for new users mind that hardware is cheap now and scaling is easy with the cloud being around and all. If we're testing new feature capacity, we can do the same more effectively and safely with a dark launch.

11. Creators are responsible for quality, as well as for the product being alive in production - you don't need a separate operations team to be "on call", taking care of production environment. Engineers are responsible for the environments as well.

Wednesday, 7 March 2012

Qcon 2012 Cloud 2017: Cloud Architecture in 5 years

I've attended a panel discussion on future of cloud architecture between Martijn Verburg, Mark Holdsworth, Patrick Debois, and Richard Daviews, moderated by Andrew Phillips. Here are some notes from the discussion. Note that I didn't stick around for the open discussion with audience.

Q: Will there be cloud teams in companies that decide to use cloud as PaaS or IaaS?
It depends how we define a cloud team. The companies will need a team of people that have a in-depth understanding of technology behind the cloud so they can support it. They will also need a broader spread idea of what it means to work with in cloud that will be on much higher level but will have to reach much more people. For companies that will go into "private cloud" old dev ops teams will be probably replaced by cloud teams, but their characteristics won't change much. Depending on how many companies will decide on staying in provate cloud or decide on treating the cloud as just a different way of hardware provisioning the cloud computing approach as an idea may be called a failure at some point in the future.

Q: In 5 year will the choice still be between "public" cloud, "private" cloud and in-house data center?
Depending on the requirements of the system some companies may require a public cloud that's not the standard that everybody uses at the moment. Currently a company is more likely to change it's code/architecture to fit into the cloud than the other way round. This is cause by the nature of the companies that are moving to the cloud and the reasons behing the move. The more big companies do the switch the more apparent the need for custom set up in the cloud will be.

Q: What technical breakthroughs we need to unlock the cloud and make it commonly used across companies?
The companies need to start developing software that is smart enough to understand the platform it's deployed to. Currently only 20% of applications deployed in cloud understand they are running on multiple and changing instances & they can figure out when to expected/contract it's hardware requirements. Developers, especially Java ones need to learn how to write applications for the cloud.

Q: Will there be companies that use cloud alongside in-house data center or maybe other solutions?
It will depend on a company and on the team. Most common patter right now is some teams/projects working on greenfield projects for a bigger company that work in the cloud alongside bigger and older projects running in in-house data centers.
The nature of this movement is that the shift usually happens outside of official paths within the company, and the approach - experimented on separately is than ported back to fit the enterprise model.

Q: How big is the gap between reality and the hype for the cloud at the moment? How it will change in 5 years?
At the moment the gap is really small, but the more widespread it will get the bigger the gap will be. The hype will probably peak around 2017, and we'll see a backlash caused by disappointed adaptors that didn't understand how to make the concept work for them.

Q: How cloud will integrate with everything social?
Very well :). There are already existing services that allow for very quick and easy, almost code-less integration. This will only get improved in next years.

Q: How will companies continue to interact between each oty=her and the cloud providers after migration?
Some ideas were thrown about peer 2 peer ad-hoc data/resources sharing between small groups of companies. Idea of Brokers for PaaS, SaaS and IaaS were discussed as well.

Overall it was interesting to see what the people who are so involved with the cloud think about where the business is and where it is going. It was mostly speculations but if I'm listening to anybody's speculation on this topic this guys are pretty high on my list.

Qcon 2012 High Availability in Heroku

Here are the notes from the High Availability in Heroku talk by Mark McGranaghan at London's QCon.

1. There are two aspects Heroku team looked at when improving their availability - architecture and execution. First one relates to the designs of their system, latter is more about how they had to change their approach to implement the designs.

2. On the architecture level Mark covered number of techniques covering following problems: load balancing, supervision, error kernel, layered design, message passing.

3. The solution is based on idea of having multiply load balancers running as internal apps between front end applications and backend provided by Heroku. In addition to the balancers level they have supervising applications that are checking on the instances and bringing them up when necessary. This already shows that they use layered design - different problems are handled on separate levels in the system which makes it more robust and easier to monitor. Finally, they use well defined, non-app specfic messages that can pass increasing number parameters between layers and apps. The messages are versioned - explicitly or handled with graceful degradation.

4. Distributed call graphs - decide what to do when faced with broken producer or worker (you can't read or you can't write data). When you can't write consider local tickets that can be picked up when system is again available.

5. On the execution front the challenges were: jaded deploys, bad visibility and cascading feedback. The deploys were covered by iterative deployment and so called "canaries". The Visibility is covered by the real time monitoring systems and automated tools that allow to flag any issues as early as possible.

Qcon 2012 The future of Java Platform: Java SE8 and Beyond

Here are my notes from "The future of Java Platform: Java SE8 and Beyond" talk by Simon Ritter (Java Platform track).

Java 8 release is planned for middle of 2013, after that they plan a new release every two years, with plans being made for Java 12 already.

1. The talk started with a brief overview of java history, and a list of priorities that were defined for java and are still considered valid: readability, simplicity, features not weighting down the language, continuos evolution.

2. We've proceed to talk about most important features planned for Java 8. The first one being Lambda Expressions - http://openjdk.java.net/projects/lambda/. They will make writing parallel map/reduce code much easier (although we'll see if it will be much easier to read). They will replace use of inner classes which gets the a gold star from me.

3. There's a new way to extend old classes with new functionality - extension methods will allow you to add a method to an interface that won't have to be implemented by it's children. If a class doesn't implement a method a default method defined in the extension method will be used. That way Collection classes can now use map, reduce, filter and parallel methods with lambda expressions.

4. Module-info.java file (http://openjdk.java.net/projects/jigsaw/) is a result of a trend towards modularisation of java apps and java as a language. The idea is that you can get rid of classpath and define your dependencies in a more flexible way.

5. Summary of the features planned for java 8: lambda project, jigsaw project, JVM Converge (Hotspot VM and JRocket VM), Java FX 3.0 (as part of java 8), Javascript support, multiply devices support, focus on developer's productivity.

6. The trends for further future: Interoperability, Cloud, Ease of Use, Advanced Optimisation, "Works Everywhere and with everything".

Qcon 2012 Keynote

This week I'm attending Qcon in London. The morning started off with a "The Data Panorama" keynote by Martin Fowler and Rebecca Parsons. Here are my notes from the talk.

The talk started off with some funny lines and Martin going crazy about how big the data is these days - bonus points in my book. Also bonus points for a female speaker in the first keynote in the conference.

1. The data is big, like really big. And it's not only a problem for companies like Google and Amazon, but for everybody, because we can't correctly estimate how big our data will get.

2. Other than big, the data is distributed (as in distributed on different data sources but also created by distributed contributors and on a range of different tools that introduce new behaviours), valuable, urgent (as in needs to be analysed almost as it comes), connected.

3. Martin tried to define NoSQL, or rather it's characteristics, and came up with non-relational, open source, cluster-friendly, 21st century web ready, schema-less. He proceed to divide more commonly used data sources into Document, Column, Graph and Key-Value. He noted that what's common for Document, Column and Key-Value ones is the approach to aggregates. He concluded saying that when you add relational DBs to the mix you'll get Polyglot Persistence - you have various tools that can be picked up depending on what your needs are - improvement over years of trying to use the same tool (relational DBs) to do everything.

4. Another shift noted in the way we use Data sources is on application level. Traditionally we would use Enterprise Data Model where one data source would be accessed by many different applications. The shift is towards All the apps owning their own data sources which are distributed, and can be synchronised when needed. The responsibility for the data shifts towards the teams that own the applications.

5. There was a slide or two on event sourcing, and how this is making building up Db from nothing much easier (also noted, that this is a trick "stolen" from version control tools).

6. Cloud Storage impact on the data was summarised briefly - the separation of location of the data owner and location of the data & very cheap, almost unlimited processing power were mentioned.

7. http://www.visual-literacy.org/periodic_table/periodic_table.html

8. The last part of the talk covered how all of this will change the Data Scientist work and Data Warehouse function in companies. They've put a lot of emphasis on how the analysis process of the data will change in the future (more agile, more intelligent approach, searching for answers for questions we don't know yet). They also talked about privacy & how this will become even bigger concern in the future.

Overall the talk was a nice mix of "techy" and less "techy", all of this made a lot of sense & was quite useful.

Monday, 5 March 2012

Cloud Architecture training at QCon (3)

Notes from the last part of the Cloud Architecture training at QCon.

This part was mostly about Cassandra - why Netflix uses it & how they use it. There were bits about monitoring and about scalability of their system. Here are the most interesting points:

1. CAP Theorem - Choose Consistency or Availability when Partitioned. Master-slave vs Quorum models.

2. An overview of current Netflix Persistence stack. It included info on Memcache, Cassandra, MongoDB and MySQL.

3. Introduction of Netflix open source tools for Cassandra users: Priam - Cassandra Automation and Astyanx - Cassandra java client. Both can be found on github. We also heard about Netflix contributions to Cassandra.

4. Interesting background on why Netflix uses Cassandra. Makes me really wonder about our use of Riak... Also we've got a very details Cassandra data flows for single and multi region apps. Included the replication, backup, archive and logging mechanism. It's all backed up by S3.

5. Cloud bring-up strategy - they used Shadow Traffic Redirection, worked in iterations, one page at a time, starting up with the simplest . They managed to "sell" the cloud to all developers early on a development boot camp. Most of the issues they faced were around key management and early lack of experience with the AWS.

6. The monitoring is based on logs (they log everything almost all the time). The logs are processed in Hadoop and are used to generate reports. They use AppDynamics as a portal that gives them deep look into what's running in production.

Last slides covered the structure of the team - notably small sys-ops, and DBA teams and very strong java dev teams.

Overall the training was very useful and packed with interesting info. I can see at least two more posts coming out of it. It also made me note some questions for Shopzilla's cloud migration in EU.

Cloud Architecture training at QCon (2)

Here are the notes from today's after-lunch session of Cloud Architecture training at QCon.

We've moved on to more detailed architecture description of what is currently running under Netflix customer-facing apps. Here are the most interesting bits I noted:

1. Finally I have good names for our client and site libraries that we use in Shopzilla. Netflix calls them SAL - Service Access Library and ESL - External Service Library. I like the way they use ESL to call up local cache and that service side cache has it's own SAL and can be accessed by webapp via ESL or from service. I think our clients may be still too thick and our site libraries are too thin at the moment.

2. The main problems Netflix developers tried to solve during the switch were: the development teams interaction and the kitchen sik objects (like Movie or Customer). The first one was fixed by more service oriented architecture, with grained libraries and well defined responsibilities, for the second they used "facets pattern" which I'll describe later. Basically it's about understanding that objects can be represented differently when they are used different. It gives developers a way of working with the same objects without breaking each other.

3. Lots of good practices on logging and monitoring. Especially on how to use hadoop/hive and AWS for that. Note to self - figure out a way for us to collect usage analysis data.

4. We got a nice overview of Amazon, and Netflix tools that are used and were developed during the migration. We've heard about S3, SQS, some third party tools like GeoIp and keystore /HSM. There are some new open source tools going to be releases soon by Netflix - watch this space: https://github.com/Netflix

5. We also heard about how they deal with security, making sure that only right people have access to the instances, but also limiting the ways services can cal each other - only services that are part of certain security group can call a service, easy way to find out who calls who.

Cloud Architecture training at QCon (1)

I'm attending the Cloud Architecture training by Adrian Cockcroft today. Here are my notes from the morning session.
http://qconlondon.com/london-2012/presentations/show_presentation.jsp?oid=3889

We have around 40 people in the room, 3 girls including myself, two of them are IT students. 90%+is the usual while male programmer type (the session kicked off with introductions from the audience, I haven't been asking people what they do on my own :)). So far the highlights for me were:

1. Netflix dev team slides & other resources we all can find online, see: http://www.slideshare.net/netflix http://techblog.netflix.com/ and http://developer.netflix.com/

2. Learning about how Netflix team works with Amazon providers: no sysops team, developers pushing builds over from testing to production. Fully in-cloud QA environments. Overreaching specs, almost no-existing capacity planning. Also, the out of the box account/keys management provided by amazon wasn't good enough, so they had to implement they own.

3. Discussion about other providers, actually the reasoning behind why Netflix uses Amazon and not any other cloud provider is not that interesting - Amazon is the only one that can handle scale. What is interesting is to learn about the risks they had to look into before even deciding to go to any cloud. Apparently if forced to they can move out of Amazon in 3 months. At the moment they would move into private cloud, simply because there is no other provider that can handle a company of the that size.

4. It seems like different teams in Netflix own different services, so you wouldn't get two teams or more messing around with the same service. They do CI & they release to production a lot (just like we do :)), but they had to move to the cloud to do that.

5. Moving to the cloud took 3-4 years. The bulk of work was to actually re-factor the system so it's more manageable. It's amazing how many things were broken and needed fixing before the move. We saw a noce chart of all the component that were moved to the cloud, seems like the process was very similar to what we do in Shopzilla, however the scale of the system, and the scale of changes introduced was much bigger.

6. Really good (and only 10 minutes long) explanation of the anti-architecture approach. A bit like the boxes method we use right now, but geared more towards developers. Will try that one out soon :).

Still want to hear more about how they detect issues in production (they use the canary method for releases, but what about the issues that can only be detected after some time? Do they use splunk, do they have some automated monitoring systems? Who owns those (no SysOps, no CM)?)