Thursday, 8 March 2012

Qcon 2012 Architecting for failure

I've listened to "Architecting for failure at the Guardian.co.uk" talk by Michael Brunton-Spall at London's QCon. The slides cn be found here: http://speakerdeck.com/u/bruntonspall/p/architecting-for-failure and here are my notes from the talk:

1. The scale of both userbase and content in Guardian is pretty big (3.5 m unique users, 1.6m unique pieces of content, hundreds of editorial staff, different content types - text, audio, video...).

2. Old system which they replaced in 2011 was a big monolith - one code base for rendering of dynamic content, third party apps data and anything else you see on news page. The system was replaced with what they call micro-apps framework. They divided the page into contant (managed by CMS) and other components. The components were rendered based on responses from mico-apps. This was an SSI-like solution based on HTTP protocol (simplicity).

3.This meant faster development, more innovation, diversity of systems, languages and frameworks used. This also meant the have in-house internal system (CMS) and external environments hosted in the cloud that are accessed from the main system. They needed a good caching solution to be placed in between them.

4. The cost of new system - support for all different apps was far more complicated, maintenance was harder as the applications were written in different languages and hosted on different platforms, extra work had to be done to make it easier for people to work within the system.

5. Failure is not a problem - you can plan for a failure and make sure that your application knows how to behave when one of the apps returns 500 response Slow is problem - things go wrong when the application is not aware there is an issues because the app is just slow, or unresponsive for a long time. You need to think about what to do with slow responses and how to make application aware of them. (Guardian used some nice caching solutions for that).

6. Things to think about when planning for failure of micro-app - do you always want dynamic traffic or speed? Do you have peaky traffic or flat one? Is small subset of functionality enough? In case of Guardian the speed is sometimes more important than dynamic contant, so they came up with the Emergency Mode. In Emergency mode caches do not expire, the "pressed pages" are served, if no pressed pages are available the cached version is used, if cached version is not available only than render the page as in normal mode.

7. Pressed pages - fully rendered HTML stored on the disc as a flat file, served like static files, can serve up to 1k pages per second per server.

8. When caching be smart, cache only what's important.

9. Monitoring is not Alerting (and vice versa). When monitoring aggregate stats - you want to be able to group and compare behaviour per colo, per server, per app, per stage. Choose what to monitor (make sure the metrics you're looking at are helpful when trying to fix/understand the problem).

10. Automatic release valve - if page load time is higher than x automatically move the site into emergency mode. Stay in the mode for 60 seconds, if after that everything is ok there was no issue and you can ignore the incident. If the problem persists somebody needs to look at it. Use feature toggles, so you can switch things off to fix the main app.

11. Log only relevant and useful stuff, learn how to analyse your logs, make them easy to look at an analyse with the tools you'v got. Measure your Mean Time Between Failures (MTBF) and Mean Time Between Recovery (MTBR). If you fail rarely but need a long time to recover it's bad. You want t recover faster.

12. Expect failure, plan for failure (plan for failure at 4am), keep things simple and keep things independent.

Qcon 2012 The Evolution of PaaS

I've attended "The Evolution of PaaS" talk by Paul Fremantle at QCon. Here are my notes from the talk:

1. The are still problems we face in the application development teams: randomness (random libraries are used, there are random ways of doing things), project infrastructure takes too long to set up, we use unknown, unapproved libraries, there is no easy way to have a clear idea of which project is at what state at any given moment, there's too little automation, there are few or no metrics on code quality, test coverage and re-use. PaaS could give a way for fixing some if not all of those issues.

2. What it means to be cloud-native? To be distributed (works in the cloud), elastic (uses cloud efficiently), multi-tenant (only costs when we use it), self-service (in the hands of users), granularly billed and metered (pay just for what you use), incrementally deployed & tested (seamless live updates).

3. PaaS itself is a step in the evolution - either from IaaS (when you decide to start using services provided by PaaS in addition to infrastructure you're already using) or from SaaS (when you decide you want to enhance the services you're using more deeply).

4. PaaS is about taking what developers care about and making it self-provisioned, managed, metered and paid per use in the cloud. The developers care about the code, but also about tools outside of the codebase (messages, databases, logs, workflows) and about non-runtime "things" - svn, git, build, CI, code coverage, automated tests.

5. The evolution of PaaS so far was about moving from just allowing developers to deploy applications (they still need to care about what the infrastructure looks like) through speciation, expending services towards dealing with multi tenancy. PaaS will obviously be still evolving.

6. What it means to be open: open to run in different places, open to run on different IaaS provider, open to run different types of applications, open to new services, maybe open source? Types of PaaS right now: public PaaS (Heroku, Amazon Elastic Beanstalk), private/public (Tibico, Microsoft), open private/public IaaS.

Qcon 2012 Sustainable speed is king

I've attended the "Zero to ten million users in four weeks: sustainable speed is the king" by Jodi Moran. Here are my notes (and bonus here are the slides from the talk: http://www.slideshare.net/plumbee/sustainable-speed-is-king-qconlondon-2012)

The session was a high level look at how maintaing a sustainable speed of delivery can be the key to the success for teams working in fast paces, high traffic, web based industries.

1. Speed - end to end time for each change. Sustainability - maintaining the speed for a long time period. Long probably means the lifetime of the project/app/game :).

2. Sustainable speed is desired because it gives your project responsiveness - you can react to changes in your users behaviour/business model changes/any external factors, it means greater returns - as you are able to deliver more features, especially in social gaming business it means you can earn more off your users, your investments are less - you deliver fast so you work less (time-wise, but not only).

3. Couple of methods Jodi used to achieve sustainable speed in her projects: Iterate & Automate, Use commodity technologies, Analyse and Improve, Create high-speed culture.

4. Iterate & Automate - be agile, break things down, focus on principles, reflect on what you did and adjust. Automate all routine work - building, testing, provisioning, deployment. Isolate your changes, high-risk components need to be separated from low-risk ones, so you can deliver faster.Launch with minimal product, minimal process and minimal technology. Prepare for the technical debt. Decide early on how you'll pay for it and when. Take it on intentionally when needed.

5. Use commodity technologies. This means it's easier for you to find people who will work on your product. It will be faster to set your development process up and running. Your team will understand different components of your system, and will be able to change them/work on them faster. Even complicated systems can be build on commodity technology as was proved in presented case stud (read/write game properties management tool that allowed the product owner tweak properties of the game and deploy it to test environment in "one step").

6. Analyse and Improve. You never have "too much" of data - you need this information for reporting, monitoring, data mining, predicate analytics, personalisation, split testing (a/b testing). We should collect both user and system data so we can track both application state and user behaviour as both are important. Split testing is especially important when maintaining sustainable speed - if you can analyse your data quickly you can react quickly. Good test means you don't invest time implementing things that do not work for your product.

7. Create high-speed culture. Teams arrangement is similar to you'r apps components arrangement - you want small, well specified modularised components that are responsible for providing a service and are independent from others. To manage communications keep the teams that need to talk to each other close to each other. You can than "reuse" your teams ike you would reuse your services. Hire best people, trust them, make them responsible and excited about your product.

8. "Source Code Is A Liability, Not An Asset" (http://blogs.msdn.com/b/elee/archive/2009/03/11/source-code-is-a-liability-not-an-asset.aspx).

9. Testing: think about out why do you test your application. Shouldn't developer be responsible for the code quality (if the code does what developer intended)? Shouldn't product owner be responsible for features implemented (Is the application doing what he intended?). Won't user behaviour verify if the feature implemented is what users wanted (react fast, a/b test)?

10. Load testing: why do we perform load tests? If we're testing for capacity for new users mind that hardware is cheap now and scaling is easy with the cloud being around and all. If we're testing new feature capacity, we can do the same more effectively and safely with a dark launch.

11. Creators are responsible for quality, as well as for the product being alive in production - you don't need a separate operations team to be "on call", taking care of production environment. Engineers are responsible for the environments as well.

Wednesday, 7 March 2012

Qcon 2012 Cloud 2017: Cloud Architecture in 5 years

I've attended a panel discussion on future of cloud architecture between Martijn Verburg, Mark Holdsworth, Patrick Debois, and Richard Daviews, moderated by Andrew Phillips. Here are some notes from the discussion. Note that I didn't stick around for the open discussion with audience.

Q: Will there be cloud teams in companies that decide to use cloud as PaaS or IaaS?
It depends how we define a cloud team. The companies will need a team of people that have a in-depth understanding of technology behind the cloud so they can support it. They will also need a broader spread idea of what it means to work with in cloud that will be on much higher level but will have to reach much more people. For companies that will go into "private cloud" old dev ops teams will be probably replaced by cloud teams, but their characteristics won't change much. Depending on how many companies will decide on staying in provate cloud or decide on treating the cloud as just a different way of hardware provisioning the cloud computing approach as an idea may be called a failure at some point in the future.

Q: In 5 year will the choice still be between "public" cloud, "private" cloud and in-house data center?
Depending on the requirements of the system some companies may require a public cloud that's not the standard that everybody uses at the moment. Currently a company is more likely to change it's code/architecture to fit into the cloud than the other way round. This is cause by the nature of the companies that are moving to the cloud and the reasons behing the move. The more big companies do the switch the more apparent the need for custom set up in the cloud will be.

Q: What technical breakthroughs we need to unlock the cloud and make it commonly used across companies?
The companies need to start developing software that is smart enough to understand the platform it's deployed to. Currently only 20% of applications deployed in cloud understand they are running on multiple and changing instances & they can figure out when to expected/contract it's hardware requirements. Developers, especially Java ones need to learn how to write applications for the cloud.

Q: Will there be companies that use cloud alongside in-house data center or maybe other solutions?
It will depend on a company and on the team. Most common patter right now is some teams/projects working on greenfield projects for a bigger company that work in the cloud alongside bigger and older projects running in in-house data centers.
The nature of this movement is that the shift usually happens outside of official paths within the company, and the approach - experimented on separately is than ported back to fit the enterprise model.

Q: How big is the gap between reality and the hype for the cloud at the moment? How it will change in 5 years?
At the moment the gap is really small, but the more widespread it will get the bigger the gap will be. The hype will probably peak around 2017, and we'll see a backlash caused by disappointed adaptors that didn't understand how to make the concept work for them.

Q: How cloud will integrate with everything social?
Very well :). There are already existing services that allow for very quick and easy, almost code-less integration. This will only get improved in next years.

Q: How will companies continue to interact between each oty=her and the cloud providers after migration?
Some ideas were thrown about peer 2 peer ad-hoc data/resources sharing between small groups of companies. Idea of Brokers for PaaS, SaaS and IaaS were discussed as well.

Overall it was interesting to see what the people who are so involved with the cloud think about where the business is and where it is going. It was mostly speculations but if I'm listening to anybody's speculation on this topic this guys are pretty high on my list.

Qcon 2012 High Availability in Heroku

Here are the notes from the High Availability in Heroku talk by Mark McGranaghan at London's QCon.

1. There are two aspects Heroku team looked at when improving their availability - architecture and execution. First one relates to the designs of their system, latter is more about how they had to change their approach to implement the designs.

2. On the architecture level Mark covered number of techniques covering following problems: load balancing, supervision, error kernel, layered design, message passing.

3. The solution is based on idea of having multiply load balancers running as internal apps between front end applications and backend provided by Heroku. In addition to the balancers level they have supervising applications that are checking on the instances and bringing them up when necessary. This already shows that they use layered design - different problems are handled on separate levels in the system which makes it more robust and easier to monitor. Finally, they use well defined, non-app specfic messages that can pass increasing number parameters between layers and apps. The messages are versioned - explicitly or handled with graceful degradation.

4. Distributed call graphs - decide what to do when faced with broken producer or worker (you can't read or you can't write data). When you can't write consider local tickets that can be picked up when system is again available.

5. On the execution front the challenges were: jaded deploys, bad visibility and cascading feedback. The deploys were covered by iterative deployment and so called "canaries". The Visibility is covered by the real time monitoring systems and automated tools that allow to flag any issues as early as possible.