Friday, 9 March 2012

Qcon 2012, Where does Big Data meet Big Database

I've listened to the "Where does Big Data meet Big Database" byBenjamin Stopford talk at London's Qcon. The talk was mostly covering the differences between the "traditional" databases and the new technologies that were introduced in last decade. It provided a helpful insights into what questions one should answer when deciding on which technology to use, and unsurprisingly "which one is sexier" was not one of those questions. Here are my notes from the talk:

1. The situation in data sources domain is much different that 10 years ago. The scale changes, we have new big sources of data that people and companies are interested in, we have a range of new technologies, products and ideas that developers can leverage when architecting their applications.

2. Map Reduce is gaining popularity since 2004. It is simple, pragmatic, it solved a problem that didn't have a solution before, it was novel and not limited by the tradition that stands behind older ideas. It introduced a split between so called "hacker culture" and "enterprise culture". Around 2009 it was criticised as "not really novel", operating on a wrong level (physical not logical level), incompatible with tooling, lacking referential integrity and elegance.

3. The new approach can be characterised as bottom-up approach to accessing very large data sets, that allows us to keep the data close to the code (close to where it matters) and to keep it in it's natural form.

4. Relational approach ("old" approach) - solid, consistent and safe data, limiting developers by it's language.

5. As early as 2001 some people closely connected with "traditional" solutions realised that there are problems with existing DB - that they are good tools, but they force developers to "speak different language", they need to evolve and allow more flexibility. The NoSQL movement came about because developers wanted a simple storage that works on diverse hardware, that scales, and can be talked to in developer's language of choice. "One size fits all" is not true anymore.

6. When choosing the tool you need to consider following solutions:  In-Memory DB, Shared Everything, Shared Nothing, Column Oriented Storage,

7.  In-Memory DB - fast, but doesn't scale well and you can loose your data very easily. This will probably gain more popularity as the distributed-in-memory solution evolves.

8. Shared Everything - all the data is located on a shared disc space, all the client nodes have access to full data set, every node can handle every request. This is pretty slow, and every cache needs to sit above whole database, which means the bigger the data set the less effecting your caching solution is.

9. Shared Nothing - each node has the access to one shard of the full data set. This means you need to find the node that can handle your request - worst case scenarion you'll need to iterate through all your nodes.

10. Column Oriented Storage - data organised in columns not rows, laid contiguously on disc. Very good compression.  Indexing becomes less important, sigle record pull is quite expensive, but bult read/write is really faster.

11. Additional factors: discs - older DBs were optimised for sequential access over magnetic drives not random access over SSD, growing speed of our networks. Mind that there are relational databases that leverage latest technologies - they can achieve very good results in benchmarks.

12. When comparing and deciding on solution don't only think about the size of your data set, it's potential to grow or even only about abilities of each tool. Think about what You need/want. Do you mind SQL? Is your data changing a  lot constantly, is the data isolated? do you want to/can you keep your data in the natural state? which solution you can afford?

13. It is good that we do have a range of tools that provide different solutions for different problems, the trick is to know what is your problem, so you can pick the right solution. 

No comments:

Post a Comment