Sunday 19 August 2012

Big Data -The three Vs

A bit of taxonomy for this blog. I'm currently taking part in a Big Data initiative within my company and I was thinking of sharing some important findings here.
Before doing that, let's define Big Data .
 I will ask Wikipedia for some help:
“Big data" is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set".

Basically, we are talking about massive volume of data that commonly used software are not able to process quickly.
Is that all? Not quite.
Big Data is defined by three characteristics, which form what is better known as the three Vs: Volume, Variety and Velocity.


Volume 
We already talked about this in the definition. Single data sets being over dozen of terabytes up to many petabytes.

Variety
This is a key characteristic. Variety represents all types of data. We shift from traditional structured data (e.g. normal database tables), to semi structured (e.g.. XML files with free description) to completely unstructured data like web logs, tweets and social networking feeds, sensor logs, surveys, medical checks, genome mappings, etc..
Traditional analytics platform can't handle variety.

Velocity

Commonly, in the traditional data world,velocity means how fast the data are coming in and getting stored. Traditional  analytics  platform deals with pretty much static data. RDBMS tables with a velocity that is pretty much under control.
In the Big Data world data are fast mutable, continuously changing and increasing in volume. We are talking about real-time, streaming data. A web log, a tweet hash-tag, are continuously growing and flowing. In the Big Data world, velocity is the speed at which  the data is flowing.


Putting the three characteristics together, a  Big Data initiative should be able to cope with massive volume of fast flowing, mostly unstructured, data.
To achieve what? This will be the subject of my next few posts.
Antonio




1 comment: