OMG, so much Data!

One of the buzz words in today’s world is Big Data. As with all buzz words, this is a vague and undefined term- it roughly refers to the ginormous, gargantuan, astronomical quantities of data that is generated every single day from a wide range of scattered sources across the globe about anything starting from the temperature at 7 pm at Abu Dhabi to where you had your dinner last night.

The “impact” of Big Data can be understood from the fact that even Scott Adams made a comic on it [0] (of course, as soon as something gets mentioned in Dilbert, or PHD Comics or xkcd, it has “arrived”. Okay, just kidding!)

Well, having got that out of the way, as mentioned above, Big Data is basically all the data that is collected on a daily basis from a variety of sources. The sources can be technical and measure physical attributes like the temperature, or it may be demographic records like the number of babies born in a particular region, or it may be economic in nature like market statistics and transaction records, or something more abstract as the changing “tastes” of a particular set of book lovers in a library, or it can even be something as banal as the traffic information at the local intersection.

One of the most interesting sources of big data is the Internet. Practically every click of your mouse while you’re on the Internet is recorded (scary thought, isn’t it?). Even when you have turned off History on Google or YouTube, they still maintain temporary logs of the pages you have viewed or your search terms. This data is then used to tweak their algorithms to give you a more personalised result- for example, when you search for “restaurant” on Google, it will show you restaurants in Mumbai, or, say, when you search for “gymkhana” or “hostel 4″ or “CS 435″, the top results are likely to be those from IIT Bombay.

Of course, people have collected data from long before computers came to be- the reason why Big Data became a buzz word only now is because it is only recently that capabilities have been developed for overcoming the daunting difficulties in handling all this data.

To give an example, one of the major issues with this data is, of course, its sheer size. According to IBM [1],around 2.5 quintillion bytes of data is generated every day. Of course, only a very small fraction of all that data is relevant to any given application, but even that data can easily run into over hundreds of Gigabytes or even Terabytes (social networks alone generate humongous amounts of data- on Twitter, 12 Terabytes of data is generated on a daily basis by the 400 million tweets alone every single day- and then there is the data about searches, new connections made, connections broken, who is following whom, new accounts created, old accounts deleted, new connections coming in from blogs and other external websites, retweets, amount of activity per user, trending topics- see my point?)

In a 2001 research report [2], data growth challenges have been identified as three-pronged which is encoded into another buzz word, the 3Vs- Volume, Velocity and Variety, referring to the quantity of data produced, the speed at which it is produced and the messy mixture of structured and unstructured data generated. Check out this infographic [3] by Mashable on the same. This has provided a lot of impetus to research in Machine Learning and Data Mining, to name two fields, which has to constantly try to come up with faster, efficient and more reliable algorithms to make sense of it all. Scalability is a big challenge, and so is accuracy, since the data is usually very, very noisy.

On the other hand, there are some interesting empirical Laws that have been discovered about large datasets- one of them being the Benford’s Law [4] which is still used by Auditors, among others, to detect inconsistencies in large datasets. One of the many cute results of this Law is that, given a “large” raw dataset containing a “large” number of integer entries spread across a “large” range of values, the number of entries that start with 1 (that is, have 1 as the leading digit) is roughly six times the number of entries which start with 9. Analysis of this is, in itself, has led to a fairly large volume of research literature (see this post on Tao’s blog [5]) with some very intuitive explanations.

So, considering how fast the problem is increasing in complexity and how little has been achieved (in relative terms, of course, compared to how much is still left to be done) and, of course, how interesting the problem is, Big Data is presumably here to stay for some time. After all, tomorrow is another day, and the world awaits the data about the millions of tweets, status updates, places checked in, items bought, pages viewed, etc. that we shall invariably generate for years to come.









Hello world!

Welcome to! This is your very first post. Click the Edit link to modify or delete it, or start a new post. If you like, use this post to tell readers why you started this blog and what you plan to do with it.

Happy blogging!