With all the devices available today to collect data, such as microphones, cameras,
sensors, and so on, we are seeing an explosion in data being collected worldwide.
Big Data is a term used to describe large collections of data that may be
unstructured, and grow so large and quickly that it is difficult to manage with regular database or statistics tools.
Early in 2011, Watson, a super computer developed by IBM competed in the popular Question and Answer show “Jeopardy!”.
Watson was successful in beating the two most popular players in that game.
It was input approximately 200 million pages of text using Hadoop to distribute the workload for loading this information into memory.
Once the information was loaded, Watson used other technologies for advanced search and analysis.
In the telecommunications industry we have China Mobile, a company that built a Hadoop cluster to perform data mining on Call Data Records.
China Mobile was producing 5-8TB of these records daily. By using a Hadoop-based system they were able to process 10 times as much data as when using their old system,
and at one fifth of the cost.
In the media we have the New York Times which wanted to host on their website all public
domain articles from 1851 to 1922.
They converted articles from 11 million image files to 1.5TB of PDF documents. This was
implemented by one employee who ran a job in 24 hours on a 100-instance Amazon EC2 Hadoop cluster, at a very low cost.
There are also many internet or social network companies using Hadoop such as Yahoo,
Facebook, Amazon, eBay, Twitter, and so on.
Yahoo is, of course, the largest production user with an application running a Hadoop cluster consisting of approximately 10,000 Linux machines.
Yahoo is also the largest contributor to the Hadoop open source project.