Imagine the scenario:
You have 1GB of data that you need to process.
The data are stored in a relational database in your desktop computer and this desktop computer
has no problem handling this load.
Then your data grows to 10GB.
And then 100GB.
And you start to reach the limits of your current desktop computer.
So you scale-up by investing in a larger computer, and you are then OK for a few more months.
When your data grows to 10TB, and then 100TB.
And you are fast approaching the limits of that computer.
Moreover, you are now asked to feed your application with unstructured data coming from sources
like Facebook, Twitter, sensors, and so on.
Your want to derive information from both the relational data and the unstructured
data, and want this information as soon as possible.
What should you do? Hadoop may be the answer!
Hadoop is an open source project of the Apache Foundation.
It is a framework written in Java originally developed by Doug Cutting who named it after his
son’s toy elephant.
Hadoop uses Google’s MapReduce and Google File System technologies as its foundation.
It is optimized to handle massive quantities of data which could be structured, unstructured.
Hadoop replicates its data across different computers, so that if one goes down, the data are
processed on one of the replicated computers.