Hadoop: When to Use it and When not to?

This entry is part 6 of 8 in the series Hadoop Tutorial

In the past couple of years, Hadoop has earned the title of “THE Big Data Analytics Platform”. To a lot of organizations, it is equal to the term Big Data Technology. But it can only do so much and savvy organizations need to analyze their needs in order to see if it is a good fit to their Big Data related needs. Hadoop has ample power to process voluminous data sets, but organizations need to assess their needs in order to know when to use Hadoop and when to look for alternative solutions.

Example: Metamarkets

For instance, Hadoop has a lot of horsepower to process huge amounts of unstructured, semi-structured and structured data alike. But it falls short when it comes to processing smaller data sets. Metamarkets is one such company which has little use of Hadoop because of this limitation. Although they use Hadoop to process huge data sets where time isn’t of the essence, but when it comes to providing with real-time analytics they use other solutions. It is because Hadoop isn’t optimized to execute batch jobs which look at every single file in the database. All their requirements come down to a tradeoff: in order to make the connections between the data points, Hadoop detriments the speed.

They use Hadoop for the reports at the end of every day, which helps them review all the transactions of the day, or when they have to scan the historical data which dates back numerous months. Their CEO says that using Hadoop is like having a pen pal, you can write to him, but you won’t be getting an instant reply, unlike IMs.

Not a true replacement to the traditional database

While some organizations might be tempted to scrape their traditional databases and the warehouse in favor of Hadoop clusters, because of the lower technology costs, many experts say that this is like comparing apples to oranges. As they believe, the relational databases which power most of the deployed data warehouses are used to accommodate small amounts of data which trickles in at a very steady rate over a time span. Hadoop is more apt to process the stores of data which has been accumulated over a lot of time.

And because Hadoop is usually employed in huge projects which require clusters of service hardware with employees specialized to handle the programming and have ample data management skills, the implementation can amass quite a lot of expenses. Even though the cost-per-unit of data is lower than that of the relational databases, adding everything shows that it isn’t as cheap as it seemed.

Requirement based application

A great example application of Hadoop would be acting as the data integration area for executing the ETL (extract, transform, and load) tasks. Although, this application doesn’t live up to the up, but it makes perfect sense when your IT dept. needs to merge huge files. In such a case, the immense power of Hadoop can be very useful in processing.

Many experts believe that Hadoop can be very helpful when handling the ETL procedure because it can split the tasks amongst the numerous nodes, speeding up the process a lot. Also, Hadoop can be used to integrate the data and then stage it for later loading into a relational database or a data warehouse, which justifies the investment in this platform.

Final Words

Hadoop is a behemoth and it is definitely capable of everything that Apache claims, but getting it through the door for bigger projects which employ Hadoop’s flexibility and scalability to a bigger scale seems like the saner thing to do.

Series Navigation<< MapR: The Magic Which Drives HadoopDifferent modes of hadoop >>

Leave A Comment

Your email address will not be published. Required fields are marked *