Sign up FAST! Login

Nati Shalom's Blog: Making Hadoop Run Faster


Stashed in: Big Data!, Elephants!, Big Data Technologies

To save this post, select a stash from drop-down menu or type in a new one:

…awesome photo!

Great picture. Also, LOL:

"We can make our Hadoop system run faster by pre-processing some of the work before it gets into our Hadoop system. "

We make it faster by making someone else do some of the work!

Liking this bit... Speed Things Up Through Stream-Based Processing

The concept of stream-based processing is fairly simple. Instead of logging the data first and then processing it, we can process it as it comes in. 

A good analogy to explain the difference is a manufacturing pipeline. Think about a car manufacturing pipeline: Compare the process of first putting all the parts together and then assembling them piece by piece, versus a process in which you package each unit at the manufacturer and only send the pre-packaged parts to the assembly line. Which method is faster?

Data processing is just like any pipeline. Putting stream-based processing at the front is analogous to pre-packaging our parts before  they get to the assembly line, which is in our case is the Hadoop batch processing system.

As in manufacturing, even if we pre-package the parts at the manufacturer we still need an assembly line to put all the parts together. In the same way, stream-based processing is not meant to replace our Hadoop system, but rather to reduce the amount of work that the system needs to deal with, and to make the work that does go into the Hadoop process easier, and thus faster, to process.

In-memory stream processing can make a good stream processing system, as Curt Monash’st.gif points out on his research traditional databases will eventually end up in RAMt.gif. An example of how this can work in the context of real-time analytics for Big Data is provided in this case studyt.gif, where we demonstrate the processing of Twitter feeds using stream-based processing that then feeds a Big Data database for the serving providing the historical agregated view as described in the diagram below.

Screen Shot 2012-08-21 at 2.23.49 PM

- See more at: http://natishalom.typepad.com/nati_shaloms_blog/2012/08/making-hadoop-run-faster.html#sthash.jVYWvPLX.dpuf

You May Also Like: