Sign up FAST! Login

Infochimps Leap to Open Source Branches - Datanami

Stashed in:

To save this post, select a stash from drop-down menu or type in a new one:

These challenges are not so unlike what Infochimps faced as it sought to scale, add reliability and provide a dash of performance that their former Flume paradigm wasn’t able to offer. The company’s co-founder, Kromer, told us that when they started with Flume, it was the only mature(ish) platform available for the purpose. However, as they continued development, it became clear that the need for handling high volume at high speed with reliability as a critical component meant they had to look elsewhere. “It took us a while,” he said, “but we found that Flume was not the right platform for streaming data analytics; it was great for high-speed data transport, but not for what we wanted…” He notes that there was some flexibility built into Kafkathat offered an “interesting, efficient take on the problem of reliable (stress reliable), high-speed data delivery into a lot of disparate systems.”

With that piece in place, he said Storm was the right fit to snap into the analytics side of their platform. For some background, Storm spun out of Twitter, which they announced last year as the answer to “open source real-time Hadoop.” The social media giant described it as a distributed, fault-tolerant, real-time computation system, which was responsible for their real-time stream processing of messages and related database updates. According to BackType, which developed it under the Twitter banner, “This is an alternative to managing your own cluster of queues and workers. Storm can be used for 'continuous computation,' doing a continuous query on data streams and streaming out the results to users as they are computed. It can also be used for 'distributed RPC,' running an expensive computation in parallel on the fly.”

As Nate Marz, the lead engineer of Storm said, the project “makes it easy to write and scale complex realtime computations on a cluster of computers, doing for realtime processing what Hadoop did for batch processing. Storm guarantees that every message will be processed. And it's fast — you can process millions of messages per second with a small cluster. Best of all, you can write Storm topologies using any programming language.”

"Storm guarantees that every message will be processed. And it's fast."

What makes this possible?

You May Also Like: