When to use Hadoop (and when not to)
Mo Data stashed this in Big Data Technologies
When enterprises interested in leveraging big data and analytics ask how to get started, they often are advised to begin with Hadoop, Apache Software's open source data storage and processing framework.
There are a number of reasons why Hadoop is an attractive option. Not only does the platform offer both distributed computing and computational capabilities at a relatively low cost, it's able to scale to meet the anticipated exponential increase in data generated by mobile technology, social media, the Internet of Things, and other emerging digital technologies.
These advantages, along with strong word of mouth and high-profile implementations by companies such as Facebook, Yahoo, and numerous Fortune 50 giants, is driving adoption of Hadoop.
Research firm Researchbeam in March forecast the global Hadoop market to grow to $50 billion in 2020 from $1.5 billion in 2012. Most of that money will be spent on services provided by commercial Hadoop specialists such as Cloudera, Hortonworks, and MapR Technologies.
But not all data scientists are climbing on board the Hadoop train. In fact, many have jumped off. In a recent survey of data scientists on the obstacles to big data analytics, vendor Paradigm4 reports that more than three-quarters (76 percent) of the scientists who said they have used Hadoop or Spark (the computational framework built on top of the Hadoop distributed file system) cite "significant limitations" to their use.
Specifically, 39 percent of respondents said Hadoop takes too much effort to program, while 37 percent said it was "too slow for interactive, ad hoc queries." Another 30 percent knocked Hadoop as being too slow for real-time analytics. And more than one-third (35 percent) of data scientists surveyed who have used Hadoop and Spark said they have stopped using them.
Granted, this survey is from a vendor that's offering "more" than Hadoop. But the reasons given by respondents explaining their dissatisfaction with Hadoop are grounded in real issues rather than vendor hype.
Take response time. If you're looking to produce complex analytics or real-time analytics, Hadoop probably isn't the platform for you, explains Claudia Perlich, chief scientist forDstillery, a marketing company that crunches web browsing data to help brands target ads.
For the part of Dstillery's business that delivers ads online, real-time analytics are essential. "That part," Perlich says, "we can't do with Hadoop."
"If I have 30 milliseconds to look up information in a database that has 300 million people, there's no way Hadoop can do it," she says. "It's not the technology for quick access."
However, Dstillery also performs analytical services for which response time takes a back seat to accuracy and long-term insights.
"All of our incoming data is dumped into Hadoop to use for building analytics," Perlich says. "We do a lot of predictive modeling, and this is where Hadoop is phenomenal, particularly the cost at which you can store everything and access it in reasonable time -- not real time, but reasonable time."
Some of the scientists who stopped using Hadoop simply may have chosen it for the wrong job -- such as real-time analytics -- in the first place. For them, moving on only makes sense.
Another potential source of dissatisfaction with Hadoop (that wasn't reflected in the Paradigm4 survey) is cost. Enterprises that go into Hadoop thinking it's going to be free or cheap because it's open source usually get a big surprise. And they usually end up paying by contracting with a Hadoop services vendor or hiring qualified Hadoop programmers and analysts to work in-house, and by then launching misguided Hadoop projects that cause them to fall behind competitors.
Early adopters of Hadoop who became disillusioned may have been victims of the first wave of Hadoop hype. The gradual maturation of big data and analytics technologies, along with better-educated customers, should make it easier for enterprises to choose the best analytics solution.
As Perlich says, "It's really about what you're trying to do that determines whether the tool is sufficient for the job."
Stashed in: Big Data!