Big Data Analytics: Time For New Tools
Mo Data stashed this in Big Data Technologies
So you're considering Hadoop as a big data platform. You'll probably need some new analytics and business intelligence tools if you're going to wring fresh insights out of your data.
Hadoop is steadily gaining adoption as an enterprise platform for capturing high-scale and highly variable data that's not easy or economically viable to store in relational databases. What's less clear is just how companies are going to analyze all this data.
A recent Forrester report declared that Hadoop is "no longer optional" for large enterprises. Our data suggests that train hasn't left the station just yet: Just 4% of companies use Hadoop extensively, while 18% say they use it on a limited basis, according to our just-released 2015 InformationWeek Analytics, Business Intelligence, and Information Management Survey. That is up from the 3% reporting extensive use and 12% reporting limited use of Hadoop in our survey last year. Another 20% plan to use Hadoop, though that still leaves 58% with no plans to use it.
But there's no doubt that interest in Hadoop is rising. The top draw is the platform's "ability to store and process semi-structured, unstructured, and variable data," cited by 31% of the 374 respondents to our survey involved with information management technology. Another 30% cited Hadoop's ability to handle "massive volumes of data," while 25% said it's Hadoop's "lower hardware and storage scaling costs" as compared to conventional relational database management systems.
That's the IT, data-management perspective on the need for Hadoop. But why is the business looking to capture and analyze big data in the first place? The top driver, cited by 48% of respondents using or planning to deploy data analytics, BI, or statistical analysis software, is finding correlations across multiple, disparate data sources, like Internet clickstreams, geospatial data, and customer-transaction data. Next in line are predicting customer behavior, cited by 46%, and predicting product or service sales, cited by 40% of respondents (multiple responses allowed, see chart below). Other motivations include predicting fraud and financial risks, analyzing social network comments for customer sentiment, and identifying security risks.
In each of these examples, companies are searching for insight by analyzing big data sets that they couldn't discover parsing the same old data they've long held in transactional systems alone. Capturing and analyzing clickstreams, server log files, social network streams, and geospatial data from mobile apps is a recent, big-data-era phenomenon for most organizations attempting it, and they're gaining insights and seeing correlations that just weren't available in the enterprise data warehouse.
But pulling insight out of this new data will require some new tools, ones that work alongside Hadoop -- which is, at its core, nothing more than a highly distributed file system. Here are the three categories of options associated with Hadoop, along with product examples.
Hadoop-native data-processing and analysis options: These include Apache Hive (provides SQL-like data access -- think data warehousing meets Hadoop); Apache Mahout (supports machine learning on top of Hadoop -- think finding patterns in data); Apache MapReduce (for searching, filtering, sorting, and forms of processing large data sets in Hadoop -- ways to boil down really big data to find the useful nuggets); and Apache Pig (a language for writing MapReduce jobs).
Alternative SQL access/analysis options: Hive is slow by relational database standards, and it doesn't support all SQL-analysis capabilities. These alternatives are designed to make BI professionals feel more at home, giving them accustomed performance, SQL- or SQL-like querying, and compatibility with current BI tools. Examples include Actian Analytics PlatformSQL Hadoop Edition, Apache Drill, Cloudera Impala, HP Vertica For SQL on Hadoop, IBM Big SQL, Microsoft SQL Server Polybase, Oracle Big Data SQL,Pivotal HAWQ, and Teradata Query Grid.
Analytics and BI options designed to run on Hadoop: These tools blend SQL and BI-type querying with big-data-oriented and advanced analytics capabilities. Examples include Apache Spark, Apache Storm, Datameer, Platfora, and SAS Visual Analytics. Many of these analysis engines now run on Hadoop 2.0's YARN resource-management system.
The first thing to note is that the SQL and SQL-like options -- including Hive, Impala, Drill, the various relational databases ported to run on Hadoop (Actian, HP, Pivotal), and the various SQL-access options (Microsoft, Oracle, Teradata) -- give you the basics of SQL query and analysis, but these are not alternatives to analytics workbenches or business intelligence suites. As noted, a key point of these query and access tools is making Hadoop compatible with incumbent SQL-connected products like BusinessObjects, Cognos, MicroStrategy, OBIEE, Tableau Software, and so on.
Businesses are demanding compatibility with tools that they already have on hand. This helps explain why there were so many SQL-on-Hadoop announcements from both Hadoop vendors (Cloudera, Hortonworks, MapR) and database incumbents (Actian, Hewlett-Packard, Oracle) over the last year.
But companies need more than SQL. The value in big data analysis is often in finding correlations among disparate data sets or insights hidden in semi-structured or highly variable data sources, such associal networks, log files, clickstreams, and even images or sound files. The very name "structured query language" underscores that SQL was born for analysis of data that fits neatly into columns and rows, and that's the stuff most businesses were already collecting in their data warehouses.
As for that interest in finding correlations among multiple, disparate data sources; predicting customer behavior; and predicting product or service sales -- those types of analyses aren't in SQL's wheelhouse.
Understanding such large and variable data sets and doing prediction requires non-SQL approaches such as machine learning, graph analysis, and various advanced analytical algorithms used in predictive modeling and text mining. Even if you have separate tools that support these techniques, the next question is whether they can handle Hadoop-scale data sets. If not, you'll have to rely on SQL connectors or laborious, two-step movement of boiled-down data sets from Hadoop into relational databases.
The downside of many tools that are part of Hadoop -- like MapReduce, Pig, and Mahout -- is that they're complex and unfamiliar to most IT pros, even data analysts who are used to working with advanced analytics workbenches. The third category of products noted earlier -- analytics and BI that run on Hadoop -- is designed, not only to run on top of Hadoop, but also to bring multiple analysis options to nontechnical users who aren't up to (or interested in) the rigors of coding MapReduce jobs from scratch.
Apache Spark, for example, is an open-source platform that aims to deliver multiple analysis libraries -- including machine learning, SQL analysis, R algorithms, graph analysis, and streaming analysis -- that can all run against a single, in-memory execution engine on top of Hadoop. Datameer combines data-integration and preparation tools (for bringing data into Hadoop) with a spreadsheet-style interface for analysis, basic SQL-like join and group-by functions, and even predictive analytics and machine learning options, including k-means clustering, decision trees, and so on. Platfora's strength is in providing fast, in-memory data-exploration and data-visualization on top of Hadoop. SAS, meanwhile, has ported its Visual Analytics product to run on Hadoop and as a standalone clustered-server platform.
The bottom line is big data is not just data warehousing as we know it at high scale. So don't expect to really tap into new data sources and learn new things with SQL tools alone.
Big data example No. 1: Utility companiesAs we await more widespread adoption of Hadoop, good places to look for innovative adoption of tools designed to work with Hadoop are relatively young companies that are using Hadoop extensively. Think of these as companies whose businesses are built on big data.
Opower is one such business -- a utility analytics company that gathers meter-reading data and household data from more than 100 gas and electric companies. It adds in weather data, property data, and more to get a clearer understanding of the size and character of homes and the number of people in each household, since those are big drivers of energy use. The idea is to use that analysis to help homeowners better understand their energy usage so they can reduce consumption and choose more attractive rate plans.
Opower uses Hadoop-native tools, including MapReduce and Hive, and it uses Datameer for data-transformation and summarization work. Datameer helps power users make clearly tagged and defined data sets available to internal users such as customer-engagement managers and product managers. These business users then use Platfora to build visualizations that help them analyze customer engagement, energy consumption, and other trends across various customer segments.
Opower wanted tools that work natively on top of Hadoop, so that the company could look at data across multiple utilities. "If it's a separate [database] environment, that's more infrastructure you have to maintain," says Mehgann Lomas, a product manager at Opower.
Another concern is ending up with multiple versions of the truth. Even whenreporting on something as simple as how many customers received a home-energy report, "I don't want that stored in too many different places, because you run the risk of having different calculations depending on where you pull that data from," Lomas explains.
Big data example No. 2: The schema-on-read advantageVivint is in the home-automation business, selling a system that lets customers monitor and control security, safety, heating, and air conditioning. The system includes a touchscreen control panel inside the house and mobile apps through which customers can remotely adjust heating or air conditioning, lock and unlock doors, and control lights or appliances. Sensor options include security cameras, thermostats, electronic door locks, door and window sensors, appliance-control switches, motion detectors, and smoke and CO alarms.
Vivint uses Hadoop to store the data from the more than 800,000 customers it serves. It chose Datameer and Platfora because "Hadoop isn't user friendly and lacks an intuitive interface," says Brandon Bunker, Vivint's senior director of customer intelligence and analytics. Without this software, "we would need data scientist and PhD types" to access and make sense of the data.
Many traditional BI tools have connectors to Hadoop, but they're best "when you know what questions you're going to ask in advance," Bunker says. "When businesspeople ask me new questions, I get same-day answers without having to create a new schema."
This schema-on-read capability provides one of the game-changing advantages of Hadoop. In a traditional database environment, you have to build a data model in advance, picking and choosing what data to include that you think might be relevant based on the questions that you suspect you will want to ask. Hadoop lets you store everything without having to fit it to a predefined data model.
Vivint has used this open-ended exploration of the data to help reduce false alarms, a problem that has dogged security companies for decades.
"You can imagine that there's a significant amount of data that can go into understanding when and why false alarms occur," Bunker says. With Hadoop, you can consider all the data, because you don't have to make assumptions about causes in advance.
Using Datameer to ingest, summarize, transform, join, and group data; Platfora for data visualization; and algorithms written in R and Matlab for prediction, Vivint can analyze all the data available in Hadoop, as opposed to limiting the scope of analysis by building a data model based on preconceived assumptions about the causes of false alarms.
"We're able to work with summary data as well as very granular data, so you can let the data do the talking, as opposed to relying on preconceived notions to solve that problem," Bunker says. Vivint considers its false-alarm analysis a trade secret and a source of competitive advantage.
Vivint has no shortage of tools at its disposal; it also uses Tableau Software and is experimenting with Spark (particularly its MLLib machine learning component). Tableau makes data visualization "easy for any user," Bunker says, but it slows down with larger data sets. When Vivint is analyzing truly big data, it relies on Datameer for its efficient data transformation, and Platfora to provide the equivalent of "an incredibly large OLAP cube in memory."
It's not a surprise that companies like Opower and Vivint, which do the bulk of their data work in Hadoop, are more inclined to use software that's part of or designed to work with that platform. Companies that use Hadoop on a limited basis and that have big investments in relational database management systems and conventional, SQL-oriented BI and analytics suites will naturally want to make the most of those tools.
But even if you're in the second camp, key takeaways here should be that not all big data questions can be easily answered with SQL, and not all tools developed for small data can cope with data variety or data at high scale. Keep these points in mind before you draw the wrong conclusions about early big data failures. It could be that your failures are due to using the wrong tools or starting with preconceived assumptions about what the data might tell you.
Stashed in: Big Data!
Fascinating line about Opower's use of Hadoop's tools:
"Another concern is ending up with multiple versions of the truth."