Big Data Outliers: Friend or Foe? The presence of outliers may be a sign of serious data quality or maybe not
Mo Data stashed this in Analysis Tips and Tricks
The bigger your dataset, the greater your chance of stumbling into an outlier. It’s practically a certainty you’ll find isolated, unexpected, and possibly bizarre data you never expected to see in your data. But how you respond to these outliers could mean the difference between big data success and failure.
How should you deal with data outliers? The answer is simple: It depends. On the one hand, the presence of outliers may be a sign of serious data quality issues, in which case the data scientist would be wise to throw out all the dirty data and then address the problem. The old computing adage of “garbage in, garbage out” is just as valid on small data volumes as it is on today’s massive ones. So you chuck it out and start fresh.
On the other hand, an outlier could indicate a signal that was previously unknown, which could lead to a competitive break or a novel scientific discovery. In some scenarios, finding the outliers is the whole point of the big data analytics process. It’s what signals the buy or selling opportunities, the likelihood of fraud, the presence of quality control issues in a factory, or a customer about to “churn.” Once you’ve actually found the needle in the haystack, you don’t throw it back. In fact, you can’t wait to find the next one.
It’s critical to take the right approach to dealing with outliers. Unfortunately, there are no easy or simple answers. But by carefully analyzing your big data goals and applying some critical thinking–and a good mix of traditional statistical and visual data exploration tools–you have a better chance of keeping the outliers from tripping you up during your big data journey.
Three Standard Deviations
Dr. Kirk Borne, a professor at George Mason University, was exposed to a variety of statistical software packages during his early training as an astrophysicist, which provided a basis for his current work teaching and practicing data science.
“Every data analysis package I ever used since I started training in the early days of my research had what’s called a sigma clip algorithm. Basically anything that’s more than three standard deviations from the mean, you just remove it,” Borne says. “There was an attitude, at least in the early days
of data analysis, that outliers are something to get rid of. They’re anomalous values, they’re signals of poor data quality, so just delete them.”
Only later did Borne learn the pitfalls of that approach. “Once I started exploring big data and data science myself over 15 years ago, I realized the outliers are the novel new discoveries in your data. These are the things you didn’t expect to find,” Borne tellsDatanami. “I’m thinking to myself, ‘Wow, who knows what I was throwing away when I was doing that!’ Scientists would never do that to their data. But it was standard practice.”
Today, Borne advises his students to ask themselves three questions when they encounter an outlier. First, ask if it’s a data quality issue. Second, ask if it’s a data pipeline or processing error. Third, ask yourself if your Nobel Prize is waiting. “It might really be the next big discovery,” he says.
Data Outliers: Stay or Go?
Helping data scientists deal with outliers is a regular part of the daily routine for Sean Kandel, co-founder and CTO at data quality software startup Trifacta. Every situation demands a different approach, whether it’s removing the outliers, capping the outliers’ values, masking them, or reverting the outliers to the mean. In other cases, data scientists ignore the core distribution of data and focus exclusively on the outliers.
“There isn’t a single source of truth in data quality anymore,” Kandel says. “It really depends on your task what it even means for something to be an outlier and what the appropriate responses might be. It’s very analysis-task dependent, and it’s very important to have somebody who understands what the goals are in the analysis task.”
While the data scientist may identify the outliers, it’s usually up to the data or business analyst to figure out what they mean and take action. It’s difficult to make any hard and fast rules about how to deal with outliers. But there seems to be a general consensus that outliers can be troublesome when building models, but perhaps more useful when actually running the model in production, where an outlier may be the signal you’re looking for.
Borne says clipping outliers makes sense when doing counting statistics. For example, if you’re counting how many people shop at Starbucks in a given region during a given hour of the day, you’re going to get a cluster of values that is fairly homogenous, say around 100 customers, plus or minus 10 percent. But then two outliers pop up unexpectedly (as they are prone to do) when one store reported 1,000 customers and another stored reported zero customers. “That’s such a random statistical fluctuation from the mean that unless you’re looking for an extreme store, it’s better not to include those in your estimate of the mean number of people you’re seeing in the stores,” Borne says.
As the founder and CEO of RapidMiner, Ingo Mierswa is quite familiar with different approaches to outliers. The majority of the algorithms in his company’s advanced analytics package are focused on data exploration and preparing raw data for analysis. It’s all about relieving data scientists of the burden of data prep work so they can get on with the work of making discoveries.
For Mierswa, outliers often present a barrier to building a productive predictive model. “You have to find the outliers and remove them…because an outlier can totally destroy the quality of your predictive model, depending on the model,” the data scientist says.
Totally Extreme Outliers
Outliers will often pop up in machine-generated data, such as when thermometers give very high or low readings. “If the sensor goes berserk for a few minutes and all the readings are fundamentally wrong it makes sense to just get rid of them,” Trifacta’s Kandel says. “In other cases the outliers might be legitimate event of extreme behavior or an extreme reading where i
t’s important to keep it around and understand how often do those types of things occur, and what impact it would have on the overall system.”
Care is required in how you approach outliers, since macro- and micro-level analyses are often intertwined in the same big data analytic project, according to Martin Hack, the CEO and president at machine learning analytics company Skytree. “The three biggest things you see driving machine learning adoption are recommendations, outlier detection, and predictive analytics,” Hack says. “Those are the three we see over and over again, sometimes in combination.”
As the data gets bigger, any errors you make with your handling of outliers can be magnified. According to Srikanth Velamakanni, co-founder and CEO of Fractal Analytics, organizations are not content to sample their data anymore and are running analytics on entire data sets, a trend that Hadoop is contributing to.
“I’ve seen big companies do a poor job of treating outliers, because sometimes outliers are genuine piece of information and they should not be deleted from the observation,” Velamakanni says. “If people use a very classical approach to statistical models and they have to cut out all these extremes of data, you might be making some very big mistakes in terms of how you’re analyzing and how you’re making conclusions. That’s definitely one of the problems I do see. People make several mistakes in how they treat missing values, how they treat outlier information, and then they make some very big decisions on that, and those decisions could be very faulty.”
The thin line separating signal from noise becomes even more blurry when one considers people and their behavior. Humans demonstrate a high degree of variance in their wants and actions, and being able to discern the important bits takes a bit more work.
“What we’re talking about as noise may be the unique and interesting feature of that particular person or that particular class of user,” Borne says. “The mean line through the population is important because that gives a trend. It’s important to know the trends. But if you’re dealing with individuals, it’s important to know how they deviate from the trend.”
So the next time you find an outlier in your data, don’t automatically delete it. While it may just be a random influx of noise into your sample, it may also indicate something hidden in the data that you didn’t expect to see.
How do you know if an outlier is the result of a data glitch, or a real data point -- indeed maybe not an outlier. Difficult question to answer, but the chart below shows that in some cases, the outlier is not an error.
In this example, you could argue that we are not using the right metrics: comparing health expenditures in US (twice above average among developed countries) when US salary (after tax) is twice above average among developed countries, lead to a bias. When corrected for this salary bias, US might not be an outlier anymore in the above chart.
Also, is life expectancy the right metric to use? What if a large group of people die very young because of gang membership, and another group (the majority) dies pretty old? What would be interesting to see is the impact over time, in US, of increased health expenditures on life expectancy, after eliminating people dying from gun shots or car accidents. Note that a more stressful life (typical in US) can cause early death despite higher health expenditures.
Note the massive impact of the USA dot (outlier) on R^2 (at the bottom right corner) - making it much smaller than it should be (R^2 = 0.51). But R^2 is a bad metric, sensitive to outliers, and should not be used. Use this metric instead, to measure quality of fit. Indeed, the entire black curve going through the cloud, is bended too much towards the South-East, because of this outlier.