Beware of the giraffes in your data (hidden errors in the aggregations)
Giraffes are what I call portions of data which dominate the rest of the data – and hide important insights. Sometimes they even lead to wrong conclusions.
Let’s say you’re out watching animals in a nature reserve. Undoubtedly, when you spot a majestic giraffe in your binoculars, you’re going to take a good look at him. Meanwhile, many of the other, smaller animals will all just seem, well, small. You won’t notice that there are significant differences in height among the smaller animals, especially as compared to the giraffe.
However, if you can take your eyes off the giraffe for a minute and zoom your binoculars into the smaller animals on the plain, an amazing thing happens: you become aware that the differences in size between the animals are actually much larger than you had first realized.
A website analytics example
Web Analytics “heatmaps.” reveal the areas of most intensive visitor mouse movements and mouse clicks on a webpage. Red areas indicate the most mouse activity, blue with the least.
On the heatmap of a homepage, we might see only one dark red area, namely the login password field. Because many of the visitors to this page are already registered users of the site, it makes sense that such a large percentage of mouse activity is centered on the login area. However, because all the mouse activity data is aggregated here, important information about where non-registered visitors are looking and clicking is hidden from the analyst’s view.
Once the analyst drills down and removes the giraffe from the data (the registered users), he sees a view of the data that is much more revealing as to the visitors’ areas of interest. In our homepage example we would see a dozen red areas instead of only one. By separating out just one portion of the data (the registered users), the analyst uncovers the important information that will lead him to better decisions about how to improve the website.
There are often giraffes in your data hiding important insights. They can even lead to erroneous decision making. The handful of examples here are only the tip of the iceberg; there are many more ways that aggregated data can hide insights and mislead marketers and analysts. Other common examples of giraffes that immediately come to mind are:
- Understand the true effectiveness of your SEO efforts by eliminating all traffic due to searches which include your brand names.
- Make sure that data on the majority of e-commerce customers – one-time purchasers – is not concealing important insights regarding the more valuable – repeat – customers.
- Make sure that data on the 40 percent of iGaming players who churn after their first 24 hours is not leading you to incorrect conclusions about where the most valuable players are acquired.
Discovering if there are any giraffes in your data is sometimes easy – an obviously dominant value will be like a huge giraffe eye staring you in the face. In these cases, it’s important not to ignore it. If you don’t see an obvious giraffe at the aggregated level, it’s important to look for one by slicing the data looking for dominant values. The most common way to do this is by adding an additional dimension or two.