80% of total time spent on Analytics projects is for collating and cleaning datasets
Mo Data stashed this in Big Data Preparation
As analysts, we spend a lot of time collating and cleaning our datasets. Experts estimate this time to be somewhere between 50% - 80% of total time spent on Analytics projects.
At times, we get so involved in creating the dataset that we forget to step back and check whether the dataset looks like as it should be. The following article discusses a framework to check data sanity every time you work on a structured dataset.
Here are the 7 steps of the framework:
1. Check number of columns and rows against expectations
2. Check for duplicates at id level
3. Check for blank columns, large % of blank data, high % of same data
4. Look at the distribution across various segments – check against business understanding and use pivot tables
5. Check outliers on all key variables – especially the computed ones
6. Check if values of a few test cases are in sync
7. Pick up a few rows and check out their values in the underlying systems
You can read the entire article here
Stashed in: Big Data!
I'm surprised it's only 80% of time. Normalization is hard.
If your $300 p.h. contract data scientist spends all their time preparing data for the real value analytics, that's $240 spent on cleansing and $60 spent on analysis. That's like hiring Michelangelo to paint your ceiling and having him move the furniture and put that blue tape up everywhere before he starts.
That's where we at Mo-Data fit in, dirty job, but someone has to do it, and we do it quickly and fast. Oddly enough, most of the work is pre-technology, there is a good amount of process and standardization that can be applied at the very source of the data, if you can influence the source, then normalize at the point of creation. That way it's far easier to tidy and clean as it comes in, streaming or bulk uploads.
So the analyst has a beautiful data lake to draw from...