The lifecycle of a data science project
Mo Data stashed this in Analysis Tips and Tricks
A traditional business problem customized here to data science.
1. Identify the problem
- Identify metrics used to measure success over baseline (doing nothing)
- Identify type of problem: prototyping, proof of concept, root cause analysis, predictive analytics, prescriptive analytics, machine-to-machine implementation
- Identify key people within your organization and outside
- Get specifications, requirements, priorities, budgets
- How accurate the solution needs to be?
- Do we need all the data?
- Built internally versus using a vendor solution
- Vendor comparison, benchmarking
2. Identify available data sources
- Extract (or obtain) and check sample data (use sound sampling techniques); discuss fields to make sure data is understood by you
- Perform EDA (exploratory analysis, data dictionary)
- Assess quality of data, and value available in data
- Identify data glitches, find work-around
- Is quality and fields populated consistent over time?
- Are some fields a blend of different stuff (example: keyword field, sometimes equal to user query, sometimes to advertiser keyword, with no way to know except via statistical analyses or by talking to business people)
- How to improve data quality moving forward
- Do I need to create mini summary tables / database to
- Which tool do I need (R, Excel, Tableau, Python, Perl, Tableau, SAS and so on)
3. Identify if additional data sources are needed
- What fields should be capture
- How granular
- How much historical data
- Do we need real time data
- How to store or access the data (NoSQL? Map-Reduce?)
- Do we need experimental design?
4. Statistical Analyses
- Use imputation methods as needed
- Detect / remove outliers
- Selecting variables (variables reduction)
- Is the data censored (hidden data, as in survival analysis or time-to-crime statistics)
- Cross-correlation analysis
- Model selection (as needed, favor simple models)
- Sensitivity analysis
- Cross-validation, model fitting
- Measure accuracy, provide confidence intervals
5. Implementation, development
- FSSRR: Fast, simple, scalable, robust, re-usable
- How frequently do I need to update lookup tables, white lists, data uploads, and so on
- Need to create an API to communicate with other apps?
6. Communicate results
- Need to integrate results in dashboard? Need to create an email alert system?
- Decide on dashboard architecture, with business people
- Discuss potential improvements (with cost estimates)
- Provide training
- Commenting code, writing a technical report, explaining how your solution should be used, parameters fine-tuned, and results interpreted
- Test the model or implementation; stress tests
- Regular updates
- Final outsourcing to engineering and business people in your company, once solutions is stable
- Help move solution to new platform or vendor
It's even easier:
0) I wonder what we're going to do with all our big data?
1) Yah! We can collect a bunch of data!
2) Oh shit, we don't know what to do with it, so we hire a bunch of data scientists
3) IT guys complain like malcontents
4) Data scientists make rational sense and domain knowledge out of it and then go off and take super high paying jobs more than you can pay them.
5) You are left with a bunch of malcontent IT guys, no data scientists, and no results
6) Go to step 0).