Sign up FAST! Login

Data Products: One-off analyses can be great, but a repeatable, reproducible analysis is much better.!

doll factory production line

Data Products Venn DiagramCreative Commons LicenseThis diagram is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

What does this mean? First of all, there are three sets of skills, directly paralleling Drew’s data science skill sets, all floating in a sea of data. When you combine Data with Domain Knowledge, you get Spreadsheets. With Statistics, Predictive Analytics, and Visualization, you get Exploratory Data Analysis and Statistical Programming. And with Software Engineering, you get Databases. Highly useful systems and products, but nothing particularly new.

Combining pairs of sets with this sea of data, you get more specific products:

Data + Software Engineering + Domain Knowledge = Business Rules and Expert Systems with implementations such as Drools and FICO’s Blaze.

Data + Software Engineering + EDA & Statistical Programming = BI and Statistics Tools, such as Tableau, SPSS, and many more general-purpose statistical systems.

Data + Domain Knowledge + EDA & Statistical Programming = One-Off Analyses, which may be a PDF article, or a data-driven Powerpoint presentation, or simply a chart showing a distribution sent via email.

And at the center of it is all a Data Product, which is a piece of software that includes both Domain Knowledge and Statistical components. These may be widgets in a larger web tool, such as LinkedIn’s People You May Know, or software systems designed for specific analytic purposes, with baked-in domain knowledge. Tools that are designed for statistical analysis of DNA sequences, or optimization of truck routing for distributors, or many many other things, all fall into this category. In many cases, data products make it easy for regular people to get what they need without having to dive into a very complex set of data and a very complex set of algorithms.

What are the consequences of this framework? I’d assert that the value of a product that combines all three aspects of a data product, requiring all three skill sets of a data scientist to design and build, may be substantially more valuable than products that combine just one or two of the components.

One-off analyses can be great, but a repeatable, reproducible analysis is much better. Business rules can lead to maintainable software systems, but without statistical capabilities, they may be too rigid to adequately work in many real world situations. (See the history of AI research prior to about the 1980s.) And general purpose BI and Statistics tools are extremely useful, but may become even more powerful when the systems are designed for and incorporate particular domain knowledge.

Stashed in:

To save this post, select a stash from drop-down menu or type in a new one:

You May Also Like: