Leveraging Big Data Techniques for Big Data Quality, #Hadoop, #MapReduce #YARN
Mo Data stashed this in Big Data Technologies
The orders of magnitude scaling of data volumes for “big data” applications has many different types of data quality practitioners figuratively salivating at the potential opportunities for applying their favorite data quality techniques. First, for some practitioners, there is the perception that in a big data environment, “lots of data” means “lots of errors,” which of course also means “lots of cleansing.” For others, the concept of “big data quality” is as simple as prepending “big” onto their philosophy about “data quality” rooted in quality techniques applied to the typical manufacturing environment.
Yet in retrospect, the conventional approaches associated with manufacturing quality are less applicable in a big data world. Consider these examples:
- Customer requirements analysis is used to identify consumer data quality criteria to be applied at the point of data acquisition. However, often in a big data environment, data sets are potentially repurposed in numerous ways and there is an expectation that the original raw form of the data is made available, which precludes the ability to effectively apply a selected set of data quality rules at the entry point into the enterprise.
- Similarly, supplier management is a process for transmitting your own data quality requirements to the data suppliers to validate prior to the data exchange. In a big data environment, though, often the source of the data is way beyond the organization’s administrative control or is completely unknown, which makes supplier management difficult if not impossible.
That suggests that assessing the characteristics of data quality management for big data requires a little adjustment in thinking. First, one must consider data consumption models for big data environments. These typically involve widespread data repurposing in which the data analysts and data scientists prefer to see the data sets in raw form, free from the shackles of the typical dimensional models conveyed by the IT department. Yet at the same time, the analytical results of big data analytics need to be integrated with the data warehouse and business intelligence architecture already extant within the enterprise.
Second, the types of big data analyses applied to both structured and unstructured data require insight about data content in addition to its structure, and this means that standard data quality tasks such as scanning and parsing text must go beyond validation against known formats to ascertain meaning in context. Ultimately, the value of discrete pieces of information vary in relation to content type, precision, timeliness and overall volume, yet there are limited opportunities for ensuring uniform “quality” of the data. The consumption orientation of big data analytics implies that rather than objective “quality,” the focus must be on information utility, and correspondingly, the onus of ensuring information utility is on the data consumer, not the data producer.
The consumer focus for big data quality means that we have to adjust our approaches to what dimensions of data quality are relevant for big data applications. Because of the limitations on control of the quality on intake, some of these other dimensions begin to take on a greater level of importance:
- Temporal consistency, or ensuring that time-dependencies are observed;
- Completeness, ensuring that the data elements to be used are populated;
- Precision consistency, which monitors consistency of the units of measure and the precision of data values that are consumed by analytic applications;
- Currency in relation to ensuring that data is as up-to-date as possible;
- Unique identifiability for entities (such as individuals, organizations and any other domain);
- Timeliness in terms of presenting acquired data within a defined agreed-to time frame; and
- Semantic consistency, ensuring consistency in the interpretation of identified data concepts in similar contexts.
Even with a diminished need for manufacturing-style data quality methods, there will still be a need for the practical aspects of data quality that can be automated, especially with the need to scan both structured and unstructured text for reference data validation and entity identification and resolution. The archetypical applications used in support of these activities include:
- Data profiling to provide statistical analysis of value frequency and guidance for business rules for data validation;
- Data validation using data quality and business rules for verifying consistency and completeness, and to help in disambiguating semantic context;
- Identity resolution using advanced techniques for entity recognition and identity resolution from structured and unstructured sources;
- Data cleansing, standardization and enhancement that apply parsing and standardization rules within context of end-use business analyses and applications; and
- Inspection, monitoring and remediation to empower data stewards to monitor quality and take proper actions for ensuring big data utility.
Interestingly, all of these methods and tools exhibit the characteristics of applications that are nicely suited to implementation on a distributed data platform employing parallel processing like Hadoop, MapReduce, and increasingly, the differentiated capabilities of YARN (also called Hadoop 2.0), such as:
- The ability to process massive data volumes that can be split across numerous storage resources.
- The desire to apply different methods to a wide variety of data inputs.
- Performance that is usually impacted by data latency can be finessed by coupling columnar data layouts, pipelined data streaming, and in-memory processing.
- Computational performance can be scaled in a linear relationship to the size of the data.
- The algorithms are easily adapted to run in parallel.
Hadoop’s parallel and distributed computing enables scalability in deploying key data quality tasks, and each of these algorithms can be adapted to improve performance when implemented on YARN. In other words, even as the nature of data quality management evolves when it comes to big data applications, the core technologies are eminently adaptable to exploit the computational fabric they are intended to support. This suggests the imminent emergence of companies targeting the deployment of extraction, transformation and data quality technologies deployed on top of Hadoop using MapReduce or that can provide the traditional data parsing and standardization on a massive scale.