Data Quality: Big Issue for Big Data ~ Future of CIO

Sunday, October 28, 2012

Data Quality: Big Issue for Big Data

9:03 PM Pearl Zhu 1 comment

As old saying, every journey comes through first step, for Big Data or any BI project, Data Quality Management is crucial for project’s success. Industry analytics researches show that intuition trumps analytics among many business professionals. What are the main reasons? Lack of confidence in the data is top one.

After walking through 5W+1H Big Data Navigation, pondering to decode Big Data via Five Big Senses or use the right tool, and Perceive Big Data picture, here, we go back to basic ingredient—Data, and top concerns in data analytics: Data Quality.

1. Data Quality is not for “A Single Version of Truth”

43 percent of data from external sources comes from social networks, while audio makes up 38 percent, and photo or video comprises 43 percent. Big Data has characteristics of 3V+1C (Volume, Velocity, Variety, and Complexity).

Data expert also provides a better definition of data quality as “the extent to which the data actually represents what it purports to represent.” As Data Quality is multiple dimensional concept:

Objective Data Quality Dimensions: Integrity, Accuracy, Validity, Completeness, Consistency, Existence,

Subjective Data Quality Dimensions: Understandability, objectivity, timeliness, relevance, interpretability, trust

Therefore, the goal of data quality management is not about pursuit of “a single version of truth”, but how do we enumerate, rationalize and perceive all different versions of truth, and the diagnosis & analysis effort need focus on specific problems of relationship between data & what it represents, and what kind of business puzzles can be untangled via such data inter-relationship.

2. Big Data Does Not Need be “Perfect”

There’s no “perfect” data in Big Data world, the accuracy and compromise will continue to coexist across the span of information management. And Big Data Quality efforts need to be defined more as profiling and standards versus cleansing. This is better aligned to how big data is managed and processed.

“Good enough” data can be more useful than perfect data, as long as the information is good enough for the recipient to make sound business decisions or solving specific business problems via there best angles, because it takes longer to make the data more accurate, and such time delay may actually diminishes its value rather than improving it.

Think about data quality in the context of supporting preprocessing with Hadoop and MapR through profiling and standards, not cleansing. Or the value of shining a social light on data quality - the value of using collaborative tools like social media to crowd-source data quality improvements.

3. Embed Data Quality Management into Big Data Deployment

When creating a data quality strategy, there are six factors, or aspects, of an organization’s operations that need be considered. The six factors are:

Storage: where the data resides
Context: the type of data being profiled and the purposes for which it is used
Data flow: how the data accesses and flows through the organization
Workflow: how work activities interact with data and use the data
Stewardship: people responsible for managing the data
Continuous monitoring: processes for regularly validating the data

Because flawed data management & information production processes introduce risks preventing the successful achievement of critical business goals. However, these flaws are mitigated through data quality management and control: controlling the quality of the information production process from beginning to end to ensure that any imperfections are identified early, prioritized, and remediated before negative impacts can be incurred.

Close-Looped Data Quality Management cycle:

Identity suspected hot spots àestablish quality metrics àprofile & measureàexpose metrics via reports and dashboardàstewards review metricsàdesign data quality controlàembed controls in data flow for BI-->Monitor and refocus data quality analysis.