As old saying, every journey comes through first step, for
Big Data or any BI project, Data Quality Management is crucial for project’s
success. Industry analytics researches show that intuition trumps analytics
among many business professionals. What are the main reasons? Lack of
confidence in the data is top one.
After walking through 5W+1H Big Data Navigation, pondering to decode Big Data via Five Big Senses or use the right tool, and Perceive Big Data picture, here, we go back to basic ingredient—Data, and top
concerns in data analytics: Data Quality.
1. Data Quality is not for “A Single Version of Truth”
Data expert also provides a better definition of data
quality as “the extent to which the data actually represents what it purports
to represent.” As Data Quality is multiple dimensional concept:
- Objective Data Quality Dimensions: Integrity, Accuracy, Validity, Completeness, Consistency, Existence,
- Subjective Data Quality Dimensions: Understandability, objectivity, timeliness, relevance, interpretability, trust
Therefore, the goal of data quality management is not about
pursuit of “a single version of truth”, but how do we enumerate, rationalize
and perceive all different versions of truth, and the diagnosis & analysis
effort need focus on specific problems of relationship between data & what
it represents, and what kind of business puzzles can be untangled via such data
inter-relationship.
2. Big Data Does Not Need be “Perfect”
There’s no “perfect” data in Big Data world, the accuracy
and compromise will continue to coexist across the span of information
management. And Big Data Quality efforts
need to be defined more as profiling and standards versus cleansing. This is
better aligned to how big data is managed and processed.
“Good enough” data
can be more useful than perfect data, as long as the information is good enough
for the recipient to make sound business decisions or solving specific business
problems via there best angles, because it takes longer to make the data more
accurate, and such time delay may actually diminishes its value rather than
improving it.
Think about data quality in the context of supporting
preprocessing with Hadoop and MapR through profiling and standards, not
cleansing. Or the value of shining a social light on data quality - the value
of using collaborative tools like social media to crowd-source data quality improvements.
3. Embed Data Quality Management into Big Data Deployment
When creating a data quality strategy, there are six
factors, or aspects, of an organization’s operations that need be considered.
The six factors are:
- Storage: where the data resides
- Context: the type of data being profiled and the purposes for which it is used
- Data flow: how the data accesses and flows through the organization
- Workflow: how work activities interact with data and use the data
- Stewardship: people responsible for managing the data
- Continuous monitoring: processes for regularly validating the data
Because flawed data management & information production
processes introduce risks preventing the successful achievement of critical
business goals. However, these flaws are mitigated through data quality
management and control: controlling the quality of the information production
process from beginning to end to ensure that any imperfections are identified
early, prioritized, and remediated before negative impacts can be incurred.
Close-Looped Data
Quality Management cycle:
Identity suspected hot spots à establish quality
metrics à profile
& measureà expose
metrics via reports and dashboardà stewards review metricsà design data quality
controlà embed
controls in data flow for BI-->Monitor and refocus data quality analysis.
1 comments:
Post a Comment