How to Overcome Data Cleansing Challenge ~ Future of CIO

Thursday, October 29, 2015

How to Overcome Data Cleansing Challenge

11:21 PM Pearl Zhu No comments

Good governance is essential to overall Data Management.

Data is the lifeblood of digital organizations. Data Cleansing has always been a challenge. The best data-driven organizations focus relentlessly on keeping their data clean. Cleaning the data is often the most difficult and time-consuming part of data science. So what are the best principles and practices to overcome Data Quality challenge?

Data left unmanaged is a HUGE liability, whether it's multi-structured, semi-structured or fully structured. How to determine if collected data was independent. How to determine if data meets stationary requirements, and what other types of data requirements that might be necessary for various big data tools; to include homogeneity, identically distributed, gaussian/non-gaussian, and sample size requirements. And importantly, you need to know where your data came from and how it was collected. There is a notion that poor quality data is a result of broken business processes, so when you start your investigations, are you considering the scope of the business process, its architectural components and the associated information lifecycle across this, or just the focal point where poor quality data manifests itself? Though Hadoop has provided a platform where now data can be collected more rapidly and looked at. The challenge of cleansing data still remains on hand: Are you collecting good data and are you collecting data wisely? Are you trying to resolve problems much earlier in the ecosystem and then waiting to push it further? Are you following the Data Governance and doing the due diligence?

The old adage "garbage in garbage out" holds true. First of all, "data" is scattered. Secondly, data is not really ugly... it's just scattered and needs cleansing and improvement. This can be a major challenge at times depending on the size of the data. Collecting data and trying to make sense of it later to meet your needs should not be the approach; understand your data needs, remove redundancy, require referential integrity, ensure synchronization / timing and collect your data more responsibly. When trying to sort through unstructured data, build your data rules to catch possible false positives and use to further understand your data and tighten your data rules. It is important to know your business, know your data. It is best to solve the problem as early on as possible at the source which would be more ideal, but the reality is different. There are many analytical tools that can help solve this, but maybe not at 100% accuracy. It will only improve, but not make it entirely better.

Data governance at the point of impact: Messy internal data is due to a lack of proper data governance, everybody knows that and not a single organization has solved it since the beginning of times. Good governance is as essential as storing raw data and overall data management. Very important to know who where and what is accessing the data and what they are doing to it to make changes. Then, knowing the before and after pictures of the changes are all part of the governance plan. Transformation and sorting are vital in the data world to put things in perspective for business to read between the lines with accuracy and clarity of information that is needed for making an effective decision. Produce the data set in a clean (and often disposable) view of the data. Working to keep all data everywhere clean is impossible given the rate of change and diversity of our relevant data sources and streams. It's not only important to be able to categorize your data but also to be aware of the dangers of dark data with regards to Legal and Regulatory risk, Intelligence risk and Reputation risk. Dark Data as unstructured, uncategorized and potentially unmanaged and untapped data could contain confidential or sensitive information and as such may pose a significant risk to the organization. This Dark Data left unstructured and unmanaged, could be more of a liability to the business than a potential asset. Regarding unmanaged data - this is a very bad thing to collect. For audibility and traceability, it is important to store historical snapshots of the facts, and ensuring "re-constitution of the actual source system for a given point in time."

An enterprise data hub is a powerful new platform. In the future, most enterprise data will land first in an enterprise data hub, and increasingly it will stay there. In the near term, an enterprise data hub delivers unprecedented flexibility to comprehensively and economically analyze and process data in new ways. Organizations that deploy an enterprise data hub alongside their existing infrastructure will continue to manage data:
-raw historical data storage (call it a warehouse, corporate memory, enterprise data hub)
-turning data into information (the business side of making sense of the raw data, adding value and augmenting business systems) this is where the organizations understand the true nature will really begin to see huge value gains if they can ask the right questions.

Many organizations invest heavily in processing the data, with the hopes that people will simply start creating value from it. Usually, this results in large operational and capital expenditures to create a vault of data that rarely gets used. Easy to use systems and automation are revolutionizing the way Big Data is being used. Implementing software to analyze data should make it easier to interpret results and help improve business processes. This can help eliminate the struggles in dealing with too much data! And most importantly, build a strong data governance discipline to overcome data management challenge.

Posted in: Big Data