Sunday, August 16, 2015

The Tale of Data Cleansing

Cleansing the data is often the most difficult and time-consuming part of data science.

Every forward-looking organization has its aggressive data analytics agenda. However, the challenge of data quality often compromises the success of projects. The tough task in cleansing data remains on hand though you have more efficient tools now helping out with that. Are you collecting good data and are you collecting data wisely? Are you trying to resolve problems much earlier in the ecosystem, and then pushes it further? Are you following the Data Governance and doing the due diligence? What are your tales of data cleansing?


The best data-driven organizations focus relentlessly on keeping their data clean. Cleansing the data is often the most difficult and time-consuming part of data science. Data cleansing, transformation and sorting are vital in the data world, because it helps put things in perspective for business to read between the lines with accuracy and clarity of information that is needed for making effective decisions. "Data" is scattered, and needs cleansing and improvement. This can be a major challenge at times depending on the size of the data.  '5 Whys' can still apply to big data, to dig through the root cause of data quality issues. Data cleansing has always been a challenge. So it’s important to know your business, know your data. It is best to solve the problem as early on as possible at the source that would be more ideal, but the reality is different.


Easy to use systems and automation are revolutionizing the way Big Data is being used. Data cleansing has been an issue for IT since data was collected. Without Hadoop and other tools, some data would be forever lost! With Hadoop or any other new technology, this is still a challenge. Though Hadoop has provided a platform where now data can be collected more rapidly and looked at. Implementing software to analyze data should make it easier to interpret results and help improve processes. This can help eliminate the struggles in dealing with too much data! However, manual data cleanup is, unfortunately, a necessary evil for some.


The industry is changing, and most tools seem to be online SaaS. The old adage "garbage in garbage out" still holds true. Collecting data and trying to make sense of it later to meet your needs should not be the approach. Understand your data needs, remove redundancy, require referential integrity, ensure synchronization / timing and collect your data more responsibly. When trying to sort through unstructured data, build your data rules to catch possible false, positives and use to further understand your data and tighten your data rules. An enterprise data hub is a powerful new platform. In the future, most enterprise data will land first in an enterprise data hub, and increasingly it will stay there. In the near term, an enterprise data hub delivers unprecedented flexibility to comprehensively and economically analyze and process data in new ways. Organizations that deploy an enterprise data hub alongside their existing infrastructure will continue to lead in the world of modern data.


When you clean up data, you also change it. So, it's important to have an audit trail to show what was done.There is a notion that poor quality data is a result of broken business processes, so when you start your investigations, are you considering the scope of the business process, its architectural components and the associated information lifecycle across this, or just the focal point where poor quality data manifests itself? Uses an artificial intelligence platform to intelligently identify, analyze, categorize and classify sensitive and useful information contained within an organization’s Dark Data and enable management in place of the content. This only has to do with how you save historical versions of data that arrives in your system. It also has to do with how the system is sourcing the data. If the system sources internal data from "views" or interpretive extracts, then already you've lost traceability. For audibility and traceability, it is important to store historical snapshots of the facts in a Hadoop-based system, to accomplish this task - ensuring "re-constitution of the actual source system for a given point in time."


Messy internal data is due to a lack of proper data governance. Governance is crucial, but where to start? Good governance is as essential as storing raw data. Very important to know who, where, and what is accessing the data and what they are doing to it to make changes. Then, knowing the before and after pictures of the changes are all part of the governance plan. It’s not only important to be able to categorize your data, but also to be aware of the dangers of data with regards to legal and regulatory risk, intelligence risk, and reputation risk. Everybody knows that and not a single organization has solved it since the beginning of times. Trying to collect the data exactly as it 'should' be used/organized now and in the future is a bit difficult. Data governance at the point of impact produces the data set in a clean (and often disposable) view of the data. Working to keep all data everywhere clean is impossible given the rate of change and diversity of our relevant data sources and streams. And when you clean up data, you also change it. So, it's important to have an audit trail to show what was done. There are many analytical tools that can help solve this, but maybe not at 100% accuracy. It will only improve, but not make it entirely better.

Data cleaning and data management has a deep business purpose to turning data into information, the business side of making sense of the raw data, adding value and augmenting business systems. This is where the organizations that understand this true nature will really begin to see huge value gains. In short, Data Quality doesn't mean you pursue the perfect data, but the good enough data being transformed into useful information, business insight, and human wisdom.

0 comments:

Post a Comment

Twitter Delicious Facebook Digg Stumbleupon Favorites More