Data Refinery ~ Future of CIO

Tuesday, February 10, 2015

Data Refinery

11:44 PM Pearl Zhu No comments

Operationalizing insights to make smarter decisions can have huge business impact, and data refinery is the key to that in a big data world.

Data refinery is the key step to transform raw data into analytic insight for helping business make smart decisions. What are the key steps to do it effectively?

The dedicated, contextual data refineries can add real value to raw data, and consumed as a service. Companies should be able to leverage all their data in the reservoir for analytic insights, the data refinery should act as a "treatment plant" for the data reservoir. A sub-section of the data reservoir is cleansed, transformed, and standardized to create a master analytic library (one version of truth) for analytic development, and more importantly take the deployment of analytic insights into production. The refinery enables companies to have confidence that data definitions are consistent across development and execution environments, thereby minimizing time to market for predictive analytics.

Data refineries should be understood as business processes, not as technologies. Everybody talks about the tools, but nobody asked about the data cleaning/formatting/calculations processes you want to perform. What kind of data cleaning/formatting/calculations do you want to perform? The issue of data quality is much more complex than just algorithms for cleaning up data. One has to consider how error/uncertainty is going to propagate through your analysis models and how the output is fit for making decisions under the specific context of your application. It is easy to postulate raw data storage and conceptual ETL to derive value, but that is not sufficient. The business process to create that value still needs to be funded, staffed, and nurtured. The technology involved is trivial in comparison to the business process that needs to be institutionalized.

Data quality metrics are a form of metadata. They provide supporting information that helps you interpret or assess the raw data. If you are managing the quality of your data from these measurements, then the measurements need to be made again once the corrections have been made. You are generally measuring your data in some fashion and recording the metrics. The process of arriving at that measurement needs to be documented carefully, so there is no misinterpretation of what the value means or how it is meant to be used. Statistics can be easily misused or abused. Data quality measurements can also highlight problems in other supporting metadata.

The tool selection and the cost effectiveness. In looking at cost concerns for improving data quality, people often ignore the cost of development. However, proper evaluation of the cost of each data cleaning/refinery platform can help make better choices. To evaluate cost, you need to compare the price of software against the time involved. You need to ask yourself just how much that "free" software is going to cost you, when the cost of your time is correctly assessed using your own standard against the time needed to get a less expensive tool to work.

"Data are of high quality if, they are fit for their intended uses in operations, decision making and planning,” (Wikipedia),which includes all perspectives at a high level. With this, information entry and of no entry required and of logical data to involve its operations, decision making and planning. Information alone has limited business value. Operationalizing insights to make smarter decisions can have huge business impact, and data refinery is key to that in a big data world.