Friday, September 20, 2013

Does Big Data Lack of Big Trust

Though Big Data is hot, the overall success rate of analytics project is low, therefore, some business leaders are holding back in investing data analytics due to lack of trust. Incidents of inaccurate analysis due to bad data or even down to incorrect model used for analysis have caused companies millions of dollars. So what are the effective way to initiate and mange Big Data projects? Is there a framework that governs trust for data analytics?

1. Understand the Objectives of Analysis and Tech Trend 

It is critical to understand the objective of the analysis, with in-depth understanding of
business fundamentals -- bottom line performance and profitability, economies of scale, innovating for the future, etc. Start with what you can actually do and stay away from trying to solve all the world's problems in one go-round. Start with an immediate business need and work your way from there. The immediate need should have a large ROI to start and NOT revolve around a lot of data cleanup.

Capture technology trend: If ever putting together a framework, begin with fairly recent, forward-looking technological trends in data storage and computing that can generate realizable cost-savings and efficiencies today. This would give C-level executives the big picture and a vision of where their world is headed. But well collect and communicate the project requirement, otherwise you have SCOPE CREEP on your hands and the project will never get off the ground.

A framework of standards may accomplish two things

1)     Deep understanding of the variant relationships and patterns of the data. This will help identify what transformations are needed and how the missing values should be treated. 

2)     Identify data integrity issues. Note that when these are found, they need to be discussed with the IT folks, as there may be solid reasons for the data being as it is. If these issues are immediately flagged to the executives, the databases will quickly lose credibility and nobody wins. 

2. Data Quality is Key Factor in Project Success  

Any data mining/analytics project can be roughly divided into two parts: cleaning the data, and analyzing the data. For many data analytics projects, half the battle is in the preparation of the data; not just dealing with missing values but coding and normalizing as well.

The causes of Data Quality Issues: Data Quality can heavily depend upon the system that captures data into computer files. It also depends upon the quality of the methods of duplicate detection (within or across files) and the methods of filling in missing data or 'correcting' contradictory combinations of data. There are many questions can be pondering such as: How do you determine whether 5+% of your data have errors? If your data has 5+% errors, what analyses (if any) can you reliably do on the data? ., etc.

Data Cleansing is often overlooked.  Although the data cleaning part is by far the most time-consuming, it is often overlooked during the planning stage. To obtain buy-in from senior management, it is important that they are educated up front about the data preparation phase. Sufficient time and resources must be budgeted to allow the data to be properly prepared in advance of any analysis. Without that, management will have unrealistic expectations about the timing and ROI of results. Disappointment will be inevitable, and future data mining projects will be jeopardized.

Set Project Priority: While you are doing that project with better ROI and less data clean up, start finding other projects to which do require data cleanup and do the data cleanup while you are doing the first or second data projects. The key point is resources. Your hard core IT types do not like to do data cleanup. Always make sure the business users are on board and are willing to do their own data cleanup. The success of data analytics takes collective and collaborative effort cross-functionally. 

The criticality of data quality is also based on business case: If there are systematic data issues that can be explained, the analysis might still be valid. For example, if you are interested in rank ordering predicted product performance (using the predicted values from some model), then the actual predicted values are not critical as long as the bias is spread across all products. However, if you are developing a model to determine which customers can be targeted for a specific promotion, knowing that if the response rate is above X% you make money, otherwise you lose money, and then the data issues could be terminal.

Indeed, there are quite a few roadblocks in managing data and analyzing data, still, many forward-looking organizations are making continuous progresses, accumulating sufficient experience in order to transform their businesses into data-based intelligent powerhouse.


Post a Comment

Twitter Delicious Facebook Digg Stumbleupon Favorites More