Wednesday, April 9, 2014

Outliers have a Story: Shall you Listen to?

Outliers could be interesting points to make inquiries and discover stories!

In statistics, an outlier is an observation point that is distant from other observations. When data scientist and analyst collect data, sometimes there are values that are "far away" from the main group of data. Are outliers real data or error? If outliers have a story, shall you listen to?

Outliers have a story. It is analyst job to try to make sense of what they are. Once you have reviewed or eliminated "data entry/measurement /collection" errors, the remaining outliers are often direct pointers to deficiencies in your current understanding of the underlying "true" data generating model ... there're some causal factors that you're not including, or are invisible to you. It’s data scientist and analyst’s job to make sense of what they are, which story they might tell, and what insight you can capture.

The outliers can represent real data, keep in mind that exclusion tends to create different outliers. In trading, there is the idea of moving ahead of the crowd rather than following it; the action really begins and ends at the outlying points. For businesses, markets expand and collapse at the outliers where risks and opportunities are greatest. The middle is a temporary protected place. 

Outliers have a story - but you're not obligated to listen to every story you're told. That is why you need knowledge-based big data and analytical models, Operation Simulation, Analysis, to find the what, why and how, causes, consequences of big data, applies knowledge based model for data mining to kick out the bad data and get the real signals so that you can tell the difference between error and real data.

Outliers are the most interesting points, the relationship (curve) between X and Y is quite interesting too, as it can sometimes lead to physical and thus true causal interpretation. Begin with the system and ask if the metrics are proper for what you are trying to get at. Or you can hope that an anomaly will hit you over the head with a 2 by 4 to get your attention. And don't lie with your statistics. 

Understanding outliers is important for forecasting. Indeed how to deal with outliers when one's focus is on forecasting or classification? Now if your focus is on forecasting, you need to understand outliers, and perhaps refine your model.  If you are going to be "data-driven", you also have to be aware of data quality control and reliability on the methods that you propose to use. As for data-driven as a "fact", look no further than some of the open competition in modeling.

Outliers are interesting points to make inquiries and discover stories, on one side, protect your data from the possible errors by the presence of outliers, on the other side, be open to make sense of it if there’s true story behind it. Be curious and stay foolish.


Post a Comment