Correlation vs. Causation ~ Future of CIO

Saturday, January 17, 2015

Correlation vs. Causation

11:48 PM Pearl Zhu No comments

A correlation between two variables does not necessarily imply that one causes the other.

Correlation does not imply causation is a phrase used in science and statistics to emphasize that a correlation between two variables does not necessarily imply that one causes the other. Many statistical tests calculate the correlation between variables. A few go further and calculate the likelihood of a true causal relationship. (Wikipedia) What is the further dot connection between correlation vs. Causation?

The first step is to try to get the causal predictor in your data set: Start doing it from day one, before you even start collecting data. If you miss it, you might find a proxy metric that is causal, but not the real causes. In many cases, this is enough. A very important step in the scientific method was treated with inadequate attention, perhaps purposely. This is the step that comes right before testing the hypothesis.

"Gather Information."Only after extensive observation, can you gather adequate data, and only after studying each independent process by asking questions about each independent variable, and the samples you collect: "Does the sample consist of countable elements?", "What is the distribution of the sample?", "Does the sample accurately represent your population?" or perhaps even domain-specific questions. Independent variable, can you adequately apply statistical measures to study relationships.

Correlation and causation are similar, yet quite different in application. Causation, generally speaking, is the reason behind every cause. That doesn't mean that there has to be an exact correlation. For appropriate correlation analysis, a statistician has to be perfectly clear on how he/she is managing data, the methodology used, the purpose and goal of the research. Managing data has to be the most important aspects of a proper statistical research! If that is missing, then the whole research might come out to be wrong. Correlation does not imply causation - and that is why a data science professional should always work together with a domain expert to come up with any meaningful solution. And this is the same reason why it is really tough to learn the topology of a Bayes net from data.

Linear transformations can be quite powerful, but it is only a representation of the original data (same sample space, with a different ordered basis, and in this case, plotted into the same coordinate system). The real problem is trying to retrofit data into required outcome. Scientifically you should always start with your hypothesis before measuring, then you won't try to be 'selective' with your data sets or your data processing. Path analysis in sociology and structural equation modeling in econometrics are two methods which were though long ago to be able to find out and quantify relationships between various variables. But these methods suppose that these theories are able to provide right variables and structure into the equations. This is not necessarily so.

The problem of domain expert is double-faced: The problems usually have many causes and can be very complex. And if your domain expert has told you the causal predictor, wouldn't Analytics then be able to run through all the various methods in analytics automatically and tell you which one gives the "best model" and thus selects the right causal predictors. But what if other domain experts would give you completely different causal predictors? And, what if another independent representative data sample provided to analytics returned a completely different "best model" even with the same "causal" predictors provided to it? So the problem is that domain experts can be just as wrong about causes, explanations, and predictions, as “Think Fast, Think Slow,” in cognitive science that blinded domain experts from one another famously shown. on one hand, they (if they are really expert, with a research methodology understanding would be better) help to validate and support the plausibility of the findings; on the other hand, they can be defending some preconceived idea and become an opponent to innovation. The balance between these two poles sometimes is very important.

Correlation is a valuable type of scientific evidence in fields such as medicine, psychology, and sociology. But first correlations must be confirmed as real, and then every possible causative relationship must be systematically explored. Because it is easy and even tempting to come to premature conclusions based upon the preliminary appearance of a correlation.