Data Scientist vs. Data Algorithm ~ Future of CIO

Thursday, July 30, 2015

Data Scientist vs. Data Algorithm

12:00 AM Pearl Zhu No comments

Analytical algorithms and data scientists are not mutually exclusive, but they are absolutely complimentary.

An algorithm is a procedure or formula for solving a problem. An algorithm is a model of the real world. A data scientist is a job title for an employee or business intelligence (BI) consultant who excels at analyzing data, particularly large amounts of data, to help a business gain a competitive edge (whatis.com). Is a data scientist an expert in behavior, physics, chemistry, biology, etc? Do data scientists see themselves as an expert of all? Is a subject matter scientist best equipped to author the algorithm?

Who is a good data scientist? Any good data scientist must be able to perform problem systems definition, data collection systems analysis integrating into the strategic analytics modeling problem-solving operations, simulation analysis algorithm and implementation to creating values. Very few people would think purchasing or using top of the line power tools would turn him or her into a master carpenter. So why anyone would believe that applying algorithms or software alone could turn him or her into a data scientist. The "scientist" concept in the open marketplace deal with universal issues, such as gravity or light that affect all humans - the data scientists deal with subject areas that affect a limited number of humans, but the protocols, procedures and everything else is practically the same.

In a data science scenario, the data scientist should also be subject to peer review if the consumers expect their results to deal with high-risk scenarios. But to avoid bias and lack of blinding, the subject matter expert should not be the person building the model. And they should get more than one subject matter's opinion (as many as possible) about what candidate variables to consider, and other expert inputs and more than one modeler (if they use algorithms that require subjective and potentially biased human choices).

How to define a good algorithm? A good algorithm needs to be developed through integrating knowledge-based data into analytic models simulation testing, implemented for problem-solving. Some business modeling isn't validated, for good reason, because it is sufficiently proven by a performance that independent validation is not economically justified or that is seen as the secret sauce. Algorithms are indeed nice tools for a data scientist. However, you need to keep in mind that underlying these algorithms are models, models with their own assumptions, strengths, and weaknesses.

In addition, these algorithms require data and understanding the idiosyncrasies of these data is critical to model performance. And understanding how to synthesize new predictors in a way that increases the predictive power of these data is critical to increasing model performance. This is where the Data Scientist shows the true value, and where algorithms fall flat on their faces.

-These models may predict the direction of an economy (usually large systems of simultaneous econometric equations) validated, independently reproduced, etc.

- These models may describe and predict the outcomes of interactions between compounds and biological systems. These models often need to consider molecular structures as well as biological processes.

- These models can predict and simulate human behavior through a sequence of events.

-They may be physical models that are estimating flow through a very dynamic situation using only differential frequency.

Analytical algorithms and data scientists are not mutually exclusive, but they are absolutely complimentary. The challenge is finding the right expertise. People with both great mathematical knowledge and great subject knowledge are the most valuable experts. The perfect machine operator who understands little about the business is not likely to share all the meta-data and peripheral insights that the model statistics expose. At least half the value of a good modeler comes from the insights gained and shared from the stuff that never made it into the final model. You need to understand the subject and purpose (how is value created) to do that well. When insights are gotten, you have to be careful that they are not an artifact of results confounded by human biases and sampling/data error. The entire reason for such rigorous blinded controlled replication validation in science is to rule out these confounding effects. Once you are fairly certain that predictions are not resulting from obvious confounds like human biases and data error, it is important to have very creative scientists and subject matter experts to interpret these results. It is often the case that the interpretation is not obvious, so there is often disagreement at this stage. If there is disagreement, new well-controlled experiments need to be designed by your scientists to resolve those disagreements.

Before you move to this second stage, it is important that you went through the first stage and verified that the original predictions were valid and could be trusted within a tight confidence interval. If you did not go through the first validation by the replication stage, you are probably wasting your time when you try to gain any insights from predictions that cannot be trusted. A big component of predictions in science is that some objective attempt is made to estimate the confidence that one has in the predictions - whether formally with a confidence interval or with blinded controlled tests. Unless you are generating a reasonable estimate of the confidence that one has in your predictions and it is fairly tight, you are probably gambling with today's machine learning/predictive modeling. This is not science.

The point is that humans should all have some humility and recognize the limitations of their expertise and partner them with the other experts to apply the analytical algorithms for problem-solving. Then there is the opportunity to make something great. And if a data scientist wants to be called a "scientist" - words mean things - and the word "scientist" is invariably associated with the rigor of a quality outcome.