

When overfitting causes your model to suggest misleading relations in the data, and you find out only after deployment, money blows out the window. In extreme cases, these models might even perform worse than not using a model at all… Predictive modeling can be an amazing tool, but if you point the gun down, it is eminently possible to shoot yourself in the foot. Consequently, the final model winds up performing worse than you expected based on the “fit” to the training data. Overfitting of models occurs when idiosyncrasies in the training data become part of the model you use in production. Not in the least because so many colleagues (me too) have been stung by it! One problem that is fairly well understood, though, is “overfitting” of models. Correlation does not equal causation.In the old days, “data mining” used to have a bad reputation because “if you torture data for long enough, they will confess to anything.” Although it is fairly easy to lie with statistics, I would like to point out that it is much easier to lie without them! We have come a long way in data science, and yet there is still lots and lots of ground to cover. False causality: This is the mistaken belief that a correlation between two variables necessarily implies a causal relationship.

This can lead to false positives and overestimation of effects. Data dredging: This is the practice of analyzing data in multiple ways until a statistically significant result is found. This can make it appear as though a treatment or intervention is effective, when in fact it is just a natural fluctuation in the data. Regression toward the mean: This is a statistical phenomenon where extreme values on one measurement tend to be followed by less extreme values on subsequent measurements. This can lead to inflated or distorted results. The Hawthorne or observer effect: This refers to the tendency of study participants to change their behavior or performance simply because they know they are being observed. These are errors or biases that can affect the results of a study, even if it is well-designed and executed. However, experimental research is not immune to data fallacies. This means that it is designed to establish a cause-and-effect relationship between variables, rather than just a correlation. Experimental research is thought to demonstrate causation and overcome the limitations of correlation studies.
