Overfitting is a tremendous enemy for a data scientist trying to train a supervised model. It will affect performances in a dramatic way and the results can be very dangerous in a production environment.
But what is overfitting exactly? In this article, I explain how to identify and avoid it.
Why does it happen?
In machine learning, simplicity is the key. We want to generalize the information obtained from the training dataset, so we can surely say that we run the risk of overfitting if we use complex models.
Complex models will likely over-learn from training data and will think that the random error that drifts training data from the underlying dynamics is actually worth learning from. That’s the exact point at which the model stops generalizing and starts overfitting.
Complexity is often measured with the number of parameters used by your model during it’s learning procedure. For example, the number of parameters in linear regression, the number of neurons in a neural network, and so on.
So, the lower the number of the parameters, the higher the simplicity and, reasonably, the lower the risk of overfitting.
What is overfitting?
Overfitting occurs when your model learns too much from training data and isn’t able to generalize the underlying information. When this happens, the model is able to describe training data very accurately but loses precision on every dataset it has not been trained on. This is completely bad because we want our model to be reasonably good on data that it has never seen before.