Overfitting or Underfitting the Training Data
Overfitting means that the model performs well on the training data, but it does not generalize well, but underfitting is the opposite of overfitting. It occurs when your model is too simple to learn the underlying structure of the data.
Overfitting
overfitting occurs when a model learns "too well" from the training data, including noise
and outliers
. This makes it perform excellently on the training dataset but poorly on new, unseen data.
Indication
High accuracy on training data but low accuracy on validation data or un-seen data.
Cause of overfitting
- insufficient training data: A small training dataset may lead specific patterns that do not generalize well.
- model complexity: high complex models (e.g. DNN with many parameters) have a greater capacity to memorize the training data, including noise.
- noisy data: training data that contains irrelevant information or outliers can mislead the model.
- excessive training: training a model for too many
epochs
can lead to overfitting as it continues to adjust the training data, including it's noise.
How to mitigate overfitting:
- simplify the model: use a less complex model with fewer parameters.
- more data: collecting more data can help the model generalize better.
- regularization: techniques like Lasso (L1) or Ridge (L2) regularization add penalties for large subsets of data
- cross validation: use techniques like k-fold cross-validation to ensure the model perform well on different subsets of data.
- pruning: in decision trees, pruning can reduce the complexity by removing branches that have little importance
Underfitting
when a model is too simple to capture the underlying patterns in the data. It fails to represent the training data well.
Indication
Low accuracy on both training and validation data.
How to mitigate underfitting:
- complexify the model: use more complex algorithm that can learn intricate patterns, like decision trees, random forests, or neuron networks.
- add features: include more relevant features the provide useful information
- reduce regularization: if regularization is being used too aggressively, it might overly constrain the model and precent learning.