363. Splitting Datasets

Training Your Model

When training a model, the dataset is often divided into a Training set, a Validation Set, and a Test Set. The ratio to split the data into these 3 sets depends on how large your dataset is, but in most cases, it is divided in a 60:20:20 proportion. Each dataset has its specific role to train a useful model.

Training Set

This data is used to optimize the weights of the model. The loss calculated using the training set, the training loss, is used to evaluate whether the model is underfitting.

If the model is not performing well AND the training loss is high, it indicates that the model is struggling with high bias. You either need more training time or a bigger network. Getting more data won’t help in this case.

Validation Set

This data is used to check whether the model is not overfitting to the training set. So the main goal for this dataset is to find the right hyper-parameters that can help the model generalize well. This also means that this dataset should be coming from the same distribution as the test set. It would be meaningless if the model is generalizing to data completely different from the test set. The weights are not optimized when using this dataset.

If the model is not performing well AND the training loss is low AND the validation loss is high, it indicates that the model is struggling with high variance. You either need more data, find a better regularization method, or find a different neural network.

Test Set

This data is used to check the followings.

  1. Whether the model is overfitting to Train/Val set.
  2. Whether the Test/Val set is not from the same distribution
  3. Whether there is too much diversity in the using dataset(data mismatch)
  4. The weights are not optimized when using this dataset as well.

If the model is not performing well AND the training/validation loss is low AND the test loss is high, it indicates that either the Test and Validation set might be coming from different distributions, the model might be overfitting to the train/validation set, or there is too much diversity in the dataset you have (data mismatch). There may be a need to reevaluate your dataset.