▮ Keeping Track
When developing an ML model, there is a lot of information to keep track of. Data scientists have to constantly experiment and the information can get easily chaotic. Ideally, all the information should be organized so that even after a while, let’s say 6 months later, you can still immediately reproduce the model.
Here are the 8 types of artifacts to keep track of when developing an ML model.
1. Model Definition
This is the information about the architecture of the network. This also includes information such as loss function, batch size, etc.
2. Model Parameters
You will need the information on the weights of the model to reproduce the prediction. Depending on your model’s saving methods, you can save both the architecture and the weights in a single file.
3. Feature Extraction
The feature extraction pipeline should also be stored so that you can use them whenever you want to preprocess raw data before inputting them into the model.
4. Dependencies
In most cases, ML models are packaged into containers, so information such as Python version and Python packages are usually stored within that container.
5. Data
It is quite common to re-split the datasets(training, validation, and test sets). Keeping track of which data you used for which dataset on which experiment can easily get confusing as you experiment repeatedly. You should name or version control your data to have reproducibility.
6. Model Generation Code
This information includes:
1. Frameworks(Pytorch, Tensorflow, etc.)
2. How it was trained
3. Dataset division method
The codes can be a Jupyter notebook or a simple python file.
7. Experiment Artifacts
These are the artifacts generated throughout the development process; loss/epoch curve, task-specific metric calculation results, etc.
8. Tags
This will help with model discovery and filtering.