▮ Labeling
Most ML models in production today adopt models with supervised learning. This means they all need data to learn to do a task, and there will rarely be a situation where label data is overwhelmingly abundant.
Here are the types and challenges of getting label data.
Reference: Designing Machine Learning Systems
▮ Hand Labels
Hand labels are tasks where a human annotates the data.
Here are several challenges hand labeling faces.
Challenge 1. Cost
Labeling can easily become expensive. Especially, if subject matter expertise is required. You can’t just hire some random annotator to label chest X-rays.
Challenge 2. Data Privacy
The data you want to label may have strict privacy requirements. If the data is not allowed to be accessed by third-party services, you may need to hire contract annotators to label it on-premises.
Challenge 3. Speed
Labeling data by hand is usually slow. Slow labeling leads to slow iteration, hence making your model less adaptive to constantly changing environments and degrade.
Challenge 4. Label Multiplicity
When annotating at scale, you are going to need multiple human annotators. Multiple annotators often lead to different levels of accuracy due to the different levels of expertise between annotators. The more expertise you need to annotate data, the more often there is a conflict in the label accuracies.
It is important to have a clear rule and share it amongst the annotators to avoid these situations.
▮ Natural Labels
Natural labels are tasks where the model’s prediction can be evaluated by the system without human labeling.
An easy example of this is Google Maps. When the model estimates the amount of time it takes to get to a certain destination, the model can get automatic feedback from the actual time it took to get there.
Even if your task doesn’t inherently generate natural labels, it is still possible to construct the system in a way that you can generate those labels.
For example, if you are developing a translation model, you can have the user submit the correct answer if the model prediction is wrong.
▮ Handling Lacking Labels
Here are four methods for handling the lack of label data.
Weak Supervision
One approach to not using hand labels is weak supervision.
You can use libraries such as Snorkel to create label functions(LFs) to encode heuristics.
You can apply these LFs to the samples you want to label.
However, you should note that it is better to have a small number of hand labels to make sure of the LF’s labeling accuracy. Completely removing human hands might have risks.
Semi-Supervised Learning
This method leverages structural assumptions to generate label data and requires an initial set of label data.
One of the subset methods that has gained popularity is the perturbation-based method. By assuming that small perturbations to the sample shouldn’t change the label, you apply small perturbations and generate new training instances.
Transfer Learning
Transfer learning is a method where you reuse the weights trained with abundant data to do a new task different from the initial task.
I’ve created a couple of blog posts related to this topic. You can go check it out through the links below if you are interested.
Active Learning
Active learning is a method for improving the efficiency of data labeling.
I’ve created a blog post for this topic before as well. If you are interested you can go check that out.