▮ Stable Diffusion
There are several fine-tuning methods for text-to-image stable diffusion models and it is hard to intuitively understand the difference between them. So for this post, I’d like to share a visual representation of each fine-tuning method.
▮ Training Procedure
Before I share the methods, I would first like to review the training process for text-to-image stable diffusion models. Here is a simplified workflow.
- Embed prompt
- Add noise to input image of timestamp “n”
- Create noise of timestamp “n-1”
- Have the diffusion model predict the noise of timestamp “n-1”
- Compare 3 and 4 and calculate the loss
The difference between the fine-tuning methods is which element the method penalizes from the loss.
▮ DreamBooth
This method may be the most popular and effective among the approaches I’m going to be sharing. This method penalizes the diffusion model itself, updating the model.
▮ Text-Inversion
Unlike DreamBooth, this method penalizes the embeddings of the prompt. In other words, it is going to re-embed the prompt again and again until it properly generates the expected image. Since it is not changing the diffusion model itself it is relatively robust to catastrophic forgetting. Also, the output is a tiny embedding, so it is storage-friendly. It is often said that this method is slightly less effective than DreamBooth.`
▮ LoRA: Low-rank adaption
Instead of training the whole diffusion model like DreamBooth, this method inserts layers into the diffusion model and only updates the weights for the inserted additional layers. Since it can constrain the number of parameters to learn it can train much faster than DreamBooth(about 1/3? depending on GPU).
▮ Hyper Networks
In this method, you create a model that outputs the layers to insert into the diffusion model. Unlike LoRA it does not update the inserted layers directly.