I was finally able to fine-tune a model and get decent results.
I’ve used a new dataset from kaggle. This has more and simple images of water surfaces so that helped a lot.
Results
The model was able to learn up to the point where, if the shape of the river is relatively simple, it’ll get most of the segmentation correct(2nd Example). Still, if the image were too up close, it seemed to get confused(1st Example).
Is it even necessary..?
I had to one-hot-encode the label images to make it work.
The original masked label image returns a tensor shape of [batch_size,1(due to gray-scale),height,width], but the model’s output, which is [batch_size,number_of_classes(2 for this time), height, width], would be different depending on how many classes you need. When the shapes of the tensors differ, you can’t calculate the loss when comparing the output of the model and the actual answer.
The weird thing is, I found a blog fine-tuning Deeplabv3 not needing one-hot-encoding but still was able to re-train the model.. What’s up with that?