In order to classify images more precisely, the traditional way is to apply convolution and pooling to lower the dimension of the input so that the model can understand more complex features. This is ok for classification tasks because you don’t care about the location of the object you are trying to detect. When it comes to semantic segmentation, you need to know spatial information as well.
If you upsample in 1 step from the deepest layer, the results of the segmentation model would be blurry, because it doesn’t understand location information.
So instead, you gradually upsample and apply fusion on the way so that you can restore spatial information from previous shallower layers.