Unet may be one of the most basic researches on segmentation tasks.
It consists of 3 parts: Encoding Phase (Apply Convolutions to classify object) -> Bridge -> Decoding Phase(Restore information so that the output would be 388×388).
During the final decoding phase, they use skip connections to restore spatial information from shallower layers.
You can notice that the output size is slightly smaller than the input size. Most segmentation models right now output the same size, but in this research, it is using non-padding convolution, so the output is slightly smaller. You can avoid this by using convolutions with padding using the mirror strategy. This research used non-padding because they were assuming that the target they wanted to detect was close to the center of the image.
Paper: https://arxiv.org/abs/1505.04597