173. BiSeNet

Background

Most of the previous semantic segmentation model’s architecture can be categorized into 2 types.

  1. Encoder-Decoder Backbone: (Ex. FCN, UNet)
    This architecture requires all information to flow through the deep encoding-decoding structure leading to high latency, also suffering in restoring low-level information
  2. Dilation Backbone: (Ex. Deeplab)
    This removes down-sampling but requires heavy computation.

Both of these types suffer from high latency. Latency matters especially when it comes to using these models for self-driving cars, so BiSeNet tackles this challenge.

BiSeNet

BiSeNet has a two-branched Architecture, consisting of the Detail Branch and the Semantic Branch, and merges information from these two branches using the Aggregation Layer.

Detail Branch

This Branch is responsible for learning the low-level features of the image. As you can see from Fig.3, in order to encode affluent spatial detailed information, it has a high channel capacity with small receptive fields.

Semantic Branch

On the other hand, this branch focus on the high-level features of the image. It has a low channel capacity with larger receptive fields. This means that the model won’t FULLY understand spatial information in this branch, but since we can get that information from the Detail Branch it is unnecessary.

Aggregation Layer

Finally, the information from each branch is merged here. The size of the feature maps is different, so for the feature map from the Detail Branch, we down-size it. For the Semantic Branch, we up-size it.

Auxiliary Headers

By inserting auxiliary segmentation heads to different positions of the branch, you can boost the feature representation in the training phase. These heads are discarded in the inference phase.