DETR
Modern object detectors predict a set of bounding boxes and category labels for each object of interest by defining surrogate regression and classification problems on a large set of proposals. This means that their performances heavily rely on post-processing steps.
DETR(DEtection TRansformers) was designed to simplify these pipelines and output direct set of predictions.
Overall Architecture
DETR consists of mainly 4 elements.
1. CNN Backbone
This stage generates low-level activation maps
2. Encoder
This uses the traditional transformer encoder which consists of a multi-head attention layer and a feed-forward network.
3. Decoder
Mostly the same as the traditional transformer decoder, but processes “object queries” in parallel
4. Heads
Makes “N” (Set to be much larger than the typical number of objects in each picture) predictions. This predicts either A) A class and a bounding box or B) “No Object”