346. CaDDN

Depth Estimation

The main challenges in monocular 3D object detection lie in accurately predicting object depth. CaDDN(Categorical Depth Distribution Network) uses a predicted categorical depth distribution for each pixel to project appropriate depth in 3D space.

Approaches

There are several approaches to predicting the object’s depth.

  1. The TOO BIG Approach
    Projects each pixel to all possible depths. Since it is projecting the same depth to all of the 3d space, it makes this approach difficult to locate the accurate position of the objects within the 3d projection.
  2. The TOO SMALL Approach
    Projects depth to a single location within the 3d space. This suffers when the depth estimation is poor.

Considering the above, the paper proposes a JUST RIGHT approach which weighs the projection of the estimated depth probabilities within the 3d projection.

Architecture

The network takes the following steps.

  1. Frustum Feat. Network
    Predicts the Image Features and the Depth Distributions using convolutional networks. Then combines both outputs to create a Frustum Feature.
  2. Voxel Trans
    Frustum Features are sampled using trilinear interpolation to populate the Voxel Feature.
  3. Voxel Collapse
    The voxel is collapsed into a Birds Eye View Feature to be used as input for the 3D Object Detectors.
  4. 3D Object Detectors
    Get the final output boxes.

Reference

Categorical Depth Distribution Network for Monocular 3D Object Detection