Depth Estimation and Application

Lijun Wang

July 1, 2018

Black - White

Topics

Architecture
Loss Function
Training Strategy
Application on RGB-D

Depth Map Prediction from a Single Image using a Multi-Scale Deep Network

By Eigen et al., NIPS 2014

Depth Map Prediction from a Single Image using a Multi-Scale Deep Network

Much of the error is explained by how well the mean depth is predicted
20% relative improvement
Scale invariant loss:

$D(y,y^{*}) = \sum \limits_{i,j} [(\log y_i - \log y_j) - (\log y^*_i - \log y^*_j)]^2$

One Multi-Scale Architecture for Multi-Task

[1] Eigen et al, Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture, ICCV 2015

Multi-scale architecture
Solve multiple tasks
Scale-invariant loss + Gradient loss

Deeper Depth Prediction with Fully Convolutional Residual Networks

By Laina et al, IEEE International Conference on 3D Vision 2016

Faster Up-Convolution

A Two-Stream Network for Depth Estimation

[2] Li et al, A Two-Streamed Network for Estimating Fine-Scaled Depth Maps from Single RGB Images, ICCV 2017

A Two-Stream Network for Depth Estimation

Set Loss

$L_{\textrm{single}} + \Omega_{\textrm{set}}$

Fusing Depth and Depth Gradient

End-to-end as refinement
Optimization

$D^* = \arg \min \limits_{D} \sum \limits _{p=1}^{N} \phi (D^p - D^p_{est}) \\ + \alpha \sum \limits_{p=1}^{N} [\phi (\nabla_x D^p - G_x^p) + \phi (\nabla_y D^p - G_y^p)]$

How about training data?

Existing Depth Data Set

DataSet	Statics	Anotation	Scene
NYUD-v2	1449 + 407K raw	Depth + Segmentation	Indoor
KITTI	94k frames	Depth aligned with raw data	Street
Make3D	500 low-resolution	Depth	Outdoor
SUNRGB-D	10k	Depth, Segmentation, 3D bounding box	Indoor

Drawbacks of Exising Data Sets

Very limited in terms of scene variety
Trained models struggle to generalize across scenes

Solution

Different training strategies:

Weakly supervised training
Unsupervised training
Semi-supervised training
Multi-task training

Single-Image Depth Perception in the Wild

By Chen et al, NIPS 2016

Motivation

Increase scene diversity with interenet images.

Challenge: How to Acquire Depth

Humans are better at judging relative depth:

“Is point A closer than point B?”

A relative depth data set

Data Collection

Gather 0.5M images from Flickr
Anotate relative depth for one pair of points per image

Learning with Relative Depth

Ranking Loss:

$L(I,R,z)=\sum \limits_k \psi(I, i_k, j_k, r, z)$

where the loss for the $k$-th quiry:

$\psi(I, i_k, j_k, r, z) = \begin{cases} \log (1+\exp (-z_{i_k} + z_{j_k})), & \mbox{if } r_k=+1\\ \log (1+\exp (z_{i_k} - z_{j_k})), & \mbox{if } r_k=-1 \\ (z_{i_k} - z_{j_k})^2, & \mbox{if } r_k=0 \end{cases}$

Generate GT Depth from Multi-view Internet Images

[3] MegaDepth: Learning Single-View Depth Prediction from Internet Photos, CVPR 2018, WebPage

Landmark10k data sets with multi-view photos for each landmark
Build 3D model for each collection with SfM
Depth Reconstruction with MVS

Data Categorization: Euclidean vs. Ordinal Depth

If $\ge 30\%$ valid depth ⇒ Euclidean loss

Otherwise ⇒ ordinal loss

Determine foreground with semantic info

Loss Function

$L = L_{\mbox{data}} + \alpha L_{\mbox{grad}} + \beta L_{\mbox{ord}}$

$L_{\mbox{grad}}=\frac{1}{n} \sum \limits_k \sum \limits_i (|\nabla_x R_i^k + \nabla_y R_i^k|)$

A Similar Work

[4] Monocular Relative Depth Perception with Web Stereo Data Supervision, CVPR 2018

Collect stereo web images for depth estimation
Compute optical flow to infer disparity (depth)
Still use ordinal loss but sample point pairs online

(Absolute depth is unavailable?)

Learning Depth Estimation from Image Alignment Loss

[4] Semi-Supervised Deep Learning for Monocular Depth Map Prediction, CVPR 2018

Supervised: sparse depth supervision
Unsupervised: image alignment

Additional Works with New Training Strategies

[8]PAD-Net: Multi-Tasks Guided Prediction-and-Distillation Network for Simultaneous Depth Estimation and Scene Parsing, CVPR 18

Additional Works with New Training Strategies

[9]AdaDepth: Unsupervised Content Congruent Adaptation for Depth Estimatio, CVPR 18

Additional Works with New Training Strategies

[10]Salience Guided Depth Calibration for Perceptually Optimized Compressive Light Field 3D Display, CVPR 18

Future Directions

Borrow idea from saliency for depth estimation
Depth completion/refinement from sparse input
Application based on RGB-D data

Application on RGB-D

2D ⇒ 2.5D ⇒ 3D

Using depth as additional low-level cues
Solve 2D by 3D construction for better scene understanding

CS231A: Computer Vision, From 3D Reconstruction to Recognition

Depth Estimation and Application

Topics

Depth Map Prediction from a Single Image using a Multi-Scale Deep Network

Depth Map Prediction from a Single Image using a Multi-Scale Deep Network

One Multi-Scale Architecture for Multi-Task

Deeper Depth Prediction with Fully Convolutional Residual Networks

Faster Up-Convolution

Faster Up-Convolution

A Two-Stream Network for Depth Estimation

A Two-Stream Network for Depth Estimation

How about training data?

Existing Depth Data Set

Drawbacks of Exising Data Sets

Solution

Single-Image Depth Perception in the Wild

Motivation

Challenge: How to Acquire Depth

Data Collection

Learning with Relative Depth

Generate GT Depth from Multi-view Internet Images

Data Categorization: Euclidean vs. Ordinal Depth

Loss Function

A Similar Work

Learning Depth Estimation from Image Alignment Loss

Similar Ideas using Monocular Videos

Similar Ideas using Monocular Videos

Aperture Supervision for Monocular Depth Estimation

Additional Works with New Training Strategies

Additional Works with New Training Strategies

Additional Works with New Training Strategies

Future Directions

Application on RGB-D

Thank You

Any Questions?