Deep Learning, Vision, and Others

Created by Lijun Wang

What is Deep Networks?

What is neural networks (CNNs, RNNs, SAE, DBN)?
How to distinguish deep and shallow models?

Are they deep or shallow models: SVMs, three-layer CNNs, 20-layer CNNs?

The Universal Appriximator Theorem [1]

A neural network with at least one hidden layer can represent any function to an arbitary degree of accuracy so long as its hidden layer is permitted to have enough units.

[1] Hornik, Kurt, Stinchcombe, Maxwell, and White, Halbert. Multilayer feedforward networks are universal approximators. Neural Networks, 1989

So What?

Shallow models: overfitting, reducing model complexity, adding regularization.
Deep models: underfitting, increasing model complexity, optimization, computation resource

Overview

Image Classification
Object Detection
Dense Prediction
Sequance Modeling

Image Classification

Alex Net
VGG Net
GoogLe Net
Deep Residual Net

Alex Net (2012, Top5: 16.4%)

5 Conv + 3 fully connected

New Technologies (Tricks)

ReLU non-linear units
Dropout
Local response normalization
Data augmentation
1.2 million training + 50k validation
GPU training

VGG Net (2014, Top5: 7.32%)

New Knowledge (Feature)

Increasing depth with small (3x3) convolution filters.

A stack of two 3x3 conv layers has an effecitve receptive field of 5x5,
but incorparates one additional non-linear layers,
and reduces parameters

Preserve time complextiy per layer.

Layers with same output feature map size have the same number of filters.
If the feature map size is halved, the number of filters is doubled.

GoogLeNet (2014, Top5: 6.67%)

Architecture Designs

Global average pooling[2] before fully connected layers.
Multiple loss layers to enforce gradient back-propagation and encourage discrimination in lower layers.
Inceptions

[2] Min Lin el al. Network in network. CoRR, abs/1312.4400, 2013.

Inception

Increasing feature map channels without blowing-up computational complexity.
Processing signals at different scales and aggregating for abstraction.

Deep Resedual Network (2015, Top5: 3.57% human-level: 5.1%)

Degradation problem: with network depth increasing, accuracy gets asturated and then degrades.

Residual Learning

Fit a residual mapping $\mathcal{F}(\mathbf{x}):=\mathcal{H}(\mathbf{x})-\mathbf{x}$, rather than fitting the original mapping $\mathcal{H}(\mathcal{x})$
It is easier to optimize the residual mapping (Considering an identity mapping as optimal).

Network Architecture

Building Blocks

Identity mapping by short connect.

Naive v.s. Bottleneck

Deep Residual Net Address Degradation

Object Detection

R-CNN
Fast R-CNN
Faster R-CNN
YOlO

R-CNN: Regions with CNN Features

Pros and Cons

The first detector eploring deep CNN features with excellent performance.
Training is a multi-stage pipeline
Training is expensive in sapce and time
Object detection is slow (47s/f)

Fast R-CNN

Key Features

RoI pooling layer: extracts fixed length feature for each region proposal.
Multitask Learning: $L(p,u,t^u,v)=L_{cls}(p,u) + \lambda L_{loc}(t^u,v)$

Training is single-stage and update all network layers.
No disk storage is required for feature caching.

Faster R-CNN

Feature Net + RPN + Fast R-CNN
Share deep features between object proposal and detection.

Region Proposal Net (RPN)

Slide a 3x3 filter on the convolution feature map to obtain a feature vector.
Two sibling layers: box regression (reg) and classificaion (cls)
For each window center, reg outputs 4k coordinates, cls outputs 2k class score with respect to k anchors.

Training RPN, Fast R-CNN, and Feature Net

Alternating training.
Approximate joint training.
Non-approximate joint training.

YOLO: You Only Look Once

Divie image into a 7x7 grid, each grid cell is responsible to detect the object within the grid.
Each cell predicts 2 bounding box and 20 class probabilities
Each bounding box includes 5 predictions: 4 relative coordinates and 1 object confidence.
Totally 7x7(2x5+20)=7x7x30 output unites.

Detector Design

First 20 layers are pretrained on image classification.
Increase input resolution from 224x224 to 448x448.
In finetuning, each object only update 1 responsive predictior among 7x7x2 predictors.

Performance Comparison

Dense Prediction

to be continued ...

How people viewed networks 20 years ago

"In spite of the seemly different underlying principles, most of the well known neural network models are implicitly equivalent or similar to classical statistical pattern recognition methods" Jain TPAMI 2000.

"Neural networks are statistics for amateurs… Most neural networks conceal the statistics from the user" Anderson, 1990.

What's next?

Simply using deep learning for vision problem is not sufficient to get our papers accepted. We are entering the post-deep-learning era.
We should follow most recent pregress of deep learning by reading more papers and build solid foundation on machine learning as well.

Dense Prediction

Structured Learning
Weakly- and Semi- Supervised Learning

Constrained Convolutional Neural Networks for Weakly Supervised Segmentation

Deepak Pathak, Philipp Krähenbühl, Trevor Darrell

UC Berkeley

Semantic Segmentation

Given image $I$, predict pixel labels $X=\{x_0,\ldots,x_n\}$.

CNN models the distribution by $Q(X|\theta,I)=\prod_i q_i(x_i|\theta,I)$,

where $q_i(x_i|\theta,I)=\frac{1}{Z_i}\exp(f_i(x_i;\theta,I))$

Fully and Weakly Supervised Segmentation

For fully supervised training:

$\arg \max\limits_{\theta} \sum_{I} Q(X|\theta, I)$

For weakly supervised training:

find $\theta$
subject to $A_I Q_I \geq b_I \quad \forall I$

Weakly Supervised Segmentation

However it is hard to directly optimize

find $\theta$
subject to $A_I Q_I \geq b_I \quad \forall I$

Constrained Optimization

Introduce a latent probability distribution $P(X)$:

$\min \limits_{\theta,P} D(P(X)\|Q(X|\theta))$
subject to $AP \geq b, \quad \sum \limits_{X} P(X)=1$,

where KL-divergence $D(p(x)\|q(x))=\sum_x p(x)\log \frac{p(x)}{q(x)}$ measures the distance of two distributions.

Constraints for Weak Segmentation

Suppresion constraint suppress any label $l$ that does not appear in the image:

$\sum \limits_{i=1}^{n} p_i(l) \leq 0 \quad \forall l \notin \mathcal{L}_I$

Constraints for Weak Segmentation

Foreground constraint encourages foreground:

$\sum \limits_{i=1}^{n} p_i(l) \geq a_l \quad \forall l \in \mathcal{L}_I$

Compare with multiple instance learning (MIL) paradigm

Constraints for Weak Segmentation

Background constraint constrain background regions

$a_0 \leq \sum \limits_{i=1}^{n} p_i(0) \leq b_0.$

Constraints for Weak Segmentation

Size constraint put an upper bound constraint on classes that are guaranteed to be small:

$\sum \limits_{i=1}^{n} p_i(l) \leq b_l.$

Experiments

VGG + Fully Connected CRF. Constrained Optimization is performed on course maps generated by VGG.

Experiments

VGG + Fully Connected CRF. Constrained Optimization is performed on course maps generated by VGG.

Weakly- and Semi-supervised Learning of DNN for Semantic Segmentaiton

G. Papandreou	L. Chen	K. Murphy	A. Yuille
Google, Inc.	UCLA	Google, Inc.	UCLA

Pixel-level annotations
Image-level annotations
Bounding box annotations

Image-level Annotations

Formulate training as hard-EM approximation, with compete-data log likelihood:

$Q(\theta;\theta^{'})=\sum \limits_{Y} P(Y|X,z;\theta^{'}) \log P(Y|X;\theta)$
$\approx \log(\hat{Y}|X;\theta)$

Image-level Annotations

E-step: update the latent segmentation

$\hat{Y}=\arg \max \limits_{Y} P(Y|X;\theta^{'})P(z|Y)$

M-step: maximize $Q(\theta;\theta^{'})$ using stochastic gradient descent.

Image-level Annotations

In summary:

Refine latend segmentation $\hat{Y}$ with current network output and image label constraint.

Train the network with refined segmentation $\hat{Y}$ as GT

Bounding Box Annotations

Mixed Strong and Weak Annotations

Weakly-Supervised Semantic Segmentation using Motion Cues

P. Tokmakov, K. Alahari, and C. Schmid
Inria

Weakly-Supervised Semantic Segmentation using Motion Cues

$\arg \max \limits_{\theta} \sum \limits_{k=1}^{K} P(\alpha=k|X,\theta) \log P(\alpha) P(Y|X,\alpha,\theta)$

$\mathcal{Q}(\theta,\theta^{old}) = \sum \limits_{k=1}^{K} P(\alpha=k|X,\theta^{old}) \log P(\alpha) P(Y|X,\alpha,\theta)$

E-step: compute $P(\alpha=k|X,\theta^{old})$ given $\theta^{old}$

M-Step: maximize $\mathcal{Q}(\theta, \theta^{old})$ with respect to $\theta$

For dictonary learning:

$P(Y|X,\alpha,\theta) =\frac{1}{Z} \exp(-\|Y-H_{\alpha} \beta_{\alpha}\|)$, $P(X|\alpha,\theta) = \frac{1}{F} \exp(-\|X-L_{\alpha} \beta_{\alpha}\|)$, $\quad\theta = \{H_{\alpha},L_{\alpha}\}$, $H_{\alpha} = \arg \max \limits_{H} \sum_{X} P(\alpha|X,\theta^{old}) \log P(Y|X,\alpha,\theta)$, $=\arg \min \limits_{H} \sum_{X} P(\alpha|X,\theta^{old})\|Y-H\beta_{\alpha}\|$,

and inference:

$\hat{Y} = \arg \max \limits_{\tilde{Y}} \sum \limits_{\alpha=1}^{K} P(\alpha|X,\theta) \log P(\tilde{Y}|X, \alpha,\theta) $ $=\arg \min \limits_{\tilde{Y}} \sum \limits_{\alpha=1}^{K} P(\alpha|X,\theta) \|\tilde{Y}-H_{\alpha} \beta_{\alpha}\|$ $=\sum_{\alpha}P(\alpha|X,\theta) H_{\alpha} \beta_{\alpha}$

For network training:

$P(Y|X,\alpha,\theta) = \frac{1}{Z} \exp (-\|Y-G_{\alpha}(X;\theta_{\alpha})\|), \quad \theta=\{\theta_{\alpha}\}$,
$\max \limits_{\theta_{\alpha}} \sum \limits_{X} P(\alpha|X, \theta) \log P(Y|X, \alpha, \theta)$ $=\min \limits_{\theta_{\alpha}} \sum \limits_{X} P(\alpha|X, \theta) \|Y-G_{\alpha}(X;\theta_{\alpha})\|$

and inference:

$\hat{Y}=\arg \max \limits_{\tilde{Y}} E_{\alpha~P(\alpha|X,\theta)}(P(\alpha, \tilde{Y}, X| \theta))$ $=\arg \max \limits_{\tilde{Y}} \sum \limits_{\alpha} P(\alpha|X, \theta) \log P(\tilde{Y}|X,\alpha,\theta)$ $=\arg \min \limits_{\tilde{Y}} \sum \limits_{\alpha} P(\alpha|X, \theta) \|\tilde{Y} - G_{\alpha}(X;\theta_{\alpha})\|$ $=\sum \limits_{\alpha} P(\alpha|X,\theta) G_{\alpha}(X;\theta_{\alpha})$

Deep Learning, Vision, and Others

Created by Lijun Wang

What is Deep Networks?

The Universal Appriximator Theorem [1]

So What?

Overview

Image Classification

Object Detection

Dense Prediction

Sequance Modeling

Image Classification

Alex Net

VGG Net

GoogLe Net

Deep Residual Net

Alex Net (2012, Top5: 16.4%)

New Technologies (Tricks)

VGG Net (2014, Top5: 7.32%)

New Knowledge (Feature)

GoogLeNet (2014, Top5: 6.67%)

Architecture Designs

Inception

Deep Resedual Network (2015, Top5: 3.57% human-level: 5.1%)

Residual Learning

Network Architecture

Building Blocks

Naive v.s. Bottleneck

Deep Residual Net Address Degradation

Object Detection

R-CNN

Fast R-CNN

Faster R-CNN

YOlO

R-CNN: Regions with CNN Features

Pros and Cons

Fast R-CNN

Key Features

Faster R-CNN

Region Proposal Net (RPN)

Training RPN, Fast R-CNN, and Feature Net

YOLO: You Only Look Once

Detector Design

Performance Comparison

Dense Prediction

How people viewed networks 20 years ago

What's next?

Dense Prediction

Structured Learning

Weakly- and Semi- Supervised Learning

Constrained Convolutional Neural Networks for Weakly Supervised Segmentation

Deepak Pathak, Philipp Krähenbühl, Trevor Darrell

UC Berkeley

Semantic Segmentation

Fully and Weakly Supervised Segmentation

Weakly Supervised Segmentation

Constrained Optimization

Constraints for Weak Segmentation

Constraints for Weak Segmentation

Constraints for Weak Segmentation

Constraints for Weak Segmentation

Experiments

Experiments

Weakly- and Semi-supervised Learning of DNN for Semantic Segmentaiton

Image-level Annotations

Image-level Annotations

Image-level Annotations

Bounding Box Annotations

Bounding Box Annotations

Mixed Strong and Weak Annotations

Weakly-Supervised Semantic Segmentation using Motion Cues

P. Tokmakov, K. Alahari, and C. Schmid Inria

Weakly-Supervised Semantic Segmentation using Motion Cues

P. Tokmakov, K. Alahari, and C. Schmid
Inria