What is neural networks (CNNs, RNNs, SAE, DBN)?
How to distinguish deep and shallow models?
Are they deep or shallow models: SVMs, three-layer CNNs, 20-layer CNNs?
A neural network with at least one hidden layer can represent any function to an arbitary degree of accuracy so long as its hidden layer is permitted to have enough units.[1] Hornik, Kurt, Stinchcombe, Maxwell, and White, Halbert. Multilayer feedforward networks are universal approximators. Neural Networks, 1989
5 Conv + 3 fully connected
Global average pooling[2] before fully connected layers.
Multiple loss layers to enforce gradient back-propagation and encourage discrimination in lower layers.
Inceptions
Degradation problem: with network depth increasing, accuracy gets asturated and then degrades.
Fit a residual mapping $\mathcal{F}(\mathbf{x}):=\mathcal{H}(\mathbf{x})-\mathbf{x}$, rather than fitting the original mapping $\mathcal{H}(\mathcal{x})$
Identity mapping by short connect.
"In spite of the seemly different underlying principles, most of the well known neural network models are implicitly equivalent or similar to classical statistical pattern recognition methods" Jain TPAMI 2000.
"Neural networks are statistics for amateurs… Most neural networks conceal the statistics from the user" Anderson, 1990.
Given image $I$, predict pixel labels $X=\{x_0,\ldots,x_n\}$.
CNN models the distribution by $Q(X|\theta,I)=\prod_i q_i(x_i|\theta,I)$,
where $q_i(x_i|\theta,I)=\frac{1}{Z_i}\exp(f_i(x_i;\theta,I))$
For fully supervised training:
For weakly supervised training:
However it is hard to directly optimize
Introduce a latent probability distribution $P(X)$:
where KL-divergence $D(p(x)\|q(x))=\sum_x p(x)\log \frac{p(x)}{q(x)}$ measures the distance of two distributions.
Suppresion constraint suppress any label $l$ that does not appear in the image:
Foreground constraint encourages foreground:
Compare with multiple instance learning (MIL) paradigm
Background constraint constrain background regions
Size constraint put an upper bound constraint on classes that are guaranteed to be small:
VGG + Fully Connected CRF. Constrained Optimization is performed on course maps generated by VGG.
VGG + Fully Connected CRF. Constrained Optimization is performed on course maps generated by VGG.
G. Papandreou | L. Chen | K. Murphy | A. Yuille |
Google, Inc. | UCLA | Google, Inc. | UCLA |
Formulate training as hard-EM approximation, with compete-data log likelihood:
E-step: update the latent segmentation
M-step: maximize $Q(\theta;\theta^{'})$ using stochastic gradient descent.
In summary:
$\arg \max \limits_{\theta} \sum \limits_{k=1}^{K} P(\alpha=k|X,\theta) \log P(\alpha) P(Y|X,\alpha,\theta)$
$\mathcal{Q}(\theta,\theta^{old}) = \sum \limits_{k=1}^{K} P(\alpha=k|X,\theta^{old}) \log P(\alpha) P(Y|X,\alpha,\theta)$
E-step: compute $P(\alpha=k|X,\theta^{old})$ given $\theta^{old}$
M-Step: maximize $\mathcal{Q}(\theta, \theta^{old})$ with respect to $\theta$
For dictonary learning:
and inference:
For network training:
and inference: