Integration of Unsupervised and Supervised Criteria for DNNs Training

Integration of Unsupervised andSupervised Criteria for DNNs

TrainingInternational Conf. on Artificial Neural Networks

Francisco Zamora-Martínez, Francisco JavierMuñoz-Almaraz, Juan Pardo

Departamento de ciencias físicas, matemáticas y de la computaciónUniversidad CEU Cardenal Herrera

September 7th, 2016

Outline

...1 Motivation

...2 Method description

...3 λ Update Policies

...4 Experiments and ResultsMNISTSML2010 temperature forecasting

...5 Conclusions and Future Work

Motivation

Greedy layer-wise unsupervised pre-training issuccessful training logistic MLPs: Two trainingstages

...1 Pre-training with unsupervised data (SAEs orRBMs)

...2 Fine-tuning parameters with supervised data

Very useful when large unsupervised data isavailableBut…

It is a greedy approachNot valid for on-line learning scenariosNot as much useful with small data sets

Motivation

GoalsTrain a supervised modelLayer-wise conditioned by unsupervised loss

Improving gradient flowLearning better features

Every layer parameters should beUseful for the global supervised taskAble to reconstruct their input (Auto-Encoders)

Motivation

Related worksIs Joint Training Better for DeepAuto-Encoders?, by Y. Zhou et al (2015) paperat arXiv (fine-tuning stage for supervision)Preliminary work done by P. Vincent et al(2010), Stacked denoising autoencoders:Learning useful representations in a deepnetwork with a local denoising criterionDeep learning via Semi-Supervised Embedding,by Weston et al (2008) ICML paper

Outline

...1 Motivation





Method descriptionHow to do itRisk = Supervised Loss + Sumlayer( Unsup. Loss )

R(θ,D) =1

|D|∑

(x,y)∈D

[λ0Ls(F (x;θ),y) +

H∑k=1

λkU(k)

]+ ϵΩ(θ)

U (k) = Lu(Ak(h(k−1);θ),h(k−1)) for 1 ≤ k ≤ H

λk ≥ 0

F (x;θ) is the MLP modelA(h(k−1);θ) is a Denoising AE modelH is the number of hidden layersh(0) = x

Method descriptionHow to do itRisk = Supervised Loss + Sumlayer( Unsup. Loss )

R(θ,D) =1

|D|∑

(x,y)∈D

[λ0Ls(F (x;θ),y)+

H∑k=1

λkU(k)

]+ϵΩ(θ)

λ vector mixes all the componentsIt should be updated every iterationStarting focused at unsupervised criteriaEnding focused at supervised criterion

Method description

Outline

...1 Motivation





λ Update Policies I

A λ update policy indicates how to change λvector every iterationThe supervised part (λ0) can be fixed to 1.The unsupervised part should be importantduring first iterations

Loosing focus while trainingBeing insignificant at the endA greedy exponential decay (GED) will suffice

λ0(t) = 1 ; λk(t) = Λγt

Being the constants Λ > 0 and γ ∈ [0, 1]

λ Update Policies II

Exponential decay is the most simple approach, butother policies are possible

Ratio between loss functionsRatio between gradients at each layerA combination of them…

Outline

...1 Motivation





Experiments and Results (MNIST) I

Benchmark with MNIST datasetLogistic activation functions, softmax outputCross-entropy for supervised and unsupervisedlossesClassification error as evaluation measureEffect of MLP topology and Λ initial value λk

Sensitivity study of γ exponential decay termComparison with other literature models

Experiments and Results (MNIST) II

Test error (%) plus 95% confidence intervalData Set SAE-3 SDAE-3 GED-3MNIST 1.40±0.23 1.28±0.22 1.22±0.22

basic 3.46±0.16 2.84±0.15 2.72±0.14SAE-3 and SDAE-3 taken from Vincent et al (2010)

Experiments and Results (MNIST) IIIHyper-parameters grid search (Validation set)

MNIST

25

6

51

2

10

24

20

48

Depth=1 2

56

51

2

10

24

20

48

Depth=2

25

6

51

2

10

24

20

48

Depth=3

25

6

51

2

10

24

20

48

Depth=4

25

6

51

2

10

24

20

48

0.8

1

1.2

1.4

1.6

1.8

2

Err

or

(%)

Layer size

Depth=5

Λ values0.000000.000010.001000.200000.600001.000003.000005.00000

Experiments and Results (MNIST) IVγ exponential decay term (Validation set)

MNIST

1.00

1.10

1.20

1.30

1.40

1.50

0.5 0.6 0.7 0.8 0.9 1

Val

idat

ion e

rror

(%)

Decay (γ)

1.00

1.10

1.20

1.30

1.40

1.50

0.997 0.998 0.999 1

Val

idat

ion e

rror

(%)

Decay (γ)

Detail

Experiments and Results (MNIST) V

First layer filters (16 of 2048 units)

Only supervised γ = 0.999 γ = 1.000

Experiments and Results (SML2010) I

SML2010 UCI data set: indoor temperatureforecastingLogistic hidden act. functions, linear output48 inputs (12 hours) and 12 outputs (3 hours)Mean Square Error function supervised lossCross-entropy unsupervised lossesMean Absolute Error (MAE) for evaluationCompared MLPs w/wout unsupervised losses

Experiments and Results (SML2010) II

Depth Size MAE Λ = 0 MAE Λ = 1

3 32 0.1322 0.12663 64 0.1350 0.12573 128 0.1308 0.12923 512 0.6160 0.1312

Validation set results. In red statistically significantimprovements.

Λ = 0 is training DNN only with supervised loss.Λ = 1 is training DNN with supervised and unsupervised losses.

Experiments and Results (SML2010) III

Test result for 3 layers model with 64 neuronsper hidden layer:

0.1274 when Λ = 00.1177 when Λ = 1

Able to train up to 10 layers DNNs with 64hidden units per layer when Λ = 1

MAE in range [0.1274, 0.1331]

Λ = 0 is training DNN only with supervised loss.Λ = 1 is training DNN with supervised and unsupervised losses.

Outline

...1 Motivation





Conclusions

One-stage training of deep models combiningsupervised and unsupervised loss functionsComparable with greedy layer-wiseunsupervised pre-training + fine-tuningThe approach is successful training deep MLPswith logistic activationsDecaying unsupervised loss during training iscrucialTime-series results encourage further researchof this idea into on-line learning scenarios

Future Work

Better filters and models? Further researchneededStudy the effect using ReLU activationsStudy other alternatives to exponentialdecaying of unsupervised loss: dynamicadaptation

The End

Thanks for your attention!!!Questions?

Integration of Unsupervised and Supervised Criteria for DNNs Training

Science

Transcript of Integration of Unsupervised and Supervised Criteria for DNNs Training