Integration of Unsupervised and Supervised Criteria for DNNs Training

26
Integration of Unsupervised and Supervised Criteria for DNNs Training International Conf. on Artificial Neural Networks Francisco Zamora-Martínez, Francisco Javier Muñoz-Almaraz, Juan Pardo Departamento de ciencias físicas, matemáticas y de la computación Universidad CEU Cardenal Herrera September 7th, 2016

Transcript of Integration of Unsupervised and Supervised Criteria for DNNs Training

Page 1: Integration of Unsupervised and Supervised Criteria for DNNs Training

Integration of Unsupervised andSupervised Criteria for DNNs

TrainingInternational Conf. on Artificial Neural Networks

Francisco Zamora-Martínez, Francisco JavierMuñoz-Almaraz, Juan Pardo

Departamento de ciencias físicas, matemáticas y de la computaciónUniversidad CEU Cardenal Herrera

September 7th, 2016

Page 2: Integration of Unsupervised and Supervised Criteria for DNNs Training

Outline

...1 Motivation

...2 Method description

...3 λ Update Policies

...4 Experiments and ResultsMNISTSML2010 temperature forecasting

...5 Conclusions and Future Work

Page 3: Integration of Unsupervised and Supervised Criteria for DNNs Training

Outline

...1 Motivation

...2 Method description

...3 λ Update Policies

...4 Experiments and ResultsMNISTSML2010 temperature forecasting

...5 Conclusions and Future Work

Page 4: Integration of Unsupervised and Supervised Criteria for DNNs Training

Motivation

Greedy layer-wise unsupervised pre-training issuccessful training logistic MLPs: Two trainingstages

...1 Pre-training with unsupervised data (SAEs orRBMs)

...2 Fine-tuning parameters with supervised data

Very useful when large unsupervised data isavailableBut…

It is a greedy approachNot valid for on-line learning scenariosNot as much useful with small data sets

Page 5: Integration of Unsupervised and Supervised Criteria for DNNs Training

Motivation

GoalsTrain a supervised modelLayer-wise conditioned by unsupervised loss

Improving gradient flowLearning better features

Every layer parameters should beUseful for the global supervised taskAble to reconstruct their input (Auto-Encoders)

Page 6: Integration of Unsupervised and Supervised Criteria for DNNs Training

Motivation

Related worksIs Joint Training Better for DeepAuto-Encoders?, by Y. Zhou et al (2015) paperat arXiv (fine-tuning stage for supervision)Preliminary work done by P. Vincent et al(2010), Stacked denoising autoencoders:Learning useful representations in a deepnetwork with a local denoising criterionDeep learning via Semi-Supervised Embedding,by Weston et al (2008) ICML paper

Page 7: Integration of Unsupervised and Supervised Criteria for DNNs Training

Outline

...1 Motivation

...2 Method description

...3 λ Update Policies

...4 Experiments and ResultsMNISTSML2010 temperature forecasting

...5 Conclusions and Future Work

Page 8: Integration of Unsupervised and Supervised Criteria for DNNs Training

Method descriptionHow to do itRisk = Supervised Loss + Sumlayer( Unsup. Loss )

R(θ,D) =1

|D|∑

(x,y)∈D

[λ0Ls(F (x;θ),y) +

H∑k=1

λkU(k)

]+ ϵΩ(θ)

U (k) = Lu(Ak(h(k−1);θ),h(k−1)) for 1 ≤ k ≤ H

λk ≥ 0

F (x;θ) is the MLP modelA(h(k−1);θ) is a Denoising AE modelH is the number of hidden layersh(0) = x

Page 9: Integration of Unsupervised and Supervised Criteria for DNNs Training

Method descriptionHow to do itRisk = Supervised Loss + Sumlayer( Unsup. Loss )

R(θ,D) =1

|D|∑

(x,y)∈D

[λ0Ls(F (x;θ),y)+

H∑k=1

λkU(k)

]+ϵΩ(θ)

λ vector mixes all the componentsIt should be updated every iterationStarting focused at unsupervised criteriaEnding focused at supervised criterion

Page 10: Integration of Unsupervised and Supervised Criteria for DNNs Training

Method description

Page 11: Integration of Unsupervised and Supervised Criteria for DNNs Training

Outline

...1 Motivation

...2 Method description

...3 λ Update Policies

...4 Experiments and ResultsMNISTSML2010 temperature forecasting

...5 Conclusions and Future Work

Page 12: Integration of Unsupervised and Supervised Criteria for DNNs Training

λ Update Policies I

A λ update policy indicates how to change λvector every iterationThe supervised part (λ0) can be fixed to 1.The unsupervised part should be importantduring first iterations

Loosing focus while trainingBeing insignificant at the endA greedy exponential decay (GED) will suffice

λ0(t) = 1 ; λk(t) = Λγt

Being the constants Λ > 0 and γ ∈ [0, 1]

Page 13: Integration of Unsupervised and Supervised Criteria for DNNs Training

λ Update Policies II

Exponential decay is the most simple approach, butother policies are possible

Ratio between loss functionsRatio between gradients at each layerA combination of them…

Page 14: Integration of Unsupervised and Supervised Criteria for DNNs Training

Outline

...1 Motivation

...2 Method description

...3 λ Update Policies

...4 Experiments and ResultsMNISTSML2010 temperature forecasting

...5 Conclusions and Future Work

Page 15: Integration of Unsupervised and Supervised Criteria for DNNs Training

Experiments and Results (MNIST) I

Benchmark with MNIST datasetLogistic activation functions, softmax outputCross-entropy for supervised and unsupervisedlossesClassification error as evaluation measureEffect of MLP topology and Λ initial value λk

Sensitivity study of γ exponential decay termComparison with other literature models

Page 16: Integration of Unsupervised and Supervised Criteria for DNNs Training

Experiments and Results (MNIST) II

Test error (%) plus 95% confidence intervalData Set SAE-3 SDAE-3 GED-3MNIST 1.40±0.23 1.28±0.22 1.22±0.22

basic 3.46±0.16 2.84±0.15 2.72±0.14SAE-3 and SDAE-3 taken from Vincent et al (2010)

Page 17: Integration of Unsupervised and Supervised Criteria for DNNs Training

Experiments and Results (MNIST) IIIHyper-parameters grid search (Validation set)

MNIST

25

6

51

2

10

24

20

48

Depth=1 2

56

51

2

10

24

20

48

Depth=2

25

6

51

2

10

24

20

48

Depth=3

25

6

51

2

10

24

20

48

Depth=4

25

6

51

2

10

24

20

48

0.8

1

1.2

1.4

1.6

1.8

2

Err

or

(%)

Layer size

Depth=5

Λ values0.000000.000010.001000.200000.600001.000003.000005.00000

Page 18: Integration of Unsupervised and Supervised Criteria for DNNs Training

Experiments and Results (MNIST) IVγ exponential decay term (Validation set)

MNIST

1.00

1.10

1.20

1.30

1.40

1.50

0.5 0.6 0.7 0.8 0.9 1

Val

idat

ion e

rror

(%)

Decay (γ)

1.00

1.10

1.20

1.30

1.40

1.50

0.997 0.998 0.999 1

Val

idat

ion e

rror

(%)

Decay (γ)

Detail

Page 19: Integration of Unsupervised and Supervised Criteria for DNNs Training

Experiments and Results (MNIST) V

First layer filters (16 of 2048 units)

Only supervised γ = 0.999 γ = 1.000

Page 20: Integration of Unsupervised and Supervised Criteria for DNNs Training

Experiments and Results (SML2010) I

SML2010 UCI data set: indoor temperatureforecastingLogistic hidden act. functions, linear output48 inputs (12 hours) and 12 outputs (3 hours)Mean Square Error function supervised lossCross-entropy unsupervised lossesMean Absolute Error (MAE) for evaluationCompared MLPs w/wout unsupervised losses

Page 21: Integration of Unsupervised and Supervised Criteria for DNNs Training

Experiments and Results (SML2010) II

Depth Size MAE Λ = 0 MAE Λ = 1

3 32 0.1322 0.12663 64 0.1350 0.12573 128 0.1308 0.12923 512 0.6160 0.1312

Validation set results. In red statistically significantimprovements.

Λ = 0 is training DNN only with supervised loss.Λ = 1 is training DNN with supervised and unsupervised losses.

Page 22: Integration of Unsupervised and Supervised Criteria for DNNs Training

Experiments and Results (SML2010) III

Test result for 3 layers model with 64 neuronsper hidden layer:

0.1274 when Λ = 00.1177 when Λ = 1

Able to train up to 10 layers DNNs with 64hidden units per layer when Λ = 1

MAE in range [0.1274, 0.1331]

Λ = 0 is training DNN only with supervised loss.Λ = 1 is training DNN with supervised and unsupervised losses.

Page 23: Integration of Unsupervised and Supervised Criteria for DNNs Training

Outline

...1 Motivation

...2 Method description

...3 λ Update Policies

...4 Experiments and ResultsMNISTSML2010 temperature forecasting

...5 Conclusions and Future Work

Page 24: Integration of Unsupervised and Supervised Criteria for DNNs Training

Conclusions

One-stage training of deep models combiningsupervised and unsupervised loss functionsComparable with greedy layer-wiseunsupervised pre-training + fine-tuningThe approach is successful training deep MLPswith logistic activationsDecaying unsupervised loss during training iscrucialTime-series results encourage further researchof this idea into on-line learning scenarios

Page 25: Integration of Unsupervised and Supervised Criteria for DNNs Training

Future Work

Better filters and models? Further researchneededStudy the effect using ReLU activationsStudy other alternatives to exponentialdecaying of unsupervised loss: dynamicadaptation

Page 26: Integration of Unsupervised and Supervised Criteria for DNNs Training

The End

Thanks for your attention!!!Questions?