Integration of Unsupervised and Supervised Criteria for DNNs Training
-
Upload
francisco-zamora-martinez -
Category
Science
-
view
410 -
download
2
Transcript of Integration of Unsupervised and Supervised Criteria for DNNs Training
Integration of Unsupervised andSupervised Criteria for DNNs
TrainingInternational Conf. on Artificial Neural Networks
Francisco Zamora-Martínez, Francisco JavierMuñoz-Almaraz, Juan Pardo
Departamento de ciencias físicas, matemáticas y de la computaciónUniversidad CEU Cardenal Herrera
September 7th, 2016
Outline
...1 Motivation
...2 Method description
...3 λ Update Policies
...4 Experiments and ResultsMNISTSML2010 temperature forecasting
...5 Conclusions and Future Work
Outline
...1 Motivation
...2 Method description
...3 λ Update Policies
...4 Experiments and ResultsMNISTSML2010 temperature forecasting
...5 Conclusions and Future Work
Motivation
Greedy layer-wise unsupervised pre-training issuccessful training logistic MLPs: Two trainingstages
...1 Pre-training with unsupervised data (SAEs orRBMs)
...2 Fine-tuning parameters with supervised data
Very useful when large unsupervised data isavailableBut…
It is a greedy approachNot valid for on-line learning scenariosNot as much useful with small data sets
Motivation
GoalsTrain a supervised modelLayer-wise conditioned by unsupervised loss
Improving gradient flowLearning better features
Every layer parameters should beUseful for the global supervised taskAble to reconstruct their input (Auto-Encoders)
Motivation
Related worksIs Joint Training Better for DeepAuto-Encoders?, by Y. Zhou et al (2015) paperat arXiv (fine-tuning stage for supervision)Preliminary work done by P. Vincent et al(2010), Stacked denoising autoencoders:Learning useful representations in a deepnetwork with a local denoising criterionDeep learning via Semi-Supervised Embedding,by Weston et al (2008) ICML paper
Outline
...1 Motivation
...2 Method description
...3 λ Update Policies
...4 Experiments and ResultsMNISTSML2010 temperature forecasting
...5 Conclusions and Future Work
Method descriptionHow to do itRisk = Supervised Loss + Sumlayer( Unsup. Loss )
R(θ,D) =1
|D|∑
(x,y)∈D
[λ0Ls(F (x;θ),y) +
H∑k=1
λkU(k)
]+ ϵΩ(θ)
U (k) = Lu(Ak(h(k−1);θ),h(k−1)) for 1 ≤ k ≤ H
λk ≥ 0
F (x;θ) is the MLP modelA(h(k−1);θ) is a Denoising AE modelH is the number of hidden layersh(0) = x
Method descriptionHow to do itRisk = Supervised Loss + Sumlayer( Unsup. Loss )
R(θ,D) =1
|D|∑
(x,y)∈D
[λ0Ls(F (x;θ),y)+
H∑k=1
λkU(k)
]+ϵΩ(θ)
λ vector mixes all the componentsIt should be updated every iterationStarting focused at unsupervised criteriaEnding focused at supervised criterion
Method description
Outline
...1 Motivation
...2 Method description
...3 λ Update Policies
...4 Experiments and ResultsMNISTSML2010 temperature forecasting
...5 Conclusions and Future Work
λ Update Policies I
A λ update policy indicates how to change λvector every iterationThe supervised part (λ0) can be fixed to 1.The unsupervised part should be importantduring first iterations
Loosing focus while trainingBeing insignificant at the endA greedy exponential decay (GED) will suffice
λ0(t) = 1 ; λk(t) = Λγt
Being the constants Λ > 0 and γ ∈ [0, 1]
λ Update Policies II
Exponential decay is the most simple approach, butother policies are possible
Ratio between loss functionsRatio between gradients at each layerA combination of them…
Outline
...1 Motivation
...2 Method description
...3 λ Update Policies
...4 Experiments and ResultsMNISTSML2010 temperature forecasting
...5 Conclusions and Future Work
Experiments and Results (MNIST) I
Benchmark with MNIST datasetLogistic activation functions, softmax outputCross-entropy for supervised and unsupervisedlossesClassification error as evaluation measureEffect of MLP topology and Λ initial value λk
Sensitivity study of γ exponential decay termComparison with other literature models
Experiments and Results (MNIST) II
Test error (%) plus 95% confidence intervalData Set SAE-3 SDAE-3 GED-3MNIST 1.40±0.23 1.28±0.22 1.22±0.22
basic 3.46±0.16 2.84±0.15 2.72±0.14SAE-3 and SDAE-3 taken from Vincent et al (2010)
Experiments and Results (MNIST) IIIHyper-parameters grid search (Validation set)
MNIST
25
6
51
2
10
24
20
48
Depth=1 2
56
51
2
10
24
20
48
Depth=2
25
6
51
2
10
24
20
48
Depth=3
25
6
51
2
10
24
20
48
Depth=4
25
6
51
2
10
24
20
48
0.8
1
1.2
1.4
1.6
1.8
2
Err
or
(%)
Layer size
Depth=5
Λ values0.000000.000010.001000.200000.600001.000003.000005.00000
Experiments and Results (MNIST) IVγ exponential decay term (Validation set)
MNIST
1.00
1.10
1.20
1.30
1.40
1.50
0.5 0.6 0.7 0.8 0.9 1
Val
idat
ion e
rror
(%)
Decay (γ)
1.00
1.10
1.20
1.30
1.40
1.50
0.997 0.998 0.999 1
Val
idat
ion e
rror
(%)
Decay (γ)
Detail
Experiments and Results (MNIST) V
First layer filters (16 of 2048 units)
Only supervised γ = 0.999 γ = 1.000
Experiments and Results (SML2010) I
SML2010 UCI data set: indoor temperatureforecastingLogistic hidden act. functions, linear output48 inputs (12 hours) and 12 outputs (3 hours)Mean Square Error function supervised lossCross-entropy unsupervised lossesMean Absolute Error (MAE) for evaluationCompared MLPs w/wout unsupervised losses
Experiments and Results (SML2010) II
Depth Size MAE Λ = 0 MAE Λ = 1
3 32 0.1322 0.12663 64 0.1350 0.12573 128 0.1308 0.12923 512 0.6160 0.1312
Validation set results. In red statistically significantimprovements.
Λ = 0 is training DNN only with supervised loss.Λ = 1 is training DNN with supervised and unsupervised losses.
Experiments and Results (SML2010) III
Test result for 3 layers model with 64 neuronsper hidden layer:
0.1274 when Λ = 00.1177 when Λ = 1
Able to train up to 10 layers DNNs with 64hidden units per layer when Λ = 1
MAE in range [0.1274, 0.1331]
Λ = 0 is training DNN only with supervised loss.Λ = 1 is training DNN with supervised and unsupervised losses.
Outline
...1 Motivation
...2 Method description
...3 λ Update Policies
...4 Experiments and ResultsMNISTSML2010 temperature forecasting
...5 Conclusions and Future Work
Conclusions
One-stage training of deep models combiningsupervised and unsupervised loss functionsComparable with greedy layer-wiseunsupervised pre-training + fine-tuningThe approach is successful training deep MLPswith logistic activationsDecaying unsupervised loss during training iscrucialTime-series results encourage further researchof this idea into on-line learning scenarios
Future Work
Better filters and models? Further researchneededStudy the effect using ReLU activationsStudy other alternatives to exponentialdecaying of unsupervised loss: dynamicadaptation
The End
Thanks for your attention!!!Questions?