Integration of Unsupervised and Supervised Criteria for DNNs Training

Integration of Unsupervised andSupervised Criteria for DNNs

TrainingInternational Conf. on Artificial Neural Networks

Francisco Zamora-Martínez, Francisco JavierMuñoz-Almaraz, Juan Pardo

Departamento de ciencias físicas, matemáticas y de la computaciónUniversidad CEU Cardenal Herrera

September 7th, 2016

Outline

...1 Motivation

...2 Method description

...3 λ Update Policies

...4 Experiments and ResultsMNISTSML2010 temperature forecasting

...5 Conclusions and Future Work

Outline

...1 Motivation

Motivation

Greedy layer-wise unsupervised pre-training issuccessful training logistic MLPs: Two trainingstages

...1 Pre-training with unsupervised data (SAEs orRBMs)

...2 Fine-tuning parameters with supervised data

Very useful when large unsupervised data isavailableBut…

It is a greedy approachNot valid for on-line learning scenariosNot as much useful with small data sets

Motivation

GoalsTrain a supervised modelLayer-wise conditioned by unsupervised loss

Improving gradient flowLearning better features

Every layer parameters should beUseful for the global supervised taskAble to reconstruct their input (Auto-Encoders)

Motivation

Related worksIs Joint Training Better for DeepAuto-Encoders?, by Y. Zhou et al (2015) paperat arXiv (fine-tuning stage for supervision)Preliminary work done by P. Vincent et al(2010), Stacked denoising autoencoders:Learning useful representations in a deepnetwork with a local denoising criterionDeep learning via Semi-Supervised Embedding,by Weston et al (2008) ICML paper

Outline

...1 Motivation

Method descriptionHow to do itRisk = Supervised Loss + Sumlayer( Unsup. Loss )

R(θ,D) =1

|D|∑

(x,y)∈D

[λ0Ls(F (x;θ),y) +

H∑k=1

λkU(k)

]+ ϵΩ(θ)

U (k) = Lu(Ak(h(k−1);θ),h(k−1)) for 1 ≤ k ≤ H

λk ≥ 0

F (x;θ) is the MLP modelA(h(k−1);θ) is a Denoising AE modelH is the number of hidden layersh(0) = x

Method descriptionHow to do itRisk = Supervised Loss + Sumlayer( Unsup. Loss )

R(θ,D) =1

|D|∑

(x,y)∈D

[λ0Ls(F (x;θ),y)+

H∑k=1

λkU(k)

]+ϵΩ(θ)

λ vector mixes all the componentsIt should be updated every iterationStarting focused at unsupervised criteriaEnding focused at supervised criterion

Method description

Outline

...1 Motivation

λ Update Policies I

A λ update policy indicates how to change λvector every iterationThe supervised part (λ0) can be fixed to 1.The unsupervised part should be importantduring first iterations

Loosing focus while trainingBeing insignificant at the endA greedy exponential decay (GED) will suffice

λ0(t) = 1 ; λk(t) = Λγt

Being the constants Λ > 0 and γ ∈ [0, 1]

λ Update Policies II

Exponential decay is the most simple approach, butother policies are possible

Ratio between loss functionsRatio between gradients at each layerA combination of them…

Outline

...1 Motivation

Experiments and Results (MNIST) I

Benchmark with MNIST datasetLogistic activation functions, softmax outputCross-entropy for supervised and unsupervisedlossesClassification error as evaluation measureEffect of MLP topology and Λ initial value λk

Sensitivity study of γ exponential decay termComparison with other literature models

Experiments and Results (MNIST) II

Test error (%) plus 95% confidence intervalData Set SAE-3 SDAE-3 GED-3MNIST 1.40±0.23 1.28±0.22 1.22±0.22

basic 3.46±0.16 2.84±0.15 2.72±0.14SAE-3 and SDAE-3 taken from Vincent et al (2010)

Experiments and Results (MNIST) IIIHyper-parameters grid search (Validation set)

Depth=1 2

Depth=2

Depth=3

Depth=4

Layer size

Depth=5

Λ values0.000000.000010.001000.200000.600001.000003.000005.00000

Experiments and Results (MNIST) IVγ exponential decay term (Validation set)

0.5 0.6 0.7 0.8 0.9 1

Decay (γ)

0.997 0.998 0.999 1

Decay (γ)

Detail

Experiments and Results (MNIST) V

First layer filters (16 of 2048 units)

Only supervised γ = 0.999 γ = 1.000

Experiments and Results (SML2010) I

SML2010 UCI data set: indoor temperatureforecastingLogistic hidden act. functions, linear output48 inputs (12 hours) and 12 outputs (3 hours)Mean Square Error function supervised lossCross-entropy unsupervised lossesMean Absolute Error (MAE) for evaluationCompared MLPs w/wout unsupervised losses

Experiments and Results (SML2010) II

Depth Size MAE Λ = 0 MAE Λ = 1

3 32 0.1322 0.12663 64 0.1350 0.12573 128 0.1308 0.12923 512 0.6160 0.1312

Validation set results. In red statistically significantimprovements.

Λ = 0 is training DNN only with supervised loss.Λ = 1 is training DNN with supervised and unsupervised losses.

Experiments and Results (SML2010) III

Test result for 3 layers model with 64 neuronsper hidden layer:

0.1274 when Λ = 00.1177 when Λ = 1

Able to train up to 10 layers DNNs with 64hidden units per layer when Λ = 1

MAE in range [0.1274, 0.1331]

Λ = 0 is training DNN only with supervised loss.Λ = 1 is training DNN with supervised and unsupervised losses.

Outline

...1 Motivation

Conclusions

One-stage training of deep models combiningsupervised and unsupervised loss functionsComparable with greedy layer-wiseunsupervised pre-training + fine-tuningThe approach is successful training deep MLPswith logistic activationsDecaying unsupervised loss during training iscrucialTime-series results encourage further researchof this idea into on-line learning scenarios

Future Work

Better filters and models? Further researchneededStudy the effect using ReLU activationsStudy other alternatives to exponentialdecaying of unsupervised loss: dynamicadaptation

The End

Thanks for your attention!!!Questions?

Integration of Unsupervised and Supervised Criteria for DNNs Training

Science

Transcript of Integration of Unsupervised and Supervised Criteria for DNNs Training

CHAPTER 14 Clustering and Unsupervised Classification CLASSIFICATION A. Dermanis.

A Supervised Bayesian Method for Cerebrovascular · PDF fileA Supervised Bayesian Method for Cerebrovascular Segmentation . ... also called the label field and x is a realization ...

BORIS RUBINOV SUPERVISED BY: PROF. GONEN ASHKENASY BEN-GURION UNIVERSITY OF THE NEGEV, BEER-SHEVA, ISRAEL From Structure to Function: Possible Implications.

Cobalt Salen Complexes as Hydrogen Evolution Catalysts... · Project by Leigh-Ann Maroño | l.t.marono1@newcastle.ac.uk | Supervised by Elizabeth Gibson | School of Chemistry Introduction/Aims

Insight: Steal from Existing Supervised Learning Methods! Training = {X,Y} Error = target output – actual output.

Learning From Data Lecture 19 A Peek At Unsupervised Learningmagdon/courses/LFD-Slides/SlidesLect19.pdfClustering A cluster is a collection of points S A k-clustering is a partition

Testing modiﬁed gravity beyond cosmic variancecosmo/CosFlo16/DOCUMENTS/SLIDES/ADAMS_Caitlin.pdf · Testing modiﬁed gravity beyond cosmic variance Caitlin Adams Supervised by Professor

Multilayer Perceptron and Deep Learning - uni-goettingen.de · Supervised learning Multilayer Perceptron and Deep Learning. Some slides are adopted from Honglak Lee, Geoffrey Hinton,

Machine Learning I - TU Dresden...8 CHAPTER 2. SUPERVISED LEARNING The instance of the bounded regularity problem w.r.t. T; ;f;Rand mis to decide whether there exists a 2 such that

SP10 cs288 lecture 5 -- maximum entropy.pptklein/cs288/sp10/slides/SP10... · 1 Statistical NLP Spring 2010 Lecture 5: WSD / Maxent Dan Klein –UC Berkeley Unsupervised Learning

Localizing 3D Cuboids in Single-view Imagesrobots.princeton.edu/projects/2012/SUNprimitive/poster.pdf · Supervised corner-location training with Structural SVM. Step 2: Local search

Optimization of an Industrial Code (COROS) With Python ... · Title: Optimization of an Industrial Code (COROS) With Python – Linear Algebra Routines Andrew MacLean, Supervised

λ ://pdfs.semanticscholar.org/cf30/4875fe809d67...Bursa Malaysia. Clustering in the data mining context refers to unsupervised classification of data into clusters/groups, and the

UNICAMP – Universidade Estadual de Campinasgaiotto/ia707/Projeto Final.doc · Web viewUma das vantagens de um algoritmo genético é a ... “Unsupervised Feature Selection Using

Supervised learning: Mixture Of Experts (MOE) Network.

R. HETTIARACHCHI arXiv:1604.00533v1 [cs.CV] 2 Apr 2016 · arXiv:1604.00533v1 [cs.CV] 2 Apr 2016 VORONO¨I REGION-BASED ADAPTIVE UNSUPERVISED COLOR IMAGE SEGMENTATION R. HETTIARACHCHIα

Part 3: Latent representations and unsupervised learning · 2012-07-11 · Part 3: Latent representations and unsupervised learning Dale Schuurmans University of Alberta. Supervised

Enter The DRAGON fileInterim report submitted to the Physics Department, University of Surrey in partial fulﬂlment of the requirements of the degree of Master in Physics Supervised

Supervised learning for computer vision: Theory and ...fbach/eccv08_fbach.pdf · Supervised learning for computer vision: Theory and algorithms - Part II Francis Bach1 & Jean-Yves

JX - core.ac.uk · müdahalesi, çatışma yeniden çözülmesi, İnsan İhtiyaçları Teorisi, Dünya Toplumu Paradigması VI. I feel most fortunate to have been guided and supervised