Neural Networks I - Virginia Techjbhuang/teaching/ECE5424-CS...Administrative •HW 2 released!...

Neural Networks I

Jia-Bin Huang

Virginia Tech Spring 2019ECE-5424G / CS-5824

Administrative

• HW 2 released!

• Jia-Bin out of town next week

• Chen: Deep Neural Networks II

• Shih-Yang: Diagnosing ML systems

• Midterm review

• Midterm: March 6th (Wed)

Hard-margin SVM formulationmin𝜃

1

2σ𝑗=1𝑛 𝜃𝑗

2

s. t. 𝜃⊤𝑥 𝑖 ≥ 1 if y 𝑖 = 1𝜃⊤𝑥 𝑖 ≤ −1 if y 𝑖 = 0

Soft-margin SVM formulationmin

𝜃, {𝜉(𝑖)}

1


2 + 𝐶 σ𝑖 𝜉(𝑖)

s. t. 𝜃⊤𝑥 𝑖 ≥ 1 − 𝜉(𝑖) if y 𝑖 = 1𝜃⊤𝑥 𝑖 ≤ −1 + 𝜉 𝑖 if y 𝑖 = 0

𝜉 𝑖 ≥ 0 ∀ 𝑖

Support Vector Machine

•Cost function

• Large margin classification

•Kernels

•Using an SVM

Non-linear classification

• How do we separate the two classes using a hyperplane?

ℎ𝜃 𝑥 = ቊ 1 if 𝜃⊤𝑥 ≥ 00 if 𝜃⊤𝑥 < 0

𝑥


𝑥

𝑥2

Kernel

•𝐾 ⋅,⋅ a legal definition of inner product:

∃𝜙: 𝑋 → 𝑅𝑁

s.t. 𝐾 𝑥, 𝑧 = 𝜙 𝑥 ⊤𝜙(𝑧)

Why Kernels matter?

• Many algorithms interact with data only via dot-products

• Replace 𝑥⊤𝑧 with 𝐾 𝑥, 𝑧 = 𝜙 𝑥 ⊤𝜙(𝑧)

• Act implicitly as if data was in the higher-dimensional 𝜙-space

Example

𝐾 𝑥, 𝑧 = 𝑥⊤𝑧 2 corresponds to

𝑥1, 𝑥2 → 𝜙 𝑥 = 𝑥12, 𝑥2

2, 2𝑥1𝑥2

𝜙 𝑥 ⊤𝜙 𝑧 = 𝑥12, 𝑥2

2, 2𝑥1𝑥2 𝑧12, 𝑧2

2, 2𝑧1𝑧2⊤

= 𝑥12𝑧1

2 + 𝑥22𝑧2

2 + 2x1x2z1z2 = x1z1 + x2z22

= 𝑥⊤𝑧 2 = 𝐾(𝑥, 𝑧)

Example

𝐾 𝑥, 𝑧 = 𝑥⊤𝑧 2 corresponds to

𝑥1, 𝑥2 → 𝜙 𝑥 = 𝑥12, 𝑥2

2, 2𝑥1𝑥2

Slide credit: Maria-Florina Balcan

Example kernels

• Linear kernel𝐾 𝑥, 𝑧 = 𝑥⊤𝑧

•Gaussian (Radial basis function) kernel

𝐾 𝑥, 𝑧 = exp −1

2𝑥 − 𝑧 ⊤Σ−1(𝑥 − 𝑧)

• Sigmoid kernel𝐾 𝑥, 𝑧 = tanh 𝑎 ⋅ 𝑥⊤𝑧 + 𝑏

Constructing new kernels

• Positive scaling𝐾 𝑥, 𝑧 = 𝑐𝐾1 𝑥, 𝑧 , 𝑐 > 0

• Exponentiation𝐾 𝑥, 𝑧 = exp 𝐾1 𝑥, 𝑧

• Addition𝐾 𝑥, 𝑧 = 𝐾1 𝑥, 𝑧 + 𝐾2 𝑥, 𝑧

• Multiplication with function𝐾 𝑥, 𝑧 = 𝑓 𝑥 𝐾1 𝑥, 𝑧 𝑓(𝑧)

• Multiplication𝐾 𝑥, 𝑧 = 𝐾1 𝑥, 𝑧 𝐾2 𝑥, 𝑧

Non-linear decision boundary

Predict 𝑦 = 1 if

𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 + 𝜃3𝑥1𝑥2+ 𝜃4𝑥1

2 + 𝜃5𝑥22 +⋯ ≥ 0

𝜃0 + 𝜃1𝑓1 + 𝜃2𝑓2 + 𝜃3𝑓3 +⋯

𝑓1 = 𝑥1, 𝑓2 = 𝑥2, 𝑓3 = 𝑥1𝑥2, ⋯

Is there a different/better choice of the features 𝑓1, 𝑓2, 𝑓3, ⋯?

𝑥1

𝑥2

Slide credit: Andrew Ng

Kernel

Give 𝑥, compute new features depending on proximity to landmarks 𝑙(1), 𝑙(2), 𝑙(3)

𝑓1 = similarity(𝑥, 𝑙(1))𝑓2 = similarity(𝑥, 𝑙(2))𝑓3 = similarity(𝑥, 𝑙(3))

Gaussian kernel

similarity 𝑥, 𝑙 𝑖 = exp(−𝑥 − 𝑙 𝑖

2

2𝜎2)

𝑥1

𝑥2𝑙(2)

𝑙(3)

𝑙(1)


Predict 𝑦 = 1 if

𝜃0 + 𝜃1𝑓1 + 𝜃2𝑓2 + 𝜃3𝑓3 ≥ 0

Ex: 𝜃0 = −0.5, 𝜃1 = 1, 𝜃2 = 1, 𝜃3 = 0

𝑥1

𝑥2𝑙(2)

𝑙(3)

𝑙(1)

𝑓1 = similarity(𝑥, 𝑙(1))𝑓2 = similarity(𝑥, 𝑙(2))

𝑓3 = similarity(𝑥, 𝑙(3))


Choosing the landmarks

• Given 𝑥

𝑓𝑖 = similarity 𝑥, 𝑙 𝑖 = exp(−𝑥 − 𝑙 𝑖

2

2𝜎2)

Predict 𝑦 = 1 if 𝜃0 + 𝜃1𝑓1 + 𝜃2𝑓2 + 𝜃3𝑓3 ≥ 0

Where to get 𝑙(1), 𝑙(2), 𝑙 3 , ⋯ ?

𝑥1

𝑥2𝑙 1

𝑙 2

𝑙 3

⋮

𝑙 𝑚


SVM with kernels

• Given 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯ , 𝑥 𝑚 , 𝑦 𝑚

• Choose 𝑙 1 = 𝑥 1 , 𝑙 2 = 𝑥 2 , 𝑙 3 = 𝑥 3 , ⋯, 𝑙 𝑚 = 𝑥 𝑚

• Given example 𝑥:• 𝑓1 = similarity 𝑥, 𝑙(1)

• 𝑓2 = similarity 𝑥, 𝑙(2)

• ⋯

• For training example 𝑥 𝑖 , 𝑦 𝑖 :

• 𝑥 𝑖 → 𝑓(𝑖)

𝑓 =

𝑓0𝑓1𝑓2⋮𝑓𝑚


• Hypothesis: Given 𝑥, compute features 𝑓 ∈ ℝ𝑚+1

• Predict 𝑦 = 1 if 𝜃⊤𝑓 ≥ 0

• Training (original)

min𝜃

𝐶

𝑖=1

𝑚

𝑦(𝑖) cost1 𝜃⊤𝑥 𝑖 + (1 − 𝑦 𝑖 ) cost0 𝜃⊤𝑥 𝑖 +1

2

𝑗=1

𝑛

𝜃𝑗2

• Training (with kernel)

min𝜃

𝐶

𝑖=1

𝑚

𝑦(𝑖) cost1 𝜃⊤𝑓 𝑖 + (1 − 𝑦 𝑖 ) cost0 𝜃⊤𝑓 𝑖 +1

2

𝑗=1

𝑚

𝜃𝑗2

SVM with kernels

Support vector machines (Primal/Dual)

•Primal form

• Lagrangian dual form

min𝜃,𝜉(𝑖)

1


2 + 𝐶 σ𝑖 𝜉(𝑖)

s. t. 𝜃⊤𝑥 𝑖 ≥ 1 − 𝜉(𝑖) if y 𝑖 = 1𝜃⊤𝑥 𝑖 ≤ −1 + 𝜉 𝑖 if y 𝑖 = 0

𝜉 𝑖 ≥ 0 ∀ 𝑖

min𝛼

1

2

𝑖

𝑗

𝑦(𝑖)𝑦(𝑗)𝛼(𝑖)𝛼(𝑗)𝑥 𝑖 ⊤𝑥(𝑗) −

𝑖

𝛼(𝑖)

s. t. 0 ≤ 𝛼(𝑖) ≤ 𝐶𝑖

𝑖

𝑦(𝑖)𝛼(𝑖) = 0

SVM (Lagrangian dual)

min𝛼

1

2

𝑖

𝑗

𝑦(𝑖)𝑦(𝑗)𝛼(𝑖)𝛼(𝑗)𝑥 𝑖 ⊤𝑥(𝑗) −

𝑖

𝛼(𝑖)

s. t. 0 ≤ 𝛼(𝑖) ≤ 𝐶𝑖

𝑖

𝑦(𝑖)𝛼(𝑖) = 0

Classifier: 𝜃 = σ𝑖 𝛼(𝑖) 𝑦 𝑖 𝑥(𝑖)

• The points 𝑥(𝑖) for which 𝛼(𝑖) ≠ 0 → Support Vectors

Replace 𝑥 𝑖 ⊤𝑥(𝑗) with 𝐾(𝑥 𝑖 , 𝑥(𝑗))

SVM parameters

• 𝐶 =1

𝜆Large 𝐶: Lower bias, high variance. Small 𝐶: Higher bias, low variance.

• 𝜎2

• Large 𝜎2: features 𝑓𝑖 vary more smoothly.• Higher bias, lower variance

• Small 𝜎2: features 𝑓𝑖 vary less smoothly.• Lower bias, higher variance


SVM Demo

• https://cs.stanford.edu/people/karpathy/svmjs/demo/

https://cs.stanford.edu/people/karpathy/svmjs/demo/

SVM song

Video source:

• https://www.youtube.com/watch?v=g15bqtyidZs

Support Vector Machine

•Cost function


•Kernels

•Using an SVM

Using SVM

• SVM software package (e.g., liblinear, libsvm) to solve for 𝜃

• Need to specify:• Choice of parameter 𝐶.

• Choice of kernel (similarity function):

• Linear kernel: Predict 𝑦 = 1 if 𝜃⊤𝑥 ≥ 0

• Gaussian kernel:

• 𝑓𝑖 = exp(−𝑥−𝑙 𝑖

2

2𝜎2), where 𝑙(𝑖) = 𝑥(𝑖)

• Need to choose 𝜎2. Need proper feature scalingSlide credit: Andrew Ng

Kernel (similarity) functions

• Note: not all similarity functions make valid kernels.

• Many off-the-shelf kernels available:• Polynomial kernel

• String kernel

• Chi-square kernel

• Histogram intersection kernel


Multi-class classification

• Use one-vs.-all method. Train 𝐾 SVMs, one to distinguish 𝑦 = 𝑖 from the rest, get 𝜃 1 , 𝜃(2), ⋯ , 𝜃(𝐾)

• Pick class 𝑖 with the largest 𝜃 𝑖 ⊤𝑥

𝑥2

𝑥1


Logistic regression vs. SVMs

• 𝑛 = number of features (𝑥 ∈ ℝ𝑛+1), 𝑚 = number of training examples

1. If 𝒏 is large (relative to 𝒎): (𝑛 = 10,000,m = 10 − 1000)→ Use logistic regression or SVM without a kernel (“linear kernel”)

2. If 𝒏 is small, 𝒎 is intermediate: (𝑛 = 1 − 1000,m = 10 − 10,000)→ Use SVM with Gaussian kernel

3. If 𝒏 is small, 𝒎 is large: (𝑛 = 1 − 1000,m = 50,000+)→ Create/add more features, then use logistic regression of linear SVM

Neural network likely to work well for most of these case, but slower to train Slide credit: Andrew Ng

Things to remember

• Cost function

min𝜃

𝐶

𝑖=1

𝑚

𝑦(𝑖) cost1 𝜃⊤𝑓 𝑖 + (1 − 𝑦 𝑖 ) cost0 𝜃⊤𝑓 𝑖 +1

2

𝑗=1

𝑚

𝜃𝑗2


• Kernels

• Using an SVM

𝑥2

𝑥1

margin

Neural Networks

•Why neural networks?

•Model representation

• Examples and intuitions

•Multi-class classification


Predict 𝑦 = 1 if

𝑔(𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 + 𝜃3𝑥1𝑥2+ 𝜃4𝑥1

2 + 𝜃5𝑥22 +⋯) ≥ 0

𝑥1= size

𝑥1= #bedrooms

𝑥1= #floors

𝑥1= age

…𝑥100

𝑥1

𝑥2


# Quadratic features?

# Cubic features?

What humans see

Slide credit: Larry Zitnick

What computers see243 239 240 225 206 185 188 218 211 206 216 225

242 239 218 110 67 31 34 152 213 206 208 221

243 242 123 58 94 82 132 77 108 208 208 215

235 217 115 212 243 236 247 139 91 209 208 211

233 208 131 222 219 226 196 114 74 208 213 214

232 217 131 116 77 150 69 56 52 201 228 223

232 232 182 186 184 179 159 123 93 232 235 235

232 236 201 154 216 133 129 81 175 252 241 240

235 238 230 128 172 138 65 63 234 249 241 245

237 236 247 143 59 78 10 94 255 248 247 251

234 237 245 193 55 33 115 144 213 255 253 251

248 245 161 128 149 109 138 65 47 156 239 255

190 107 39 102 94 73 114 58 17 7 51 137

23 32 33 148 168 203 179 43 27 17 12 8

17 26 12 160 255 255 109 22 26 19 35 24

Slide credit: Larry Zitnick

Computer Vision: Car detection

Testing:


50x50 pixel images -> 2500 pixels

𝑥 =

pixel 1 intensitypixel 2 intensity

⋮pixel 2500 intensity

Quadratic features (𝑥𝑖𝑥𝑗)~3 million features Slide credit: Andrew Ng

Neural Networks

• Origins: Algorithms that try to mimic the brain.

• Was very widely used in 80s and early 90s; popularity diminished in late 90s.

• Recent resurgence: State-of-the-art technique for many applications


https://www.slideshare.net/dlavenda/ai-and-productivity

The “one learning algorithm” hypothesis

[Roe et al. 1992] Slide credit: Andrew Ng

The “one learning algorithm” hypothesis

[Metin and Frost 1989] Slide credit: Andrew Ng

Sensor representations in the brain


Neural Networks





A single neuron in the brain

Input

OutputSlide credit: Andrew Ng

http://histoweb.co.za/082/082img002.html

An artificial neuron: Logistic unit

𝑥 =

𝑥0𝑥1𝑥2𝑥3

𝜃 =

𝜃0𝜃1𝜃2𝜃3

• Sigmoid (logistic) activation function

𝑥1

𝑥2

𝑥3

𝑥0

ℎ𝜃 𝑥 =1

1 + 𝑒−𝜃⊤𝑥

“Bias unit”

“Input”

“Output”

“Weights” “Parameters”


Neural network

𝑎1(2)

𝑎2(2)

𝑎3(2)

𝑎0(2)

ℎΘ 𝑥

Layer 1

“Output”

𝑥1

𝑥2

𝑥3

𝑥0

Layer 2 Layer 3 Slide credit: Andrew Ng

Neural network

𝑎1(2)

𝑎2(2)

𝑎3(2)

𝑎0(2)

ℎΘ 𝑥

𝑥1

𝑥2

𝑥3

𝑥0𝑎𝑖(𝑗)

= “activation” of unit 𝑖 in layer 𝑗

Θ 𝑗 = matrix of weights controlling function mapping from layer 𝑗 to layer 𝑗 + 1

𝑎1(2)

= 𝑔 Θ10(1)𝑥0 + Θ11

(1)𝑥1 + Θ12

(1)𝑥2 + Θ13

(1)𝑥3

𝑎2(2)

= 𝑔 Θ20(1)𝑥0 + Θ21

(1)𝑥1 + Θ22

(1)𝑥2 + Θ23

(1)𝑥3

𝑎3(2)

= 𝑔 Θ30(1)𝑥0 + Θ31

(1)𝑥1 + Θ32

(1)𝑥2 + Θ33

(1)𝑥3

ℎΘ(𝑥) = 𝑔 Θ10(2)𝑎0(2)

+ Θ11(1)𝑎1(2)

+ Θ12(1)𝑎2(2)

+ Θ13(1)𝑎3(2)

𝑠𝑗 unit in layer 𝑗

𝑠𝑗+1 units in layer 𝑗 + 1

Size of Θ 𝑗 ?

𝑠𝑗+1 × (𝑠𝑗 + 1)


Neural network

𝑎1(2)

𝑎2(2)

𝑎3(2)

𝑎0(2)

ℎΘ 𝑥

𝑥1

𝑥2

𝑥3

𝑥0

𝑥 =


z(2) =

z1(2)

z2(2)

z3(2)

𝑎1(2)

= 𝑔 Θ10(1)𝑥0 + Θ11

(1)𝑥1 + Θ12

(1)𝑥2 + Θ13

(1)𝑥3 = 𝑔(z1

(2))

𝑎2(2)

= 𝑔 Θ20(1)𝑥0 + Θ21

(1)𝑥1 + Θ22

(1)𝑥2 + Θ23

(1)𝑥3 = 𝑔(z2

(2))

𝑎3(2)

= 𝑔 Θ30(1)𝑥0 + Θ31

(1)𝑥1 + Θ32

(1)𝑥2 + Θ33

(1)𝑥3 = 𝑔(z3

(2))

ℎΘ 𝑥 = 𝑔 Θ102𝑎0

2+ Θ11

1𝑎1

2+ Θ12

1𝑎2

2+ Θ13

1𝑎3

2= 𝑔(𝑧(3))

“Pre-activation”


Neural network

𝑎1(2)

𝑎2(2)

𝑎3(2)

𝑎0(2)

ℎΘ 𝑥

𝑥1

𝑥2

𝑥3

𝑥0

𝑥 =


z(2) =

z1(2)

z2(2)

z3(2)

𝑎1(2)

= 𝑔(z1(2))

𝑎2(2)

= 𝑔(z2(2))

𝑎3(2)

= 𝑔(z3(2))

ℎΘ 𝑥 = 𝑔(𝑧(3))

“Pre-activation”

𝑧(2) = Θ(1)𝑥 = Θ(1)𝑎(1)

𝑎(2) = 𝑔(𝑧(2))

Add 𝑎0(2)

= 1

𝑧(3) = Θ(2)𝑎(2)

ℎΘ 𝑥 = 𝑎(3) = 𝑔(𝑧(3))Slide credit: Andrew Ng

Neural network learning its own features

𝑎1(2)

𝑎2(2)

𝑎3(2)

𝑎0(2)

ℎΘ 𝑥

𝑥1

𝑥2

𝑥3

𝑥0


Other network architectures

𝑎1(2)

𝑎2(2)

𝑎3(2)

𝑎0(2)

ℎΘ 𝑥

𝑥1

𝑥2

𝑥3

𝑥0

𝑎1(3)

𝑎2(3)

𝑎0(3)


• Architecture

Neural Networks





Non-linear classification example: XOR/XNOR• 𝑥1, 𝑥2 are binary (0 or 1)

• 𝑦 = 𝑋𝑂𝑅(𝑥1, 𝑥2)

𝑥1

𝑥2

𝑦 = 1

𝑦 = 0𝑦 = 1

𝑦 = 0

𝑥1

𝑥2


Simple example: AND

• 𝑥1, 𝑥2 ∈ 0, 1

• 𝑦 = 𝑥1 AND 𝑥2

+1

𝑥1

𝑥2

ℎΘ 𝑥

−30

+20

+20

ℎΘ 𝑥 = 𝑔(−30 + 20𝑥1 + 20𝑥2)

𝑥1 𝑥2 ℎΘ 𝑥

0 0 𝑔 −30 ≈ 0

0 1 𝑔(−10) ≈ 0

1 0 𝑔(−10) ≈ 0

1 1 𝑔(10) ≈ 1

ℎΘ 𝑥 ≈ 𝑥1 AND x2Slide credit: Andrew Ng

Simple example: OR

• 𝑥1, 𝑥2 ∈ 0, 1


+1

𝑥1

𝑥2

ℎΘ 𝑥

−10

+20

+20

ℎΘ 𝑥 = 𝑔(−10 + 20𝑥1 + 20𝑥2)

𝑥1 𝑥2 ℎΘ 𝑥

0 0 𝑔 −10 ≈ 0

0 1 𝑔(10) ≈ 1

1 0 𝑔(10) ≈ 1

1 1 𝑔(30) ≈ 1

ℎΘ 𝑥 ≈ 𝑥1 OR x2Slide credit: Andrew Ng

Simple example: NOT

• 𝑥1, 𝑥2 ∈ 0, 1


+1

𝑥1 ℎΘ 𝑥

10

−20

ℎΘ 𝑥 = 𝑔(10 − 20𝑥1)

𝑥1 ℎΘ 𝑥

0 𝑔 10 ≈ 1

1 𝑔(−10) ≈ 0

ℎΘ 𝑥 ≈ NOT 𝑥1Slide credit: Andrew Ng

Putting it together: 𝑥1 XNOR 𝑥2+1

𝑥1

𝑥2

ℎΘ 𝑥

−30

+20

+20

+1

𝑥1

𝑥2

ℎΘ 𝑥

−10

+20

+20

ℎΘ 𝑥

+1

𝑥1

𝑥2

10

-20

-20

𝑥1 AND 𝑥2 (NOT 𝑥1) AND (NOT 𝑥2) 𝑥1 OR 𝑥2

+1

𝑥1

𝑥2

𝑎1(2)

−30

+20+20

𝑎2(2)

10-20

-20

+1

−10

+20

+20

𝑥1 𝑥2 𝑎1(2)

𝑎2(2) ℎΘ 𝑥

0 0 0 1 1

0 1 0 0 0

1 0 0 0 0

1 1 1 0 1 Slide credit: Andrew Ng

Layer 1

Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]

http://ftp.cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf

Layer 2



Layer 3



Layer 4 and 5



Neural Networks





Multiple output units: One-vs-all

𝑎1(2)

𝑎2(2)

𝑎3(2)

𝑥1

𝑥2

𝑎1(3)

𝑎2(3)

Pedestrian?

Car?

Motocycle?

Truck?

Multiple output units: One-vs-all

𝑎1(2)

𝑎2(2)

𝑎3(2)

𝑥1

𝑥2

𝑎1(3)

𝑎2(3)

ℎΘ 𝑥 ∈ 𝑅4

Training set : 𝑥 1 , 𝑦(1) , 𝑥 2 , 𝑦(2) , ⋯ 𝑥(𝑚), 𝑦(𝑚) ,

𝑦(𝑖) one of

1000

,

0100

,

0010

,

0001

Things to remember





𝑎1(2)

𝑎2(2)

𝑎3(2)

𝑎0(2)

ℎΘ 𝑥

Layer 1

“Output”

𝑥1

𝑥2

𝑥3

𝑥0

Layer 2 Layer 3

+1

𝑥1

𝑥2

𝑎1(2)−30

+20+20

𝑎2(2)

10-20-20

+1

−10

+20

+20

Neural Networks I - Virginia Techjbhuang/teaching/ECE5424-CS...Administrative •HW 2 released!...

Documents

Transcript of Neural Networks I - Virginia Techjbhuang/teaching/ECE5424-CS...Administrative •HW 2 released!...