Neural Networks I - Virginia Techjbhuang/teaching/ECE5424-CS...Administrative •HW 2 released!...
Transcript of Neural Networks I - Virginia Techjbhuang/teaching/ECE5424-CS...Administrative •HW 2 released!...
Neural Networks I
Jia-Bin Huang
Virginia Tech Spring 2019ECE-5424G / CS-5824
Administrative
• HW 2 released!
• Jia-Bin out of town next week
• Chen: Deep Neural Networks II
• Shih-Yang: Diagnosing ML systems
• Midterm review
• Midterm: March 6th (Wed)
Hard-margin SVM formulationmin𝜃
1
2σ𝑗=1𝑛 𝜃𝑗
2
s. t. 𝜃⊤𝑥 𝑖 ≥ 1 if y 𝑖 = 1𝜃⊤𝑥 𝑖 ≤ −1 if y 𝑖 = 0
Soft-margin SVM formulationmin
𝜃, {𝜉(𝑖)}
1
2σ𝑗=1𝑛 𝜃𝑗
2 + 𝐶 σ𝑖 𝜉(𝑖)
s. t. 𝜃⊤𝑥 𝑖 ≥ 1 − 𝜉(𝑖) if y 𝑖 = 1𝜃⊤𝑥 𝑖 ≤ −1 + 𝜉 𝑖 if y 𝑖 = 0
𝜉 𝑖 ≥ 0 ∀ 𝑖
Support Vector Machine
•Cost function
• Large margin classification
•Kernels
•Using an SVM
Non-linear classification
• How do we separate the two classes using a hyperplane?
ℎ𝜃 𝑥 = ቊ 1 if 𝜃⊤𝑥 ≥ 00 if 𝜃⊤𝑥 < 0
𝑥
Non-linear classification
𝑥
𝑥2
Kernel
•𝐾 ⋅,⋅ a legal definition of inner product:
∃𝜙: 𝑋 → 𝑅𝑁
s.t. 𝐾 𝑥, 𝑧 = 𝜙 𝑥 ⊤𝜙(𝑧)
Why Kernels matter?
• Many algorithms interact with data only via dot-products
• Replace 𝑥⊤𝑧 with 𝐾 𝑥, 𝑧 = 𝜙 𝑥 ⊤𝜙(𝑧)
• Act implicitly as if data was in the higher-dimensional 𝜙-space
Example
𝐾 𝑥, 𝑧 = 𝑥⊤𝑧 2 corresponds to
𝑥1, 𝑥2 → 𝜙 𝑥 = 𝑥12, 𝑥2
2, 2𝑥1𝑥2
𝜙 𝑥 ⊤𝜙 𝑧 = 𝑥12, 𝑥2
2, 2𝑥1𝑥2 𝑧12, 𝑧2
2, 2𝑧1𝑧2⊤
= 𝑥12𝑧1
2 + 𝑥22𝑧2
2 + 2x1x2z1z2 = x1z1 + x2z22
= 𝑥⊤𝑧 2 = 𝐾(𝑥, 𝑧)
Example
𝐾 𝑥, 𝑧 = 𝑥⊤𝑧 2 corresponds to
𝑥1, 𝑥2 → 𝜙 𝑥 = 𝑥12, 𝑥2
2, 2𝑥1𝑥2
Slide credit: Maria-Florina Balcan
Example kernels
• Linear kernel𝐾 𝑥, 𝑧 = 𝑥⊤𝑧
•Gaussian (Radial basis function) kernel
𝐾 𝑥, 𝑧 = exp −1
2𝑥 − 𝑧 ⊤Σ−1(𝑥 − 𝑧)
• Sigmoid kernel𝐾 𝑥, 𝑧 = tanh 𝑎 ⋅ 𝑥⊤𝑧 + 𝑏
Constructing new kernels
• Positive scaling𝐾 𝑥, 𝑧 = 𝑐𝐾1 𝑥, 𝑧 , 𝑐 > 0
• Exponentiation𝐾 𝑥, 𝑧 = exp 𝐾1 𝑥, 𝑧
• Addition𝐾 𝑥, 𝑧 = 𝐾1 𝑥, 𝑧 + 𝐾2 𝑥, 𝑧
• Multiplication with function𝐾 𝑥, 𝑧 = 𝑓 𝑥 𝐾1 𝑥, 𝑧 𝑓(𝑧)
• Multiplication𝐾 𝑥, 𝑧 = 𝐾1 𝑥, 𝑧 𝐾2 𝑥, 𝑧
Non-linear decision boundary
Predict 𝑦 = 1 if
𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 + 𝜃3𝑥1𝑥2+ 𝜃4𝑥1
2 + 𝜃5𝑥22 +⋯ ≥ 0
𝜃0 + 𝜃1𝑓1 + 𝜃2𝑓2 + 𝜃3𝑓3 +⋯
𝑓1 = 𝑥1, 𝑓2 = 𝑥2, 𝑓3 = 𝑥1𝑥2, ⋯
Is there a different/better choice of the features 𝑓1, 𝑓2, 𝑓3, ⋯?
𝑥1
𝑥2
Slide credit: Andrew Ng
Kernel
Give 𝑥, compute new features depending on proximity to landmarks 𝑙(1), 𝑙(2), 𝑙(3)
𝑓1 = similarity(𝑥, 𝑙(1))𝑓2 = similarity(𝑥, 𝑙(2))𝑓3 = similarity(𝑥, 𝑙(3))
Gaussian kernel
similarity 𝑥, 𝑙 𝑖 = exp(−𝑥 − 𝑙 𝑖
2
2𝜎2)
𝑥1
𝑥2𝑙(2)
𝑙(3)
𝑙(1)
Slide credit: Andrew Ng
Predict 𝑦 = 1 if
𝜃0 + 𝜃1𝑓1 + 𝜃2𝑓2 + 𝜃3𝑓3 ≥ 0
Ex: 𝜃0 = −0.5, 𝜃1 = 1, 𝜃2 = 1, 𝜃3 = 0
𝑥1
𝑥2𝑙(2)
𝑙(3)
𝑙(1)
𝑓1 = similarity(𝑥, 𝑙(1))𝑓2 = similarity(𝑥, 𝑙(2))
𝑓3 = similarity(𝑥, 𝑙(3))
Slide credit: Andrew Ng
Choosing the landmarks
• Given 𝑥
𝑓𝑖 = similarity 𝑥, 𝑙 𝑖 = exp(−𝑥 − 𝑙 𝑖
2
2𝜎2)
Predict 𝑦 = 1 if 𝜃0 + 𝜃1𝑓1 + 𝜃2𝑓2 + 𝜃3𝑓3 ≥ 0
Where to get 𝑙(1), 𝑙(2), 𝑙 3 , ⋯ ?
𝑥1
𝑥2𝑙 1
𝑙 2
𝑙 3
⋮
𝑙 𝑚
Slide credit: Andrew Ng
SVM with kernels
• Given 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯ , 𝑥 𝑚 , 𝑦 𝑚
• Choose 𝑙 1 = 𝑥 1 , 𝑙 2 = 𝑥 2 , 𝑙 3 = 𝑥 3 , ⋯, 𝑙 𝑚 = 𝑥 𝑚
• Given example 𝑥:• 𝑓1 = similarity 𝑥, 𝑙(1)
• 𝑓2 = similarity 𝑥, 𝑙(2)
• ⋯
• For training example 𝑥 𝑖 , 𝑦 𝑖 :
• 𝑥 𝑖 → 𝑓(𝑖)
𝑓 =
𝑓0𝑓1𝑓2⋮𝑓𝑚
Slide credit: Andrew Ng
• Hypothesis: Given 𝑥, compute features 𝑓 ∈ ℝ𝑚+1
• Predict 𝑦 = 1 if 𝜃⊤𝑓 ≥ 0
• Training (original)
min𝜃
𝐶
𝑖=1
𝑚
𝑦(𝑖) cost1 𝜃⊤𝑥 𝑖 + (1 − 𝑦 𝑖 ) cost0 𝜃⊤𝑥 𝑖 +1
2
𝑗=1
𝑛
𝜃𝑗2
• Training (with kernel)
min𝜃
𝐶
𝑖=1
𝑚
𝑦(𝑖) cost1 𝜃⊤𝑓 𝑖 + (1 − 𝑦 𝑖 ) cost0 𝜃⊤𝑓 𝑖 +1
2
𝑗=1
𝑚
𝜃𝑗2
SVM with kernels
Support vector machines (Primal/Dual)
•Primal form
• Lagrangian dual form
min𝜃,𝜉(𝑖)
1
2σ𝑗=1𝑛 𝜃𝑗
2 + 𝐶 σ𝑖 𝜉(𝑖)
s. t. 𝜃⊤𝑥 𝑖 ≥ 1 − 𝜉(𝑖) if y 𝑖 = 1𝜃⊤𝑥 𝑖 ≤ −1 + 𝜉 𝑖 if y 𝑖 = 0
𝜉 𝑖 ≥ 0 ∀ 𝑖
min𝛼
1
2
𝑖
𝑗
𝑦(𝑖)𝑦(𝑗)𝛼(𝑖)𝛼(𝑗)𝑥 𝑖 ⊤𝑥(𝑗) −
𝑖
𝛼(𝑖)
s. t. 0 ≤ 𝛼(𝑖) ≤ 𝐶𝑖
𝑖
𝑦(𝑖)𝛼(𝑖) = 0
SVM (Lagrangian dual)
min𝛼
1
2
𝑖
𝑗
𝑦(𝑖)𝑦(𝑗)𝛼(𝑖)𝛼(𝑗)𝑥 𝑖 ⊤𝑥(𝑗) −
𝑖
𝛼(𝑖)
s. t. 0 ≤ 𝛼(𝑖) ≤ 𝐶𝑖
𝑖
𝑦(𝑖)𝛼(𝑖) = 0
Classifier: 𝜃 = σ𝑖 𝛼(𝑖) 𝑦 𝑖 𝑥(𝑖)
• The points 𝑥(𝑖) for which 𝛼(𝑖) ≠ 0 → Support Vectors
Replace 𝑥 𝑖 ⊤𝑥(𝑗) with 𝐾(𝑥 𝑖 , 𝑥(𝑗))
SVM parameters
• 𝐶 =1
𝜆Large 𝐶: Lower bias, high variance. Small 𝐶: Higher bias, low variance.
• 𝜎2
• Large 𝜎2: features 𝑓𝑖 vary more smoothly.• Higher bias, lower variance
• Small 𝜎2: features 𝑓𝑖 vary less smoothly.• Lower bias, higher variance
Slide credit: Andrew Ng
SVM Demo
• https://cs.stanford.edu/people/karpathy/svmjs/demo/
SVM song
Video source:
• https://www.youtube.com/watch?v=g15bqtyidZs
Support Vector Machine
•Cost function
• Large margin classification
•Kernels
•Using an SVM
Using SVM
• SVM software package (e.g., liblinear, libsvm) to solve for 𝜃
• Need to specify:• Choice of parameter 𝐶.
• Choice of kernel (similarity function):
• Linear kernel: Predict 𝑦 = 1 if 𝜃⊤𝑥 ≥ 0
• Gaussian kernel:
• 𝑓𝑖 = exp(−𝑥−𝑙 𝑖
2
2𝜎2), where 𝑙(𝑖) = 𝑥(𝑖)
• Need to choose 𝜎2. Need proper feature scalingSlide credit: Andrew Ng
Kernel (similarity) functions
• Note: not all similarity functions make valid kernels.
• Many off-the-shelf kernels available:• Polynomial kernel
• String kernel
• Chi-square kernel
• Histogram intersection kernel
Slide credit: Andrew Ng
Multi-class classification
• Use one-vs.-all method. Train 𝐾 SVMs, one to distinguish 𝑦 = 𝑖 from the rest, get 𝜃 1 , 𝜃(2), ⋯ , 𝜃(𝐾)
• Pick class 𝑖 with the largest 𝜃 𝑖 ⊤𝑥
𝑥2
𝑥1
Slide credit: Andrew Ng
Logistic regression vs. SVMs
• 𝑛 = number of features (𝑥 ∈ ℝ𝑛+1), 𝑚 = number of training examples
1. If 𝒏 is large (relative to 𝒎): (𝑛 = 10,000,m = 10 − 1000)→ Use logistic regression or SVM without a kernel (“linear kernel”)
2. If 𝒏 is small, 𝒎 is intermediate: (𝑛 = 1 − 1000,m = 10 − 10,000)→ Use SVM with Gaussian kernel
3. If 𝒏 is small, 𝒎 is large: (𝑛 = 1 − 1000,m = 50,000+)→ Create/add more features, then use logistic regression of linear SVM
Neural network likely to work well for most of these case, but slower to train Slide credit: Andrew Ng
Things to remember
• Cost function
min𝜃
𝐶
𝑖=1
𝑚
𝑦(𝑖) cost1 𝜃⊤𝑓 𝑖 + (1 − 𝑦 𝑖 ) cost0 𝜃⊤𝑓 𝑖 +1
2
𝑗=1
𝑚
𝜃𝑗2
• Large margin classification
• Kernels
• Using an SVM
𝑥2
𝑥1
margin
Neural Networks
•Why neural networks?
•Model representation
• Examples and intuitions
•Multi-class classification
Neural Networks
•Why neural networks?
•Model representation
• Examples and intuitions
•Multi-class classification
Non-linear classification
Predict 𝑦 = 1 if
𝑔(𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 + 𝜃3𝑥1𝑥2+ 𝜃4𝑥1
2 + 𝜃5𝑥22 +⋯) ≥ 0
𝑥1= size
𝑥1= #bedrooms
𝑥1= #floors
𝑥1= age
…𝑥100
𝑥1
𝑥2
Slide credit: Andrew Ng
# Quadratic features?
# Cubic features?
What humans see
Slide credit: Larry Zitnick
What computers see243 239 240 225 206 185 188 218 211 206 216 225
242 239 218 110 67 31 34 152 213 206 208 221
243 242 123 58 94 82 132 77 108 208 208 215
235 217 115 212 243 236 247 139 91 209 208 211
233 208 131 222 219 226 196 114 74 208 213 214
232 217 131 116 77 150 69 56 52 201 228 223
232 232 182 186 184 179 159 123 93 232 235 235
232 236 201 154 216 133 129 81 175 252 241 240
235 238 230 128 172 138 65 63 234 249 241 245
237 236 247 143 59 78 10 94 255 248 247 251
234 237 245 193 55 33 115 144 213 255 253 251
248 245 161 128 149 109 138 65 47 156 239 255
190 107 39 102 94 73 114 58 17 7 51 137
23 32 33 148 168 203 179 43 27 17 12 8
17 26 12 160 255 255 109 22 26 19 35 24
Slide credit: Larry Zitnick
Computer Vision: Car detection
Testing:
Slide credit: Andrew Ng
50x50 pixel images -> 2500 pixels
𝑥 =
pixel 1 intensitypixel 2 intensity
⋮pixel 2500 intensity
Quadratic features (𝑥𝑖𝑥𝑗)~3 million features Slide credit: Andrew Ng
Neural Networks
• Origins: Algorithms that try to mimic the brain.
• Was very widely used in 80s and early 90s; popularity diminished in late 90s.
• Recent resurgence: State-of-the-art technique for many applications
Slide credit: Andrew Ng
https://www.slideshare.net/dlavenda/ai-and-productivity
The “one learning algorithm” hypothesis
[Roe et al. 1992] Slide credit: Andrew Ng
The “one learning algorithm” hypothesis
[Metin and Frost 1989] Slide credit: Andrew Ng
Sensor representations in the brain
Slide credit: Andrew Ng
Neural Networks
•Why neural networks?
•Model representation
• Examples and intuitions
•Multi-class classification
A single neuron in the brain
Input
OutputSlide credit: Andrew Ng
http://histoweb.co.za/082/082img002.html
An artificial neuron: Logistic unit
𝑥 =
𝑥0𝑥1𝑥2𝑥3
𝜃 =
𝜃0𝜃1𝜃2𝜃3
• Sigmoid (logistic) activation function
𝑥1
𝑥2
𝑥3
𝑥0
ℎ𝜃 𝑥 =1
1 + 𝑒−𝜃⊤𝑥
“Bias unit”
“Input”
“Output”
“Weights” “Parameters”
Slide credit: Andrew Ng
Neural network
𝑎1(2)
𝑎2(2)
𝑎3(2)
𝑎0(2)
ℎΘ 𝑥
Layer 1
“Output”
𝑥1
𝑥2
𝑥3
𝑥0
Layer 2 Layer 3 Slide credit: Andrew Ng
Neural network
𝑎1(2)
𝑎2(2)
𝑎3(2)
𝑎0(2)
ℎΘ 𝑥
𝑥1
𝑥2
𝑥3
𝑥0𝑎𝑖(𝑗)
= “activation” of unit 𝑖 in layer 𝑗
Θ 𝑗 = matrix of weights controlling function mapping from layer 𝑗 to layer 𝑗 + 1
𝑎1(2)
= 𝑔 Θ10(1)𝑥0 + Θ11
(1)𝑥1 + Θ12
(1)𝑥2 + Θ13
(1)𝑥3
𝑎2(2)
= 𝑔 Θ20(1)𝑥0 + Θ21
(1)𝑥1 + Θ22
(1)𝑥2 + Θ23
(1)𝑥3
𝑎3(2)
= 𝑔 Θ30(1)𝑥0 + Θ31
(1)𝑥1 + Θ32
(1)𝑥2 + Θ33
(1)𝑥3
ℎΘ(𝑥) = 𝑔 Θ10(2)𝑎0(2)
+ Θ11(1)𝑎1(2)
+ Θ12(1)𝑎2(2)
+ Θ13(1)𝑎3(2)
𝑠𝑗 unit in layer 𝑗
𝑠𝑗+1 units in layer 𝑗 + 1
Size of Θ 𝑗 ?
𝑠𝑗+1 × (𝑠𝑗 + 1)
Slide credit: Andrew Ng
Neural network
𝑎1(2)
𝑎2(2)
𝑎3(2)
𝑎0(2)
ℎΘ 𝑥
𝑥1
𝑥2
𝑥3
𝑥0
𝑥 =
𝑥0𝑥1𝑥2𝑥3
z(2) =
z1(2)
z2(2)
z3(2)
𝑎1(2)
= 𝑔 Θ10(1)𝑥0 + Θ11
(1)𝑥1 + Θ12
(1)𝑥2 + Θ13
(1)𝑥3 = 𝑔(z1
(2))
𝑎2(2)
= 𝑔 Θ20(1)𝑥0 + Θ21
(1)𝑥1 + Θ22
(1)𝑥2 + Θ23
(1)𝑥3 = 𝑔(z2
(2))
𝑎3(2)
= 𝑔 Θ30(1)𝑥0 + Θ31
(1)𝑥1 + Θ32
(1)𝑥2 + Θ33
(1)𝑥3 = 𝑔(z3
(2))
ℎΘ 𝑥 = 𝑔 Θ102𝑎0
2+ Θ11
1𝑎1
2+ Θ12
1𝑎2
2+ Θ13
1𝑎3
2= 𝑔(𝑧(3))
“Pre-activation”
Slide credit: Andrew Ng
Neural network
𝑎1(2)
𝑎2(2)
𝑎3(2)
𝑎0(2)
ℎΘ 𝑥
𝑥1
𝑥2
𝑥3
𝑥0
𝑥 =
𝑥0𝑥1𝑥2𝑥3
z(2) =
z1(2)
z2(2)
z3(2)
𝑎1(2)
= 𝑔(z1(2))
𝑎2(2)
= 𝑔(z2(2))
𝑎3(2)
= 𝑔(z3(2))
ℎΘ 𝑥 = 𝑔(𝑧(3))
“Pre-activation”
𝑧(2) = Θ(1)𝑥 = Θ(1)𝑎(1)
𝑎(2) = 𝑔(𝑧(2))
Add 𝑎0(2)
= 1
𝑧(3) = Θ(2)𝑎(2)
ℎΘ 𝑥 = 𝑎(3) = 𝑔(𝑧(3))Slide credit: Andrew Ng
Neural network learning its own features
𝑎1(2)
𝑎2(2)
𝑎3(2)
𝑎0(2)
ℎΘ 𝑥
𝑥1
𝑥2
𝑥3
𝑥0
Slide credit: Andrew Ng
Other network architectures
𝑎1(2)
𝑎2(2)
𝑎3(2)
𝑎0(2)
ℎΘ 𝑥
𝑥1
𝑥2
𝑥3
𝑥0
𝑎1(3)
𝑎2(3)
𝑎0(3)
Slide credit: Andrew Ng
• Architecture
Neural Networks
•Why neural networks?
•Model representation
• Examples and intuitions
•Multi-class classification
Non-linear classification example: XOR/XNOR• 𝑥1, 𝑥2 are binary (0 or 1)
• 𝑦 = 𝑋𝑂𝑅(𝑥1, 𝑥2)
𝑥1
𝑥2
𝑦 = 1
𝑦 = 0𝑦 = 1
𝑦 = 0
𝑥1
𝑥2
Slide credit: Andrew Ng
Simple example: AND
• 𝑥1, 𝑥2 ∈ 0, 1
• 𝑦 = 𝑥1 AND 𝑥2
+1
𝑥1
𝑥2
ℎΘ 𝑥
−30
+20
+20
ℎΘ 𝑥 = 𝑔(−30 + 20𝑥1 + 20𝑥2)
𝑥1 𝑥2 ℎΘ 𝑥
0 0 𝑔 −30 ≈ 0
0 1 𝑔(−10) ≈ 0
1 0 𝑔(−10) ≈ 0
1 1 𝑔(10) ≈ 1
ℎΘ 𝑥 ≈ 𝑥1 AND x2Slide credit: Andrew Ng
Simple example: OR
• 𝑥1, 𝑥2 ∈ 0, 1
• 𝑦 = 𝑥1 AND 𝑥2
+1
𝑥1
𝑥2
ℎΘ 𝑥
−10
+20
+20
ℎΘ 𝑥 = 𝑔(−10 + 20𝑥1 + 20𝑥2)
𝑥1 𝑥2 ℎΘ 𝑥
0 0 𝑔 −10 ≈ 0
0 1 𝑔(10) ≈ 1
1 0 𝑔(10) ≈ 1
1 1 𝑔(30) ≈ 1
ℎΘ 𝑥 ≈ 𝑥1 OR x2Slide credit: Andrew Ng
Simple example: NOT
• 𝑥1, 𝑥2 ∈ 0, 1
• 𝑦 = 𝑥1 AND 𝑥2
+1
𝑥1 ℎΘ 𝑥
10
−20
ℎΘ 𝑥 = 𝑔(10 − 20𝑥1)
𝑥1 ℎΘ 𝑥
0 𝑔 10 ≈ 1
1 𝑔(−10) ≈ 0
ℎΘ 𝑥 ≈ NOT 𝑥1Slide credit: Andrew Ng
Putting it together: 𝑥1 XNOR 𝑥2+1
𝑥1
𝑥2
ℎΘ 𝑥
−30
+20
+20
+1
𝑥1
𝑥2
ℎΘ 𝑥
−10
+20
+20
ℎΘ 𝑥
+1
𝑥1
𝑥2
10
-20
-20
𝑥1 AND 𝑥2 (NOT 𝑥1) AND (NOT 𝑥2) 𝑥1 OR 𝑥2
+1
𝑥1
𝑥2
𝑎1(2)
−30
+20+20
𝑎2(2)
10-20
-20
+1
−10
+20
+20
𝑥1 𝑥2 𝑎1(2)
𝑎2(2) ℎΘ 𝑥
0 0 0 1 1
0 1 0 0 0
1 0 0 0 0
1 1 1 0 1 Slide credit: Andrew Ng
Layer 1
Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]
Layer 2
Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]
Layer 3
Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]
Layer 4 and 5
Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]
Neural Networks
•Why neural networks?
•Model representation
• Examples and intuitions
•Multi-class classification
Multiple output units: One-vs-all
𝑎1(2)
𝑎2(2)
𝑎3(2)
𝑥1
𝑥2
𝑎1(3)
𝑎2(3)
Pedestrian?
Car?
Motocycle?
Truck?
Multiple output units: One-vs-all
𝑎1(2)
𝑎2(2)
𝑎3(2)
𝑥1
𝑥2
𝑎1(3)
𝑎2(3)
ℎΘ 𝑥 ∈ 𝑅4
Training set : 𝑥 1 , 𝑦(1) , 𝑥 2 , 𝑦(2) , ⋯ 𝑥(𝑚), 𝑦(𝑚) ,
𝑦(𝑖) one of
1000
,
0100
,
0010
,
0001
Things to remember
•Why neural networks?
•Model representation
• Examples and intuitions
•Multi-class classification
𝑎1(2)
𝑎2(2)
𝑎3(2)
𝑎0(2)
ℎΘ 𝑥
Layer 1
“Output”
𝑥1
𝑥2
𝑥3
𝑥0
Layer 2 Layer 3
+1
𝑥1
𝑥2
𝑎1(2)−30
+20+20
𝑎2(2)
10-20-20
+1
−10
+20
+20