Post on 19-May-2022
On Feature Learning in Neural Networks:Emergence from Inputs and Advantage over Fixed
Features
Zhenmei Shi, Jenny Wei, Yingyu Liang
UW-Madison
1 / 30
Introduction
2 / 30
Deep Learning
• Remarkable success in applications.• Advantage over traditional machine learning methods.
Figure 1: Computer Vision, Reinforcement Learning, Natural Language Processing
3 / 30
Neural NetworksTwo-layer network: y ′ = g(x) = aTσ(Wx+b).
Train: gradient descent
θ(t) = θ
(t−1)−η(t)
∇θ
(L(g (t−1))
)θ denotes W , b and a.
4 / 30
Neural NetworksTwo-layer network: y ′ = g(x) = aTσ(Wx+b).
Train: gradient descent
θ(t) = θ
(t−1)−η(t)
∇θ
(L(g (t−1))
)θ denotes W , b and a.
4 / 30
Neural Networks
• Loss function ℓ(y ,y ′): measure the cost incurred by taking adecision y ′ and y is true label.
• Evaluate the risk function: L(g) = E(x ,y)[ℓ(y ,g(x))].
• Regularization terms can be added.
5 / 30
Existing Works
Why neural networks success (overparameterized regime)?
Current theoretical understanding:• Can be approximated by Neural Tangent Kernel (NTK regime
or lazy learning regime).
• However, practical training not fit in the NTK regime. Also,NTK cannot explain the network advantage over traditionalfixed feature methods (random features, kernel methods).
• A recent line of work shows neural networks provably enjoyadvantages over fixed feature methods including NTK.
• However, they have not investigated whether input structure iscrucial for feature learning, or have not analyzed how gradientdescent learns effective features, or rely on strong assumptions(e.g., special networks, Gaussian data, etc).
6 / 30
Existing Works
Why neural networks success (overparameterized regime)?
Current theoretical understanding:• Can be approximated by Neural Tangent Kernel (NTK regime
or lazy learning regime).
• However, practical training not fit in the NTK regime. Also,NTK cannot explain the network advantage over traditionalfixed feature methods (random features, kernel methods).
• A recent line of work shows neural networks provably enjoyadvantages over fixed feature methods including NTK.
• However, they have not investigated whether input structure iscrucial for feature learning, or have not analyzed how gradientdescent learns effective features, or rely on strong assumptions(e.g., special networks, Gaussian data, etc).
6 / 30
Existing Works
Why neural networks success (overparameterized regime)?
Current theoretical understanding:• Can be approximated by Neural Tangent Kernel (NTK regime
or lazy learning regime).
• However, practical training not fit in the NTK regime. Also,NTK cannot explain the network advantage over traditionalfixed feature methods (random features, kernel methods).
• A recent line of work shows neural networks provably enjoyadvantages over fixed feature methods including NTK.
• However, they have not investigated whether input structure iscrucial for feature learning, or have not analyzed how gradientdescent learns effective features, or rely on strong assumptions(e.g., special networks, Gaussian data, etc).
6 / 30
Empirical Observation• Neural networks perform better on data with structures.
• Neural networks performs feature learning (feature learningregime).
Figure 2: Networks can learn neurons that correspond to different semantic patterns in the inputs. 1
1From paper "Visualizing and Understanding Convolutional Networks".7 / 30
Question
Question 1:How can effective features emerge from inputs in the trainingdynamics of gradient descent?
Question 2:Is feature learning from inputs necessary for superior performance?
8 / 30
Roadmap
To answer the previous two questions:Step1: Choose input data distributions with and without structures.
Step2: Show feature learning exist for input with structures. Analyzeconvergence of gradient descent with the aid of learned features.
Step3: Show fixed feature methods under the same condition (datawith structures) cannot learn efficiently.
Step4: Show learning input data without structures is much harder forall methods.
9 / 30
Problem Setup
10 / 30
Pattern Counting ProblemMotivation:
• Dictionary learning and sparse coding.
• Images contain label relevant or label irrelevant patterns.
• If the image contains a sufficient number of label relevantpatterns, the image may belong to a certain class.
Figure 3: Systematically cover up different portions of the scene with a gray square and see how the topfeature maps and classifier output changes. (b): for each position of the gray scale, record the activationin feature map (c): a visualization of this feature map projected down into the input image. (d): a map ofcorrect class probability. (e): the most probable label as a function of occluder position.
11 / 30
Pattern Counting ProblemHidden representation (pattern indicator vector):
• φ ∈ 0,1D : Hidden vector indicates presence of each pattern.• D
φ: A distribution of φ .
• M: Unknown dictionary of patterns.
Input:• Given φ , D
φand M, generate input x by: x =M φ .
Label:• A⊆ [D]: subset of size k, corresponding to class relevant
patterns.• P ⊆ [k], generate label y by:
y =
+1, if ∑i∈A φi ∈ P,
−1, otherwise.
• Intuition: y can be any binary function over the number of classrelevant patterns.
12 / 30
Pattern Counting ProblemHidden representation (pattern indicator vector):
• φ ∈ 0,1D : Hidden vector indicates presence of each pattern.• D
φ: A distribution of φ .
• M: Unknown dictionary of patterns.
Input:• Given φ , D
φand M, generate input x by: x =M φ .
Label:• A⊆ [D]: subset of size k, corresponding to class relevant
patterns.• P ⊆ [k], generate label y by:
y =
+1, if ∑i∈A φi ∈ P,
−1, otherwise.
• Intuition: y can be any binary function over the number of classrelevant patterns.
12 / 30
Pattern Counting ProblemHidden representation (pattern indicator vector):
• φ ∈ 0,1D : Hidden vector indicates presence of each pattern.• D
φ: A distribution of φ .
• M: Unknown dictionary of patterns.
Input:• Given φ , D
φand M, generate input x by: x =M φ .
Label:• A⊆ [D]: subset of size k, corresponding to class relevant
patterns.• P ⊆ [k], generate label y by:
y =
+1, if ∑i∈A φi ∈ P,
−1, otherwise.
• Intuition: y can be any binary function over the number of classrelevant patterns.
12 / 30
Assumptions on Distribution over Hidden Representation
Input with structures (family of distributions FΞ):(A0) Equal class probability.(A1) The patterns in A are correlated with the labels: for any i ∈ A,
γ = E[y φi ]−E[y ]E[φi ]> 0.(A2) Each pattern outside A is independent of all other patterns and
identically distributed. Let po := Pr[φi = 1]≤ 1/2 denote theprobability they appear.
Input without structures (family of distributions FΞ0):(A1’) The patterns are independent, and φi is uniform.
13 / 30
Assumptions on Distribution over Hidden Representation
Input with structures (family of distributions FΞ):(A0) Equal class probability.(A1) The patterns in A are correlated with the labels: for any i ∈ A,
γ = E[y φi ]−E[y ]E[φi ]> 0.(A2) Each pattern outside A is independent of all other patterns and
identically distributed. Let po := Pr[φi = 1]≤ 1/2 denote theprobability they appear.
Input without structures (family of distributions FΞ0):(A1’) The patterns are independent, and φi is uniform.
13 / 30
Network
• Two-layer network: g(x) = ∑2mi=1 aiσ(⟨wi ,x⟩+bi ).
• σ(z) = min(1,max(z ,0)): the truncated rectified linear unit(ReLU) activation function.
• Hinge loss and ℓ2 regularization.• Gaussian initialization and gradient descent.
14 / 30
Main Results
15 / 30
Provable Guarantee for Neural Networks
Theorem 1 (Informal)For any small positive δ and ε , if k =Ω
(log2 (Dm/(δγ))
),
po =Ω(k2/D) and m ≥maxΩ(k12/ε3/2),D, then with properhyper-parameters, for any D ∈ FΞ, with probability at least 1−δ ,there exists t ∈ [T ] such that Pr[sign(g (t)(x)) = y ]≤ LD(g
(t))≤ ε.
Message:
• For a wide range of the background pattern probability po and thenumber of class relevant patterns k, given any input distributionwith structure, neural network can achieve small population riskwith polynomial number of neurons.
• The analysis shows the success comes from feature learning.
16 / 30
Provable Guarantee for Neural Networks
Theorem 1 (Informal)For any small positive δ and ε , if k =Ω
(log2 (Dm/(δγ))
),
po =Ω(k2/D) and m ≥maxΩ(k12/ε3/2),D, then with properhyper-parameters, for any D ∈ FΞ, with probability at least 1−δ ,there exists t ∈ [T ] such that Pr[sign(g (t)(x)) = y ]≤ LD(g
(t))≤ ε.
Message:
• For a wide range of the background pattern probability po and thenumber of class relevant patterns k, given any input distributionwith structure, neural network can achieve small population riskwith polynomial number of neurons.
• The analysis shows the success comes from feature learning.
16 / 30
Lower Bound for Fixed Features Models
Fixed features model:
Suppose Ψ is a data-independent feature mapping of dimension N withbounded features, i.e., Ψ : X → [−1,1]N . For B > 0, the family oflinear models on Ψ with bounded norm B is HB = h(x) : h(x) =⟨Ψ(x),w⟩,∥w∥2 ≤ B.
Theorem 2 (Informal)With proper k , there exists D ∈ FΞ such that all h ∈ HB havehinge-loss at least po
(1−
√2NB2k
).
Message:
There exists a input distribution with structure such that no fixed featuremethod with polynomial features can efficiently learn the task.
17 / 30
Lower Bound for Fixed Features Models
Fixed features model:
Suppose Ψ is a data-independent feature mapping of dimension N withbounded features, i.e., Ψ : X → [−1,1]N . For B > 0, the family oflinear models on Ψ with bounded norm B is HB = h(x) : h(x) =⟨Ψ(x),w⟩,∥w∥2 ≤ B.
Theorem 2 (Informal)With proper k , there exists D ∈ FΞ such that all h ∈ HB havehinge-loss at least po
(1−
√2NB2k
).
Message:
There exists a input distribution with structure such that no fixed featuremethod with polynomial features can efficiently learn the task.
17 / 30
Lower Bound for Fixed Features Models
Fixed features model:
Suppose Ψ is a data-independent feature mapping of dimension N withbounded features, i.e., Ψ : X → [−1,1]N . For B > 0, the family oflinear models on Ψ with bounded norm B is HB = h(x) : h(x) =⟨Ψ(x),w⟩,∥w∥2 ≤ B.
Theorem 2 (Informal)With proper k , there exists D ∈ FΞ such that all h ∈ HB havehinge-loss at least po
(1−
√2NB2k
).
Message:
There exists a input distribution with structure such that no fixed featuremethod with polynomial features can efficiently learn the task.
17 / 30
Lower Bound for Learning Without Input Structure
Statistical Query (SQ) model:
• Only receive information through statistical queries (Q,τ). Propertypredicate Q of labeled instances and tolerance τ ∈ [0,1]. Receive aresponse PQ ∈ [PQ − τ,PQ + τ], where PQ = Pr[Q(x ,y) is true].
• The SQ model captures almost all common learning algorithmsincluding mini-batch SGD.
Theorem 3For any algorithm in the Statistical Query model with query (Q,τ)that can learn over FΞ0 to classification error less than 1
2 −1
(Dk)3 ,
either the number of queries or 1/τ must be at least 12
(Dk
)1/3.
Message:
Without input structure, polynomial SQ model cannot non-trivially betterthan random guessing.
18 / 30
Lower Bound for Learning Without Input Structure
Statistical Query (SQ) model:
• Only receive information through statistical queries (Q,τ). Propertypredicate Q of labeled instances and tolerance τ ∈ [0,1]. Receive aresponse PQ ∈ [PQ − τ,PQ + τ], where PQ = Pr[Q(x ,y) is true].
• The SQ model captures almost all common learning algorithmsincluding mini-batch SGD.
Theorem 3For any algorithm in the Statistical Query model with query (Q,τ)that can learn over FΞ0 to classification error less than 1
2 −1
(Dk)3 ,
either the number of queries or 1/τ must be at least 12
(Dk
)1/3.
Message:
Without input structure, polynomial SQ model cannot non-trivially betterthan random guessing.
18 / 30
Lower Bound for Learning Without Input Structure
Statistical Query (SQ) model:
• Only receive information through statistical queries (Q,τ). Propertypredicate Q of labeled instances and tolerance τ ∈ [0,1]. Receive aresponse PQ ∈ [PQ − τ,PQ + τ], where PQ = Pr[Q(x ,y) is true].
• The SQ model captures almost all common learning algorithmsincluding mini-batch SGD.
Theorem 3For any algorithm in the Statistical Query model with query (Q,τ)that can learn over FΞ0 to classification error less than 1
2 −1
(Dk)3 ,
either the number of queries or 1/τ must be at least 12
(Dk
)1/3.
Message:
Without input structure, polynomial SQ model cannot non-trivially betterthan random guessing.
18 / 30
Proof Sketches
19 / 30
Existence of A Good Network
Intuition:
Find a "good" two-layer network that can represent the target labelingfunction, whose neurons are viewed as ground truth features. Then focuson analyzing how the network learns such neuron weights.
Lemma 4 (Informal)For any D ∈ FΞ, here exists a two-layer network g∗(x) with zeroloss. Furthermore, the hidden neurons’ weights in g∗(x) are allproportional to ∑j∈AMj .
20 / 30
Existence of A Good Network
Intuition:
Find a "good" two-layer network that can represent the target labelingfunction, whose neurons are viewed as ground truth features. Then focuson analyzing how the network learns such neuron weights.
Lemma 4 (Informal)For any D ∈ FΞ, here exists a two-layer network g∗(x) with zeroloss. Furthermore, the hidden neurons’ weights in g∗(x) are allproportional to ∑j∈AMj .
20 / 30
Feature Emergence in the First Gradient Step
Intuition:
After the first gradient step, the hidden neurons of the trained networkbecome close to the ground truth features.
Lemma 5 (Informal)∂
∂wiLD(g
(0)) =−a(0)i ∑
Dj=1MjTj where Tj satisfies:
• if j ∈ A, then Tj ≈ O(γ);• if j ∈ A, then |Tj | ≤ O(εe1) for a small εe1.
Message:
Gradients are updated uniformly in class-relevant pattern directions. Theirupdates in class-irrelevant pattern directions are relatively small.
21 / 30
Feature Emergence in the First Gradient Step
Intuition:
After the first gradient step, the hidden neurons of the trained networkbecome close to the ground truth features.
Lemma 5 (Informal)∂
∂wiLD(g
(0)) =−a(0)i ∑
Dj=1MjTj where Tj satisfies:
• if j ∈ A, then Tj ≈ O(γ);• if j ∈ A, then |Tj | ≤ O(εe1) for a small εe1.
Message:
Gradients are updated uniformly in class-relevant pattern directions. Theirupdates in class-irrelevant pattern directions are relatively small.
21 / 30
Feature Emergence in the First Gradient Step
Intuition:
After the first gradient step, the hidden neurons of the trained networkbecome close to the ground truth features.
Lemma 5 (Informal)∂
∂wiLD(g
(0)) =−a(0)i ∑
Dj=1MjTj where Tj satisfies:
• if j ∈ A, then Tj ≈ O(γ);• if j ∈ A, then |Tj | ≤ O(εe1) for a small εe1.
Message:
Gradients are updated uniformly in class-relevant pattern directions. Theirupdates in class-irrelevant pattern directions are relatively small.
21 / 30
Proof Ideas of Lemma 5The gradient of wi is: ∂LD (g)
∂wi=−aiE(x ,y)∼D yxσ ′[⟨wi ,x⟩+bi ].
Let φ = (φ −E[φ ])/σ , then the component of the gradient on Mj is:
⟨Mj ,∂LD(g)
∂wi⟩=−aiE
yφjσ
′
[∑
ℓ∈[D]
φℓqℓ+bi
]. (1)
If the set of class relevant patterns A is relatively small, then
I[D] := σ′
[∑
ℓ∈[D]
φℓqℓ+bi
]≈ I−A := σ
′
[∑ℓ∈A
φℓqℓ+bi
]. (2)
Thus, component of each class relevant pattern is nearly a constant:
⟨Mj ,∂LD(g)
∂wi⟩ ∝ E
yφjI[D]
≈ EyφjI−A= EyφjE[I−A]. (3)
Similarly, for background patterns, the component is close to 0.
22 / 30
Proof Ideas of Lemma 5The gradient of wi is: ∂LD (g)
∂wi=−aiE(x ,y)∼D yxσ ′[⟨wi ,x⟩+bi ].
Let φ = (φ −E[φ ])/σ , then the component of the gradient on Mj is:
⟨Mj ,∂LD(g)
∂wi⟩=−aiE
yφjσ
′
[∑
ℓ∈[D]
φℓqℓ+bi
]. (1)
If the set of class relevant patterns A is relatively small, then
I[D] := σ′
[∑
ℓ∈[D]
φℓqℓ+bi
]≈ I−A := σ
′
[∑ℓ∈A
φℓqℓ+bi
]. (2)
Thus, component of each class relevant pattern is nearly a constant:
⟨Mj ,∂LD(g)
∂wi⟩ ∝ E
yφjI[D]
≈ EyφjI−A= EyφjE[I−A]. (3)
Similarly, for background patterns, the component is close to 0.
22 / 30
Proof Ideas of Lemma 5The gradient of wi is: ∂LD (g)
∂wi=−aiE(x ,y)∼D yxσ ′[⟨wi ,x⟩+bi ].
Let φ = (φ −E[φ ])/σ , then the component of the gradient on Mj is:
⟨Mj ,∂LD(g)
∂wi⟩=−aiE
yφjσ
′
[∑
ℓ∈[D]
φℓqℓ+bi
]. (1)
If the set of class relevant patterns A is relatively small, then
I[D] := σ′
[∑
ℓ∈[D]
φℓqℓ+bi
]≈ I−A := σ
′
[∑ℓ∈A
φℓqℓ+bi
]. (2)
Thus, component of each class relevant pattern is nearly a constant:
⟨Mj ,∂LD(g)
∂wi⟩ ∝ E
yφjI[D]
≈ EyφjI−A= EyφjE[I−A]. (3)
Similarly, for background patterns, the component is close to 0.22 / 30
Feature Improvement in the Second Gradient Step
Intuition:
After the second gradient step, these neurons get improved to a sufficientlygood level.
Lemma 6 (Informal)∂
∂wiLD(g
(1)) =−a(1)i ∑
Dj=1MjTj where Tj satisfies:
• if j ∈ A, then Tj ≈ O(γ);• if j ∈ A, then |Tj | ≤ O(εe2) for a small εe2, where εe2 much
smaller than εe1.
Message:
The signal-to-noise ratio improves in this step.
23 / 30
Feature Improvement in the Second Gradient Step
Intuition:
After the second gradient step, these neurons get improved to a sufficientlygood level.
Lemma 6 (Informal)∂
∂wiLD(g
(1)) =−a(1)i ∑
Dj=1MjTj where Tj satisfies:
• if j ∈ A, then Tj ≈ O(γ);• if j ∈ A, then |Tj | ≤ O(εe2) for a small εe2, where εe2 much
smaller than εe1.
Message:
The signal-to-noise ratio improves in this step.
23 / 30
Feature Improvement in the Second Gradient Step
Intuition:
After the second gradient step, these neurons get improved to a sufficientlygood level.
Lemma 6 (Informal)∂
∂wiLD(g
(1)) =−a(1)i ∑
Dj=1MjTj where Tj satisfies:
• if j ∈ A, then Tj ≈ O(γ);• if j ∈ A, then |Tj | ≤ O(εe2) for a small εe2, where εe2 much
smaller than εe1.
Message:
The signal-to-noise ratio improves in this step.
23 / 30
Experiments
24 / 30
Simulation: Test Accuracy VS Steps
Setting:
Generate simulated data with or without input structure and labels givenby the parity function.
Figure 4: Test accuracy on simulated data with or without input structure.
25 / 30
Synthetic Data: Feature Learning in Networks
Setting:
Compute the cosine similarities between the weights wi ’s and visualizethem by Multidimensional Scaling. Dots represent neurons and starsrepresent effective features ±∑j∈AMj .
Message:
All neurons converge to the effective features after two steps.
26 / 30
Synthetic Data: Feature Learning in Networks
Setting:
Compute the cosine similarities between the weights wi ’s and visualizethem by Multidimensional Scaling. Dots represent neurons and starsrepresent effective features ±∑j∈AMj .
Message:
All neurons converge to the effective features after two steps.
26 / 30
Real Data: Feature Learning in Networks
Setting:
Two-layer network trained on the subset of MNIST data with label 0/1.
Message:
Similar clustering effect.
27 / 30
Real Data: Feature Learning in Networks
Setting:
Two-layer network trained on the subset of MNIST data with label 0/1.
Message:
Similar clustering effect.
27 / 30
Take Home Message
Question 1:How can effective features emerge from inputs in the trainingdynamics of gradient descent?AnswerInput structures provably influence effective feature learning.
Question 2:Is feature learning from inputs necessary for superior performance?AnswerFeature learning ability of neural networks provably leads to theirsuccess comparing to fixed feature methods.
28 / 30
Thank you!
29 / 30
Q&A
30 / 30