A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After...

41
Frequency Principle Yaoyu Zhang, Tao Luo

Transcript of A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After...

Page 1: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

Frequency Principle

Yaoyu Zhang, Tao Luo

Page 2: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

•Background

• Frequency Principle

• Implication of F-Principle

•Quantitative theory for F-Principle

Page 3: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

Background—Why DNN remains a mystery?

Page 4: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

Example 1 (mystery)—Deep Learning

Given 𝒟: (𝑥𝑖 , 𝑦𝑖) 𝑖=1𝑛 and ℋ: {𝑓 ⋅; Θ |Θ ∈ ℝ𝑚}, find 𝑓 ∈

ℋ such that 𝑓 𝑥𝑖 = 𝑦𝑖 for 𝑖 = 1,⋯ , 𝑛.

Supervised Learning Problem

𝒟

find

ሶ𝛩 = −𝛻𝛩𝐿 𝛩

Initialized by special 𝛩0

𝐿 𝛩 = 𝛴𝑖=1𝑛 ℎ 𝑥𝑖; 𝛩 − 𝑦𝑖

2/2𝑛

Page 5: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

Example 2 (well understood)—polynomial interpolation

Given 𝒟: (𝑥𝑖 , 𝑦𝑖) 𝑖=1𝑛 and ℋ: {𝑓 ⋅; Θ |Θ ∈ ℝ𝑚}, find 𝑓 ∈

ℋ such that 𝑓 𝑥𝑖 = 𝑦𝑖 for 𝑖 = 1,⋯ , 𝑛.

Supervised Learning Problem

𝒟(𝑥𝑖 ∈ ℝ, 𝑦𝑖 ∈ ℝ) 𝑖=1

𝑛

ℋℎ 𝑥;Θ = 𝜃1 +⋯+ 𝜃𝑀𝑥

𝑚−1

with 𝑚 = 𝑛

findNewton's interpolation formula

Q: Why we think polynomial interpolation is well understood?

Page 6: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

Example 3 (well understood)—linear spline

Given 𝒟: (𝑥𝑖 , 𝑦𝑖) 𝑖=1𝑛 and ℋ: {𝑓 ⋅; Θ |Θ ∈ ℝ𝑚}, find 𝑓 ∈

ℋ such that 𝑓 𝑥𝑖 = 𝑦𝑖 for 𝑖 = 1,⋯ , 𝑛.

Supervised Learning Problem

𝒟(𝑥𝑖 ∈ ℝ, 𝑦𝑖 ∈ ℝ) 𝑖=1

𝑛

ℋpiecewise linear functions

findexplicit solution

Page 7: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

Given 𝒟: (𝑥𝑖 , 𝑦𝑖) 𝑖=1𝑛 and ℋ: {𝑓 ⋅; Θ |Θ ∈ ℝ𝑚}, find 𝑓 ∈

ℋ such that 𝑓 𝑥𝑖 = 𝑦𝑖 for 𝑖 = 1,⋯ , 𝑛.

Why deep learning is a mystery?

Deep learning (black box!)

High dimensional real data (e.g., d=32*32*3)

Deep neural network (#para>>#data)

Gradient-based method with proper initialization

Conventional methods

Low dimensional data (𝑑 ≤ 3)

Spanned by simple basis functions (#para≤#data)

explicit formula

𝒟

find

Page 8: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

Is deep learning alchemy?

https://beamandrew.github.io/deeplearning/2017/02/23/deep_learning_101_part1.html

Page 9: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

1960-1969

• Simple (#data small)

• Single-layer NN (cannot solve XOR)

• Non-Gradient based (nondiff activation)

1984-1996

• Moderate (e.g., MNIST)

• Multi-layer NN (universal approx)

• Gradient based (BP)

2010-now

• Complex real data (e.g., ImageNet)

• Deep NN

• Gradient based (BP) with good initialization

NN is still a black box!

Golden ages of neural network

Page 10: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

Leo Breiman 1995

1. Why don’t heavily parameterized neural networks overfit the data?

2. What is the effective number of parameters?

3. Why doesn’t backpropagation head for a poor local minima?

4. When should one stop the backpropagation and use the current parameters?

Leo Breiman Reflections After Refereeing Papers for NIPS

Page 11: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

Frequency Principle

Page 12: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

Conventional view of generalization

Page 13: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

Conventional view of generalization

"With four parameters you can fit an elephant to a curve; with five you can make him wiggle his trunk.”

-- John von Neumann

Mayer et al., 2010

A model that can fit anything likely overfits the data.

Page 14: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

Overparameterized DNNs often generalize well.

Cifar10: 60,000 training data

airplanes, cars, birds,

cats, deer, dogs, frogs,

horses, ships, and trucks

Page 15: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

Problem simplification

𝒟

find ሶ𝛩 = −𝛻𝛩𝐿 𝛩

Initialized by special 𝛩0

1.0 0.5 0.0 0.5 1.0x

1

0

1

only observe𝑓 𝑥, 𝑡 : = 𝑓 𝑥; Θ(𝑡)

Page 16: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

Overparameterized DNNs still generalize well

Lei Wu, Zhanxing Zhu, Weinan E, 2017

#para(~1000)>>#data: 5

Page 17: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

evolution of 𝒇(𝒙, 𝒕)

tanh-DNN, 200-100-100-50

Page 18: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

Through the lens of Fourier transform 𝒉(𝝃, 𝒕)

Frequency Principle (F-Principle):DNNs often fit target functions from low to high frequencies during the training.

Xu, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018

Page 19: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

Synthetic curve with equal amplitude

Page 20: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

Target: image 𝐼 𝐱 : ℝ2 → ℝ𝐱 : location of a pixel𝐼 𝐱 : grayscale pixel value

How DNN fits a 2-d image?

Page 21: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

F-Principle

High-dimensional real data?

Xu, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019

1.0 0.5 0.0 0.5 1.0x

1

0

1

?

Page 22: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

Frequency

Image frequency (NOT USED)

• This frequency corresponds to the rate of change of intensity across neighboring pixels.

Response frequency

• Frequency of a general Input-Output mapping 𝑓.

መ𝑓 𝐤 = න𝑓 𝐱 e−i2𝜋𝐤⋅𝐱 d𝐱

MNIST: ℝ784 → ℝ10, 𝐤 ∈ ℝ784

Zero freqSame color

high freqSharp edge

high freqAdversarial example

Goodfellow et al.

Page 23: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

Examining F-Principle for high dimensional real problems

Nonuniform Discrete Fourier transform (NUDFT) for training dataset (𝐱𝑖 , 𝑦𝑖) 𝑖=1

𝑛 :

ො𝑦𝐤 =1

𝑛σ𝑖=1𝑛 𝑦𝑖e

−i2𝜋𝐤⋅𝐱𝑖, ℎ𝐤(𝑡) =1

𝑛σ𝑖=1𝑛 ℎ(𝐱𝑖 , 𝑡)e

−i2𝜋𝐤⋅𝐱𝑖

Difficulty:

• Curse of dimensionality, i.e., #𝐤 grows exponentially with dimension of problem 𝑑.

Our approaches:

• Projection, i.e., choose 𝐤 = 𝑘𝐩1• Filtering

Page 24: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

Projection approach

MNIST

CIFAR10

Relative error: Δ𝐹 𝑘 = ℎ𝑘 − ො𝑦𝑘 /| ො𝑦𝑘|

Page 25: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

low-freq of ො𝑦

𝑘0

Decompose frequency domain by filtering

Page 26: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

F-Principle in high-dim spaceCIFAR10 CIFAR10MNIST

Page 27: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

F-Principle

Implication of F-Principle

Xu, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018Xu, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019

Page 28: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

For Ԧ𝑥 ∈ −1,1 𝑛

𝒇 𝒙 = ς𝒋=𝟏𝒏 𝒙𝒋,

Even #‘-1’ → 1;

Odd #‘-1’ → -1.

Test accuracy: 72% %>>10% Test accuracy: ~50%, random guess

F-Principe: DNN prefers low frequencies

Why don’t heavily parameterized neural networks overfit the data?

CIFAR10 parity

Page 29: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

When should one stop the backpropagation and use the current parameters?

Page 30: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

Studies elicited by F-Principle• Theoretical study

• Rahaman, N., Arpit, D., Baratin, A., Draxler, F., Lin, M., Hamprecht, F. A., Bengio, Y. & Courville, A. (2018), ‘On the spectral bias of deep neural networks’.

• Jin, P., Lu, L., Tang, Y. & Karniadakis, G. E. (2019), ‘Quantifying the generalization error in deep learning in terms of data distribution and neural network smoothness’

• Basri, R., Jacobs, D., Kasten, Y. & Kritchman, S. (2019), ‘The convergence rate of neural networks for learned functions of different frequencies’

• Zhen, H.-L., Lin, X., Tang, A. Z., Li, Z., Zhang, Q. & Kwong, S. (2018), ‘Nonlinear collaborative scheme for deep neural networks’

• Wang, H., Wu, X., Yin, P. & Xing, E. P. (2019), ‘High frequency component helps explain the generalization of convolutional neural networks’

• Empirical study

• Jagtap, A. D. & Karniadakis, G. E. (2019), ‘Adaptive activation functions accelerate convergence in deep and physics-informed neural networks’

• Stamatescu, V. & McDonnell, M. D. (2018), ‘Diagnosing convolutional neural networks using their spectral response’

• Rabinowitz, N. C. (2019), ‘Meta-learners’ learning dynamics are unlike learners”,

• Application

• Wang, F., Müller, J., Eljarrat, A., Henninen, T., Rolf, E. & Koch, C. (2018), ‘Solving inverse problems with multi-scale deep convolutional neural networks’

• Cai, W., Li, X. & Liu, L. (2019), ‘Phasednn-a parallel phase shift deep neural network for adaptive wideband learning’

Page 31: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

Quantitative theory for F-Principle

Zhang, Xu, Luo, Ma, Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks, 2019

Page 32: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

The NTK regime

𝐿 Θ = σ𝑖=1𝑛 ℎ 𝑥𝑖; Θ − 𝑦𝑖

2

ሶΘ = −∇Θ𝐿 Θ

• 𝜕𝑡ℎ 𝑥; Θ = −σ𝑖=1𝑛 𝐾Θ 𝑥, 𝑥𝑖 ℎ 𝑥𝑖; Θ − 𝑦𝑖

Where 𝐾Θ 𝑥, 𝑥′ = ∇Θℎ 𝑥; Θ ⋅ ∇Θℎ 𝑥′; Θ

• Neural Tangent Kernel (NTK) regime:

𝐾Θ(t) 𝑥, 𝑥′ ≈ 𝐾Θ(0) 𝑥, 𝑥′ for any t.

Jacot et al., 2018

Page 33: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

Problem simplification

𝒟

find ሶ𝛩 = −𝛻𝛩𝐿 𝛩

Initialized by special 𝛩0

1.0 0.5 0.0 0.5 1.0x

1

0

1

Kernel gradient flow𝜕𝑡𝑓 𝑥, 𝑡= −Σ𝑖=1

𝑛 𝐾Θ0 𝑥, 𝑥𝑖 𝑓 𝑥𝑖 , 𝑡 − 𝑦𝑖

Two-layer ReLU NN

ℎ 𝑥; 𝛩 =𝑖=1

𝑛

𝑤𝑖𝜎 (𝑟𝑖(𝑥 + 𝑙𝑖))

Page 34: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

Linear F-Principle (LFP) dynamics

2-layer NN: ℎ 𝑥; 𝛩 = σ𝑖=1𝑛 𝑤𝑖ReLU(𝑟𝑖(𝑥 + 𝑙𝑖))

⋅ : mean over all neurons at initialization

𝑓: target function; ⋅ 𝑝 = (⋅)𝑝,where 𝑝 𝑥 =1

𝑛σ𝑖=1𝑛 𝛿(𝑥 − 𝑥𝑖);

Ƹ⋅: Fourier transform; 𝜉: frequency

ReLU

aliasing

𝜕𝑡 ℎ 𝜉, 𝑡 = −4𝜋2 𝑟2𝑤2

𝜉2+

𝑟2 + 𝑤2

𝜉4ℎ𝑝 𝜉, 𝑡 − 𝑓𝑝 𝜉, 𝑡

Assumptions: (i) NTK regime, (ii) sufficiently wide distribution of 𝑙𝑖.

Page 35: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

𝜕𝑡 ℎ 𝜉, 𝑡 = −4𝜋2 𝑟2𝑤2

𝜉2+

𝑟2 + 𝑤2

𝜉4ℎ𝑝 𝜉, 𝑡 − 𝑓𝑝 𝜉, 𝑡

Preference induced by LFP dynamics

minℎ∈𝐹𝛾

4𝜋2 𝑟2𝑤2

𝜉2+

𝑟2 + 𝑤2

𝜉4

−1ℎ 𝜉

2d𝜉

s.t. ℎ 𝑥𝑖 = 𝑦𝑖 for 𝑖 = 1,⋯ , 𝑛

low frequency preference

Case 1: 𝜉−2 dominant

• min 𝜉2 ℎ 𝜉2d𝜉~min ℎ′(𝑥) 2 d𝜉 → linear spline

Case 2: 𝜉−4 dominant

• min 𝜉4 ℎ 𝜉2d𝜉~min ℎ′′(𝑥) 2 d𝜉 → cubic spline

Page 36: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

Regularity can be changed through initialization

Case 1𝑟2 + 𝑤2 ≫ 4𝜋2 𝑟2𝑤2

Case 24𝜋2 𝑟2𝑤2 ≫ 𝑟2 + 𝑤2

minන𝜉2 ℎ 𝜉2d𝜉 minන𝜉4 ℎ 𝜉

2d𝜉

Page 37: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

𝜕𝑡 ℎ 𝜉, 𝑡 = −|𝑟|2 + 𝑤2

|𝜉|𝑑+3+4𝜋2 |𝑟|2𝑤2

|𝜉|𝑑+1ℎ𝑝 𝜉, 𝑡 − 𝑓𝑝 𝜉, 𝑡

where 𝑓: target function; ⋅ 𝑝 = (⋅)𝑝,where 𝑝 𝑥 =1

𝑛σ𝑖=1𝑛 𝛿(𝑥 − 𝑥𝑖); (⋅): Fourier

transform; 𝜉: frequency.

Theorem (informal). Solution of LFP dynamics at 𝑡 → ∞ with initial value

ℎini is the same as solution of the following optimization problem

minℎ−ℎini∈𝐹𝛾

න𝑟 2 + 𝑤2

|𝜉|𝑑+3+4𝜋2 𝑟 2𝑤2

|𝜉|𝑑+1

−1

ℎ 𝜉 − ℎini(𝜉)2d𝜉

s.t. ℎ 𝑋 = 𝑌.

High-dimensional Case

Page 38: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

FP-norm and FP-space

A priori generalization error boundTheorem (informal). Suppose that the real-valued target function 𝑓 ∈ 𝐹𝛾(Ω), ℎ𝑛 is

the solution of the regularized model

minℎ∈𝐹𝛾

ℎ 𝛾 s.t. ℎ 𝑋 = 𝑌

Then for any 𝛿 ∈ (0,1) with probability at least 1 − 𝛿 over the random training

samples, the population risk has the bound

𝐿 ℎ𝑛 ≤ 𝑓 ∞ + 2 𝑓 𝛾 𝛾 𝑙22

𝑛+ 4

2log(4/𝛿)

𝑛

We define the FP-norm for all function ℎ ∈ 𝐿2(Ω):

ℎ 𝛾: = ℎ𝐻Γ

=

𝑘∈ℤ𝑑∗

𝛾−2 𝑘 ℎ 𝑘2

1/2

Next, we define the FP-space:

𝐹𝛾 Ω = {ℎ ∈ 𝐿2(Ω): ℎ 𝛾 < ∞}

Page 39: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

Leo Breiman 1995

1. Why don’t heavily parameterized neural networks overfit the data?

2. What is the effective number of parameters?

3. Why doesn’t backpropagation head for a poor local minima?

4. When should one stop the backpropagation and use the current parameters?

Leo Breiman Reflections After Refereeing Papers for NIPS

Page 40: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

Global Minima

unbiased initialization + F-Principle

+ ⋯(to be discovered)

A picture for the generalization mystery of DNN

Page 41: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional

Conclusion

DNNs prefer low frequencies!

References:

• Xu, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018

• Xu, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019

• Zhang, Xu, Luo, Ma, Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks, 2019

• Zhang, Xu, Luo, Ma, A type of generalization error induced by initialization in deep neural networks, 2019

• Luo, Ma, Xu, Zhang, Theory on Frequency Principle in General Deep Neural Networks, 2019.