A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After...

Post on 28-Jul-2020

6 views 0 download

Transcript of A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After...

Frequency Principle

Yaoyu Zhang, Tao Luo

•Background

• Frequency Principle

• Implication of F-Principle

•Quantitative theory for F-Principle

Background—Why DNN remains a mystery?

Example 1 (mystery)—Deep Learning

Given 𝒟: (𝑥𝑖 , 𝑦𝑖) 𝑖=1𝑛 and ℋ: {𝑓 ⋅; Θ |Θ ∈ ℝ𝑚}, find 𝑓 ∈

ℋ such that 𝑓 𝑥𝑖 = 𝑦𝑖 for 𝑖 = 1,⋯ , 𝑛.

Supervised Learning Problem

𝒟

find

ሶ𝛩 = −𝛻𝛩𝐿 𝛩

Initialized by special 𝛩0

𝐿 𝛩 = 𝛴𝑖=1𝑛 ℎ 𝑥𝑖; 𝛩 − 𝑦𝑖

2/2𝑛

Example 2 (well understood)—polynomial interpolation

Given 𝒟: (𝑥𝑖 , 𝑦𝑖) 𝑖=1𝑛 and ℋ: {𝑓 ⋅; Θ |Θ ∈ ℝ𝑚}, find 𝑓 ∈

ℋ such that 𝑓 𝑥𝑖 = 𝑦𝑖 for 𝑖 = 1,⋯ , 𝑛.

Supervised Learning Problem

𝒟(𝑥𝑖 ∈ ℝ, 𝑦𝑖 ∈ ℝ) 𝑖=1

𝑛

ℋℎ 𝑥;Θ = 𝜃1 +⋯+ 𝜃𝑀𝑥

𝑚−1

with 𝑚 = 𝑛

findNewton's interpolation formula

Q: Why we think polynomial interpolation is well understood?

Example 3 (well understood)—linear spline

Given 𝒟: (𝑥𝑖 , 𝑦𝑖) 𝑖=1𝑛 and ℋ: {𝑓 ⋅; Θ |Θ ∈ ℝ𝑚}, find 𝑓 ∈

ℋ such that 𝑓 𝑥𝑖 = 𝑦𝑖 for 𝑖 = 1,⋯ , 𝑛.

Supervised Learning Problem

𝒟(𝑥𝑖 ∈ ℝ, 𝑦𝑖 ∈ ℝ) 𝑖=1

𝑛

ℋpiecewise linear functions

findexplicit solution

Given 𝒟: (𝑥𝑖 , 𝑦𝑖) 𝑖=1𝑛 and ℋ: {𝑓 ⋅; Θ |Θ ∈ ℝ𝑚}, find 𝑓 ∈

ℋ such that 𝑓 𝑥𝑖 = 𝑦𝑖 for 𝑖 = 1,⋯ , 𝑛.

Why deep learning is a mystery?

Deep learning (black box!)

High dimensional real data (e.g., d=32*32*3)

Deep neural network (#para>>#data)

Gradient-based method with proper initialization

Conventional methods

Low dimensional data (𝑑 ≤ 3)

Spanned by simple basis functions (#para≤#data)

explicit formula

𝒟

find

Is deep learning alchemy?

https://beamandrew.github.io/deeplearning/2017/02/23/deep_learning_101_part1.html

1960-1969

• Simple (#data small)

• Single-layer NN (cannot solve XOR)

• Non-Gradient based (nondiff activation)

1984-1996

• Moderate (e.g., MNIST)

• Multi-layer NN (universal approx)

• Gradient based (BP)

2010-now

• Complex real data (e.g., ImageNet)

• Deep NN

• Gradient based (BP) with good initialization

NN is still a black box!

Golden ages of neural network

Leo Breiman 1995

1. Why don’t heavily parameterized neural networks overfit the data?

2. What is the effective number of parameters?

3. Why doesn’t backpropagation head for a poor local minima?

4. When should one stop the backpropagation and use the current parameters?

Leo Breiman Reflections After Refereeing Papers for NIPS

Frequency Principle

Conventional view of generalization

Conventional view of generalization

"With four parameters you can fit an elephant to a curve; with five you can make him wiggle his trunk.”

-- John von Neumann

Mayer et al., 2010

A model that can fit anything likely overfits the data.

Overparameterized DNNs often generalize well.

Cifar10: 60,000 training data

airplanes, cars, birds,

cats, deer, dogs, frogs,

horses, ships, and trucks

Problem simplification

𝒟

find ሶ𝛩 = −𝛻𝛩𝐿 𝛩

Initialized by special 𝛩0

1.0 0.5 0.0 0.5 1.0x

1

0

1

only observe𝑓 𝑥, 𝑡 : = 𝑓 𝑥; Θ(𝑡)

Overparameterized DNNs still generalize well

Lei Wu, Zhanxing Zhu, Weinan E, 2017

#para(~1000)>>#data: 5

evolution of 𝒇(𝒙, 𝒕)

tanh-DNN, 200-100-100-50

Through the lens of Fourier transform 𝒉(𝝃, 𝒕)

Frequency Principle (F-Principle):DNNs often fit target functions from low to high frequencies during the training.

Xu, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018

Synthetic curve with equal amplitude

Target: image 𝐼 𝐱 : ℝ2 → ℝ𝐱 : location of a pixel𝐼 𝐱 : grayscale pixel value

How DNN fits a 2-d image?

F-Principle

High-dimensional real data?

Xu, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019

1.0 0.5 0.0 0.5 1.0x

1

0

1

?

Frequency

Image frequency (NOT USED)

• This frequency corresponds to the rate of change of intensity across neighboring pixels.

Response frequency

• Frequency of a general Input-Output mapping 𝑓.

መ𝑓 𝐤 = න𝑓 𝐱 e−i2𝜋𝐤⋅𝐱 d𝐱

MNIST: ℝ784 → ℝ10, 𝐤 ∈ ℝ784

Zero freqSame color

high freqSharp edge

high freqAdversarial example

Goodfellow et al.

Examining F-Principle for high dimensional real problems

Nonuniform Discrete Fourier transform (NUDFT) for training dataset (𝐱𝑖 , 𝑦𝑖) 𝑖=1

𝑛 :

ො𝑦𝐤 =1

𝑛σ𝑖=1𝑛 𝑦𝑖e

−i2𝜋𝐤⋅𝐱𝑖, ℎ𝐤(𝑡) =1

𝑛σ𝑖=1𝑛 ℎ(𝐱𝑖 , 𝑡)e

−i2𝜋𝐤⋅𝐱𝑖

Difficulty:

• Curse of dimensionality, i.e., #𝐤 grows exponentially with dimension of problem 𝑑.

Our approaches:

• Projection, i.e., choose 𝐤 = 𝑘𝐩1• Filtering

Projection approach

MNIST

CIFAR10

Relative error: Δ𝐹 𝑘 = ℎ𝑘 − ො𝑦𝑘 /| ො𝑦𝑘|

low-freq of ො𝑦

𝑘0

Decompose frequency domain by filtering

F-Principle in high-dim spaceCIFAR10 CIFAR10MNIST

F-Principle

Implication of F-Principle

Xu, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018Xu, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019

For Ԧ𝑥 ∈ −1,1 𝑛

𝒇 𝒙 = ς𝒋=𝟏𝒏 𝒙𝒋,

Even #‘-1’ → 1;

Odd #‘-1’ → -1.

Test accuracy: 72% %>>10% Test accuracy: ~50%, random guess

F-Principe: DNN prefers low frequencies

Why don’t heavily parameterized neural networks overfit the data?

CIFAR10 parity

When should one stop the backpropagation and use the current parameters?

Studies elicited by F-Principle• Theoretical study

• Rahaman, N., Arpit, D., Baratin, A., Draxler, F., Lin, M., Hamprecht, F. A., Bengio, Y. & Courville, A. (2018), ‘On the spectral bias of deep neural networks’.

• Jin, P., Lu, L., Tang, Y. & Karniadakis, G. E. (2019), ‘Quantifying the generalization error in deep learning in terms of data distribution and neural network smoothness’

• Basri, R., Jacobs, D., Kasten, Y. & Kritchman, S. (2019), ‘The convergence rate of neural networks for learned functions of different frequencies’

• Zhen, H.-L., Lin, X., Tang, A. Z., Li, Z., Zhang, Q. & Kwong, S. (2018), ‘Nonlinear collaborative scheme for deep neural networks’

• Wang, H., Wu, X., Yin, P. & Xing, E. P. (2019), ‘High frequency component helps explain the generalization of convolutional neural networks’

• Empirical study

• Jagtap, A. D. & Karniadakis, G. E. (2019), ‘Adaptive activation functions accelerate convergence in deep and physics-informed neural networks’

• Stamatescu, V. & McDonnell, M. D. (2018), ‘Diagnosing convolutional neural networks using their spectral response’

• Rabinowitz, N. C. (2019), ‘Meta-learners’ learning dynamics are unlike learners”,

• Application

• Wang, F., Müller, J., Eljarrat, A., Henninen, T., Rolf, E. & Koch, C. (2018), ‘Solving inverse problems with multi-scale deep convolutional neural networks’

• Cai, W., Li, X. & Liu, L. (2019), ‘Phasednn-a parallel phase shift deep neural network for adaptive wideband learning’

Quantitative theory for F-Principle

Zhang, Xu, Luo, Ma, Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks, 2019

The NTK regime

𝐿 Θ = σ𝑖=1𝑛 ℎ 𝑥𝑖; Θ − 𝑦𝑖

2

ሶΘ = −∇Θ𝐿 Θ

• 𝜕𝑡ℎ 𝑥; Θ = −σ𝑖=1𝑛 𝐾Θ 𝑥, 𝑥𝑖 ℎ 𝑥𝑖; Θ − 𝑦𝑖

Where 𝐾Θ 𝑥, 𝑥′ = ∇Θℎ 𝑥; Θ ⋅ ∇Θℎ 𝑥′; Θ

• Neural Tangent Kernel (NTK) regime:

𝐾Θ(t) 𝑥, 𝑥′ ≈ 𝐾Θ(0) 𝑥, 𝑥′ for any t.

Jacot et al., 2018

Problem simplification

𝒟

find ሶ𝛩 = −𝛻𝛩𝐿 𝛩

Initialized by special 𝛩0

1.0 0.5 0.0 0.5 1.0x

1

0

1

Kernel gradient flow𝜕𝑡𝑓 𝑥, 𝑡= −Σ𝑖=1

𝑛 𝐾Θ0 𝑥, 𝑥𝑖 𝑓 𝑥𝑖 , 𝑡 − 𝑦𝑖

Two-layer ReLU NN

ℎ 𝑥; 𝛩 =𝑖=1

𝑛

𝑤𝑖𝜎 (𝑟𝑖(𝑥 + 𝑙𝑖))

Linear F-Principle (LFP) dynamics

2-layer NN: ℎ 𝑥; 𝛩 = σ𝑖=1𝑛 𝑤𝑖ReLU(𝑟𝑖(𝑥 + 𝑙𝑖))

⋅ : mean over all neurons at initialization

𝑓: target function; ⋅ 𝑝 = (⋅)𝑝,where 𝑝 𝑥 =1

𝑛σ𝑖=1𝑛 𝛿(𝑥 − 𝑥𝑖);

Ƹ⋅: Fourier transform; 𝜉: frequency

ReLU

aliasing

𝜕𝑡 ℎ 𝜉, 𝑡 = −4𝜋2 𝑟2𝑤2

𝜉2+

𝑟2 + 𝑤2

𝜉4ℎ𝑝 𝜉, 𝑡 − 𝑓𝑝 𝜉, 𝑡

Assumptions: (i) NTK regime, (ii) sufficiently wide distribution of 𝑙𝑖.

𝜕𝑡 ℎ 𝜉, 𝑡 = −4𝜋2 𝑟2𝑤2

𝜉2+

𝑟2 + 𝑤2

𝜉4ℎ𝑝 𝜉, 𝑡 − 𝑓𝑝 𝜉, 𝑡

Preference induced by LFP dynamics

minℎ∈𝐹𝛾

4𝜋2 𝑟2𝑤2

𝜉2+

𝑟2 + 𝑤2

𝜉4

−1ℎ 𝜉

2d𝜉

s.t. ℎ 𝑥𝑖 = 𝑦𝑖 for 𝑖 = 1,⋯ , 𝑛

low frequency preference

Case 1: 𝜉−2 dominant

• min 𝜉2 ℎ 𝜉2d𝜉~min ℎ′(𝑥) 2 d𝜉 → linear spline

Case 2: 𝜉−4 dominant

• min 𝜉4 ℎ 𝜉2d𝜉~min ℎ′′(𝑥) 2 d𝜉 → cubic spline

Regularity can be changed through initialization

Case 1𝑟2 + 𝑤2 ≫ 4𝜋2 𝑟2𝑤2

Case 24𝜋2 𝑟2𝑤2 ≫ 𝑟2 + 𝑤2

minන𝜉2 ℎ 𝜉2d𝜉 minන𝜉4 ℎ 𝜉

2d𝜉

𝜕𝑡 ℎ 𝜉, 𝑡 = −|𝑟|2 + 𝑤2

|𝜉|𝑑+3+4𝜋2 |𝑟|2𝑤2

|𝜉|𝑑+1ℎ𝑝 𝜉, 𝑡 − 𝑓𝑝 𝜉, 𝑡

where 𝑓: target function; ⋅ 𝑝 = (⋅)𝑝,where 𝑝 𝑥 =1

𝑛σ𝑖=1𝑛 𝛿(𝑥 − 𝑥𝑖); (⋅): Fourier

transform; 𝜉: frequency.

Theorem (informal). Solution of LFP dynamics at 𝑡 → ∞ with initial value

ℎini is the same as solution of the following optimization problem

minℎ−ℎini∈𝐹𝛾

න𝑟 2 + 𝑤2

|𝜉|𝑑+3+4𝜋2 𝑟 2𝑤2

|𝜉|𝑑+1

−1

ℎ 𝜉 − ℎini(𝜉)2d𝜉

s.t. ℎ 𝑋 = 𝑌.

High-dimensional Case

FP-norm and FP-space

A priori generalization error boundTheorem (informal). Suppose that the real-valued target function 𝑓 ∈ 𝐹𝛾(Ω), ℎ𝑛 is

the solution of the regularized model

minℎ∈𝐹𝛾

ℎ 𝛾 s.t. ℎ 𝑋 = 𝑌

Then for any 𝛿 ∈ (0,1) with probability at least 1 − 𝛿 over the random training

samples, the population risk has the bound

𝐿 ℎ𝑛 ≤ 𝑓 ∞ + 2 𝑓 𝛾 𝛾 𝑙22

𝑛+ 4

2log(4/𝛿)

𝑛

We define the FP-norm for all function ℎ ∈ 𝐿2(Ω):

ℎ 𝛾: = ℎ𝐻Γ

=

𝑘∈ℤ𝑑∗

𝛾−2 𝑘 ℎ 𝑘2

1/2

Next, we define the FP-space:

𝐹𝛾 Ω = {ℎ ∈ 𝐿2(Ω): ℎ 𝛾 < ∞}

Leo Breiman 1995

1. Why don’t heavily parameterized neural networks overfit the data?

2. What is the effective number of parameters?

3. Why doesn’t backpropagation head for a poor local minima?

4. When should one stop the backpropagation and use the current parameters?

Leo Breiman Reflections After Refereeing Papers for NIPS

Global Minima

unbiased initialization + F-Principle

+ ⋯(to be discovered)

A picture for the generalization mystery of DNN

Conclusion

DNNs prefer low frequencies!

References:

• Xu, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018

• Xu, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019

• Zhang, Xu, Luo, Ma, Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks, 2019

• Zhang, Xu, Luo, Ma, A type of generalization error induced by initialization in deep neural networks, 2019

• Luo, Ma, Xu, Zhang, Theory on Frequency Principle in General Deep Neural Networks, 2019.