# Exponentiated Gradient versus Gradient Descent for Linear ...

of 23
/23

Embed Size (px)

### Transcript of Exponentiated Gradient versus Gradient Descent for Linear ...

Exponentiated Gradient versus Gradient Descent for Linear
PredictorsJyrki Kivinen and Manfred Warmuth

Presented By: Maitreyi N

The bounds can be improved to:

where

This is the gradient of the Squared Euclidean Distance:

2

This is the gradient of the Relative Entropy:

Algorithm GDL(s, η)

Parameters: L: a loss function from R × R to [0, ∞), s: a start vector in RN, and η: a learning rate in [0, ∞).

Initialization: Before the first trial, set w1=s.

Prediction: Upon receiving the t th instance xt, give the prediction t=wt • xt .

Update: Upon receiving the t th outcome yt, update the weights according to the rule

wt+1=wt - η L'yt(t) xt .

Algorithm EGL(s, η)

Parameters: L: a loss function from R × R to [0, ∞), s: a start vector with ΣN

i=1 si = 1, and η: a learning rate in [0, ∞).

Initialization: Before the first trial, set w1=s.

Prediction: Upon receiving the t th instance xt, give the prediction t=wt • xt .

Update: Upon receiving the t th outcome yt, update the weights according to the rule

EG± : EG with negative weights

EG is analogous to the Weighted Majority Algorithm: Uses multiplicative update rules Is based on minimizing relative entropy Unfortunately, it can represent only positive concepts

EG± can represent any concept in the entire sample space.

It has proven relative bounds Absolute bounds are not proven. Works by splitting the weight vector into positive and negative weights, with separate update rules.

EG± Algorithm:

∑ =

+

−−

=

=

EGV±

+ − =

−′−=+ η

The approximation leads to oscillation of the weight vector for certain weight distributions

Worst Case Loss Bounds

Other Algorithms

Gradient projection algorithm (GP) Has similar bounds to GD Uses the constraint: weights must sum to 1

Exponentiated Gradient algorithm with Unnormalized weights (EGU)

When all outcomes, inputs and comparison vectors are positive, it has the bounds:

( ) ).,(12),(21)),,,(( suXYd c

Have a fixed target concept u∈RN

u is equivalent to the weightage of each input Use instances of input xt

Drawn from a probability measure in RN

Random noise is added to the inputs Run each algorithm on the (same) inputs Plot cumulative losses for each algorithm

Results

Results

Results

Results

Results

GD vs. EG

Random errors confuse GD much more When the number of relevant variables is constant:

Loss(GD) grows linearly in N Loss(EG) grows logarithmically in N

GD does better when: All variables are relevant, and Input is consistent (few or no errors)

Conclusion

Worst case loss bounds exist only for square loss.

We need loss bounds for relative entropy loss GD has provably optimal bounds

Lower bounds for EG, EG± are still required. EG, EG± perform better in error prone learning environments

Exponentiated Gradient versus Gradient Descent for Linear Predictors

Linear Predictors

Gradient Descent

Exponentiated Gradient

Algorithm GDL(s, )

Algorithm EGL(s, )

EG± Algorithm:

Presented By: Maitreyi N

The bounds can be improved to:

where

This is the gradient of the Squared Euclidean Distance:

2

This is the gradient of the Relative Entropy:

Algorithm GDL(s, η)

Parameters: L: a loss function from R × R to [0, ∞), s: a start vector in RN, and η: a learning rate in [0, ∞).

Initialization: Before the first trial, set w1=s.

Prediction: Upon receiving the t th instance xt, give the prediction t=wt • xt .

Update: Upon receiving the t th outcome yt, update the weights according to the rule

wt+1=wt - η L'yt(t) xt .

Algorithm EGL(s, η)

Parameters: L: a loss function from R × R to [0, ∞), s: a start vector with ΣN

i=1 si = 1, and η: a learning rate in [0, ∞).

Initialization: Before the first trial, set w1=s.

Prediction: Upon receiving the t th instance xt, give the prediction t=wt • xt .

Update: Upon receiving the t th outcome yt, update the weights according to the rule

EG± : EG with negative weights

EG is analogous to the Weighted Majority Algorithm: Uses multiplicative update rules Is based on minimizing relative entropy Unfortunately, it can represent only positive concepts

EG± can represent any concept in the entire sample space.

It has proven relative bounds Absolute bounds are not proven. Works by splitting the weight vector into positive and negative weights, with separate update rules.

EG± Algorithm:

∑ =

+

−−

=

=

EGV±

+ − =

−′−=+ η

The approximation leads to oscillation of the weight vector for certain weight distributions

Worst Case Loss Bounds

Other Algorithms

Gradient projection algorithm (GP) Has similar bounds to GD Uses the constraint: weights must sum to 1

Exponentiated Gradient algorithm with Unnormalized weights (EGU)

When all outcomes, inputs and comparison vectors are positive, it has the bounds:

( ) ).,(12),(21)),,,(( suXYd c

Have a fixed target concept u∈RN

u is equivalent to the weightage of each input Use instances of input xt

Drawn from a probability measure in RN

Random noise is added to the inputs Run each algorithm on the (same) inputs Plot cumulative losses for each algorithm

Results

Results

Results

Results

Results

GD vs. EG

Random errors confuse GD much more When the number of relevant variables is constant:

Loss(GD) grows linearly in N Loss(EG) grows logarithmically in N

GD does better when: All variables are relevant, and Input is consistent (few or no errors)

Conclusion

Worst case loss bounds exist only for square loss.

We need loss bounds for relative entropy loss GD has provably optimal bounds

Lower bounds for EG, EG± are still required. EG, EG± perform better in error prone learning environments

Exponentiated Gradient versus Gradient Descent for Linear Predictors

Linear Predictors

Gradient Descent

Exponentiated Gradient

Algorithm GDL(s, )

Algorithm EGL(s, )

EG± Algorithm: