Exponentiated Gradient versus Gradient Descent for Linear ...

of 23 /23
Exponentiated Gradient versus Gradient Descent for Linear Predictors Jyrki Kivinen and Manfred Warmuth Presented By: Maitreyi N

Embed Size (px)

Transcript of Exponentiated Gradient versus Gradient Descent for Linear ...

Exponentiated Gradient versus Gradient Descent for Linear PredictorsJyrki Kivinen and Manfred Warmuth
Presented By: Maitreyi N
The bounds can be improved to:
where
This is the gradient of the Squared Euclidean Distance:
2
This is the gradient of the Relative Entropy:
Algorithm GDL(s, η)
Parameters: L: a loss function from R × R to [0, ∞), s: a start vector in RN, and η: a learning rate in [0, ∞).
Initialization: Before the first trial, set w1=s.
Prediction: Upon receiving the t th instance xt, give the prediction t=wt • xt .
Update: Upon receiving the t th outcome yt, update the weights according to the rule
wt+1=wt - η L'yt(t) xt .
Algorithm EGL(s, η)
Parameters: L: a loss function from R × R to [0, ∞), s: a start vector with ΣN
i=1 si = 1, and η: a learning rate in [0, ∞).
Initialization: Before the first trial, set w1=s.
Prediction: Upon receiving the t th instance xt, give the prediction t=wt • xt .
Update: Upon receiving the t th outcome yt, update the weights according to the rule
EG± : EG with negative weights
EG is analogous to the Weighted Majority Algorithm: Uses multiplicative update rules Is based on minimizing relative entropy Unfortunately, it can represent only positive concepts
EG± can represent any concept in the entire sample space.
It has proven relative bounds Absolute bounds are not proven. Works by splitting the weight vector into positive and negative weights, with separate update rules.
EG± Algorithm:
∑ =
+
−−
=
=
EGV±
+ − =
−′−=+ η
The approximation leads to oscillation of the weight vector for certain weight distributions
Worst Case Loss Bounds
Other Algorithms
Gradient projection algorithm (GP) Has similar bounds to GD Uses the constraint: weights must sum to 1
Exponentiated Gradient algorithm with Unnormalized weights (EGU)
When all outcomes, inputs and comparison vectors are positive, it has the bounds:
( ) ).,(12),(21)),,,(( suXYd c
Have a fixed target concept u∈RN
u is equivalent to the weightage of each input Use instances of input xt
Drawn from a probability measure in RN
Random noise is added to the inputs Run each algorithm on the (same) inputs Plot cumulative losses for each algorithm
Results
Results
Results
Results
Results
GD vs. EG
Random errors confuse GD much more When the number of relevant variables is constant:
Loss(GD) grows linearly in N Loss(EG) grows logarithmically in N
GD does better when: All variables are relevant, and Input is consistent (few or no errors)
Conclusion
Worst case loss bounds exist only for square loss.
We need loss bounds for relative entropy loss GD has provably optimal bounds
Lower bounds for EG, EG± are still required. EG, EG± perform better in error prone learning environments
Exponentiated Gradient versus Gradient Descent for Linear Predictors
Linear Predictors
Gradient Descent
Exponentiated Gradient
Algorithm GDL(s, )
Algorithm EGL(s, )
EG± Algorithm: