Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

47
• Short reading for Thursday • Job talk at 1:30pm in ETRL 101 • Kuka robotics http://www.kuka-timoboll.com/en/ho me/

Transcript of Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

Page 1: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

• Short reading for Thursday• Job talk at 1:30pm in ETRL 101• Kuka robotics– http://www.kuka-timoboll.com/en/home/

Page 2: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

Unified View

Page 3: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

On-line, Tabular TD(λ)

Page 4: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

• Flappy Bird: state space?– http://sarvagyavaish.github.io/FlappyBirdRL/

Page 5: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

Chapter 9: Generalization and Function Approximation

• How does experience in parts of the state space help us act over the entire state space?

• How can does function approximation (supervised learning) merge with RL?

• Function approximator convergence

Page 6: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

Chapter 9: Generalization and Function Approximation

• How does experience in parts of the state space help us act over the entire state space?

• How can does function approximation (supervised learning) merge with RL?

• Function approximator convergence

• “I read it and it mostly makes sense.”• “There are many methods to do [function

approximation], most of which made very little sense as explained.”

Page 7: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

• Instead of lookup table for values of V at time t (Vt), consider some kind of weight vector wt

• E.g., wt could be the weights in a neural network

• Instead of one value (weight) per state, now we update this vector

Page 8: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

Insight: Steal from Existing Supervised Learning Methods!

• Training = {X,Y}• Error = target output – actual output

Page 9: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

TD Backups as Training Examples

• Recall the TD(0) backup:

• As a training example:– Input = Features of st

– Target output = rt+1 + γV(st+1)

Page 10: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

What FA methods can we use?

• In principle, anything!– Neural networks– Decision trees– Multivariate regression– Support Vector Machines– Gaussian Processes– Etc.

• But, we normally want to– Learn while interacting– Handle nonstationarity– Not take “too long” or use “too much” memory– Etc.

Page 11: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

Perceptron

• Binary, linear classifier: Rosenblatt, 1957• Eventual failure of perceptron to do

“everything” shifted field of AI towards symbolic representations

• Sum = w1x1 + w2x2 + … + wnxn

• Output is +1 if sum > 0, -1 otherwise• wj = wj + (target – output) xj

• Also, can use x0 = 1 and w0 is therefore a bias

Page 12: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

Perceptron

• Consider Perceptron with 3 weights:• x, y, bias

Page 13: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

Spatial-based Perceptron Weights

Page 14: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

Neural Networks

• How do we get around only linear solutions?

Page 15: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

Neural Networks

• A multi-layer network of linear perceptrons is still linear.

• Non-linear (differentiable) units• Logistic or tanh function

Page 16: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –
Page 17: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –
Page 18: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –
Page 20: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

Gradient Descent

• w = (w1, w2, …, wn)T

• Assume Vt(s) sufficiently smooth differential function of w, for all states s in S

• Also, assume that training examples are in the form:

• Features of st Vπ(st)• Goal: minimize errors on the observed samples

Page 21: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

• wt+1=wt + α[Vπ(st)-Vt(st)] wtVt(st)

• Vector of partial derivatives

Δ

Page 22: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

• Let J(w) be any function of the weight space• The gradient at any point wt in this space is:

• Then, to iteratively move down the gradient:

• Why still doing this iteratively? If you could just eliminate the error, why could this be a bad idea?

Page 23: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

• Common goal is to minimize mean-squared error (MSE) over distribution d:

• Why does this make any sense?• d is distribution of states receiving backups• on- or off-policy distribution

Page 24: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

• (Dmitry)• Motivation for choosing MSE:– MSE - 2-norm of the error.– 1) Square of the norm is a sum of the squares, and

its derivative is a linear function, that is good.– 2) QR decomposition is used to get nice solution

for linear approximation problems• Find x which minimizes 2-norm of(A*x-b)• Other norms don't such simple solution

Page 25: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

Gradient Descent

• Each sample gradient is an unbiased estimate of the true gradient• This will converge to a local minimum of the MSE if α decreases

“appropriately” over time

Page 26: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

• Unfortunately, we don’t actually have vπ(s)• Instead, we just have an estimate of the target

Vt

• If the Vt is an an unbiased estimate of vπ(st), then we’ll converge to a local minimum (again with α caveat)

Page 27: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

• δ is or normal TD error• e is vector of eligibility traces• θ is a weight vector

Page 28: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

• Note that TD(λ) targets are biased• But… we do it anyway

Page 29: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

Linear Methods

• Why are these a particularly important type of function approximation?

• Parameter vector θt

• Column vector of features φs for every state • (same number of components)

Page 30: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

Linear Methods

• Gradient is simple• Error surface for MSE is simple (single

minimum)

Page 31: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

• Coarse coding• Generalization based on features activating

Page 32: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

Size Matters

Page 33: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

Tile coding

Page 34: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

Tile coding, view #2

• Consider a game of soccer

Page 35: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

But, how do you pick the “coarseness”?

Adaptive tile codingIFSA

Page 36: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

Irregular tilings

Page 37: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

Radial Basis Functions

• Instead of binary, have degrees of activation

• Can combine with tile coding!

Page 38: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

• Kanerva Coding: choose “prototype states” and consider distance from prototype states

• Now, updates depend on number of features, not number of dimensions

• Instance-based methods

Page 39: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

Fitted R-Max [Jong and Stone, 2007]

x

y ?

• Instance-based RL method [Ormoneit & Sen, 2002]

• Handles continuous state spaces• Weights recorded transitions by distances• Plans over discrete, abstract MDPExample: 2 state variables, 1 action

Page 40: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

Fitted R-Max [Jong and Stone, 2007]

x

y

• Instance-based RL method [Ormoneit & Sen, 2002]

• Handles continuous state spaces• Weights recorded transitions by distances• Plans over discrete, abstract MDPExample: 2 state variables, 1 action

Page 41: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

Fitted R-Max [Jong and Stone, 2007]

x

y

• Instance-based RL method [Ormoneit & Sen, 2002]

• Handles continuous state spaces• Weights recorded transitions by distances• Plans over discrete, abstract MDPExample: 2 state variables, 1 action

Page 42: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

Fitted R-Max [Jong and Stone, 2007]

• Instance-based RL method [Ormoneit & Sen, 2002]

• Handles continuous state spaces• Weights recorded transitions by distances• Plans over discrete, abstract MDPExample: 2 state variables, 1 action

x

y

Page 43: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

Mountain-Car Task

Page 44: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

3D Mountain Car

• X: position and acceleration• Y: position and acceleration

Page 45: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

• Control with FA• Bootstrapping

Page 46: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

Efficiency in ML / AI

1. Data efficiency (rate of learning)2. Computational efficiency (memory,

computation, communication)3. Researcher efficiency (autonomy, ease of

setup, parameter tuning, priors, labels, expertise)

Page 47: Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –

• Todd’s work with decision trees• Course feedback