• Author

azriel
• Category

## Documents

• view

54

0

Embed Size (px)

description

CS344 : Introduction to Artificial Intelligence. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 28- PAC and Reinforcement Learning. U Universe. C. h. C h = Error region. +. P(C h )

### Transcript of CS344 : Introduction to Artificial Intelligence

• CS344 : Introduction to Artificial Intelligence

Pushpak Bhattacharyya CSE Dept., IIT Bombay

Lecture 28- PAC and Reinforcement Learning

IIT Bombay

• ChU UniverseC h = Error regionP(C h )
• Learning Means the followingShould happen:

Pr(P(c h) = 1-

PAC model of learning correct.+Probably Approximately Correct

IIT Bombay

• ABDC+++------------xy

IIT Bombay

• Algo:1. Ignore ve example.

2. Find the closest fitting axis parallel rectangle for the data.

IIT Bombay

• Case 1: If P([]ABCD) < than the Algo is PAC.

ABDC+++------------xychPr(P(c h) = 1- C h++

IIT Bombay

• Case 2:ABDC------------xy p([]ABCD) > BottomRightLeftTopP(Top) = P(Bottom) = P(Right) = P(Left) = /4Case 2

IIT Bombay

• Let # of examples = m.

Probability that a point comes from top = /4

Probability that none of the m example come from top = (1- /4)m

IIT Bombay

• Probability that none of m examples come from one of top/bottom/left/right = 4(1 - /4)m

Probability that at least one example will come from the 4 regions = 1- 4(1 - /4)m

IIT Bombay

• This fact must have probability greater than or equal to 1-

1-4 (1 - /4 )m >1-

or 4(1 - /4 )m <

IIT Bombay

• (1 - /4)m < e(-m/4)

We must have

4 e(-m/4) <

Or m > (4/) ln(4/)

IIT Bombay

• Lets say we want 10% error with 90% confidence

M > ((4/0.1) ln (4/0.1))

Which is nearly equal to 200

IIT Bombay

• VC-dimension

Gives a necessary and sufficient condition for PAC learnability.

IIT Bombay

• Def:-Let C be a concept class, i.e., it has members c1,c2,c3, as concepts in it.C1C2C3C

IIT Bombay

• Let S be a subset of U (universe).

Now if all the subsets of S can be produced by intersecting with Cis, then we say C shatters S.

IIT Bombay

• The highest cardinality set S that can be shattered gives the VC-dimension of C.

VC-dim(C)= |S|

VC-dim: Vapnik-Cherronenkis dimension.

IIT Bombay

• 2 Dim surfaceC = { half planes}xy

IIT Bombay

• a|s| = 1 can be shatteredS1= { a }

{a}, yx

IIT Bombay

• a|s| = 2 can be shatteredbS2= { a,b }{a,b},{a},{b},xy

IIT Bombay

• a|s| = 3 can be shatteredbcxyS3= { a,b,c }

IIT Bombay

• IIT Bombay

• ABDCxy|s| = 4 cannot be shatteredS4= { a,b,c,d }

IIT Bombay

• Fundamental Theorem of PAC learning (Ehrenfeuct et. al, 1989)A Concept Class C is learnable for all probability distributions and all concepts in C if and only if the VC dimension of C is finiteIf the VC dimension of C is d, then(next page)

IIT Bombay

• Fundamental theorem (contd)(a) for 0
• BookComputational Learning Theory, M. H. G. Anthony, N. Biggs, Cambridge Tracts in Theoretical Computer Science, 1997. Papers 1. A theory of the learnable, Valiant, LG (1984), Communications of the ACM 27(11):1134 -1142. 2. Learnability and the VC-dimension, A Blumer, A Ehrenfeucht, D Haussler, M Warmuth - Journal of the ACM, 1989.

IIT Bombay

• Introducing Reinforcement Learning

IIT Bombay

• IntroductionReinforcement Learning is a sub-area of machine learning concerned with how an agent ought to take actions in an environment so as to maximize some notion of long-term reward.

• Constituents

In RL no correct/incorrrect input/output are given.

Feedback for the learning process is called 'Reward' or 'Reinforcement'

In RL we examine how an agent can learn from success and failure, reward and punishment

• The RL frameworkEnvironment is depicted as a finite-state Markov Decision process.(MDP)

Utility of a state U[i] gives the usefulness of the state

The agent can begin with knowledge of the environment and the effects of its actions; or it will have to learn this model as well as utility information.

• The RL problem

Rewards can be received either in intermediate or a terminal state.

Rewards can be a component of the actual utility(e.g. Pts in a TT match) or they can be hints to the actual utility (e.g. Verbal reinforcements)

The agent can be a passive or an active learner

• Passive Learning in a Known EnvironmentIn passive learning, the environment generates state transitions and the agent perceives them. Consider an agent trying to learn the utilities of the states shown below:

• Passive Learning in a Known Environment Agent can move {North, East, South, West} Terminate on reading [4,2] or [4,3]

• Passive Learning in a Known EnvironmentAgent is provided:Mi j = a model given the probability of reaching from state i to state j

• Passive Learning in a Known Environment The object is to use this information about rewards to learn the expected utility U(i) associated with each nonterminal state i Utilities can be learned using 3 approaches1) LMS (least mean squares)2) ADP (adaptive dynamic programming)3) TD (temporal difference learning)

• Passive Learning in a Known EnvironmentLMS (Least Mean Square)Agent makes random runs (sequences of random moves) through environment [1,1]->[1,2]->[1,3]->[2,3]->[3,3]->[4,3] = +1 [1,1]->[2,1]->[3,1]->[3,2]->[4,2] = -1

• Passive Learning in a Known EnvironmentLMS Collect statistics on final payoff for each state (eg. when on [2,3], how often reached +1 vs -1 ?) Learner computes average for each state Probably converges to true expected value (utilities)

• Passive Learning in a Known EnvironmentLMSMain Drawback:- slow convergence- it takes the agent well over a 1000 training sequences to get close to the correct value

Uses the value or policy iteration algorithm to calculate exact utilities of states given an estimated mode

• Passive Learning in a Known EnvironmentADPIn general:Un+1(i) = Un(i)+ Mij . Un(j)

-Un(i) is the utility of state i after nth iteration-Initially set to R(i)- R(i) is reward of being in state i(often non zero for only a few end states)- Mij is the probability of transition from state i to j

• Passive Learning in a Known Environment Consider U(3,3)U(3,3) = 0.33 x U(4,3) + 0.33 x U(2,3) + 0.33 x U(3,2) = 0.33 x 1.0 + 0.33 x 0.0886 + 0.33 x -0.4430 = 0.2152ADP

• Passive Learning in a Known EnvironmentADP makes optimal use of the local constraints on utilities of states imposed by the neighborhood structure of the environment somewhat intractable for large state spaces

• Passive Learning in a Known EnvironmentTD (Temporal Difference Learning)

The key is to use the observed transitions to adjust the values of the observed states so that they agree with the constraint equations

• Passive Learning in a Known EnvironmentTD Learning Suppose we observe a transition from state i to state j U(i) = -0.5 and U(j) = +0.5 Suggests that we should increase U(i) to make it agree better with it successor Can be achieved using the following updating ruleUn+1(i) = Un(i)+ a(R(i) + Un(j) Un(i))

• Passive Learning in a Known EnvironmentTD LearningPerformance: Runs noisier than LMS but smaller error Deal with observed states during sample runs (Not all instances, unlike ADP)

• Passive Learning in an Unknown EnvironmentLMS approach and TD approach operate unchanged in an initially unknown environment.

• Passive Learning in an Unknown EnvironmentADP ApproachThe environment model is learned by direct observation of transitionsThe environment model M can be updated by keeping track of the percentage of times each state transitions to each of its neighbours

• Passive Learning in an Unknown EnvironmentADP & TD ApproachesThe ADP approach and the TD approach are closely related

Both try to make local adjustments to the utility estimates in order to make each state agree with its successors

• Passive Learning in an Unknown EnvironmentMinor differences : TD adjusts a state to agree with its observed successorADP adjusts the state to agree with all of the successorsImportant differences :TD makes a single adjustment per observed transitionADP makes as many adjustments as it needs to restore consistency between the utility estimates U and the environment model M

• Passive Learning in an Unknown EnvironmentTo make ADP more efficient :directly approximate the algorithm for value iteration or policy iterationprioritized-sweeping heuristic makes adjustments to states whose likely successors have just undergone a large adjustment in their own utility estimatesAdvantage of the approximate ADP :efficient in terms of computationeliminate long value iterations occur in early stage

• Active Learning in an Unknown Environment

An active agent must consider :

what actions to takewhat their outcomes may be how they will affect the rewards received

• Active Learning in an Unknown EnvironmentMinor changes to passive learning agent:

environment model now incorporates the probabilities of transitions to other states given a particular actionmaximize its expected utilityagent needs a performance element to choose an action at each step

• The framework

IIT Bombay

• Learning An Action Value-FunctionThe TD Q-Learning Update Equation- requires no model- calculated after each transition from state .i to jThus, they can be learned directly from reward feedback

• Generalization In Reinforcement LearningExplicit Representationwe have assumed that all the functions learned by the agents(U,M,R,Q) are represented in tabular formexplicit representation involves one output value for each input tuple.