CS344 : Introduction to Artificial Intelligence
Embed Size (px)
description
Transcript of CS344 : Introduction to Artificial Intelligence

CS344 : Introduction to Artificial Intelligence
Pushpak Bhattacharyya CSE Dept., IIT Bombay
Lecture 28 PAC and Reinforcement Learning
IIT Bombay
 ChU UniverseC h = Error regionP(C h )

Learning Means the followingShould happen:
Pr(P(c h) = 1
PAC model of learning correct.+Probably Approximately Correct
IIT Bombay

ABDC+++xy
IIT Bombay

Algo:1. Ignore ve example.
2. Find the closest fitting axis parallel rectangle for the data.
IIT Bombay

Case 1: If P([]ABCD) < than the Algo is PAC.
ABDC+++xychPr(P(c h) = 1 C h++
IIT Bombay

Case 2:ABDCxy p([]ABCD) > BottomRightLeftTopP(Top) = P(Bottom) = P(Right) = P(Left) = /4Case 2
IIT Bombay

Let # of examples = m.
Probability that a point comes from top = /4
Probability that none of the m example come from top = (1 /4)m
IIT Bombay

Probability that none of m examples come from one of top/bottom/left/right = 4(1  /4)m
Probability that at least one example will come from the 4 regions = 1 4(1  /4)m
IIT Bombay

This fact must have probability greater than or equal to 1
14 (1  /4 )m >1
or 4(1  /4 )m <
IIT Bombay

(1  /4)m < e(m/4)
We must have
4 e(m/4) <
Or m > (4/) ln(4/)
IIT Bombay

Lets say we want 10% error with 90% confidence
M > ((4/0.1) ln (4/0.1))
Which is nearly equal to 200
IIT Bombay

VCdimension
Gives a necessary and sufficient condition for PAC learnability.
IIT Bombay

Def:Let C be a concept class, i.e., it has members c1,c2,c3, as concepts in it.C1C2C3C
IIT Bombay

Let S be a subset of U (universe).
Now if all the subsets of S can be produced by intersecting with Cis, then we say C shatters S.
IIT Bombay

The highest cardinality set S that can be shattered gives the VCdimension of C.
VCdim(C)= S
VCdim: VapnikCherronenkis dimension.
IIT Bombay

2 Dim surfaceC = { half planes}xy
IIT Bombay

as = 1 can be shatteredS1= { a }
{a}, yx
IIT Bombay

as = 2 can be shatteredbS2= { a,b }{a,b},{a},{b},xy
IIT Bombay

as = 3 can be shatteredbcxyS3= { a,b,c }
IIT Bombay

IIT Bombay

ABDCxys = 4 cannot be shatteredS4= { a,b,c,d }
IIT Bombay

Fundamental Theorem of PAC learning (Ehrenfeuct et. al, 1989)A Concept Class C is learnable for all probability distributions and all concepts in C if and only if the VC dimension of C is finiteIf the VC dimension of C is d, then(next page)
IIT Bombay
 Fundamental theorem (contd)(a) for 0

BookComputational Learning Theory, M. H. G. Anthony, N. Biggs, Cambridge Tracts in Theoretical Computer Science, 1997. Papers 1. A theory of the learnable, Valiant, LG (1984), Communications of the ACM 27(11):1134 1142. 2. Learnability and the VCdimension, A Blumer, A Ehrenfeucht, D Haussler, M Warmuth  Journal of the ACM, 1989.
IIT Bombay

Introducing Reinforcement Learning
IIT Bombay

IntroductionReinforcement Learning is a subarea of machine learning concerned with how an agent ought to take actions in an environment so as to maximize some notion of longterm reward.

Constituents
In RL no correct/incorrrect input/output are given.
Feedback for the learning process is called 'Reward' or 'Reinforcement'
In RL we examine how an agent can learn from success and failure, reward and punishment

The RL frameworkEnvironment is depicted as a finitestate Markov Decision process.(MDP)
Utility of a state U[i] gives the usefulness of the state
The agent can begin with knowledge of the environment and the effects of its actions; or it will have to learn this model as well as utility information.

The RL problem
Rewards can be received either in intermediate or a terminal state.
Rewards can be a component of the actual utility(e.g. Pts in a TT match) or they can be hints to the actual utility (e.g. Verbal reinforcements)
The agent can be a passive or an active learner

Passive Learning in a Known EnvironmentIn passive learning, the environment generates state transitions and the agent perceives them. Consider an agent trying to learn the utilities of the states shown below:

Passive Learning in a Known Environment Agent can move {North, East, South, West} Terminate on reading [4,2] or [4,3]

Passive Learning in a Known EnvironmentAgent is provided:Mi j = a model given the probability of reaching from state i to state j

Passive Learning in a Known Environment The object is to use this information about rewards to learn the expected utility U(i) associated with each nonterminal state i Utilities can be learned using 3 approaches1) LMS (least mean squares)2) ADP (adaptive dynamic programming)3) TD (temporal difference learning)

Passive Learning in a Known EnvironmentLMS (Least Mean Square)Agent makes random runs (sequences of random moves) through environment [1,1]>[1,2]>[1,3]>[2,3]>[3,3]>[4,3] = +1 [1,1]>[2,1]>[3,1]>[3,2]>[4,2] = 1

Passive Learning in a Known EnvironmentLMS Collect statistics on final payoff for each state (eg. when on [2,3], how often reached +1 vs 1 ?) Learner computes average for each state Probably converges to true expected value (utilities)

Passive Learning in a Known EnvironmentLMSMain Drawback: slow convergence it takes the agent well over a 1000 training sequences to get close to the correct value

Passive Learning in a Known EnvironmentADP (Adaptive Dynamic Programming)
Uses the value or policy iteration algorithm to calculate exact utilities of states given an estimated mode

Passive Learning in a Known EnvironmentADPIn general:Un+1(i) = Un(i)+ Mij . Un(j)
Un(i) is the utility of state i after nth iterationInitially set to R(i) R(i) is reward of being in state i(often non zero for only a few end states) Mij is the probability of transition from state i to j

Passive Learning in a Known Environment Consider U(3,3)U(3,3) = 0.33 x U(4,3) + 0.33 x U(2,3) + 0.33 x U(3,2) = 0.33 x 1.0 + 0.33 x 0.0886 + 0.33 x 0.4430 = 0.2152ADP

Passive Learning in a Known EnvironmentADP makes optimal use of the local constraints on utilities of states imposed by the neighborhood structure of the environment somewhat intractable for large state spaces

Passive Learning in a Known EnvironmentTD (Temporal Difference Learning)
The key is to use the observed transitions to adjust the values of the observed states so that they agree with the constraint equations

Passive Learning in a Known EnvironmentTD Learning Suppose we observe a transition from state i to state j U(i) = 0.5 and U(j) = +0.5 Suggests that we should increase U(i) to make it agree better with it successor Can be achieved using the following updating ruleUn+1(i) = Un(i)+ a(R(i) + Un(j) Un(i))

Passive Learning in a Known EnvironmentTD LearningPerformance: Runs noisier than LMS but smaller error Deal with observed states during sample runs (Not all instances, unlike ADP)

Passive Learning in an Unknown EnvironmentLMS approach and TD approach operate unchanged in an initially unknown environment.
ADP approach adds a step that updates an estimated model of the environment.

Passive Learning in an Unknown EnvironmentADP ApproachThe environment model is learned by direct observation of transitionsThe environment model M can be updated by keeping track of the percentage of times each state transitions to each of its neighbours

Passive Learning in an Unknown EnvironmentADP & TD ApproachesThe ADP approach and the TD approach are closely related
Both try to make local adjustments to the utility estimates in order to make each state agree with its successors

Passive Learning in an Unknown EnvironmentMinor differences : TD adjusts a state to agree with its observed successorADP adjusts the state to agree with all of the successorsImportant differences :TD makes a single adjustment per observed transitionADP makes as many adjustments as it needs to restore consistency between the utility estimates U and the environment model M

Passive Learning in an Unknown EnvironmentTo make ADP more efficient :directly approximate the algorithm for value iteration or policy iterationprioritizedsweeping heuristic makes adjustments to states whose likely successors have just undergone a large adjustment in their own utility estimatesAdvantage of the approximate ADP :efficient in terms of computationeliminate long value iterations occur in early stage

Active Learning in an Unknown Environment
An active agent must consider :
what actions to takewhat their outcomes may be how they will affect the rewards received

Active Learning in an Unknown EnvironmentMinor changes to passive learning agent:
environment model now incorporates the probabilities of transitions to other states given a particular actionmaximize its expected utilityagent needs a performance element to choose an action at each step

The framework
IIT Bombay

Learning An Action ValueFunctionThe TD QLearning Update Equation requires no model calculated after each transition from state .i to jThus, they can be learned directly from reward feedback

Generalization In Reinforcement LearningExplicit Representationwe have assumed that all the functions learned by the agents(U,M,R,Q) are represented in tabular formexplicit representation involves one output value for each input tuple.