CS344 : Introduction to Artificial Intelligence
description
Transcript of CS344 : Introduction to Artificial Intelligence
-
CS344 : Introduction to Artificial Intelligence
Pushpak Bhattacharyya CSE Dept., IIT Bombay
Lecture 28- PAC and Reinforcement Learning
IIT Bombay
- ChU UniverseC h = Error regionP(C h )
-
Learning Means the followingShould happen:
Pr(P(c h) = 1-
PAC model of learning correct.+Probably Approximately Correct
IIT Bombay
-
ABDC+++------------xy
IIT Bombay
-
Algo:1. Ignore ve example.
2. Find the closest fitting axis parallel rectangle for the data.
IIT Bombay
-
Case 1: If P([]ABCD) < than the Algo is PAC.
ABDC+++------------xychPr(P(c h) = 1- C h++
IIT Bombay
-
Case 2:ABDC------------xy p([]ABCD) > BottomRightLeftTopP(Top) = P(Bottom) = P(Right) = P(Left) = /4Case 2
IIT Bombay
-
Let # of examples = m.
Probability that a point comes from top = /4
Probability that none of the m example come from top = (1- /4)m
IIT Bombay
-
Probability that none of m examples come from one of top/bottom/left/right = 4(1 - /4)m
Probability that at least one example will come from the 4 regions = 1- 4(1 - /4)m
IIT Bombay
-
This fact must have probability greater than or equal to 1-
1-4 (1 - /4 )m >1-
or 4(1 - /4 )m <
IIT Bombay
-
(1 - /4)m < e(-m/4)
We must have
4 e(-m/4) <
Or m > (4/) ln(4/)
IIT Bombay
-
Lets say we want 10% error with 90% confidence
M > ((4/0.1) ln (4/0.1))
Which is nearly equal to 200
IIT Bombay
-
VC-dimension
Gives a necessary and sufficient condition for PAC learnability.
IIT Bombay
-
Def:-Let C be a concept class, i.e., it has members c1,c2,c3, as concepts in it.C1C2C3C
IIT Bombay
-
Let S be a subset of U (universe).
Now if all the subsets of S can be produced by intersecting with Cis, then we say C shatters S.
IIT Bombay
-
The highest cardinality set S that can be shattered gives the VC-dimension of C.
VC-dim(C)= |S|
VC-dim: Vapnik-Cherronenkis dimension.
IIT Bombay
-
2 Dim surfaceC = { half planes}xy
IIT Bombay
-
a|s| = 1 can be shatteredS1= { a }
{a}, yx
IIT Bombay
-
a|s| = 2 can be shatteredbS2= { a,b }{a,b},{a},{b},xy
IIT Bombay
-
a|s| = 3 can be shatteredbcxyS3= { a,b,c }
IIT Bombay
-
IIT Bombay
-
ABDCxy|s| = 4 cannot be shatteredS4= { a,b,c,d }
IIT Bombay
-
Fundamental Theorem of PAC learning (Ehrenfeuct et. al, 1989)A Concept Class C is learnable for all probability distributions and all concepts in C if and only if the VC dimension of C is finiteIf the VC dimension of C is d, then(next page)
IIT Bombay
- Fundamental theorem (contd)(a) for 0
-
BookComputational Learning Theory, M. H. G. Anthony, N. Biggs, Cambridge Tracts in Theoretical Computer Science, 1997. Papers 1. A theory of the learnable, Valiant, LG (1984), Communications of the ACM 27(11):1134 -1142. 2. Learnability and the VC-dimension, A Blumer, A Ehrenfeucht, D Haussler, M Warmuth - Journal of the ACM, 1989.
IIT Bombay
-
Introducing Reinforcement Learning
IIT Bombay
-
IntroductionReinforcement Learning is a sub-area of machine learning concerned with how an agent ought to take actions in an environment so as to maximize some notion of long-term reward.
-
Constituents
In RL no correct/incorrrect input/output are given.
Feedback for the learning process is called 'Reward' or 'Reinforcement'
In RL we examine how an agent can learn from success and failure, reward and punishment
-
The RL frameworkEnvironment is depicted as a finite-state Markov Decision process.(MDP)
Utility of a state U[i] gives the usefulness of the state
The agent can begin with knowledge of the environment and the effects of its actions; or it will have to learn this model as well as utility information.
-
The RL problem
Rewards can be received either in intermediate or a terminal state.
Rewards can be a component of the actual utility(e.g. Pts in a TT match) or they can be hints to the actual utility (e.g. Verbal reinforcements)
The agent can be a passive or an active learner
-
Passive Learning in a Known EnvironmentIn passive learning, the environment generates state transitions and the agent perceives them. Consider an agent trying to learn the utilities of the states shown below:
-
Passive Learning in a Known Environment Agent can move {North, East, South, West} Terminate on reading [4,2] or [4,3]
-
Passive Learning in a Known EnvironmentAgent is provided:Mi j = a model given the probability of reaching from state i to state j
-
Passive Learning in a Known Environment The object is to use this information about rewards to learn the expected utility U(i) associated with each nonterminal state i Utilities can be learned using 3 approaches1) LMS (least mean squares)2) ADP (adaptive dynamic programming)3) TD (temporal difference learning)
-
Passive Learning in a Known EnvironmentLMS (Least Mean Square)Agent makes random runs (sequences of random moves) through environment [1,1]->[1,2]->[1,3]->[2,3]->[3,3]->[4,3] = +1 [1,1]->[2,1]->[3,1]->[3,2]->[4,2] = -1
-
Passive Learning in a Known EnvironmentLMS Collect statistics on final payoff for each state (eg. when on [2,3], how often reached +1 vs -1 ?) Learner computes average for each state Probably converges to true expected value (utilities)
-
Passive Learning in a Known EnvironmentLMSMain Drawback:- slow convergence- it takes the agent well over a 1000 training sequences to get close to the correct value
-
Passive Learning in a Known EnvironmentADP (Adaptive Dynamic Programming)
Uses the value or policy iteration algorithm to calculate exact utilities of states given an estimated mode
-
Passive Learning in a Known EnvironmentADPIn general:Un+1(i) = Un(i)+ Mij . Un(j)
-Un(i) is the utility of state i after nth iteration-Initially set to R(i)- R(i) is reward of being in state i(often non zero for only a few end states)- Mij is the probability of transition from state i to j
-
Passive Learning in a Known Environment Consider U(3,3)U(3,3) = 0.33 x U(4,3) + 0.33 x U(2,3) + 0.33 x U(3,2) = 0.33 x 1.0 + 0.33 x 0.0886 + 0.33 x -0.4430 = 0.2152ADP
-
Passive Learning in a Known EnvironmentADP makes optimal use of the local constraints on utilities of states imposed by the neighborhood structure of the environment somewhat intractable for large state spaces
-
Passive Learning in a Known EnvironmentTD (Temporal Difference Learning)
The key is to use the observed transitions to adjust the values of the observed states so that they agree with the constraint equations
-
Passive Learning in a Known EnvironmentTD Learning Suppose we observe a transition from state i to state j U(i) = -0.5 and U(j) = +0.5 Suggests that we should increase U(i) to make it agree better with it successor Can be achieved using the following updating ruleUn+1(i) = Un(i)+ a(R(i) + Un(j) Un(i))
-
Passive Learning in a Known EnvironmentTD LearningPerformance: Runs noisier than LMS but smaller error Deal with observed states during sample runs (Not all instances, unlike ADP)
-
Passive Learning in an Unknown EnvironmentLMS approach and TD approach operate unchanged in an initially unknown environment.
ADP approach adds a step that updates an estimated model of the environment.
-
Passive Learning in an Unknown EnvironmentADP ApproachThe environment model is learned by direct observation of transitionsThe environment model M can be updated by keeping track of the percentage of times each state transitions to each of its neighbours
-
Passive Learning in an Unknown EnvironmentADP & TD ApproachesThe ADP approach and the TD approach are closely related
Both try to make local adjustments to the utility estimates in order to make each state agree with its successors
-
Passive Learning in an Unknown EnvironmentMinor differences : TD adjusts a state to agree with its observed successorADP adjusts the state to agree with all of the successorsImportant differences :TD makes a single adjustment per observed transitionADP makes as many adjustments as it needs to restore consistency between the utility estimates U and the environment model M
-
Passive Learning in an Unknown EnvironmentTo make ADP more efficient :directly approximate the algorithm for value iteration or policy iterationprioritized-sweeping heuristic makes adjustments to states whose likely successors have just undergone a large adjustment in their own utility estimatesAdvantage of the approximate ADP :efficient in terms of computationeliminate long value iterations occur in early stage
-
Active Learning in an Unknown Environment
An active agent must consider :
what actions to takewhat their outcomes may be how they will affect the rewards received
-
Active Learning in an Unknown EnvironmentMinor changes to passive learning agent:
environment model now incorporates the probabilities of transitions to other states given a particular actionmaximize its expected utilityagent needs a performance element to choose an action at each step
-
The framework
IIT Bombay
-
Learning An Action Value-FunctionThe TD Q-Learning Update Equation- requires no model- calculated after each transition from state .i to jThus, they can be learned directly from reward feedback
-
Generalization In Reinforcement LearningExplicit Representationwe have assumed that all the functions learned by the agents(U,M,R,Q) are represented in tabular formexplicit representation involves one output value for each input tuple.