Download - CS344 : Introduction to Artificial Intelligence

Transcript
  • CS344 : Introduction to Artificial Intelligence

    Pushpak Bhattacharyya CSE Dept., IIT Bombay

    Lecture 28- PAC and Reinforcement Learning

    IIT Bombay

  • ChU UniverseC h = Error regionP(C h )
  • Learning Means the followingShould happen:

    Pr(P(c h) = 1-

    PAC model of learning correct.+Probably Approximately Correct

    IIT Bombay

  • ABDC+++------------xy

    IIT Bombay

  • Algo:1. Ignore ve example.

    2. Find the closest fitting axis parallel rectangle for the data.

    IIT Bombay

  • Case 1: If P([]ABCD) < than the Algo is PAC.

    ABDC+++------------xychPr(P(c h) = 1- C h++

    IIT Bombay

  • Case 2:ABDC------------xy p([]ABCD) > BottomRightLeftTopP(Top) = P(Bottom) = P(Right) = P(Left) = /4Case 2

    IIT Bombay

  • Let # of examples = m.

    Probability that a point comes from top = /4

    Probability that none of the m example come from top = (1- /4)m

    IIT Bombay

  • Probability that none of m examples come from one of top/bottom/left/right = 4(1 - /4)m

    Probability that at least one example will come from the 4 regions = 1- 4(1 - /4)m

    IIT Bombay

  • This fact must have probability greater than or equal to 1-

    1-4 (1 - /4 )m >1-

    or 4(1 - /4 )m <

    IIT Bombay

  • (1 - /4)m < e(-m/4)

    We must have

    4 e(-m/4) <

    Or m > (4/) ln(4/)

    IIT Bombay

  • Lets say we want 10% error with 90% confidence

    M > ((4/0.1) ln (4/0.1))

    Which is nearly equal to 200

    IIT Bombay

  • VC-dimension

    Gives a necessary and sufficient condition for PAC learnability.

    IIT Bombay

  • Def:-Let C be a concept class, i.e., it has members c1,c2,c3, as concepts in it.C1C2C3C

    IIT Bombay

  • Let S be a subset of U (universe).

    Now if all the subsets of S can be produced by intersecting with Cis, then we say C shatters S.

    IIT Bombay

  • The highest cardinality set S that can be shattered gives the VC-dimension of C.

    VC-dim(C)= |S|

    VC-dim: Vapnik-Cherronenkis dimension.

    IIT Bombay

  • 2 Dim surfaceC = { half planes}xy

    IIT Bombay

  • a|s| = 1 can be shatteredS1= { a }

    {a}, yx

    IIT Bombay

  • a|s| = 2 can be shatteredbS2= { a,b }{a,b},{a},{b},xy

    IIT Bombay

  • a|s| = 3 can be shatteredbcxyS3= { a,b,c }

    IIT Bombay

  • IIT Bombay

  • ABDCxy|s| = 4 cannot be shatteredS4= { a,b,c,d }

    IIT Bombay

  • Fundamental Theorem of PAC learning (Ehrenfeuct et. al, 1989)A Concept Class C is learnable for all probability distributions and all concepts in C if and only if the VC dimension of C is finiteIf the VC dimension of C is d, then(next page)

    IIT Bombay

  • Fundamental theorem (contd)(a) for 0
  • BookComputational Learning Theory, M. H. G. Anthony, N. Biggs, Cambridge Tracts in Theoretical Computer Science, 1997. Papers 1. A theory of the learnable, Valiant, LG (1984), Communications of the ACM 27(11):1134 -1142. 2. Learnability and the VC-dimension, A Blumer, A Ehrenfeucht, D Haussler, M Warmuth - Journal of the ACM, 1989.

    IIT Bombay

  • Introducing Reinforcement Learning

    IIT Bombay

  • IntroductionReinforcement Learning is a sub-area of machine learning concerned with how an agent ought to take actions in an environment so as to maximize some notion of long-term reward.

  • Constituents

    In RL no correct/incorrrect input/output are given.

    Feedback for the learning process is called 'Reward' or 'Reinforcement'

    In RL we examine how an agent can learn from success and failure, reward and punishment

  • The RL frameworkEnvironment is depicted as a finite-state Markov Decision process.(MDP)

    Utility of a state U[i] gives the usefulness of the state

    The agent can begin with knowledge of the environment and the effects of its actions; or it will have to learn this model as well as utility information.

  • The RL problem

    Rewards can be received either in intermediate or a terminal state.

    Rewards can be a component of the actual utility(e.g. Pts in a TT match) or they can be hints to the actual utility (e.g. Verbal reinforcements)

    The agent can be a passive or an active learner

  • Passive Learning in a Known EnvironmentIn passive learning, the environment generates state transitions and the agent perceives them. Consider an agent trying to learn the utilities of the states shown below:

  • Passive Learning in a Known Environment Agent can move {North, East, South, West} Terminate on reading [4,2] or [4,3]

  • Passive Learning in a Known EnvironmentAgent is provided:Mi j = a model given the probability of reaching from state i to state j

  • Passive Learning in a Known Environment The object is to use this information about rewards to learn the expected utility U(i) associated with each nonterminal state i Utilities can be learned using 3 approaches1) LMS (least mean squares)2) ADP (adaptive dynamic programming)3) TD (temporal difference learning)

  • Passive Learning in a Known EnvironmentLMS (Least Mean Square)Agent makes random runs (sequences of random moves) through environment [1,1]->[1,2]->[1,3]->[2,3]->[3,3]->[4,3] = +1 [1,1]->[2,1]->[3,1]->[3,2]->[4,2] = -1

  • Passive Learning in a Known EnvironmentLMS Collect statistics on final payoff for each state (eg. when on [2,3], how often reached +1 vs -1 ?) Learner computes average for each state Probably converges to true expected value (utilities)

  • Passive Learning in a Known EnvironmentLMSMain Drawback:- slow convergence- it takes the agent well over a 1000 training sequences to get close to the correct value

  • Passive Learning in a Known EnvironmentADP (Adaptive Dynamic Programming)

    Uses the value or policy iteration algorithm to calculate exact utilities of states given an estimated mode

  • Passive Learning in a Known EnvironmentADPIn general:Un+1(i) = Un(i)+ Mij . Un(j)

    -Un(i) is the utility of state i after nth iteration-Initially set to R(i)- R(i) is reward of being in state i(often non zero for only a few end states)- Mij is the probability of transition from state i to j

  • Passive Learning in a Known Environment Consider U(3,3)U(3,3) = 0.33 x U(4,3) + 0.33 x U(2,3) + 0.33 x U(3,2) = 0.33 x 1.0 + 0.33 x 0.0886 + 0.33 x -0.4430 = 0.2152ADP

  • Passive Learning in a Known EnvironmentADP makes optimal use of the local constraints on utilities of states imposed by the neighborhood structure of the environment somewhat intractable for large state spaces

  • Passive Learning in a Known EnvironmentTD (Temporal Difference Learning)

    The key is to use the observed transitions to adjust the values of the observed states so that they agree with the constraint equations

  • Passive Learning in a Known EnvironmentTD Learning Suppose we observe a transition from state i to state j U(i) = -0.5 and U(j) = +0.5 Suggests that we should increase U(i) to make it agree better with it successor Can be achieved using the following updating ruleUn+1(i) = Un(i)+ a(R(i) + Un(j) Un(i))

  • Passive Learning in a Known EnvironmentTD LearningPerformance: Runs noisier than LMS but smaller error Deal with observed states during sample runs (Not all instances, unlike ADP)

  • Passive Learning in an Unknown EnvironmentLMS approach and TD approach operate unchanged in an initially unknown environment.

    ADP approach adds a step that updates an estimated model of the environment.

  • Passive Learning in an Unknown EnvironmentADP ApproachThe environment model is learned by direct observation of transitionsThe environment model M can be updated by keeping track of the percentage of times each state transitions to each of its neighbours

  • Passive Learning in an Unknown EnvironmentADP & TD ApproachesThe ADP approach and the TD approach are closely related

    Both try to make local adjustments to the utility estimates in order to make each state agree with its successors

  • Passive Learning in an Unknown EnvironmentMinor differences : TD adjusts a state to agree with its observed successorADP adjusts the state to agree with all of the successorsImportant differences :TD makes a single adjustment per observed transitionADP makes as many adjustments as it needs to restore consistency between the utility estimates U and the environment model M

  • Passive Learning in an Unknown EnvironmentTo make ADP more efficient :directly approximate the algorithm for value iteration or policy iterationprioritized-sweeping heuristic makes adjustments to states whose likely successors have just undergone a large adjustment in their own utility estimatesAdvantage of the approximate ADP :efficient in terms of computationeliminate long value iterations occur in early stage

  • Active Learning in an Unknown Environment

    An active agent must consider :

    what actions to takewhat their outcomes may be how they will affect the rewards received

  • Active Learning in an Unknown EnvironmentMinor changes to passive learning agent:

    environment model now incorporates the probabilities of transitions to other states given a particular actionmaximize its expected utilityagent needs a performance element to choose an action at each step

  • The framework

    IIT Bombay

  • Learning An Action Value-FunctionThe TD Q-Learning Update Equation- requires no model- calculated after each transition from state .i to jThus, they can be learned directly from reward feedback

  • Generalization In Reinforcement LearningExplicit Representationwe have assumed that all the functions learned by the agents(U,M,R,Q) are represented in tabular formexplicit representation involves one output value for each input tuple.