Learning, then Compacting Visual Policies (Extended Abstract) · PDF fileLearning, then...
Embed Size (px)
Transcript of Learning, then Compacting Visual Policies (Extended Abstract) · PDF fileLearning, then...
Learning, then Compacting Visual Policies (Extended Abstract)
Sebastien Jodogne S.Jodogne@ULg.ac.beJustus H. Piater Justus.Piater@ULg.ac.beMontefiore Institute (B28), University of Liege, B-4000 Liege, Belgium
AbstractWe propose a way to fight overfitting inthe recently-published Reinforcement Learn-ing of Visual Classes (RLVC) algorithm.RLVC interactively learns mappings from im-ages to actions. In RLVC, the visual space isincrementally partitioned into a smaller setof visual classes, by testing the presence ofhighly informative visual features. Since thesplitting rule is local, useless features are of-ten selected, which reduces the performanceof the algorithm. We introduce a methodto aggregate visual classes that are similarwith respect to the theory of Markov DecisionProcesses using Binary Decision Diagrams.We illustrate the practical interest of this ap-proach on a complex visual navigation task.
Constructing an optimal control policy through Rein-forcement Learning (RL) becomes exponentially har-der as the size of the perceptual (input) space grows.This is known as the curse of dimensionality . Thereexist only little work on perceptual spaces consistingof images, which are extremely high-dimensional andnoisy. Likewise, in the Computer Vision community,there has been so far only very little focus on whatcould be called the vision-for-action paradigm, whichconsists in computing image-to-action mappings suit-able for a given visual task. However, in many roboticcontrol tasks, the learning agent is equipped with cam-eras. We therefore argue that RL algorithms able todirectly tackle visual spaces should be developed.
The issue of dealing with large, discrete perceptualspaces in RL was previously addressed by the G Al-gorithm (Chapman & Kaelbling, 1991) and by the so-called selective attention mechanism of U Tree (Mc-Callum, 1996). These algorithms incrementally refinea decision tree, in a sequence of attempts to removeperceptual aliasing. The decision tree partitions theinput space into several perceptual classes by testingthe presence of domain-specific features that are highly
relevant for solving the task. This idea was later ex-tended to continuous input spaces, like in ContinuousU Tree (Uther & Veloso, 1998) and Variable Resolu-tion Grids (Munos & Moore, 2002). However, thesepapers do not answer the question of which type offeatures could be used in the case of visual tasks.
We have recently proposed to take advantage of the vi-sual features that are used in local-appearance methodsin Computer Vision (Jodogne & Piater, 2005). Suchmethods have had much success in applications suchas image matching, image retrieval and object recog-nition. They are indeed powerful and flexible, as theyare robust to partial occlusions, and do not require seg-mentation or 3D models of objects. They rely on thedetection of discontinuities in the visual signal thanksto interest point detectors. The image patch aroundeach interest point is then summarized as a vector ofreal numbers, which is called a visual feature (Mikola-jczyk & Schmid, 2003). Once a suitable metric (e.g.,Mahalanobis or Euclidean distance) is defined uponthe visual features, two images can be matched if theyshare a sufficient number of matching visual features.
In our framework, which we called ReinforcementLearning of Visual Classes (RLVC), we build a de-cision tree that tests, at each of its internal nodes, theexistence of one visual feature in the visual stimulus.The leaves of the decision tree define a set of visualclasses, which is hopefully exponentially smaller thanthe original visual space, and upon which it is possibleto apply directly any usual RL algorithm. The con-struction of the decision tree is carried on incremen-tally, like in the G Algorithm and in U Tree, but us-ing a different splitting rule that targets zero Bellmanresiduals and that relies on an information-theoreticmeasure. We have shown that this algorithm is ofpractical interest, by solving a visual navigation task.
Unfortunately, because of its greedy nature, RLVC ishighly subject to overfitting. Splitting one visual classhas an impact on the Bellman residuals of all the vi-sual classes. Therefore, the splitting strategy can getstuck in local minima: Once a split is made that sub-sequently proves useless, it cannot be undone in the
original description of RLVC. In this work, we proposeto provide RLVC with the possibility of aggregatingvisual classes that share similar properties. Doing sohas at least three potential benefits: (i) Useless fea-tures are discarded, which enhances generalization ca-pabilities; (ii) RLVC can escape from local minima byre-initializing the search for good features; and (iii) thenumber of samples that the embedded RL algorithmhas at its disposal for each visual class is increased,which results in better visual control policies.
2. Compacting Visual Policies
We cannot include here a thorough discussion of RLVCfor space reasons. The reader is kindly invited to con-sult our previous papers for more details (Jodogne& Piater, 2005). For the purpose of this work, itis sufficient to know that RLVC builds a sequence(C0, . . . , Ck, . . .) of decision trees until no further per-ceptual aliasing is detected. Each Ck induces a mappedMarkov Decision Process (MDP), the states of whichcorrespond to the leaves of Ck. Furthermore, for eachk in the sequence, a traditional RL algorithm is ap-plied, so that properties like the optimal value functionV k (), the optimal state-action value function Qk(, )and the optimal control policy k() are known for theMDP that is mapped through Ck. Using those proper-ties, it is easy to define a whole range of equivalence re-lations between the visual classes. For instance, givena threshold R+, we list hereunder three possibleequivalence relations for a pair of visual classes (ck, ck):
Optimal Value Equivalence:|V k (ck) V k (ck)| .
Optimal Policy Equivalence:|V k (ck)Qk(ck, k(ck))| |V k (ck)Qk(ck, k(ck))| .
Optimal State-Action Value Equivalence:(a A)|Qk(ck, a)Qk(ck, a)| .
We therefore propose to modify RLVC so that, period-ically, visual classes that are equivalent with respect toone of those criteria are merged together. We have ex-perimentally observed that the conjunction of the firsttwo criteria seems to lead to the best performance.This way, RLVC alternatively splits and merges visualclasses. Therefore, a visual class cannot be representedas a single branch of a decision tree anymore. Rather, avisual class is characterized by a general Boolean func-tion, the variables of which correspond to the testedvisual features, so that we need a more expressive toolthan decision trees to represent the visual classes.
One such suitable tool is the Binary Decision Diagram(BDD) (Bryant, 1992). BDD is a tree-based sym-
bolic representation for encoding arbitrary Booleanfunctions, and has had much success in the field ofcomputer-aided verification. A BDD is unique whenthe ordering of its variables is fixed, but different vari-able orderings can lead to different sizes of the BDD,since some variables can be discarded by the reorder-ing process. Although the problem of finding the op-timal variable ordering is NP-complete, heuristics canin practice find orderings that are closed to optimality.This is interesting in our case, since reducing the sizeof the BDD amounts to discarding irrelevant variables,which correspond to useless visual features.
To summarize, the modified RLVC does not use deci-sion trees anymore, but assigns one BDD to each visualclass. The process of refining, with a visual feature f ,a visual class v that is labeled by the BDD B(v), con-sists in replacing v by two visual classes v1 and v2 suchthat B(v1) = B(v) f and B(v2) = B(v) f . Merg-ing a pair of visual classes (v1, v2) amounts to deletingv1 and v2, and adding a new visual class v such thatB(v) = B(v1)B(v2). Every time a merging operationtakes place, variable reordering is carried on.
3. Experimental Results
We have applied the modified version of RLVC to asimulated navigation task. In this task, the agentmoves between 11 spots of our campus (Figure 1).Every time the agent is at one of the 11 locations,its body can aim at four possible orientations (north,south, west, east). The state space is therefore of size114 = 44. The agent has three possible actions (turnleft, turn right, go forward). Now, the agent does nothave direct access to its position and its orientation.Rather, it only perceives a picture of the area that isin front of it (Figure 2). So, the agent has to connectimages to the appropriate reactions without knowingthe underlying physical structure of the task.
We have used the SIFT keypoints as visual fea-tures (Lowe, 2004). There was a total of 2,591 distinctfeatures. The original version of RLVC has identified644 visual classes by selecting 336 SIFT features. Theerror on the computed visual policy was 2% on thelearning set and 22% on the test set1. The modifiedRLVC was then applied, with one compacting every 5steps. The results are shown in Figure 3 and are clearlysuperior: There is an improvement in the generaliza-tion abilities, as well as a reduction of the number ofvisual classes and selected features.
As a conclusion, compacting visual policies is probablya required step to deal with realistic visual tasks, if an
1With respect to the optimal policy when the agent hasdirect access to its position and viewing direction.
Figure 1. The Montefiore campus at Liege. Red spots cor-responds to the places between which the agent moves. Theagent can only follow the links between the different spots.Its goal is to enter the Montefiore Institute, that is labeledby a red cross, and where it gets a reward of 100. Turningleft or right induces