Universal Value Function Approximatorsproceedings.mlr.press/v37/schaul15.pdf · ing a UVFA may...

Universal Value Function Approximators

Tom Schaul [email protected] Horgan [email protected] Gregor [email protected] Silver [email protected]

Google DeepMind, 5 New Street Square, EC4A 3TW London

AbstractValue functions are a core component of rein-forcement learning systems. The main idea isto to construct a single function approximatorV (s; θ) that estimates the long-term reward fromany state s, using parameters θ. In this paperwe introduce universal value function approx-imators (UVFAs) V (s, g; θ) that generalise notjust over states s but also over goals g. We de-velop an efficient technique for supervised learn-ing of UVFAs, by factoring observed values intoseparate embedding vectors for state and goal,and then learning a mapping from s and g tothese factored embedding vectors. We show howthis technique may be incorporated into a re-inforcement learning algorithm that updates theUVFA solely from observed rewards. Finally, wedemonstrate that a UVFA can successfully gener-alise to previously unseen goals.

1. IntroductionValue functions are perhaps the most central idea in rein-forcement learning (Sutton & Barto, 1998). The main ideais to cache knowledge in a single function V (s) that repre-sents the utility of any state s in achieving the agent’s over-all goal or reward function. Storing this knowledge enablesthe agent to immediately assess and compare the utility ofstates and/or actions. The value function may be efficientlylearned, even from partial trajectories or under off-policyevaluation, by bootstrapping from value estimates at a laterstate (Precup et al., 2001).However, value functions may be used to represent knowl-edge beyond the agent’s overall goal. General value func-tions Vg(s) (Sutton et al., 2011) represent the utility of anystate s in achieving a given goal g (e.g. a waypoint), repre-sented by a pseudo-reward function that takes the place ofthe real rewards in the problem. Each such value functionrepresents a chunk of knowledge about the environment:

Proceedings of the 32nd International Conference on MachineLearning, Lille, France, 2015. JMLR: W&CP volume 37. Copy-right 2015 by the author(s).

how to evaluate or control a specific aspect of the environ-ment (e.g. progress toward a waypoint). A collection ofgeneral value functions provides a powerful form of knowl-edge representation that can be utilised in several ways. Forexample, the Horde architecture (Sutton et al., 2011) con-sists of a discrete set of value functions (‘demons’), all ofwhich may be learnt simultaneously from a single streamof experience, by bootstrapping off-policy from successivevalue estimates (Modayil et al., 2014). Each value functionmay also be used to generate a policy or option, for exampleby acting greedily with respect to the values, and terminat-ing at goal states. Such a collection of options can be usedto provide a temporally abstract action-space for learningor planning (Sutton et al., 1999). Finally, a collection ofvalue functions can be used as a predictive representationof state, where the predicted values themselves are used asa feature vector (Sutton & Tanner, 2005; Schaul & Ring,2013).In large problems, the value function is typically repre-sented by a function approximator V (s, θ), such as a linearcombination of features or a neural network with param-eters θ. The function approximator exploits the structurein the state space to efficiently learn the value of observedstates and generalise to the value of similar, unseen states.However, the goal space often contains just as much struc-ture as the state space (Foster & Dayan, 2002). Considerfor example the case where the agent’s goal is described bya single desired state: it is clear that there is just as muchsimilarity between the value of nearby goals as there is be-tween the value of nearby states. Our main idea is to ex-tend the idea of value function approximation to both statess and goals g, using a universal value function approxima-tor (UVFA1) V (s, g, θ). A sufficiently expressive functionapproximator can in principle identify and exploit structureacross both s and g. By universal, we mean that the valuefunction can generalise to any goal g in a set G of possiblegoals: for example a discrete set of goal states; their powerset; a set of continuous goal regions; or a vector represen-tation of arbitrary pseudo-reward functions.This UVFA effectively represents an infinite Horde of

1Pronounce ‘YOU-fah’.


demons that summarizes a whole class of predictions in asingle object. Any system that enumerates separate valuefunctions and learns each individually (like the Horde) ishampered in its scalability, as it cannot take advantage ofany shared structure (unless the demons share parameters).In contrast, UVFAs can exploit two kinds of structure be-tween goals: similarity encoded a priori in the goal rep-resentations g, and the structure in the induced value func-tions discovered bottom-up. Also, the complexity of UVFAlearning does not depend on the number of demons but onthe inherent domain complexity. This complexity is largerthan standard value function approximation, and represent-ing a UVFA may require a rich function approximator suchas a deep neural network.Learning a UVFA poses special challenges. In general, theagent will only see a small subset of possible combinationsof states and goals (s, g), but we would like to generalise inseveral ways. Even in a supervised learning context, whenthe true value Vg(s) is provided, this is a challenging re-gression problem. We introduce a novel factorization ap-proach that decomposes the regression into two stages. Weview the data as a sparse table of values that contains onerow for each observed state s and one column for each ob-served goal g, and find a low-rank factorization of the tableinto state embeddings φ(s) and goal embeddings ψ(g). Wethen learn non-linear mappings from states s to state em-beddings φ(s), and from goals g to goal embeddings ψ(g),using standard regression techniques (e.g. gradient descenton a neural network). In our experiments, this factorizedapproach learned UVFAs an order of magnitude faster thannaive regression.Finally, we return to reinforcement learning, and providetwo algorithms for learning UVFAs directly from rewards.The first algorithm maintains a finite Horde of generalvalue functions Vg(s), and uses these values to seed the ta-ble and hence learn a UVFA V (s, g; θ) that generalizes topreviously unseen goals. The second algorithm bootstrapsdirectly from the value of the UVFA at successor states. Onthe Atari game of Ms Pacman, we then demonstrate thatUVFAs can scale to larger visual input spaces and differenttypes of goals, and show they generalize across policies forobtaining possible pellets.

2. BackgroundConsider a Markov Decision Process defined by a set ofstates s ∈ S, a set of actions a ∈ A, and transition prob-abilities T (s, a, s′) := P(st+1 = s′ | st = s, at = a).For any goal g ∈ G, we define a pseudo-reward func-tion Rg(s, a, s

′) and a pseudo-discount function γg(s).The pseudo-discount γg takes the double role of state-dependent discounting, and of soft termination, in the sensethat γ(s) = 0 if and only if s is a terminal state accordingto goal g (e.g. the waypoint is reached). For any policy

V(s,g)

gs

F

V(s,g)

gs

φ ψ

h

V(s,g)

gs

φ hψ

φ(s)ˆ ψ(g)ˆ

φ(s)ˆ ψ(g)ˆ

Figure 1. Diagram of the presented function approximation ar-chitectures and training setups. In blue dashed lines, we show thelearning targets for the output of each network (cloud). Left: con-catenated architecture. Center: two-stream architecture with twoseparate sub-networks φ and ψ combined at h. Right: Decom-posed view of two-stream architecture when trained in two stages,where target embedding vectors are formed by matrix factoriza-tion (right sub-diagram) and two embedding networks are trainedwith those as multi-variate regression targets (left and center sub-diagrams).

π : S 7→ A and each g, and under some technical reg-ularity conditions, we define a general value function thatrepresents the expected cumulative pseudo-discounted fu-ture pseudo-return, i.e.,

Vg,π(s) := E

[ ∞∑t=0

Rg(st+1, at, st)

t∏k=0

γg(sk)

∣∣∣∣∣s0 = s

]where the actions are generated according to π, as well asan action-value function

Qg,π(s, a) := Es′ [Rg(s, a, s′) + γg(s′) · Vg,π(s′)]

Any goal admits an optimal policy π∗g(s) :=argmaxaQπ,g(s, a), and a corresponding optimalvalue function2 V ∗g := Vg,π∗g . Similarly, Q∗g := Qg,π∗g .

3. Universal Value Function ApproximatorsOur main idea is to represent a large set of optimal valuefunctions by a single, unified function approximator thatgeneralises over both states and goals. Specifically, weconsider function approximators V (s, g; θ) ≈ V ∗g (s) orQ(s, a, g; θ) ≈ Q∗g(s, a), parameterized by θ ∈ Rd, thatapproximate the optimal value function both over a poten-tially large state space s ∈ S , and also a potentially largegoal space g ∈ G.Figure 1 schematically depicts possible function approxi-mators: the most direct approach, F : S × G 7→ R simplyconcatenates state and goal together as a joint input. Themapping from concatenated input to regression target canthen be dealt with a non-linear function approximator suchas a multi-layer perceptron (MLP).A two-stream architecture, on the other hand, assumes thatthe problem has a factorized structure and computes its out-put from two components φ : S 7→ Rn and ψ : G 7→ Rn

2Aka ideal value of an active forecast (Schaul & Ring, 2013).


Figure 2. Clustering structure of embedding vectors on a one-room LavaWorld, after processing them with t-SNE (Van derMaaten & Hinton, 2008) to map their similarities in 2D. Col-ors correspond. Left: Room layout. Center: State embeddings.Right: Goal embeddings. Note how both embeddings recover thecycle and dead-end structure of the original environment.

that both map into an n-dimensional vector space of em-beddings and output function h : Rn × Rn 7→ R that mapstwo such embedding vectors to a single scalar regressionoutput. We will focus on the case where the mappings φand ψ are general function approximators, such as MLPs,whereas h is a simple function, such as the dot-product.The two-stream architecture is capable of exploiting thecommon structure between states and goals. First, goals areoften defined in terms of states, e.g. G ⊆ S. We exploit thisproperty by sharing features between φ and ψ. Specifically,when using MLPs for φ and ψ, the parameters of their firstlayers may be shared, so that common features are learnedfor both states and goals. Second, the UVFA may be knownto be symmetric (i.e. V ∗g (s) = V ∗s (g) ∀s, g), for example aUVFA that computes the distance between state s and goalg in a reversible environment. This symmetry can be ex-ploited by using the same network φ = ψ, and a symmetricoutput function h (e.g. dot product). We will refer to thesetwo cases as partially symmetric and symmetric architec-tures respectively. In particular, when choosing a symmet-ric architecture with a distance-based h, and G = S, thenUVFA will learn an embedding such that small distancesaccording to h imply nearby states, which may be very use-ful representation of state (as discussed in Appendix D).3

3.1. Supervised Learning of UVFAsWe consider two approaches to learning UVFAs, first usinga direct end-to-end training procedure, and second usinga two-stage training procedure that exploits the factorisedstructure of a two-stream function approximator. We firstconsider the simpler, supervised learning setting, before re-turning to the reinforcement learning setting in section 5.The direct approach to learning the parameters θ is byend-to-end training. This is achieved by backpropagatinga suitable loss function, such as the mean-squared error

3Note that in the minimal case of a single goal, both archi-tectures collapse to conventional function approximation, with ψbeing a constant multiplier and φ(s) a linear function of the value.

(MSE) E[(V ∗g (s)− V (s, g; θ)

)2], to compute a descent

direction. Parameters θ are then updated in this directionby a variant of stochastic gradient descent (SGD). This ap-proach may be applied to any of the architectures in sec-tion 3. For two-stream architectures, we further introduce atwo-stage training procedure based on matrix factorization,which proceeds as follows:

• Stage 1: lay out all the values V ∗g (s) in a data matrixwith one row for each observed state s and one col-umn for each observed goal g, and factorize that ma-trix4, finding a low-rank approximation that definesn-dimensional embedding spaces for both states andgoals. We denote φs the resulting target embeddingvector for the row of s and ψg the target embeddingvector for the column of g.

• Stage 2: learn the parameters of the two networks φand ψ via separate multivariate regression toward thetarget embedding vectors φs and ψg from phase 1,respectively (see Figure 1, right).

The first stage leverages the well-understood technique ofmatrix factorisation as a subprocedure. The factorisationidentifies idealised row and column embeddings φ and ψthat can accurately reconstruct the observed values, ignor-ing the actual states s and goals g. The second stage thentries to achieve these idealised embeddings, i.e. how toconvert an actual state s into its idealised embedding φ(s)and an actual state g into its idealised embedding ψ(s). Thetwo stages are given in pseudo-code in lines 17 to 24 of Al-gorithm 1, below. An optional third stage fine-tunes thejoint two-stream architecture with end-to-end training.In general, the data matrix may be large, sparse and noisy,but there is a rich literature on dealing with these cases;concretely, we used OptSpace (Keshavan et al., 2009) formatrix factorization when the data matrix is available in abatch form (and h is the dot-product), and SGD otherwise.As Figure 3 shows, training can be sped up by an order ofmagnitude when using the two-stage approach, comparedto end-to-end training, on a 2-room 7x7 LavaWorld.

4. Supervised Learning ExperimentsWe ran several experiments to investigate the generalisa-tion capabilities of UVFAs. In each case, the scenario isone of supervised learning, where the ground truth valuesV ∗g (s) or Q∗g(s, a) are only given for some training set ofpairs (s, g). We trained a UVFA on that data, and evaluatedits generalisation capability in two ways. First, we mea-sured the prediction error (MSE) on the value of a held-outset of unseen (s, g) pairs. Second, we measured the policy

4This factorization can be done with SGD for most choices ofh, while the linear case (dot-product) also admits more efficientalgorithms like singular value decomposition.


102 103 104 105 106

Training steps

0.0

0.2

0.4

0.6

0.8

Policy Quality

e2e dot

e2e concat

2stage dot

102 103 104 105 106

Training steps

10−4

10−3

10−2

10−1

100MSE

e2e dot

e2e concat

2stage dot

Figure 3. Comparative performance of a UVFA on LavaWorld(two rooms of 7x7) as a function of training steps for two differ-ent architectures: the two-steam architecture with a dot-producton top (‘dot’) and the concatenated architecture (‘concat’), andfor two training modes: end-to-end supervised regression (‘e2e’)and two-stage training (‘2stage’). Note that two-stage trainingis impossible for the concatenated network. Dotted lines indi-cate minimum and maximum observed values among 10 randomseeds. The computational cost of the matrix factorization is ordersof magnitudes smaller than the regression and omitted in theseplots.

True Subgoal 1 Reconstructed Subgoal 1






Factor 1

Factor 2

Factor 3

Figure 4. Illustrations. Left: Color-coded value function overlaidon the maze structure of the 4-rooms environment, for 5 differentgoal locations. Middle: Reconstruction of these same values, af-ter a rank-3 factorization of the UVFA. Right: Overlay for thethree factors (embedding dimensions) from the low-rank factor-ization of the UVFA: the first one appears to correspond to aninside-vs-corners dimension, and the other two appear to jointlyencode room identity.

5 10 15 20Rank

0.0

0.2

0.4

0.6

0.8

1.0

Pol

icy

Qu

alit

y

Policy Quality by Rank(over 10 random bisections of subgoal space)

Max

Median

Min

0 5 10 15 20Rank

10−5

10−4

10−3

10−2

10−1

100

MS

E

MSE by Rank(over 10 random bisections of subgoal space)

Max

Median

Min

Figure 5. Generalization to unseen values as a function of rank.Left: Policy quality. Right: Prediction error.

quality of a value function approximator Q(s, a, g; θ) to bethe true expected discounted reward according to its goal g,averaged over all start states, when following the soft-maxpolicy of these values with temperature τ , as compared todoing the same with the optimal value function. A non-zero temperature makes this evaluation criterion changesmoothly with respect to the parameters, which gives par-tial credit to near-perfect policies5 as in Figure 8. We nor-malise the policy quality such that optimal behaviour has ascore of 1, and the uniform random policy scores 0.To help provide intuition, we use two small-scale exampledomains. First, a classical grid-world with 4 rooms and anaction space with the 4 cardinal directions (see Figure 4),second LavaWorld (see Figure 2, left), which is a grid-world with a couple of differences: it contains lava blocksthat are deadly when touched, it has multiple rooms, eachwith a ‘door’ that teleports the agent to the next room, andthe observation features show only the current room.In this paper we will side-step the thorny issue of wheregoals come from, and how they are represented; instead, ifnot mentioned otherwise, we will explore the simple casewhere goals are states themselves, i.e. G ⊂ S and enter-ing a goal is rewarded. The resulting pseudo-discount andpseudo-reward functions can then be defined as:

Rg(s, a, s′) =

{1, s′ = g and γext(s) 6= 0

0, otherwise

γg(s) =

{0, s = g

γext(s), otherwise

where γext is the external discount function.

4.1. Tabular CompletionOur initial experiments focus on the simplest case, using atabular representation of the UVFA. Both states and goalsare represented unambiguously by 1-hot unit vectors6, andthe mappings φ andψ are identity functions. We investigatehow accurately unseen (s, g) pairs can be reconstructed

5Compared to ε-greedy, a soft-max also prevents deadly ac-tions (as in LavaWorld) from being selected too often.

6A single dimension has value 1, all others are zero.


0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8Sparsity

0.0

0.2

0.4

0.6

0.8

1.0

Pol

icy

Qu

alit

y

Policy Quality by Sparsity(Over 10 random seeds)

Max

Median

Min

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8Sparsity

10−3

10−2

10−1

100

MS

E

MSE by Sparsity(Over 10 random seeds)

Max

Median

Min

Figure 6. Generalization to unseen values as a function of sparsityfor rank n = 7. The sparsity is the proportion of (s, g) values thatare unobserved. Left: Policy quality. Right: Prediction error.

with a low-rank approximation. Figure 4 provides an illus-tration. We found that the policy quality saturated at an op-timal behavior level already at a low rank, while the valueerror kept improving (see Figure 5). This suggests that alow-rank approximation can capture much of the structureof a universal value function, at least insofar as its inducedpolicies are concerned. Another illustration is given in Fig-ure 2 showing how the low-rank embeddings can recoverthe topological structure of policies in LavaWorld.Next, we evaluate how resilient the process is with respectto missing or unreliable data. We represent the data as asparse matrix M of (s, g) pairs, and apply OptSpace to per-form sparse matrix completion (Keshavan et al., 2009) suchthat M ≈ φ>ψ, reconstructing V (s, g; θ) := φ

>s ψg . Fig-

ure 6 shows that the matrix completion procedure providessuccessful generalisation, and furthermore that policy qual-ity degrades gracefully as less and less value information isprovided. See also Figure 13 in the appendix for an illus-tration of this process.In summary, the UVFA is learned jointly over states andgoals, which lets it infer V (s, g) even if s and g have neverbefore been encountered together; this form of generaliza-tion would be inconceivable with a single value function.

4.2. InterpolationOne use case for UVFAs is the continual learning setting,when the set of goals under consideration expands overtime (Ring, 1994). Ideally, a UVFA trained to estimate thevalues on a training set of goals GT should give reason-able estimates on new, never-seen goals from a test set GVif there is structure in the space of all goals that the map-ping function ψ can capture. We investigate this form ofgeneralization from a training set to a test set of goals inthe same supervised setup as before, but with non-tabularfunction approximators.Concretely, we represent states in the LavaWorld by a gridof pixels for the currently observed room. Each pixel isa binary vector indicating the presence of the agent, lava,empty space, or door respectively. We represent goals as adesired state grid, also represented in terms of pixels, i.e.

102 103 104 105

Training steps

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

1.0

Pol

icy

Qu

alit

y

Policy Quality (Training Set)

Max

Median

Min

102 103 104 105

Training steps

−0.10

−0.05

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Pol

icy

Qu

alit

y

Policy Quality (Validation Set)

Max

Median

Min

102 103 104 105

Training steps

10−4

10−3

10−2

10−1

100

101

MS

E

MSE (Training Set)

Max

Median

Min

102 103 104 105

Training steps

10−2

10−1

100

101

MS

E

MSE (Validation Set)

Max

Median

Min

Figure 7. Policy quality (top) and prediction error (bottom) onthe UVFA learned by the two-stream architecture (with two-stagetraining), as a function of training samples (log-scale), on a 7x7LavaWorld with 1 room. Left: on training set of goals. Right: ontest set of goals.

G ⊂ S. The data matrix M is constructed from all states,but only the goals in the training set (half of all possiblestates, randomly selected); a separate three-layer MLP isused for φ and ψ, and training follows our proposed two-stage approach (lines 17 to 24 in Algorithm 1 below; seealso Section 3.1 and Appendix B), and a small rank ofn = 7 that provides sufficient training performance (i.e.,90% policy quality, see Figure 5). Figure 7 summarizes theresults, showing that it is possible to interpolate the valuefunction to a useful level of quality on the test set of goals(the remaining half of G).Furthermore, transfer learning is straightforward, by post-training the UVFA on goals outside its training set: Fig-ure 12 shows that this leads to very quick learning, as com-pared to training a UVFA with the same architecture fromscratch.

4.3. ExtrapolationThe interpolation between goals is feasible because similargoals are represented with similar features, and because thetraining set is broad enough to cover the different parts ofgoal space, in our case because we took a random subset ofG. More challengingly, we may still be able to generalizeto unseen goals in completely new parts of space – in par-ticular, if states are represented with the same features asgoals, and states in these new parts of space have alreadybeen encountered, then a partially symmetric architecture(see section 3) allows the knowledge transfer from φ to ψ.We conduct a concrete experiment to show the feasibility


Figure 8. A UVFA trained only on goals in the first 3 rooms of themaze generalizes such that it can achieve goals in the fourth roomas well. Above: This visualizes the learned UVFA for 3 test goals(marked as red dots) in the fourth room. The color encodes thevalue function, and the arrows indicate the actions of the greedypolicy. Below: Ground truth for the same 3 test goals.

of the idea, on the 4-room environment, where the trainingset now includes only goals in three rooms, and the test setcontains the goals in the fourth room. Figure 8 shows theresulting generalization of the UVFA, which does indeedextrapolate to induce a policy that is likely to reach thosetest goals, with a policy quality of 24%± 8%.

5. Reinforcement Learning ExperimentsHaving established that UVFAs can generalize across goalsin a supervised learning setting, we now turn to a more real-istic reinforcement learning scenario, where only a streamof observations, actions and rewards is available, but thereare no ground-truth target values. Also fully observablestates are no longer assumed, as observations could benoisy or aliased. We consider two approaches: firstly rely-ing on Horde to provide targets to the UVFA, and secondlybootstrapping from the UVFA itself. Given the results fromSection 4 we retain a two-stream architecture throughout,with relatively small embedding dimensions n.

5.1. Generalizing from HordeOne way to incorporate UVFAs into a reinforcement learn-ing agent is to make use of the Horde architecture (Suttonet al., 2011). Each demon in the Horde approximates thevalue function for a single, specific goal. These values arelearnt off-policy, in parallel, from a shared stream of inter-actions with the environment. The key new ingredient is toseed a data matrix with values from the Horde, and use thetwo-stream factorization to build a UVFA that generalisesto other goals than those learned directly by the Horde.Algorithm 1 provides pseudocode for this approach. Eachdemon learns a specific value function Qg(s, a) for its goalg (lines 2-10), off-policy, using the Horde architecture. Af-ter processing a certain number of transitions, we build adata matrix M from their estimates, where each columng corresponds to one goal/demon, and each row t corre-sponds to the time-index of one transition in the history,

Algorithm 1 UVFA learning from Horde targets1: Input: rank n, training goals GT , budgets b1, b2, b32: Initialise transition historyH3: for t = 1 to b1 do4: H ← H∪ (st, at, γext, st+1)5: end for6: for i = 1 to b2 do7: Pick a random transition t fromH8: Pick a random goal g from GT9: Update Qg given a transition t

10: end for11: Initialise data matrix M12: for (st, at, γext, st+1) inH do13: for g in GT do14: Mt,g ← Qg(st, at)15: end for16: end for17: Compute rank-n factorisation M ≈ φ>ψ18: Initialise embedding networks φ and ψ19: for i = 1 to b3 do20: Pick a random transition t fromH21: Do regression update of φ(st, at) toward φt22: Pick a random goal g from GT23: Do regression update of ψ(g) toward ψg24: end for25: return Q(s, a, g) := h(φ(s, a), ψ(g))

1e+

02

3e+

02

1e+

03

2e+

03

5e+

03Learning Updates

3e+02

1e+03

3e+03

1e+04

3e+04

Dat

aS

amp

les

Training Prediction Error

1e+

02

3e+

02

1e+

03

2e+

03

5e+

03

Learning Updates

3e+02

1e+03

3e+03

1e+04

3e+04

Dat

aS

amp

les

Test Prediction Error

1e+

02

3e+

02

1e+

03

2e+

03

5e+

03

Learning Updates

3e+02

1e+03

3e+03

1e+04

3e+04

Dat

aS

amp

les

Training Policy Quality

1e+

02

3e+

02

1e+

03

2e+

03

5e+

03

Learning Updates

3e+02

1e+03

3e+03

1e+04

3e+04

Dat

aS

amp

les

Test Policy Quality

Figure 9. A UVFA with embedding dimension n = 6, learnedwith a horde of 12 demons, on the a 3-room LavaWorld of size7× 7, following Algorithm 1. Heatmaps of prediction error with(log) data samples one vertical axis and (log) learning updates onthe horizontal axis: warmer is better. Left: Learned UVFA perfor-mance on the 12 training goals estimated by the demons. Right:UVFA generalization to a test set of 25 unseen goals. Given thedifficulty of data imbalance, the prediction error is adequate onthe training set (MSE of 0.6) and generally better on the test set(MSE of 0.4), even though we already see an overfitting trendtoward the larger number of updates.

from which we then produce target embeddings φt for eachtime-step and ψg for each goal (as in section 3.1, line 11-17), and in the following stage the two networks φ and ψare trained by regression to match these (lines 18-24).The performance of this approach now depends on two fac-tors: the amount of experience the Horde has accumulated,which directly affects the quality of the targets used in the


Figure 10. UVFA results on Ms Pacman. Top left: Locationsof 29 training set pellet goals highlighted in green. Top right:Locations of 120 test set pellet goals in pink. Center row: Color-coded value functions for 5 test set pellet goals, as learned bythe Horde. Bottom row: Value functions predicted by a UVFA,trained only from the 29 training set demons, as predicted on thesame 5 test set goals. The wrapping of values between far left andfar right is due to hidden passages in the game map.

matrix factorization; and the amount of computation usedto build the UVFA from this data (i.e. training the embed-dings φ(s) and ψ(g).This reinforcement learning approach is more challeng-ing than the simple supervised learning setting that we ad-dressed in the previous section. One complication is thatthe data itself depends on the way in which the behaviourpolicy explores the environment. In many cases, we donot see much data relevant to goals that we might like tolearn about (in the Horde), or generalise to (in the UVFA).Nevertheless, we found that this approach is sufficiently ef-fective. Figure 9 visualizes performance results accordingto both these dimensions, showing that there is a tippingpoint in the quantity of experience after which the UVFAapproximates the ground truth reasonably well; training it-erations on the other hand lead to smooth improvement. Assection 4.2 promised, the UVFA also generalizes to thosegoals for which no demon was trained.

5.2. Ms PacmanScaling up, we applied our approach to learn a UVFA forthe Atari 2600 game Ms. Pacman. We used a hand-craftedgoal space G: for each pellet on the screen, we defined eat-ing it as an individual goal g ∈ R2, which is representedby the pellet’s (x, y) coordinate on-screen. Following Al-gorithm 1, a Horde with 150 demons was trained. Eachdemon processed the visual input directly from the screen(see Appendix C for further experimental details). A sub-

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Sparsity

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Pol

icy

Qu

alit

y

Policy Quality by Sparsity(Over 10 random seeds)

Max

Median

Min

Figure 11. Policy quality after learning with direct bootstrappingfrom a stream of transitions using Equation 1 as a function of thefraction of possible training pairs (s, g) that were held out, on a2-room LavaWorld of size 10× 10, after 104 updates.

set of 29 ‘training’ demons was used to seed the matrix andtrain a UVFA. We then investigated generalization perfor-mance on 5 unseen goal locations g, corresponding to 5of the 150 pellet locations, randomly selected after exclud-ing the 29 training locations. Figure 10 visualises the valuefunctions of the UVFA at these 5 pellet locations, comparedto the value functions learned directly by a larger Horde.These results demonstrate that a UVFA learned from asmall Horde can successfully approximate the knowledgeof a much larger Horde.Preliminary, non-quantitative experiments further indicatethat representing multiple value functions in a single UVFAreduces noise from the individual value functions, and thatthe resulting policy quality is decent.

5.3. Direct BootstrappingInstead of a detour via an explicit Horde, it is also possi-ble to train a UVFA end-to-end with direct bootstrappingupdates using a variant of Q-learning:

Q(st, at, g) ← α(rg + γg max

a′Q(st+1, a

′, g))

+(1− α)Q(st, at, g) (1)

applied at a randomly selected transition (st, at, st+1) andgoal g; where rg = Rg(st, at, st+1).When bootstrapping with function approximation at thesame time as generalizing over goals, the learning pro-cess is prone to become unstable, which can sometimesbe addressed by much smaller learning rates (and slowedconvergence), or using a more well-behaved h function(at the cost of generality). In particular, here we ex-ploit our prior knowledge that our pseudo-reward functionsare bounded between 0 and 1, and use a distance-basedh(a, b) = γ‖a−b‖2 , where ‖ · ‖2 denotes Euclidean dis-tance between embedding vectors. Figure 11 shows gener-alization performance after learning when only a fractionof state-goal pairs is present (as in Section 4.1). We do notrecover 100% policy quality with this setup, yet the UVFAgeneralises well, as long as it is trained on about a quarter


102 103 104 105

Training steps

0.0

0.2

0.4

0.6

0.8

1.0Policy Quality

transfer

baseline

102 103 104 105

Training steps

10−3

10−2

10−1

100Prediction error

transfer

baseline

Figure 12. Policy quality (left) and prediction error (right) as afunction of additional training on the test of goals for the 3-roomLavaWorld environment of size 7x7 and n = 15. ‘Baseline’ de-notes training on the test set from scratch (end-to-end), and ‘trans-fer’ is the same setup, but with the UVFA initialised by training onthe training set. We find that embeddings learned on the trainingset are helpful, and permit the system to attain good performanceon the test set with just 1000 samples, rather than 100000 sampleswhen training from scratch.

of the possible (s, g) pairs.

6. Related WorkA number of previous methods have focused on general-ising across tasks rather than goals (Caruana, 1997). Thedistinction is that a task may have different MDP dynamics,whereas a goal only changes the reward function but not thetransition structure of an MDP. These methods have almostexclusively been in explored in the context of policy search,and fall into two classes: methods that combine local poli-cies into a single situation-sensitive controller, for exampleparameterized skills (Da Silva et al., 2012) or parameter-ized motor primitives (Kober et al., 2012); and methods thatdirectly parameterize a policy by the task as well as thestate, e.g. using model-based RL (Deisenroth et al., 2014).In contrast, our approach is model-free, value-based, andexploits the special shared structure inherent to states andgoals, which may not be present in arbitrary tasks. Further-more, like the Horde (Sutton et al., 2011), but unlike policysearch methods, the UVFA does not need to see completeseparate episodes for each task or goal, as all goals canin principle be learnt by off-policy bootstrapping from anybehaviour policy (see section 5.3). A different line of workhas explored generalizing across value functions in a rela-tional context (van Otterlo, 2009).Perhaps the closest prior work to our own is the fragment-based approach of Foster and Dayan (Foster & Dayan,2002). This work also explored shared structure in valuefunctions across many different goals. Their mechanismused a specific mixture-of-Gaussians approach that can beviewed as a probabilistic UVFA where the mean is repre-sented by a mixture-of-constants. The mixture componentswere learned in an unsupervised fashion, and those mix-tures were then recombined to give specific values. How-ever, the mixture-of-constants representation was found to

be limited, and was augmented in their experiments by atabular representation of states and goals. In contrast, weconsider general function approximators that can representthe joint structure in a much more flexible and powerfulway, for example by exploiting the representational capa-bilities of deep neural networks.

7. DiscussionThis paper has developed a universal approximator forgoal-directed knowledge. We have demonstrated that ourUVFA model is learnable either from supervised targets, ordirectly from real experience; and that it generalises effec-tively to unseen goals. We conclude by discussing severalways in which UVFAs may be used.First, UVFAs can be used for transfer learning to newtasks with the same dynamics but different goals. Specifi-cally, the values V (s, g; θ) in a UVFA can be used to ini-tialise a new, single value function Vg(s) for a new taskwith unseen goal g. Figure 12 demonstrates that an agentwhich starts from transferred values in this fashion canlearn to solve the new task g considerably faster than ran-dom value initialization.Second, generalized value functions can be used as fea-tures to represent state (Schaul & Ring, 2013); this is aform of predictive representation (Singh et al., 2003). AUVFA compresses a large or infinite number of predictionsinto a short feature vector. Specifically, the state embed-ding φ(s) can be used as a feature vector that representsstate s. Furthermore, the goal embedding φ(g) can be usedas a separate feature vector that represents state g. Figure 2illustrated that these features can capture non-trivial struc-ture in the domain.Third, a UVFA could be used to generate temporally ab-stract options (Sutton et al., 1999). For any goal g acorresponding option may be constructed that acts (soft-)greedily with respect to V (s, g; θ) and terminates e.g.upon reaching g. The UVFA then effectively provides auniversal option that represents (approximately) optimalbehaviour towards any goal g ∈ G. This in turn allows a hi-erarchical policy to choose any goal g ∈ G as a temporallyabstract action (Kaelbling, 1993).Finally, a UVFA could also be used as a universal op-tion model (Yao et al., 2014). Specifically, if pseudo-rewards are defined by goal achievement (as in section 4),then V (s, g; θ) approximates the (discounted) probabilityof reaching g from s, under a policy that tries to reach it.

AcknowledgmentsWe thank Theophane Weber, Hado van Hasselt, ThomasDegris, Arthur Guez and Adria Puigdomenech for theirmany helpful comments, and the anonymous reviewers fortheir constructive feedback.


ReferencesBellemare, Marc G, Naddaf, Yavar, Veness, Joel, and

Bowling, Michael. The arcade learning environment: Anevaluation platform for general agents. arXiv preprintarXiv:1207.4708, 2012.

Caruana, Rich. Multitask learning. Machine learning, 28(1):41–75, 1997.

Collobert, Ronan, Kavukcuoglu, Koray, and Farabet,Clement. Torch7: A matlab-like environment for ma-chine learning. In BigLearn, NIPS Workshop, 2011a.

Collobert, Ronan, Weston, Jason, Bottou, Leon, Karlen,Michael, Kavukcuoglu, Koray, and Kuksa, Pavel. Nat-ural language processing (almost) from scratch. TheJournal of Machine Learning Research, 12:2493–2537,2011b.

Da Silva, Bruno, Konidaris, George, and Barto, An-drew. Learning parameterized skills. arXiv preprintarXiv:1206.6398, 2012.

Deisenroth, Marc Peter, Englert, Peter, Peters, Jan, andFox, Dieter. Multi-task policy search for robotics. InRobotics and Automation (ICRA), 2014 IEEE Interna-tional Conference on, pp. 3876–3881. IEEE, 2014.

Foster, David and Dayan, Peter. Structure in the space ofvalue functions. Machine Learning, 49(2-3):325–346,2002.

Kaelbling, Leslie Pack. Hierarchical learning in stochas-tic domains: Preliminary results. In Proceedings of thetenth international conference on machine learning, vol-ume 951, pp. 167–173, 1993.

Keshavan, Raghunandan H., Oh, Sewoong, and Montanari,Andrea. Matrix completion from a few entries. CoRR,abs/0901.3150, 2009.

Kingma, Diederik P. and Ba, Jimmy. Adam: A method forstochastic optimization. CoRR, abs/1412.6980, 2014.

Kober, Jens, Wilhelm, Andreas, Oztop, Erhan, and Peters,Jan. Reinforcement learning to adjust parametrized mo-tor primitives to new situations. Autonomous Robots, 33(4):361–379, 2012.

Konidaris, George, Scheidwasser, Ilya, and Barto, An-drew G. Transfer in reinforcement learning via sharedfeatures. The Journal of Machine Learning Research, 13(1):1333–1371, 2012.

Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado,Greg S, and Dean, Jeff. Distributed representations ofwords and phrases and their compositionality. In Ad-vances in Neural Information Processing Systems, pp.3111–3119, 2013.

Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David,Graves, Alex, Antonoglou, Ioannis, Wierstra, Daan, andRiedmiller, Martin. Playing atari with deep reinforce-

ment learning. arXiv preprint arXiv:1312.5602, 2013.Modayil, Joseph, White, Adam, and Sutton, Richard S.

Multi-timescale nexting in a reinforcement learningrobot. Adaptive Behavior, 22(2):146–160, 2014.

Precup, Doina, Sutton, Richard S, and Dasgupta, San-joy. Off-policy temporal-difference learning with func-tion approximation. In ICML, pp. 417–424. Citeseer,2001.

Ring, Mark Bishop. Continual Learning in ReinforcementEnvironments. PhD thesis, University of Texas at Austin,1994.

Schaul, Tom and Ring, Mark. Better generalization withforecasts. In Proceedings of the Twenty-Third inter-national joint conference on Artificial Intelligence, pp.1656–1662. AAAI Press, 2013.

Singh, Satinder, Littman, Michael L, Jong, Nicholas K,Pardoe, David, and Stone, Peter. Learning predictivestate representations. In ICML, pp. 712–719, 2003.

Sutton, Richard S and Barto, Andrew G. Introduction toreinforcement learning. MIT Press, 1998.

Sutton, Richard S and Tanner, Brian. Temporal-differencenetworks. In Saul, L.K., Weiss, Y., and Bottou, L. (eds.),Advances in Neural Information Processing Systems 17,pp. 1377–1384. MIT Press, 2005.

Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween mdps and semi-mdps: A framework for temporalabstraction in reinforcement learning. Artificial intelli-gence, 112(1):181–211, 1999.

Sutton, Richard S, Modayil, Joseph, Delp, Michael, De-gris, Thomas, Pilarski, Patrick M, White, Adam, andPrecup, Doina. Horde: A scalable real-time architecturefor learning knowledge from unsupervised sensorimotorinteraction. In The 10th International Conference on Au-tonomous Agents and Multiagent Systems-Volume 2, pp.761–768, 2011.

Van der Maaten, Laurens and Hinton, Geoffrey. Visualizingdata using t-sne. Journal of Machine Learning Research,9(2579-2605):85, 2008.

van Otterlo, Martijn. The logic of adaptive behavior:knowledge representation and algorithms for adaptivesequential decision making under uncertainty in first-order and relational domains, volume 192. Ios Press,2009.

Yao, Hengshuai, Szepesvari, Csaba, Sutton, Richard S,Modayil, Joseph, and Bhatnagar, Shalabh. Universal op-tion models. In Advances in Neural Information Pro-cessing Systems, pp. 990–998, 2014.

Universal Value Function Approximatorsproceedings.mlr.press/v37/schaul15.pdf · ing a UVFA may...

Documents

Transcript of Universal Value Function Approximatorsproceedings.mlr.press/v37/schaul15.pdf · ing a UVFA may...