Decision making. Blaise Pascal 1623 - 1662 Probability in games of chance How much should I bet on...
-
date post
21-Dec-2015 -
Category
Documents
-
view
217 -
download
1
Transcript of Decision making. Blaise Pascal 1623 - 1662 Probability in games of chance How much should I bet on...
Decision making
?
Blaise Pascal1623 - 1662
Probability in games of chance
How much should I bet on ’20’?
E[gain] = Σgain(x) Pr(x)
Decisions under uncertainty
Maximize expected value (Pascal)
Bets should be assessed according to
x
p x gain x
Decisions under uncertainty
The value of an alternative is a monotonous function of the
• Probability of reward
• Magnitude of reward
Do Classical Decision VariablesInfluence Brain Activity in LIP?
LIP
Varying Movement Value
Platt and Glimcher 1999
What Influences LIP?
Related to Movement Desirability• Value/Utility of Reward• Probability of Reward
Varying Movement Probability
What Influences LIP?
Related to Movement Desirability• Value/Utility of Reward• Probability of Reward
Decisions under uncertainty
Neural activity in area LIP depends on:
• Probability of reward
• Magnitude of reward
Dorris and Glimcher 2004
Relative or absolute reward?
?
$X
$Y
$Z
$A $B $C $D $E
Consider a set of alternatives X and a binary relation on it, , interpreted as “preferred at least as”.
Consider the following three axioms:C1. Completeness: For every C2. Transitivity: For every
C3. Separability
Maximization of utility
X X
, , or x y X x y x y
, , , and imply x y z X x y y z x z
Theorem: A binary relation can be represented by a real-valued function if and only if it satisfies C1-C3
Under these conditions, the function u is unique up to increasing transformation(Cantor 1915)
A face utility function?
In there an explicit representation of ‘value’ of a choice in the brain?
Neurons in the orbitofrontal cortex encode value
Padoa-Schioppa and Assad, 2006
Examples of neurons encoding the chosen value
A neuron encoding the value of A
A neuron encoding the value of B
A neuron encoding the chosen juice taste
Encoding takes place at different times
post-offer (a, d, e, blue),
pre-juice (b, cyan),
post-juice (c, f, black)
How does the brain learn the values?
The computational problem
The goal is to maximize the sum of rewards
Eend
tt
V r
The computational problem
The value of the state S1 depends on the policy
1 2ice cream V S R V S
If the animal chooses ‘right’ at S1,
How to find the optimal policy in a complicated world?
How to find the optimal policy in a complicated world?
• If values of the different states are known then this task is easy
1 t t tV S r V S
How to find the optimal policy in a complicated world?
• If values of the different states are known then this task is easy
How can the values of the different states be learned?
1 t t tV S r V S
V(St) = the value of the state at time t
rt = the (average) reward delivered at time t
V(St+1) = the value of the state at time t+1
where
t t tV S V S
1 t t t tr V S V S
is the TD error.
The TD (temporal difference) learning algorithm
Schultz, Dayan and Montague, Science, 1997
CS Reward
Before trial 1:
1 2 3 4 5 6 7 8 9
1 2 9 0 V S V S V S
In trial 1:
• no reward in states 1-7
1 0 t t t tr V S V S
0 t t tV S V S
• reward of size 1 in states 8
9 8 1 t tr V S V S
8 t tV S V S
CS Reward
Before trial 2:
1 2 3 4 5 6 7 8 9
1 2 7 9 0 V S V S V S V S
8 V SIn trial 2, for states 1-6
1 0 t t t tr V S V S
0 t t tV S V S
For state 7,
1 t t t tr V S V S 2
7 7 tV S V S
CS Reward
Before trial 2:
1 2 3 4 5 6 7 8 9
1 2 7 9 0 V S V S V S V S
8 V SFor state 8,
1 1 t t t tr V S V S
8 8 1 2 tV S V S
CS Reward
Before trial 3:
1 2 3 4 5 6 7 8 9
1 2 6 9 0 V S V S V S V S
27 8 2 V S V S
In trial 2, for states 1-5
1 0 t t t tr V S V S
0 t t tV S V S
For state 6,
21 t t t tr V S V S
37 7 tV S V S
CS Reward
1 2 3 4 5 6 7 8 9
For state 7,
21 2 2 1 t t t tr V S V S
2 2 37 7 2 1 3 2 tV S V S
Before trial 3: 1 2 6 9 0 V S V S V S V S
27 8 2 V S V S
For state 8,
1 1 2 t t t tr V S V S
8 8 2 1 1 2 tV S V S
CS Reward
After many trials
1 2 3 4 5 6 7 8 9
1 8 91 0 V S V S V S
1 0 t t t tr V S V S
Except for the CS whose time is unknown
Schultz, 1998
Bayer and Glimcher, 1998
“We found that these neurons encoded the difference between the current reward and a weighted average of previous rewards, a reward prediction error, but only for outcomes that were better than expected”.
Bayer and Glimcher, 1998