Neural Encoding Models · 2011. 11. 25. · STA. Nonlinearity. STA unbiased for spherical...
Embed Size (px)
Transcript of Neural Encoding Models · 2011. 11. 25. · STA. Nonlinearity. STA unbiased for spherical...

Neural Encoding Models
Maneesh [email protected]
Gatsby Computational Neuroscience UnitUniversity College London
Term 1, Autumn 2011

Studying sensory systems
x(t) y(t)
Decoding: x̂(t)= G[y(t)] (reconstruction)Encoding: ŷ(t)= F [x(t)] (systems identification)

General approach
Goal: Estimate p(spikex,H) [or λ(tx[0, t), H(t))] from data.
• Naive approach: measure p(spike, Hx) directly for every setting of x.– too hard: too little data and too many potential inputs.
• Estimate some functional f (p) instead (e.g. mutual information)
• Select stimuli efficiently
• Fit models with smaller numbers of parameters

Spikes, or rate?
Most neurons communicate using action potentials — statistically described by a point process:
P(spike ∈ [t, t + dt)
)= λ(tH(t), stimulus, network activity)dt
To fully model the response we need to identify λ. In general this depends on spike history H(t) and network activity. Three options:
• Ignore the history dependence, take network activity as source of“noise” (i.e. assume firing is inhomogeneous Poisson or Cox process,conditioned on the stimulus).
• Average multiple trials to estimate
λ(t, stimulus) = limN→∞
1
N
∑n
λ(tHn(t), stimulus, networkn)
the mean intensity (or PSTH), and try to fit this.
• Attempt to capture history and network effects in simple models.

Spiketriggered average
Decoding: mean of P (x  y = 1)Encoding: predictive filter

Linear regression
y(t) =
∫ T0
x(t− τ )w(τ )dτ
W (ω) =X(ω)∗Y (ω)
X(ω)2
x1 x2 x3 . . . xT xT+1 . . .x1 x2 x3 . . . xT︸ ︷︷ ︸x1 x2 x3 . . . xT xT︸ ︷︷ ︸
x1 x2 x3 . . . xT+1x2 x3 x4 . . . xT+1
...×
wt...w3w2w1
=
yTyT+1
...
XW = Y
W = (XTX)︸ ︷︷ ︸ΣSS
−1 (XTY )︸ ︷︷ ︸STA

Linear models
So the (whitened) spiketriggered average gives the minimumsquarederror linear model.
Issues:
• overfitting and regularisation– standard methods for regression
• negative predicted rates– can model deviations from background
• real neurons aren’t linear– models are still used extensively– interpretable suggestions of underlying sensitivity– may provide unbiased estimates of cascade filters (see later)

How good are linear predictions?
We would like an absolute measure of model performance.
Measured responses can never be predicted perfectly:
• The measurements themselves are noisy.
Models may fail to predict because:
• They are the wrong model.• Their parameters are misestimated due to noise.

Estimating predictable power
Psignal
Pnoise
response︸ ︷︷ ︸r(n)
= signal + noise
P(r(n)) = Psignal + Pnoise
P(r(n)) = Psignal +1
NPnoise
⇒P̂signal =
1
N − 1
(NP(r(n))− P(r(n))
)P̂noise = P(r(n))− P̂signal

Signal power in A1 responses
0 0.5 1 1.5 2 2.5 3 3.5 4
0
0.1
0.2
0.3
0.4
Noise power (spikes2/bin)
Sig
nal p
ower
(sp
ikes
2 /bi
n)
0 50 100 150Number of recordings

Testing a model
For a perfect prediction〈P(trial)− P(residual)
〉= P(signal)
Thus, we can judge the performance of a model by the normalized predictive power
P(trial)− P(residual)P̂(signal)
Similar to coefficient of determination (r2), but the denominator is thepredictable variance.

Predictive performance
0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
norm
alis
ed B
ayes
pre
dict
ive
pow
er
Training Error
−2 −1 0 1−2
−1.5
−1
−0.5
0
0.5
1Cross−Validation Error
normalised STA predictive power

Extrapolating the model performance

Jackknifed estimates
0 50 100 150−0.5
0
0.5
1
1.5
2
2.5
3
Nor
mal
ized
line
arly
pre
dict
ive
pow
er
Normalized noise power

Extrapolated linearity
0 50 100 150−0.5
0
0.5
1
1.5
2
2.5
−5 0 5 10 15 20 25 30
−0.2
0
0.2
0.4
0.6
0.8
1
Normalized noise power
Nor
mal
ized
line
arly
pre
dict
ive
pow
er
[extrapolated range: (0.19,0.39); mean Jackknife estimate: 0.29]

Simulated (almost) linear data
0 50 100 1500
0.5
1
1.5
2
2.5
3
−5 0 5 10 15 20 25 300.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
Normalized noise power
Nor
mal
ized
line
arly
pre
dict
ive
pow
er
[extrapolated range: (0.95,0.97); mean Jackknife estimate: 0.97]

Linear fits to nonlinear functions

Linear fits to nonlinear functions
(Stimulus dependence does not always signal response adaptation)

Approximations are stimulus dependent
(Stimulus dependence does not always signal response adaptation)

Consequences
Local fitting can have counterintuitive consequences on the interpretation of a “receptive field”.

“Independently distributed” stimuli
Knowing stimulus power at any set of points in analysis space providesnoinformation about stimulus power at any other point.
DRC:
Space
Spectrotemporal
Ripple:
Independence is a property of stimulus and analysis space.
Christianson, Sahani, and Linden (2008)

Nonlinearity & nonindependence distort RF estimates
Stimulus may have higherorder correlations in other analysis spaces— interaction with nonlinearities can produce misleading “receptive fields.”
Christianson, Sahani, and Linden (2008)

What about natural sounds?
Multiplicative RF
Time (ms)
Fre
q. (
kHz)
−30 −25 −20 −15 −10 −5
1
2
3
4
5
6
7Multiplicative RF
Fre
q. (
kHz)
Time (ms)−30 −25 −20 −15 −10 −5
1
2
3
4
5
6
7
Finch Song
Fre
q. (
kHz)
Time (ms)−30 −25 −20 −15 −10 −5
1
2
3
4
5
6
7Finch Song
Fre
q. (
kHz)
Time (ms)−30 −25 −20 −15 −10 −5
1
2
3
4
5
6
7
Usually not independent in any space — so STRFs may not be conservative estimates of receptive fields.
Christianson, Sahani, and Linden (2008)

Beyond linearity

Beyond linearity
Linear models often fail to predict well. Alternatives?
•Wiener/Volterra functional expansions– Mseries– Linearised estimation– Kernel formulations
• LN (Wiener) cascades– Spiketrigger covariance (STC) methods– “Maximimally informative” dimensions (MID) ⇔ ML nonparametric
LNP models– ML Parametric GLM models
• NL (Hammerstein) cascades– Multilinear formulations

Nonlinear models
The LNP (Wiener) cascade
k n
Rectification addresses negative firing rates. Possible biophysical justification.

LNP estimation – the Spiketriggered ensemble

Single linear filter
STA.
Nonlinearity.
STA unbiased for spherical (elliptical) data.
Bussgang.
Nonspherical inputs.
Biases.

Multiple filters
Distribution changes along relevant directions (and, usually, along alllinear combinations of relevant directions).
Proxies for distribution:
•mean: STA (can only reveal a single direction)• variance: STC• binned (or kernel) KL: MID “maximally informative directions” (equiv
alent to ML in LNP model with binned nonlinearity)

STC
Project out STA:
X̃ = X − (Xksta)kTsta; Cprior =X̃TX̃
N;Cspike =
X̃Tdiag(Y )X̃Nspike
Choose directions with greatest change in variance:k argmax‖v‖=1
vT(Cprior − Cspike)v
⇒ find eigenvectors of (Cprior − Cspike) with large (absolute) eigvals.

STC
Reconstruct nonlinearity (may assume separability)

Biases
STC (obviously) requires that the nonlinearity alter variance.
If so, subspace is unbiased if distribution
• radially (elliptically) symmetric• AND independent
⇒ Gaussian.
May be possible to correct by transformation, subsampling or weighting(latter two at cost of variance).

More LNP methods
• Nonparametric nonlinearities: “Maximally informative dimensions”(MID)⇔ “nonparametric” maximum likelihood.– Intuitively, extends the variance difference idea to arbitrary differ
ences between marginal and spikeconditioned stimulus distributions.
kMID = argmaxk
KL[P (k · x)‖P (k · xspike)]
– Measuring KL requires binning or smoothing—turns out to be equivalent to fitting a nonparametric nonlinearity by binning or smoothing.
– Difficult to use for highdimensional LNP models.
• Parametric nonlinearities: the “generalised linear model” (GLM).

Generalised linear models
LN models with specified nonlinearities and exponential family noise.
In general (for monotonic g):
y ∼ ExpFamily[µ(x)]; g(µ) = βx
For our purposes easier to write
y ∼ ExpFamily[f (βx)]
(Continuous time) point process likelihood with GLMlike dependence ofλ on covariates is approached in limit of bins→ 0 by either Poisson orBernoulli GLM.
Mark Berman and T. Rolf Turner (1992) Approximating Point Process Likelihoods with GLIM
Journal of the Royal Statistical Society. Series C (Applied Statistics), 41(1):3138.

Generalised linear models
Poisson distribution⇒ f = exp() is canonical (natural params = βx).Canonical link functions give concave likelihoods⇒ unique maxima.
Generalises (for Poisson) to any f which is convex and logconcave:
loglikelihood = c− f (βx) + y log f (βx)
Includes:
• thresholdlinear• thresholdpolynomial• “softthreshold” f (z) = α−1 log(1 + eαz).

Generalised linear models
ML parameters found by
• gradient ascent• IRLS
Regularisation by L2 (quadratic) or L1 (absolute value – sparse) penalties (MAP with Gaussian/Laplacian priors) preserves concavity.

LinearNonlinearPoisson (GLM)
stimulus filter pointnonlinearity
Poissonspiking
stimulus
k(t)

GLM with historydependence
• rate is a product of stim and spikehistory dependent terms
• output no longer a Poisson process
• also known as “softthreshold” IntegrateandFire model
exponentialnonlinearity
+postspike filter
h
(t)
stimulus filter
(Truccolo et al 04)
k
Poissonspiking
conditional intensity(spike rate)
stimulus

filter output
traditional IF
filter output
“hard threshold”
“softthreshold” IF
spik
e ra
te
GLM with historydependence
exponentialnonlinearity
+postspike filter
h
(t)
stimulus filter
k
Poissonspiking
• “softthreshold” approximation to IntegrateandFire model
stimulus

GLM dynamic behaviors
time after spike time (ms)
0 50 100 0 100 200 300 400 500
stimulus x(t)
postspike waveform
0 0
stiminduced
spikehistory
induced
regular spiking

GLM dynamic behaviors
stimulus x(t)
postspike waveform
0 0
stimfilter output
spikehistory
filter output
regular spiking
0 10 20
0
0 100 200 300 400 500
0
time after spike time (ms)
irregular spiking

GLM dynamic behaviors
stimulus x(t)
00
burstingpostspike waveform
time after spike time (ms)
0 20 40
0
0 100 200 300 400 50010
0
adaptation

Generalized Linear Model (GLM)
postspike filter
exponentialnonlinearity
probabilisticspiking
stimulus
stimulus filter
+

multineuron GLM
exponentialnonlinearity
probabilisticspiking
stimulus
neuron 1
neuron 2
postspike filter
stimulus filter
+
+

multineuron GLM
exponentialnonlinearity
probabilisticspiking
couplingfilters
stimulus
neuron 1
neuron 2
postspike filter
stimulus filter
+
+

conditional intensity(spike rate)
...
time t
GLM equivalent diagram:

Multilinear models
Input nonlinearities (Hammerstein cascades) can be identified in a multilinear (cartesian tensor) framework.

Input nonlinearities
The basic linear model (for sounds):
r̂(i)︸︷︷︸predicted rate
=∑jk
wtfjk︸︷︷︸STRF weights
s(i− j, k)︸ ︷︷ ︸stimulus power
,
How to measure s? (pressure, intensity, dB, thresholded, . . . )
We can learn an optimal representation g(.):
r̂(i) =∑jk
wtfjkg(s(i− j, k)).
Define: basis functions {gl} such that g(s) =∑lw
llgl(s)
and stimulus array Mijkl = gl(s(i− j, k)). Now the model is
r̂(i) =∑j
wtfjkwllMijkl or r̂ = (w
tf ⊗ wl) •M.

Multilinear models
Multilinear forms are straightforward to optimise by alternating least squares.
Cost function:E =
∥∥∥r− (wtf ⊗ wl) •M∥∥∥2Minimise iteratively, defining matrices
B = wl •M and A = wtf •M
and updating
wtf = (BTB)−1BTr and wl = (ATA)−1ATr.
Each linear regression step can be regularised by evidence optimisation (suboptimal), with uncertainty propagated approximately using variational methods.

Some input nonlinearities
25 40 55 70
0
l (dB−SPL)
wl

Parameter grouping
Separable STRF: (time) ⊗ (frequency). The input nonlinearity model isseparable in another sense: (time, frequency) ⊗ (sound level).
intensitytime
freq
uenc
y
time
freq
uenc
y
intensityw
eigh
t
Other separations:
• (time, sound level) ⊗ (frequency): r̂ = (wtl ⊗ wf) • M,• (frequency, sound level) ⊗ (time): r̂ = (wfl ⊗ wt) • M,• (time) ⊗ (frequency) ⊗ (sound level): r̂ = (wl ⊗ wf ⊗ wl) • M.

(time, frequency) ⊗ (sound level)
t (ms)
f (kH
z)
180 120 60 0
32
16
8
4
2
0
25 40 55 70
0
l (dB−SPL)w
l

(time, sound level) ⊗ (frequency)
t (ms)
l (dB
−S
PL)
180 120 60 0
70
55
40
25
0
25 50 100
0
f (kHz)
wf

(frequency, sound level) ⊗ (time)
f (kHz)
l (dB
−S
PL)
2 4 8 16 32
70
55
40
25
0
180 120 60 0
0
t (ms)
wt

Contextual influences
stimulus
wτφ
wtfPSTH
time
frequ
ency
rate(i) = c +∑jkl
wtjwfkw
ll gl(soundi−j,k)(
c2 +∑mnp
wτmwφn w
λp gp(soundi−j−m,k+n)
)

Contextual influences
Introduce extra dimensions:
• τ : time difference between contextual and primary tone,•φ: frequency difference between contextual and primary tone,•λ: sound level of the contextual tone.
Model the effective sound level of a primary tone by
Level(i, j)→ Level(i, j) · (const + Context(i, j)) .
and the context by
Context(i, j) =∑m,n,p
wτmwφn w
λp hp(s(i−m, j + n))
This leads to a multilinear model
r̂ = (wt ⊗ wf ⊗ wl ⊗ wτ ⊗ wφ ⊗ wλ) •M.

Inseparable contexts
We can also allow inseparable contexts (and principal fields), droppingthe levelnonlinearity to reduce parameters.
r(i) = c +∑jk
wtfjk soundi−j,k(
1 +∑mn
wτφmn soundi−j−m,k+n)
stimulus
wτφ
wtfPSTH
time
frequ
ency

Performance
0.1
0.5
0.9
Ext
rapo
late
d no
rmal
ised
pred
ictiv
e po
wer
STRF
Separable Context
Inseparable Context
test set
training set
Studying sensory systemsGeneral approachSpikes, or rate?Spiketriggered averageLinear regressionLinear modelsHow good are linear predictions?Estimating predictable powerSignal power in A1 responsesTesting a modelPredictive performanceExtrapolating the model performanceJackknifed estimatesExtrapolated linearitySimulated (almost) linear dataLinear fits to nonlinear functionsLinear fits to nonlinear functionsApproximations are stimulus dependentConsequences``Independently distributed'' stimuliNonlinearity & nonindependence distort RF estimatesWhat about natural sounds?Beyond linearityBeyond linearityNonlinear modelsLNP estimation – the Spiketriggered ensembleSingle linear filterMultiple filtersSTCBiasesMore LNP methodsGeneralised linear modelsMultilinear modelsInput nonlinearitiesMultilinear modelsSome input nonlinearitiesParameter grouping(time, frequency) (sound level)(time, sound level) (frequency)(frequency, sound level) (time)Contextual influencesContextual influencesInseparable contextsPerformance