Neural Encoding Models · 2011. 11. 25. · STA. Non-linearity. STA unbiased for spherical...

of 58 /58
Neural Encoding Models Maneesh Sahani [email protected] Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2011

Embed Size (px)

Transcript of Neural Encoding Models · 2011. 11. 25. · STA. Non-linearity. STA unbiased for spherical...

  • Neural Encoding Models

    Maneesh [email protected]

    Gatsby Computational Neuroscience UnitUniversity College London

    Term 1, Autumn 2011

  • Studying sensory systems

    x(t) y(t)

    Decoding: x̂(t)= G[y(t)] (reconstruction)Encoding: ŷ(t)= F [x(t)] (systems identification)

  • General approach

    Goal: Estimate p(spike|x,H) [or λ(t|x[0, t), H(t))] from data.

    • Naive approach: measure p(spike, H|x) directly for every setting of x.– too hard: too little data and too many potential inputs.

    • Estimate some functional f (p) instead (e.g. mutual information)

    • Select stimuli efficiently

    • Fit models with smaller numbers of parameters

  • Spikes, or rate?

    Most neurons communicate using action potentials — statistically de-scribed by a point process:

    P(spike ∈ [t, t + dt)

    )= λ(t|H(t), stimulus, network activity)dt

    To fully model the response we need to identify λ. In general this de-pends on spike history H(t) and network activity. Three options:

    • Ignore the history dependence, take network activity as source of“noise” (i.e. assume firing is inhomogeneous Poisson or Cox process,conditioned on the stimulus).

    • Average multiple trials to estimate

    λ(t, stimulus) = limN→∞

    1

    N

    ∑n

    λ(t|Hn(t), stimulus, networkn)

    the mean intensity (or PSTH), and try to fit this.

    • Attempt to capture history and network effects in simple models.

  • Spike-triggered average

    Decoding: mean of P (x | y = 1)Encoding: predictive filter

  • Linear regression

    y(t) =

    ∫ T0

    x(t− τ )w(τ )dτ

    W (ω) =X(ω)∗Y (ω)

    |X(ω)|2

    x1 x2 x3 . . . xT xT+1 . . .x1 x2 x3 . . . xT︸ ︷︷ ︸x1 x2 x3 . . . xT xT︸ ︷︷ ︸

    x1 x2 x3 . . . xT+1x2 x3 x4 . . . xT+1

    ...×

    wt...w3w2w1

    =

    yTyT+1

    ...

    XW = Y

    W = (XTX)︸ ︷︷ ︸ΣSS

    −1 (XTY )︸ ︷︷ ︸STA

  • Linear models

    So the (whitened) spike-triggered average gives the minimum-squared-error linear model.

    Issues:

    • overfitting and regularisation– standard methods for regression

    • negative predicted rates– can model deviations from background

    • real neurons aren’t linear– models are still used extensively– interpretable suggestions of underlying sensitivity– may provide unbiased estimates of cascade filters (see later)

  • How good are linear predictions?

    We would like an absolute measure of model performance.

    Measured responses can never be predicted perfectly:

    • The measurements themselves are noisy.

    Models may fail to predict because:

    • They are the wrong model.• Their parameters are mis-estimated due to noise.

  • Estimating predictable power

    Psignal

    Pnoise

    response︸ ︷︷ ︸r(n)

    = signal + noise

    P(r(n)) = Psignal + Pnoise

    P(r(n)) = Psignal +1

    NPnoise

    ⇒P̂signal =

    1

    N − 1

    (NP(r(n))− P(r(n))

    )P̂noise = P(r(n))− P̂signal

  • Signal power in A1 responses

    0 0.5 1 1.5 2 2.5 3 3.5 4

    0

    0.1

    0.2

    0.3

    0.4

    Noise power (spikes2/bin)

    Sig

    nal p

    ower

    (sp

    ikes

    2 /bi

    n)

    0 50 100 150Number of recordings

  • Testing a model

    For a perfect prediction〈P(trial)− P(residual)

    〉= P(signal)

    Thus, we can judge the performance of a model by the normalized pre-dictive power

    P(trial)− P(residual)P̂(signal)

    Similar to coefficient of determination (r2), but the denominator is thepredictable variance.

  • Predictive performance

    0 0.5 1 1.5 2 2.50

    0.5

    1

    1.5

    2

    2.5

    norm

    alis

    ed B

    ayes

    pre

    dict

    ive

    pow

    er

    Training Error

    −2 −1 0 1−2

    −1.5

    −1

    −0.5

    0

    0.5

    1Cross−Validation Error

    normalised STA predictive power

  • Extrapolating the model performance

  • Jackknifed estimates

    0 50 100 150−0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3

    Nor

    mal

    ized

    line

    arly

    pre

    dict

    ive

    pow

    er

    Normalized noise power

  • Extrapolated linearity

    0 50 100 150−0.5

    0

    0.5

    1

    1.5

    2

    2.5

    −5 0 5 10 15 20 25 30

    −0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    Normalized noise power

    Nor

    mal

    ized

    line

    arly

    pre

    dict

    ive

    pow

    er

    [extrapolated range: (0.19,0.39); mean Jackknife estimate: 0.29]

  • Simulated (almost) linear data

    0 50 100 1500

    0.5

    1

    1.5

    2

    2.5

    3

    −5 0 5 10 15 20 25 300.5

    0.6

    0.7

    0.8

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    Normalized noise power

    Nor

    mal

    ized

    line

    arly

    pre

    dict

    ive

    pow

    er

    [extrapolated range: (0.95,0.97); mean Jackknife estimate: 0.97]

  • Linear fits to non-linear functions

  • Linear fits to non-linear functions

    (Stimulus dependence does not always signal response adaptation)

  • Approximations are stimulus dependent

    (Stimulus dependence does not always signal response adaptation)

  • Consequences

    Local fitting can have counterintuitive consequences on the interpreta-tion of a “receptive field”.

  • “Independently distributed” stimuli

    Knowing stimulus power at any set of points in analysis space providesnoinformation about stimulus power at any other point.

    DRC:

    Space

    Spectrotemporal

    Ripple:

    Independence is a property of stimulus and analysis space.

    Christianson, Sahani, and Linden (2008)

  • Nonlinearity & non-independence distort RF estimates

    Stimulus may have higher-order correlations in other analysis spaces— interaction with nonlinearities can produce misleading “receptive fields.”

    Christianson, Sahani, and Linden (2008)

  • What about natural sounds?

    Multiplicative RF

    Time (ms)

    Fre

    q. (

    kHz)

    −30 −25 −20 −15 −10 −5

    1

    2

    3

    4

    5

    6

    7Multiplicative RF

    Fre

    q. (

    kHz)

    Time (ms)−30 −25 −20 −15 −10 −5

    1

    2

    3

    4

    5

    6

    7

    Finch Song

    Fre

    q. (

    kHz)

    Time (ms)−30 −25 −20 −15 −10 −5

    1

    2

    3

    4

    5

    6

    7Finch Song

    Fre

    q. (

    kHz)

    Time (ms)−30 −25 −20 −15 −10 −5

    1

    2

    3

    4

    5

    6

    7

    Usually not independent in any space — so STRFs may not be conser-vative estimates of receptive fields.

    Christianson, Sahani, and Linden (2008)

  • Beyond linearity

  • Beyond linearity

    Linear models often fail to predict well. Alternatives?

    •Wiener/Volterra functional expansions– M-series– Linearised estimation– Kernel formulations

    • LN (Wiener) cascades– Spike-trigger covariance (STC) methods– “Maximimally informative” dimensions (MID) ⇔ ML nonparametric

    LNP models– ML Parametric GLM models

    • NL (Hammerstein) cascades– Multilinear formulations

  • Non-linear models

    The LNP (Wiener) cascade

    k n

    Rectification addresses negative firing rates. Possible biophysical justi-fication.

  • LNP estimation – the Spike-triggered ensemble

  • Single linear filter

    STA.

    Non-linearity.

    STA unbiased for spherical (elliptical) data.

    Bussgang.

    Non-spherical inputs.

    Biases.

  • Multiple filters

    Distribution changes along relevant directions (and, usually, along alllinear combinations of relevant directions).

    Proxies for distribution:

    •mean: STA (can only reveal a single direction)• variance: STC• binned (or kernel) KL: MID “maximally informative directions” (equiv-

    alent to ML in LNP model with binned nonlinearity)

  • STC

    Project out STA:

    X̃ = X − (Xksta)kTsta; Cprior =X̃TX̃

    N;Cspike =

    X̃Tdiag(Y )X̃Nspike

    Choose directions with greatest change in variance:k- argmax‖v‖=1

    vT(Cprior − Cspike)v

    ⇒ find eigenvectors of (Cprior − Cspike) with large (absolute) eigvals.

  • STC

    Reconstruct nonlinearity (may assume separability)

  • Biases

    STC (obviously) requires that the nonlinearity alter variance.

    If so, subspace is unbiased if distribution

    • radially (elliptically) symmetric• AND independent

    ⇒ Gaussian.

    May be possible to correct by transformation, subsampling or weighting(latter two at cost of variance).

  • More LNP methods

    • Non-parametric non-linearities: “Maximally informative dimensions”(MID)⇔ “non-parametric” maximum likelihood.– Intuitively, extends the variance difference idea to arbitrary differ-

    ences between marginal and spike-conditioned stimulus distribu-tions.

    kMID = argmaxk

    KL[P (k · x)‖P (k · x|spike)]

    – Measuring KL requires binning or smoothing—turns out to be equiv-alent to fitting a non-parametric nonlinearity by binning or smooth-ing.

    – Difficult to use for high-dimensional LNP models.

    • Parametric non-linearities: the “generalised linear model” (GLM).

  • Generalised linear models

    LN models with specified nonlinearities and exponential family noise.

    In general (for monotonic g):

    y ∼ ExpFamily[µ(x)]; g(µ) = βx

    For our purposes easier to write

    y ∼ ExpFamily[f (βx)]

    (Continuous time) point process likelihood with GLM-like dependence ofλ on covariates is approached in limit of bins→ 0 by either Poisson orBernoulli GLM.

    Mark Berman and T. Rolf Turner (1992) Approximating Point Process Likelihoods with GLIM

    Journal of the Royal Statistical Society. Series C (Applied Statistics), 41(1):31-38.

  • Generalised linear models

    Poisson distribution⇒ f = exp() is canonical (natural params = βx).Canonical link functions give concave likelihoods⇒ unique maxima.

    Generalises (for Poisson) to any f which is convex and log-concave:

    log-likelihood = c− f (βx) + y log f (βx)

    Includes:

    • threshold-linear• threshold-polynomial• “soft-threshold” f (z) = α−1 log(1 + eαz).

  • Generalised linear models

    ML parameters found by

    • gradient ascent• IRLS

    Regularisation by L2 (quadratic) or L1 (absolute value – sparse) penal-ties (MAP with Gaussian/Laplacian priors) preserves concavity.

  • Linear-Nonlinear-Poisson (GLM)

    stimulus filter pointnonlinearity

    Poissonspiking

    stimulus

    k(t)

  • GLM with history-dependence

    • rate is a product of stim- and spike-history dependent terms

    • output no longer a Poisson process

    • also known as “soft-threshold” Integrate-and-Fire model

    exponentialnonlinearity

    +post-spike filter

    h

    (t)

    stimulus filter

    (Truccolo et al 04)

    k

    Poissonspiking

    conditional intensity(spike rate)

    stimulus

  • filter output

    traditional IF

    filter output

    “hard threshold”

    “soft-threshold” IF

    spik

    e ra

    te

    GLM with history-dependence

    exponentialnonlinearity

    +post-spike filter

    h

    (t)

    stimulus filter

    k

    Poissonspiking

    • “soft-threshold” approximation to Integrate-and-Fire model

    stimulus

  • GLM dynamic behaviors

    time after spike time (ms)

    0 50 100 0 100 200 300 400 500

    stimulus x(t)

    post-spike waveform

    0 0

    stim-induced

    spike-history

    induced

    regular spiking

  • GLM dynamic behaviors

    stimulus x(t)

    post-spike waveform

    0 0

    stim-filter output

    spike-history

    filter output

    regular spiking

    0 10 20

    0

    0 100 200 300 400 500

    0

    time after spike time (ms)

    irregular spiking

  • GLM dynamic behaviors

    stimulus x(t)

    00

    burstingpost-spike waveform

    time after spike time (ms)

    0 20 40

    0

    0 100 200 300 400 500-10

    0

    adaptation

  • Generalized Linear Model (GLM)

    post-spike filter

    exponentialnonlinearity

    probabilisticspiking

    stimulus

    stimulus filter

    +

  • multi-neuron GLM

    exponentialnonlinearity

    probabilisticspiking

    stimulus

    neuron 1

    neuron 2

    post-spike filter

    stimulus filter

    +

    +

  • multi-neuron GLM

    exponentialnonlinearity

    probabilisticspiking

    couplingfilters

    stimulus

    neuron 1

    neuron 2

    post-spike filter

    stimulus filter

    +

    +

  • conditional intensity(spike rate)

    ...

    time t

    GLM equivalent diagram:

  • Multilinear models

    Input nonlinearities (Hammerstein cascades) can be identified in a mul-tilinear (cartesian tensor) framework.

  • Input nonlinearities

    The basic linear model (for sounds):

    r̂(i)︸︷︷︸predicted rate

    =∑jk

    wtfjk︸︷︷︸STRF weights

    s(i− j, k)︸ ︷︷ ︸stimulus power

    ,

    How to measure s? (pressure, intensity, dB, thresholded, . . . )

    We can learn an optimal representation g(.):

    r̂(i) =∑jk

    wtfjkg(s(i− j, k)).

    Define: basis functions {gl} such that g(s) =∑lw

    llgl(s)

    and stimulus array Mijkl = gl(s(i− j, k)). Now the model is

    r̂(i) =∑j

    wtfjkwllMijkl or r̂ = (w

    tf ⊗ wl) •M.

  • Multilinear models

    Multilinear forms are straightforward to optimise by alternating least squares.

    Cost function:E =

    ∥∥∥r− (wtf ⊗ wl) •M∥∥∥2Minimise iteratively, defining matrices

    B = wl •M and A = wtf •M

    and updating

    wtf = (BTB)−1BTr and wl = (ATA)−1ATr.

    Each linear regression step can be regularised by evidence optimisa-tion (suboptimal), with uncertainty propagated approximately using vari-ational methods.

  • Some input non-linearities

    25 40 55 70

    0

    l (dB−SPL)

    wl

  • Parameter grouping

    Separable STRF: (time) ⊗ (frequency). The input nonlinearity model isseparable in another sense: (time, frequency) ⊗ (sound level).

    intensitytime

    freq

    uenc

    y

    time

    freq

    uenc

    y

    intensityw

    eigh

    t

    Other separations:

    • (time, sound level) ⊗ (frequency): r̂ = (wtl ⊗ wf) • M,• (frequency, sound level) ⊗ (time): r̂ = (wfl ⊗ wt) • M,• (time) ⊗ (frequency) ⊗ (sound level): r̂ = (wl ⊗ wf ⊗ wl) • M.

  • (time, frequency) ⊗ (sound level)

    t (ms)

    f (kH

    z)

    180 120 60 0

    32

    16

    8

    4

    2

    0

    25 40 55 70

    0

    l (dB−SPL)w

    l

  • (time, sound level) ⊗ (frequency)

    t (ms)

    l (dB

    −S

    PL)

    180 120 60 0

    70

    55

    40

    25

    0

    25 50 100

    0

    f (kHz)

    wf

  • (frequency, sound level) ⊗ (time)

    f (kHz)

    l (dB

    −S

    PL)

    2 4 8 16 32

    70

    55

    40

    25

    0

    180 120 60 0

    0

    t (ms)

    wt

  • Contextual influences

    stimulus

    wτφ

    wtfPSTH

    time

    frequ

    ency

    rate(i) = c +∑jkl

    wtjwfkw

    ll gl(soundi−j,k)(

    c2 +∑mnp

    wτmwφn w

    λp gp(soundi−j−m,k+n)

    )

  • Contextual influences

    Introduce extra dimensions:

    • τ : time difference between contextual and primary tone,•φ: frequency difference between contextual and primary tone,•λ: sound level of the contextual tone.

    Model the effective sound level of a primary tone by

    Level(i, j)→ Level(i, j) · (const + Context(i, j)) .

    and the context by

    Context(i, j) =∑m,n,p

    wτmwφn w

    λp hp(s(i−m, j + n))

    This leads to a multilinear model

    r̂ = (wt ⊗ wf ⊗ wl ⊗ wτ ⊗ wφ ⊗ wλ) •M.

  • Inseparable contexts

    We can also allow inseparable contexts (and principal fields), droppingthe level-nonlinearity to reduce parameters.

    r(i) = c +∑jk

    wtfjk soundi−j,k(

    1 +∑mn

    wτφmn soundi−j−m,k+n)

    stimulus

    wτφ

    wtfPSTH

    time

    frequ

    ency

  • Performance

    0.1

    0.5

    0.9

    Ext

    rapo

    late

    d no

    rmal

    ised

    pred

    ictiv

    e po

    wer

    STRF

    Separable Context

    Inseparable Context

    test set

    training set

    Studying sensory systemsGeneral approachSpikes, or rate?Spike-triggered averageLinear regressionLinear modelsHow good are linear predictions?Estimating predictable powerSignal power in A1 responsesTesting a modelPredictive performanceExtrapolating the model performanceJackknifed estimatesExtrapolated linearitySimulated (almost) linear dataLinear fits to non-linear functionsLinear fits to non-linear functionsApproximations are stimulus dependentConsequences``Independently distributed'' stimuliNonlinearity & non-independence distort RF estimatesWhat about natural sounds?Beyond linearityBeyond linearityNon-linear modelsLNP estimation – the Spike-triggered ensembleSingle linear filterMultiple filtersSTCBiasesMore LNP methodsGeneralised linear modelsMultilinear modelsInput nonlinearitiesMultilinear modelsSome input non-linearitiesParameter grouping(time, frequency) (sound level)(time, sound level) (frequency)(frequency, sound level) (time)Contextual influencesContextual influencesInseparable contextsPerformance