Adapting Generalized additive modelling methods for short term grid

GAM for large datasets and load forecasting

Simon Wood, University of Bath, U.K.

in collaboration with

Yannig Goude, EDF

French Grid load

0 500 1000 1500

3050

70

day

GW

1740 1760 1780 1800 1820

3040

50

day

GW

I Half hourly load data from 1st Sept 2002.I Pierrot and Goude (2011) successfully predicted one day

ahead using 48 separate models, one for each half hour periodof the day.

Why 48 models?

I The 48 separate models each have a form something like

MWt = f (MWt−48) +∑

jgj(xjt) + εt

where f and the gj are a smooth functions to be estimated,and the xj are other covariates.

I 48 Separate models has some disadvantages1. Continuity of effects across the day is not enforced.2. Information is wasted as the correlation between neighbouring

time points is not utilised.3. Fitting 48 models, each to 1/48 of the data promotes estimate

instability and creates a huge model checking task each timethe model is updated (every day?)

I But existing methods for fitting such smooth additive modelscould not cope with fitting one model to all the data.

Smooth additive models

I The basic model is

yi =∑

jfj(xji) + εi

where the fj are smooth functions to be estimated.I Need to estimate the fj (including how smooth).I Represent each fj using a spline basis expansion

fj(xj) =∑

k bjk(xj)βjk where bjk are basis functions and βjkare coefficients (maybe 10-100 of each).

I So model becomes

yi =∑

j

∑k

bjk(xji)βjk + εi = Xiβ + εi

. . . a linear model. But it is much too flexible.

Penalizing overfit

I To avoid over fitting, we penalize function wiggliness (lack ofsmoothness) while fitting.

I In particular letJj(fj) = βTSβ

measure wiggliness of fj .I Estimate β to minimize∑

i(yi − Xiβ)

2 +∑

jλjβ

TSβ

where λj are smoothing parameters.I High λj ⇒ smooth fj . Low λj ⇒ wiggly fj .

Example basis-penalty: P-splinesI Eilers and Marx have popularized the use of B-spline bases

with discrete penalties.I If bk(x) is a B-spline and βk an unknown coefficient, then

f (x) =K∑k

βkbk(x).

I Wiggliness can be penalized by e.g.

P =K−1∑k=2

(βj−1 − 2βj + βj+1)2 = βTSβ.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

x

b k(x

)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

x

f(x)

Example varying P-spline penalty

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x

y

basis functions

0.0 0.2 0.4 0.6 0.8 1.0

−2

02

46

810

x

y

∑i

(βi−1 − 2βi + βi+1)2 = 207

0.0 0.2 0.4 0.6 0.8 1.0

−2

02

46

810

x

y

∑i

(βi−1 − 2βi + βi+1)2 = 16.4

0.0 0.2 0.4 0.6 0.8 1.0

−2

02

46

810

x

y∑

i(βi−1 − 2βi + βi+1)2 = 0.04

Choosing how much to smooth

0.2 0.4 0.6 0.8 1.0

−2

02

46

8

λ too high

x

y

0.2 0.4 0.6 0.8 1.0

−2

02

46

8

λ about right

x

y

0.2 0.4 0.6 0.8 1.0

−2

02

46

8

λ too low

x

y

I Choose λj to optimize model’s ability to predict data it wasn’tfitted too.

I Generalized cross validation chooses λj to minimize

V(λ) = ‖y − Xβ‖2

[n − tr(F)]2

where F = (XTX +∑

j λjSj)−1XTX. tr(F) = effective

degrees of freedom.

The main problem

I We estimate β by the minimizer of

‖y − Xβ‖2 +∑

jλjβ

TSβ

and the λj to minimise

V(λ) = ‖y − Xβ‖2

[n − tr(F)]2

I If X is too large, then we can easily run out of computermemory when trying to fit the model.

I A simple approach will let us fit the model without forming Xwhole.

A smaller equivalent fitting problem

I Let X be n × pI Consider the QR decomposition X = QR, where R is p × p

and Q has orthogonal columns.I Let f = QTy and γ = ‖y‖2 − ‖f‖2.I ‖y − Xβ‖2 = ‖f − Rβ‖2 + γ and

V(λ) = ‖y − Xβ‖2

[n − tr(F)]2 =‖f − Rβ‖2 + γ

[n − tr(F)]2

where F = (RTR +∑

j λjSj)−1RTR.

I So once we have R and f we can estimate β and λ withoutX. . .

QR updating for R and f

I Let X =

[X0X1

], y =

[y0y1

]

I Form X0 = Q0R0 &[

R0X1

]= Q1R.

I Then X = QR where Q =

[Q0 00 I

]Q1.

I Also f = QTy = QT1

[QT

0 y0y1

]I Applying these results recursively, R and f can be

accumulated from sub-blocks of X without forming X whole.

XTX updating and Choleski

I R is the Choleski factor of XTX.I Partitioning X row-wise into sub-matrices X1, X2, . . ., we have

XTX =∑

jXT

j Xj

which can be used to accumulate XTX without needing toform X whole.

I At same time accumulate XTy =∑

j XTj y.

I Choleski decomposition gives RTR = XTX, and f = R−1XTy.I This is twice as fast as QR approach, but less numerically

stable.

Generalizations

1. Generalized additive models (GAMs) allow distributions otherthan normal (and non-identity link functions).

2. GAMs can be estimated by iteratively estimating workinglinear models, using R and f accumulation tricks on each.

3. REML, AIC etc can also be used for λj estimation. REML canbe made especially efficient in this context.

4. The accumulation methods are easy to parallelize.5. Updating a model fit with new data is very easy.6. AR1 residuals can also be handled easily in the

non-generalized case.

Air pollution example

I Around 5000 daily death rates, for Chicago, along with time,ozone, pm10, tmp (last 3 averaged over preceding 3 days).Peng and Welty (2004).

I Appropriate GAM is: deathi ∼ Poi,

log{E(deathi)} = f1(timei) + f2(ozonei , tmpi) + f3(pm10i).

I f1 and f3 penalized cubic regression splines, f2 tensor productspline.

I Results suggest a very strong ozone - temperature interaction.

Air pollution Chicago results

−2000 −1000 0 1000 2000

−0.

10.

00.

10.

2

time

s(tim

e,13

6.85

)

te(o3,tmp,38.63)

−50 0 50 100

010

020

030

0

o3

tmp

−1se +1se

10 12 14 16 18

−0.

10.

00.

10.

2

pm10

s(pm

10,3

.42)

Air pollution Chicago results

I To test the truth of ozone-temp interaction it would be goodto fit to the equivalent data for around 100 other cities,simultaneously.

I Model is

log{E (deathi)} = γj + αk + fk(o3i , tempi) + f4(ti)

if observation i is from city j and age group k (there are 3 agegroups recorded).

I Model has 802 coefs and is estimated from 1.2M data.I Fitting takes 12.5 minutes using 4 cores of a $600 PC.I Same model to just Chicago, takes 11.5 minutes by previous

methods.

Air pollution all cities results

−2000 −1000 0 1000 2000

−0.

10.

00.

10.

20.

3

time

s(tim

e,27

7.36

)<65

o3

tmp

−0.5

−0.3 −0.2

−0.2 −0.2

−0.1

−0.1

−0.1

0

0

0

0

0.1 0.2

0.3

−100 0 100 200

010

020

030

040

0

65−74

o3

tmp

−0.05

0

0

0

0.05

0.05

0.05

0.1

0.1

0.1

0.15

0.15

−100 0 100 200

010

020

030

040

0

>74

o3

tmp

−0.02

−0.02

0

0

0

0

0.02

0.0

2

0.02

0.04 0.04 0.04

0.0

4

0.04

0.04

0.06

0.06

0.06

0.06

0.06 0.08

0.08

0.08

0.1

0.12 0.14

−100 0 100 200

010

020

030

040

0

Back to grid load

0 500 1000 1500

3050

70

day

GW

1740 1760 1780 1800 1820

3040

50

day

GW

I Now the 48 separate grid load model can be replaced by

Li = γj + fk(Ii , Li−48) + g1(ti) + g2(Ii , toyi) + g3(Ti , Ii)

+ g4(T.24i , T.48i) + g5(cloudi) + STih(Ii) + ei

if observation i is from day of the week j, and day class k.I ei = ρei−1 + εi and εi ∼ N(0, σ2) (AR1).

Fitting grid load model

I Using QR update fitting takes under 30 minutes and weestimate ρ = 0.98.

I Update with new data then takes under 2 minutes.I We fit up to 31 August 2008, and then predicted the next

years data, one day at a time, with model update.I Predictive RMSE was 1156MW (1024MW for fit). Predictive

MAPE 1.62% (1.46% for fit).I Setting ρ = 0 gives overfit, and worse predictive performance.

Residuals

0 10 20 30 40

−60

00−

2000

020

0040

0060

00

0 10 20 30 40

0.5

1.0

1.5

2.0

2.5

3.0

Instant

MA

PE

(%

)

MoTuWeTh

FrSaSu

9/1/

2002

12/1

9/20

02

4/8/

2003

8/12

/200

3

12/4

/200

3

3/22

/200

4

7/26

/200

4

11/1

7/20

04

3/7/

2005

7/6/

2005

10/2

3/20

05

2/18

/200

6

6/20

/200

6

10/8

/200

6

2/3/

2007

6/4/

2007

9/21

/200

7

1/17

/200

8

5/6/

2008

8/31

/200

8

−60

00−

2000

020

0040

0060

00

1/9/

2008

17/9

/200

84/

10/2

008

21/1

0/20

0815

/11/

2008

2/12

/200

819

/12/

2008

13/1

/200

930

/1/2

009

16/2

/200

94/

3/20

0921

/3/2

009

7/4/

2009

28/4

/200

927

/5/2

009

17/6

/200

94/

7/20

0925

/7/2

009

11/8

/200

931

/8/2

009

−60

00−

4000

−20

000

2000

4000

6000

Temperature effect

I[t]

0

10

20

30

40

T[t]0

10

2030

L[t]

50000

55000

60000

65000

70000

Temperature Effect

Lagged load Effects

I[t]

0

10

20

30

40

L[t − 1]

30000

40000

5000060000

7000080000

L[t]

40000

50000

60000

70000

Lagged Load Effect, WW

I[t]

0

10

20

30

40

L[t − 1]

30000

40000

5000060000

7000080000

L[t]

50000

60000

70000

80000

Lagged Load Effect, WH

I[t]

0

10

20

30

40

L[t − 1]

30000

40000

5000060000

7000080000

L[t]

50000

60000

70000

80000

Lagged Load Effect, HH

I[t]

0

10

20

30

40

L[t − 1]

30000

40000

5000060000

7000080000

L[t]

40000

50000

60000

70000

80000

90000

Lagged Load Effect, HW

Adapting Generalized additive modelling methods for short term grid

Documents

Transcript of Adapting Generalized additive modelling methods for short term grid