Artiﬁcial Neural Networks in Time Series Forecasting: A...

Artificial Neural Networks in Time SeriesForecasting:A ComparativeAnalysis 1

HectorAllende� 2, ClaudioMoraga

��andRodrigoSalas

��UniversidadTecnicaFedericoSantaMarıa;DepartamentodeInformatica;

Casilla110-V; Valparaıso-Chile;e-mail: (hallende,rsalas)@inf.utfsm.cl��Universityof Dortmund;Departmentof ComputerScience;

D-44221Dortmund;Germany; e-mail: [email protected]

Abstract

Artificial neuralnetworks (ANN) have received a greatdeal of attentionin many fields ofengineeringandscience. Inspiredby the studyof brain architecture,ANN representa classofnonlinearmodelscapableof learningfrom data. ANN have beenappliedin many areaswherestatisticalmethodsaretraditionallyemployed.They have beenusedin patternrecognition,classi-fication,predictionandprocesscontrol.Thepurposeof thispaperis to discussANN andcomparethemto nonlineartimeseriesmodels.Webegin exploringrecentdevelopmentsin timeseriesfore-castingwith particularemphasison theuseof nonlinearmodels.Thereafterwe includea reviewof recentresultson the topic of ANN. The relevanceof ANN modelsfor the statisticalmethodsis consideredusingtime seriespredictionproblems.Finally we constructasymptoticpredictionintervalsfor ANN andshow how to usepredictionintervals to choosethenumberof nodesin theANN.

keywords:Artificial neuralnetworks; nonlineartime seriesmodels;predictionintervals; modelspecification;asymptoticproperties.

1 Intr oduction

Artificial neuralnetworks (ANN) have received a greatdealof attentionover the last years. Theyarebeingusedin theareasof predictionandclassification,areaswhereregressionandotherrelatedstatisticaltechniqueshavetraditionallybeenused [CT94]

Forecastingin time seriesis a commonproblem. Using a statisticalapproach,Box andJenkins[BJR94] have developedthe integratedautoregressive moving average(ARIMA) methodologyforfitting a classof linear time seriesmodels.Statisticiansin a numberof wayshave addressedthere-strictionof linearity in theBox-Jenkinsapproach.Robustversionsof variousARIMA modelshavebeendeveloped.In addition,a largeamountof literatureon inherentlynonlineartime seriesmodelsis available. The stochasticapproachto nonlineartime seriesoutlinedby [Ton90] cannot only fitnon-linearmodelsto time seriesdata,but againprovidesmeasuresof uncertaintyin the estimatedmodelparametersaswell asforecastsgeneratedby thesemodels.It is thestochasticapproachthatagainenablesthespecificationof uncertaintyin parameterestimatesandforecasts.

More recently, ANN have beenstudiedas an alternative to thesenonlinearmodel-driven ap-proaches.Becauseof their characteristics,ANN belongto thedata-drivenapproach,i.e. theanalysisdependsontheavailabledata,with little apriori rationalizationaboutrelationshipsbetweenvariablesandaboutthe models. The processof constructingthe relationshipsbetweenthe input andoutput

1This researchwassupportedin partby theResearchGrantBMBF RCH99/023(Germany), in partby a ResearchFellow-shipof theGermanAcademicExchangeService(DAAD) andtheResearchGrantDGIP-UTFSM240022

2Addressfor correspondence

variablesis addressedby certaingeneral-purpose’ learning’ algorithms [Fin99]. Somedrawbacksto the practicaluseof ANN arethe possiblylong time consumedin the modelingprocessandthelarge amountof datarequiredby the presentANN technology. Speed-upis beingachieved duetotheimpressiveprogressin increasingtheclock rateof presentprocessors.Thedemandson thenum-berof observationsremainshowever a hardopenproblem. Onecauseof bothproblemsis the lackof definitegenericmethodologythat couldbeusedto designa small structure.Most of thepresentmethodologiesusenetworks, with a large numberof parameters(”weights”). This meanslengthycomputationsto set their valuesanda requirementfor many observations. Unfortunately, in prac-tice, a model’s parametersmustbeestimatedquickly andjust a smallamountof dataareavailable.Moreover, partof theavailabledatashouldbekeptfor thevalidationandfor performance-evaluationprocedures.

This reportreviews recentdevelopmentsin oneimportantclassof nonlineartime seriesmodels,like theANN’s (model-freesystems)anddescribea methodologyfor theconstructionof predictionintervalswhich facilitatestheestimationof forecast.

In the next sectionwe provide a very brief review of the linear andnonlinearARMA modelsand the optimal prediction. Section3 containsan overview of ANN terminologyanddescribesamethodologyfor neuralmodel identification. The multilayer feedforward ANN describedcan beconceptualizedasameansof fitting ahighly nonlinearregressionandtimeseriespredictionproblem.In section4 we usetheresultsof [HD97] to constructconfidenceintervalsandpredictionintervalsin nonlineartimeseries.

2 Time SeriesAnalysis

2.1 Linear Models

The statisticalapproachto forecastinginvolves the constructionof stochasticmodelsto predictthe valueof an observation xt usingprevious observations. This is often accomplishedusing lin-ear stochasticdifferenceequationmodels,with randominput. By far, the most importantclassof suchmodelsis the linear autoregressive integratemoving average(ARIMA) model. Here weprovide a very brief review of the linear ARIMA-models andoptimal predictionfor thesemodels.A morecomprehensive treatmentmay be found for examplein [BJR94]. The seasonalARIMA�p � d � q��

P� D � Q� s modelfor suchtimeseriesis representedby

φ�BS� φp

�B� ∇D

S∇dxt � θ�BS� θq

�B� εt (1)

whereφp�B� is thenonseasonalautoregressiveoperatorof order” p”, θq

�B� is thenonseasonalmov-

ing averageoperatorof order”q”, φ�BS� , θ

�BS� aretheseasonalautoregressiveandmoving average

operatorof orderP andQ andthetermsxt andεt arethetime series,a sequenceof randomshocks,anda constant,respectively. Moreover it is assumedthatE εt xt � 1 � xt � 2 �� 0. This conditionissatisfiedfor examplewhenεt arezeromean,independentandidenticallydistributedandindependentof pastxt ´s. It is assumedthroughoutthatεt hasfinite varianceσ2. ThebackshiftoperatorB shiftsthe index of a time seriesobservationbackwards,e.g. Bxt � xt � 1 , andBkxt � xt � k. The orderofthe operatoris selectedby Akaike’s informationcriterion (AIC) or by Bayesinformationcriterion(BIC) [BD91] andtheparametersφ1 �� φp andθ1 �� θq areselectedfrom thetimeseriesdatausingoptimizationmethodssuchasmaximumlikelihood [BJR94] or usingrobustmethodssuchasrecur-sive GeneralizedMaximum likelihood [AH92]. The ARMA-model is limited by the requirementof stationaryandinvertibility of the time seriesi.e. the systemgeneratingthe time seriesmustbetime invariantandstable.Additionally, theresidualsmustbeindependentandidenticallydistributed

2

[BO93].

The ARMA modelsrequirea stationarytime seriesin order to be useful for forecasting.Theconditionfor a seriesto beweakstationaryis thatfor all � t �

E xt � � µ ; V xt � � σ2 ; COV xt � xt � k � � γk (2)

Diagnosticcheckingof the overall ARMA modelsis doneby the residual. Several testshavebeenproposed,amongthemthe mostpopularseemsto be the so-calledportmanteautestproposedby [LB78] andits robustversionby [AG96]. Thesetestsarebasedon a sumof squaredcorrelationof theestimatedresidualsuitablyscaled.

2.2 Non-Linear Models

Theoryandpracticearemostlyconcernedwith linearmethodsandmodels,suchasARIMA modelsandexponentialsmoothingmethods.However, many time seriesexhibit featureswhich cannotbeexplainedin a linearframework. For examplesomeeconomicseriesshow differentpropertieswhentheeconomyis goinginto, ratherthancomingoutof, recession.As aresult,therehasbeenincreasinginterestin non-linearmodels.

Many typesof non-linearmodelshave beenproposedin the literature,seefor examplebilinearmodels [Rao81], classificationandregressiontrees [BFOS84], ThresholdAutoregressive models[Ton90] andProjectionPursuitRegression [Fri91]. Therewardsfrom usingnon-linearmodelscanoccasionallybesubstantial.However, onthedebitside,it is generallymoredifficult to computefore-castsmorethanonestepahead.[LG94].

Anotherimportantclassof non-linearmodelsis that of non-linearARMA modelsproposedby[CM94]. Naturalgeneralizationsof thelinearARMA modelsto thenon-linearcasearethenon-linearARMA models(NARMA)

xt � h�xt � 1 � xt � 2 �� xt � p � εt � 1 �� εt � q �� εt (3)

whereh is an unknown smoothfunction, and as in 2.1 it is assumedthat E εt xt � 1 � xt � 2 �� 0andthat varianceof εt is σ2. In this casethe conditionalmeanpredictorbasedon the infinite pastobservationis

xt � E h � xt � 1 � xt � 2 �� xt � p � εt � 1 �� εt � q � xt � 1 � xt � 2 �� (4)

SupposethattheNARMA modelis invertiblein thesensethatthereexistsa functionν suchthat

xt � ν�xt � 1 � xt � 2 �� εt (5)

Thengiventheinfinite pastof observationsxt � 1 � xt � 2,...,onecancomputetheεt � j in (3) exactly:

εt � j � εt � j�xt � j � xt � j � 1 �� j � 1 � 2 �� q (6)

In this casethemeanestimateis

xt � h�xt � 1 � xt � 2 �� xt � p � εt � 1 �� εt � q � (7)

wheretheεt � j arespecifiedin termsof presentandpastxu’s. Thepredictorof (7) hasamean-square

3

errorσ2.

Sincewe have only a finite observationrecord,we cannotcompute(6) and(7). It seemsreason-ableto approximatetheconditionalmeanpredictor(7) by therecursivealgorithm

xt � h�xt � 1 � xt � 2 �� xt � p � εt � 1 �� εt � q � (8)

εt � j � xt � j � xt � j � j � 1 � 2 �� q (9)

with thefollowing initial conditions

x0 � x� 1 � �� x� p� 1 � ε0 � �� ε � q� 1 � 0 (10)

For thespecialcaseof nonlinearautoregressivemodel(NAR), it is easyto checkthat(3) is givenby

xt � h�xt � 1 � xt � 2 �� xt � p �� εt (11)

In thiscase,theminimummeansquareerror(MSE)optimalpredictorof xt givenxt � 1, xt � 2,....,xt � p

is theconditionalmean(for t � p � 1).

xt � E xt xt � 1 �� xt � p � � h�xt � 1 �� xt � p � (12)

Thispredictorhasmeansquareerrorσ2.

3 Artificial Neural Networks

Thebrainis anenormouslycomplex systemin whichdistributedinformationis processedin parallelby mutualdynamicalinteractionsof neurons.It is still difficult andchallenging,to understandthemechanismsof thebrain.Theimportanceandeffectivenessof brain-stylecomputationhasbecomeafundamentalprinciple in thedevelopmentof neuralnetworks. Therearethreedifferentresearchar-easconcerningneuralnetworks.Oneis theexperimentalbasedonphysiologyandmolecularbiology.The secondareais engineeringapplicationsof neuralnetworks inspiredby the brain-stylecompu-tationwhereinformationis distributedasanalogpatternsignal,parallelcomputationsaredominantandlearningguaranteesflexibility androbustcomputation.The third areais concernedwith math-ematicalfoundationsof neuro-computing,which searchesfor thefundamentalprinciplesof paralleldistributedinformationsystemswith learningcapabilities.Statisticshasacloserelationwith thesec-ond applicationareaof neuronalnetworks. This areahasopenednew practicalmethodsof patternrecognition,timeseriesanalysis,imageprocessing,etc.

Artificial NeuralNetworks(ANN) provide statisticswith tractablemultivariatenon-linealmeth-odsto be furtherstudied.On theotherhandstatisticalscienceprovidesoneof the crucialmethodsfor constructingtheoreticalfoundationof neuro-computing.

Froma statisticalperspective ANN areinterestingbecauseof their usein variouskindsof prob-lems, for example: predictionand classification. ANN have beenusedfor a wide variety of ap-plications,wherestatisticalmethodsare traditionally employed. They have beenusedin classifi-cationproblemsasidentifying underwatersonarcontacts,andpredictingheartproblemsin patients[Bax90]. In time seriesapplicationsthey have beenusedin predictingstockmarket performance[Hut94]. ANN arecurrentlythe preferenttool in predictingproteinsecondarystructures. [FR90].

4

Thestatisticianswouldnormallysolvetheseproblemsthroughclassicalstatisticalmodelssuchasdis-criminantanalysis,logistic regression,multiple regressionandtime seriesmodelssuchasARIMAandforecastingmethods.

It is thereforetime to recognizeANN asa potentialtool for dataanalysis.SeveralauthorshavedonecomparisonstudiesbetweenstatisticalmethodsandANN (seee.g. [WY92] and [Ste96]).Theseworkstendto focuson performancecomparisonsandusespecificproblemsasexamples.TheANN trainedby errorBackpropagationareexamplesof nonparametricregressionestimators.In thisreport we presentthe relationsbetweennonparametricinferenceand ANN, we usethe statisticalviewpoint to highlight strengthsandweaknessof neuralmodels.Therearea numberof goodintro-ductoryarticleson ANN usually locatedin varioustradejournals. For instance, [Lip87] providesan excellentoverview of ANN for the signalprocessingcommunity. Therehave alsobeenpapersrelatingANN andstatisticalmethods[Rip93] and [Sar94]. Oneof thebestfor a generaloverviewfor statisticiansis [CT94].

3.1 Elementsof Artificial Neural Networks

Thethreeessentialfeaturesof anartificial neuralnetwork (ANN) arethebasicprocessingelementsreferredto asneuronsor nodes;thenetwork architecturedescribingtheconnectionsbetweennodes;andthetrainingalgorithmusedto find valuesof thenetwork parametersfor performinga particulartask.

Figure1: A multilayerfeedforwardANN for approximatingtheunknown functionϕ�x�

An ANN consistsof elementaryprocessingelements(neurons),organizedin Layers(seeFigure1). The Layersbetweenthe input andthe output layersarecalled”hidden”. The numberof inputunits is determinedby the application. The architectureor topologyAλ of a network refersto thetopologicalarrangementof thenetwork connections.A classof neuralmodelsis specifiedby

Sλ �� gλ�x � w�� x � Rm � w � W �� W Rp (13)

5

wheregλ�x � w� is a non-linearfunction of x with w beingits parametervector, λ is the numberof

hiddenneuronsandp is thenumberof freeparametersdeterminedby Aλ , i.e., p � ρ�Aλ).

A class(or family) of neuralmodelsis a setof ANN modelswhich sharethesamearchitectureandwhoseindividual membersarecontinuouslyparameterizedby thevectorw � �

w1 � w2 �� wp � T .Theelementsof this vectorareusuallyreferredto asweights.For asingle-hidden-layerarchitecture,thenumberof hiddenunitsλ indexesthedifferentclassesof ANN models

�Sλ � sinceit is anunam-

biguousdescriptorof thedimensionallyp of theparametervector�p � �

m � 2� λ � 1� .Giventhesampleof observations,thetaskof neurallearningis to constructanestimatorg

�x � w�

of theunknown functionϕ�x�

gλ�x � w� � γ2

� λ

∑j ! 1

w" 2#j γ1� m

∑i ! 1

w" 1#i j xi � w" 1#m� 1 $ j �� w" 2#λ � 1 � (14)

wherew � �w1 � w2 �� wp � T is aparametervectorto beestimated,γ �s arelinearityor non-linearityand

λ is a controlparameter(numberof hiddenunits). An importantfactorin thespecificationof neuralmodelsis the choiceof basefunction γ. Otherwiseknown as’activation’ or ’squashing’functions,thesecan be any non-linearfunction as long as they are continuous,boundedand differentiable.Typically γ1 is asigmoidalor thehyperbolictangent,andγ2 is a linearfunction.

Theestimatedparameterw is obtainedby minimizing iteratively a costfunctionalLn�w� i.e.

w � argmin� Ln�w� : w � W �%� W Rp (15)

whereLn�w� is, for example,theordinaryleastsquaresfunctioni.e.

Ln�w� � 1

2n

n

∑i ! 1

�yi � gλ

�xi � w�� 2 (16)

Thelossfunctionin equation(16)givesusameasureof accuracy with whichanestimatorAλ, fitstheobserveddatabut it doesnot accountfor theestimator’s (model)complexity. Givena sufficientlarge numberof free parameters,p � ρ

�Aλ � , a neuralestimatorAλ, canfit the datawith arbitrary

accuracy. Thus,from the perspective of selectingbetweencandidates,modelexpression(16) is aninadequatemeasure.The usualapproachto the selectionis the so-calleddiscriminationapproach,wherethemodelsareevaluatedusinga fitnesscriterion,which usuallypenalizesthe in-sampleper-formanceof themodel,asthecomplexity of thefunctionalform increasesandthedegreesof freedomfor error becomeless. Suchcriteria, commonlyusedin the context of regressionanalysisare: theR-Squaredadjustedfor degreesof freedom,Mallow’sCp criterion,Akaike’sAIC criterion,etc.

Thebasicrequirementof any methodis convergenceof thetrainingalgorithmto a locally uniqueminimum. By introducingtherequirementthat for any particulararchitectureAλ thenetwork hastobetrainedto convergence,weperformarestrictedsearchin thefunctionspace.This is becausefromeachclassSλ weselectonly onememberwith its parameterestimatedfrom equation(15). In thisset-ting theactualtraining(estimated)algorithmused,is of noconsequence(providedthatit satisfiestheconvergencerequirement).Thefirst stepis to estimatetheparametersw of themodelby iterativelyminimizingtheempiricallossLn

�w� (see(16)). Thisstagemustnotbeconfusedwith modelselection,

which in this framework employsadifferentfitnesscriterionfor selectingbetweenfittedmodels.Thesecondstepis to computetheerrorHessianAn . (SeeAppendixA). This is usedto facilitatetestonconvergence.Thethird stepis intendedto performa testfor convergenceanduniqueness,basicallyby examiningwhetherAn hasnegativeeigenvalues.Thefourth stepis to estimatethepredictionrisk

6

Pλ � E L � wn �&� , which adjuststheempiricallossfor complexity. Thefifth stepis to selectamodelbyemploying theminimumpredictionrisk principlewhich expressesthetrade-off betweenthegeneral-izationability of thenetwork andits complexity. It hasto benoted,however, thatsincethesearchisrestricted,theselectednetwork is thebestamongthe alternativesconsideredandit doesnot neces-sarily representaglobaloptimum.Thefinal stepinvolvestestingtheadequacy of theselectedmodel.Satisfyingthosetestsis anecessarybut notsufficientconditionfor modeladequacy. Failureto satisfythosetestsindicatesthateithermorehiddenunitsareneededor somerelevantvariableswereomitted.

3.2 Model-FreeForecast

Artificial neuralnetworks areessentiallydevicesfor non-parametricstatisticalinference.From thestatisticalviewpoint, they have a simpleinterpretation:givena sampleDn �'� � xi � yi �� n

i ! 1 generatedby anunknown function f

�x� with theadditionof astochasticcomponentε, i.e.

y � f�xi �� εi (17)

the task of ”neural learning” is to constructan estimatorg�x � w � Dn �)( f

�x� of f

�x � , wherew ��

w1 �� wp � T is a setof freeparameters(known as”connectionweights” in sub-section3.1),andDn

is a finite setof observations.Sinceno a priori assumptionsaremaderegardingthefunctionalformof f

�x � , theneuralmodelg

�x � w� is a non-parametricestimatorof theconditionaldensityE y x� , as

opposedto a parametricestimatorwherethe functionalform is assumeda priori, for example,in alinearmodel.

ANN of nonlinear AutoregressiveModels (NAR)

An ANN topologyanddynamicsdefinean approximatorfrom input to output. The unknownfunctiong : Rm * R producestheobservedsamplepatternpairs

�x1 � y1 �� x2 � y2 �� . Thesampledata

modify parametersin theneuralestimatorandbringtheneuralsystem’sinput-outputresponsescloserto the input-outputresponsesof theunknown estimateg. In psychologicalterms,theneuralsystemlearnsfrom experience.In theneuralestimatorprocess,we do not asktheneuralengineerto articu-late,write down or guessthemathematicalshapeof theunknown functiong. This is why wecall theANN estimationasmodel-free.

A centralproblemof nonlinearautoregressivemodels(NAR) is to constructafunction,h : Rp * Rin adynamicalsystemwith theform

xt � h�xt � 1 � xt � 2 �� xt � p � (18)

or possiblyinvolving amixtureof chaosandrandomnessxt � h�xt � 1 � xt � 2 �� xt � p �%� εt in whichh is

anunknownsmoothfunctionandεt denotesnoise.Similarto2.1weassumethatE εt xt � 1 � xt � 2 �� 0, andthatεt hasfinite varianceσ2. UndertheseconditionstheMSE optimalpredictorof xt , givenxt � 1 � xt � 2 �� xt � p is asshown in equation(12).

FeedforwardANN wereproposedasan NAR modelfor time seriespredictionby [CM94]. AfeedforwardANN providesa nonlinearapproximationto h givenby

xt � h�xt � 1 � xt � 2 �� xt � p � � λ

∑j ! 1

w" 2#j γ1� p

∑i ! 1

w" 1#i j xt � i � w" 1#p� 1 $ j � (19)

7

wherethe function γ1��+ � is a smoothboundedmonotic function. (19) is similar to equation(14),

whereγ1 is asigmoide,γ2 is theidentity functionandtheoutputnodehasno bias.

Theparameterw" 2#j andw" 1#i j areestimatedfrom a trainingandestimateh of h. Estimatesareob-tainedby minimizing thesumof thesquaredresiduals,similar to (15). This is donefor exampleby agradientdescentprocedureknown as”Backpropagation”, by SuperSelf-AdaptingBackpropagationor by a second-ordermethodsfor learning(see [Fin99]).

4 Prediction Inter vals for ANN

A theoreticalproblemof the ANN is the unidentifiabilityof the parameters.That is, therearetwosetsof parameterssuchthat the correspondingdistributions

�x � y� are identical. In this sectionwe

concentrateon the caseof only onehiddenlayer. Further, we assumea nonparametricstatisticalmodelthatrelatesy andg

�x � w� asfollows:

y � g�x � w�� ε (20)

wheretherandomcomponentε hasanormaldistributionwith meanzeroandvarianceσ2. Thefunc-tion g

�x � w� is anon-linearfunctionsuchas(13).

Thenetwork is trainedonthedatasetDn �,� � xi � yi �� ni ! 1 ; thatis, thesedataareusedto predictthe

futureoutputatanew inputxn� 1 by yn� 1 � g�xn� 1 � w� . We assumethatfor every1 - i - n � 1, (14)

and(20)aresatisfied,thatis, yi � g�xi � w�.� εi , whereyn � 1 is theunobservablerandomvariablethat

is thetargetof prediction.Further, we assumethat thexi ’s areindependentof theεi ’s andthe�x � εi �

, 1 - i - n � 1 areindependentidenticallydistributed(i.i.d.). Our aim in this sectionis to constructpredictionintervals for yn� 1 andconfidenceintervals for g

�xn� 1 � w� , the conditionalexpectationof

yn� 1 givenxn� 1.

To discussthe identifiability (or ratherthe unidentifiability) of parameters,we first discusstwoconcepts(asin [Sus92]). WesaythatanANN (with afixedsetof parameters)is ”redundant”if thereexistsanotherANN with feverneuronsthatrepresentsexactly thesamerelationshipfunctiong

��+ � w� .A formaldefinitionis thereducibilityof w, andcanbefoundin [Sus92].

Definition 4.1: For γ1 choosenassymmetricsigmoidalfunction and γ2 a linear function w is

called”reducible” if oneof following threecasesholdsfor j /� 0, (a) w" 2#j � 0 for some j � 1 �� λ;

(b) w " 1#j � �w" 1#1 j �� w" 1#mj � � 0 for some j � 1 �� λ; or (c)

�w" 1#j � w" 1#m� 1 $ j � �'0 � w" 1#j � w" 1#m� 1 $ l � for some

j /� l , where0 denotesthezerovectorof theappropriatesize.

If w is reducibleand γ1 is a sigmoidalfunction, then the correspondingANN relative to (20)is redundant [Sus92]. On the other hand,an irreduciblew may not always lead to a nonredun-dant ANN, althougha sufficient condition on γ. [Sus92] proved that if the classof functions� γ1

�bx � b0 �� b 1 0 ��2 � γ1 ( 1 � is linearly independentthenthe irreducibility of w implies that the

correspondingANN is nonredundant.

In generalnotethateveryANN is unidentifiable.However [HD97] showedthatANN’swith cer-tainactivationfunctions,leavethedistributionof y invariantupto certainfamily τ of transformations

8

of w. Thatis, if thereexist anotherw�

suchthatg��+ � w� � � g

��+ � w� , thenthereis atransformationgener-atedby τ thattransformsw

�to w. Furtherundertheassumptionthatγ1 is continuouslydifferentiable,

thematrix

Σ � E ∇wg�x � w� ∇wg

�x � w� T � (21)

is non-singular.In this sectionwe constructconfidenceintervalsandpredictionintervalsbasedon anANN and

show theseto beasymptoticallyvalid usingtheresults.

From [Sus92], [HD97], wefirst assumethatthenumberof neuronsλ is known. Specificallyweassumethatourobservations

�xi � yi � , 1 - i - n satisfy(20), thatis

yi � g�x � w�� εi (22)

Furthermore,let yn� 1 denotea futureunknown observationthatsatisfies

yn� 1 � g�xn� 1 � w�� εn� 1 (23)

We thenconstructa predictioninterval for yn � 1 anda confidenceinterval for g�xn� 1 � w� , the

conditionalmeanof yn� 1 givenxn� 1.

Beforedoingso,westategeneralresultsaboutaninvariantstatistic,whereg��+ � w� mayor maynot

correspondto anANN. We write theparameterspaceW a subsetof Rp, astheunionof Wi ’s whereWi mayor maynot bedisjoint. We assumethatthereexist differentiablefunctionsTi , 1 - i - λ, thatmapW1 ontoWi thatis,

Ti�W1 � � Wi (24)

Let w0 � W denotethe true parameterand let w" 1#0 be the point in W1 that correspondsto w0.

Assumethatw 3 14 is aconsistentestimatorfor w" 1#0 basedon a samplesizen and5n 6 w 3 14 � w 3 140 7 * N 6 0 � σ2V 8 w 3 140 9:7 (25)

whereσ2 is a scaleparameterthatcanbeestimatedconsistentlyby anestimatorσ2 andV 6 w 3 140 7 is

a squarematrix (See [Fin99]). Let w beanarbitraryestimatorthat takesvaluefrom � w 3 14 �� w 3 λ 4 �wherew 3 i 4 � Ti

�w 3 14 � . This is to saythat for every n andevery dataset,thereexists an i suchthat

w0 � w 3 i 4 . A real-valuedfunction l�w� is saidto be invariantwith respectto all the transformations

Ti if

l�w� � l

�Ti�w�� for every i (26)

Onecanshow thattheasymptoticvarianceof aninvariantstatisticis alsoinvariantasstatedin thefollowing result (see [HD97]). Assumethat l

�w� is differentiableandinvariant. Thenasn * ∞,5

n l � w0 � � l�w0 �&� convergesto a normal with meanzero and varianceν2 � w0 � , whereν2 � w0 � �

σ2 ∇l�w0 � TV

�w0 � ∇l

�w0 �;� . Furthermore,the function ν2 � w0 � is invariantwith respectto all of the

transformationsTi .

Under additionalcontinuity assumptionsit canbe proved that the asymptoticvariancecan beestimatedconsistentlyby σ2 ∇l

�w 3 i 4 � TV

�w 3 i 4 � ∇l

�w 3 i 4 �&� , which againby invarianceequals

9

σ2 ∇l�w0 � TV

�w0 � ∇l

�w0 �&� (27)

thereforeif ∇l�w� andV

�w� areboth continuousin w , then(27) is a consistentestimatorfor the

asymptoticvarianceof5

n l � w0 � � l�w0 �&� .

Returningto the neuralnetwork problem,we now apply the last resultsto model(20) andweassumethat the true valuew0 of w is irreducible. We may make this assumptionwithout lossofgenerality, becauseotherwisetheneuralnetwork is redundantandwe maydroponeneuronwithoutchangingthe input output relationshipat all. We may continuethis processuntil a nonredundantnetwork is obtained.Thiscorrespondsto anirreduciblew0.

AssumethatW, theparameterspace,is acompactset.Furthermore,makethefollowing assump-tions:

i) εi arei.i.d. with meanzeroandvarianceσ2, andε �isarestatisticallyindependentof xi ’s.

ii) xi arei.i.d. samplesfrom anunknown distributionF�x� whosesupportis Rm.

iii) γ1 in (14) is a symmetricsigmoidalfunctionwith continuoussecond-orderderivative. Let γ �1 de-noteits derivative. Furthermoretheclassof functions � γ1

�bx � b0�� b 1 0 �<2 � γ �1 � bx � b0�� b 1

0 �=2 � xγ �1 � bx � b0�� b 1 0 �>2 � γ1 ( 1 � is linearly independent.

iv) γ2 in (14) is a linearfunction.

v) w0 is aninteriorpoint of W.

Let w bea globalminimizerof ∑ni ! 1 6 yi � g

�xi � w� 7 2

, which existsby thecompactnessof W and

continuityof g. Then

g�xn� 1 � w� 0 t 3 1 � α ? 24 ; " n � 3 m� 24 λ � 1# σ @ S

�w� (28)

isaconfidenceinterval for g�xn� 1 � w� with asymptoticcoverageprobability1 � α. Heret 3 1 � α ? 24 ; " n � 3 m� 24 λ � 1#

denotesthe1 � α 2 quantileof a t-Studentdistributionwith n � �m � 2� λ � 1� degreesof freedom,

σ2 � 1 n � �m � 2� λ � 1� n

∑i ! 1

yi � g�xi � w�&� 2 (29)

and

S�w� � 1

n ACB ∇wg�xn� 1 � w�&D T

w! wΣ � 1 � w�� B ∇wg

�xn� 1 � w�&D

w! w E (30)

where

Σ�w� � 1

n A n

∑i ! 1 B ∇wg

�xi � w� ∇wg

�xi � w� T D

w! w E (31)

Furthermore,assumethatεn� 1 is normallydistributed.Then

Iw�yn� 1 � � g

�xn� 1 � w� 0 t 3 1 � α ? 24 ; " n � 3 m� 24 λ � 1# σ @ 1 � S

�w� (32)

10

is anasymptoticpredictioninterval for yn� 1; thatis,

Pr yn� 1 � Iw�yn� 1 �;� * 1 � α (33)

theproofof this resultsaregivenin [HD97].

A practicalproblemthat occursin many applicationsof ANN’s is how to choosethe networkstructure. Whenrestrictedto feedforward networks with only onehiddenlayer, this problembe-comeshow to choosethe numberof hiddenneurons.Onepossibleapproach,which canbe calledthe ”prediction interval approach”,is to choosethe numberof nodesso that the predictionintervalhascoverageprobabilitycloseto thenominallevel (e. g. 95%or 90%)andhastheshortestexpectedlength.Becausebothquantitiesareunknown, they shouldbeestimated.Thedelete-onejackknifeforthe coverageprobability could be used. Specifically, this involvesdeletinga pair

�xi � yi � andusing

the restof datatogetherwith xi to constructa predictioninterval for yi . By letting i vary, we haven intervals. Thecoverageprobabilitycanthenbeestimatedby countingtheproportionof timestheintervalscoveryi . Onecouldalsocalculatetheaveragelengthof then intervalsanduseit to estimatetheexpectedlength.Anotherpossibleapproach,whichcanbecalledthe”predictionerrorapproach”,is to choosethenumberof nodesto minimizethejackknifeestimateof predictionerror.

Finally anotherpossibleapproachrecommendedis bootstrapmethods,or so-called”resamplingtechniques”thatpermitratheraccurateestimateof finite sampledistributionsfor wn when � � xi � yi �F� n

i ! 1is asequenceof independentidenticallydistributed(i.i.d.) randomvariable.Thebasicideais to drawa largenumberN of randomsamplesof sizen with replacementfrom � � xi � yi �F� n

i ! 1, calculatewn for

eachof the N samples,sayw 3 i 4n ; i � 1 � 2 �� N, andusethe resultingempiricaldistribution of the

estimatesw 3 i 4n , asan estimateof the samplingdistribution of wn . The bootstrapmethodsarenotrecommended,becausecomputationallythey aretoo time-consuming.Theseresamplingtechniquesarebeyondthescopeof this report.

5 Application to Data

In this sectiontheANN will beappliedin two examples,thefirst oneis thewellknown ’airline’ dataandnext we will dealwith the ’RESEX’ data. Both time seriesaremonthlyobservationsandhavebeenanalyzedby many scientistsandarea baselineto comparedifferentmodels.

The resultsreportedin this paperwerecomputedby separatingeachsetof datain two subsets,werethefirst n monthlyobservations(data),correspondingfrom time 1 to time T, calledsamplesortraining setwereusedto fit the modelandthenusethe last 12, calledtestset,correspondingfromtimeT � 1 to T � 12,to maketheforecast.Thedatausedto fit themodelarealsousedfor thetrainingof the neuralnetwork, this datawerere-escaledin the interval � 1 � 1� . The NN usedto model thedataandthenusedto forecastis a feedforwardwith onehiddenlayer anda biasin the hiddenandoutputlayer. Thenumberof neuronsm in the input layer is thesameasthenumberof lagsneeded,theseneuronsdonotperformany processing,they justdistributetheinputvaluesto thehiddenlayer,they serve asa senselayer. In the hiddenlayer differentnumberof neuronsareusedto choosethebestarchitecture,the activation functionusedis the sigmoidalfunctionz

�w� � 1

1� eG w . Oneneuronis usedin theoutputcorrespondingto the forecast,andit usesa linearactivation function to obtainvaluesin therealspace.Theforecastswereobtainedusingthedataavailable(samples)andthengivea one-stepforecastbringing in the recentobserveddataoneat a time. Themodelparameterswere

11

not re-estimatedat eachstepwhencomputingtheforecasts.

Theweights(parameters)to beusedin theNN modelareestimatedfrom thedataby minimizingthemeansquarederrormse� 1

nef f ective∑i�yi � g

�xi � w�� 2 of thewithin-sampleone-step-aheadforecast

errors,wherenef f ective denotesthenumberof effectiveobservationsusedin fitting themodel,becausesomedatamaybelost by differencing.To train thenetwork a backpropagationalgorithmwith mo-mentumwasused,which is anenhancementof thebackpropagationalgorithm.Thenetwork ’ learns’by comparingtheactualnetwork outputandthetarget;thenit updateshis weightsby computingthefirst derivativesof theobjective function,andusemomentumto escapefrom localminima.

Thestatisticscomputedfor eachmodelwerethefollowing:

H S � ∑i�ei � 2, thesumof squaredresidualsup to time T, whereei � yi � g

�xi � w� aretheresid-

uals,andg�xi � w� is theoutputof theNN andyi is thetarget(trainingset).H Theestimateof the residualstandarddeviation: σ �JI S

nef f ective � p wherep � �m � 2� λ � 1 is

thenumberof parameters.H TheAkaike informationcriterion(AIC): AIC � nef f ective ln�S nef f ective �� 2pH TheBayesianinformationcriterion(BIC): BIC � nef f ective ln�S nef f ective �K� p � p ln

�nef f ective �H Spre is thesumof squaresof one-step-aheadforecasterrorsof thetestset.

To choosethearchitectureof themodelthatbestfits to thedata,onecanusetheresidualsumofsquares,S, but the larger the model is made(moreneurons),the smallerbecomesS andthe resid-ual standarddeviation,andthemodelgetsmorecomplicated.InsteadBIC andAIC asminimizationcriteriaareusedfor choosinga ’best’ modelfrom candidatesmodelshaving differentnumberof pa-rameters. In both criteria, the first term measuresthe fit and the rest is a penaltyterm to preventoverfitting,whereBIC penalizesmoreseverelytheextraparameterthanAIC does.Overfittingof themodelis not wanted,becauseit producesa very poor forecast,giving anotherreasonto chooseAICandBIC over S to selectthebestmodel. Thelower valueobtainedby this criterion,thebetteris themodel.

To identify differentclassesof neuralmodelsasexpressedby equation(13), following notationNN

�j1 �� jk;λ � wasused,which denotesa neuralnetwork with inputsat lags j1 �� jk andwith λ

neuronsin thehiddenlayer.

5.1 Artificial Neural Network for the Airline data

In this sectionwe show an exampleof the wellknown airline data,listed by Box et al.�1994� , se-

ries G, andearlierby Brown�1962� (SeeFigure2). The dataof this serieshave an upward trend,

a seasonalvariationcalledmultiplicative seasonality. The airline datacomprisesmonthly totalsofinternationalairlinepassengersfrom January1949to December1960.

The airline datawas modeledby a specialtype of seasonalautoregressive integratedmovingaveragemodel (ARIMA), of order

�0 � 1 � 1�L� �

0 � 1 � 1� 12 as describedin section2.1 which hastheform

�1 � B12� � 1 � B� xt � �

1 � Θ0B12 � � 1 � θB� at , after someoperationsthe following equationisobtained,xt � xt � 1 � xt � 12 � xt � 13 � at � θat � 1 � Θ0at � 12 � θΘ0at � 13, takingcareof usingthe ap-propriatetransformationto maketheseasonalityadditive, in thiscasenaturallogarithmis takenover

12

thedata.

We will useanANN to fit andforecasttheairline data,becauseof thenon-linearitypropertyoftheNN models;this will allow usto dealwith themultiplicativeseasonality.

Choosingthe Ar chitecture

Differentneuralnetwork architectureswereevaluatedwith the statisticsdescribedin section5,thebestmodelusingAIC andBIC is NN

�1 � 12� 13;1� , having thebestforecastwith aminimumSpred

(seeTable1), sotheNN�1 � 12� 13;1� modelwasselectedfor furtherresults.LookingtheBox-Jenkins

airline modelonecantry to usethelagsproposedin there,i.e.�xt � 1 � xt � 12 � xt � 13 � , astheinput to the

neuralnetwork andthenseeits performance.

Lags λ p S desv AIC BIC Spred AIC BICprediction prediction

1,12,13 1 6 0.2583 0.0478 -715.8 -695.1 0.1556 -40.1 -31.2

Table1: Resultsobtainedfor theNN modelchoosedfor theairlinedata.

Forecastingand Prediction Inter vals

After selectingthemodel,it is usedto forecasttherestof thedata(testdata)usingone-step-aheadforecast.Theresultis shown in Table2, andit is representedin Figure2. By usingequation(29) to(32) theasymptoticpredictioninterval is calculatedfor eachone-stepforecast.Thepredictioninter-val computedfor α � 0 � 05 andα � 0 � 10 is shown in Figure3 respectively, andthevaluesobtainedareshown in Table2.

5.2 Artificial Neural Network for the RESEX data

In this sectionthe procedureis appliedto the ResidenceTelephoneExtensionsInward Movement(Bell Canada)known asRESEXdata.Thechosenseriesis a monthlyseriesof ”inwardmovement”of residentialtelephoneextensionsof a fixedgeographicareain Canadafrom January1966to May1973,a total of 89 datapoints.This serieshastwo extremelylargevaluesin NovemberandDecem-ber1972asit is shown in Figure2. Thetwo obviousoutliershave a known cause,namelya bargain

Month Target Prediction α M 0N 05 Prediction α M 0 N 10133 417 428.9431O 26.9128 428.9431O 21.3724134 391 398.4319O 26.9475 398.4319O 21.3999135 419 461.1852O 26.8404 461.1852O 21.3152136 461 417.7294O 28.4284 417.7294O 22.5760137 472 482.4761O 29.9372 482.4761O 23.7742138 535 517.8650O 29.8691 517.8650O 23.7202139 622 573.9760O 29.9452 573.9760O 23.7806140 606 581.7285O 31.6975 581.7285O 25.1721141 508 598.4535O 32.0805 498.4535O 25.4763142 461 450.3307O 32.0715 450.3307O 25.4692143 390 414.2488O 32.0719 414.2488O 25.4695144 432 442.6636O 32.3828 442.6636O 25.7163

Table2: Predictionof theNN.

13

Figure2: (left) Airline dataandits NN modelandprediction.(rigth) RESEXdataandits NN modelandprediction.

Figure3: Asymptoticpredictioninterval for theAirline datawith α � 0 � 05(left) andα � 0 � 10(rigth)

Figure4: Asymptoticpredictioninterval for α � 0 � 05(left) andα � 0 � 10(rigth) for theRESEXdata

14

month (November)in which residenceextensionscould be requestedfree of charge. Most of theorderswerefilled duringDecember, with theremainderbeingfilled in January.

Brubacher(1974)identifiedthestationaryseriesasanARIMA�2 � 0 � 0�P� �

0 � 1 � 0� 12 model,i.e.,the RESEX datais representedby an AR

�2� model after differencing. As describedin 2.1 it has

theform�1 � φ1B � φ2B2 � � 1 � B12� xt � at andaftersomeoperationsxt � φ1xt � 1 � φ2xt � 2 � xt � 12 �

φ1xt � 13 � φ2xt � 14 � at .

Choosingthe Ar chitecture

Differentarchitecturesweretested,andthebestresultis obtainedby NN�1 � 2 � 12� 13� 14;1� (See

Table3), theNN usingthelagsof theARIMA model.Sothis modelwaschoosedfor furtherresults.

Lags λ p S desv AIC BIC Spred AIC BICprediction prediction

1,2,12,13,14 1 8 0.6876 0.1118 -268.6 -243.5 25.6 25.1 37.0

Table3: Resultsobtainedfor theNN modelchoosedfor theRESEXdata.

Forecastingand Prediction Inter vals

After selectingthe model, it is usedto forecastthe rest of the data(testdata)usingone-step-ahead-forecast.Theresultis shown in Table4, andit is representedin Figure2.

month Target prediction α M 0 N 05 prediction α M 0 N 1078 24309 28360.2O 6171.8 28360.2O 4976.979 24998 27717.7O 6616.9 27717.7O 5335.880 25996 25508.0O 6786.7 25508.0O 5472.781 27583 29374.2O 6718.8 29374.2O 5418.082 22068 25364.3O 6757.6 25364.3O 5449.383 75344 22993.4O 6997.7 22993.4O 5642.984 47365 39153.2O 31888.5 39153.2O 25714.785 18115 39670.5O 31993.2 39670.5O 25799.286 15184 28882.0O 34279.9 28882.0O 27643.187 19832 19117.6O 35184.9 19117.6O 28372.888 27597 29709.1O 34676.1 29709.1O 27962.689 34256 35221.7O 34300.1 35221.7O 27659.4

Table4: Predictionof theNN for theRESEXdata

By usingequation(29) to (32) theasymptoticpredictioninterval is calculatedfor eachone-stepforecast.Theresultsareshown in Table4 andin Figure4 for α � 0 � 05,andα � 0 � 10. Both graphicsin Figure4 have a predictioninterval that is large,becauseof the hugeoutlierspresented,giving apoorforecastwith a lot of errorandvariance.But theNN modelat leasttried to follow thetrendofthedatain theoutlier part.

6 Conclusions

It is thepremiseof thispaperit wasstatedthatthelearningmethodsin ANN aresophisticatedstatis-tical proceduresandthat toolsdevelopedfor thestudyof statisticalproceduresgenerallydon’t only

15

yield usefulinsightsinto thepropertiesof specificlearningproceduresbut alsosuggestvaluableim-provementsin alternativesto andgeneralizationsof existing learningprocedures.

Particularlyapplicableareasymptoticanalyticalmethodsthatdescribethebehavior of statisticswhenthesizen of thetrainingsetis large.At present,thereis noeasyanswerto thequestionof howlarge”n” mustbefor theapproximatorsdescribedearlierto be”good”.

The advantageof the ANN techniqueproposedin this paperis that it providesa methodologyfor model-freeapproximation;i. e. theweightedvectorestimationis independentof any model. Ithasliberatedusfrom theproceduresof themodel-basedselectionandthesampledataassumptions.Whenthe non-linearsystemsarestill in the stateof development,we canconcludethat the ANNapproachsuggestsa competitive and robust methodfor the systemanalysis,forecastandcontrol.The ANN presentis a superiortechniquein the modelingof anothernonlineartime seriessuchas:bilinearmodels,thresholdautorregressivemodelsandregressiontrees.And theconnectionsbetweenforecasting,datacompression,andneurocomputingshown in this reportseemsvery interestingin thetime seriesanalysis.

To decidewhich architectureonemay useto modelsometime series,first, it is possibleto trytraditionalmethods,by usingasimpleautocorrelationfunction,to find thekind of timeseriesthatwearedealingwith, andindeed,thelagsthatareusedasinput in theNN. Second,to selectthenumberof hiddenneurons,we startwith oneandthenwe increaseit until theperformanceevaluatedby AICandBIC becomesworse.Then,we train thenetwork with thefirst data,andfinally usethe lastdatato forecast.Asymptoticpredictionsintervalsarecomputedfor eachone-step-forecast,to show thelimits wherethedatais moving.

Thestudyof thestochasticconvergenceproperties(consistency, limiting distribution)of any pro-posednew learningprocedureis stronglyrecommended,in orderto determinewhatit is thattheANNeventuallylearnsandunderwhatspecificconditions.Derivationof thelimiting distributionwill gen-erally reveal the statisticalefficiency of the new procedurerelative to existing proceduresandmaysuggestmodificationscapableof improving statisticalefficiency. Furthermore,theavailability of thelimiting distributionmakespossiblevalid statisticalinferences.Suchinferencescanbeof greatvaluein theresearchof theoptimalnetwork architecturesin particularapplications.A wealthof applicabletheoryis alreadyavailablein the statistics,engineering,andsystemidentificationandoptimizationtheoryliteratures.

It is alsoevidentthatthefieldsof statisticshasmuchto gainfrom theneurocomputingtechniques.Analyzing neuralnetwork learningproceduresposea host of interestingtheoreticaland practicalchallengesfor statisticalmethods;all is not cut anddried. Most important,however, neuralnetworkmodelsprovidea novel, elegantandrich classof mathematicalstatisticaltoolsfor dataanalysis.

In spiteof the robust forecastperformancefor ANN someproblemsremainto be solved. Forexample: (i) How many input nodesarerequiredfor a seasonaltime series?(ii) How to treat theoutlier data?(iii) How to avoid theproblemof overfitting?(iv) How to find the

�1 � α � % confidence

interval for theforecast?(v) How to treatthemissingdata?

In generalthefollowing conclusionsandguidelinescanbestatedconcerningtheuseof statisticalmethodsandANN:

1. If thefunctionalform linking inputsandoutputis unknown, only known to beextremelycom-plex, or of no interestto theinvestigator, ananalysisusingANN maybebest.Theavailability

16

of largetrainingdatasetsandpowerful computingfacilitiesarerequirementsfor thisapproach.

2. If theunderlyingphysicsof thedatageneratingprocessareto beincorporatedinto theanalysis,a statisticalapproachmaybe the best. Generally, fewer parametersneedto beestimatedandthetrainingdatasetscanbesubstantiallysmaller. Also, if measuresof uncertaintyaredesired,eitherin parameterestimatesor forecasts,astatisticalanalysisis mandatory. If themodelsfit todataareto beusedto delveinto theunderlyingmechanisms,andif measuresof uncertaintyaresought,astatisticalapproachcangivemoreinsight. In thissense,statisticsprovidesmorevalueaddedto a dataanalysis;it probablywill requirea higherlevel of effort to ascertainthe bestfitting model,but error in predictions,error in parameterestimate,andassementof modelad-equacy areavailablein statisticalanalysis.In additionto providing measuresof parameterandpredictionuncertainty, statisticalmodelsinherentlypossesmorestructurethanANN do,whichareoftenregardedas”black boxes”. This structureis manifestedasspecificationof a randomcomponentin statisticalmodels. As such,statisticalmethodshave morelimited application.If a nonlinearrelationshipexits betweeninputsandoutputs,thendataof this complexity maybestmodeledby an ANN. A summaryof theseconsiderationscanbe found in table7. (SeeAppendixC)

17

A Appendix: Asymptotic distrib ution of wn

Undercertainmild regularity assumptions,it canbeshown [HD97] that theasymptoticdistributionof the standardizedquantity

5n�wn � w0 � is zeromeanmultivariatenormalwith covariancematrix

C � A� 1BA � 1 wherew is theestimatedandw0 thetrueparametervectorand

A � E ∇∇r�x � w0 �&� and B � E ∇r

�z� w0 � ∇r

�z� w0 � r �

ThematricesA andB arenon-singularwith ∇ and∇∇ denotingthe�p � 1� gradientand

�p � p�

Hessianoperatorwith respectto w (p is the numberof network parameters).However, sincethetrueparametersw0 arenot known, theweaklyconsistentestimatorC � A� 1

n BnA� 1n of thecovariance

matrix C hasbeusedinstead,where

A � n � 1n

∑i ! 1

∇∇r�zi � w� (34)

Bn � n� 1n

∑i ! 1

∇r�zi � wn � ∇r

�zi � w� T (35)

r�zi � wn � � 1

n yi � g

�xi;wn �&� 2 (36)

This hasno effect on theasymptoticdistribution of thenetwork’s parameters,althoughlargernwill be neededto obtainan approximationasgoodas if C itself wereavailable. The singlemostimportantassumptionmadeis that wn is a locally uniquesolution, i.e. noneof its parameterscanbe expressedin termsof the others,or equivalently, the network is not overparameterized.This isreflectedin thenaturalrequirementthatmatricesA andB arenon-singular.

The fact that5

n�wn � w�RQ N

�0 � C � canbe usedto robustly estimatethe standarderror of any

complex functionof wn i.e. θ � ρ�wn � , without theneedfor ananalyticderivation.By stochastically

samplingfrom thedistribution of wn, we caninexpensively createa sufficient largenumberk of pa-

rametervectorsw 3 s4n , wheres � 1 � 2 �� r andthencomputetheestimateσA of thestandarderrorasfollows:

σA �TS � r � 1� � 1r

∑s! 1

6 θ 3 s4 � θ�0� 7 2 U 1

2

(37)

where

θ�0� � r � 1

r

∑s! 1

θ 3 s4 � r � 1r

∑s! 1

ρ�w 3 s4n � (38)

The schemeis independentof the functional ρ��+ � and much lesscomputationallydemanding,

comparedto bootstrapfor example,sincetheestimatewn hasto beobtainedonly once(see [RZ99]).

18

B Appendix: Time SeriesData

SERIESG: InternationalAirline Passengers:monthlytotals(Thousandsof passengers).

JAN FEB MAR ABR MAY JUN JUL AUG SEP OCT NOV DEC1949 112 118 132 129 121 135 148 148 136 119 104 1181950 115 126 141 135 125 149 170 170 158 133 114 1401951 145 150 178 163 172 178 199 199 184 162 146 1661952 171 180 193 181 183 218 230 242 209 191 172 1941953 196 196 236 235 229 243 264 272 237 211 180 2011954 204 188 235 227 234 264 302 293 259 229 203 2291955 242 233 267 269 270 315 364 347 312 274 237 2781956 284 277 317 313 318 374 413 405 355 306 271 3061957 315 301 356 348 355 422 465 467 404 347 305 3361958 340 318 362 348 363 435 491 505 404 359 310 3371959 360 342 406 396 420 472 548 559 463 407 362 4051960 417 391 419 461 472 535 622 606 508 461 390 432

Table5: SeriesG

RESEX: ResidenceTelephoneExtensionsinwardMovement(Bell Canada).

JAN FEB MAR ABR MAY JUN JUL AUG SEP OCT NOV DEC1966 10165 9279 10930 15876 16485 14075 14168 14535 15367 13396 12606 129321967 10545 10120 11877 14752 16932 14123 14777 14943 16573 15548 15838 141591968 12689 11791 12771 16952 21854 17028 16988 18797 18026 18045 16518 144251969 13335 12395 15450 19092 22301 18260 19427 18974 20180 18395 15596 147781970 13453 13086 14340 19714 20796 18183 17981 17706 20923 18380 17343 154161971 12465 12442 15448 21402 25437 20814 22066 21528 24418 20853 20673 187461972 15637 16074 18422 27326 32883 24309 24998 25996 27583 22068 75344 473651973 18115 15184 19832 27597 34256

Table6: RESEX

19

C Appendix: Comparisonbetweenstatistical Methods(SM) andArtificial Neural networks (ANN)

Characteristics SM ANNRandomness Complex, nonlinear

General Variability Input /outputrelationshipsStructuredModel Multiple Outputs.Singleor few outputsRelatively small Massive training

Datarequired Trainingdatasets Datasetsneededto estimateMay requireprobability weightsDistributions

Model specifications PhysicalLaw Models No processKnowledgerequiredLineardiscrimination Nonlineardiscrimination

Goodnessof fit Many possibilities Few possibilitiesCriterion Bestfit canbetested Leastsquares

No bestfit testParameter Relatively few iterative training Relatively many (weights)Estimator for nonlinear;elsenoniterative Iterative training

computertime SeveredemandsoncomputertimeCalculateuncertaintiesfor Responsesurfaces(splines)canparameterestimatesand bemultivariatevectors

Outputs predictedvalues. No uncertaintycomputationsResidualdiagnosticcanprovide Minimal diagnosticsphysicalinsight

Computerpower Low HighRequired ParallelprocessingpossibleTrends Evolutionarytechniquesnotyet Evolutionarydesignpossible

used.

Table7: StatisticalAnalysisversusANN.

References[AG96] H. Allende andJ. Galbiati. Robust testin time seriesmodel. Journal of InteramericanStatistical

Institute, 1(48):35–79,1996.

[AH92] H. Allende andS. Heiler. Recursive generalizedm-estimatesfor autoregressive moving averagemodels.Journalof TimeSeriesAnalysis, (13):1–18,1992.

[Bax90] W. G. Baxt. Useof anartificial neuralnetwork for dataanalysisin clinical decisionmarking: Thediagnosisof acutecoronaryocclusion.Neural Computational, (2):480–489,1990.

[BD91] P.J.Brockwell andR.A. Davis. TimeSeriesTheoryandMethods. Ed.SpringerVerlag,1991.

[BFOS84] L. Breiman,J. Friedman,R. Olshen,andC.J.Stone.Classificationandregressiontrees.Technicalreport,Belmont,C. A. Wadsworth,1984.

[BJR94] G. E. P. Box, G.M. Jenkins,andG.C.Reinsel.TimeSeriesAnalysis,ForecastingandControl. Ed.EnglewoodClif fs: PrenticeHall, 3 edition,1994.

[BO93] B.L. BowermanandR.T. O’Connell.Forecastingandtimeseries:anappliedapproach. Ed.DuxburyPress,3 edition,1993.

[CM94] J.T. ConnorandR.D. Martin. Recurrentneuralnetworks androbust time seriesprediction. IEEETransactionsof Neural Networks, 2(5):240–253,1994.

[CT94] B. ChengandD.M. Titterington. Neuralnetworks: review from a statisticalperspective. StatisticalScience, (1):2–54,1994.

20

[Fin99] T. L. Fine. Feedforward Neural NetworkMethodology. Ed.Springer, 1999.

[FR90] B. Flury andH. Riedwyl. MultivariateStatistics:A practical Approach. Ed.ChapmanHall, 1990.

[Fri91] J.H.Friedman.Multivariateadaptive regressionspline.TheAnnalsof Statistics, (19):1–141,1991.

[HD97] J.T. G. HwangandA. A. Ding. Predictionfor artificial neuralnetworks. JASA92, (438):748–757,1997.

[Hut94] J.M. Hutchinson.A RadialBasisFunctionApproach to Financial TimeSeriesAnalysis. Phdtesis,MassachusettsInstituteof Technology, 1994.

[LB78] G. M. Ljung and G. E. P. Bax. On a measureof lack of fit in time seriesmodels. Biometrika,65:297–303,1978.

[LG94] J.L.Lin andC.W. Granger. Forecastingfrom non-linearmodelsin practice.Int. Journalof Forecast-ing, 13:1–9,1994.

[Lip87] R. P. Lippmann.An introductionto computingwith neuralnets.IEEEASSPMagazine, pages4–22,1987.

[Rao81] T. SubbaRao.On thetheoryof bilinearmodels.J. Roy. Statist.Soc.B, (43):244–255,1981.

[Rip93] B. D. Ripley. Statisticalaspectsof neural networks.In networksandChaos-StatisticalandProba-bilistic Aspect.Ed.Barndorf-NielsenO. E., JensenJ.L., KendallW. S.,ChapmanandHall, 1993.

[RZ99] A. P. N. Referesand A. D. Zapranis. Neural model identification,variableselectionand modeladequacy. J. Forecasting, 18:299–322,1999.

[Sar94] W. S.Sarle.Neuralnetworksandstatisticalmethods.In Proc.Of the19thAnualSASUsers GroupInternationalConference, 1994.

[Ste96] H. S.Stern.Neuralnetworksin appliedstatistics.Technometrics, 38(3):205–214,1996.

[Sus92] H. J.Sussmann.Uniquenessof theweightsfor minimal feedforwardnetswith a giveninput-outputmap.Neural networks, (5):589–593,1992.

[Ton90] H. Tong. Non-linearTimeSeries. Ed.Oxford UniversityPress,1990.

[WY92] F.Y. Wu andK. K. Yen. Applicationof neuralnetwork in regressionanalysis.In Proc.Of the14thAnnualConferenceon Computers andIndustrial Engineering, 1992.

21

Artiﬁcial Neural Networks in Time Series Forecasting: A...

Documents

Transcript of Artiﬁcial Neural Networks in Time Series Forecasting: A...