MLHEP 2015: Introductory Lecture #4
Embed Size (px)
Transcript of MLHEP 2015: Introductory Lecture #4
MACHINE LEARNING IN HIGHENERGY PHYSICSLECTURE #4
Alex Rogozhnikov, 2015
WEIGHTED VOTINGThe way to introduce importance of classifiers
D(x) = (x)j jdj
GENERAL CASE OF ENSEMBLING:D(x) = f ( (x), (x), , (x))d1 d2 dJ
COMPARISON OF DISTRIBUTIONSgood options to compare 1duse classifier's output to build discriminating variableROC AUC + Mann Whitney to get significance
RANDOM FORESTcomposition of independent trees
ADABOOSTAfter building th base classifier:j
2. increase weight of misclassified
= ln ( )j 12 wcorrectwwrong wi wi e ( )jyidj xi
GRADIENT BOOSTING min
(x) = (x)Dj j =1j j dj
(x) = (x) + (x)Dj Dj1 jdjAt th iteration:j
train regressor to minimize MSE:
= zi D( )xiD(x)= (x)Dj1
dj( ( ) mini dj xi zi)2
Mean Squared Error Mean Absolute Error
binary classification, ExpLoss (aka AdaLoss) LogLoss
ranking lossesboosting to uniformity: FlatnessLoss
y (d( ) i xi yi)
d( ) i xi yi
= 1yii e
d( )yi xi
log(1 + )i e d( )yi xi
TUNING GRADIENT BOOSTING OVERDECISION TREESParameters:
loss functionpre-pruning: maximal depth, minimal leaf sizesubsample, max_features =
(learning_rate)number of trees
LOSS FUNCTIONdifferent combinations also may be used
TUNING GBDT1. set high, but feasible number of trees2. find optimal parameters by checking combinations3. decrease learning rate, increase number of trees
See also: GBDT tutorial
ENSEMBLES: BAGGING OVERBOOSTINGDifferent variations , are claimed to overcome singleGBDT.
Very complex training, better quality if GB estimators areoverfitted
Correcting output of several classifiers with new classifier.
D(x) = f ( (x), (x), , (x))d1 d2 dJTo use unbiased predictions, use holdout or kFolding.
REWEIGHTINGGiven two distributions: target and original, find newweights for original distributions, that distributions willcoincide.
REWEIGHTINGSolution for 1 or 2 variables: reweight using bins (so-called'histogram division').
Typical application: Monte-Carlo reweighting.
REWEIGHTING WITH BINSSplitting all space in bins, for each bin:
compute , in each binmultiply weights of original distribution in bin to
compensate the difference
good in 1d, works sometimes in 2d, nearly impossible indimensions > 2too few events in bin reweighting rule is unstablewe can reweight several times over 1d if variables aren'tcorrelated
reweighting using enhanced version of bin reweighter
reweighting using gradient boosted reweighter
GRADIENT BOOSTED REWEIGHTERworks well in high dimensionsworks with correlated variablesproduces more stable reweighting rule
HOW DOES GB REWEIGHTERWORKS?iteratively:
1. Find the tree, which is able to discriminate two distributions2. Correct weights in each leaf3. Reweight original distribution and repeat the process.
Less bins, bins are guaranteed to have high statistic in each.
DISCRIMINATING TREEWhen looking for a tree with good discrimination, optimize
=2 lleaves( wl,target wl,original)2
+wl,target wl,originalAs before, using greedy minimization to build a tree, whichprovides maximal .2
+ introducing all heuristics from standard GB.
QUALITY ESTIMATIONQuality of model can be estimated by the value of optimizedloss function. Though remember that different classifiersuse different optimization target
different target of optimization use general qualitymetrics
TESTING HYPOTHESIS WITH CUTExample: we test hypothesis that there is signal channelwith fixed vs. no signal channel ( ).Br 0 Br = 0
: signal channel with : there is no signal channel
H0 Br = constH1Putting a threshold: d(x) > threshold
Estimate number of signal / bkg events that will be selected:s = tpr, b = fpr
H0 Poiss(s + b)nobsH1 Poiss(b)nobs
TESTING HYPOTHESIS WITH CUTSelect some appropriate metric:
= 2(s + b) log(1 + ) 2sAMS2sb
Maximize it by selecting best
special holdout shall be usedor use kFoldingvery poor usage of information from classifier
TESTING HYPOTHESIS WITH BINSSplitting all data in bins:
x k-th bin < d(x) thrk1 thrkExpected amounts of signal:
= ( ), = ( )sk tprk1 tprk bk fprk1 fprk: :
H0 Poiss( + )nk sk bkH1 Poiss( )nk bkOptimal statistics:
, = log(1 + )k
take some approximation to test power and optimizethresholds
FINDING OPTIMAL PARAMETERSsome algorithms have many parametersnot all the parameters are guessedchecking all combinations takes too long
FINDING OPTIMAL PARAMETERSrandomly picking parameters is a partial solutiongiven a target optimal value we can optimize itno gradient with respect to parametersnoisy resultsfunction reconstruction is a problem
Before running grid optimization make sure your metric isstable (i.e. by train/testing on different subsets).
Overfitting by using many attempts is real issue.
OPTIMAL GRID SEARCHstochastic optimization (Metropolis-Hastings, annealing)regression techniques, reusing all known information (ML to optimize ML!)
OPTIMAL GRID SEARCH USINGREGRESSIONGeneral algorithm (point of grid = set of parameters):
1. evaluations at random points2. build regression model based on known results3. select the point with best expected quality according to
trained model4. evaluate quality at this points5. Go to 2 if not enough evaluations
Why not using linear regression?
GAUSSIAN PROCESSES FORREGRESSIONSome definitions: , where and arefunctions of mean and covariance: ,
Y GP(m, K) m Km(x) K( , )x1 x2
represents our prior expectation of quality(may taken constant)
represents influence of knownresults on expectation of newRBF kernel is the most useful:
m(x) = Y(x)
K( , ) = Y( )Y( )x1 x2 x1 x2
K( , ) = exp(| )x1 x2 x1 x2 |2
We can model the posterior distribution of results in eachpoint.
Gaussian Process Demo on Mathematica
Also see: .http://www.tmpl.fi/gp/
PREDICTION SPEEDCuts vs classifier
cuts are interpretablecuts are applied really fastML classifiers are applied much slower
SPEED UP WAYSMethod-specific
Logistic regression: sparsityNeural networks: removing neuronsGBDT: pruning by selecting best trees
Staging of classifiers (a-la triggers)
SPEED UP: LOOKUP TABLESSplit space of features into bins, simple piecewise-constantfunction
SPEED UP: LOOKUP TABLESTraining:
1. split each variable into bins2. replace values with index of bin3. train any classifier on indices of bins4. create lookup table (evaluate answer for each
combination of bins)
1. convert features to bins' indices2. take answer from lookup table
LOOKUP TABLESspeed is comparable to cutsallows to use any ML model behindused in LHCb topological trigger ('Bonsai BDT')
too many combination when number of features N > 10 (8 bins in 10 features),
finding optimal thresholds of bins 1Gb810
UNSUPERVISED LEARNING: PCA[PEARSON, 1901]PCA is finding axes along which variation is maximal (based on principal axis theorem)
PCA DESCRIPTIONPCA is based on the principal axes
is orthogonal matrix, is diagonal matrix.
Q = UUT
U = diag( , , , ), 1 2 n 1 2 n
Emotion = [scared] + [laughs] + [angry]+. . .
Electron/jet recognition in ATLAS
autoencoding: , error hidden layer issmaller
y = x | y| miny
CONNECTION TO PCAWhen optimizing MSE:
y i xi)2
And activation is linear:
Optimal solution of optimization problem is PCA.
There is an informational bottleneck algorithm willeager to pass the most precious information.
MANIFOLD LEARNING TECHNIQUES
Digits samples from MNIST
PCA and autoencoder:
MARKOV CHAIN IN TEXT
TOPIC MODELLING (LDA)bag of wordsgenerative model that describe different probabilitiesfitting by maximizing log-likelihoodparameters of this model are useful
COLLABORATIVE RESEARCHproblem: different environmentsproblem: data ping-pongsolution: dedicated machine (server) for a teamrestart the research = run script or notebook!ping www.bbc.co.uk