Diploma Thesis Polygonal Approximation of Laser Range Scan ...tc... · Cognitive Systems Group...

Cognitive Systems GroupDept. of Computer Science, University of HamburgVogt-Kolln-Str. 30, 22527 Hamburg, Germany

Diploma Thesis

Polygonal Approximation ofLaser Range Scan DataUsing Extended EM

Leonid Tcherniavski

Supervisors:Dr. rer. nat. U. KotheProf. Dr.-Ing. H. Siegfried Stiehl

∫xq(x

) ∫zlog

(p(x,z|Θ

))q(z|x

)dzd

x≈ ∑i sd

d(x

i ) ∑l log

(αl p(x

i |z=

l,Θ))

p(z=

l|xi ,Θ

)

AcknowledgementsI wish to thank my first supervisor, Dr. Ullrich Koethe, for his continual support and encouragement

throughout the course of this project. I also wish to thank Hans Meine for providing images for theevaluation and technical support .

I also would like to thank my wife, Natalja Tcherniavskaja, for her patience and love, which havebeen invaluable for the success of the studies in general and this project in particular.

Many thanks to George Harrison and Mario Krizanac for their help and feedback on this paper.

AbstractThe polygonal approximation requires not only the estimation of optimal model parameters but

also the adjustment of optimal number of model components. A new EM framework based on theKullback-Leibler divergence was proposed to fit the model to the nonparametric data density estimation.The method extends the classical EM framework by Segment Fitting and Split and Merge steps foradjustment of the number of model components.

In this project we analyse the newly proposed method, give the detailed derivation of its statementsand evaluate the reimplementation on real laser range data, noise corrupted data and the object contoursin digital images.

We also analyse the method with respect to graph fitting and make propositions for the extension.

Keywords: Polygonal Approximation, Expectation Maximization, Kullback-Leibler Divergence,Statistical Modelling, Nonparametric/Parametric Density Estimation

i

Contents

1 Introduction 1

2 Related Work 32.1 Line Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 One Line Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Line Fitting with EM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Using penalty functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 Involving Splitting and Merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4.1 Reversible Jump Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . 92.4.2 SMEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5 Line Based Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.5.1 Polyline Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.5.2 Line Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Mathematical Preliminary 153.1 Least Squares Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Distance Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3 Matrix Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3.1 Covariance Matrix with given Eigenvectors and Eigenvalues . . . . . . . . . . 173.3.2 First Order Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3.3 Second Order Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3.4 Derivative Of Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Expectation Maximization 214.1 Maximum Likelihood Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.1.2 The General Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.1.3 The Multivariate Normal Case . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2 Classification Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.2.1 Bayes Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.2.2 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.3 Derivation of the EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.3.1 General Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.3.2 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

ii

4.3.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.3.5 Convergence Property of EM . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5 Extended EM derived from Kullback-Leibler Divergence 385.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.2 Gaussian Like Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.3 Using Kullback-Leibler Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.3.1 Model Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.3.2 Adjusting the Number of Model Components . . . . . . . . . . . . . . . . . . 43

5.4 EM with Split and Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.4.1 EM Derivation from KLD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.4.2 Conditions for Model Modification . . . . . . . . . . . . . . . . . . . . . . . 455.4.3 Splitting Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.4.4 Merge by Overlapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.4.5 Example for EM Failure with Glike . . . . . . . . . . . . . . . . . . . . . . . 51

5.5 Ground Truth Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.6 Extended EM with Split and Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.6.1 Derivation from KLD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.6.2 Conditions for Model Modification . . . . . . . . . . . . . . . . . . . . . . . 575.6.3 Extended Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.6.4 Extended Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6 EMSFSM 646.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.3 EMSF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.3.1 Undeveloped Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676.3.2 Weights Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686.3.3 Line Computation and Trimming . . . . . . . . . . . . . . . . . . . . . . . . 69

6.4 Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706.4.1 Assumed Sample Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706.4.2 Estimating The Sample Point Density Value . . . . . . . . . . . . . . . . . . . 70

6.5 Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.5.1 Overlap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.5.2 Merging Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.5.3 Checking for Merge Advantage with KLD . . . . . . . . . . . . . . . . . . . . 73

7 Evaluation 747.1 Simulating Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747.2 Nearest Neighbour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757.3 EM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 767.4 Intuitive SMEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

7.4.1 Intuitive Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797.4.2 Intuitive Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 807.4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

7.5 EMSFSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 827.5.1 Sub Segment Length & Minimum Density . . . . . . . . . . . . . . . . . . . 82

iii

7.5.2 Radius & Sigma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 837.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

7.6.1 Noisy Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 867.6.2 Rome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

8 Future Work 908.1 Extension to Graph Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 908.2 Graph Fitting as Post processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 918.3 Statistical Edge Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

9 Visualisation 949.1 The User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949.2 Simulating Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 969.3 EM & Intuitive SMEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 979.4 EMSFSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

10 Conclusion 105

A Rome Data Set i

B Gradient Images v

C Artifical Data Sets vii

Bibliography ix

iv

List of Symbols

X: The domain of the ground truth densityX : The random variable denoting the data points. The data setx,xi,~x: A variable denoting Data/Sample points, usual indexing by iY : The original data set enlarged by assumed data pointsy,yi: A variable denoting assumed sample pointsZ: The set of model componentsz,z j: A variable denoting the model component, usual indexing by jU : the set of undeveloped regionsXu

i : undeveloped region, a set of data points outside the reach of influence of any model com-ponent

S j: support set associated with model component z j, the set of points influenced by the modelcomponent z j at most

Θ: The model parameters, vector containing all component parametersθ j: The component parameters associated with component z j

ω: current state of nature; modelπ j,α j; Prior probability of the model component z j~µ j: The mean parameter usually assotiated with the Gaussian density according to the model

component z j

Σ j: Covariance matrix, parameter usually assotiated with the Gaussian density according to themodel component z j

σ : system parameter sigma, or standard deviation associated with 1D Gaussian distributionR: System parameter radiuslsub: System parameters subsegment lengthW : Weights matrix, probabilities of belonging to the model components for each data pointwi, j: i, j’th element of the weights matrix, the probability of belonging of the data point xi to the

model component z j

dp,p(., .): Point to point distance functiondp,l(., .): Point to Line distance functiondp,s(., .): Point to Segment distance functiondk

p: the Euclidian distance between the sample point and its k’th neighbourl j = (~l1,~l2): Line, defined by two points~l1 and~l2z j = (~z1,~z2): Line segment, defined by two points~z1 and~z2∆: The gradient operatorL: Likelihood functionl(.) = log(L): Log likelihood functionQ(Θ,Θn): Expectation function with variable parameters vector Θ associated with EME(.): Expectation valueδi, j: indicator, denotes that the data point xi belongs to the model component z j

δz1,z2 : Merging value of the components z1 and z2

v

Gl(.|., .): The Gaussian Like density functionD(.||.): Kullback Leibler Divergence (KLD)q(.): The ground truth densitysdd(.): The nonparametric ground truth density estimation “smoothed data density”p(.|.): The parametric ground truth density estimationG(., ., .): Gaussian distribution||~x||: norm of a vector~x|x|: absolute value of a scalar|X |: cardinality of a setO(.): Complexity computation

vi

Chapter 1

Introduction

Polygonal Approximation of laser range data is an active research topic in robot navigation. Given aset of data we want to reconstruct the original scene. The problems which arise are how to model eachcomponent in the scene, how to estimate the parameters of the model components and how to find theoptimal number of model components.

The “Expectation Maximization” (EM) framework gives an answer to first two questions. Assumingthat the data set was generated by Gaussian mixtures EM finds an optimal solution iteratively maximiz-ing the log likelihood function. However, in the EM framework the number of model components mustbe known and fixed. This is due to the fact that the log likelihood function increases if the number ofmodel components is increased.

In the work of [22] a new EM framework was introduced in which it is claimed to be possible tooptimize the model parameters and the number of model components. The key feature of the approach isthe usage of nonparametric density estimation. While the standard EM estimates the model parametersby fitting the model onto the data itself the new framework uses the Kullback-Leibler divergence (KLD)to fit the model onto the nonparametric estimation of the ground truth.

The new method is based on the standard EM framework which is modified by fitting the linesegments instead of Gaussians. The EM framework is extended by splitting and merging the segmentsto determine the optimal number of model components.

In our project we intend to analyse the new method by reengineering the existing implementationcoded up for MATLAB applications.

There exist several approaches to polygonal approximation including the estimation of the optimalnumber of model components. All of them are bound either on penalty functions or hidden systemparameters which in fact adjust the optimum. We are going to analyse the new proposed method forsuch constrains and see if the method is sensitive to their variety.

We begin our project with the study of related works done on the subject of data approximation andfinding of the optimal number of model components (see chapter 2). We base the references on theliterature named in the work of [23].

Since the new method is an extension of the EM framework we give a detailed introduction into theissue. In the derivation of the EM and further in the new method we would need some mathematical

1

2 CHAPTER 1. INTRODUCTION

prerequisites 3. Thus before introducing the EM derivation we give some mathematical preliminarieswhich would not necessarily be found in every math book.

The understanding of the EM is crucial for the new method. Thus, we begin with a very basicintroduction and proceed with a very detailed mathematical derivation ending with experiments whichshow us the advantages and disadvantages of the algorithm 4.

The derivation of the new method based on the KLD and nonparametric density estimation is givenin the next chapter 5. We not only try to give the detailed introduction and derivation of the methodbut discover the inadequate assumption made in the method implementation. Though the deviationfrom the statistical framework helps to balance the algorithm between splitting and merging, the aspectof convergence is corrupted. We give a proper merge step derivation derived from the convergenceproperty and so preserving it.

Given the theory on the issue we give in the chapter 6 a detailed description on how to code up themethod. We take into account the aspects in the proposed code only.

Evaluating our reimplementation and so analysing the method we use the laser range data and objectcontours in digital images. Additionally we developed a tool to artificially generate data sets which wereused to illustrate the method development and debugging. We stated the evaluation results in the chapter7. Please find the additional imagery attached in the appendix.

The first intent of this project was to develop an extension of the newly proposed method of [22]to fit graphs to the data. We give the results of the method analysis on this account in the chapter 8 andintroduce the ideas for such extension.

For the detailed analysis and visualisation of the method implementation we developed an appli-cation tool. The application provides us with data simulation tool, demonstrates the EM and intuitiveSMEM algorithms applying them on the data sets and contains the reimplementation and visualisationof the new method which is called EMSFSM in our framework. The description of the visualisation toolcan be found in the chapter 9.

Chapter 2

Related Work

In this chapter we discuss related work on the issue of line fitting. There is a very large range ofliterature to solve the problem to find a compressed data representation. We concentrate our attentionon the method developed to fit a set of data points by a set of line segments. The data set is assumed tobe drawn from a planar environment.

We keep the explanations of the methods to be brief. It serves the illustration only. For furtherdetails please consult the referenced literature. We follow in our discussion the corresponding literature.Please note, that in some contexts the notations used in the literature deviates from the notation used inthis work. This is due to the fact, that we tried to find correspondences between the explained methodsand the assumptions and notations made for this framework.

We begin the discussion with some basics provided by standard text book “Pattern ClassificationAnd Scene Analysis” [1]. Several methods on the issue of line fitting and data grouping are proposed.

In our further chapters we introduce the framework of the “expectation maximization” (EM) algo-rithm and show how to derive the proposed method from it. Thus, we give only a very brief introductionto the line fitting based on EM given the standard references ( [2], [3]).

We show the disadvantages of the algorithms which use penalty functions to find a balance betweenthe error maximization and the number of model components proposed by [4] using the BIC criterion.We give additional references on the issue of penalty functions and handling of intractable internals.

We shortly describe two methods using an iterative procedure involving splitting and merging stepsto adjust the number of model components. The solution proposed by [11] is based on a fully Bayesianmixture analysis that makes use of “Jump Marcov Chain Monte Carlo” (MCMC) methods. We giveadditional references for better understanding of the method. The method proposed by [15] does notmake use of a penalty function but is based on merging and spitting criteria which maximize the loglikelihood.

Since our framework describes polygonal approximation of laser range data providing line basedmaps, we introduce two methods which handle with the same assumption ( [20], [17]). In the work of[19] one can find an overview over several methods on that issue.

In the work of [21] the variety of techniques to polygonal approximations of curves is examined. Ameasure to classify the goodness of the techniques has been developed and the procedures tested.

3

4 CHAPTER 2. RELATED WORK

2.1 Line Fitting

In this section we want to introduce some standard methods to fit a figure by one or more line segments.We follow in this section the book to “Pattern Classification And Scene Analysis” (see chapter 9 in [1]).

Let’s consider we had a figure. A figure is a digital representation of some scene fragment. In ourcase we have to assume binarised representation of the scene. The figure content is a set of discretepoints in the plane. The problem of line fitting can be divided into two problems, how to find the partsof figure corresponding to the elements of the scene, and how to fit a line to each part.

2.1.1 One Line Fitting

Let’s start with a figure containing only one such part. Consider, we digitalized a scene with only oneobject in it. Thus, our problem reduces to fitting one single line to a point set X . The first method wewant to present approaches to line fitting by finding the minimum squared error (MSE).

Given a set of points ~x = (xi0 ,xi1)t ∈ X0×X1 with, i ∈ 1, . . . ,N in the plane, find a vector ~c =

(c0,c1)t such that the error function

N

∑i=1

[(c0 + c1xi0)− xi1 ]2 (2.1.1)

is minimized. The vector ~c is the orientation vector of a line. If the MSE error is minimal the sum ofsquares of the vertical distances from each point to the line is minimal.

The problem is solved by “pseudeinverse of a matrix”. To apply the theory onto the line fittingproblem we rewrite the MSE as:

∥∥∥A~c−~b∥∥∥2

=

∥∥∥∥∥∥∥∥∥

1 x10

1 x20...

...1 xN0

(

c0c1

)−

x11

x21...

xN1

∥∥∥∥∥∥∥∥∥

2

= ∑i=1

N [(c0 + c1xi0)− xi1 ]2

The solution of the MSE is obtained by: (c0c1

)= A†~b,

where the matrix A† = (AtA)−1At is the pseudoinverse of A. Please consult [1] chapter 5 for analyticalderivation or chapter 9 for geometrical derivation.

Using the notations defined before let’s consider another error function. The MSE method mini-mized the vertical distances from the points to the line. This method is dependent on the choice of axes.The negative consequences will be given below. If we define the error as a sum of squared perpendiculardistances from the data points to the line, we achieve the independence of axes .

In left figure of 2.1.1 the difference between the error choices is presented. The figure 2.1.1 has beentaken from [1] figure 9.2 and 9.4. The solid lines connecting the points and the line demonstrate thevertical distance used for MSE and the dotted connections are the perpendicular distances used in thenew error definition.

2.1. LINE FITTING 5

x1

x0

x1

x0

eigenvector fit

MSE fit

Fig. 2.1: Left: The best fit criteria: solid MSE, dotted eigenvector fit. Right: Differentfits: solid MSE fit, dotted eigenvector fit

Let’s denote the unit normal vector of the line as~n and the mean or centre of gravity as ~µ . Our taskis to find a line minimizing the following error:

d2 =N

∑i=1

d2i =

N

∑i=1

(~nt~xi)2

= ~ntN

∑i=1

((~xi−~µ)(~xi−~µ t)~n

where di is the perpendicular distance from the ith point to the line. The quadratic form is minimized bytaking~n to be the eigenvector of the scatter matrix S = ∑

Ni=1((~xi−~µ)(~xi−~µ t) associated with the small-

est eigenvalue. Matching to this derivation the method is called eigenvector fit. The scatter matrix S issymmetric. The eigenvectors of symmetric matrices are orthogonal. Thus, the eigenvector associatedwith the largest eigenvalue is the orientation vector of the best fitting line.

As it is done in [1] we present the disadvantage of MSE. In the right figure 2.1.1 the different fits tothe same data set is presented. The dotted line demonstrates the line fit of the eigenvalue fitting methodand the solid correspond the MSE method. The result of MSE is erroneous. If the X0 and X1 axes wereinterchanged, then the both fits would result in identical correct fit.

2.1.2 Classification

Let’s now turn our attention to the problem of data classification. In the preceding section we assumedthat all data points are to be fitted by a single line. Now we want to investigate the problem of de-scribing the figure by a set of lines. Thus, we have to find a method of mapping the data points to thecorresponding lines. Notice that the number of lines is “yet-to-be determined”. We have to “classify”the data points by the corresponding lines.

First method we want to present is introduced by [1] as a point to curve transformation. Let’sassume that the lines are specified by the angle between the normal of the line and the axis of coordinatesand the perpendicular distance from the origin to the line. Thus, our method maps each point into a linein the parameter space in such a way that collinear points map into concurrent lines. Two collinear linesmay be merged into one. The new point placed outside the line denotes the existence of another line.


AB

C

D

Fig. 2.2: .Iterative end point fits.

To ease the problem of noise corruption the parameter space has to be quantized. Thus, the approx-imately collinear points are mapped into one line.

Another method of describing a figure by a number of lines is known under iterative end point fits.We need to preset some threshold of minimum error, since without the threshold the overfitting occurs.The overfitting in this case would be the connecting two neighbouring points by line segment.

Three first iterations of the method are demonstrated in the figure 2.2 (compare figure 9.6 in [1]).The algorithm is initialised by a line segment connecting two most distant points: in example AB. In alliterations the distances from points to the nearest line segment are computed. If all distances are lessthen the threshold the algorithm terminates. Otherwise the furthest point from the line breaks the linesegment into two new: in the example AC and CD. This is separately done for each line where to largedistances had been detected. In the last iteration all distances corresponding to the line AC are less thanthe threshold. Only CB is broken.

Obviously to proceed with this algorithm the data points must be arranged in such a way that thestarting and the ending points can explicitly be found. This method is strongly influenced by singlepoints, which makes it noise sensitive. Some pre- and postprocessing steps can be proposed to avoidoutliers. Smoothing pre-process can ease the problem of noise.

The finally selected line segments may not be an appropriate fit. A more appropriate line fittingalgorithm can be applied as postprocessing to adjust the placement of the predetermined line segments,since the point to line correspondences have already been found. The end point fit can be used to breakthe lines.

2.2 Line Fitting with EM

The Expectation Maximization Algorithm (EM; [2]) provides a particularly useful framework to solvethe problem of data classification. In our framework we use the advantages of the EM and extend itwith split and merge steps as proposed by [22]. We give here a short inroduction only. We will give anin-depth introduction of the EM algorithm in the chapter 4.

The problem to solve can be illustrated as finding a correspondence of data points to some latentor hidden variables, labels. The labels denote the classes of the data. The solution provided by EM isbased on the idea to treat the labels as random variables which “complete” the given data set.

The problem of fitting lines results in the already known solution proposed by [3] in 1956. Long

2.3. USING PENALTY FUNCTIONS 7

before the EM method was proposed. It uses the fact, that “the fictitious values are actually the expectedvalues of the missing units derived from the least squares estimates of the block”. (compare [3], 1956).The iterative application of the Least Squares Fit (LSF) minimizes the residual sum of squares whichguarantees the convergence of the algorithm.

However the polygonal approximation of point data requires not only the estimation of the parame-ters of the model components but the number of model components. For the classical EM the numberof model components must be fixed. Moreover, EM produces an optimal solution only if the initialvalues of components parameters are close to global optimum. The finding of the global optimum is notguaranteed even if the em procedure achieves convergence.

2.3 Using penalty functions

This category of estimating the correct number of model components is based on using penalty func-tions. The procedures in such frameworks require that EM is run until convergence for some or everynumber of components and select then the model of highest criterion value.

The work of [4] gives a good introduction into the issue difficulty, discusses the existing methodsand investigates a “novel application of the framework to scoring the structures of discrete graphicalmodels”. (compare sec. 2 in [4]).

The criterion used in this category of approaches is built as represents a “trade-off” between thelikelihood of the date and the value of importance of model complexity. The model criterion value isiteratively compared for each number of model components. The success of this approach depends onconvergence to the global optimum whatever the initial number of model components assumed.

Note that we chose a deviate notation compared to [4] to fit the formulas into our work. Supposethat we have a countable collection of models Ω. The kth model ωk ∈ Ω has a vector ~θk of unknownparameters.

Bayesian approaches treat the parameters ~θk from a model ωk as unknown random variables andaveraging over the likelihood we obtain from different settings of ~θk:

p(X |ωk) =∫

p(X |ωk,~θk)p(~θk|ωk)d~θk (2.3.1)

This equation 2.3.1 is called “Bayesian Integration”, where p(X |ωk) is the “marginal likelihood” for adata set X assuming model ωk, where p(~θk|ωk) is the prior distribution over parameters. “Integratingout the parameters penalizes models with more degrees of freedom since these models can a priorymodel a larger range of data sets” (compare sec:1.1 in [4]). This property of Bayesian Integration iscalled “Occham’s razor”1. It prevents the models to become to complex since the simpler explanationsare preferred.

For most models of interest it is analytically and computationally intractable to perform the integralsfor 2.3.1. Thus, we need some methods of approximation for the integral exactly.

An approach to Bayesian integration is the “Laplace Approximation” ( [5], [6]) which makes a localGaussian approximation around a “Maximum a posteriori parameter estimate (MAP). The covarianceof the fitted Gaussian is determined by the Hessian matrix at the MAP:

p(X |ωk)Laplace =√|2πH−1|p(X |ωk,~θMAPk)p(~θMAPk |ωk) (2.3.2)

1From Wikipedia: http://en.wikipedia.org/wiki/Occam’s Razor: “The explanation of any phenomenon should make as fewassumptions as possible, eliminating, those that make no difference in the observable predictions of the explanatory hypothesisor theory”

http://en.wikipedia.org/wiki/Occam's_Razor


-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

Fig. 2.3: Left: Two model components obtained by EM. Right: The optimalapproximation.

This can be justified by the fact that under certain regularity conditions, the posterior distribution ap-proaches Gaussian distribution as the number of samples grows.

A much less costly approximation to the marginal likelihood is given by Bayesian InformationCriterion (BIC; [7]) and Akaike Information Criterion (AIC; [8]). BIC and AIC can be derived fromthe Laplace Approximation by retaining only the terms that grow with the number of data samples N.For a model with d parameters, using the MAP estimate ~θMAPk :

log p(X |ωk)BIC = log p(X |ωk,~θMAPk)−d2

log N (2.3.3)

andlog p(X |ωk)AIC = log p(X |ωk,~θMAPk)−d (2.3.4)

The AIC penalizes free parameters less strongly than does the BIC.Another method of choice for approximating intractable expectations and integrals is “Markov

Chain Monte Carlo” (MCMC; [9]). In contrast to the previously introduced nonsampling methodsMCMC methods are a class of algorithms for sampling from the probability distribution. It is based onconstructing a Markov chain that has the desired distribution as its target function. “The state of thechain after a large number of steps is then used as a sample from the desired distribution. The qualityof the sample improves as a function of the number of steps.”2 In these methods a set of “runners”moves around the equilibrium distribution in relatively small steps randomly. At each step the runner issearching for the highest value to contribute the integral.

In the [4] it is proposed to choose Annealed Importance Sampling (AIS; [10]) as “the gold standard”of sampling method candidates.

The problem with the approaches from this category of estimating is the dependence on conver-gence. If, for some initial configuration, EM gets stuck in a local optimum, the chosen criterion willincorrectly estimate the number of model parameters.

For example, EM gets stuck in the local optimum presented in the figure 2.3 left. The likelihood forthis configuration is very low. The BIC values this number of components incorrectly. consequently theright number of components is not selected. The right number of model components is actually two.But the estimated parameters should place the components as it is show in the right figure of 2.3.

2Very good introduction on Wikipedia: http://en.wikipedia.org/wiki/Markov chain Monte Carlo

http://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo

2.4. INVOLVING SPLITTING AND MERGING 9

2.4 Involving Splitting and Merging

In this section we introduce the category of estimating the number of model components using split andmerge steps in the EM framework.

2.4.1 Reversible Jump Markov Chain Monte Carlo

Green’s Iterative Merging and Splitting (MCMC; [11]) is based on a fully Bayesian mixture analysisthat makes use of “Jump Marcov Chain Monte Carlo” (MCMC) methods.

We first introduce the most popular MCMC method, the Metropolis-Hastings algorithm (MH; [12]).Later we will see that Green’s method can be interpreted as special case or extension of this algorithm.

MCMC is a strategy for generating samples ~θ ∈Θ while exploring the state space Θ using a Markovchain mechanism.

Let ~θ be a parameter random vector. The corresponding probability distribution is p(X |~θ) describ-ing the observations X . If we assume p(~θ) to be a prior probability density describing the prior beliefabout ~θ , then according to Bayes’s formula the posterior π in terms of likelihood and prior is given by:

π(~θ) := p(~θ |X) =p(X |~θ)p(~θ)∫p(X |~θ)p(~θ)d~θ

(2.4.1)

In MCMC methods the Markov chain are constructed in such a way that the parameters Θ j are itsstates and π is its stationary distribution. A Markov chain is described by a transition kernel P(~θ ,dΘ′)that gives for each state ~θ the probability distribution for the chain to move to the state d~θ ′ in the nextstep. Let’s denote the transition density as p(~θ ,~θ ′).

The transition kernel is a periodic, irreducible and satisfies a reversibility condition, also called“detailed balance equation”:

π(~θ)p(~θ ,~θ ′) = π(~θ ′)p(~θ ′,~θ) (2.4.2)

In the MH method, a step of invariant distribution π(~θ) and proposal distribution q(~θ ′|~θ) involvessampling a candidate value ~θ ′ given the current value ~θ according to q(~θ ′|~θ). Then, with probability:

α(~θ ,~θ ′) = min

1,

π(~θ ′)q(~θ ′|~θ)

π(~θ)q(~θ |~θ ′))

(2.4.3)

the proposed values are accepted; otherwise, the existing values are retained. The Hastins algorithm isvery simple, but it requires careful design of the proposal distribution q. The samples generated by MHalgorithm will approach samples drawn from the target distribution asymptotically.

Until now we carried out the problem of model selection for models of the same dimensionality.Now the previously described method is extended to the ability to adjust the number of model compo-nents.

Given a family of K models ωk ∈ Ω with k ∈ 1, . . . ,K, Markov chains are constructed admittingπ(k,~θk) as invariant distribution, where ~θk ∈ Θk with Θk as the parameter space of the ktextrmth modelωk. The dimension of ~θk can vary with k.

Up to here, we have been comparing densities in the acceptance ratio 2.4.3. However, if we nowperform model selection, then comparing the densities of objects in different dimensions loose the sense.We have to consider the measure of volume. To compare densities point-wise, the models have to bemapped to a common dimension.


The dimension matching is implemented by generating a vector of random variables independentlyof parameters. The sum of the parameter dimension and the dimension of the new generated randomvector~uk,l have to be equal to the sum of the dimension of the parameter vector in the next state and thedimension of the appropriately generated random vector ~ul,k.

The state is denoted by a double containing the parameter vector ~θk of the current model ωk andthe generated random vector ~uk,l . The proposal distribution have now to be defined by the distributiontaking the pair ~θi and~ui, j as argument sampling a candidate pair ~θ j and~u j,i. The probability of choosingthe move from the state i to j will be denoted by p(i, j)

The choice of the proposal distribution q is problem dependent and needs to be addressed on a caseby case basis. The

The acceptance probability is then computed by

α(~θk,~θl) = min

1,

π(l,~θl)

π(k,~θk)× p(l,k)

p(k, l)× q(~θk,~uk,l|~θl,~ul,k)

q(~θl,~ul,k|~θk,~uk,l)×∣∣∣∣∣ δ (~θl,~ul,k)

δ (~θk,~uk,l)

∣∣∣∣∣

(2.4.4)

Green’s method samples on a small union space⋃K

k=1k×~θk instead of sampling over the modelindex and the product space ∏

Kk=1

~θk. (compare sec. 3.7 in [13]). Thus, the Green’s method allows thesampler to “jump” between the different subspaces. For further investigations to the issue on splittingand merging moves for normal mixtures consult [14] section 3.2.

The proposal distribution q acts as a penalty function for the number of model components. Thechoice of the candidates depends on randomly generated numbers and therefore counterintuitive. Dueto this fact, the algorithm requires a huge number of iterations. The run of Green’s method on imagesegmentation in [11] required 20,000 iterations which took 260 seconds on a Sun Sparc 2 workstationas Green reported.

2.4.2 SMEM

As we have seen in the previous section, the random selection may cause inefficient search moves. Theframework proposed by [15] extends the classical EM by split and merge steps. Their split and mergedo not require any penalty as in the Green’s approach. The change of the dimension is only accepted ifit leads to maximization of the log likelihood, which guarantees the convergence.

The merge and split candidates are weighted to be brought in the appropriate order. The mergingis assumed to be highly probable, “when there are many data points each of which has almost equalposterior probability”. (compare sec 3.3 in [15]). The merging criterion (compare formula (15) in[15]) is defined as:

Jmerge(i, j;Θ∗) = ~Pi(Θ∗)t~Pj(Θ∗), (2.4.5)

“where ~Pi(Θ∗) = (P(i|~x1;Θ∗), . . . ,P(i|~xN ;Θ∗)) is an N-dimensional vector consisting of the posteriorprobabilities for the ith model”.

According to this merging criterion the model components as presented in the left figure of 2.4 wouldbe highly prioritised for merging. And the merging would lead to higher log likelihood value. The in-correct merging would result in one single model component that cannot provide a good approximationfor two crossing data point clouds.

The split criterion is defined in [15] and [16] as the Kullback Leibler Divergence (KLD) (compareformula 16 in [15] or 27 in [16]):

Jsplit(k;Θ∗) =

∫fk(~x;Θ

∗)logfk(~x;Θ∗)pk(~x;θ ∗k )

d~x, (2.4.6)

2.5. LINE BASED MAPS 11

-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

Fig. 2.4: Left: Two model components equally weighting the data. Right: twoobviously separated point clouds.

which is the distance measure between the local data density fk()~x around the kth model component andthe density of the kth model component specified by the parameter estimate Θ∗.

According to this criterion the split is highly probable if the component density is significantly dif-ferent from the ground truth density. The ground truth estimate is computed using the actual componentparameters as it can be seen in the formula 17 in [15]:

fk(~x;Θ∗) =

N

∑n=1

δ (~x−~xn)P(k|~xn;Θ∗)

N

∑n=1

P(k|~xn;Θ∗)

(2.4.7)

In our application the model components are line segments. The line segment as it is shown in theright figure of 2.4 matches perfectly the data. Since the KLD in 2.4.6 value is computed only withrespect to the data points ~x ∈ X , the both densities are identical. The split criterion fails to propose thecomponent candidate for splitting.

2.5 Line Based Maps

In this section we are going to give a short introduction into line fitting algorithms for mobile robotmaps. In this issue the task is to build an environmental representation from laser range scan data. Inour framework we create line based maps.

2.5.1 Polyline Maps

In the first framework we want to present the data approximation is done by polylines (compare [20]).The advantage of polylines compared to line based maps is the polylines allow the representation ofcorners and line intersections called vertices.

The method proposed by [20] consists of two parts. The first part computes an initial state ofpolylines, which then are iteratively optimized in the second part.

It is assumed that an accurate pose estimate is given for all laser scans. The input of the algorithmis a set of aligned laser range scans consisting of 180 - 361 beams depending on the type of the range


overlap removal merge split adjust

zigzag removal add noise extend removal

Fig. 2.5: Operators applied to the polylines. Compare [20] figure 4

scanner used. The first part of the algorithm proceeds with three steps. The first two steps performray-casting operation for all laser beams.

In the first step a grid map is computed storing in every cell the probability of reflection by this cell.This probability is computed by ratio of “hits” (reflections) of all beams. In the second step only thegrid cells are labelled which have a high probability of belonging to a surface. The ration of hits to thebeams is higher than 0.5. In the final step the polylines are computed. To achieve this, the endpointsof the surface are extracted. The endpoints become the starting of the contour. The traversed cells areadded as new polygon points. In case of cyclic contours an arbitrary starting point is chosen.

After the initial state of polylines has been generated the optimization process starts. The task isnow to compute a map in such a way that the value of goodness is maximized. The metric to comparethe goodness of maps is given by Bayesian Information Criterion (BIC).

We use in the following the notation and formulas defined in [20] chapter IV. Suppose the rangedata are denoted by d = d1, . . . ,dn and the polyline map by P. The error of approximation the data bythe map is given by:

E(d|P) =n

∑i=1

dist(l∗(i),di)2 + γ

k

∑i=1

1− p(αi|xi,yi),

wherel∗(i) = argminl∈P dist(l,di)

where l∗(i) is the nearest line segment to the scan point di, k is the number of vertices in the currentpolyline, the quantity p(αi|xi,yi) specifies the probability of the angle αi at the position xi and yi ofvertex i in te scan data. “This probability is obtained according to a statistics about the angles betweenlines fitted to point clusters in the individual input range scans. The term γ is a weighing factor thatcomputes the tradeoff between the distance of the data points from the polyline and the angle error ”.(compare [20], description to formula 2.)

Since there is no analytical solution of the error minimization, a “local search” has been proposedto minimize the error. During the search several operators are applied on the map. The operators areshown in the figure 2.5.1, which has been taken from [20] (compare figure 4.)

Please consult [20] for implementation details. The number of vertices is adjusted by BIC which is

2.5. LINE BASED MAPS 13

R

scan data

L

δk

dk

φkα

Fig. 2.6: Left: Representation of the uncertainty of sample points. Right: Line and datarepresentation in polar coordinates

defined as:EBIC(d|P) = αE(d|P)+ k logn

where k is the number of model components and n is the number of data points.

2.5.2 Line Maps

The method we want to introduce here uses line segments to approximate a set of laser range scan [17].We give only a short introduction. For any further explanation please consult the work of [17].

The lines are parameterized by polar coordinates. The input of the algorithm is sets of dense rangedata that are collected by a mobile robot from multiple poses. The algorithm weights each sample pointby its influence on the fit according to its uncertainty.

In this framework the uncertainty of data occurrence is considered to be variable. The variety of thedata uncertainty is demonstrated in the figure 2.5.2 (compare [17] figure 1). Each scan point can be polarrepresented by its distance between the surface and the robot and the direction of the capturing devicedenoted by an angle. The measurement noise is assumed to arise from zero mean Gaussian process. TheGaussians are not isotropic and its parameters vary depending on the direction of the capturing device.The displacement estimate may be partially or fully derived from the range data. The general model formeasurement noise is presented in [18]. The work of [17] give a very good review in section 2.

The logarithm is divided in two parts. In the first part the scan data classification is performedproviding a first rough estimate of corresponding lines. The second part is an iterative procedure base onmaximum likelihood estimation. The last step of the second part merges. The similarity of the proposedline segments is measured using chi-square test. The transformation of both line representations takenfrom different poses into a common representation is given.

The classification part of the algorithm groups the scan data. The line and the data are parameterizedusing polar coordinates. In the right figure 2.5.2 the data and a line are demonstrated. A Hough spaceis spanned by desretized parameters corresponding to the distance from the data point to the origin andthe corresponding angle (see right figure 2.5.2). For each point the parameters are computed for all


lines passing through that point. The cells in the Hough space which correspond to these parameters areincremented. The peaks in the Hough space correspond to lines in the data space. Thus, the data pointscan be sorted according to their correspondence to lines.

The second step of the algorithm weights the data with respect to the fit and the correspondinguncertainty. Since there is no exact closed form formula to estimate the angle an iterative solution isproposed to solve the non-linear problem. The lines are trimmed at the extreme endpoints to obtain theline segments as optimal fit.

The final step of the algorithm merges similar line segments according to the chi-squared test. Nosplitting steps have been proposed.

Chapter 3

Mathematical Preliminary

In this chapter some of mathematical requirements are introduced.

3.1 Least Squares Fit

In our context we extend the classical LSF algorithm by weighted orthogonal regression. The solution isa set of pairs µls f and Σls f . The algorithm uses two input parameter, the set of points Xls f is and Wls f ∈|Xls f |× |Zxsl| the weights matrix. The number of columns |Zxsl| determines the number of iterations ofthe algorithm and the number of pairs in the solution set.

For each j ∈ 0, . . . , |Zxsl|−1 the approach performs a weighted orthogonal regression computingthe corresponding pair of values:

~µ j =∑|Xls f |−1i=0 ~xiwi, j

∑|Xls f |−1i=0 wi, j

(3.1.1)

Σ j =|Xls f |∑i=0

wi, j(~xi−~µ j)(~xi−~µ j)t (3.1.2)

where xi ∈ Xls f and wi, j are the elements of the weights Matrix Wls f in the i’th row and j’th columnwith i ∈ 0,1, . . . , |Xls f |−1 and j ∈ 0,1, . . . , |Zls f |−1.

3.2 Distance Functions

In our framework different distance function are used. In following we will use the vectors as points inthe Euclidian space. Thus in following context the term “‘points” refers to a vector. We distinguish thefunctions for computation of the distances two points in the Euclidian space, the distances between apoint and a line and the distances between a point and a line segment.

We will assume that a domain or in our case the data set X is given with ~x ∈ X . We use ‖.‖ tocompute the norm of a vector and |.| to compute the absolute value of a scalar. The classical point to

15

16 CHAPTER 3. MATHEMATICAL PRELIMINARY

~µ

~x0 − ~µ

~x0

~n

~r

dx,n

~s2

~x1

d(~x3, s)

d(~x1, s)

σ

σ

~x2

~x3

Fig. 3.1: Distance between a data point and a segment according to 3.2.4. Notice thatonly a part of a segment is shown. σ circle demonstrates the standard deviation for Gnpd f

in 6.5.3

point distance function is then denoted by:

dp,p(~x1,~x2) = ‖~x2−~x1‖ (3.2.1)

The distance between a data point and its k’th neighbour is defined by the function:

dkp(~x) = ‖~x− k(~x)‖ , (3.2.2)

where k(~x) is a function taking a data points returns its k’th neighbour. This function has only demon-stration purpose and is not nearly described.

We define a line by two points located on it l = (~l1,~l2) and assume that the unit normal vector~nl iscomputed. The line to point distance function is then given by:

dp,l(~x, l) = |(~x−~l2)~nl|, or (3.2.3)

= dp,p(~x,~l2 +(~x−~l2)~rl)

with~rl = 1‖~l2−~l1‖(

~l2−~l1). The second definition is given if the normal vector is not given.

We define a line segment by s = (~s1,~s2) two points located on it bounding the line segment.

dp,s(~x,s) =

dp,l(~x,s), i f |(~x−~µs)~rs|< 1

2 ‖~s2−~s1‖√d2

p,l(~x,s)+(|(~x−~µs)~rs|− 12 ‖~s2−~s1‖)2, else

(3.2.4)with~rs = 1

‖~s2−~s1‖(~s2−~s1) and ~µs = 12(~s2 +~s1).

The distance computation is demonstrated in figure 3.1. Notice, the figure 3.1 shows only a part ofa segment, the point~s1 is not seen. The distance between the segment and the data points~x0,~x1 and~x2would be computed with dp,l , but the distance between the projection of the data point~x3 and the meanof the segment ~µ is greater then half length and therefore the appropriate distance computation for the

data point~x3 is√

d2p,l(~x,s)+((~x−~µs)~rs− 1

2 ‖~s2−~s1‖)2.

3.3. MATRIX CALCULUS 17

3.3 Matrix Calculus

3.3.1 Covariance Matrix with given Eigenvectors and Eigenvalues

Let Σ = σi, j with i, j ∈ 0,1 be the covariance matrix, (x,y)t an arbitrary vector, λ0 and λ1 eigenvalues,(x0,y0)t and (x1,y1)t the corresponding eigenvectors, then the covariance matrix Σ can be computed by:(

σ00 σ01σ10 σ11

)(xy

)= λ0,1

(xy

)

σ00 =λ1x1y0−λ0x0y1

x1y0− x0y1

σ11 =λ0x1y0−λ1x0y1

x1y0− x0y1

σ01 =x0x1(λ0−λ1)x1y0− x0y1

σ10 = −y0y1(λ0−λ1)x1y0− x0y1

3.3.2 First Order Derivation

The trace of a square matrix tr(A) is equal to the sum of A’s diagonal elements. We can draw froma large literature on “matrix calculus” to find a proof for some elementary matrix calculations. The“matrix Cookbook” [31] provides a straightforward list of formulas.

tr(AB) = tr(BA) (3.3.1)

tr(A+B) = tr(A)+ tr(B) (3.3.2)

In the following we are going to prove that:

∑i~xi

tA~xi = tr(AB), where B = ∑i~xi~xi

t (3.3.3)

We handle in our framework with symmetric matrices. Let d be the dimension of the vectors ~xi. Thematrices are then A,B ∈ d×d.

∑i~xi

tA~xi = ∑i

xi0xi1...

xid

t

a0,0 a0,1 · · · a0,da1,0 a1,1 · · · a1,d

......

. . ....

ad,0 ad,1 · · · ad,d

xi0xi1...

xid

= ∑i

(xi0 xi1 · · · xid

)

a0,0xi0 +a0,1xi1 + · · ·+a0,dxida1,0xi0 +a1,1xi1 + · · ·+a1,dxid

...ad,0xi0 +ad,1xi1 + · · ·+ad,dxid

= ∑

i

(xi0(a0,0xi0 +a0,1xi1 + · · ·+a0,dxid )+

+xi1(a1,0xi0 +a1,1xi1 + · · ·+a1,dxid )++ · · ·++xid (ad,0xi0 +ad,1xi1 + · · ·+ad,dxid )

)(3.3.4)


B = ∑i~xi~xi

t =

xi0xi1...

xid

( xi0 xi1 · · · xid)

= ∑i

xi0xi0 xi0xi1 · · · xi0xidxi1xi0 xi1xi1 · · · xi1xid

......

. . ....

xid xi0 xid xi1 · · · xid xid

= ∑i

D

tr(AB) = tr(A∑i

D) = tr(∑i

AD) = ∑i

tr(AD)

∑i

tr(AD) = ∑i

a0,0 a0,1 · · · a0,da1,0 a1,1 · · · a1,d

......

. . ....

ad,0 ad,1 · · · ad,d

xi0xi0 xi0xi1 · · · xi0xidxi1xi0 xi1xi1 · · · xi1xid

......

. . ....

xid xi0 xid xi1 · · · xid xid

= ∑

i

((a0,0xi0xi0 +a0,1xi1xi0 + · · ·+a0,dxid xi0)+

+(a1,0xi0xi1 +a1,1xi1xi1 + · · ·+a1,dxid xi1)++ · · ·++(ad,0xi0xid +ad,1xi1xid + · · ·+ad,dxid xid )

)(3.3.5)

As we can see 3.3.4 is equal to 3.3.5. 2

Now we are going to show that:

δ tr(AD)δA

= D+Dt −diag(D). (3.3.6)

δ tr(AD)δA

=

δ tr(AD)

δa0,0

δ tr(AD)a0,1

· · · δ tr(AD)a0,d

δ tr(AD)a1,0

δ tr(AD)a1,1

· · · δ tr(AD)a1,d

......

. . ....

δ tr(AD)ad,0

δ tr(AD)ad,1

· · · δ tr(AD)ad,d

,

where

δ tr(AD)δak,l

= xik xil , if k = l,

δ tr(AD)δak,l

= xik xil + xil xik , if k 6= l, (3.3.7)

The equation 3.3.7 arises from the fact that the matrix A is symmetric, which means ak,l = al,k, so we

3.3. MATRIX CALCULUS 19

have two paces in 3.3.5 to make a derivation on: ak,lxil xik and al,kxik xil . Using this result we get:

δ tr(AD)δA

=

xi0xi0 xi0xi1 + xi1xi0 · · · xi0xid + xid xi0

xi1xi0 + xi0xi1 xi1xi1 · · · xi1xid + xid xi1...

.... . .

...xid xi0 + xid xid xid xi1 + xi1xid · · · xid xid

= D+Dt −diag(D) 2

3.3.3 Second Order Derivation

In this section we just give the definition of the second order derivation with respect to a vector. Thisformula is can be drawn from large literature to matrix calculus and is presented in its original definition.For further investigation we refer to [31].

δ~xtA~xδ~x

= (A+At)~x (3.3.8)

3.3.4 Derivative Of Determinant

We define the determinant of the symmetric matrix A using “Laplace’s Expansion”. The Laplace Ex-pansion of the determinant of a d × d square matrix A expresses the determinant |A| as a sum of ddeterminants of (d− 1)× (d− 1) sub matrices of A. There are 2n such expressions, one for each rowand column of A.

Define the i, j minor matrix Mi, j of B as the (d− 1)× (d− 1) matrix that results from deleting thei-th row and the j-th column of A, and the i, j cofactor of A as:

Ci, j = (−1)i+ j|Mi, j| (3.3.9)

Then the Laplace expansion is given by the following Theorem: Suppose A = (ai, j) is an n× n matrixand i, j ∈ 0,1, ...,d−1. Then the determinant:

|A| = ai,0Ci,0 +ai,1Ci,1 + · · ·+ai,dCi,d (3.3.10)

= a0, jC0, j +a1, jC1, j + · · ·+ad, jCd, j (3.3.11)

To find a derivative of the determinant of a symmetrical matrix we need to perform the computation3.3.10 for row every i ∈ 0,1, · · · ,d−1 (or respectively for every column j in 3.3.11). Because of thesymmetry ai, j = a j,i, the equations 3.3.10 contain twice the expression ai, jCi, j for every i 6= 0 and i 6= j.Thus, the derivative is:

δ |A|δai, j

=

Ci, j if i = j2Ci, j if i 6= j

(3.3.12)

or:

δ |A|δA

=

C0,0 2C0,1 · · · 2C0,d2C1,0 C1,1 · · · 2C1,d

......

. . ....

2Cd,0 2Cd,1 · · · Cd,d

= C +Ct −diag(C) (3.3.13)


For a square matrix A, the inverse is written A−1. When A is multiplied by A−1 the result is theidentity matrix I. Using the “Adjoin Method” we get:

A−1 =Ct

|A| , (3.3.14)

where C = (Ci, j) is the cofactor matrix with coefficients Ci, j introduced in 3.3.9.Given the above and the fact of symmetry of C, we see that:

δ log|A|δA

=1|A|

δ |A|δA

= 2C|A| −

diag(C)|A|

= 2A−1−diag(A−1) (3.3.15)

Chapter 4

Expectation Maximization

4.1 Maximum Likelihood Estimator

4.1.1 Example

The expectation maximization (EM) algorithm is a very complex procedure. It is used in statistics forfinding maximum likelihood estimates of parameters in probabilistic models. Before we begin to derivethe formula for the application it might be advisable to start with the basics. We follow in our followingconsiderations the standard text book on “Pattern Classification” [1].

To illustrate the types of problems we shall address, let us consider the following imaginary. Supposethat a health food store wants to automate the process of examination fruits.

As the first project we are going to analyse the size of fruits. Using optical sensing we take mea-surements and store them. The camera takes a picture of a fruit and passes the picture to a “featureextractor”, whose purpose is to reduce the data by measuring certain properties.

Size is an obvious feature, and we might attempt to make certain conclusions merely by seeingwhether or not the measurement exceeds some critical value xc. Suppose the store manager wants torealise a certain amount making a compromise between quality and quantity. The fruits are going tobe divided into three prize classes. The biggest ones are most expensive, the fruits of a nearly averagesize would keep the prize unchanged, and the smaller ones would get a sales discount. The value xc

determines the deviation from the size average.To choose xc we could obtain some samples and inspect the results. Given the amount of gain to

aim at and the fixed volume of fruits, we can determine the value xc to obtain the reasonable amount ofadditional charge and discount.

We proceed with N fruits of the same kind. Let us assume those are oranges. In that case we aregiven a set X = x1, . . . ,xN of measurements. We call this set the “data set” and visualize the dataas points on the x axis in figure 4.1. The data are distributed around the average. If we determine theproperties of this distribution we might adjust the average size and the so called critical value accordingto these properties once and for all. Let us assume that the data are normally distributed (see figure 4.1).In such a way we need to determine the mean value µ = x and the standard deviation σ . The critical

21

22 CHAPTER 4. EXPECTATION MAXIMIZATION

x

p(x|Θ)

Fig. 4.1: The data set X visualized as points on the x axis and the p(x|Θ) is the law, bywhich the data have been taken.

value can now be determined as a percentage of σ .In the following we will group the properties of the distribution by Θ. In our first project Θ consists

only of one pair Θ = < µ,σ >. We call Θ the “parameters” of the distribution.While this rule describes the measurement variation and helps us to determine the average value and

the critical value, we have no guarantee that it will result in the same values on new samples. It wouldbe advisable to obtain some more samples and see how much the new results deviate from the old ones.This suggests that our problem has a statistical component and our parameters are only estimates.

With the assumption of normally distributed measurements, we may draw a sample x,x1, . . . ,xN ofN values from this distribution. We denote this distribution by the known probability density functionp(x) = p(x|µ,σ) or in our context p(x|Θ) 4.1.1.

p(x|Θ) =1

σ√

2πe−(x−µ)2

2σ2 (4.1.1)

The value p(xi|Θ) says how probable it is that the next measurement has the value xi. Notice that in thiscomputation the parameters of the distribution have to be known.

Our problem is of different kind. What we are given are the measurements and the parameters areunknown. Thus we need to compute the probability density associated with our observed data: L(Θ) =p(x1, . . . ,xN |Θ) as a function of Θ with x1, . . . ,xN fixed. L(Θ) is called the “likelihood function”.

L(Θ) = p(x1, . . . ,xN |Θ) =( 1

2πσ2

)N/2e−∑

Ni=1(xi−µ)2)

2σ2 (4.1.2)

It is used to estimate the parameters by finding the value of Θ that maximizes L(Θ). This is the “maxi-mum likelihood estimator” (MLE) of Θ.

Notice that p(x|Θ) describes some law of nature. Whereas the density function with the estimatedparameters resulting from L(Θ) = p(x1, . . . ,xN |Θ) is only an approximate of that law.

Since we decided to make the parameters variable the expression 4.1.2 describes a family of dis-tributions with two parameters µ and σ , so we maximize the likelihood L(Θ) over two parameterssimultaneously, or if possible individually.

The logarithm is a continuous strictly increasing function over the range of the likelihood, the valueswhich maximize the likelihood will also maximize its logarithm. Since maximizing the logarithm often

4.1. MAXIMUM LIKELIHOOD ESTIMATOR 23

requires simpler algebra, it is the logarithm which is maximized below.

0 =δ

δ µlog

( 12πσ2

)n/2

e−∑

Ni=1(xi−µ)2)

2σ2

(4.1.3)

=δ

δ µ

(log(

12πσ2

)n/2

− ∑Ni=1(xi−µ)2)

2σ2

)

= −(−2∑

Ni=1 xi +2Nµ

2σ2

)µ =

1N

N

∑i=1

xi

This is the only turning point and the second derivate is strictly less then zero, thus this is indeed themaximum of the function.

Similarly we differentiate the log likelihood with respect to σ and equate to zero:

0 =δ

δσlog

( 12πσ2

)n/2

e−∑

Ni=1(xi−µ)2)

2σ2

(4.1.4)

=δ

δσ

(log(

12πσ2

)n/2

− ∑Ni=1(xi−µ)2)

2σ2

)

= − nσ

+ ∑Ni=1(xi−µ)2

σ3

σ2 =

1N

N

∑i=1

(xi−µ)2

The squared standard deviation or variance is the only solution and the second deviation with this solu-tion is less then zero, thus this is the maximum of the function.1

4.1.2 The General Principle

In this section we want again address to the text book on “Pattern Classification” [1]. We follow herethe introduction of the issue “Parameter Estimation” (compare chapter 3).

Suppose the set of samples X = x1, . . . ,xN has been drawn independently according to the proba-bility law p(~x|Θ).

We assume that p(~x|Θ) has a known parametric form, and is therefore determined uniquely by thevalue of a parameter vector Θ. In our framework we have p(~x|Θ) ∼ N(~µ,Σ), where the componentsof Θ include the components of both ~µ and Σ. The problem is to use the information provided by thesamples to obtain good estimates for the unknown parameter vector Θ.

We simplified the problem definition by assumption that the samples contain information about Θ

only and no sample can be drawn from p(~x|Θ) which could contain information about any differentparameter vector.

1We need to determine whether the solutions are “biased”. Is the expectation value of the solution equal to the solutionitself so is the solution unbiased. The solution of µ is unbiased. The solution of σ2 is biased but consistent.


Suppose that X contains N samples, X = x1, . . . ,xN. Then, since the samples were drawn inde-pendently,

p(X |Θ) =N

∏i=1

p(~xi|Θ) (4.1.5)

Viewed as a function of Θ, p(X |Θ) is called the “likelihood” of Θ with respect to the set of samples. The“maximum likelihood estimate” of Θ is that value Θ that maximizes p(X |Θ). Intuitively, it correspondsto the value of Θ that in some sense best agrees with the actually observed samples.

Logarithm is monotonically increasing function. The value Θ that maximizes log-likelihood alsomaximizes the likelihood. To find a differential it is easier to work with a sum of logarithms than with aproduct. If p(X |Θ) is well behaved, differential function of Θ, Θ can be found by the standard methodsof differential calculus. Let Θ be the p-component vector Θ = (θ1, . . . ,θp)t , let ∆Θ be the gradientoperator,

∆Θ =

δ

δθ1...δ

δθp

, (4.1.6)

and let l(Θ) be the log-likelihood function. Then

l(Θ) = log p(X |Θ) (4.1.7)

=N

∑i=1

log p(~xi|Θ) (4.1.8)

∆Θl =N

∑i=1

∆Θlog p(~xi|Θ) (4.1.9)

Thus, a set of necessary conditions for the maximum likelihood estimate for Θ can be obtained from theset of p equations ∆Θl = 0.

If we apply the log-likelihood function 4.1.7 without simplification step 4.1.8 onto the problemdescribed in the example, we get:

l(µ,σ) = log(p(X |µ,σ)) = log

(N

∏i=1

p(xi|µ,σ)

)

= log

N

∏i=1

1√2πσ

e−(xi−µ)2

2σ2

= log

( 12πσ

)N/2

e−∑

Ni=1(xi−µ)2

2σ2

4.1.3 The Multivariate Normal Case

To see how these results apply to specific case, suppose that we obtain the samples and inspect themdue to more than one property. In such a case we suppose that the samples have been drawn from anormal population with mean ~µ and a covariance matrix Σ. In this new case the mean ~µ = (µ1 · · · µd)is a vector. d is the number of properties we have chosen to examine. Σ is the matrix which describes

4.1. MAXIMUM LIKELIHOOD ESTIMATOR 25

the squared deviation of the samples from the mean and how this samples are related. The followingderivation was extracted from [33] for parameter estimation without classification problem.

Letp(~xi|~µ,Σ) =

1(2π)d/2|Σ|1/2 e−

12 (~xi−~µ)t Σ−1(~xi−~µ), (4.1.10)

then the log likelihood is

l(~µ,Σ) =N

∑i=1

(−d

2log(2π)− 1

2log(|Σ|)− 1

2(~xi−~µ)t

Σ−1(~xi−~µ)

), (4.1.11)

identifying Θ with ~µ and Σ we determine the maximum likelihood estimate with 4.1.6. We take thederivative of equation 4.1.11 with respect to ~µ using 3.3.8 and setting it equal to zero, we get:

δ

δ~µl(~µ,Σ) =

N

∑i=1

Σ−1(~xi−~µ) = 0

with which we can easily solve for ~µ to obtain:

~µ =1N

N

∑i=1

~xi (4.1.12)

The result days that the maximum likelihood estimation for the unknown population mean is just thearithmetic average of the samples. Geometrically, if we think of the N samples as a cloud of points, thesample mean is the centroid of the cloud.

To find Σ, note that we can rewrite the equation 4.1.11. Also note that if A is a matrix and |A| itsdeterminant, then following is valid: |A|= 1/|A−1|. We use this knowledge and the results in 3.3.3 andget:

l(~µ,Σ) =N

∑i=1

(−d

2log(2π)+

12

log(|Σ−1|)− 12

tr(Σ−1(~xi−~µ)(~xi−~µ)t)

=N

∑i=1

(−d

2log(2π)+

12

log(|Σ−1|)− 12

tr(Σ−1Di))

where Di = (~xi−~µ)(~xi−~µ)t .Taking the derivative with respect to Σ−1 using the results 3.3.6 and 3.3.15, we get:

δ

δΣ−1 l(~µ,Σ) =N

∑i=1

(12

(2Σ−diag(Σ))− 12

(2Di−diag(Di)))

=12

N

∑i=1

(2Ei−diag(Ei))

= 2F−diag(F)

where Ei = Σ−Di and whereF = 12 ∑

Ni=1 Ei. Setting the derivative to zero, i.e., 2F−diag(F) = 0, implies

that F = 0. This gives:

0 =N

∑i=1

(Σ−Di)

Σ =1N

N

∑i=1

Di =1N

N

∑i=1

(~xi−~µ)(~xi−~µ)t (4.1.13)


The maximum estimate for the covariance matrix is the arithmetic average of N matrices (~xi−~µ)(~xi−~µ)t . If we think of the N samples as a cloud of points again, the covariance matrix describes thespreading of the cloud.

4.2 Classification Problem

4.2.1 Bayes Rule

To discuss our new problem we begin with the illustration given in the previous example. Let us re-consider the hypothetical problem. In our first application we assumed that the measurements are tobe taken from the fruits of one kind only. Suppose now that we put two different of fruit kinds underobservation, the oranges and grapefruits. We simplify the problem and assume that only size is mea-sured. Given the set of measurements we do not know if the current value is belonging to an orange ora grapefruit. We refer once again to the text book on “Pattern Classification” [1]. We follow here theintroduction of “Bayes Decision Theory” (compare chapter 2).

Suppose that our fruits are mixed together and that an observer watching the fruits emerging fromour machine finds it so hard to predict what type will emerge next, that the sequence of types of fruitsappears to be at random. Using the decision theoretic terminology, we say that as each fruit emerges,nature is in one or the other of the two possible states: either the fruit is an orange or the fruit is agrapefruit. We let ω denote the “state of nature”, with ω1 for orange and ω2 for grapefruit. Because thestate of nature is so unpredictable, we consider ω to be a random variable.

If our fruit store delivered as much oranges as grapefruits, we would say that the next fruit is equallylikely to be an orange or grapefruit. More generally, we assume that there is some “a priori probability”p(ω1) that the next fruit is orange and some a priori probability p(ω2) that it is grapefruit. These a prioriprobabilities reflect our prior knowledge of how likely we are to see orange or grapefruit before the fruitactually appears.

Suppose for a moment that we were forced to make a decision about the type of fruit that willappear without being allowed to see it. The only information we are allowed to use is the value of the apriori probabilities. If a decision must be made with so little information, it seems reasonable to use thefollowing “decision rule”: decide ω1 if p(ω1) > p(ω2), otherwise decide ω2.

We do not make decisions with so little evidence. In our example, we can use the size measurementx as evidence. Different samples of fruits will yield different size measurements. We express thisvariability in probabilistic terms. We consider x to be a continuous random variable whose distributiondepends on the state of nature. Let p(x|ω j) be the “state conditional probability density” function for x,the probability density function for x given that the state of nature is ω j. Then the difference betweenp(x|ω0) and p(x|ω1) describes the difference in size between oranges and grapefruits, see figure 4.2.Notice that we added the condition that the states are parameterized by Θ.

Suppose that we know both the a priori probabilities p(ω j) and the conditional densities p(x|ω j).Suppose further that we measure the size of the fruit and discover the value of x. How does that measure-ment influence our attitude concerning the true state of nature? The answer to this question is providedby “Bayes Rule”:

p(ω j|x) =p(x|ω j)p(ω j)

p(x)(4.2.1)

where

p(x) =2

∑j=1

p(x|ω j)p(ω j) (4.2.2)

4.2. CLASSIFICATION PROBLEM 27

x

p(x|Θ)p(x|1, Θ)

p(x|2, Θ)

Fig. 4.2: The data set X visualized as points on the x axis contains measurements oforanges (0) and grapefruits (1). The p(x|0,Θ) and p(x|1,Θ) are the laws of nature

respectively by which each measurements are taken. The p(x|Θ) is the join probabilityfunction.

Bayes rule shows how observing the value of x changes the a priori probability p(ω j) to the “a posteri-ori” probability p(ω j|x)

4.2.2 Parameter Estimation

Let’s illustrate our new problem again. We have a device measuring size of fruits. The measurementsare taken from different types of fruits. The device receives the fruits in mixed order, such that theknowledge about the type is not given. The device produces a set of measurements (the data set) whichhas to be examined. Our goal is to sort the fruits of each type into three categories according to theautomatically determined critical values. Thus we have to determine the parameters for each of fruittypes.

From the previous considerations we know that the parameter estimation means to approximate alaw of nature. This law determines the variation of the measurements for each fruit type. There is onesuch law for each type. Thus we need to approximate as much laws as fruit types we have.

We assumed to have two fruit types only. In such a way our parameter vector is expanded by a newpair of values:

Θ = < µ1,σ1 >,< µ1,σ2 >In the previous section we considered the measurements as values produces by the nature. The state

of nature determines which law is taken into account to produce the values. Let the number of states ofnature to be M, in our specific case M = 2. We assume the following probabilistic model (compare thedefinition in 4.2.2):

p(x|Θ) =M

∑j=1

p(x|ω j,Θ)p(ω j|Θ)

such that ∑Mj=1 p(ω j|Θ) = 1, and each p(x|ω j,Θ) is parameterized by θ j. In other words, we assume we

have M component densities mixed together with M mixing coefficients p(ω j|Θ).Since we are not given the hidden membership information the data set X is incomplete. The log

likelihood expression for the density from the incomplete data X is given by:

log(L(Θ|X)) = log(p(X |Θ)) = log

(N

∏i=1

p(xi|Θ)

)

=N

∑i=1

log

(M

∑j=1

p(xi|ω j,Θ)p(ω j|Θ)

)The “incomplete-data log-likelihood” is difficult to optimize because it contains the log of the sum.


4.3 Derivation of the EM algorithm

In this section we want to introduce the “Expectation Maximization Algorithm” (EM). We derive firstthe expectation value and show its usage in control of the iterative parameter estimation. We demonstratethe convergence property of the algorithm. The parameter estimation procedure is modified accordinglyto the new framework. The book “Expectation Maximization Theory” [32] by M.W.Mark, S.Y.Kungand S.H.Lin provides the statements and details for further investigations.

As we have seen in the section 4.2.2, lacking the hidden membership information results in a com-plicated parameter estimation procedure. The estimation procedure, however, can be greatly simplifiedif this membership information is assumed to be known.

As discussed before, we assume we have M component densities mixed together. We consider Xas incomplete and posit the existence of unobserved or “hidden” data items Z = ziN

i=1 whose valuesinform us which component density produced each data item. That is, we assume that zi ∈ 1, · · · ,Mfor each i, and zi = k if the ith sample was generated by the kth mixture component.

Note that the combination of observations X and the hidden states Z constitute the complete data.The likelihood of the complete data is instrumental in accordance with the EM formulation

4.3.1 General Analysis

According to the probability theory the state conditional probability density is defined as:

p(X |Θ) =p(Z,X |Θ)p(Z|X ,Θ)

(4.3.1)

Using the equation 4.3.1 and equation 4.1.7, one can write the incomplete data log likelihood as follows:

l(Θn) = log p(X |Θn) (4.3.2)

= log p(X |Θn)Z

∑ p(Z|X ,Θn), sinceZ

∑ p(Z|X ,Θn) = 1

=Z

∑ p(Z|X ,Θn)logp(Z,X |Θn)p(Z|X ,Θn)

=Z

∑ p(Z|X ,Θn)log p(Z,X |Θn)−Z

∑ p(Z|X ,Θn)log p(Z|X ,Θn)= EZ (log p(Z,X |Θn)|X ,Θn)−EZ (log p(Z|X ,Θn)|X ,Θn)= Q(Θn|Θn)+R(Θn|Θn)

where EZ denotes expectation with respect to Z. Thus, denote

R(Θ|Θn)≡−EZ (log p(Z|X ,Θ)|X ,Θn) (4.3.3)

andQ(Θ|Θn)≡ EZ (log p(Z,X |Θ)|X ,Θn) (4.3.4)

where R(Θ|Θn) is an entropy term representing the difference between the incomplete data likelihoodand the expectation of the completed data likelihood. Q(Θ|Θn) is the expectation of the completed datalikelihood function.

We can estimate parameters by maximizing the expectation function Q(Θ|Θn) or by maximizingan incomplete data likelihood function R(Θ|Θn), leading to an entropy interpretation of the algorithm.

4.3. DERIVATION OF THE EM ALGORITHM 29

We follow the first suggestion, and drop the second, since maximizing the first term we maximize theincomplete data likelihood expression 4.3.2.

Notice that the expectation of the completed data likelihood in 4.3.4 is a function of Θ. Θ corre-sponds to the parameters that will me optimized in an attempt to maximize the likelihood. The secondargument Θn corresponds to the parameters that we use to evaluate the expectation.

4.3.2 Parameter Estimation

In this section we are going to define the complete data log likelihood function and derive with its helpthe parameters maximizing the expectation value. We refer in this section to [33] which provides uswith theory and add some extra explanations.

As we assumed before, we have M component densities mixed together. We considered the data setX as incomplete. The unobserved data items Z inform us which component density generated each dataitem. Thus, zi ∈ 1, · · · ,M for each i ∈ 1, · · · ,N, and zi = k if the ith sample was generated by the kth

mixture component. If we know the values of Z, the likelihood becomes:

log(L(Θ|X ,Z)) = log(p(X ,Z|Θ)) =N

∑i=1

log(αzi p(xi|zi,θzi))

where αzi are the mixing coefficients, and θzi ∈ Θ are the parameters of the mixture component yi. Wedo not know the values of Z, but if we assume Z as random vector we can proceed.

We first must derive an expression for the distribution of the unobserved data. Let us first guess atparameters for the mixture density. We guess that

Θn = αn

1 , · · · ,αnM,θ n

1 , · · · ,θ nM

are the appropriate parameters for the likelihood L(Θn|X ,Z). Given Θn, we can easily compute p(xi| j,θ nj )

for each i and j. In addition, the mixing parameters, α j can be thought of as prior probabilities of eachmixture component, that is α j = p( j|θ n

j ). Therefore, using Bayes’s rule, we can compute:

p(zi|xi,Θn) =

αnzi

p(xi|zi,θnzi)

p(xi|θ n)=

αnzi

p(xi|zi,θnzi)

∑Mk=1 αn

zkp(xk|zk,θ n

zk)

and

p(Z|X ,Θn) =N

∏i=1

p(zi|xi,Θn)

In this case, the equation 4.3.4 takes the form:

Q(Θ,Θn) =Z

∑ p(Z|X ,Θn)log p(Z,X |Θ)

=Z

∑N

∑i=1

log(αzi p(xi|zi,θzi))N

∏i=1

p(zi|xi,Θn) (4.3.5)

Before we go any further, we would like to draw your attention to the equation 4.3.4. The computa-tion of the expectation function is done by summarising some terms over ∑

Z . What does that mean?Let’s consider we have only one singe sample point x0 and M, density components. This is not

very meaningful for the statistical approach, but very useful for our illustration. In such a way, we getz0 ∈ 1, · · · ,M information units about the membership of the sample point.


~x0 1

Mixture~x

0

1 Mixture

2

3

Fig. 4.3: Membership of one sample point. Left: The sample point belongs to cluster(Mixture) 1. Right: The sample point belongs to all clusters.

Figure 4.3.2 shows us two examples. The left illustration shows that the sample point belongs to themixture component 1. Summarizing over all Z would became trivial resulting in the term itself. Theleft illustration shows us that the sample point might belong to different components at the same time.In this case we need to summarize over all Z, because all terms contain nontrivial information. Thecomputation 4.3.5 in this particular case would be:

Z

∑ p(Z|X ,Θn)log p(Z,X |Θn) =M

∑z1=1

log(αz1 p(x1|z1,θz1)) p(z1|x1,Θn)

where αz1 = 1. We see, that we need to summarize over M for every zi. In case of N samples theequation would take the form:

Z

∑ p(Z|X ,Θn)log p(Z,X |Θn) =M

∑z1=1

M

∑z1=1· · ·

M

∑zN=1

N

∑i=1

log(αzi p(xi|zi,θzi))N

∏j=1

p(z j|x j,Θn)

Now let us continue with our considerations and define a function which selects the status of thehidden states.

δi, j = δ (i, j) =

1 if i = j0 otherwise

With this indicator function we can express:

log(αzi p(xi|zi,θzi)) =M

∑l=1

δl,zi log(αl p(xi|l,θl))

Using this consideration we get:

Q(Θ,Θn) =M

∑z1=1

M

∑z1=1· · ·

M

∑zN=1

N

∑i=1

M

∑l=1

δl,zi log(αl p(xi|l,θl))N

∏j=1

p(z j|x j,Θn)

=M

∑l=1

N

∑i=1

log(αl p(xi|l,θl))M

∑z1=1

M

∑z1=1· · ·

M

∑zN=1

δl,zi

N

∏j=1

p(z j|x j,Θn) (4.3.6)

In this form, Q(Θ,Θn) does not look very clear, yet it can be greatly simplified.


We first note that for l ∈ 1, · · · ,MM

∑z1=1

M

∑z1=1· · ·

M

∑zN=1

δl,zi

N

∏j=1

p(z j|x j,Θn)

=M

∑z1=1· · ·

M

∑zi−1=1

M

∑zi+1=1

· · ·M

∑zN=1

M

∑zi=1

δl,zi p(zi|xi,Θn)

N

∏j=1, j 6=i

p(z j|x j,Θn)

=

(M

∑z1=1· · ·

M

∑zi−1=1

M

∑zi+1=1

· · ·M

∑zN=1

N

∏j=1, j 6=i

p(z j|x j,Θn)

)p(l|xi,Θ

n)

=N

∏j=1, j 6=i

(M

∑z j=1

p(z j|x j,Θn)

)p(l|xi,Θ

n) = p(l|xi,Θn)

since ∑Myi=1 δl,yi p(yi|xi,Θ

n) = p(l|xi,Θn) and ∑

Mj=1 p( j|x j,Θ

n) = 1. Thus, we can write equation 4.3.6as:

Q(Θ,Θn) =M

∑l=1

N

∑i=1

log(αl p(xi|l,θl)) p(l|xi,Θn)

=M

∑l=1

N

∑i=1

log(αl) p(l|xi,Θn)+

M

∑l=1

N

∑i=1

log(p(xi|l,θl)) p(l|xi,Θn) (4.3.7)

To maximize this expression, we can maximize the term containing αl and the term containing θlindependently since they are not related.

To find the expression for αl , we introduce the Langrange multiplier λ with the constraint that∑l αl = 1, and solve the following equation:

δ

δαl

[M

∑l=1

N

∑i=1

log(αl) p(l|xi,Θn)+λ

(∑

lαl−1

)]= 0

orN

∑i=1

1αl

p(l|xi,Θn)+λ = 0

Summing both sizes over l, we get that λ =−N resulting in:

αl =1N

N

∑i=1

p(l|xi,Θn)

In our framework we handle with multivariate Gaussian distributions, with mean µ and covariancematrix Σ, θ =< µ,Σ >, then

p(x|l,µl,Σl) =1

(2π)d/2|Σl|1/2 e−1

2(x−µl)t

Σ−1l (x−µl)

(4.3.8)

Please note, that µ and x are vectors in this case. But the considerations made before extend to thevectors too, without any new considerations. We will use the same notations as before.


Taking the log of the equation 4.3.8, and substituting into the right side of equation 4.3.7, we get:

M

∑l=1

N

∑i=1

log(p(xi|l,θl)) p(l|xi,Θn)

=M

∑l=1

N

∑i=1

(−d

2log(2π)− 1

2log(|Σ|)− 1

2(~xi−~µ)t

Σ−1(~xi−~µ)

)p(l|xi,Θ

n)(4.3.9)

To derive the update equations for these distributions, we recall the computations made for theincomplete data log likelihood in 4.1.3. We see that the equation 4.1.11 is very similar to our equation4.3.9. We do the derivation here in the same way.

Taking the derivative of equation 4.3.9 with respect to µl and setting it equal to zero, we get:

N

∑i=1

Σ−1l (xi−µl)p(l|xi,Θ

n) = 0

with which we can solve for µl to obtain:

µl = ∑Ni=1 xi p(l|xi,Θ

n)∑

Ni=1 p(l|xi,Θn)

.

To find Σl we refer to the computations made in 4.1.3. We rewrite the equation 4.3.9.

M

∑l=1

[−d

2log(2π)

N

∑i=1

p(l|xi,Θn)+

12

log(|Σ−1|)N

∑i=1

p(l|xi,Θn)− 1

2

N

∑i=1

p(l|xi,Θn)tr

(Σ−1(xi−µl)(xi−µl)t)]

=M

∑l=1

[−d

2log(2π)

N

∑i=1

p(l|xi,Θn)+

12

log(|Σ−1|)N

∑i=1

p(l|xi,Θn)− 1

2

N

∑i=1

p(l|xi,Θn)tr

(Σ−1Dl,i

)]

where Dl,i = (xi−µl)(xi−µl)t .Taking the derivative with respect to Σ

−1l using the results 3.3.6 and 3.3.15, we get:

12

N

∑i=1

p(l|xi,Θn)(2Σl−diag(Σl))− 1

2

N

∑i=1

p(l|xi,Θn)(2Dl,i−diag(Dl,i))

=12

N

∑i=1

p(l|xi,Θn)(2El,i−diag(El,i))

= 2F−diag(F)

where El,i = Σl −Dl,i and whereF = 12 ∑

Ni=1 p(l|xi,Θ

n)El,i. Setting the derivative to zero, i.e., 2F −diag(F) = 0, implies that F = 0. This gives:

N

∑i=1

p(l|xi,Θn)(Σl−Dl,i) = 0

Σl =∑

Ni=1 p(l|xi,Θ

n)Dl,i

∑Ni=1 p(l|xi,Θn)

= ∑Ni=1 p(l|xi,Θ

n)(xi−µ)(xi−µ)t

∑Ni=1 p(l|xi,Θn)


4.3.3 Summary

Let’s summarize our achievements. The target in our framework is, given a set of data, to estimatethe law of nature, which produced this set. The law of nature consists of mixed component. Thus weestimate the components separately. We use the maximum likelihood estimation on incomplete datamaximizing the expectation value of the complete data.

l(Θ,Θn) = Q(Θ|Θn)+R(Θ|Θn)

The expectation value is:

Q(Θ|Θn) = EZ (log p(Z,X |Θ)|X ,Θn)

=Z

∑ p(Z|X ,Θn)log p(Z,X |Θ)

Q(Θ,Θn) =M

∑l=1

N

∑i=1

log(αl) p(l|~xi,Θn)+

M

∑l=1

N

∑i=1

log(p(~xi|l,θl)) p(l|~xi,Θn)

where the conditional density for mixture component is:

p(l|~xi,Θn) =

αnl p(~xi|l,θ n

l )

∑Mk=1 αn

zkp(~xi|zk,θ n

zk)

(4.3.10)

and

p(~xi|l, ~µl,Σl) =1

(2π)d/2|Σl|1/2 e−1

2(~xi− ~µl)t

Σ−1l (~xi− ~µl)

(4.3.11)

For which the parameters are:

αn+1l =

1N

N

∑i=1

p(l|~xi,Θn) (4.3.12)

and

~µn+1l =

N

∑i=1

~xi p(l|~xi,Θn)

N

∑i=1

p(l|~xi,Θn)

(4.3.13)

Σn+1l =

N

∑i=1

p(l|~xi,Θn)(~xi−~µn+1

l )(~xi−~µn+1l )t

N

∑i=1

p(l|~xi,Θn)

(4.3.14)

At last the meaning of the n index is visible. As you might notice that computing the expectationvalue we perform two steps with different Θ. The computation of 4.3.10 depends on αl and the pa-rameters of 4.3.11: µl and Σl . The computation of the parameters of 4.3.11 depends on the conditionaldensity 4.3.10.

The computation of the expectation value can not be performed in one go. Let us consider we knowthe conditional density, such that we do not need any further computations. Let’s say, we mark thedensity and all depending parameters by n. So, we could the unknown quantities can be computed. We


mark the new quantities by n + 1. Knowing the terms with mark n and n + 1 the expectation value canbe computed. If the expectation value is computed and the parameters marked by n + 1 are given, wemay compute the conditional density and mark it with a new index.

These two steps: computation of the expectation value and estimation of new parameters to maxi-mize it, result in the approach called “Expectation Maximization Algorithm”. In the modified form ofthe “Maximization” (M) step, instead of maximizing Q(Θ,Θn), some Θn+1 is found such that Q(Θn+1,Θn)>Q(Θn,Θn−1). This form of the algorithm is called “Generalized EM” (GEM) and also guaranteed toconverge.

The only one issue remains open. How to begin? We need to initialize the algorithm by pseudoparameters.

4.3.4 Example

Let’s return to our example. We have a device capturing images of fruits and computing the size ofeach. We measure two types of fruits: oranges and grapefruits. The fruits are mixed, such that wedo not know, which fruit type is to be measured next. Our goal is to divide each type of fruits intothree price categories. Thus for each type we have to determine the mean size and some critical value.In our framework we assume that the nature is “producing” the fruits following the Gaussian normaldistribution as law.

Let’s derive the parameter estimators according to the equations 4.3.12, 4.3.13 and 4.3.14 for onedimensional case:

αn+1l =

1N

N

∑i=1

p(l|xi,Θn) (4.3.15)

µn+1l =

N

∑i=1

xi p(l|xi,Θn)

N

∑i=1

p(l|xi,Θn)

(4.3.16)

σn+1l =

N

∑i=1

(xi−µl)2 p(l|xi,Θn)

2N

∑i=1

p(l|xi,Θn)

(4.3.17)

The figure 4.4 on the left hand side demonstrates the initial state of our algorithm. The bold linerepresents the law of nature. We assume to know the parameters. Let Θnature be α1,α1,θ0,θ1, wherethe αi are the priors and the θi are the parameters for each law. We consider:

α1 α2 µ1 µ1 σ1 σ2 Ereal 0.6 0.4 9 15 1 1.6 -init 0.57 0.43 6.7 17 1.5 1.5 -Iter 1 0.58 0.42 8.97 14.57 0.76 1.1 −120.775Iter 3 0.59 0.41 9.01 14.67 0.77 0.92 −120.548Iter 3 0.59 0.41 9.01 14.68 0.77 0.92 −120.500

Table 4.1: Example parameters.


2 4 6 8 10 12 14 16 180

0.05

0.1

0.15

0.2

0.25Init

← 0.6*N(9,1)+0.4*N(15,1.6)

← 0.57*N( 6.70, 1.50)

0.43*N(17.00, 1.90)→

2 4 6 8 10 12 14 16 180

0.05

0.1

0.15

0.2

0.25

0.3

0.35-120.775

← 0.6*N(9,1)+0.4*N(15,1.6)

← 0.58*N( 8.97, 0.76)

0.42*N(14.57, 1.10)→

Fig. 4.4: EM Initialisation. Components after first iteration

2 4 6 8 10 12 14 16 180

0.05

0.1

0.15

0.2

0.25

0.3

0.35-120.548

← 0.6*N(9,1)+0.4*N(15,1.6)

← 0.59*N( 9.01, 0.77)

0.41*N(14.67, 0.92)→

2 4 6 8 10 12 14 16 180

0.05

0.1

0.15

0.2

0.25

0.3

0.35-120.500

← 0.6*N(9,1)+0.4*N(15,1.6)

← 0.59*N( 9.01, 0.77)

0.41*N(14.67, 0.92)→

Fig. 4.5: Components after third and sixth iteration.

where the parameters of the law of nature are in the first row. In the probabilistic theory the law ofnature of the density we are going to estimate is called the “ground truth density”. We use this notation.

Let’s have a look at the implementation. As considered before our approach has to be initialized.It performs the expectation and maximization step in each iteration until the procedure converges. Theconvergence is achieved if the expectation value does not change. We code up the algorithm in the“MATLAB” environment.

First: initial parameters are found by random. The initial mean values are the min and the max of thedata. The first prior is a random number between zero and one and the second alpha is the rest to one.The standard deviations are random numbers between one and five. The data set is produced by build infunction in MATLAB producing normally distributed numbers by given mean and deviation values. Weparameterize the function with values presented in the tabular 4.1 in the first row. The resulting initialvalues are shown in the second row.

Second: since we are given the parameters the state conditional distribution p(x|l,θl) can be com-puted (see 4.3.11). We compute the membership values by the conditional density p(l|x,θl) (see 4.3.10)With this information new parameters can be estimated. We determine the expectation value and maxi-mize it estimation new parameters with 4.3.15. This is the iteration. At the end of every iteration a newexpectation value can be determined and the ground truth estimated.

The results of a successful procedure are shown in figure 4.4 and 4.5. The left top figure shows the


2 4 6 8 10 12 14 16 180

0.05

0.1

0.15

0.2

0.25

0.3

0.35Init

← 0.6*N(9,1)+0.4*N(15,1.6)

← 0.50*N( 7.00, 0.60)

0.50*N(17.00, 1.90)→

2 4 6 8 10 12 14 16 180

0.05

0.1

0.15

0.2

0.25-337.782

← 0.6*N(9,1)+0.4*N(15,1.6)

← 0.03*N( 8.50, 0.00)

0.97*N(11.48, 4.97)→

Fig. 4.6: EM Initialisation. Components after fourth iteration

ground truth which is denoted by bold line, and the two weighted normal distributions representing thecurrent parameters. The title is the current expectation value. The top right figure is the result after thefirst iteration, the bottom left after the third iteration and the bottom right shows the results after sixthiteration after which the convergence is achieved.

This example is of course perfectly chosen for a best result. But not every data set can be estimatedwith such a good result. The guaranteed convergence does not guarantee the global maximum.

The figure 4.6 demonstrates the unfortunate result of parameter estimation. The data set representedby pikes is also been produced in the same way as the first one.

α1 α2 µ1 µ2 σ1 σ2

real 0.6 0.4 9 15 1 1.6init 0.5 0.5 7 17 0.6 1.9Iter 4 0.97 0.03 11.48 8.5 4.97 0

Table 4.2: Bad Example parameters.

The estimated parameters in the fourth iteration can be found in the tabular 4.2 in the bottom row.Its representation is shown in the figure 4.6 to the right. The result can be interpreted as only one normaldistribution since the prior of the second one is much less than the first and the means are placed nearby.In the convergence the α1 increases to 1 and the α1 is 0.

4.3.5 Convergence Property of EM

The following section demonstrates why the EM algorithm has a general convergence property. Thebasic idea is via Jensen’s inequality. More precisely, it can be shown that the Q-function in equation4.3.7 is improved in each iteration (in the maximization step), then so will be the likelihood function L.

The proof of convergence begins with the observation of the following relationship:

l(X |Θ) = log p(X |Θ) = log

Z

∑ p(Z,X |Θ)

= log

Z

∑ p(Z|X ,Θn)p(Z,X |Θ)p(Z|X ,Θn)

(4.3.18)

Using equation 4.3.18 and Jensen’s inequality, this is obtained:

Chapter 5

Extended EM derived fromKullback-Leibler Divergence

As we have seen in the previous chapter, the EM framework iteratively optimizes the model parameters.We assumed that some law of nature “generates” the data. We call the law the “ground truth density”.Our task is to find a mathematical representation of this law called statistical model. Though it is notpossible to derive the model analytically we can approximate it iteratively estimating its parameters.The EM algorithm has the task to estimate this statistical model.

However, in the EM framework the number of model components must be known and fixed. Wewant to investigate the reason and introduce the extension which allows not only to optimize the modelparameters but also to estimate the number of model components.

In the previous examples we assumed that the ground truth is a mixture of Gaussian distributions.With this assumption we could derive the procedure of estimating the parameters which maximize thelikelihood function. In our new framework we handle with different distributions. We illustrate the newchallenge on an example 5.1 and show furthermore the failure of the EM algorithm to approximate adata set with the appropriate number of model components.

Giving account of data capturing rules we introduce in the section 5.2 a new density function called“Gaussian Like”. In our following context we are going to analyse two extensions of the classical EMframework. Both extensions fit Gaussian Like mixtures onto the data to approximate the ground truthdensity and so to estimate its statistical model.

To derive the extensions of the classical EM framework we use the “Kullback-Leibler Divergence”(KLD) theory. Kullback-Leibler Divergence is a widely known and accepted metric to measure dissim-ilarity of density functions. In the section 5.3 we build on the basis of KLD a theoretical frameworkwhich provides us with rules for model estimation and model modification.

Model estimation is a method of estimating the model parameters by optimising some metric. Aswe have seen, in the classical EM framework we maximised the expectation value. To modify thenumber of model components we need to enlarge or reduce the set of latent variables which label thedata. Such modification throws us out from model estimation framework. Thus a new metric for modelmodification is derived from KLD.

38

5.1. EXAMPLE 39

Though the built framework is theoretical we can use it to derive the extensions. In the section5.4 we will give account on how to derive the classical EM framework from the KLD and extend itwith model modification rules. We modify the model estimation by splitting and merging of modelcomponents. As we will see the EM failure is connected with the Monte Carlo estimation of the groundtruth density. We analyse the failure of this EM extension theoretically and by means of two examples.

Before we derive a new framework proposed in [22] we need to introduce a new method of nonpara-metric ground truth density estimation called “smoothed data density” (sdd). This density estimation iscontinuous and provides us with the required “knowledge” about the ground truth density. This knowl-edge is in fact a method of estimation the ground truth density values on the data points and beyond it.We introduce the “smoothed data density” in the section 5.5.

Using the smoothed data density for ground truth density estimation we derive a new framework ofmodel fitting proposed in [22]. In our context we call this framework “Extended EM”. Convenientlyto the previous section we derive the rules for model estimation and modification in the section 5.6.Furthermore we introduce a method to make a splitting proposition. We will give the evidence that suchsplitting is convenient with the modification rules and thus preserves the convergence property of themethod. The merging rule proposed by the authors of [23] is introduced in the last part of the section.This rule is not convenient with the modification rules. To see this please consult the chapter 7 thesection 7.5.2.

5.1 Example

Our target in this section is to illustrate the rules of data capturing which are assumed in this frameworkand used for model construction.

Let’s consider a wall with a very simple graffiti pattern. A worker has the task to remove thispainting. The only time he can do it is after his work day in the night. He takes his torch and triesto locate the edges of the pattern on the wall. The beam of light orthogonally directed onto the wallproduces a circle. Let’s consider the pattern to be two lines crossing each other. Even if a line doesnot cross the light circle, it will be visible if it is inside of it. Unfortunately the torch has a dischargedbattery. The torch is only shortly lightening if it has just been switched on. In this unfortunate case theworker has to “scan” the wall at some intervals to find the lines.

If we automatise the procedure and deliver our worker from his unfortunate destiny, we would needto build a device which locates the painting. The device captures the point as a coordinate if the line isvisible. As we have already seen the line might be visible even if it does not cross the centre of the lightcircle, but still only the centre is captured. The captured coordinate deviates from the real line point.Our device scans the wall and stores the coordinates if the line was visible. We get a set of points, eachof which is normally distributed around the real line point. Since the scanning is performed rigidly wemay say that the occurrence of captured points along the assumed line is uniformly distributed.

Let’s now say want to restore the pattern. Given the set of captured points we face again the problemof parameter estimation. If we assume that the captured points are distributed following the multivariatenormal distribution around the line middle, we may use our EM procedure to estimate the parameters.We would need to assume that the number of model components is known.

We demonstrate the example in the figure 5.1. The data set has been produced by visualization toolwhich had been developed for this project. The data set serves as demonstration only and will not beintroduced any further. The resulting approximation is shown as ellipses which are the cuts through themodel component being bivariate Gaussian distribution.

On the right hand side the model components are presented as lines. The line orientation follows the

40 CHAPTER 5. EXTENDED EM DERIVED FROM KULLBACK-LEIBLER DIVERGENCE

Fig. 5.1: The result of the EM procedure. Left figure demonstrate the components ascut through the Gaussian distributions. The right figure represents the components as

lines.

-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

Fig. 5.2: The result of the EM procedure with only one component.

eigenvector according to the larger eigenvalue of the covariance matrix with its mean being the meanof the model component. The length is quadruple of the larger eigenvalue. The result presented inthe figure 5.1 shows us the disadvantage of the EM algorithm. Depending on the initialisation the EMprocedure converges to this local optimum.

To simplify the example and present the problem we are going to solve, we construct the newsituation on the basis of the previous data set. We take into account the bottom right corner of the figureonly. The result is equivalent to the situation presented in figure 5.2. We take a data set in which twolines meeting each other on the end points had been sampled. The situation is equivalent to the previousone reduced to the bottom right corner only.

According to the EM result on the previous data set, we approximate the reduced data set with onecomponent only. The problem is that the algorithm was initialized with false number of components.But what is the right or optimal number of model components. The EM algorithm becomes semi au-tomatic because a specialist has to intervene. But what we want is a procedure which finds the linesautomatically. We need a procedure to find an optimal number of model components.

Another issue is the model estimation. The Gaussian mixtures are no longer appropriate for modelconstruction. We need to develop a new density function which takes the sampling method into account.

5.2. GAUSSIAN LIKE DISTRIBUTION 41

5.2 Gaussian Like Distribution

In the previous section we introduced the method of data capturing. Obviously the Gaussian mixturesare no longer appropriate to model the data captured in that way.

Let’s consider first we sample one edge or line segment only. The edge is an element of a real scene.Thus we have some elements of the real scene and some digital representation of these elements. Thedigital representation is a set of sampled points corresponding to the points of the real scene. Eachsample point deviates from the corresponding real point according to the Gaussian distribution. Sincethe scanning is performed rigidly we may say that the occurrence of captured points along the edge isuniformly distributed.

Summarising, we may say that we have uniform distribution of Gaussians along a linear structure.To model this density we build a linear combination of equally weighted Gaussians. In theory if weassume the sampling interval to very small he density results in a function which can be modelled bya difference of two “Error functions”. In practice the Error function is very difficult to handle, and thesampling intervals mostly are much greater then Gaussian deviation parameter.

Thus, we model the density by:

Gl(~xi|z j,σ) = G(dp,s(~xi,z j),0,σ), (5.2.1)

where ~xi is the sample point for which the component density value is to be computed, z j denote themodel component, G is the normal distribution1, dp,s(., .) is the shortest point segment distance functiondefined in 3.2.4. Figure 3.1 demonstrates the distance computation according to σ as the range of thestandard deviation which is used for Gl 5.2.1. In our following context we call this density “GaussianLike”.

5.3 Using Kullback-Leibler Divergence

In this section we want to derive the rules for model estimation and the conditions for model modificationto enable the method to estimate the optimal number of model components and its parameters.

Kullback-Leibler divergence (KLD) is used to measure dissimilarity between densities. If we as-sume we had a knowledge about the ground truth density q(x) then we could measure the “goodness”of our model estimation by KLD.

The law of nature is the ground truth density q(x). We want to model the law of nature by somemathematic representation p(x|Θ). Since no analytical evidence can be given to derive the model, wehave to estimate it by parametric density function p(x|Θn).

Please notice, that q(x) is continuous on the space X with x ∈X. It generates the data set xi ∈ X ⊂X

with i ∈ 1, . . . ,N which is used for model estimation. N = |X | is the size of the data set X . The modelpredicts the existence of x ∈ X by given xi ∈ X . Thus, the model can be regarded to be continuous tooand we can use it in KLD.

In the model the parameters vector Θ describes the edges placement and the model describes theweights distribution generated by these statistically modelled edges. p(x|Θn) is the model estimate inthe state n and Θn describes the estimation of the edges distribution in this particular state.

The density function p(x|Θn) is a member of a parametric family p(x|Θn) : Θn ∈ S of densities.[compare introduction in [23]].

1σ is a system parameter and depends on the settings of the data recording device


According to [23] it is possible to estimate the number of model components, viewing KLDD(q(x)||p(x|Θn)) as functional on the space p(x|Θn) of Gaussian mixtures. KLD is convex and it ispossible to “approximate the minimum with any required precision when we minimize KLD in thespace of finite Gaussian mixtures.” (see introduction [23]) The minimum is the wanted model p(x|Θ).

In the following we abandon the index n and consider the vector Θ to be variable. Thus, we measurethe dissimilarity between the ground truth density q(x) and its approximation with a density functionp(x|Θ). KLD is defined as:

D(q(x)||p(x|Θ)) =∫

log(q(x)

p(x|Θ))q(x)dx = (5.3.1)

=∫

log(q(x))q(x)dx−∫

log(p(x|Θ))q(x)dx

It can be easily derived that parameters Θ minimizing 5.3.1 are given by:

Θ = argmaxΘ∫

log(pΘ(x))q(x)dx (5.3.2)

5.3.1 Model Estimation

To derive the method of parameter estimation for our framework we follow the method of EM derivationand introduce latent variables z0, . . . ,zN−1 which serve us for label the respective data points x1, . . . ,xN .Thus, zi ∈ 1, · · · ,M for each i∈ 1, · · · ,N, and zi = k if the ith sample was generated by the kth modelcomponent.

In our framework, we need to compute the KLD between the joint density q(x,z) and a parametricdensity function p(x,z|Θ):

D(q(x,z)||p(x,z|Θ)) =∫

x

∫zlog(

q(x,z)p(x,z|Θ)

)q(x,z)dzdx

=∫

x

∫zlog(q(x,z))q(x,z)dzdx−

∫x

∫zlog(p(x,z|Θ))q(x,z)dzdx

=∫

xq(x)

∫zlog(q(x)q(z|x))q(z|x)dzdx−

∫xq(x)

∫zlog(p(x,z|Θ))q(z|x)dzdx (5.3.3)

with q(x,z) = q(x)q(z|x). We see with p(x,z|Θ) = p(x|Θ)p(z|x,Θ):

D(q(x,z)||p(x,z|Θ)) = D(q(x)q(z|x)||p(x|Θ)p(z|x,Θ))

=∫

x

∫z

[log(

q(x)p(x|Θ)

)+ log

(q(z|x)

p(z|x,Θ)

)]q(x)q(z|x)dzdx

=∫

xlog(

q(x)p(x|Θ)

)q(x)dx+

∫xq(x)

∫zlog(

q(z|x)p(z|x,Θ)

)q(z|x)dzdx (5.3.4)

Since the KLD is always nonnegative, and the second summand in 5.3.4 is minimized for p(z|x,Θ) =q(z|x), we obtain from 5.3.3∫

xq(x)

∫zlog(p(x,z|Θ))q(z|x)dzdx =

∫xq(x)

∫zlog(p(x,z|Θ)) p(z|x,Θ)dzdx (5.3.5)

And so following the KLD introduction we derive that the parameters Θ minimizing 5.3.3 are givenby:

Θ = argmaxΘ∫

xq(x)

∫zlog(p(x,z|Θ)) p(z|x,Θ)dzdx (5.3.6)

5.3. USING KULLBACK-LEIBLER DIVERGENCE 43

We still cannot use the formula derived in 5.3.6, since the ground truth density q(x) is not given. Toproceed with our assumption to have the knowledge about the ground truth we need to estimate it. In thefollowing discussions we will give two different nonparametric density estimations. The first one willlead us to the classical EM framework. The second was proposed by [22] and lead us to the ExtendedEM framework which will be able to adjust the number of model components.

5.3.2 Adjusting the Number of Model Components

Before we go any further let’s summarize the previous section. We have only one term 5.3.6 which al-lows us to estimate the model. This one term consists of two steps: compute the density development bygiven model parameters, find the settings of the model parameters by given model density development.We can analytically do both in an iterative procedure.

The term 5.3.6 is dependant on latent variable z. This means that the weighting is dependant on theconstellation of the model components. In the first step of the iteration each component “generates” itsregion of influence and weights the data points. According to this weighting and to this environment thecomponent’s settings can be changed in the second step of the iteration. If we say, that the componentsdo not correspond to such regions, then the information about the previous weighting is useless. Fromthis it follows that the original model density distribution can not be used for analytical derivation ofthe parameter settings, if the correspondence between the components and its regions of influence isdestroyed.

Thus, we need to estimate the modified model density distribution before we modify the number ofmodel components and derive their settings. This means, we assume, we had had the modified modelestimation already and had known the number and the constellation of the model components and thustheir regions of influence. According to that information the weights could be estimated. Given theweights and the modified number of model components the parameters can be analytically derived.

As we have seen in the KLD definition, the KLD value is minimized by maximizing the rightsummand of 5.3.3 or by minimizing the left summand of 5.3.4. The left summand is not dependant onthe latent variable z and thus allows us to modify the model distribution directly. We minimize the leftsummand estimating the weights development of the modified model estimation. Thus, we adjust thenumber of model components by minimizing:∫

xlog(

q(x)p(x|Θ)

)q(x)dx =

∫xlog(q(x))q(x)dx−

∫xlog(p(x|Θ))q(x)dx (5.3.7)

Please notice, that such modification of model estimation changes the dimensionality and naturallythe coefficients of the parameter’s vector Θ. Thus, by fixed weights the Θ and its dimensionality becomevariable.

So, we have the ground truth density q(x) and two model estimations: the original po(x|Θo) con-taining the previous set of model components and their parameters and the modified model estimationpm(x|Θm) containing a modified set of components and their parameters. The size of the modified setof model components will differ from the original one.

Thus our goal is to maximize the right summand of 5.3.7. Let po(x|Θo) be the current model esti-mation and pm(x|Θm) be the modified model estimation. The modification is successful if the followingis valid: ∫

xq(x)log(po(x|Θo))dx <

∫xq(x)log(pm(x|Θm))dx (5.3.8)

It means that the modification minimizes the KLD by greater value than the original model estima-tion. We modify the model estimation by split and merge steps. The original model estimation is given


by po(x|Θo) = ∑ j po(z j|Θo)po(x|z j,θ j) with z j ∈ Zo.How do we estimate the modified density? Well, let’s consider we had a “proposal” for the modifi-

cation - a new set of model components Zm. According to that proposal we weight the data points. Thisweighting is the estimation of the modified density given by: pm(x|Θm) = ∑ j pm(z j|Θm)pm(x|z j,θ j)with z j ∈ Zm.

Thus, to increase the number of model components by splitting the component l into l1 and l2 thefollowing must be valid:∫

xq(x)log(∑

j 6=lpo(z j|Θm)po(x|z j,θ j)+ po(zl|Θo)po(x|zl,θl))dx <∫

xq(x)log(∑

j 6=lpo(z j|Θo)po(x|z j,θ j)+ pm(zl1 |Θm)pm(x|zl1 ,θl1)+ pm(zl2 |Θm)pm(x|zl2 ,θl2))dx(5.3.9)

or to decrease the number of model components by merging the components l1 and l2 into l the followingmust be valid:∫

xq(x)log( ∑

j 6=l1j 6=l2

po(z j|Θo)po(x|z j,θ j)+ po(zl1 |Θo)po(x|zl1 ,θl1)+ po(zl2 |Θo)po(x|zl2 ,θl2))dx <

∫xq(x)log( ∑

j 6=l1j 6=l2

po(z j|Θo)po(x|z j,θ j)+ pm(zl|Θm)pm(x|zl,θl))dx (5.3.10)

As you can see, we estimate the density locally. The modification of model estimation can beregarded as a sequence of local modifications such as splitting of one component or merging of twocomponents. Thus, we reuse the weights on data points not influenced by model components which areto be modified.

Having the weights we can apply the inequality 5.3.8. In case of success we consider the vector Θm

to be variable but its dimensionality to be fixed. So we can analytically derive its coefficients. Thus thedimensionality of the vector Θm becomes variable in the proposition method only.

In fact, such local modification is the iteration of the “sparse” EM method. The results of [24]justify such approach.

There are two more issues to consider. We begin with the second. Having the proposed set of modelcomponents, we have to ensure the following restriction: ∑i po(xi|Θo) = ∑i pm(xi|Θm). Thus, the mod-ified model estimation must be normalised. In the following sections we will analyse the developmentof the modified model estimation depending on proposed set of model components in cases of split andmerge.

Now the first, how do we propose a modified set of model components? Please consult the sections5.6.3 and 5.6.4 as well as the chapter 6 the sections 6.4 and 6.5 for further details.

Please notice, the inequality 5.3.8 and all its derivations cannot be applied yet. We have to estimatethe ground truth density q(x) first.

5.4 EM with Split and Merge

In this section we will derive the classical EM framework from the KLD and extend it by model modi-fication rules according to the theory derived in the previous section. Applying the Monte Carlo estima-tion we can derive the Expectation value metric as it is known from the EM derivation introduced in the

5.4. EM WITH SPLIT AND MERGE 45

chapter 4. We will see that applying the KLD theory we will also be able to derive model modificationrules which preserve the convergence property of the method (see: 5.4.2).

We will give the evidence of the EM failure to adjust the number of model components and analysethe development of the modification metric depending on the state of the model estimation. (see: 5.4.3and 5.4.4). We finally give an arithmetic example to illustrate the splitting failure.

5.4.1 EM Derivation from KLD

The Monte Carlo (MC) estimation theory says, if x, . . . ,xN are identically and independently distributed(i.i.d.) sample points from the probability density function (pdf) q(x), then we can approximate theintegral of a continuous function f by its MC estimate:∫

f (x)g(x)dx≈ 1N ∑

if (xi)

In the usual approach to inference, it is commonly accepted assumption that sample points are dis-tributed according to the estimated density. This assumption is the key to ensure that maximum likeli-hood estimators are appropriate for purposes of estimating the parameters of interest.

If we approximate the term 5.3.5 by its MC estimate with q(z|x) = p(z|x,Θ), p(xi,z j|Θ) = p(z =l|Θ)p(xi|z = l,Θ) and αl = p(z = l|Θ), we get:∫

xq(x)

∫zlog(p(x,z|Θ))q(z|x)dzdx≈ 1

N ∑i

∑l

log(αl p(xi|z = l,Θ)) p(z = l|xi,Θ) ∝ Q(Θ,Θ) (5.4.1)

where x ∈ X is the domain of the continuous density function q(x) and xi ∈ X ⊂ X is the domain of thedescrete Monte Carlo estimate of the continuous density function q(x).

We precisely see now the derivation of the classical EM. The Monte Carlo estimation of the term5.3.6 results in the computation of the expectation value according the classical EM derivation. Thus,we can use the already derived expectation and maximization steps for model estimation. (see: chapter4.3.3).

What does it mean for the KLD? With the MC estimate we interpreted the ground truth densityq(x) as descrete distribution which uniformly weights the data points. We compare such “uniform”distribution with the model estimation.

If we consider the left summand of 5.3.4 again, what becomes of it? With q(z|x) = p(z|x,Θ) theright summand results in zero. The left summand is maximized by p(x|Θ)≈ q(x). Thus, the best modelis a density which uniformly weights the data points. That means, that overfitting producing a modelcomponent for each data point or underfitting estimating the model by one component only might bethe best model estimations.

5.4.2 Conditions for Model Modification

Any steps towards the modification of the model estimation by adjusting the number of model compo-nents are an extension of the classical EM framework. In our context we just call it EM with split andMerge. We want to emphasize the difference between the classical EM and the newly proposed methodby methods of ground truth density estimation.

As derived in section 5.3.2 the modification of the model estimation by changing the number ofmodel components and naturally their constellation is done by 5.3.8. If we follow the EM derivation


and estimate the ground truth density by its Monte Carlo approximation we get the following condition:

|X |∑

ilog(po(xi|Θo)) <

|X |∑

ilog(pm(xi|Θm)) (5.4.2)

where po(xi|Θo) is the original model estimation and pm(xi|Θm) is the modified model estimation. Whatdoes it mean? The Monte Carlo approximation discretises the domain of possible data X leaving onlythe available data set of sample points x ∈ X . So we have the data set only for further inference. Theclassical EM method fits the model onto the data points themselves. Thus the Monte Carlo estimate failsto predict the ground truth density development between the data points. So the KLD can not be longerdefined on continuous density functions but is only defined for weights distribution on the data points.

To increase the number of model components by splitting the component l into l1 and l2 the follow-ing must be valid:

|X |∑

ilog(

|Zo|∑j 6=l

po(z j|Θm)po(xi|z j,θ j)+ po(zl|Θo)po(xi|zl,θl)) <

|X |∑

ilog(

|Zo|∑j 6=l

po(z j|Θo)po(xi|z j,θ j)+ pm(zl1 |Θm)pm(xi|zl1 ,θl1)+ pm(zl2 |Θm)pm(xi|zl2 ,θl2)) (5.4.3)

where Zo is the set of original model components. To decrease the number of model components bymerging the components l1 and l2 into l the following must be valid:

|X |∑

ilog(

|Zo|∑j 6=l1j 6=l2

po(z j|Θo)po(xi|z j,θ j)+ po(zl1 |Θo)po(xi|zl1 ,θl1)+ po(zl2 |Θo)po(xi|zl2 ,θl2)) <

|X |∑

ilog(

|Zo|∑j 6=l1j 6=l2

po(z j|Θo)po(xi|z j,θ j)+ pm(zl|Θm)pm(xi|zl,θl)) (5.4.4)

where ∀i ∈ 1, . . . , |X | : xi ∈ X .Since the weights of the modified model are estimated, we have to ensure the following restriction:

∑i pm(xi|Θm) = ∑i po(xi|Θo). If we don’t have any other model components but one to be modified thesum over all weights will naturally be one. Thus, we have to normalize the estimate by:

norm = ∑i po(xi|Θo)∑i pm(xi|Θm)

(5.4.5)

In our previous consideration we assumed the modified density to be normalised. We want to analysethe conditions of the modification depending on the model component constellations.

Before we introduce the possible component constellations, let’s define some terms we want to dealwith. Let z j be a model component. The component weights the data points. We call it “influence”.The influence is measurable in a limited region only. We call it the set data points within this region the“region of influence”.

The model component density is assumed to be Gaussian Like. This density describes a noisecorrupted sampling of an edge. The data points assumed to be sampled from the corresponding edge areuniformly weighted by the component. The more distant to the component the data points are placedthe less they are affected by it. Gaussian Like weights the noise corrupted data points according the one


-8 -6 -4 -2 0 2 4 6 80

0.5

1

1.5

2

2.5

3x 10-3

X

X1X2

X1-eX2-e

Xe

X1eX1eX2e

Xr

Fig. 5.3: Diagram demonstrates the possible constellations of two mode componentsdenoted by Gaussian Like densities.

dimensional Gaussian distribution parameterized by the mean on the data point, a predefined standarddeviation and the shortest distance between the data point and model component.

We say the data points with the maximum influence are “fully affected”. The noise corrupted datapoints with lower weight are “partly affected”.

Let’s see how we can differentiate between the values of influence, if we have to neighbouring modelcomponent. We group the data points according to the kind of influence.

The diagram 5.3 shows two components. For simplicity we return to one dimensional representation.Note, our following considerations are not bound by one dimensional example.

We divide the data set of data points X into subsets by following conditions. Given two componentslabelled by 1 and 2, the data sets X1 and X2 contain the data points fully affected by the respectivecomponents.

The model components may overlap. The partly affected sample points are combined into a set Xe.The sets X1−e and X2−e contain the sample points which are not affected by the other component.

The affected points on the set Xe have to be differently regarded according to the value of influenceon the part of the components. The sets X je are weighted by the model component j as belonging toitself, but partly effected by the component zi 6= z j. The data points in the set Xr are only partly effectedby both components.

With these notations we can illustrate the modification of a model estimation. Let ∑i p(xi|Θ) be themodel estimation. Given two model components labelled by 1 and 2 the model estimation is constructedby:

∑i

p(xi|Θ) =X1−e

∑i

α1 p(xi|z1,θ1)+X1e

∑i

α1 p(xi|z1,θ1)+Xr

∑i

α1 p(xi|z1,θ1)+X2−e

∑i

α1 p(xi|z1,θ1)+

+X2−e

∑i

α2 p(xi|z2,θ2)+X2e

∑i

α2 p(xi|z2,θ2)+Xr

∑i

α2 p(xi|z2,θ2)+X1−e

∑i

α2 p(xi|z2,θ2)


In our following analysis we will regard model estimations with two or one components. The resultsof [24] justify such approach. As we have seen in the section 5.3.2 to adjust the number of modelcomponents we perform an EM iteration. According to the results of [24] we may perform the EMsteps locally. Local computation is performed in the “environment” of the concerned model componentsonly. This environment is build of sample points which are weighted by the model components. Suchapproach limits the set of data points such that the treatment of the other model components can bespared.

5.4.3 Splitting Failure

The EM method extended by split and merge fails to adjust the number of model components. We wantto construct an example, where splitting is required and demonstrate the EM splitting failure.

To spare the war of indices and statistical notation trimmings we simplify the notations and con-centrate ourselves on the theory. Let zl be the model component we want to split. In our constructedexample all data points xi ∈ X are fully influenced by this model component. Thus the component uni-formly weights the data points. Let say the weights are ∀i ∈ 1, . . . , |X | : αl p(xi|zl,θl) = pi. Let N bethe number of data points.

Let assume that the data points are arranged in two clusters. The clusters are clearly parted. Sosplitting is required. We propose two model components corresponding to the clusters zl1 and zl2 , suchthat X1−e = X1, X2−e = X2, Xe = /0 and X1∪X2 = X . The two components do not overlap: there are nodata points influenced by the two components simultaneously. You may find one possible visualisationof the problem in the figure 5.6.

Since we are given the proposition we want to estimate the density. To do so we assume that∀i ∈ 1, . . . , |X |∀ j ∈ 1,2 : p(xi|zl j ,θl j) = pi. Thus, we are able to present the model by:

∑i

pm(xi|Θm) =X1

∑i

α1 pi +X2

∑i

α2 pi

The notation ∑Xi means the sum over all data points from the set X . The estimated density must sat-

isfy the following restriction: ∑i po(xi|Θo) = ∑i pm(xi|Θm). The normalisation factor is then computedby:

norm = ∑i po(xi|Θo)∑i pm(xi|Θm)

= ∑Ni pi

α1 ∑X1i pi+α2 ∑

X2i pi

The splitting condition according to the inequality 5.4.3 is given by:

∑i

po(xi|Θo) < ∑i

pm(xi|Θm)

X

∑i

log(pi) <X1

∑i

log(α1 ·norm · pi)+X2

∑i

log(α2 ·norm · pi)

X

∑i

log(pi) <X

∑i

log(pi)+

+|X1| · log

(α1

∑Ni pi


X2i pi

)+

+|X2| · log

(α2

∑Ni pi


X2i pi

)(5.4.6)


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-140

-120

-100

-80

-60

-40

-20

0

Fig. 5.4: Diagram demonstrates the behaviour of the rest of the term 5.4.6.

As we can see, the difference between the one component model and splitted model is the rest of theterm containing the last two summands (see 5.4.6). We can display the behaviour of this rest dependingon the α1. The figure 5.4 shows that for all α1 the rest is less than or equal zero. Consequently, splittingis not advantageous.

Why does the algorithm fail to split? The answer to this question lies in the discretisation of thetarget function. We lose the information about the ground truth density outside the sample points. Weignore the property of the ground truth density to be continuous. And so are unable to detect the gapbetween the clusters. We need to compare the densities in-between and outside the data. We introducethe continuous method of splitting in the section 5.6.3 proposed by [23].

5.4.4 Merge by Overlapping

To derive the merge step for the EM we have to draw your attention, that the estimation of the densityvalue pm(xi|Θm) on the data point xi can not be done by po(xi|Θo). It is only possible in case of collinearline segments.

As we have already seen, the data representation with one component only is more advantageouslythan with two. Of course this assumption is erroneous, but is valid in the framework of EM withGaussian like function as component density. We want to show now, that two components representationis not better if the components overlap.

We consider merging as splitting but another way around. We assume, that the splitting of thecomponent z into two overlapping components z1 and z2 is more advantageous, and try to disprove it.Again, the sum over all weights of the old model must be equal to the sum over all weights of the new.As assumed the modified model estimation contains two overlapping components:

∑i

pm(xi|Θm) =X1

∑i

α1 p(xi|z1,θ1)+X2

∑i

α2 p(xi|z2,θ2)+

+X2e

∑i

α1Gi1 +X1e

∑i

α2Gi2 +Xr

∑i

α1Gi1 +Xr

∑i

α2Gi2

The normalisation factor norm is conveniently computed and can be easily derived. Conveniently tothe previous consideration, we divide the notations into the uniformly weighted data points which arefully affected by the model component z j by pm(xi|z j,θ j) and the partly affected data points weighted


0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.63

3.5

4

4.5

5

5.5

-8 -6 -4 -2 0 2 4 6 80

0.5

1

1.5

Fig. 5.5: Left: Development of the right term of the inequality 5.4.7 against overlapvariation. The x-axis is the value of expansion of the components. Right: the constellation

of components with the highest value of the right term of the inequality 5.4.7.

by Gi j. The notation Gi j stands for Gaussian parameterised by the previously system adjusted standarddeviation, the mean on the data point xi and the shortest distance from the data point xi to the modelcomponent zl j .

In fact the value of p(xi|z j,θ j) is computed by N(0,0,σ), where N is the normal distribution and σ

is again a system parameter adjusted by sampling device.The merging condition according to the inequality 5.4.4 is given by:

∑i

po(xi|Θo) < ∑i

pm(xi|Θm)

<X

∑i

log

(∑

jα j · p(xi|z j,θ j) ·norm

)

<X1−e

∑i

log(α1 · pm(xi|z1,θ1))+X2−e

∑i

log(α2 · pm(xi|z2,θ2))+

+|X |log(norm)+

+X1e

∑i

log(α1 · pm(xi|z1,θ1)+α2 ·Gi2)+X2e

∑i

log(α2 · pm(xi|z2,θ2)+α1 ·Gi1)+

+Xr

∑i

log(α1 ·Gi1 +α2 ·Gi2) (5.4.7)

We want to see the behaviour of the modified model estimation it by overlapping. We need to holdthe values of the model estimation not affected by overlapping as constants and variate the overlappingparts.

The left figure of 5.5 demonstrates the developing of the model estimation depending on the over-lapping. The highest value arises from the constellation presented in the right figure of 5.5.

To construct the diagram of 5.5 we built a data set of perfectly sampled data from one single edge.Thus, the model representation with one model component must be the most advantageous. First webegin to approximate the data by two fully parted model components without any overlapping. In eachstep we expand the components toward each other, such that the outer sample points representation isnot concerned but the inner sample points are weighted by more influence of the components. At the


-8 -6 -4 -2 0 2 4 6 80

0.02

0.04

0.06

0.08

0.1

0.12

0.14

-8 -6 -4 -2 0 2 4 6 80

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Fig. 5.6: The sequence of peaks denotes the data points 1) Gaussian Like distributionnormalised such that the sum of weights on data points is one. 2) Model estimation with

two distant components. Thin line denotes the component densities, bold line is the modeldensity.

last step the two components completely overlap.The result is: we get the highest value for mixture model which approximates the one component

model best. The more overlapping we have, the less is the value.Please note again, that in two or more dimensional case the merged segment will differently weight

the sample points compared with the original model estimation. The two components will not collinearlylie in the space and the merged line segment will not just be a connection of two collinear segments.

Our example did not say how to maximize the right term of 5.4.4. It only visualised the influenceof overlapping on the right term of the inequality 5.4.7. In the framework of [22] merged segmentis proposed to be found by least squares algorithm. The corresponding KLD value is to be comparedwith appropriately normalized KLD value of the two components of the original model estimation. Furfurther details please consult the section 5.6.4

5.4.5 Example for EM Failure with Glike

The Expectation Maximization method fails to adjust the number of model components if it uses theGaussian mixtures as model estimation. We will see in this section, that the model estimation withGaussian Like mixtures also results in error.

We illustrate the split failure on a hand made data set. Let’s assume a perfect one dimensionalsampling of two edges. We see the demonstration in the figure 5.6. The sequence of peaks denotes thedata points. We clearly see two clusters. We try to estimate the model with one component and then tryto split it.

Let’s consider the following: the data are clearly divided, we have N/2 sample points for each linesegment. The model density function must be normalized such that

N

∑i

p(xi|Θ) =N

∑i

M

∑j

α j p(xi|z j,θ j) = 1

The model estimation with one component M = 1 is normalised. Thus we get:

N

∑i

p(M=1)(xi|Θ) = 1


Please recall, The vector Θ = (θ1, . . . ,θM) contains the parameters of all model components. The θ j isthe vector of parameters of the model component z j. We abandon here the indices denoting the state ofthe vector Θ.

The model estimation with two components must be normalised too, such that ∑Ni p(M=1)(xi|Θ) =

∑Ni p(M=2)(xi|Θ) is valid. Applying the equation to our example with a model estimation with two

components and keeping in mind that for each component the density value for N/2 data points is 0, weget:

N

∑i

p(M=1)(xi|Θ) =N

∑i

M

∑j

α j p(M=2)(xi|z j,θ j)

=N

∑i

(0.5p(M=2)(xi|z1,θ1)+0.5p(M=2)(xi|z2,θ2)

)= 0.5

N

∑i

p(M=2)(xi|z1,θ1)+0.5N

∑i

p(M=2)(xi|z2,θ2)

= 0.5

(N/2

∑i

p(M=2)(xi|z1,θ1)+N/2

∑i

p(M=2)(xi|z2,θ2)

)︸︷︷︸

= KN

∑i

p(M=1)(xi|Θ)

⇔N

∑i

p(M=1)(xi|Θ) = 0.5KN

∑i

p(M=1)(xi|Θ) ⇒ K = 2

As you can see, we used the with K parameterised form of the model estimation with one component toact as the model estimation with two components. We can and want do this, because in this case all datapoints are uniformly weighted by component densities and thus we can use the one model estimationparameterised by some factor and let it act as the other. Since all data points are uniformly weighted wecan say that ∀ j ∈ 1, . . . ,M : p(M=2)(xi|z j,θ j) = 2p(M=1)(xi|Θ).

We use the newly gained knowledge and compute the corresponding expectation values to demon-strate the equivalence of the two model estimations. The expectation value Q (compare 4.3.7) withM = 1 components and N data points is

Q(M=1) =M

∑j

N

∑i

log(α j p(xi|z j,θ j)) p(z j|xi,Θ)

=N

∑i

log(

p(M=1)(xi|Θ))

= Nlog(

p(M=1)(xi|Θ))

where α j = 1 and for all xi p(z j|xi,Θ) = 1. Note, that the data points are classified in such a way thatN/2 points belongs to one component and the other N/2 to the other. Thus, for the model estimation

5.5. GROUND TRUTH DENSITY ESTIMATION 53

-8 -6 -4 -2 0 2 4 6 80

0.02

0.04

0.06

0.08

0.1

0.12

0.14

-8 -6 -4 -2 0 2 4 6 80

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Fig. 5.7: The sequence of peaks denotes the data points. 1) Model estimation with twooverlapping components. Thin line denotes the component densities, bold line is the

model density. 2) Gaussian Like distribution normalised such that the sum of weights ondata points is one.

with two components we get:

Q(M=2) =M

∑j=0

N

∑i

log(α j p(M=2)(xi|z j,θ j)

)p(M=2)(z j|xi,Θ)

=N/2

∑i

log(α1 p(M=2)(xi|z1,θ1)

)+

N/2

∑i

log(α2 p(M=2)(xi|z2,θ1)

)=

N2

log(0.5 ·2 · p(M=1)(xi|Θ)

)+

N2

log(0.5 ·2 · p(M=1)(xi|Θ)

)= Nlog

(p(M=1)(xi|Θ)

)where α1 = α2 = 0.5 since we have equal number of points for each component and the density valueis equal for all points.

The result of our consideration is: the log likelihood function is not maximized if splitting is per-formed, even if the data are clearly parted. The result is expandable to two or n dimensional case, sincethe collinear constellation of line segments can be given independently on the dimensionality.

We consider a new example, where the lines are connected. We modify our last data set and fill thegap between the clusters with appropriate data points. We have the reciprocal problem now. We havemodel estimation with two overlapping components (see figure 5.7.1 ) and want estimate the model byone component as a result of merge (see figure 5.7.2 ). We have a different data set now compared tothe previous example. Thus the normalisation of the density is done by different factor.

The specialist sees, that two collinear line segments result in one of doubled length if they areconnected on its ends. As we have seen in our previous example, the approximation with one modelcomponent only is more advantageous than with two. Thus, we may assume that merging takes placetoo, even if the densities overlap. We will return to this example later and show, that our assumptionwas legitimate.

5.5 Ground Truth Density Estimation

In the EM method derivation we have seen, that the fitting algorithm has to be applied onto the MonteCarlo Estimate of the “ground truth density”. We want first to analyse the justification of that method.


Suppose we have a small data set but want to estimate the original sampling statistics. Accordingto “bootstrap” method we can compute the statistics by resampling the data. It is called “sampling withreplacement”. The idea is to resample the data set N times by replacing the currently sampled datapoint back into the set (Compare [25]). It is necessary to sample with replacement, because we wouldotherwise reproduce the original data set. Please consult the secondary literature on the issue [26] and[27].

The result of the method is a new data set of N sample points. This method only works if twoassumptions are made. First, all data points are valid representatives of the population. Second, eachsubsampling is independent and identical distribution (i.i.d) (compare [25]). Of course, according tothese assumptions resampling on and the same data point results in a uniformly distributed data set.

And now the confusing part. Why should the distribution of the original population be uniform?Let’s consider some arrangement of n points in a space. n is quite small. If we resample each pointaccording to the bootstrap method we get a point cloud for each data point in which every subsample isuniformly weighted. Obviously the EM method works with the assumption that the original arrangementof the points was uniform distribution. Thus, we have a uniform distribution of uniform distributionsand the Monte Carlo seems to be the appropriate estimation for ground truth density.

In all real applications, the sample data set is corrupted by a certain amount of noise and outliers.The difference between noise and outlier is obvious. Sampling a real scene point we get only an approx-imation of it as data point. In fact the sample point will deviate according to the Gaussian distributionfrom the real scene point. This deviation is caused by noise. Thus we may expect in some environmentof a data point a neighbouring data point.

The outliers occur independently of the sample points. Outlier does not represent a scene point.Thus we may not expect a neighbour of an outlier. Usually the proportion of outliers does not decreasewhen the number of sample points is increased.

What does that mean for the EM assumption? It means that the bootstrap method was used toestimate the corrupted data density. But this is not what we want. In the work of [23] the corrupted datadensity was introduced as u(x) = αq(x)+(1−α)η(x). Where α is the probability that an observationcomes from the ground truth distribution and (1−α) is the probability that the observation has been anoutlier.

The key to the ground truth estimation proposed by the authors of [23] is to estimate the probabilitythat a data point x has been selected from the ground truth density q(x). The conditional probability isthen defined by:

P(q(x)|x) =αq(x)

αq(x)+(1−α)η(x)=

αq(x)u(x)

(5.5.1)

As you can see the introduction of the corrupted data density as a composition of ground truthdensity q(x) and the outlier density η(x) served us as a demonstration. A close look at the ratio providesus with the evidence of the Bayes rule:

P(q(x)|x) =P(x|q(x))P(q(x))

P(x)(5.5.2)

with P(x) = u(x) = αq(x)+(1−α)η(x), P(q(x)) = α and P(x|q(x)) = q(x). Notice, that the larger thevalues of P(q(x)|x) the more significant is the information for the inference.

According to the previous consideration the density u(x) is estimated by bootstrap method. Thus forall data points the following is valid: ∀i ∈ 1, . . . ,N : u(xi) = 1

N . Thus assuming that x1,x2, . . . ,xn are

5.5. GROUND TRUTH DENSITY ESTIMATION 55

-3 -2 -1 0 1 2 3 40

0.5

1

1.5

2

2.5

3

-3 -2 -1 0 1 2 3 40

0.5

1

1.5

2

2.5

3

3.5

Fig. 5.8: Every peak represent a 1D sample point. The right most peak represents anoutlier. X-axis represents the data distribution. Y-axis represents the data density

estimation with Gaussian kernel parameterised by 1) constant standard deviation for datapoint 2) standard deviation scaled by the distance to the third neighbour

identically independent distributed (i.i.d), we can estimate the ratio q(x)u(x) by:

sdd(xi) ∝q(xi)u(xi)

≈ Nq(xi) = NN

∑j=1

K(dp,p(xi,x j)

h) =

NNh

N

∑j=1

G(dp,p(xi,x j),0,h), (5.5.3)

where the proportionality refers to the fact that ∑sdd(xi) = 1, dp,p is the Euclidian distance. As you cansee, the ground truth estimation is done by kernel estimation using Gaussian kernel G(dp,p(xi,x j),0,h)with zero mean an the standard deviation h. To estimate the bandwidth parameter h, we can draw froma large literature on nonparametric density estimation [28], [29].

[22] proposed the following bandwidth estimation:

h = bd3p(x), (5.5.4)

where b is a scaling parameter and in our implementation is set to be b = 0.5 (see 2) and d3p(x) is the

Euclidian distance between the sample point x and its third neighbour3.So we estimate the ground truth density by smoothing the data distribution with a Gaussian kernel.

In the work [22] this ground truth density estimation method was called “smoothed data density” (sdd).The parameterization of the Gaussian kernel with the third nearest neighbour allows us to model and

weight the outliers “properly”.In the figure 5.8 we demonstrate the difference between the Gaussian kernel parameterised by a

constant bandwidth parameter (left) and the smoothed data density parameterised by a bandwidth scaledby third nearest neighbour (right).

We must sort out the outliers if we estimate the model of the ground truth density. Thus the outliersmust be considerably lower weighted compared with the real data points. As considered before, wecannot expect a neighbour if a data point is an outlier. Thus weighting the kernel by the third nearestneighbour we achieve that the data points without significant amount of neighbours are lower weighted.

With the smoothed data density function following advantages are achieved. First, we can predictthe ground truth density value between the data points. Second, we can sort out the outliers.

2The setting of b = 0.5 is a system parameter.3k = 3 is a system parameter. Using the third nearest neighbour has been proposed by [22]


5.6 Extended EM with Split and Merge

we use in this section the method of ground truth estimation by sdd introduced before and derive theframework proposed in [22]. Conform to the EM derivation we use the KLD theory to derive theExpectation and Maximisation steps first and then develop conditions for model modifications withSplit and Merge (see: 5.6.2).

In the following subsection we will introduce a method for finding the splitting proposition and giveevidence that such splitting preserves convergence (see: 5.6.3). In the final subsection we introduce themodification condition proposed by the authors of [23] and analyse its development according to theoverlap.

5.6.1 Derivation from KLD

Conform to the EM derivation we are going to derive the expectation and maximization steps for theExtended EM method. As we have seen in the previous section the Monte Carlo method is not appro-priate for ground truth estimation if the data are corrupted by noise and outliers. The following equationproposed by [22] provides a substantially better estimate:∫

f (x)g(x)dx≈∑i

f (xi)sdd(xi)

If we approximate the term 5.3.5 by the new estimate with sdd(x) ≈ q(x), q(z|x) = p(z|x,Θ),p(xi,z j|Θ) = p(z = l|Θ)p(xi|z = l,Θ) and αl = p(z = l|Θ), we get:

∫xq(x)

∫zlog(p(x,z|Θ))q(z|x)dzdx≈∑

isdd(xi)∑

llog(αl p(xi|z = l,Θ)) p(z = l|xi,Θ) (5.6.1)

where x ∈ X is the domain of the continuous density function q(x). Please note, that the sum over allindices i is not defined for a concrete size of a concrete data set. In fact the smoothed data density sddis defined on the domain X. But to derive the algorithm steps analytically we have to discretise thedomain. Of course the current number of data points must be known before discretisation. As you willsee in the following discussions the data set may be enriched by “assumed sample points”, such that theKLD computation will be performed on a different data set.

According to the derivation of the standard EM algorithm we may give the required formulas. Wenow introduce the data points as vectors to show the dimensionality independence.

Expectation Step:

p(zi = l|~xi,Θ) =α

nl p(~xi|l,θ n

l )M−1

∑j=0

αnj p(~xi|zi = j,θ n

j )

Maximization Step:

Since the derivation of the following formulas is done in the same way as in EM, we only present

5.6. EXTENDED EM WITH SPLIT AND MERGE 57

them.

~µn+1l =

N−1

∑i=0

~xisdd(~xi)p(l|~xi,Θn)

N−1

∑i=0

sdd(~xi)p(l|~xi,Θn)

(5.6.2)

Σn+1l =

N−1

∑i=0

sdd(~xi)p(l|~xi,Θn)(~xi−~µn+1

l )(~xi−~µn+1l )t

N−1

∑i=0

sdd(~xi)p(l|~xi,Θn)

(5.6.3)

However, some simplifications may be done on the formulas for coding up. In the derivation weassumed the model components to be Gaussian mixtures. Each component is weighted by its prior α .As we have seen in the section, where we discussed the splitting failure 5.4.3, the appropriate weightingof the model components would be the uniform distribution. Thus, we can set α j = 1

M to be uniformlydistributed for all j ∈ 0,1, . . . ,M−1.

In our framework we want to fit line segments and not Gaussian mixtures. Line segment may bedefined in several ways. Giving the one definition we can compute the other. To find the orientationand placement of the line we perform the orthogonal regression weighted according to our conditions.The procedure is equivalent to weighted fitting of the Gaussian distribution as it is given in 5.6.2 and5.6.3. The solution is the orientation vector of the line, which is the vector corresponding to the largesteigenvalue of the covariance matrix Σ.

Thus, we simplify the fitting of lines by estimating the parameters of Gaussian mixture. Since weonly need the orientation of the line given as normalized eigenvector, we may drop the normalizationin the computation of the covariance matrix. The resulting formulas are given in the chapter where theprocedure of extended line fitting is introduced 6.

5.6.2 Conditions for Model Modification

According to the EM derivation and analysis we want to give you the conditions for modification of amodel estimation for the EM framework extended by sdd.

As derived in section 5.3.2 the modification of the model estimation by changing the number ofmodel components is done by 5.3.8. If we estimate the ground truth density by the smoothed datadensity sdd we get the following condition:

∑i

sdd(xi)log(po(xi|Θo)) < ∑i

sdd(xi)log(pm(xi|Θm)) (5.6.4)

where po(xi|Θo) is the original model estimation and pm(xi|Θm) is the modified model estimation.Please note again, that sdd is defined on the continuous domain X. In contrast to EM framework we canenlarge the data set X ⊂ X and will still be able to check the condition.

As we will see in the following sections the splitting is done by estimating the smoothed data densityvalue in-between the real data points. We try to predict a density value of a data point assumed to besampled from the scene. Thus we enlarge the data set X by such assumed sample points y. In thefollowing context we want to denote the enlarged data set by Y . Thus the following is valid X ⊂Y ⊂X.


Fig. 5.9: Gaussian Like distribution as result of the extended EM procedure.

To increase the number of model components by splitting the component l into l1 and l2 the follow-ing must be valid:

|Y |∑

isdd(yi)log(

|Zo|∑j 6=l

po(z j|Θm)po(yi|z j,θ j)+ po(zl|Θo)po(yi|zl,θl)) < (5.6.5)

|Y |∑

isdd(yi)log(

|Zo|∑j 6=l

po(z j|Θo)po(yi|z j,θ j)+ pm(zl1 |Θm)pm(yi|zl1 ,θl1)+ pm(zl2 |Θm)pm(yi|zl2 ,θl2))

where Zo is the set of original model components. To decrease the number of model components bymerging the components l1 and l2 into l the following must be valid:

|Y |∑

isdd(yi)log(

|Zo|∑j 6=l1j 6=l2

po(z j|Θo)po(yi|z j,θ j)+ po(zl1 |Θo)po(yi|zl1 ,θl1)+ po(zl2 |Θo)po(yi|zl2 ,θl2)) <

|Y |∑

isdd(yi)log(

|Zo|∑j 6=l1j 6=l2

po(z j|Θo)po(yi|z j,θ j)+ pm(zl|Θm)pm(yi|zl,θl)) (5.6.6)

where ∀i ∈ 1, . . . , |Y | : yi ∈ Y .

5.6.3 Extended Split

In this section we are going to demonstrate the power of the EM extension with sdd. To demonstratethe split and impress our reader by some fancy plots we choose a two dimensional data set.

The data points had been sampled from two connected edges forming the letter “L”. First we performthe EM step with one model component only and then try to split it. Obviously the optimal number ofmodel components is two. Without giving any proves we state that the classical EM framework extendedby split and merge fails to split the model component. We leave the prove as an exercise for the reader.

In the figure 5.9 we demonstrate the model estimation with one model component. The arrangementof bold black dots demonstrates the data set. To show the Gaussian Like density we use the 3D plot.The z−axis are the values of the Gaussian Like corresponding to the model component denoted by the


Fig. 5.10: Smoothed data density.

line segment. We use two points of view to show the data arrangement and the model component (left)and the Gaussian Like density (right).

As already mentioned the EM extension with split and merge fails to split the model component.The reason is: the splitting mechanism fails to “see” the lack of data points in the middle of the modelcomponent. The result is the model estimation predicts an edge where no edge is to be found.

Having the knowledge about the ground truth density would allow to “notice” the lack of points.Smoothed data density is defined on the whole domain X and therefore estimates the data density in thisregion too. Thus, sdd is able to detect this gap.

In the figure 5.10 we demonstrate the smoothed data density corresponding to the data set X ⊂ X.We use again two points of view to demonstrate the data arrangement and the “back” of the sdd density(left) and the “front” side of the sdd density demonstrating the gap.

The key of the extended split is to compare the density values of the model estimation with the sddvalues along the edge. As we can foresee there will be a contrast in the middle of the line segment.

There are two open question. First, how do we compate the densities? Second, how do we estimatethe sdd values outside the data set. The answer to the first question is straight forward. We use the KLDfor densities defined on the appropriate domain. The answer to the second question is tricky. We assumethat we are able to resample the scene along the line segment. If we do so at some intervals we get afinite data set extension by “assumed sample points”.

How do we estimate the sdd value on the assumed sample point? Given the data set X and conse-quently the respective sdd density and the arbitrary point y, the sdd(y) is the expectation value of thedata density given the point y. Recall that E [h(X)|Y = y] =

∫x h(x) fX |Y (x|y)dx

sdd(y) = E [sdd(X)|y]≈∑i

sdd(xi)G(xi|y,h) (5.6.7)

where fX |Y (x|y) = G(xi|y,h) is the Gaussian with mean on y and σ = h. h is the bandwidth parameteraccording to the bandwidth parameter of sdd(xi). Note, that in case of multidimensional space xi and yare vectors and respectively is the Gaussian. In two dimensional case the Gaussian has the form of:

G(~xi|~y,h) =1

2πh2 e−‖~xi−~y‖2

2h

In the left figure of 5.11 we demonstrate the development of the sdd values at some intervals corre-sponding to the assumed sample points. You notice the familiar data arrangement slightly rotated. The


-0.5-0.4-0.3-0.2-0.100.10.20.30.40.5

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0

1

2

3

4

5

6

Fig. 5.11: Left: Estimated sdd values on the line. Right: Splitted Gaussian like density.

almost invisible thin line segment denotes the model component. We use again 3D plot to demonstratethe sdd values on the z-axis. We clearly see the gap in the middle of the line segment. The splittingalgorithm is able now to detect the discontinuity of the line segment. The assumed sample points withto low sdd density values denote this discontinuity and lead us to splitting.

To prove the splitting assumption we refer to the modification condition 5.6.5. We check the condi-tion by enlarging the data set X by the assumed sample points y. The enlarged data set Y contains theactual data set X and the assumed sample points Ya on the line segment. The set of assumed samplepoints contains the sample points with nearly same sdd values as the data points in the environment ofthese assumed sample points. We call that set Yx. The erroneously assumed sample points are groupedin the set Yy. Thus the expanded data set is Y = X ∪Yx∪Yy.

To check the splitting condition we need two model estimation the old and the modified. To modifythe model estimation we assume that the model component had already been splitted. To see howthe propositions are constructed please consult the following chapter in 6.4. The proposition and thecorresponding model modification of the model estimation are presented in the right figure 5.11. Werotated the plot to show that the model components and the respective densities are parted.

Since the density values for all data points in the sets X and Yx are nearly equal we check the splittingcondition for the data set Yy only.

Checking the condition we want to check if the KLD value decreases. But there is a problem if wecheck the KLD value for the data set Yy. The Kullback Leibler Divergence defined as:

KLDYy = D(sdd(Y )||p(Y |Θ) =∫

ysdd(y)log

(sdd(y)p(y|Θ)

)dy

is negative for the data in Yy. The sdd(y) values are nearly zero but the original model density values

p(y|Θ) are greater then zero. Thus, the following is valid: sdd(y) < p(y|Θ)⇒ log(

sdd(y)p(y|Θ)

)< 0. The

greater the model density values the greater is the absolute values of KLD. The consequence is: thesmaller the model density is the nearer is the KLD value to zero which is desired.

Thus we have to ensure the positive development of the KLD value and multiply both sides of thecondition 5.6.5 by −1. In fact for the data set Yy we only need to invert the inequality resulting in:

|Yy|∑

isdd(yi)log(po(yi|Θo)) >

|Yy|∑

isdd(yi)log(pm(yi|Θm))


-0.5-0.4

-0.3-0.2

-0.10

0.10.2

0.30.4

0.5

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

-1

0

1

Fig. 5.12: Left: Data set for merging example. Right: The corresponding sdd density.

which is valid since ∀yi ∈ Yy : po(yi|Θo) > pm(yi|Θm).There is one issue to consider before we leave th splitting. The data set extension must not falsify

the actual sdd computation. If the assumed sample points are closer to its surrounding data points thentheir own nearest neighbours, the sdd density will inappropriately rise. The concerned samples comeobviously from the Yx. Thus if the distance from y to xi is smaller then the distance from xi to its nearestneighbours, we just take the respective neighbour of xi instead of y for estimation. In the [22] frameworkit is proposed to estimate the bandwidth parameter h for the sdd density by the third nearest neighbour.Thus, we take for the y just the third nearest neighbour of xi to avoid the falsification of the sdd density.

5.6.4 Extended Merge

As we have seen in the previous section, we need to insure that the KLD computation is performed fordensities over the same domain. To derive the merging step we demonstrate a new example. The figure5.12 shows the data distribution (left) and its corresponding sdd density (right). As specialists we seethat the data set may be approximated by model estimation with one component only.

However, let us consider that two model components had been proposed by our algorithm (see theleft figure of 5.13). Each model component uses a Gaussian like density to weight the correspondingpoints. The data points can only be weighted if the corresponding weights are measurable. The range ofmeasurable weights is determined by a system parameter “radius”. Only the data points in the distanceof the radius are weighted by the corresponding model component. We call this set of data points “regionof influence”.

Summarising over the Gaussian like densities of the components results in the statistical modelwhich approximates the data set. If two or more model components weight one and the same data point,we say the model components “overlap”. The corresponding regions of influence share the data points.

In the left figure of 5.13 we demonstrate disconnected model components by bold line segments.Their regions of influence overlap resulting in a composition of two densities which weights the datapoints in the range of the overlap twice. In the right figure of 5.13 we demonstrate the correspondingmodel density.

How do we compute the KLD value if some of the data points occur more then once? In theextension of the classical EM by split and merge steps we introduced the method of normalising themodel estimation to perform the modification check. This modification check was derived from theconvergence property of the method and so preserves it.


-0.5-0.4

-0.3-0.2

-0.10

0.10.2

0.30.4

0.5

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

-1

0

1

Fig. 5.13: Left: Two model components proposed for merging. Right: Thecorresponding combined Gaussian Like density.

Let’s consider two model components l1 and l2. The corresponding regions of influence are X1 andX2 with X1,X2 ⊆ X and X1∩X2 6= /0. We group the data points influenced by the model components in amultiset Y . The multiset Y is build by the composition of X1 and X2. Thus the data points in the rangeof overlap occur in Y twice.

Using not normalised model estimation for KLD computation would determine the multiset as thedomain. Of course some further handling would be needed to make the computation mathematicallycorrect. The repeated data points must be labelled to ensure their influence in full quantity. But we arenot going to compute the KLD value on the “domain” Y . Since the model estimation with one modelcomponent only is defined on the domain X we have to compare the KLD on the same domain. Thus,we have to normalise the KLD computation.

The authors of [23] proposed the following method of normalisation. Let Do be the KLD com-putation for the original model estimation with two model components and Dm be the computation ofthe KLD for the modified model estimation. We compare both densities with the ground truth densityestimation sdd on the domain X . If Dm < Do then the merging is “advantageous”.

The computation of the Dm is straightforward. We introduce the computation and normalisation ofthe Do only.

Do(sdd(X)||po(X |Θo)) =|X1|D(sdd(X1)||pm(X1|l1,θ1))+ |X2|D(sdd(X2)||pm(X2|l2,θ2))

|X1|+ |X2| (5.6.8)

where pm(X j|l j,θ j) for j ∈ 1,2 are the component densities. We ignore the prior probabilities here be-cause it is assumed, that the components are uniformly distributed. Thus the merge condition proposedby the authors of [23] is:

Do(sdd(X)||po(X |Θo)) > Dm(sdd(X)||pm(X |Θm)) (5.6.9)

Let’s consider the following very improbable example. The data set was generated by samplingone single edge. The first steps of the method estimated the model with two collinear line segments l1and l2. In the merge step the proposition is a line segment which results if the original components areexpanded towards each other until connection.

We consider the multiset Y as a composition of three sets Y = X1\O ∪X2\O ∪XO. XO = X1 ∩X2.X1\O = X1−XO and X2\O = X2−XO are the difference sets. The set XO contains the data points which


-0.5-0.4

-0.3-0.2

-0.10

0.10.2

0.30.4

0.5

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

-1

0

1

Fig. 5.14: Left: The merged model component. Right: The corresponding GaussianLike density.

are weighted by both components. The KLD computation according to 5.6.8 would be:

Do(sdd(X)||po(X |Θo)) =|X1\O||Y | D(sdd(X1\O)||po(X1\O|Θo))+

+|X2\O||Y | D(sdd(X2\O)||po(X2\O|Θo))+

+2 · |XO||Y | D(sdd(XO)||po(XO|Θo)) (5.6.10)

What do we see in the formula 5.6.10? The difference between the KLD computation for the originalmodel estimation and the modified is the KLD value on the domain XO. If we assume that no overlappingtakes place the KLD values for both model estimations would be equal and the condition would not hold.The greater the overlapping the worse is the KLD value, the more probable is merging.

The resulting model with one component only is presented in figure 5.14. We see that this result isa better approximation for the data set because it is more similar to sdd in 5.12.

We had to make some assumptions to derive the “extended merging”. Firstly, we assumed that weknew how to estimate the density values and parameters of the resulting model component. Well, themerged component is proposed to be computed by orthogonal regression over the data set X . Knowingthe parameters of the model component we can compute its density.

Secondly, we assumed that the check on merge advantage has to be carried out if “overlapping”takes place. In the chapter 6 we discuss the coding up. A method of overlap detection is presented.Please consult 6.5.1 for further details on this issue.

The normalisation of the KLD value proposed by Latecki and co. has considerable advantages. Itis much faster for the coding up. The model components j with corresponding data sets X j are the lessweighted the fewer elements are in the X j. In such a way components which approximate outliers can bedetected. The disadvantage is, the convergence is no longer preserved. As we will see in the evaluationof the method in chapter 7 in the section 7.5.2 the method may oscillate if the system parameters arebadly adjusted.

Chapter 6

EMSFSM

In the previous chapter we introduced and derived the Expectation Maximization Segment Fitting withSplit and Merge method proposed by [22]. In this chapter we want to describe how to code up thealgorithm. We gained the formulas from the implementation provided by the authors of [22]. Since ourtask is to reimplement and analyse the method, we use the statements in our application too.

The is a digitalized representation of some real world scene. Different from digital image capturingour data are sampled edges found in the scene. The data can evidently come from processing the digitalimages such as edge detection approaches. Laser range scan map provide a digital scene representationin form of sampled edges.

The robot scanning the environment collects at fixed intervals the distance measurements to thenearest surfaces within the range of view. Multiple scans from different directions are required to obtaininformation of the subject. In the procedure of reconstruction the shape of the subject is extrapolatedfrom the set of scans.

Consider the data set which had was captured along an edge. The scanner samples at fixed intervalsthe edge, but the captured sample points statistically variate from the real scene points. We assumethat the statistical variation follows the normal distribution with the mean at the scene point and somestandard deviation σ depending on the capturing equipment. In such a way we may consider that thedata set has been produced by a linear combination of Gaussian distributions with the real world edgeas the best approximation.

But, since we are only given the data set without the knowledge about the scene, how can we findthe best or the nearly best approximation of the edge? In the other words, how can we find the bestestimation of the model of the law of nature. The law of nature is the ground truth density q(x). Thep(x|Θ) is estimated by parametric density function p(x|Θn). In the model the parameters vector Θ

describes the edges placement and the model the weights distribution generated by these statisticallymodelled edges. p(x|Θn) is the model estimate in the state n and Θn describes the estimation of theedges distribution in this particular state.

As we described in previous chapters, to improve the parametric density estimation in the EM frame-work we fit our model estimate onto the nonparametric ground truth estimation sdd (5.5) rather then thedata set itself. Since we are not given any evidence about the ground truth density except for the data set,

64

6.1. METHOD 65

we use it for nonparametric estimation sdd. We measure the dissimilarity between the current modelestimation p(x|Θn) and the ground truth estimate sdd with Kullback-Leibler Divergence (KLD 5.3).

The Gaussian distribution is infinite, thus the function must be cut. The range of measurable Gaus-sian values is proposed to be called “radius”. The value of radius is a system parameter. We extractedthe value of radius from the implementation of the method provided by the authors of [22]. It dependsof course on the data set.

The value of sampling interval gives us the reason to assume a maximum distance lsub betweentwo neighbouring points belonging to one and the same edge. Distances exceeding this value give theevidence about the existence of another edges. We call this value subsegment length. We approximatethe edges with line segments. In the following context we use the notion “line segment” for the edgeapproximation described by the parameters vector Θn.

The distance between the data point and the line segment measures the probability of belonging tothis particular line segment. This probability is not measurable for the points outside the radius. Pointsoutside the range of any line segment build regions which need further development. We describe thehandling of such undeveloped regions in 6.3.1.

6.1 Method

In this section we want to give a brief overview of the method implementation. We begin with theinitialisation of the algorithm and proceed with the introduction of the several steps of the iterationgiving the definitions of the specific terms.

Let X be the set of data points. The algorithm initialisation is done by diagonals of the bounding boxof the data box. We proceed with the kernel estimation to compute the nonparametric density estimationsdd.

The algorithm iteration consists of several steps: computation of undeveloped regions, expectationmaximisation step, split and merge.

In each iteration we need to assure that all of data points are developed. This means, all of datapoints are in the radius range of any line segment. Only developed points provide us with informationfor inference. The undeveloped data points are grouped into regions.

We denote the set of undeveloped regions by U . We say regions, but in fact we speak of sets ofdata points. We perform the classical least squares line fitting on each undeveloped region and extendin such a way the set of model components. The lines are then cut into segments. To do so, we performthe orthogonal projection of the data points from the corresponding undeveloped regions onto the lines.The outest projections cut the lines.

Line segments indicate the possible edges of the real scene. For example, on the way of samplingtwo neighbouring edges the data points occur which may belong to either of them or be noise. Thus weneed a measurement procedure to compute the value of belonging to an edge.

In the expectation step the probability of belonging to an estimated edge or line segment for eachdata point is computed. We have to perform this computation for all data point with respect to allsegments. Thus, the result of the computation is a “weights matrix” W (see: 6.3.2).

Now we have all we need to compute and maximise the expectation value. The expectation value ismaximized by new estimation of the model components. We perform the weighted orthogonal regres-sion on the data points.

The weights are given by the weights matrix W and sdd. In fact we adjust a model component ontoits region of influence. The weights matrix is now fixed. Thus, the computation of the new estimationof the model component z can only be done on data points which were previously influenced by the old

66 CHAPTER 6. EMSFSM

estimation of the model component z. All other data points are weighted by zero. Thus, the weightsmatrix also determines the subsets of data on which the orthogonal regression is to be performed.

The model component weights in a particular range the data points. We called this set of data pointsthe region of influence of the segment. It may be that some data points are also influenced by othersegments. If we collect from the region of influence only such points which are influenced by thiscomponent the most, we obtain a new smaller set. The weights on these points have the most influencein the orthogonal regression by the line computation. They “support” the line the most. Because of that,we call this set the “support” of the line (see 6.3.6).

It is very important to see the difference between line and line segment. Lines are cut into the linesegments by trimming. The lines are trimmed by its support. The data points from the support areorthogonally projected onto the line. The outest cut the line into segment (see 6.3.3).

The lines are defined by two points located on it. Conform to lines the line segments are defined bytwo points on it, but those two points are the most outest ones indicating the bounds of the segment.

We call the combination of expectation, maximization step with segment fitting the EMSF. To adjustthe number of model components two additional steps have to be introduced: split 6.4 and merge 6.5.

For the splitting decision we assume sample points (see 6.4.2) lying on the segment at samplingintervals lsub. We estimate the sdd value on these assumed sample points 6.4.4. In case of to low sddvalue we may decide that the assumption of this particular sample point was erroneous. Erroneouslyassumed sample point makes the positive split decision evident.

We need to perform EMSF step after the set of model parameters has been modified.The merge decision is made in three steps. First, the propositions for the merge have to be found

and arranged in an advantageous order.If there is a set of data points which are weighted by two line segments and this set is not empty,

we call the relation between them “overlap” (see 6.5.1). Only overlapping line segments are proposedfor merging. The merging order is determined by the “merging value” 6.5.2. The merging value isrecomputed after each merge.

Second, we need to check if after merging the resulting component might be spitted. No merge ispossible if the result is splitted afterwards.

Third and last step is performed after the successful split check is the merge check. The merge isonly performed if it is advantageous. The resulting model estimate is measured by Kullback-LeiblerDivergence.

Summarising the overview we see that the approach after being initialized is performing the follow-ing steps in each iteration:

EMSF Split EMSF Merge EMSF

6.2 Initialization

Before the algorithm starts the nonparametric density estimation is performed on the data set accordingto the 5.5.

We initialize our algorithm by two crossing line segments. The line segments connect the twodiagonally opposite points of the bounding box of the data set. In the next step the undeveloped regionsare found (see 6.3.1).

Notice that at this step the output of the undeveloped regions method is not to be trimmed on thedata. We can not perform the trimming because the weights matrix has not been computed yet. Theinitial model parameters are shown in the figure 6.1.

6.3. EMSF 67

Fig. 6.1: The set of data points has been initialized by two lines denoted by diagonals ofthe bounding box. The undeveloped regions denoted by five rectangles are demonstrated.

The line segments indicate the placement and orientation of corresponding lines.

6.3 EMSF

The name of this step is an abbreviation of Expectation Maximization Segment Fitting (EMSF) . Theextension of the EM is due to the fact that the model parameters are line segments to be fitting.

Except for the first iteration the procedure performs the undeveloped regions method first to ensurethat all data points are used for inference. The E-step is performed by computing the dependencesbetween the actual model parameters and the data set combining them in a matrix 6.3.2.

We use the resulting matrix weighted with the nonparametric density values on this points to performthe M-step. In the procedure of the M-step the model parameters are optimized by performing theweighted orthogonal regression on the data set with weights resulting from the multiplying each rowof the weights matrix 6.3.2 by the corresponding estimated nonparametric density value 5.5.3. Theresulting lines are then trimmed onto the sets of points supporting the lines (see 6.3.6).

6.3.1 Undeveloped Regions

The radius parameter and a line segment give us the maximum range, within of which for each data thecorrespondence to this component is measurable.

We define an undeveloped data point~x ∈ X if the following is valid: ∀z ∈ Z : dp,s(~xi,z) > R, whereZ is the set of line segments and R is the system parameter radius. Thus we can propose a method ofgrouping by:

Xui = ~x|~x∈X∧∀ j∈1, . . . , |U | :

(~x /∈ Xu

j)∧∀z∈ Z : (dp,s(~xu,z) > R)∧dp,p(~x,~µu

i ) < 2R⊆X (6.3.1)

where ~µui is the centre of the bounding box of the undeveloped region Xu

i , dp,s(., .) is the point seg-


ment distance function 3.2.4 and dp,p(., .) is the point to point distance function 3.2.1. The set of allundeveloped regions is denoted by U .

By construction we add a data point to one undeveloped region only, such that for two differentundeveloped regions is valid ∀i, j ∈ 1, . . . , |U | : i 6= j→ Xu

i ∩Xuj = /0. Thus the following expression

applies:∀~x ∈ X : (¬∃z ∈ Z : (dp,s(~x,z) < R)→∃i ∈ 1, . . . , |U | : (~x ∈ Xu

i ))

Which means: all undeveloped data points are elements of an undeveloped region. Figure 6.1 demon-strates the result of the method. Five undeveloped regions have been found. Every region is indicatedby its bounding box. Notice that the method computes regions with corresponding line segments ofno greater length that 2R. Depending on radius a big amount of lines is then proposed for the M-step.According to [22] the accurate initialization is not crucial. In the procedure of the M-step redundantmodel components are either deleted or merged into significant segments.

The example of such redundancy can be seen on the undeveloped regions 1 and 2 on the left handside of the figure. The corresponding line segments have been computed by classical least squaresalgorithm and have then been trimmed by the respective region.

6.3.2 Weights Matrix

As we have learned from previous chapters, knowing the current model parameters, we can computep(~x|z,θz) and p(z|~x,θz). As already introduced~x∈ X are the data points z∈ Z are the model componentswith Z as the set of model components and θz is the parameters vector corresponding to z. The modelestimation computes the probability of occurrence of the data points by p(~x|Θ). Those values are com-puted by the sum of values of influence of all model components. What is this influence? The modelcomponent weights the point by its density p(~x|z,θz). The value of this weights only partly flows intothe sum. This part is determined by the prior probability αz of the model component. Please consult theEM chapter 4.2.1 for further details. The notation in this particular section deviates from our notation,since the general case is considered.

In the work of [22] it is proposed to assume the equal prior probability for all model components.We can justify this proposition by the EM example, which shows the EM development on badly sampleddata 4.3.4. The model estimation with one component only optimises the log likelihood better than twocomponents model estimation. The answer to our puzzlement is given in the chapter 5, where wehave seen the development of the expectation value depending on the prior in the figure 5.4. The loglikelihood is maximized by equally distributed priors.

Thus, we can call the p(~x|z,θz) the weight of the model component z onto the data point~x. Collectingall weights in a matrix results in a “weights matrix”.

What is the difference between the weights matrix and the matrix consisting of the belonging prob-abilities W considered before? The answer is the normalisation. The weights are normalized in such away that the sum of weights of all model components onto the one data point is equal 1.

Now it becomes a bit of irritating. The values of the normalised weights matrix and the sdd densityare going to be used in the maximization step as “weights”. To simplify the matter, we use the termweights matrix for normalised weights matrix W .

Computation of the weights matrix is in fact the Expectation step of the algorithm. Thus, the weightwi, j = w(~xi,z j) represents the probability that point~xi corresponds to segment z j. Let X be the set of datapoints and Z the actual set of model parameters. Notice, depending on the iteration the Θ set of modelparameters is changing. The number |Z| of line segments and their settings variate. For simplificationwe handle the following definition without an index of iteration.

6.3. EMSF 69

The weights matrix is defined as:

W =

w(~x1,z1) w(~x1,z2) . . . w(~x1,z|Z|)w(~x2,z1) w(~x2,z2) . . . w(~x2,z|Z|)

......

. . ....

w(~x|X |,z1) w(~x|X |,z2) . . . w(~x|X |,z|Z|)

(6.3.2)

where ~xi ∈ X are the data points with i ∈ 1, . . . , |X |, z j ∈ Z are the segments of actual iteration withj ∈ 1, . . . , |Z|.

w(~xi,z j) = e−dp,s(~xi,z j)2

2σ2 (6.3.3)

where dp,s(~xi,s j) is the point segment distance function 3.2.4. The weights are normalized such that∑

Mj=1 w(~xi,z j) = 1 for each i or are equal zero, if the norm is zero.

6.3.3 Line Computation and Trimming

This step of our algorithm is the M-step. We optimize the model parameters by fitting the line segmentsonto the data. In fact we adjust the model components onto their region of influence. The weightsmatrix is now fixed. Thus, the computation of the new estimation of the model component z j can onlybe done on data points which were previously weighted by the old estimation of the model componentz j. All other data points are weighted by zero. Thus, the weights matrix determines the subsets of dataon which the weighted orthogonal regression is to be performed.

Thus, we perform the |Z| times the orthogonal regression weighted by the corresponding columnsof the weights matrix W 6.3.2 in which each row has been multiplied by the nonparametric densityestimation value on the corresponding data point.

The result is a set of size |Z| containing for each j ∈ 1, . . . , |Z| pairs of following values:

~µ j =∑|X |i=1~xiwi, jsdd(~xi)

∑|X |i=1 wi, jsdd(~xi)

(6.3.4)

Σ j =|X |∑i=1

wi, jsdd(~xi)(~xi−~µ j)(~xi−~µ j)t (6.3.5)

where µ j is the centre of gravity and Σ j describes the spreading of the data which have been weightedby the previously estimated model component z j.

For each pair we compute the line going through the centre point and directed according to thespreading. For this purpose the unit eigenvector~e j is computed corresponding to the largest eigenvalueof the matrix Σ j in 6.3.5.

The line is then defined by two points located on it:

l j = (~l1 j ,~l2 j), ~l1 j =~µ j−~e j, ~l2 j =~µ j +~e j,

The computed line is then trimmed on its support set. The support set S j is a set of points whoseprobability of supporting the segment z j is the largest.

S j = ~xi|~xi ∈ X ∧wi, j 6= 0∧∀k ∈ |Z| : j 6= k→ wi, j > wi,k, (6.3.6)


The lines are cut by orthogonal projection of the supporting points onto the line. The outest pointsdetermine the bounds of the line segment. Segments the length of which is smaller then half of theradius1 are removed.

Thus we define the line segment as follows:

z j = (~z1 j ,~z2 j), with ~z1 j =~µ j +dmin~e j and ~z2 j =~µ j +dmax~e j, (6.3.7)

where dmin = min~xi∈S j (~xi−~µ j)~e j and dmax = max~xi∈S j(~xi−~µ j)~e j.

6.4 Split

The split step is performed on line segments when some of their subsegments cannot find any corre-spondence to the supposed edges in the real scene. For this purpose we assume a data point on that edgeand estimate its smoothed data density (sdd) value. Is this value less than a minimally measurable valuethen the assumption has not been correct. We represent the line segment by such assumed data pointsand estimate their sdd value. If all line segment’s represents have to small sdd value the line segmenthas obviously no correspondence to any edge in the real scene and will be removed.

We divide the line segments by sampling them into sub elements. The sampling interval is called“subsegment length” lsub

2. The assumed sample points located on the line segment represent the sub-segments. In case of too small sdd value estimation the subsegment represented by the correspondingassumed sample point is removed and the segment is splitted. The remaining neighbouring subsegmentsare merged into one new model component.

6.4.1 Assumed Sample Points

The assumed sample points represent the subsegments of a line segment z j to be splitted. The numberof the assumed sample points is computed by:

|Yj|=∥∥~z2 j −~z1 j

∥∥lsub

, (6.4.1)

where lsub is the subsegment length, z j = (~z2 j ,~z1 j) (see 6.3.7) is the segment to be splitted and Yj is theset of assumed sample points computed by:

Yj = ~yi, j, ∀i ∈ 1, ..., |Yj| :~yi, j =~z1 j + i · lsub ·~e j, (6.4.2)

where~e j is the unit orientation vector of the line segment z j. In the figure 6.2 the sampling of a line seg-ment is demonstrated. The line segment is sampled using the sampling interval lsub. The correspondingsample points are denoted by equidistantly placed marks.

6.4.2 Estimating The Sample Point Density Value

The estimated sample point density value computed in this section determines if the correspondingsubsegment is to be removed and the line segment splitted. Let yi, j be the i’th assumed sample point

1The value “radius” is used to define the developed regions which are sets of data points within the predefined radius ofthe line segment computed by the method in 6.3.3. The value of “radius” is a system parameter and emanates from the datarecording device

2Subsegment length value depends on the setting of the recording and measure equipment

6.4. SPLIT 71

~y2

~y7

Fig. 6.2: Sample points and its nearest neighbours used to estimate the nonparametricdensity value. The line segment is splitted into two new line segments.

of the line segment z j. The set of supporting points Xyi, j is constructed by finding the ten3 nearestneighbours. Figure 6.2 shows some of the nearest neighbours with the corresponding distances onsample points~y2 and~y7.

We call the ten nearest neighbours of the assumed sample point the “supporting points” of thecorresponding subsegment. For further computation we define the following distance function:

dy(~x,~y) = ‖~x−~y‖ if ‖~x−~y‖> d3

p(~x)d3

p(~x) else(6.4.3)

where dkp(~x) is the distance between the data point~x and its k’th nearest neighbour (compare 3.2.2). k in

this case is a system parameter and as proposed by the authors of [22] is set to k = 3.dk

p(~x) is a correction of the distance function. Since we use a distance function for the kernel esti-mation in sdd we have to set here a minimum distance to the nearest neighbour for each data point ~xdepending on the bandwidth h (compare section 5.5). Otherwise we might get results of higher proba-bilities than the density values determined by the nonparametric ground truth estimation sdd. In such away the distance function computes the distance between an assumed sample point on the line segment~y and the data point~x or returns the minimum value for this data point.

With these considerations it is possible now to estimate the smoothed data density value of thecurrent sample point:

sdd(yi, j) =10

∑n=1

sdd(~xn)G(dy(~xn,~yi j),0,h), (6.4.4)

with ~xn ∈ Xyi, j , G is the normal probability distribution function and h is the bandwidth parameter ac-cording to sdd bandwidth in 5.5.4. The splitting on yi, j is performed if sdd(yi, j) < MINDENS, whereMINDENS 4 is the system parameter.

3The number of nearest neighbours to e found for the support computation of a sample point is a system parameter andwas proposed by [22]

4MINDENS system parameter is set to 0.000001


6.5 Merge

A model component weights by its parametric density the data points. Given the model component andthe radius parameter we determine the range in which the weighting is measurable. We call the set ofdata points within this range the “region of influence” of the model component.

The merge step is performed on line segments when their regions of influence overlap 6.5.1. Wedefine below the “merging value” for two segments to determine the order of merge iterations. Themerging depends on two aspects. First, no merging is considerable if afterwards a splitting is possible.Second, no merging is preferable if it leads to disadvantageous state. A state is measurable with the“Kullback-Leibler Divergence”.

6.5.1 Overlap

The relation “overlap” determines whether two segments weight the same data points. The relation isdefined as:

(zk,zl) ∈ Overlap i f (6.5.1)

~xi|~xi ∈ X ∧w(~xi,zk) > ZERO∧w(~xi,zl) > ZERO 6= /0 ,

where~xi ∈ X are the elements of the data set, w(~xi,z j) with j ∈ k, l is the i, jth element of the weightsmatrix 6.3.2. w(~xi,z j) is the probability for the data point~xi to belong to the line segment z j.

ZERO5 is a system parameter standing for not measurable influence of the model component onto adata point.

6.5.2 Merging Value

The merging value determines the precedence order of the merge iterations. To optimize merge condi-tions we expand the segment to a line. Then we compute distances between the data points in the regionof influence of one segment and the line corresponding to the other segment. The minimum of the av-erages of these distances is the “merging value”. In such a way the collinear segments in the overlaprelation are preferred by merge procedure.

Let zk and zl be the line segments, Xzk and Xzl the corresponding regions of influence, then thedistance averages are computed by:

dx,zk =∑|Xzl |i=1 dp,l(~xi,zk)|Xzl |

,

where dp,l(., .) is the distance point line distance function defined in by 3.2.3. The merging value isdefined as:

δzk,zl = min(dx,zk , dx,zl ); (6.5.2)

With this new measure we are able now to sort the segment pairs in the overlap relation proposedfor merging. The segment pairs with lower merging value are preferred.

5This value is a system parameter and needs to be adjusted. In this implementation it is set to ZERO = 0.0000000001

6.5. MERGE 73

6.5.3 Checking for Merge Advantage with KLD

Before two segments are merged into a new one, we want to know if this performance would leadthe approach to a “better” result. “Better” in our context means a model estimation which optimizesthe “KLD”. We explain here the coding proposed by the authors of [22] and introduced in the previouschapter in ??. Notice that this coding does not preserve the convergence (compare the method evaluationin 7.5.2).

It is assumed that the data points are “Gaussian like” distributed around the line segment. TheGaussian like (Gl) density value on the data point~xi by given segment z j is computed by:

Gl(~xi|z j,σ) = G(dp,s(~xi,z j),0,σ), (6.5.3)

where G is the normal distribution6, dp,s(., .) is the point segment distance function defined in 3.2.4.Figure 3.1 demonstrates the distance computation according to σ as the range of the standard deviationwhich is used for Gl 6.5.3.

To check the merge advantage we compare the model estimates with nonparametric density estima-tion sdd locally. It means that we only need to check the compute the KLD in the region of influence ofboth segments to be merged. To do so we need to assure the proper normalisation.

The KLD computation for the resulting segment is straight forward. Let zr be the resulting linesegment proposed by merge. Xzr = Xzk ∪Xzl is the region of influence of this segment. To determine thisset we compute the union of Xzk and Xzl .

To compute the KLD value on a collection of segments, we have to compute the KLD values for eachmodel component separately and then combine them. Since the regions of influence of the segmentsoverlap, we would need a normalisation.

Before we can compute the KLD we have to ensure that the densities to be compared are normalised.Let Xz j , with j ∈ k, l,r be a region of influence, then the density normalisation is computed by:

GlNorm(~xi|z j,σ) =1

|Xz j |∑n=1

Gl(~xn|z j,σ)

·Gl(~xi|z j,σ)

sddNorm(~xi) =1

|Xz j |∑n=1

sdd(~xn)

· sdd(~xi)

with~xi ∈ Xz j .The KLD values corresponding to the regions of influence are then computed by:

g j =|Xz j |∑i=1

sddNorm(~xi) log(sddNorm(~xi)

GlNorm(~xi|z j,σ)), (6.5.4)

with j ∈ k, l,r.Normalising the combination of the KLD values we can compare the model estimations before

merging and afterwards. The merge is disadvantageous if:

|Xzk | ·gk + |Xzl | ·gl

|Xzk |+ |Xzl |< gr; (6.5.5)

6σ is a system parameter and depends on the settings of the data recording device

Chapter 7

Evaluation

7.1 Simulating Data

For the purpose of testing and visualising the implementation we had to be supplied by a great amountof different data sets. Since no laser scan device or any other capturing device is given, we had toimprovise with a data generating machine.

Without evaluation the machine for random sample points, we have to announce that the data isnot conventionally generated to our assumption. The sampling along the segment is done uniformly andeach sample point is then displaced according to an isotropic Gaussian around the original sample point.This was the assumption.

Our generation method uses the “C++” own method to produce uniformly distributed data points.The one dimensional normal distribution is computed using the “Box Muller” method [34]. We produceuniformly distributed data along the segment and displace the samples orthogonally according to the 1Dnormal distribution. It is not an equivalent to the assumption but provide good enough data sets fortesting the method and debugging the implementation.

Naturally the final implementation is going to be evaluated using real data. We address the experi-ments to the experiments done by [22] and use the equivalent data sets.

To evaluate the artificial generation of the data we present two data sets: 7.1. The first is no isotropicGaussian noise. The ratio of the sigma in x direction to the sigma in direction y correspond to theration of the bounding box rectangular. The second data set has been produced according to uniformdistribution in x direction and Gaussian in direction y. We demonstrate the corresponding distributionsin the central and right figures.

Since our tool does not provide features of plotting the data sets have been transformed into MAT-LAB code. We see that we achieve desirable distributions.

74

7.2. NEAREST NEIGHBOUR 75

0 100 200 300 400 500 600 7000

100

200

300

400

500

600

700

0 100 200 300 400 500 600 700 8000

10

20

30

40

50

60

200 220 240 260 280 300 3200

10

20

30

40

50

60

0 50 100 150 200 250 300 350 400 450 5000

50

100

150

200

250

300

350

400

450

500

50 100 150 200 250 300 350 400 450 5000

5

10

15

20

25

30

35

40

45

230 240 250 260 270 280 290 3000

10

20

30

40

50

60

Fig. 7.1: Left: artifically generated noise. Center: distribution along y-axis. Reightdistribution along x-axis

7.2 Nearest Neighbour

The method is counting on solid nearest neighbour computation. The smoothed data density computa-tion and the computation of ten nearest neighbours of assumed sample points used in splitting rely onthe procedure.

In the work of [22] “kd-trees algorithm” has been used to perform the computation. We went apedestrian way leaving the implementation of kd-trees for future work. Our procedure computes foreach point the distances to the each other point in the set. We obtain O(N2) complexity. The evaluationis carried out by visualization.

The procedure has to take a point from the set or an arbitrary point as input and return a sorted listof k nearest neighbours. Since only three neighbours are needed, we proceed with k = 4. Just to be sure.The result is shown in the picture 7.2. The points denoted by a and b are not members of the data set.

Fig. 7.2: Visualization of the nearest neighbour computation. The point denoted by “a”and “b” are not containing in the data set. The rest are real points

76 CHAPTER 7. EVALUATION

Fig. 7.3: The EM demo with clearly parted clusters

Fig. 7.4: Second EM demo with overlapping clusters. Top: global optimal solutions.Bottom: globally not optimal solutions

7.3 EM

The method explained and discussed in our framework is based on the structure of “Expectation Max-imization” (EM) algorithm. Before we begin to deepen into the “Line Fitting” (LF) with the ExtendedEM we need to understand the EM itself.

Our implementation of the classical EM is evaluated here. For this purpose we constructed two datasets. The first data set contains three clearly parted Gaussian point clouds 7.3. The three Gaussian pointclouds of the second data set overlap 7.4.

We present in the figure 7.3 the parted data set in the convergence on the global optimum. The topfigure 7.3 shows the fit on the second data set in the convergence on the global optimum. The bottomfigure 7.3 shows unfortunate results if the algorithm converges towards the local optimum only. As itcan be seen the fitting has been performed with different number of model components.

We want to investigate the behaviour of the method depending on the number of model componentsand the data set. We start the algorithm 20 times with the right number of model component and 10times with the wrong number of model components. The results are demonstrated in the set of figures7.5. The figures show the dependence of the likelihood value on the number of iterations. The top three

7.3. EM 77

1 2 3 4 5 6 7 8 9 10-12.4

-12.2

-12

-11.8

-11.6

-11.4

-11.2

-11parted with 2

0 5 10 15 20-12.2

-12

-11.8

-11.6

-11.4

-11.2

-11

-10.8

-10.6parted with 3

0 5 10 15 20-11.8

-11.6

-11.4

-11.2

-11

-10.8

-10.6parted with 4

0 5 10 15 20-11.2

-11.15

-11.1

-11.05

-11

-10.95

-10.9

-10.85overlapped with 2

0 5 10 15 20-11.15

-11.1

-11.05

-11

-10.95

-10.9

-10.85


0 5 10 15 20-11.15

-11.1

-11.05

-11

-10.95

-10.9

-10.85

-10.8

-10.75


Fig. 7.5: Convergence of the EM, depending on number of components and data set.Top parted data set. Bottom: overlapped data set. Left: 2, centre: 3, right: 4 components

Fig. 7.6: EM L, µ , Σ Variation

figures show the development of the likelihood value of the parted data fit and the bottom three for theoverlapped data. The left column is the fit with two model components, with three in the centre, and thefit with four model components is in the right column.

As expected, we achieve more often global optimum for parted data than for overlapped data. Themore overlapping we have, the less is the probability to achieve the global solution. Another obviousobservation is: the more model components we have the higher is the likelihood value in the conver-gence.

We might be tempted to assume, that the higher the likelihood value is at the first iteration the lessiterations are needed to achieve convergence. The fact is, we can not make this assumption with thegiven information. Our model components are Gaussians, depending on mean and deviation. In theprocedure the parameters are estimated. The speed of convergence may differ for each parameter. Thus,one well adjusted parameter and badly adjusted another might result in the same likelihood but needdifferent number of iterations to find an optimum. We would need investigations depending on theparameters separately.


It is not the subject of this project to analyse EM. From standard EM literature we know that theconvergence towards the global optimum and the number of iterations depend on the initial state. Thebetter the initial fit is the more probable is the global solution the less iterations are needed. As we cansee in the figures 7.5 there is a tendency for this assumption. But what is the “better initial state”?

There are examples in the figures, which show that the lower initial likelihood may lead fastertowards the global solution and states with higher initial likelihood may not lead to a global solution atall. We show some examples to illustrate the dependence.

We handle with parted data only. The initialisation is manually done by chosen parameters, thecovariance matrices are computed by given eigenvalues and eigenvectors according to 3.3.1. First, letus construct two initial states with nearly equal likelihood values. In the most left figures of 7.6 theinitial model components of the top and the bottom states are different. Even if we have equal initiallikelihood values for both states, the top converges after three iterations and the bottom one after fifteen.Obviously the speed of convergence depends on more the likelihood value.

The second example shows that the speed of convergence strongly depends on the placement of thecomponents, on its means. We take three model components which differ in the means only. Placefirstly the components over the point clouds. The global solution is found after three iterations. If weplace the same components in between, the algorithm converges after seventeen iterations.

If the initial model components are placed fortunately, the algorithm does not react sensitively tothe deviation in orientation or size f the model component. As we can see in the third example theextremely different components converge in both cases after three iterations correctly, if the means areequally adjusted.

The variation of the size and orientation are reflected in the covariance matrix of the Gaussianmixtures. The forth example shows, how the equally placed components but of different size destroythe convergence towards the global solution. The top initial state correctly converges after eighteeniterations. The bottom initialisation does not lead the algorithm to a global solution at all.

Of course we would need an analytical derivation to illustrate the dependence absolutely. But theintention of this work is only to evaluate the implementation.

7.4 Intuitive SMEM

Implementing the EM method we want to extend the algorithm by Split and Merge steps. Let’s considerhow such steps might intuitively look like.

First, the conversance must be preserved. Thus, the extension must increase the log likelihood. Sec-ond, we have to find a method to make a suggestion, how the new components might be parameterized.Third, we need to perform EM after each extended step. Forth, our model components are no longerGaussian mixtures, since line segments have to be fitted.

Thus, we need at least three “jokers”. Having the machinery for fitting Gaussians we can determinethe mean and the spreading of the point clouds. The first joker to take is the assumption that the sameparameters can be used to determine the mean and the normal vector of the line to be fitted. Startingfrom that point, we do not need new derivation of the Maximization step. We only need to considerhow lines of infinite length can be cut into segments. We go the easiest way and postulate that thereis a statistical value, some constant factor depending on eigenvalues. Trying to determine this constantempirically, we found some value slightly less the 4 and took this constant to be

√12.

Now we are able to propose for two line segments a new merged line segment performing the leastsquares fit (LSF) on the previously weighted points. We can perform LSF on the whole data set, because

7.4. INTUITIVE SMEM 79

the points are weighted by their classification. Thus, only the points previously classified to belong tothe two components influence the computation.

To make a suggestion for splitting is more difficult. We need to determine which parts of the segmentare holes. It is time to take the second joker. We divide the segment into parts and count for each suchpart the points in some environment. If the number of points for the part is too small, than it is a hole.The joker is the assumption of existence of such environment, the sub segment length and the minimalnumber of points.

At this point we are able to make a suggestion for splitting and merge. How can we find the pairsof segments for merge or the segments for split. We go the easiest way again and just try to split allsegments and try all pairs to merge. Our feeling says, we do not need to compute the likelihood aftereach split or merge for all model components, since the likelihood value for the unchanged rest muststay unchanged. Thus, we take the third joker and perform the likelihood computation locally. To do sowe need to compute the new weights for the concerning points. Thus, the EM steps must be performedlocally too. Is the likelihood value afterwards less the previous value, then the new model componentscan be accepted.

As we have seen in the derivation and discussion of the [22] method of line fitting the most takenjokers are justified. The parameters like sub segment length, radius of the environment of the nearestpoints belonging to the segment are according the work of [22] the system parameters depending on thedata capturing device. The local EM is justified by [24]. The existence of the minimal number of pointssupporting the sub segments is also justified by assumption of system parameters of the apparatus.

In the following we are going to evaluate the intuitive extension.

7.4.1 Intuitive Split

In this step the segments are divided into sub segments of length r and the supports Sij of the sub

segments are computed. r is the device dependant parameter. Support Sij is a set of sample points,

which are located within the radius environment of the sub segment i of the component j. For thispurpose a square with width and height of r is computed. Squares containing smaller number of samplesthan a previously computed threshold C have not big enough support and are removed. The remainingneighbouring sub segments are merged producing new segments.

Let z j ∈ Z be the segment to be spitted. Dividing the segment into sub segments result in a set of subsegments zi

j ∈ Z j for all i ∈ 1, . . . , |Z j|. The number of sub segments is computed by |Z j|= ceil(l j/r)where l j is the length of the segment Z j.

Since the model components are firstly computed by fitting the Gaussian mixtures, the orientationof the line segments will be given by the unit eigenvectors of the covariance matrix of the correspondingGaussian. The unit eigenvector e0

j corresponds to the largest eigenvalue of the matrix and the , uniteigenvector e1

j corresponds to the smallest.Every sub segment radius environment is a square explicitly defined by r, the orientation e0

j ,e1j and

the mean ~µ ij. The mean of the sub segment i of the component z j is computed by:

µij = µ j− e0

j

(l j

2

)+ e0

jr(

2i−12

)where µ j is the mean of the model component z j

The support Sij of the sub segment zi

j is then computed by:

Sij =~x|~x ∈ X ∧ ∣∣~xt ·~e1

j

∣∣< r2∧ ∣∣(~x−~µ i

j)t ·~e0

j

∣∣< r2


Fig. 7.7: Intuitive Split Demo.

The threshold C is the mean over all supports of all sub segments of all segments:

C =∑

Mj

(∑|Z j|i |Si

j|)

∑Mj |Z j|

The figures 7.7 show splitting in action. The most right picture is the state after fitting of only onesegment which is obviously not enough. The computation of the support sets is demonstrated in thesecond picture. Two pairs of neighbouring well supported sub segments have been found. Thus, thesuggestion: the segment has to be divided into two segments placed at the means of segments, whichresult if neighbouring sub segments are merged.

There is one problem, how do we compute the length and the orientation of the resulting segments.The solution is sparse EM. As assumed before we need to determine if a split is advantageous. Todo so, we perform a local EM initialised with two equal segments placed at the aforesaid means. Theinitialisation of the sparse EM can even be done by copying the original component and displacing thetwo copies correctly. The outcome of the sparse EM is accepted if the resulting log likelihood value ishigher. As it is, the sparse EM fitted segments are advantageous as it is shown in the third picture.

If we would perform the classic EM from now on the method would result in the optimum as it isshown in the most right picture of 7.7. But it does not. We will investigate the reason in the next section

7.4.2 Intuitive Merge

The merge step is more intuitive than split was. The suggestion is done by least squares fitting. Themerge of collinear segments is highly probable. This is due to the fact of overlapping.

If we evaluate the merge on a data set presented in the left figure 7.8 with different initial states wesee that the more overlapping we have the higher is the likelihood value.

We always initialize with two collinear segments. We displace the component in such a way thatthey move towards each other until they overlap completely. We move them eight times and computethe log likelihood. We see the development in the right figure of 7.8.

The development converges towards the likelihood value of a fit with one segment only. Thus, themerge step is always going to be performed if the segments are collinear.

The merge of noncolinear segments on a different data set is straight forward. The higher the loglikelihood value of the fit with a suggested segment the higher the probability of merge.

7.4. INTUITIVE SMEM 81

0 1 2 3 4 5 6 7 8 9 10-0.032

-0.03

-0.028

-0.026

-0.024

-0.022

-0.02

-0.018

-0.016

-0.014

Fig. 7.8: Intuitive Merge Demo.

Fig. 7.9: Intuitive SMEM Demo.

7.4.3 Conclusion

The implementation of the intuitive SMEM poses lots of questions. First, if we do not weight the sam-ples by Gaussian mixture, what is our new model. To answer this question we equally to the assumptionof [22] build a segment model to fit noise corrupted lines. This component modelling depends on theplacement or mean vector, orientation, segment length, and the assumption about the noise corruptionreflected in the standard deviation of the Gaussian.

In our intuitive SMEM we tried to adjust that sigma value automatically making it dependant on thesmallest eigenvalue of the fit of the Gaussian mixture, which we used to compute the line parameters.The risk and error is: it can not be variable. We adequately call that parameter sigma. The algorithm willdestroy the fit by enlarge sigma and make the following split step impossible, since to many erroneouslyclassified samples have to be considered. The proposed segments would fit inadequately.

The figure 7.9 demonstrates the difficulties. The most left picture is the state of the segments afterclassical EM, which has to precede. Depending on the initial state the sigma value can vary. To largesigma would lead to overgrown proposed segments like presented in the third picture. Though it is notdisastrous in this example.

The to small sigma would not lead to a split at all end the two segments will be proposed formerge. The following step results in the completely erroneous state as it is shown in the forth picture.The optimal fit with the actual implementation could be achieved after 5-7 trys and is presented in thesecond picture.

Another difficulty in the implementation is handling to small values. Depending in initializationsome weight on the samples cannot be computed, because of to small values. Division by zero or log ofzero is the consequence.

The segment length parameter is not appropriate as it can be seen in the “optimal” fit in the secondpicture 7.9. The segment length has to be trimmed by sample points.

The merge proposal is made inadequately. No merge should be tried if the segments are not collinear


or may be spitted afterwards, or lie too distantly from each other.The threshold C may not depend on the number of supporting points. Consider the merge example

as it was shown in the previous section. The optimal fit with one component may result in splitting if onesub segment would be supported by one point less than the others. Even if the support is of significantsize. A new threshold modelling must be considered

These questions make our method not intuitive any more. We abandon the development and beginwith the discussion of the Segment Fitting Algorithm proposed by [22].

7.5 EMSFSM

The method discussed in our project is based on “Expectation Maximization” algorithm extended to“Segment Fitting” with “Split and Mere” (EMSFSM). The implementation of the EM step is straight-forward. Since we already evaluated this step we won’t repeat it here.

The initialisation is done in all experiments in the same way. We start by computing the boundingbox and initialise the algorithm with the diagonals. Computation of the undeveloped regions ensuresthat no new segments cut the original segments. The new segments are at most as long as doubled radius.The settings, which premise very low radius values, result in an initialisation with to much segments.The procedure to fill the undeveloped regions may propose overlapping segments. But since the methodis well balanced, the split and merge steps filter the unsupported segments and connect the collinearones. The first iteration takes the longest time.

The initialisation step has been performed more than 200 times with different parameter settingson different data sets. The speed of convergence and the quality of the fit can be varied by systemparameters, which directly manipulate the initial state, since the undeveloped regions finding is donedepending on the radius parameter.

Thus, the initialisation and EM step are very straightforward in the implementation address our-selves to Split and Merge step. The performance of the method to find the optimal number of modelcomponents is depends on the performance of this two steps. And the performance of split and mergedepends on the parameter settings. Thus, we evaluate the method by investigation of sensitivity to theparameter variation. And now we want to welcome you to the parameter game.

The method described in our work has four system parameters. Naturally they are adjusted by thedevice, which captures the data and only once. If no information about the device settings is given, theparameters have to be adjusted manually.

We divide the parameters into two classes according to their influence character. We begin withthe demonstration that the parameters indeed directly influence the process of adjustment the “correct”number of model components. The term “correct” or “optimal” loose their sense since it is bound onsystem settings.

In the next section you will find the experimental results to the parameter variation.

7.5.1 Sub Segment Length & Minimum Density

The “Sub Segment Length” parameter is the prime parameter to control splitting. We sample the con-tinuous sdd density function on descrete sample points along the line segment. The more sample pointswe have the more probable that we detect a gap. In this section we want to demonstrate the correctbehaviour of the algorithm.

To abstract the example and simplify the conditions we take a distribution of points which are notcorrupted by noise. Since we wand to investigate the splitting behaviour we sample two line segments

7.5. EMSFSM 83

Fig. 7.10: Split Demo. a) The data set. Two clearly parted clusters. b) Initialization.Two line segments of full length. c) Two segments after first iteration with small “SubSegment Length”. d) Only one segment after first iteration with large “Sub Segment

Length”

radius line segment glike environment

Fig. 7.11: Demonstration of the radius drawn measurement environment.

with the sampling interval which is less then the distance between the segments. We obtain the data setdemonstrated in the figure 7.10 a).

The algorithm initialisation is done in the usual way. Two line segments connecting diagonally theangles of the bounding box. In this case we obtain two equal segments of the length of the bounding boxas it is shown in b) of figure 7.10. The circle shows the distance to the third neighbour and is irrelevantfor this example.

We proceed with one iteration only to achieve convergence. We choose a small and a to large settingof the parameter. The parameter length can be seen in the c) and d) of the figure as ticked lines underthe segment representations. The length of one interval is the sub segment length. In c) we see that twomarks land in the gap. The gap is detectable. In the example d) no sampling is done in the gap. The twoline segments are merged.

The meaning of the “minimum density” parameter is straight forward. It controls directly the split-ting. It determines the character of a sample point to be relevant or to be seen as noise. The less is thevalue of the parameter the less is the probability for a sample point of being labelled as noise, the morespitting we get.

7.5.2 Radius & Sigma

“Radius” and “Sigma” clearly are the parameters to control merging. Sigma parameterizes the Gaussianlike distribution. The distribution which gives the evidence about the probability of point occurrencegiven this component. The larger sigma the smoother becomes the weighting, the more distant pointscan be reached, the more segments overlap the more probable is the merging.

The sigma parameter variation models the differences in the data density. The denser the data pointsoccur the higher are the sdd values for more distant points. To minimize the error or the differencebetween the component density and the sdd we can vary the sigma parameter.

Radius parameterises the Gaussian like function in the indirect way. It determinates the environmentwhere the measurement or weighting is relevant. It draws an environment around the line segment of theform demonstrated in figure 7.11. Only the points inside this environment are considered by merging.The computation of the merging “advantage” collects all points inside the radius environment into a setand measures the KLD value over this points. Clearly the larger radius the greater is the environmentthe more distant points can be considered to belong to this component, the greater is the error.

Why does the radius parameter not depend on sigma parameter? First we want to analyse their


Fig. 7.12: Example of badly adjusted merge parameters.

overlap

merge proposition

chamfer segment

cross segment

Fig. 7.13: Left framebox: “A” merge example. Right framebox: Overlap byapproximation the “N” with overgrown sigma

influence separately. The objection may raise, the density values on data points more distant than 3σ

are near zero. The point is in the merge condition proposed by the authors of the method:

KLDm =N1 ·KLD1

N1 +N2+

N2 ·KLD2

N1 +N2< KLDo

where in case of overlap N1 +N2 > N, with N data points and respective regions of influence of sizes N1and N2 (compare equations 5.6.8 and 5.6.9). The sigma parameter controlls the number of data points inthe KLD, but the radius parameter controlls the sizes of regions of influence and thus N1 and N2. Whichmeans, even if the most weights on data points are zero and thus not relevant for KLD computation butthe merge may still be controlled by sizes of N1 and N2.

The merge is only performed if no splitting on the resulting component is possible. If the split checkis not performed the algorithm can get into an infinite loop. This is the consequence of the inappropriatemerge check. The merge check according to the convergence could not generate a splittable component.

In one of our experiments we found an example of inappropriate merge. Before we go any furtherwith our example we have to emphasize that the parameters have not been adjusted properly. No in-appropriate merging occurs if the parameters are appropriately adjusted. We call the data set “NATA”.Please don’t be worried about the mirrored “N” letter. It does not influence the result and has no rele-vance.

As experts we see that 11 line segments are required to approximate the data. But if we take a closerlook at the result (see figure 7.12), we may notice the lack of the cross of the last letter “A”. How couldthat happen?

We investigate the problem and build the machinery to look at the individual steps of each iterationand cut the merge step. We need the demonstration of the “Overlap” and the proposed line segment. Theresults are shown in the figure 7.13. Thus we have to states. The state before and after erroneous merg-ing. The first image shows the approximation of the data set corresponding to the letter “A” with foursegments at the third iteration (“before state”). The second image is the result of the fourth iteration with

7.5. EMSFSM 85

1 2 3 4 5 6 7 8 9 10 110

0.5

1

1.5

2

2.5NATA KLD

iteration

KLD

Fig. 7.14: Left and centre are the model estimations between which the methodoscillates. The left is associated with the smaller KLD value. The right most picture:

Development of the KLD value depending on the iteration number for “NATA” data setapproximation with badly adjusted parameters.

only two segments (“after” state). The overlap of the chamfer segment and the cross is demonstrated inthe picture three.

We see that we have a large overlap denoted by a circle in the right most pricture in the left frameboxof 7.13. The two segments, the chamfer and the cross are influencing each other a great deal. This isthe first evidence of unadjusted parameters. The two segments generate a great error because of theoverlap. You may profit if you have a coloured copy. The right most picture in the left framebox showsthe influence of each segment separately. The supporting points of the chamfer are red and the pointscorresponding to the cross are green coloured. The points in the overlap are black framed. The proposedsegment is red and lies parallel to the chamfer. The merged segment approximates the union of the bothsets of supporting points. The circle at the bottom demonstrates the range of the third nearest neighbourand has no relevance here.

In the merge step after the overlap was detected two checks are required to determine if merging is toperform. The first check tries to split the proposed segment. If the splitting is possible then no mergingcan be considered. In our case the unfortunate case of unadjusted splitting parameters performs the splitin such a way that only one segment is obtained. The proposed segment would just be shortened. Insuch a case our method fails to determine if it is shortening or splitting. The difference is in our caseapparent. The segment is spitted in truth. If a point on the cross and a point on the right chamfer couldbe sampled in the splitting step, we would obtain two resulting segments! But the sub segment length isnot adjusted properly too.

Thus, the splitting is not detected and the next merge check has to be performed. At this stage we al-ready failed to make the right performance. The next merge check tries if the merge is advantageous. Ofcourse if the overlap is huge, the error is overgrown, and the proposed segment is indeed advantageouswith these parameters.

The following procedure steps merges the two collinear segments of the right chamfer and theline fitting afterwards places the two segments as it is shown in the second picture of 7.13. In thenext iteration the chamfer is shortened in the aforesaid way leaving undeveloped reions. Addingnew segments completes the loop. New merging is proposed and performed. The method oscil-latesconvergence!Extended EM convergence corruption. In the first two pictures of 7.14 we presentboth model estimations between which the method oscillates. The cross of the “A” letter is not detected.Again, the parameters are very badly adjusted. The development of the KLD value with oscillation idpresented in the right most figure 7.14.

The left figure of 7.15 shows the result if the parameters are well adjusted. The radius parameterhad been reduced to only a half of value, and the sub segment length parameter had been enlarged toavoid splitting on the right chamfer.


Fig. 7.15: Left: “NATA” data set approximation with adjusted parameters. Right:approximation with the triple of sigma

The sigma parameter plays its role in the merging too, but not the merging only. We achieve the sameresult as it is presented in the left figure of 7.15 if we increase the sigma value slightly. Increasing sigmasmoothes the weighting of the points and increases the range of influence of the model components. Ifwe triple the value of sigma the overlap of the segments overgrow. What is the consequence? The rightfigure of 7.15 demonstrates the result. The segments seem to be pressed together.

The result becomes apparent if we investigate the overlap. The increasing range of the componentsmeans weighting more inappropriate points by wrong segments. The last figure in 7.13 shows theoverlap of two neighbouring segments of the “N” letter in the “NATA” data set. Thus, almost the wholeset of supporting points of the aslope segment is influenced by the erected one and vice versa. Since themodel component fits its supporting points it has to fit the inappropriate ones too. Thus, the segmentsare moved towards the inappropriate points. The result is, the line segments intersect. The line fittingcuts the lines in such a way that a part of it protrudes over the intersection point.

7.6 Experiments

In this section we are going to analyse the performance of the method with respect to the noise andparameter variation. We investigate first the sensitivity of the method to noise. The data set providedby Lateki & co. will serve us as basis for variation and adjustment of the parameter settings. Since noinformation about the capturing device is given, we start with parameter settings proposed by Lateki &co. and proceed by varying one parameter at the time.

For debugging and visualisation of the method implementation we artifically developed some datasets. The data set “Cross” and “NATA” have already been discussed above. We present the fitting resultson these two and two another in the appendix without any further discussion.

We also tested the fitting quality on the gradient images. The method always results in very goodfits and converges after 4-5 iterations. The results on four data sets are also presented in the appendixand will not be discussed here.

7.6.1 Noisy Data

As it was demonstrated in the [23] the method can handle noisy data. We want to show that in ourreimplementation even the strongly corrupted data can insensitively be fitted. For this purpose weconstruct a similar example. We sample three line segments by 150 points and add 1000 salt and peppernoise. The whole data set is demonstrated in 7.16.

The data set is artificial. We produced the sample points using the visualisation tool developed forhis project. The quality of the sampling method is limited but we can produce a seizable data set.

Since the points had not been captured by any device, the parameter settings are not given. We hadto adjust the parameters manually. The best result from the 10 trys is presented. We achieve convergence

7.6. EXPERIMENTS 87

Fig. 7.16: Approximating the data set strongly corrupted by noise. 150 data points.1000 noise points. Left: Initialisation. Center: second iteration. Right: fifth iteration,

convergence

after fifth iteration. The approximation with six line segments after the second iteration is demonstratedin the central figure of 7.16. The final fit with six components is shown in the right figure.

We did not achieve the approximation with three components. We could have made too few trys toadjust the parameters or the data may be produced unconventionally to the original assumption of datacapturing. Nevertheless we see that the result approximates the correct data without being influencedby noise. Because of that we achieved the purpose of the experiment.

7.6.2 Rome

In this section we want to evaluate the method using the original outdoor map composed of 23,265 scanpoints obtained during the Rescue Robot Camp in Rome, 2004. The data set gas been provided by theauthors of [22]. We call this data set “Rome”.

We started the fitting method 21 times with different parameter settings. Starting from the parametersettings (radius=3.6, sigma=0.24, sub segment length=1.2) proposed by Lateki & co. we wanted to testthe sensitivity of the method. We did not vary the parameter “minimum density”, since the split controlcan be performed by sub segment length. The following list shows the parameters variety.

r; σ ; sl r; σ ; sl r; σ ; sl3.6; 0.12; 1.23.6; 0.24; 0.8/1.2/2.4 5.4; 0.24; 1.2; 7.2; 0.24; 0.8/1.2/2.43.6; 0.48; 1.2 5.4; 0.48; 1.2; 7.2; 0.48; 1.23.6; 0.72; 0.8/1.2/2.4 5.4; 0.72; 1.2; 7.2; 0.72; 0.8/1.2/2.010; 0.24; 5 10; 0.30; 5 30; 2; 10

In the figure 7.17 you will find the development of “KLD values” and the number of model compo-nents depending on the iteration number and the parameters. As we have already seen in the discussionof merge parameters, the method looses its guaranteed property to converge. We proceeded with 10iterations only. In the most cases the “quasi” convergence could be achieved after five-six iterations.The results after tenth iteration can be found in the appendix.

The left figures of 7.17 visualize the development by variation of the merge parameters radius andsigma. The parameter sub segment length is constant. The graphs are represented by solid, dashed anddotted lines with circled, squared or no mark. The mark variations demonstrate the sigma variation,whereas the variation in line style demonstrates the variation of the radius.

In the top left figure we can notice some clustering formations. The equally marked graphs buildclasses of “KLD values”. The sigma parameter smoothed the component densities. The smoother the


1 2 3 4 5 6 7 8 9 100

0.5

1

1.5

2

2.5

3

3.5

4KLD

3.60/ 0.24/ 1.20

5.40/ 0.24/ 1.20

7.20/ 0.24/ 1.20

3.60/ 0.48/ 1.20

5.40/ 0.48/ 1.20

7.20/ 0.48/ 1.20

3.60/ 0.72/ 1.20

5.40/ 0.72/ 1.20

7.20/ 0.72/ 1.20

1 2 3 4 5 6 7 8 9 100

0.5

1

1.5

2

2.5

3

3.5

4KLD

3.60/ 0.24/ 0.80

3.60/ 0.24/ 1.20

3.60/ 0.24/ 2.40

7.20/ 0.24/ 0.80

7.20/ 0.24/ 1.20

7.20/ 0.24/ 2.40

3.60/ 0.72/ 0.80

3.60/ 0.72/ 1.20

3.60/ 0.72/ 2.40

7.20/ 0.72/ 0.80

7.20/ 0.72/ 1.20

7.20/ 0.72/ 2.00

1 2 3 4 5 6 7 8 9 1080

90

100

110

120

130

140

150Lines

3.60/ 0.24/ 1.20

5.40/ 0.24/ 1.20

7.20/ 0.24/ 1.20

3.60/ 0.48/ 1.20

5.40/ 0.48/ 1.20

7.20/ 0.48/ 1.20

3.60/ 0.72/ 1.20

5.40/ 0.72/ 1.20

7.20/ 0.72/ 1.20

1 2 3 4 5 6 7 8 9 1070

80

90

100

110

120

130

140

150

160Lines

3.60/ 0.24/ 0.80

3.60/ 0.24/ 1.20

3.60/ 0.24/ 2.40

7.20/ 0.24/ 0.80

7.20/ 0.24/ 1.20

7.20/ 0.24/ 2.40

3.60/ 0.72/ 0.80

3.60/ 0.72/ 1.20

3.60/ 0.72/ 2.40

7.20/ 0.72/ 0.80

7.20/ 0.72/ 1.20

7.20/ 0.72/ 2.00

Fig. 7.17: The KLD values and number of number of model components fordifferently parameterised fits on “Rome” data set

density the less is the difference in weights of to differently distant samples. As you can notice in themap, the data occur in high density. The consequence is, the smoothed data density highly weights moredistant points. The points closed to the edge and the more distant points are highly weighted. To modelthis density we smooth the component density by larger sigma parameter.

As we can see, the graphs with greater radius in the same clusters tend to produce a greater error, butfit with less model components. The dotted graphs run above all in the clusters in the KLD figure, butunder all others in the cluster in the Lines figure. It is a very intuitive result. The greater the radius is,the more distant samples can be measured to belong to the component, the greater the error. The largerthe radius is, the less is the composed KLD value of two segments proposed for merging, the greater isthe probability of merging.

The clustering formations according to sigma and radius parameters imply the consequence that weare not able to measure the “goodness” of the method results for fits with different parameter settings.

As we can see in the resulting fits in the appendix, the doubled radius parameter merges someparallel segments in the inner formations of the map. The fit with enlarged sigma on the other handresults in inappropriate segment formations.

The right figures of 7.17 demonstrate the influence of the sub segment length parameter. We seeagain the graph clusters according to sigma variation. But what we do not see is the systematic depen-dence on the sub segment length values. There is no significant influence on the KLD value in this subsegment length parameter variety, even if its value grows to a triple of the original value. We see the

7.6. EXPERIMENTS 89

Fig. 7.18: Rome Data Best Fit

difference in the Lines figure at the bottom to the right. Again the dotted graphs tend to run under theothers in the cluster. The clusters are labelled by graph mark. And again we achieve an intuitive result.The larger is the sub segment length parameter, the less probable is splitting, the more probable is thesuccessful merge test the less model components will be used.

The pictures in the appendix are arranged in such a way that the sub segment length parametervaries observing from top to bottom and the radius parameter varies observing from left to right. Forthe variation of sigma see the further page.

If we examine the data set, we might notice some structures which could not be found by fittingmethod. Lets investigate the reason. In the left figure 7.18 we manually emphasized such regions on themap in the initialization step by ellipses. As specialists we can see such point formations and recognizestructure.

The reason, why the method fails to fit the regions, is the smoothed data density. The data densityis to low in these regions. The sdd computation labels the corresponding samples to be noise. To fitthe regions we need to start the method with parameter settings which indeed would be appropriate forthem, but would be completely inappropriate for the rest of the data. The method would not be able tofit differently captured data at the same go.

The visualization tool developed for this project loads the points from a text file. Thus, lots of stringprocessing has to be done. The loading of the Rome data set takes 219,426 ms. The initialisation andsdd computation takes 503,364 ms. Every iteration takes 200,000 ms on average. The method wasstarted on Intel Pentium M processor 1.5GHz machine with 768MB RAM.

Chapter 8

Future Work

In the reimplementation of the EMSFSM method we came across some problems. The nearest neigh-bour computation is crucial for this method and takes the most time. We used the straightforwardalgorithm of to complexity O(N2). The implementation of KD-trees would simplify the computation tolinear complexity.

In our future work we intend to extend the framework to the ability of fitting graphs to the data.In the next section we give a brief excursion into the matter and suggest some next steps. First wewant to show the ability of the method itself to find a fit of intersected segments. The suggestion ofgraph construction as post processing step after the segment fitting is our second proposition. The morestatistical and with it more mathematically approved proposition is made at the last section.

8.1 Extension to Graph Fitting

In our previous chapters we have seen how the method depends on the system parameters and how theseparameters influence the result. In our future work we want to direct the attention to the graph fitting.What do we comprehend the meaning of the term “graph” and what is the difference between segmentfitting and graph fitting.

A graph is a set of connected edges. The smallest graph is a segment. A segment in this contextis a relation between two points only. If relation is a set of point pairs, and each point pair stands fora segment, then this relation describes the set of segments. We consider only coherent graphs. Thestatistical model is than given by a set of coherent graphs.

The question which arises from the issue is how to compute the statistical component density ifthis model component is a graph. In our context we learned the Gaussian like distribution, which is thesigma dependant density around a linear structure. We did see that the sampling around two overlappingsegments results in inadequate sample formations.

Let’s consider the problem of overlapping again. We take for simplicity two segments and connectthem to build a corner. We see the density approximation before the connection and afterwards in thepicture 8.1. The assumed sampling mechanism produces a data set with more data points along the

90

8.2. GRAPH FITTING AS POST PROCESSING 91

region of higherst density

region of lowest densitysegment

sampling error

Fig. 8.1: Overlapping segments density

segment as collinearly in front or behind them. The assumed Gaussian error of displaced samples isdenser along the segment as it can be seen in the left picture.

In our context we explicitly defined the meaning of overlapping segments. Depending on sigmavalue and the radius one sample may be assumed to belong to two or more model components. Thecorrespondence is uncertain. The consequence is two different components influence the weighting.These components overlap on this sample point. If two segments are connected the overlap is certain.

We see the consequence of overlapping in the central picture. The ranges of the Gaussian error oftwo different samples coming from two different edges overlap. The probability for displacement ofthe samples into this region is the highest. The result is a dense formation of samples inside the cornerand the nonparametric ground truth estimation “sdd” gets a peak. We demonstrate the density variationaround the segments on a gray valued image in the right picture.

The theory of overlapping led the developers of the algorithm to the possibility of merge. Themodelled overlapping deviates from the nonexistent peaks in sdd and the merge becomes advantageous.But how does it look like if we have two perpendicular segments. The modelled overlapping fits theinadequately built sdd peak as the best. The overlapping becomes advantageous, the two segments aremoved together. We could see in the evaluation of the method that the variation of the sigma parameterinfluences the size of the overlap. The sigma parameter is responsible for the detection of segmentconnection. With this inadequately built sdd we are able to detect the vertices in the graphs.

We might be tempted to assume an additional fitting step: a step which connects two segments to agraph by moving or enlarging the segments. There are several points to consider. First, varying sigmawe are already able to bring segments together, but without any guarantee for that move. Second, toguarantee the connection, we cannot use the same metric as before for sample weighting. If the segmentsare not moved closer as the result of previous steps, then every additional interfering movement wouldresult in greater overlap and greater additional error. Third, if we develop a new metric for weightingsamples as belonging to a graph - the samples inside the corner are not higher weighted -, we have toconsider the appropriate modification in sdd. The already disturbed convergence of the method wouldbe needed to be reassured.

8.2 Graph Fitting as Post processing

As we have seen in the previous section the graph fitting may not be very different from line fitting.One of the simplest ways to fit graphs to the data by existing line fitting method is by doing so as post

92 CHAPTER 8. FUTURE WORK

vertex existing edge new edges

Fig. 8.2: Adding edges by nearest neighbour method

processing.The assumption is, the original scene consists of a number of linear structures which may be con-

nected. In our experiments we did see a few of data sets captured from such scene. The “Bloxworld”example describes in fact an arrangement of connected segments. A block may be considered as a graph.The EMSFSM method already finds a fit as a set of intersected segments. In the post processing stepthe intersection of segments may be considered as the target.

With the “Nearest Neighbour” algorithm and some threshold dmin the graph merge could be achievedforming a new edge between a segments endpoint and the nearest points on a distant segment. Theendpoints of the segments are considered as existing vertices and the points on the segments as potentialvertices.

The expected result of such edge enhancement is demonstrated in the figure 8.2. The segments arecombined to graphs by suggested edges which connect the vertices.

8.3 Statistical Edge Completion

In the work of [35] was proposed a method to contour completion. In this framework the method startswith a set of contours. The contours have been captured from a real scene and are assumed to be closed.To complete the contour extraction this method uses Constrained Delaunay Triangulation (CDT) whichpreserves existing curves and additionally proposes new linear components to close the interruptions.

The propositions are made by “probabilistic model of continuity and junction frequency on theCDT”. Two models are introduced. The “local continuity model” uses logistic regression to fit a linearmodel to the local features. The local features describe the pixels along the edge and whether it is actualor suggested edge. The logistic model gives an estimate of the posterior probability that this and theneighbouring edge are both real boundary.

The “global continuity model” uses conditional random fields to build a joint probabilistic modelover all edges.

The statistical character of this method and its ability to complete the contour may be used to developa method of graph fitting including line fitting and graph completion steps.

8.3. STATISTICAL EDGE COMPLETION 93

0 100 200 300 400 500 60050

100

150

200

250

300

existing vertex

suggested vertex

Fig. 8.3: Adding edges Constraint Delaunay Triangulation

The last step of such fitting algorithm may be assumed to be as it is demonstrated in figure 8.3. Inthe picture we see the result of constrained delaunay triangulation with existing and suggested vertex.The probabilistic framework starting with such a state would be able to complete the graphs by newedges as it was demonstrated before in the fitting attempt by nearest neighbour metric (see 8.2).

Chapter 9

Visualisation

One of the parts of this project was to reimplement the method and analyse its properties and behaviour.Before we started the reengineering process, we started with some intuitive developments. The reengi-neering would require the analysis of the existing implementation. We went a bit further and tried todevelop first a method counting on our own intuition only.

The lack of data sets brought us to the need of data simulation tool. Thus, the first applicationwas to develop a software which randomly generates an appropriate data set. As we described in thechapter “Evaluation” 7.1 the in such a way produced data sets deviate from the assumed data capturing.Therefore we call it artificial data set generation or data simulation.

The EMSFSM method is based on the framework of “Expectation Maximization” algorithm. Thus,the implementation and analysis of the EM is the first step towards the reengineering.

Starting with the EM theory we went towards the Split and Merge extension intuitively and devel-oped extended EM framework. Though erroneous, the development opened new questions described inthe chapter “Evaluation” 7.4. The visualisation of the every single step became crucial for understandingand debugging.

We expected the answers to the opened questions through the reimplementation of the MATHLABcode provided by the authors of [22]. We call the reimplementation of the EMSFSM method just “LineFitting” (LF) in our context to separate the functionality from the own implementation.

The analysis of the LF implementation includes the understanding of different steps of each iteration.We visualised again each single step of the iteration to proceed with the analysis. The evaluation of themethod is base on parameter variety. Thus, we additionally built in the parameter controls.

In the following section we want to introduce the visualisation tool according to the developmentdescribed before.

9.1 The User Interface

We begin with the description of the main panels of the user interface. The frame is divided into“control panel”, “canvas panel”, “parameters panel” and the “log panel”. The arrangement on the frameis demonstrated on the picture 9.1

94

9.1. THE USER INTERFACE 95

Fig. 9.1: Visualisation tool panel description

The canvas panel paints the output onto the display. Since we have 2D data the output is normalisedto fit a 2D visualisation screen. Our application consists of three different functionalities. Respectivelywe have three canvas tabs which control the canvas navigation.

There are two additional output possibilities. The method estimates a probabilistic model containingcomponent parameters. The component parameter variation can be read on the parameters panel. If“help” is activated additional information about the density value of a model component on a particularsample or point segment distance can be read on the “densities panel”. For the exception handling orstring output cares the “log panel”.

The user controls the system through the functionality of the “control panel”. This panel is dividedinto three sub panels with controls for data simulation, EM/SMEM method, and the LF method.

The system status information and the current mouse position can be read in the status bar. Conve-niently to the software development the menu bar allows us to take control about the loading and savingof the data sets and the log content. The data sets are saved in two ways. First it is written as text filecontaining the kind of the data set, the bounding box parameters, the rotation angle, the size of the dataset and the point coordinates. Second, a XML file is written containing the equivalent information andadditionally the state of the model components. This XML file corresponds to the “IPE drawing”1 toolstandards. In such a way the possibility is given to reconstruct the last fit with a picture editor.

Additionally there is a possibility to display the density distribution in a histogram window, but we

1http://tclab.kaist.ac.kr/ipe/

http://tclab.kaist.ac.kr/ipe/

96 CHAPTER 9. VISUALISATION

give an account of it later.

9.2 Simulating Data

In this section we pay out attention to data simulation control. This feature was developed first of alland is still very buggy. The intention is simple. We want to create a data set. The procedure is hard.How are the data points distributed? How are the clusters to be modelled? We can see the cut-out of thecontrol panel and a part of the canvas in the picture below. It shows an arrangement of 4×4 controls.

The forth column is slightly broader than the first three. This is due to the fact that the spin controlstake more place than the buttons. Thus, the spin controls are placed into the right most column.

The easiest way to produce a data set is to do it manually. Pressing the top left button “p” enablesthe user to generate the hand-made data points by left-click on the mouse inside the canvas.

To draw an ellipse or a rectangle the two following buttons “e” and “r” are responsible. The spinin the top row controls the rotation angle of the current figure. The current figure is not automaticallyaccepted. If a new figure is drawn the unaccepted figures are lost. To accept or “save” the figure toa data set press the third button in the second row “s”. To reset the data set and clear the figure path,containing all accepted figures press the first button in the second row “c”. To reset the rotation anglepress “rr”.

The help checkbox controls the enhanced visualisation of the method. To change a figure in the paththe user would have to “select” it on the canvas. Press the button “se” to do so.

The random generation of a data set is target of data simulation. The random data are only generatedinside an ellipse or an rectangle and are bound by its size. No random points occur outside the figure.The size of the figure and the ratio of its edge length determine the parameters for the random numbersgeneration. In case of Gaussian random numbers the standard deviation parameter is determined by theedge length. Since the ellipse is drawn inside a rectangle we do not need to build any further exceptions.

To generate adequate data set the user has to keep the smaller edge length constant. The bottomright button “ra” controls the data generation. The number of points inside the figure is controlled bythe spin in the third row. To generate a new additional but equivalently distributed data point press “rp”.The data distribution is selected in the top-down menu in the bottom most row. There are four differ-ent distributions: the uniform distribution “LAPLACE”, the Gaussian distribution “GAUSSIAN”, theGaussian like distribution “LAPGAUSS”. The last distribution “HORIZONTAL” generates a perfectlyarranged data set of equidistant data points. The “Gaussian like” distribution approximates the assumeddata distribution considered in our project. For further details consult the chapter “Evaluation” 7.1.

As we can see in the figure 9.2 the ‘Points” tab has to be selected to proceed. In the cut-out anexample of an rectangle rotated 30 degrees clockwise containing 10 uniformly distributed points. Bydouble clicking into the selected figure the 4 nearest data points are computed as it is shown in thepicture.

9.3. EM & INTUITIVE SMEM 97

Fig. 9.2: EM Frame Visualisation

9.3 EM & Intuitive SMEM

We approach the extended EM by implementing and analysing the EM algorithm and then adding intu-itive extensions. If a data set is given and converted into the correct format it can be loaded by our tool.Otherwise a new data set can be generated using the previously described machinery. Once the data setis satisfiable it can and must be transferred onto the EM or LF canvas to be initialized for the algorithmor to be saved. The data set transfer is launched by pressing “∼>” button fn the appropriate controlgroup.

Transferring the data set into the algorithm environment is not the initialisation. EM method requiresa fixed number of model components at the initialisation. The number can be changed in the spin of thefirst row of the EM control group.

The initialisation of EM or Intuitive SMEM (ISMEM) can be done in to ways: randomly and manu-ally. The random initialisation is launched by pressing the button “i”. The manual initialisation is doneby enabling the ellipse button “e” of the simulated data group and drawing the appropriate figure. Theplacement, rotation and size give the evidence about the eigenvalues and eigenvectors of the compo-nents. The covariance matrix and the mean computation become straightforward. To manually controlthe particular component in the initialisation change the value of the right spin in the fourth row of theEM control group. If the number of manually given components is less than the determinate number,the rest is initialised randomly.

The algorithm proceeds in a number of iterations. There is no stop at the convergence. It is possibleto start a automatic go of previously defined number of iterations by pressing the button “go”. To set upthe number of iterations the spin value to the right is to be changed. In the figure 9.2 it is set to be 20.


Fig. 9.3: Visualisation of the Intuitive Split.

To stop the automatic procedure press “st”.The manual control of the iterations is done by the buttons of the fifth row: the iteration “>”, the

split “/>” and merge “+>”. If the checkbox “smem” is disabled the EM method without splitting andmerging proceeds by fitting Gaussians onto the data set. The Gaussians are denoted by ellipses.

The parameters of the model components can be taken from the “theta” parameters panel. Choosethe component by changing the value in the top most spin. The first tabular shows the covariance matrixof the Gaussian the second the corresponding eigenvalues. The third tabular contains the coordinates ofthe mean. The bottom most tabular of the “theta” group shows the determinant of the covariance matrixand the prior of the component.

As we can see the checkbox “help” of the top control group is enabled. This gives the user theadditional information which can be read on the canvas and the parameters panel. The additional “den-sity” group appears containing the conditional densities, the sdd value and the weight on the particularsample point. The samples can be chosen by changing the value of the first spin in the fourth row of theEM control group. The bottom most tabular is not used by EM procedure. It contains the smallest andthe actual point segment distances in the ISMEM or LF procedure.

The canvas in the picture 9.2 demonstrates the result of the EM. The model components carry theiridentity numbers to the left of them. The log likelihood value and the iteration number can be found inthe top left. corner. The currently chosen sample point and the currently chosen component are denotedby fine green lines. The current sample and the current model component are connected by a fine lineto show the dependancy for the conditional probability in the density panel.

We tried to interactively manipulate the sdd computation by changing its bandwidth parameter man-ually. The slider in the third row served us as the control. No significant differences in the SMEMprocedures could be seen. Because of that the slider control became rudimentary.

The ISMEM procedure is awoken if the checkbox “smem” is enabled. If additionally the checkbox“help” is enabled the EM step of the ISMEM, the splitting and merging steps are performed separately.The EM step is launched by pressing the iteration button “>”. The splitting is divided into two sub steps:the computation of the supporting point sets and the splitting and sparse EM iteration. By pressing thesplit button “/>” each of steps is shown (see the figure 9.3).

In the left figure of 9.3 the segment is cut into sub segments of the length of the “radius” parameterwhich can be changed by spin value in the third row of the EM control group. The sub segment environ-ment is denoted by squares. Each sub segment square denotes the support of this sub segment. You can

9.3. EM & INTUITIVE SMEM 99

Fig. 9.4: Visualisation of the Intuitive Merge.

read the number of containing samples in the square. The unsupported sub segments are gray colouredand the remaining sub segments are green coloured. The neighbouring sub segments are merged andproposed for the further sparse EM computation. The result of the sparse EM is shown in the rightfigure of 9.3. The resulting log likelihood if written in the top right corner. The third pressing on thesplit button makes the split decision.

The segments can be manually chosen for splitting by changing the right spin value in the forth rowof the EM control group.

Manual proposition for merging can be done by simultaneous changing of the both spins values.The first pressing on the merge button “+>” shows the merged segment proposed for further sparseEM computation (see the left figure of 9.4). The second pressing shows the result of the sparse EMcompotation. In the top right corner the resulting log likelihood is written (see the right figure of 9.4).

In out attempts to understand the sdd influence onto the method behaviour we developed the dis-tribution histogram showing the sdd density, the model density, the p(~x|z) conditional, and the modeldensity weighted by sdd. The histogram of the splitting example from above in the convergence isdemonstrated in the figure 9.5.

Unfortunately no obvious changes in the behaviour could be seen in the histogram. Thus, the his-togram tool became quite redundant. The tool allows the user to interact with the data set. Clicking inthe histogram selects a particular sample point. This particular selection is then green coloured in allhistograms.

The p(~x|z) conditional demonstrated in the third histogram obviously depends on the model com-ponent selection. and shows its weighting only. Varying the values in the spins in the forth row of theEM control group also influences the histogram appearance.


Fig. 9.5: Histogram to Intuitive Split Example. The top most is sdd distribution. Thesecond is the model density. The third is the p(~x|z) conditional. The bottom most is the

model density weighted by sdd

9.4 EMSFSM

Motivated by intuitive development of SMEM method we reimplemented the EMSFSM visualising eachsub step of the iteration too. Recall that in our framework we call EMSFSM just line fitting or “LF”.The LF control group also consists of the data transfer button “∼>” and the initialisation button “i”.The data loading and saving happens just in the same way as in the EM/SMEM control.

The initialisation is always done by two diagonals of the bounding box corresponding to the dataset. To setup the method by specific system parameters the user manipulates the values on the spins andfinally launches the initialisation.

The LF control group is again an arrangement of four columns and five rows of buttons and spins.The spin in the first row controls the “radius” value. Please notice, that “radius” in this frameworkslightly differs in its meaning from the SMEM framework. The spin in the second row controls thesigma value. The sub segment length is manipulated by the spin in the fourth row. The bottom mostspin is responsible for the minimum density value variation.

The spins can vary integer numbers only. Thus the values of the radius and sub segment length spinshave to be divided by 10 and the value of the sigma spin by 100. The minimum density spin controlsthe power value p of the term 1 · 10p. The resulting parameter settings are written in the bottom leftcorner of the LF canvas as “radius”/“sigma”/“sub segment length” and the “minimum density” value inthe next line.

Conveniently to the EM controls the iteration is launched by the button “>”. The iteration consistsof several steps: “addSegments”, “createSegments”, “splitSegments”, “createSegments2”, “splitSeg-ments2”, “createSegments3”, “mergeSegments”, “createSegments4”. The “createSegments” step is the

9.4. EMSFSM 101

Fig. 9.6: NATA Undeveloped Regions

actual EM step of the method, where the weights and the model component parameters are computed.The “go” button launches the method procedure and stops after 10 iterations unless the “st” button

is pressed. The stop of the procedures is only possible if the current iteration is fully performed. Thus,pressing the “st” button does not stop the procedure immediately. after each iteration the current modelstate is painted on the LF canvas. The current iteration number is written under the bounding box of thedata set on the right side, the KLD value in the centre and the number of model components to the left(here denoted by “Lines”). The current state is also printed in the “log” text field. The output is writtenin MATLAB code, such that the components fill an array of points “L”, the current KLD and the numberof model components are written into an specifically defined array to. The values of the iterations areseparated by an “if“ conditional. Thus the values of the each specific iteration can be chosen separatelyfrom each other.

The history of each procedure development is stored for the time of using the system. To navigate inmodel states according to the iteration number the buttons back “<−” and next “−>” are used. Pleasenotice that the data presentation changes its appearance for better straightforwardness.

In each iteration the data set is searched for undeveloped regions. New segments have to be added ifsuch regions could be found. The visualisation of this step separately from the iteration is possible in the“help” modus only. If the “help” checkbox is activated, the pressing on the “ud” button computes theundeveloped regions of the current model state and fills them with a segment by least squares algorithm.The undeveloped regions in the initialisation of the “NATA” data set is demonstrated in the picture 9.6.To return to the iteration modus press the “ud” button again.

The help modus allows the user to look “into” the procedure process. The iteration is then dividedinto its steps. The pressing on the iteration button “>” launches the steps separately. The current state


is written in the top left corner of the LF canvas.In the help modus it is possible to “read” the point segment dependence. The currently selected

sample point and the segment are green coloured and connected by a green line to show the smallestpoint segment distance. The selected sample point is drawn by a circle of radius corresponding to thedistance to the third nearest neighbour. The cross in the circle serves the better finding of the samplepoint itself. The selection of the sample point and the model component is done in the EM controlgroup with help of the spins in the fourth row. Left spin controls the selection of the sample points andthe right spin the model components. Without selecting the model component the nearest segment tothe currently selected sample point can be find by enabling the “best index” modus. To do so press thebutton “bi” to the right of the “ud” button in the fourth row of the LF control group.

The distance to the selected segment and the distance to the nearest segment can be read in theparameters panel in the bottom most tabular. The sdd value and the weight of the currently selectedsample point can be read in the density parameter panel in the second tabular. The both points definingthe segment can be read in the theta parameters group in the first tabular.

The specific feature of this method is the ability of splitting and merging. It was intended to au-tomatically find the optimal number of model components with this feature. It is possible for the userto check for splitting and merging of the selected model components by pressing the buttons “cs” and“cm” respectively. Obviously no splitting or merging would be performed. It is only a check if the stepcould be performed. The result will be written on the canvas in the top left corner. By failure the writingis red coloured and it is green coloured otherwise.

The splitting step is very straightforward in this method and the evaluation is always successful.The splitting is based on cutting the segment into sub segments and computation of nearest neighbours.The nearest neighbours computation is evaluated and debugged in the context of simulating data. Thesub segments or sample points in this method are computed in quite the same way as it was done inthe framework of the intuitive SMEM. Thus we did not see any reason to visualise the split step in thisframework.

The merge step on the other hand is very difficult. The merging is performed in a specific orderon specific segments which is determined by the overlap computation. The merge value determinesthe sorting order. Please consult the EMSFSM chapter 6.5 for further details. The computation andthe maintaining of this information could be very bug sensitive. In fact we found a crucial bug byvisualisation of the overlap functionality.

We developed an overlap modus to show the relations in the adequate order. The modus is activatedand left by pressing “ov” button. The navigation in the different overlaps between the correspondingsegments is done by increasing or decreasing the number in the spin to the right of the button. Theoverlaps are sorted according to the merge value. The samples which are influenced in its weighting byboth segments are blue coloured.

We found some inadequate mergings especially on the “NATA” data set. It happened only if theparameter settings were not adjusted. To understand this behaviour and perform debugging if neededwe visualised the whole procedure of merging separately from the iteration. It is only possible toperform in the “help-overlap“ modus if the next iteration step is “merge”. The activation, leaving andthe navigation is done by the merge button “+ >” of the EM control group.

The screenshot is presented in the picture 9.7. Each pressing of the merge button shows the mergestep on two particular segments selected and arranged by overlap functionality. Ad you can see in thepicture the sample points are coloured.

We enlarged the region of interest and put it into the margin. The colours correspond to the segmentbelonging. The red coloured samples belong to the top segment, the green coloured support the bottom

9.4. EMSFSM 103

Fig. 9.7: NATA Merge Frame

one. The overlapped samples are framed. The overlapped samples are coloured according to the valueof influence of the corresponding segments. The circle denotes the currently selected sample. Theresulting segment proposed by this step resulting from the least squares algorithm is red coloured. Butnaturally can not be seen in this example because it is only barely displaced and is of the same length.

These two segments would indeed be merged. The merge step performs two checks. First is thecheck if the resulting segment could be split. In case of not splitting the second check is performed, ifthe merging is advantageous. The value of advantage is determined by computing the KLD values of thesegments separately and normalized afterwards compared with the KLD value of the resulting segment.

As we already described in the Evaluation chapter the larger is the sigma and the radius the moresamples are influenced inadequately by the segments the worse is the KLD value of the segments beforemerge. To visualise the KLD behaviour on these segments we took the usage of the histogram tool. Asit is shown in the picture 9.7 the histogram only computed for the supporting points of the two segmentsproposed for the merge. The histogram is only computable if the merging advantage computation is per-formed. Since only then the KLD values are computed. The top most histogram shows the distributionof the original sdd.The second is the same distribution but normalised to result in 1 in the sum. Thethird is the most interesting one. It shows the influence of the both segmentsseparately. We enlarge the third histogram in the picture to the right. The yellowlines denote the largest influence onto the sample. The red coloured line showsthe smallest influence. The black lines show the either the weighting outside theoverlap or the normalized weight computed by both influences.


You can again navigate in the histogram by clicking onto the canvas or by selecting a sample in thespin. It is now selected a sample number 36 in the set of supporting points (see spin) which is numbersample 45 in the data set (see the number on the histogram). Thus navigation in the set of supportingpoints you can read the indenture number of the sample in the actual data set. We selected a samplepoint of low sdd value as it can be seen in the top most histogram in the picture 9.7. If the sdd valueis low then the distance to the third nearest neighbour is large which is confirmed by the large circlearound the sample in the LF canvas. But this sample gains a lot of importance after merging since it isvery close to the segment as it can be seen in the bottom most histogram. The bottom most histogram isthe weights distribution after suggested merging.

The KLD values used in the computation of the merging advantage are printed onto the LF canvasin the bottom left corner in gray colours. In the top right corner of the LF canvas you will find the mergeoutput containing the checking results, the resulting number of segments and the merge decision whichis green coloured in our case denoting the successful merging.

The computation of the aforesaid KLD values is based on normalised sdd values and the normalisedprobability weights on the samples. The original and the normalised values can also be read in thedensity parameters panel in the first and second tabular.

The histogram tool serves the visualisation only. We could not gain much more understanding byanalysing it. But by developing the histograms we discovered again and proved the influence of thesigma and the radius parameters on the KLD values.

Chapter 10

Conclusion

In the course of the method derivation and reengineering we discovered that the method outcome indeeddepends on setting of the system parameters. The balance between splitting and merging was achievedby inadequate KLD computation at the cost of convergence. The inadequate KLD computation on acomposite of segments is strongly dependant on the merging parameters which control the size of theset of sample points influenced in its weighting by more than one model component.

Even by lost guarantee of convergence a good approximation can be found already after few itera-tions. The method results in very good solutions even if the initial model parameters are not close to theglobal optimum. We made propositions for merging preserving the convergence. Though the balancebetween the slit and our merge was only assumed and not further analysed.

Though dependant on system parameters, the method does not react very sensitive to the systemvariation. Partly even tripled parameter values do not make difference on the outcome. Thus, theapproximation may be done on data sets captured by machinery with differently adjusted system pa-rameters. The knowledge of such system parameters may adjust our system settings.

Though the time needed for each iteration increases. The method outcome is very insensitive tonoise.

A visualisation tool was developed including the data simulation tool, EM/SMEM application andthe EMSFEM method reimplementation and visualisation. The reimplementation of the method wascopiously tested and evaluated on the real laser range data, noise corrupted data and the object contoursin digital images.

The inadequate nonparametric density estimation and therefore inadequate overlap theory is the keyfeature of fitting the line segments in such a way that the segments connections can be aspired. So theresulting fits may assure the detection of crossing edges in the real scene and may so lead us to graphfitting.

105

Appendix A

Rome Data Set

We present here the results of fitting the “Rome” data set with variating parameter setting. We analysedthe behavior of the algorithm by variating one parameter of three at the time. The parameter “MinimumDensity” has been set to be a constant and has the value of 1 · 10−6. The following tabular shows theparammeter variety and is arranged at the same way as the following figures are. The pictures arearranged in such a way that the subsegment length parameter variates observing from top to bottom andthe radius parameter variates observing from left to right. For the variation of sigma see the furtherpage.

r; σ ; sl r; σ ; sl10; 0.24; 5 10; 0.24; 5

3.6; 0.24; 0.8 7.2; 0.24; 0.83.6; 0.24; 1.2 7.2; 0.24; 1.23.6; 0.24; 2.4 7.2; 0.24; 2.43.6; 0.72; 0.8 7.2; 0.72; 0.83.6; 0.72; 1.2 7.2; 0.72; 1.23.6; 0.72; 2.4 7.2; 0.72; 2.0

10; 0.30; 5 30; 2; 103.6; 0.12; 1.2 3.6; 0.48; 1.25.4; 0.24; 1.2 5.4; 0.48; 1.25.4; 0.72; 1.2 7.2; 0.48; 1.2

We also started with very inaccurate parameter settings. The results can be found at the bottom ofthe second picture group. Starting with such unadjusted parameter settings a new good fitting resultcould be found after three tryings only. The result is presented in the very first pictures.

The development of the KLD values and the number of model components has already been pre-sented in the chapter “Evaluation” in the section 7.6.2 and will not be shown here.

i

Appendix B

Gradient Images

We present here the method performance on gradient images. The data set has been generated by“Vigra” library(“Vision with Generic Algorithms”)1. The gradient images have been produced with theσ = 1.6 and the data points by “edgel finding” with threshold of 5. We found it sensable to fit all datasets with equal parameter settings, since the capturing settings have been constant. We chose radius tobe 3.6, sigma=0.8, susegment length = 3 and the minimum density 1 ·10−5.

data set samples segmentsBlox 2436 91Blocs 1887 66

Blocks World 2951 55House 2988 81

1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7KLD

Blox

Blocs

Blocks World

House

1 2 3 4 5 6 7 8 9 1050

60

70

80

90

100

110

120

130Lines

Blox

Blocs

Blocks World

House

1See http://kogs-www.informatik.uni-hamburg.de/ koethe/vigra/ for further details and documentation

v

http://kogs-www.informatik.uni-hamburg.de/~koethe/vigra/

Appendix C

Artifical Data Sets

For debugging and visualisation of the method reimplementation we artifically developed some datasets. We want to present the results of the fitting on the most critical ones. The data sets had beenproduced by our visualisation tool and are not convenient with the assumed data capturing devices. Butthe data sets still served the purpose.

The following tabular shows the parameter variety, the starting and the ending conditions.

data set samples segments parametersCross 765 2 30/3/20/1 ·10−6

H 292 3 60/4/30/1 ·10−6

NATA 903 12 30/3/20/1 ·10−6

LEO 325 12 35/2/20/1 ·10−6

1 2 3 4 5 6 7 8 9 100

0.2

0.4

0.6

0.8

1

1.2

1.4KLD

Cross

H

NATA

LEO

1 2 3 4 5 6 7 8 9 102

4

6

8

10

12

14

16Lines

Cross

H

NATA

LEO

vii

Bibliography

[1] Richard O. Duda, Peter E. Hart Pattern Classification and Scene Analysis. John Wiley & Sons Inc,June 1973

[2] A. Dempster, N. Laird, D. Rubin Maximum Likelihood from incmplete data via the em algorithm.Journal of the Royal Statistical Society, Series B, 39(1):1-38, 1977.

[3] M.J.R Healy, M. Wesmacott Missing values in experiments analyzed on automatic computers. Appl.Statist., 5:203-206, 1956.

[4] M.J.Beal and Z.Ghahramani The variational bayesian em algorithm for incomplete data.BAYESIAN sTATISTICS 7. Oxford Univ. Press, 2003

[5] Kass, R.E. and Raftery, A.E. Bayes Factors. J.Amer.Statist.Assoc., 90:773-795, 1995

[6] MacKey, D.J.C Probable networks and plausible predictions - a review of practical Bayesian meth-ods for supervised neural networks. Network: Computation in Neural Systems, 6:469-505, 1995

[7] Schwarz, Gideon Estimating the dimension of a model. Annals of Statistics, 6(2):461-464, 1978

[8] Akaike, Hirotsugu A new look at the statistical model identification. IEEE Transactions on Auto-matic Control, 19 (6): 716723, 1974

[9] C. Andrieu and N. de Freitas and A. Doucet and M. Jordan An introduction to MCMC for machinelearning. Machine Learning, vol. 50, pp. 5-43, Jan. - Feb. 2003.

[10] Neal, R.M. Anealed Importance Sampling. Statistics and Computing, 11:125-139, 2001

[11] Peter J. Green Reversible Jump Markov Chain Monte Carlo Computation and Bayesian ModelDetermination. Biometrika, Vol. 82, No. 4. (Dec., 1995), pp. 711-732.

[12] W.K. Hastings Monte Carlo Sampling Methods Using Markov Chains and Their Applications.Biometrika, 57(1):97-109, 1970.

[13] C. Andrieu and N. de Freitas and A. Doucet and M. Jordan An introduction to MCMC for machinelearning. Machine Learning, vol. 50, pp. 5-43, Jan. - Feb. 2003.

[14] Sylvia Richardson, Peter J. Green On Bayesian analysis of mixtures with an unknown number ofcomponents. Journal of the Royal Statistical Society, Series B, 59, 731–92, 1997.

[15] Naonori Ueda, Ryohei Nakano, Zoubin Ghahramani, Geoffrey E. Hinton SMEM Algorithm forMixture Models. Neural Computation, 12(9):2109-2128, 2000

ix

[16] Zhihua Zhanga, Chibiao Chena, Jian Sunb, Kap Luk Chana EMalgorithms for Gaussian mixtureswith split-and-merge operation. Pattern Recognition 36 (2003) 1973 1983

[17] Samuel T. Pfister, Stergios I. Roumeliotis, Joel W. Burdick Weighted Line Fitting Algorithms forMobile Robot Map Building and Efficient Data Representation. ICRA, 2003

[18] S.T Pfister, K.L. Kriechbaum, S.I. Roumeiotis and J.W.Burdick Weighted range sensor match-ing algorithms for mobile robot displacement estimation. Proc. IEEE Int. Conf. on Robotics andAutomation, Washington, D.C., May 2002

[19] Daniel Sack, Wolfram Burgard A comparison of methods for line extraction from range data. Procof the 5th IFAC Symposium on Intelligent Autonomous Vehicles (IAV). 2004

[20] Michael Veeck, Wolfram Burgard Learning Polyline Maps from Range Scan Data Acquired withMobile Robots. Proc of the IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS). 2004

[21] Paul L. Rosin Techniques for Assessing Polygonal Approximations of Curves. IEEE Trans PAMI,Vol. 19, No. 6, June 1997

[22] CIS Dept. Temple University Philadelphia Mark Sobel Satistics Dept. Temple University Philadel-phia Longin Jan Lateki, Rolf Lakaemper. Polygonal Approximation of Point Sets. Technical Report159-173, Springer-Verlag Berlin Heidelberg, 2006.

[23] CIS Dept. Temple University Philadelphia Mark Sobel Satistics Dept. Temple University Philadel-phia Longin Jan Lateki, Rolf Lakaemper. New EM Derived from Kullback-Leibler Divergence.Technical report, ACM SIGKDD int. Conf. on Discovery and Data Mining (KDD), Philadelphia,August 2006.

[24] R.Neal and G.Hinton A view of the em algorithm that justifies incremental, sparse, and othervariants. In M.I.Jordan, editor, Learning in Graphical Models, Kluwer, 1998

[25] Kardi Teknomo, PhD. Bootstrap Sampling Tutorial.http://people.revoledu.com/kardi/tutorial/Bootstrap/index.html

[26] Efron, Bradley and Tibshirani, Robert. An introduction to the Bootstrap. Chapman and Hall, NewYork. 1993

[27] Susan Holmes (Stanford). Lecture notes. http://www-stat.stanford.edu/%7Esusan/courses/s208/

[28] D.W.Scott. Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley and Sons,1992

[29] B.W.Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986

[30] T.W.Anderson. An Introduction to multivariate statistical analysis. New York,Wiley, 1958

[31] Kaare Brandt Petersen, Michael Syskind Pedersen. The Matrix Cookbook. Version: February 16,2006

[32] M.W.Mark, S.Y.Kung, S.H.Lin Expectation-Maximization Theory. Prentice Hall PTR January3,2005

x

http://people.revoledu.com/kardi/tutorial/Bootstrap/index.html

http://www-stat.stanford.edu/%7Esusan/courses/s208/

[33] Jeff A. Bilmes A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimationfor Gaussian Mixture and Hidden Markov Models. INTERNATIONAL COMPUTER SCIENCEINSTITUTE,April 1998

[34] Box, G. E. P. and Muller, M. E. A Note on the Generation of Random Normal Deviates. Ann.Math. Stat. 29, 610-611, 1958.

[35] X. Ren and C. Fowlkes and J. Malik Scale-invariant contour completion using conditional randomfields. ICCV. (2005).

xi

Index

assumed sample points, 59, 70

bootstrap, 54Box Muller method, 74

convergenceEM convergence, 36, 45Extended EM convergence, 39, 61Extended EM convergence corruption, 63speed of convergence, 77

data set, 21, 64complete data set, 28, 33incomplete data set, 27

EMSF, 67EMSFSM, 64

iteration, 66expectation value, 28, 33, 52

Gaussian like, 41, 46, 51, 73

indicator function, 30

Jensen’s inequality, 36

Kullback-Leibler Divergence, 41, 65inverted, 60oscillation, 85

Langrange multiplier, 31least squares, 68, 69likelihood, 24

complete-data log-likelihood, 28function, 22incomplete-data log-likelihood, 27, 28, 32likelihood function, 36likelihood value, 76log likelihood, 24log likelihood value, 80maximum likelihood estimate, 24

line, 69line segment, 65, 70

merge decision, 66merging value, 66, 72model, 27, 41, 64

model estimation, 41modified model estimation, 43, 57

Monte Carlo Estimation, 45, 46, 53

normalisation, 44, 51EM model modification, 46

orthogonal regression, 65, 68, 69oscillation, 85overfitting, 45overlap, 51, 61, 66, 72, 83

parameterparameter estimation, 22, 43parametric family of densities, 41

prior, 68

radius, 65, 67, 70region of influence, 46, 61, 73

sdd, 55–57, 66, 69, 70, 73segment, 70sigma, 41, 73sparse EM, 80split decision, 66split failure, 51subsegment length, 65, 66, 70support, 66, 69, 79system parameter

MINDENS, 71radius, 61, 65, 83sdd bandwidth, 59sigma, 41, 73, 83subsegment length, 65, 66, 70, 82ZERO, 72

threshold, 79

underfitting, 45undeveloped region, 65, 67

weights matrix, 65, 68, 69, 72

xii

Ich versichere, dass ich die vorstehende Arbeit selbststandig und ohne fremde Hilfe angefertigtund mich anderer als der im beigefugten Verzeichnis angegebenen Hilfsmittel nicht bedienthabe. Alle Stellen, die wortlich oder sinngemass aus Veroffentlichungen entnommen wurden,sind als solche kenntlich gemacht.

Ich bin mit einer Einstellung in den Bestand der Bibliothek des Departments Informatikeinverstanden.

Hamburg, den July 3, 2007

Leonid Tcherniavski

Diploma Thesis Polygonal Approximation of Laser Range Scan ...tc... · Cognitive Systems Group...

Documents

Transcript of Diploma Thesis Polygonal Approximation of Laser Range Scan ...tc... · Cognitive Systems Group...