Fcv learn ramanan

27
Learning structured representations Deva Ramanan UC Irvine

Transcript of Fcv learn ramanan

Page 1: Fcv learn ramanan

Learning structured representations

Deva Ramanan UC Irvine

Page 2: Fcv learn ramanan

Geometric models (1970s-1990s)

Learned modelfw(x) = w · Φ(x)Training

• Training data consists of images with labeled bounding boxes

• Need to learn the model structure, filters and deformation costs

Training

positiveweights

negativeweights

Learned modelfw(x) = w · Φ(x)Training

• Training data consists of images with labeled bounding boxes

• Need to learn the model structure, filters and deformation costs

Training

positiveweights

negativeweights

Learned modelfw(x) = w · Φ(x)Training

• Training data consists of images with labeled bounding boxes

• Need to learn the model structure, filters and deformation costs

Training

positiveweights

negativeweights

Learned modelfw(x) = w · Φ(x)Training

• Training data consists of images with labeled bounding boxes

• Need to learn the model structure, filters and deformation costs

Training

positiveweights

negativeweights

Learned modelfw(x) = w · Φ(x)Training

• Training data consists of images with labeled bounding boxes

• Need to learn the model structure, filters and deformation costs

Training

positiveweights

negativeweights

Learned modelfw(x) = w · Φ(x)Training

• Training data consists of images with labeled bounding boxes

• Need to learn the model structure, filters and deformation costs

Training

positiveweights

negativeweights

Learned modelfw(x) = w · Φ(x)Training

• Training data consists of images with labeled bounding boxes

• Need to learn the model structure, filters and deformation costs

Training

positiveweights

negativeweights

negativespositives

Statistical classifiers(1990s-present)

Large-scale trainingAppearance-based representationsHand-coded models

Visual representations

Page 3: Fcv learn ramanan

Learned visual representations

ViolaJones Dalal Triggs

Learned modelfw(x) = w · Φ(x)Training

• Training data consists of images with labeled bounding boxes

• Need to learn the model structure, filters and deformation costs

Training

positiveweights

negativeweights

Representation (linear classifier, ...)

Features

Where is invariance built in?

Page 4: Fcv learn ramanan

Learned visual representationsWhere is invariance built in? 4

(a) (b) (c)

Fig. 1. Detections obtained with a single component person model. The model is defined by a coarse root filter (a), several

higher resolution part filters (b) and a spatial model for the location of each part relative to the root (c). The filters specify

weights for histogram of oriented gradients features. Their visualization show the positive weights at different orientations. The

visualization of the spatial models reflects the “cost” of placing the center of a part at different locations relative to the root.

To train models using partially labeled data we use a latent variable formulation of MI-SVM

[3] that we call latent SVM (LSVM). In a latent SVM each example x is scored by a function

of the following form,

fβ(x) = maxz∈Z(x)

β · Φ(x, z). (1)

Here β is a vector of model parameters, z are latent values, and Φ(x, z) is a feature vector.

In the case of one of our star models β is the concatenation of the root filter, the part filters,

and deformation cost weights, z is a specification of the object configuration, and Φ(x, z) is a

concatenation of subwindows from a feature pyramid and part deformation features.

We note that (1) can handle very general forms of latent information. For example, z could

specify a derivation under a rich visual grammar.

Our second class of models represents each object category by a mixture of star models.

The score of one of our mixture models at a given position and scale is the maximum over

components, of the score of that component model at the given location. In this case the latent

information, z, specifies a component label and a configuration for that component. Figure 2

shows a mixture model for the bicycle category.

To obtain high performance using discriminative training it is often important to use large

training sets. In the case of object detection the training problem is highly unbalanced because

there is vastly more background than objects. This motivates a process of searching through

the background to find a relatively small number of potential false positives. A methodology of

June 1, 2009 DRAFT

4

(a) (b) (c)

Fig. 1. Detections obtained with a single component person model. The model is defined by a coarse root filter (a), several

higher resolution part filters (b) and a spatial model for the location of each part relative to the root (c). The filters specify

weights for histogram of oriented gradients features. Their visualization show the positive weights at different orientations. The

visualization of the spatial models reflects the “cost” of placing the center of a part at different locations relative to the root.

To train models using partially labeled data we use a latent variable formulation of MI-SVM

[3] that we call latent SVM (LSVM). In a latent SVM each example x is scored by a function

of the following form,

fβ(x) = maxz∈Z(x)

β · Φ(x, z). (1)

Here β is a vector of model parameters, z are latent values, and Φ(x, z) is a feature vector.

In the case of one of our star models β is the concatenation of the root filter, the part filters,

and deformation cost weights, z is a specification of the object configuration, and Φ(x, z) is a

concatenation of subwindows from a feature pyramid and part deformation features.

We note that (1) can handle very general forms of latent information. For example, z could

specify a derivation under a rich visual grammar.

Our second class of models represents each object category by a mixture of star models.

The score of one of our mixture models at a given position and scale is the maximum over

components, of the score of that component model at the given location. In this case the latent

information, z, specifies a component label and a configuration for that component. Figure 2

shows a mixture model for the bicycle category.

To obtain high performance using discriminative training it is often important to use large

training sets. In the case of object detection the training problem is highly unbalanced because

there is vastly more background than objects. This motivates a process of searching through

the background to find a relatively small number of potential false positives. A methodology of

June 1, 2009 DRAFT

4

(a) (b) (c)

Fig. 1. Detections obtained with a single component person model. The model is defined by a coarse root filter (a), several

higher resolution part filters (b) and a spatial model for the location of each part relative to the root (c). The filters specify

weights for histogram of oriented gradients features. Their visualization show the positive weights at different orientations. The

visualization of the spatial models reflects the “cost” of placing the center of a part at different locations relative to the root.

To train models using partially labeled data we use a latent variable formulation of MI-SVM

[3] that we call latent SVM (LSVM). In a latent SVM each example x is scored by a function

of the following form,

fβ(x) = maxz∈Z(x)

β · Φ(x, z). (1)

Here β is a vector of model parameters, z are latent values, and Φ(x, z) is a feature vector.

In the case of one of our star models β is the concatenation of the root filter, the part filters,

and deformation cost weights, z is a specification of the object configuration, and Φ(x, z) is a

concatenation of subwindows from a feature pyramid and part deformation features.

We note that (1) can handle very general forms of latent information. For example, z could

specify a derivation under a rich visual grammar.

Our second class of models represents each object category by a mixture of star models.

The score of one of our mixture models at a given position and scale is the maximum over

components, of the score of that component model at the given location. In this case the latent

information, z, specifies a component label and a configuration for that component. Figure 2

shows a mixture model for the bicycle category.

To obtain high performance using discriminative training it is often important to use large

training sets. In the case of object detection the training problem is highly unbalanced because

there is vastly more background than objects. This motivates a process of searching through

the background to find a relatively small number of potential false positives. A methodology of

June 1, 2009 DRAFT

4

(a) (b) (c)

Fig. 1. Detections obtained with a single component person model. The model is defined by a coarse root filter (a), several

higher resolution part filters (b) and a spatial model for the location of each part relative to the root (c). The filters specify

weights for histogram of oriented gradients features. Their visualization show the positive weights at different orientations. The

visualization of the spatial models reflects the “cost” of placing the center of a part at different locations relative to the root.

To train models using partially labeled data we use a latent variable formulation of MI-SVM

[3] that we call latent SVM (LSVM). In a latent SVM each example x is scored by a function

of the following form,

fβ(x) = maxz∈Z(x)

β · Φ(x, z). (1)

Here β is a vector of model parameters, z are latent values, and Φ(x, z) is a feature vector.

In the case of one of our star models β is the concatenation of the root filter, the part filters,

and deformation cost weights, z is a specification of the object configuration, and Φ(x, z) is a

concatenation of subwindows from a feature pyramid and part deformation features.

We note that (1) can handle very general forms of latent information. For example, z could

specify a derivation under a rich visual grammar.

Our second class of models represents each object category by a mixture of star models.

The score of one of our mixture models at a given position and scale is the maximum over

components, of the score of that component model at the given location. In this case the latent

information, z, specifies a component label and a configuration for that component. Figure 2

shows a mixture model for the bicycle category.

To obtain high performance using discriminative training it is often important to use large

training sets. In the case of object detection the training problem is highly unbalanced because

there is vastly more background than objects. This motivates a process of searching through

the background to find a relatively small number of potential false positives. A methodology of

June 1, 2009 DRAFT

4

(a) (b) (c)

Fig. 1. Detections obtained with a single component person model. The model is defined by a coarse root filter (a), several

higher resolution part filters (b) and a spatial model for the location of each part relative to the root (c). The filters specify

weights for histogram of oriented gradients features. Their visualization show the positive weights at different orientations. The

visualization of the spatial models reflects the “cost” of placing the center of a part at different locations relative to the root.

To train models using partially labeled data we use a latent variable formulation of MI-SVM

[3] that we call latent SVM (LSVM). In a latent SVM each example x is scored by a function

of the following form,

fβ(x) = maxz∈Z(x)

β · Φ(x, z). (1)

Here β is a vector of model parameters, z are latent values, and Φ(x, z) is a feature vector.

In the case of one of our star models β is the concatenation of the root filter, the part filters,

and deformation cost weights, z is a specification of the object configuration, and Φ(x, z) is a

concatenation of subwindows from a feature pyramid and part deformation features.

We note that (1) can handle very general forms of latent information. For example, z could

specify a derivation under a rich visual grammar.

Our second class of models represents each object category by a mixture of star models.

The score of one of our mixture models at a given position and scale is the maximum over

components, of the score of that component model at the given location. In this case the latent

information, z, specifies a component label and a configuration for that component. Figure 2

shows a mixture model for the bicycle category.

To obtain high performance using discriminative training it is often important to use large

training sets. In the case of object detection the training problem is highly unbalanced because

there is vastly more background than objects. This motivates a process of searching through

the background to find a relatively small number of potential false positives. A methodology of

June 1, 2009 DRAFT

4

(a) (b) (c)

Fig. 1. Detections obtained with a single component person model. The model is defined by a coarse root filter (a), several

higher resolution part filters (b) and a spatial model for the location of each part relative to the root (c). The filters specify

weights for histogram of oriented gradients features. Their visualization show the positive weights at different orientations. The

visualization of the spatial models reflects the “cost” of placing the center of a part at different locations relative to the root.

To train models using partially labeled data we use a latent variable formulation of MI-SVM

[3] that we call latent SVM (LSVM). In a latent SVM each example x is scored by a function

of the following form,

fβ(x) = maxz∈Z(x)

β · Φ(x, z). (1)

Here β is a vector of model parameters, z are latent values, and Φ(x, z) is a feature vector.

In the case of one of our star models β is the concatenation of the root filter, the part filters,

and deformation cost weights, z is a specification of the object configuration, and Φ(x, z) is a

concatenation of subwindows from a feature pyramid and part deformation features.

We note that (1) can handle very general forms of latent information. For example, z could

specify a derivation under a rich visual grammar.

Our second class of models represents each object category by a mixture of star models.

The score of one of our mixture models at a given position and scale is the maximum over

components, of the score of that component model at the given location. In this case the latent

information, z, specifies a component label and a configuration for that component. Figure 2

shows a mixture model for the bicycle category.

To obtain high performance using discriminative training it is often important to use large

training sets. In the case of object detection the training problem is highly unbalanced because

there is vastly more background than objects. This motivates a process of searching through

the background to find a relatively small number of potential false positives. A methodology of

June 1, 2009 DRAFT

Representation (latent-variable classifier)

Features

4

(a) (b) (c)

Fig. 1. Detections obtained with a single component person model. The model is defined by a coarse root filter (a), several

higher resolution part filters (b) and a spatial model for the location of each part relative to the root (c). The filters specify

weights for histogram of oriented gradients features. Their visualization show the positive weights at different orientations. The

visualization of the spatial models reflects the “cost” of placing the center of a part at different locations relative to the root.

To train models using partially labeled data we use a latent variable formulation of MI-SVM

[3] that we call latent SVM (LSVM). In a latent SVM each example x is scored by a function

of the following form,

fβ(x) = maxz∈Z(x)

β · Φ(x, z). (1)

Here β is a vector of model parameters, z are latent values, and Φ(x, z) is a feature vector.

In the case of one of our star models β is the concatenation of the root filter, the part filters,

and deformation cost weights, z is a specification of the object configuration, and Φ(x, z) is a

concatenation of subwindows from a feature pyramid and part deformation features.

We note that (1) can handle very general forms of latent information. For example, z could

specify a derivation under a rich visual grammar.

Our second class of models represents each object category by a mixture of star models.

The score of one of our mixture models at a given position and scale is the maximum over

components, of the score of that component model at the given location. In this case the latent

information, z, specifies a component label and a configuration for that component. Figure 2

shows a mixture model for the bicycle category.

To obtain high performance using discriminative training it is often important to use large

training sets. In the case of object detection the training problem is highly unbalanced because

there is vastly more background than objects. This motivates a process of searching through

the background to find a relatively small number of potential false positives. A methodology of

June 1, 2009 DRAFT

4

(a) (b) (c)

Fig. 1. Detections obtained with a single component person model. The model is defined by a coarse root filter (a), several

higher resolution part filters (b) and a spatial model for the location of each part relative to the root (c). The filters specify

weights for histogram of oriented gradients features. Their visualization show the positive weights at different orientations. The

visualization of the spatial models reflects the “cost” of placing the center of a part at different locations relative to the root.

To train models using partially labeled data we use a latent variable formulation of MI-SVM

[3] that we call latent SVM (LSVM). In a latent SVM each example x is scored by a function

of the following form,

fβ(x) = maxz∈Z(x)

β · Φ(x, z). (1)

Here β is a vector of model parameters, z are latent values, and Φ(x, z) is a feature vector.

In the case of one of our star models β is the concatenation of the root filter, the part filters,

and deformation cost weights, z is a specification of the object configuration, and Φ(x, z) is a

concatenation of subwindows from a feature pyramid and part deformation features.

We note that (1) can handle very general forms of latent information. For example, z could

specify a derivation under a rich visual grammar.

Our second class of models represents each object category by a mixture of star models.

The score of one of our mixture models at a given position and scale is the maximum over

components, of the score of that component model at the given location. In this case the latent

information, z, specifies a component label and a configuration for that component. Figure 2

shows a mixture model for the bicycle category.

To obtain high performance using discriminative training it is often important to use large

training sets. In the case of object detection the training problem is highly unbalanced because

there is vastly more background than objects. This motivates a process of searching through

the background to find a relatively small number of potential false positives. A methodology of

June 1, 2009 DRAFT

4

(a) (b) (c)

Fig. 1. Detections obtained with a single component person model. The model is defined by a coarse root filter (a), several

higher resolution part filters (b) and a spatial model for the location of each part relative to the root (c). The filters specify

weights for histogram of oriented gradients features. Their visualization show the positive weights at different orientations. The

visualization of the spatial models reflects the “cost” of placing the center of a part at different locations relative to the root.

To train models using partially labeled data we use a latent variable formulation of MI-SVM

[3] that we call latent SVM (LSVM). In a latent SVM each example x is scored by a function

of the following form,

fβ(x) = maxz∈Z(x)

β · Φ(x, z). (1)

Here β is a vector of model parameters, z are latent values, and Φ(x, z) is a feature vector.

In the case of one of our star models β is the concatenation of the root filter, the part filters,

and deformation cost weights, z is a specification of the object configuration, and Φ(x, z) is a

concatenation of subwindows from a feature pyramid and part deformation features.

We note that (1) can handle very general forms of latent information. For example, z could

specify a derivation under a rich visual grammar.

Our second class of models represents each object category by a mixture of star models.

The score of one of our mixture models at a given position and scale is the maximum over

components, of the score of that component model at the given location. In this case the latent

information, z, specifies a component label and a configuration for that component. Figure 2

shows a mixture model for the bicycle category.

To obtain high performance using discriminative training it is often important to use large

training sets. In the case of object detection the training problem is highly unbalanced because

there is vastly more background than objects. This motivates a process of searching through

the background to find a relatively small number of potential false positives. A methodology of

June 1, 2009 DRAFT

4

(a) (b) (c)

Fig. 1. Detections obtained with a single component person model. The model is defined by a coarse root filter (a), several

higher resolution part filters (b) and a spatial model for the location of each part relative to the root (c). The filters specify

weights for histogram of oriented gradients features. Their visualization show the positive weights at different orientations. The

visualization of the spatial models reflects the “cost” of placing the center of a part at different locations relative to the root.

To train models using partially labeled data we use a latent variable formulation of MI-SVM

[3] that we call latent SVM (LSVM). In a latent SVM each example x is scored by a function

of the following form,

fβ(x) = maxz∈Z(x)

β · Φ(x, z). (1)

Here β is a vector of model parameters, z are latent values, and Φ(x, z) is a feature vector.

In the case of one of our star models β is the concatenation of the root filter, the part filters,

and deformation cost weights, z is a specification of the object configuration, and Φ(x, z) is a

concatenation of subwindows from a feature pyramid and part deformation features.

We note that (1) can handle very general forms of latent information. For example, z could

specify a derivation under a rich visual grammar.

Our second class of models represents each object category by a mixture of star models.

The score of one of our mixture models at a given position and scale is the maximum over

components, of the score of that component model at the given location. In this case the latent

information, z, specifies a component label and a configuration for that component. Figure 2

shows a mixture model for the bicycle category.

To obtain high performance using discriminative training it is often important to use large

training sets. In the case of object detection the training problem is highly unbalanced because

there is vastly more background than objects. This motivates a process of searching through

the background to find a relatively small number of potential false positives. A methodology of

June 1, 2009 DRAFT

4

(a) (b) (c)

Fig. 1. Detections obtained with a single component person model. The model is defined by a coarse root filter (a), several

higher resolution part filters (b) and a spatial model for the location of each part relative to the root (c). The filters specify

weights for histogram of oriented gradients features. Their visualization show the positive weights at different orientations. The

visualization of the spatial models reflects the “cost” of placing the center of a part at different locations relative to the root.

To train models using partially labeled data we use a latent variable formulation of MI-SVM

[3] that we call latent SVM (LSVM). In a latent SVM each example x is scored by a function

of the following form,

fβ(x) = maxz∈Z(x)

β · Φ(x, z). (1)

Here β is a vector of model parameters, z are latent values, and Φ(x, z) is a feature vector.

In the case of one of our star models β is the concatenation of the root filter, the part filters,

and deformation cost weights, z is a specification of the object configuration, and Φ(x, z) is a

concatenation of subwindows from a feature pyramid and part deformation features.

We note that (1) can handle very general forms of latent information. For example, z could

specify a derivation under a rich visual grammar.

Our second class of models represents each object category by a mixture of star models.

The score of one of our mixture models at a given position and scale is the maximum over

components, of the score of that component model at the given location. In this case the latent

information, z, specifies a component label and a configuration for that component. Figure 2

shows a mixture model for the bicycle category.

To obtain high performance using discriminative training it is often important to use large

training sets. In the case of object detection the training problem is highly unbalanced because

there is vastly more background than objects. This motivates a process of searching through

the background to find a relatively small number of potential false positives. A methodology of

June 1, 2009 DRAFT

4

(a) (b) (c)

Fig. 1. Detections obtained with a single component person model. The model is defined by a coarse root filter (a), several

higher resolution part filters (b) and a spatial model for the location of each part relative to the root (c). The filters specify

weights for histogram of oriented gradients features. Their visualization show the positive weights at different orientations. The

visualization of the spatial models reflects the “cost” of placing the center of a part at different locations relative to the root.

To train models using partially labeled data we use a latent variable formulation of MI-SVM

[3] that we call latent SVM (LSVM). In a latent SVM each example x is scored by a function

of the following form,

fβ(x) = maxz∈Z(x)

β · Φ(x, z). (1)

Here β is a vector of model parameters, z are latent values, and Φ(x, z) is a feature vector.

In the case of one of our star models β is the concatenation of the root filter, the part filters,

and deformation cost weights, z is a specification of the object configuration, and Φ(x, z) is a

concatenation of subwindows from a feature pyramid and part deformation features.

We note that (1) can handle very general forms of latent information. For example, z could

specify a derivation under a rich visual grammar.

Our second class of models represents each object category by a mixture of star models.

The score of one of our mixture models at a given position and scale is the maximum over

components, of the score of that component model at the given location. In this case the latent

information, z, specifies a component label and a configuration for that component. Figure 2

shows a mixture model for the bicycle category.

To obtain high performance using discriminative training it is often important to use large

training sets. In the case of object detection the training problem is highly unbalanced because

there is vastly more background than objects. This motivates a process of searching through

the background to find a relatively small number of potential false positives. A methodology of

June 1, 2009 DRAFT

Felzenszwalb et al 09

Page 5: Fcv learn ramanan

Matchingalg

Alg output

Ground truth

Tune parameters ( , ) till desired output on training set

‘Graduate Student Descent’ might take a while(phrase from Marshall Tappen)

Trainingimages

Where does learning fit in?

17

person bottle

cat

car

Fig. 9. Some of the models learned on the PASCAL 2007 dataset.

17

person bottle

cat

car

Fig. 9. Some of the models learned on the PASCAL 2007 dataset.

Page 6: Fcv learn ramanan

5 years of PASCAL people detection

averageprecision

1% to 47% in 5 years

0

12.5

25

37.5

50

2005

2006

2007

2008

2009

2010

Matching results

(after non-maximum suppression)

~1 second to search all scales

How do we move beyond the plateau?

Page 7: Fcv learn ramanan

How do we move beyond the plateau?

1. Develop more structured models with less invariant features

Page 8: Fcv learn ramanan

Invariance vs Search

Projective Invariants

View-Based Mixtures

Page 9: Fcv learn ramanan

Invariance vs Parametric SearchPart-Based Models

17

person bottle

cat

car

Fig. 9. Some of the models learned on the PASCAL 2007 dataset.

17

person bottle

cat

car

Fig. 9. Some of the models learned on the PASCAL 2007 dataset.

17

person bottle

cat

car

Fig. 9. Some of the models learned on the PASCAL 2007 dataset.

17

person bottle

cat

car

Fig. 9. Some of the models learned on the PASCAL 2007 dataset.

17

person bottle

cat

car

Fig. 9. Some of the models learned on the PASCAL 2007 dataset.

17

person bottle

cat

car

Fig. 9. Some of the models learned on the PASCAL 2007 dataset.

17

person bottle

cat

car

Fig. 9. Some of the models learned on the PASCAL 2007 dataset.

17

person bottle

cat

car

Fig. 9. Some of the models learned on the PASCAL 2007 dataset.

17

person bottle

cat

car

Fig. 9. Some of the models learned on the PASCAL 2007 dataset.

17

person bottle

cat

car

Fig. 9. Some of the models learned on the PASCAL 2007 dataset.

17

person bottle

cat

car

Fig. 9. Some of the models learned on the PASCAL 2007 dataset.

17

person bottle

cat

car

Fig. 9. Some of the models learned on the PASCAL 2007 dataset.

17

person bottle

cat

car

Fig. 9. Some of the models learned on the PASCAL 2007 dataset.

17

person bottle

cat

car

Fig. 9. Some of the models learned on the PASCAL 2007 dataset.

17

person bottle

cat

car

Fig. 9. Some of the models learned on the PASCAL 2007 dataset.

4

(a) (b) (c)

Fig. 1. Detections obtained with a single component person model. The model is defined by a coarse root filter (a), several

higher resolution part filters (b) and a spatial model for the location of each part relative to the root (c). The filters specify

weights for histogram of oriented gradients features. Their visualization show the positive weights at different orientations. The

visualization of the spatial models reflects the “cost” of placing the center of a part at different locations relative to the root.

To train models using partially labeled data we use a latent variable formulation of MI-SVM

[3] that we call latent SVM (LSVM). In a latent SVM each example x is scored by a function

of the following form,

fβ(x) = maxz∈Z(x)

β · Φ(x, z). (1)

Here β is a vector of model parameters, z are latent values, and Φ(x, z) is a feature vector.

In the case of one of our star models β is the concatenation of the root filter, the part filters,

and deformation cost weights, z is a specification of the object configuration, and Φ(x, z) is a

concatenation of subwindows from a feature pyramid and part deformation features.

We note that (1) can handle very general forms of latent information. For example, z could

specify a derivation under a rich visual grammar.

Our second class of models represents each object category by a mixture of star models.

The score of one of our mixture models at a given position and scale is the maximum over

components, of the score of that component model at the given location. In this case the latent

information, z, specifies a component label and a configuration for that component. Figure 2

shows a mixture model for the bicycle category.

To obtain high performance using discriminative training it is often important to use large

training sets. In the case of object detection the training problem is highly unbalanced because

there is vastly more background than objects. This motivates a process of searching through

the background to find a relatively small number of potential false positives. A methodology of

June 1, 2009 DRAFT

4

(a) (b) (c)

Fig. 1. Detections obtained with a single component person model. The model is defined by a coarse root filter (a), several

higher resolution part filters (b) and a spatial model for the location of each part relative to the root (c). The filters specify

weights for histogram of oriented gradients features. Their visualization show the positive weights at different orientations. The

visualization of the spatial models reflects the “cost” of placing the center of a part at different locations relative to the root.

To train models using partially labeled data we use a latent variable formulation of MI-SVM

[3] that we call latent SVM (LSVM). In a latent SVM each example x is scored by a function

of the following form,

fβ(x) = maxz∈Z(x)

β · Φ(x, z). (1)

Here β is a vector of model parameters, z are latent values, and Φ(x, z) is a feature vector.

In the case of one of our star models β is the concatenation of the root filter, the part filters,

and deformation cost weights, z is a specification of the object configuration, and Φ(x, z) is a

concatenation of subwindows from a feature pyramid and part deformation features.

We note that (1) can handle very general forms of latent information. For example, z could

specify a derivation under a rich visual grammar.

Our second class of models represents each object category by a mixture of star models.

The score of one of our mixture models at a given position and scale is the maximum over

components, of the score of that component model at the given location. In this case the latent

information, z, specifies a component label and a configuration for that component. Figure 2

shows a mixture model for the bicycle category.

To obtain high performance using discriminative training it is often important to use large

training sets. In the case of object detection the training problem is highly unbalanced because

there is vastly more background than objects. This motivates a process of searching through

the background to find a relatively small number of potential false positives. A methodology of

June 1, 2009 DRAFT

4

(a) (b) (c)

Fig. 1. Detections obtained with a single component person model. The model is defined by a coarse root filter (a), several

higher resolution part filters (b) and a spatial model for the location of each part relative to the root (c). The filters specify

weights for histogram of oriented gradients features. Their visualization show the positive weights at different orientations. The

visualization of the spatial models reflects the “cost” of placing the center of a part at different locations relative to the root.

To train models using partially labeled data we use a latent variable formulation of MI-SVM

[3] that we call latent SVM (LSVM). In a latent SVM each example x is scored by a function

of the following form,

fβ(x) = maxz∈Z(x)

β · Φ(x, z). (1)

Here β is a vector of model parameters, z are latent values, and Φ(x, z) is a feature vector.

In the case of one of our star models β is the concatenation of the root filter, the part filters,

and deformation cost weights, z is a specification of the object configuration, and Φ(x, z) is a

concatenation of subwindows from a feature pyramid and part deformation features.

We note that (1) can handle very general forms of latent information. For example, z could

specify a derivation under a rich visual grammar.

Our second class of models represents each object category by a mixture of star models.

The score of one of our mixture models at a given position and scale is the maximum over

components, of the score of that component model at the given location. In this case the latent

information, z, specifies a component label and a configuration for that component. Figure 2

shows a mixture model for the bicycle category.

To obtain high performance using discriminative training it is often important to use large

training sets. In the case of object detection the training problem is highly unbalanced because

there is vastly more background than objects. This motivates a process of searching through

the background to find a relatively small number of potential false positives. A methodology of

June 1, 2009 DRAFT

4

(a) (b) (c)

Fig. 1. Detections obtained with a single component person model. The model is defined by a coarse root filter (a), several

higher resolution part filters (b) and a spatial model for the location of each part relative to the root (c). The filters specify

weights for histogram of oriented gradients features. Their visualization show the positive weights at different orientations. The

visualization of the spatial models reflects the “cost” of placing the center of a part at different locations relative to the root.

To train models using partially labeled data we use a latent variable formulation of MI-SVM

[3] that we call latent SVM (LSVM). In a latent SVM each example x is scored by a function

of the following form,

fβ(x) = maxz∈Z(x)

β · Φ(x, z). (1)

Here β is a vector of model parameters, z are latent values, and Φ(x, z) is a feature vector.

In the case of one of our star models β is the concatenation of the root filter, the part filters,

and deformation cost weights, z is a specification of the object configuration, and Φ(x, z) is a

concatenation of subwindows from a feature pyramid and part deformation features.

We note that (1) can handle very general forms of latent information. For example, z could

specify a derivation under a rich visual grammar.

Our second class of models represents each object category by a mixture of star models.

The score of one of our mixture models at a given position and scale is the maximum over

components, of the score of that component model at the given location. In this case the latent

information, z, specifies a component label and a configuration for that component. Figure 2

shows a mixture model for the bicycle category.

To obtain high performance using discriminative training it is often important to use large

training sets. In the case of object detection the training problem is highly unbalanced because

there is vastly more background than objects. This motivates a process of searching through

the background to find a relatively small number of potential false positives. A methodology of

June 1, 2009 DRAFT

4

(a) (b) (c)

Fig. 1. Detections obtained with a single component person model. The model is defined by a coarse root filter (a), several

higher resolution part filters (b) and a spatial model for the location of each part relative to the root (c). The filters specify

weights for histogram of oriented gradients features. Their visualization show the positive weights at different orientations. The

visualization of the spatial models reflects the “cost” of placing the center of a part at different locations relative to the root.

To train models using partially labeled data we use a latent variable formulation of MI-SVM

[3] that we call latent SVM (LSVM). In a latent SVM each example x is scored by a function

of the following form,

fβ(x) = maxz∈Z(x)

β · Φ(x, z). (1)

Here β is a vector of model parameters, z are latent values, and Φ(x, z) is a feature vector.

In the case of one of our star models β is the concatenation of the root filter, the part filters,

and deformation cost weights, z is a specification of the object configuration, and Φ(x, z) is a

concatenation of subwindows from a feature pyramid and part deformation features.

We note that (1) can handle very general forms of latent information. For example, z could

specify a derivation under a rich visual grammar.

Our second class of models represents each object category by a mixture of star models.

The score of one of our mixture models at a given position and scale is the maximum over

components, of the score of that component model at the given location. In this case the latent

information, z, specifies a component label and a configuration for that component. Figure 2

shows a mixture model for the bicycle category.

To obtain high performance using discriminative training it is often important to use large

training sets. In the case of object detection the training problem is highly unbalanced because

there is vastly more background than objects. This motivates a process of searching through

the background to find a relatively small number of potential false positives. A methodology of

June 1, 2009 DRAFT

4

(a) (b) (c)

Fig. 1. Detections obtained with a single component person model. The model is defined by a coarse root filter (a), several

higher resolution part filters (b) and a spatial model for the location of each part relative to the root (c). The filters specify

weights for histogram of oriented gradients features. Their visualization show the positive weights at different orientations. The

visualization of the spatial models reflects the “cost” of placing the center of a part at different locations relative to the root.

To train models using partially labeled data we use a latent variable formulation of MI-SVM

[3] that we call latent SVM (LSVM). In a latent SVM each example x is scored by a function

of the following form,

fβ(x) = maxz∈Z(x)

β · Φ(x, z). (1)

Here β is a vector of model parameters, z are latent values, and Φ(x, z) is a feature vector.

In the case of one of our star models β is the concatenation of the root filter, the part filters,

and deformation cost weights, z is a specification of the object configuration, and Φ(x, z) is a

concatenation of subwindows from a feature pyramid and part deformation features.

We note that (1) can handle very general forms of latent information. For example, z could

specify a derivation under a rich visual grammar.

Our second class of models represents each object category by a mixture of star models.

The score of one of our mixture models at a given position and scale is the maximum over

components, of the score of that component model at the given location. In this case the latent

information, z, specifies a component label and a configuration for that component. Figure 2

shows a mixture model for the bicycle category.

To obtain high performance using discriminative training it is often important to use large

training sets. In the case of object detection the training problem is highly unbalanced because

there is vastly more background than objects. This motivates a process of searching through

the background to find a relatively small number of potential false positives. A methodology of

June 1, 2009 DRAFT

Page 10: Fcv learn ramanan

Learned visual representationsWhere is invariance built in?

Representation (latent-variable classifier)

Features Yi & Ramanan 11

Buffy performance: 88% vs 73%

Page 11: Fcv learn ramanan

Qualitative Results

Page 12: Fcv learn ramanan

How do we move beyond the plateau?

1. Develop more structured models with less invariant features

2. Score syntax as semantics

Page 13: Fcv learn ramanan

The forgotten challenge....

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

prec

isio

n

All results

BCNPCL_HumanLayout (3.3)OXFORD_SBD (10.4)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

prec

isio

n

All results

BCNPCL_HumanLayout (74.4)OXFORD_SBD (52.7)

!"#$%&#

'()*+"&,)-#.*/)&,*$#012*-"&"3&)4#*&4501"-*)1*)&,"4*-5&5

678)4-*+"&,)-*-)"#*1)&*5&&"+9&*&)*-"&"3&*8""&

:"5- :51- ;))&

<=>?=@A:$+51@5B)$& CDED FEF GEH

6I;6!JAK<J LHEC GMED MEM

:"5- :51-#

Head Hand Foot

Page 14: Fcv learn ramanan

Structured classifiers

classifier

Estimatedshape

shape

reflectance

To appear in the ACM SIGGRAPH conference proceedings

Figure 8: Top: heat equilibrium for two bones. Bottom: the resultof rotating the right bone with the heat-based attachment

treat the character volume as an insulated heat-conducting body andforce the temperature of bone i to be 1 while keeping the tempera-ture of all of the other bones at 0. Then we can take the equilibriumtemperature at each vertex on the surface as the weight of bone i atthat vertex. Figure 8 illustrates this in two dimensions.

Solving for heat equilibrium over a volume would require tes-sellating the volume and would be slow. Therefore, for simplic-ity, Pinocchio solves for equilibrium over the surface only, but atsome vertices, it adds the heat transferred from the nearest bone.

The equilibrium over the surface for bone i is given by !wi

!t=

!wi + H(pi ! wi) = 0, which can be written as

!!wi + Hw

i = Hpi, (1)

where ! is the discrete surface Laplacian, calculated with thecotangent formula [Meyer et al. 2003], pi is a vector with pi

j = 1if the nearest bone to vertex j is i and pi

j = 0 otherwise, and H isthe diagonal matrix with Hjj being the heat contribution weight ofthe nearest bone to vertex j. Because ! has units of length!2, somust H. Letting d(j) be the distance from vertex j to the nearestbone, Pinocchio uses Hjj = c/d(j)2 if the shortest line segmentfrom the vertex to the bone is contained in the character volumeand Hjj = 0 if it is not. It uses the precomputed distance field todetermine whether a line segment is entirely contained in the char-acter volume. For c " 0.22, this method gives weights with similartransitions to those computed by finding the equilibrium over thevolume. Pinocchio uses c = 1 (corresponding to anisotropic heatdiffusion) because the results look more natural. When k bones areequidistant from vertex j, heat contributions from all of them areused: pj is 1/k for all of them, and Hjj = kc/d(j)2.

Equation (1) is a sparse linear system, and the left hand sidematrix !! + H does not depend on i, the bone we are interestedin. Thus we can factor the system once and back-substitute to findthe weights for each bone. Botsch et al. [2005] show how to usea sparse Cholesky solver to compute the factorization for this kindof system. Pinocchio uses the TAUCS [Toledo 2003] library forthis computation. Note also that the weights wi sum to 1 for eachvertex: if we sum (1) over i, we get (!! + H)

P

i wi = H · 1,

which yieldsP

i wi = 1.

It is possible to speed up this method slightly by finding verticesthat are unambiguously attached to a single bone and forcing theirweight to 1. An earlier variant of our algorithm did this, but the im-provement was negligible, and this introduced occasional artifacts.

5 Results

We evaluate Pinocchio with respect to the three criteria stated inthe introduction: generality, quality, and performance. To ensurean objective evaluation, we use inputs that were not used duringdevelopment. To this end, once the development was complete, wetested Pinocchio on 16 biped Cosmic Blobs models that we had notpreviously tried.

Figure 10: A centaur pirate with a centaur skeleton embedded looksat a cat with a quadruped skeleton embedded

Figure 11: The human scan on the left is rigged by Pinocchio and isposed on the right by changing joint angles in the embedded skele-ton. The well-known deficiencies of LBS can be seen in the rightknee and hip areas.

5.1 Generality

Figure 9 shows our 16 test characters and the skeletons Pinocchioembedded. The skeleton was correctly embedded into 13 of thesemodels (81% success). For Models 7, 10 and 13, a hint for a singlejoint was sufficient to produce a good embedding.

These tests demonstrate the range of proportions that our methodcan tolerate: we have a well-proportioned human (Models 1–4, 8),large arms and tiny legs (6; in 10, this causes problems), and largelegs and small arms (15; in 13, the small arms cause problems). Forother characters we tested, skeletons were almost always correctlyembedded into well-proportioned characters whose pose matchedthe given skeleton. Pinocchio was even able to transfer a bipedwalk onto a human hand, a cat on its hind legs, and a donut.

The most common issues we ran into on other characters were:

• The thinnest limb into which we may hope to embed a bonehas a radius of 2! . Characters with extremely thin limbs oftenfail because the the graph we extract is disconnected. Reduc-ing ! , however, hurts performance.

• Degree 2 joints such as knees and elbows are often positionedincorrectly within a limb. We do not know of a reliable wayto identify the right locations for them: on some charactersthey are thicker than the rest of the limb, and on others theyare thinner.

Although most of our tests were done with the biped skeleton,we have also used other skeletons for other characters (Figure 10).

5.2 Quality

Figure 11 shows the results of manually posing a human scan us-ing our attachment. Our video [Baran and Popovic 2007b] demon-strates the quality of the animation produced by Pinocchio.

6

Estimated reflectance

Page 15: Fcv learn ramanan

Structured object reports

Parsing the Appearance of People in Images

Lead: Jitendra Malik (UC Berkeley)Participants: Deva Ramanan (UC Irvine), Steve Seitz (U Washington)

Introduction/goal: Human detection and pose estimation are tasks with many applications, in-cluding next-generation human-computer interfaces and activity understanding. Detection is oftencast as a classification problem (does this window contain a person or not?), while pose estimationis often cast as a regression problem, where given an image or sequence of frames, one must pre-dict joint angles. This project will take a more general view and cast both tasks as one of “parsing,”where a full syntactic parse will report the number of people present (if any), their body articu-lation, and most notably, visual attributes describing each person. Examples of such attributesinclude the presence (or lack of) hats, eyeware, jackets, backpacks, and types and color of clothing(shorts versus pants versus skirts). Building on recent success in object recognition, this projectwill develop discriminative representations for encoding attributes and collect annotated bench-mark data for training and evaluating such models.

Game changing potential: Contemporary vision systems tend to report image classification la-bels or bounding boxes. However, such reports are not sufficient for many applications. We wouldlike to move beyond detection and generate more sophisticated output including articulated humanpose and visual attributes. Though human pose recovery is a classic problem with a large bodyof related work, it has typically been addressed in simple, uncluttered scenes. To process out–of-the-lab footage, we plan to build on recent successful work from the detection community thatcombines invariant descriptors with large-scale machine-learning methods. Finally, we argue thatattribute-matching has potentially game-changing applications, including next-gen photo browsingof people in large photo collections based on attributes, pose, etc. (”photo tourism for people”) andmatching people in photo collections (“find all photos of me online”), described further below.

Relevant applications: Our focus on attributes will be guided by two concrete tasks. One task isthat of robust pose estimation in cluttered scenes, useful for such applications as gaming interfaces.

1

“If you’re not winning the game, change the rules”

Page 16: Fcv learn ramanan

Caveat: we need more pixels

Multiresolution models for object detectionDennis Park Deva Ramanan Charless Fowlkes

University of California, IrvineDepartment of Computer Science

Motivation & Goal

! Objects in images come with various resolutions.! Most recognition systems are scale-invariant,

i.e. fixed-size template! More pixels mean more information!! We want to use the information when it is avail-

able.Test image

! Goal :1. We want to use more pixels.2. We want to detect small instances as well.3. In addition, we try to address the correlation be-

tween resolution and the role of context.

Model

Building blocks! HOG features [1]! SVM

Evolution of ideasGiven LR & HR templates, how can we combine them into a unified model?

S1. Immediate thought : build separate models for each size.LR template HR template

!(x, s) =

!

"

"

"

"

"

"

#

!0(x)100

$

%

%

%

%

%

%

&

if s = 0 , !(x, s) =

!

"

"

"

"

"

"

#

00

!1(x)1

$

%

%

%

%

%

%

&

if s = 1

! s is a binary observed variable determined by the size of instance.! !0(x) : feature computed from small instance.! !1(x) : feature computed from large instance.! Limitation : Objective function reduces into two separate loss functions.! the scores from LR/HR are not comparable.

S2. Simple idea : We can always use LR templates to score large instances.

!(x, s) =

!

"

"

"

"

"

"

#

!0(x)100

$

%

%

%

%

%

%

&

if s = 0 , !(x, s) =

!

"

"

"

"

"

"

#

!0(x)0

!1(x)1

$

%

%

%

%

%

%

&

if s = 1

! Two templates share parameters, and bias terms make scores comparable.! Limitation : We expected that . . .

! LR template captures coarse edges, e.g. silhouette.! HR template captures fine edges, e.g. eyes, nose, etc.! Instead, it worked like an LR template with finer cells.

S3. Now we replace HR model with part-based model [2].! star model

! eliminate blurring effect.! LR global template + HR deformable part templates

! naturally fits into multiresolution model.! LR template = global template! HR template = global template + part template

! trained by Latent SVM! part locations, spring constants, ground-truth boxes are latent variables.

!(x, s, z) =

!

"

"

"

"

"

"

#

!0(x)100

$

%

%

%

%

%

%

&

s = 0 , !(x, s, z) =

!

"

"

"

"

"

"

#

!0(x)0

!1(x, z)1

$

%

%

%

%

%

%

&

s = 1

! scoring function

f (x, s) =

'

(

)

w0 · !0(x) + b0 if s = 0w0 · !o(x) + maxz w1 · !1(x, z) + b1 if s = 1

S4. & final model! The boundary of LR/HR is chosen by ad-hoc. i.e. 100 pixel-tall.! To reduce the artifact, we implement latent resolution for boundary in-

stances.! e.g. Given 110 pixel tall instance, we chose template that gives higher score.! We use 50% overlap criteria to determine boundary instances.

Context feature! Ground plane assumption.

! People around horizon look small.! h = ay + b [3]! penalize square distance from the linearmodel.! (h " (ay + b))2 = wp · !p(x)! !p(x) = [h2 y2 hy h y 1]

! use separate sets of feature for LR/HR! observe the effect of context to each template.

Experiment & Results

Benchmark! Caltech pedestrian dataset [4]! 71 videos, urban scenes.! various sizes of instances.

! mostly follow ground plane assump-tion.

! provide scale-specific evaluation.

Experiment design! 1 vs 4 cross-validation! LR : h<88pix, HR : h>88pix.! small modification of existing code [2]

! Two scale-specific evalutions.! #1 mimics benchmark.! #2 shows accumulative detectionrate w.r.t. candidates’s sizes.

ResultsRunning example Comparing models Benchmark

LR

HR

MR

small

!3 !2 !1 0 1 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

MR+C (0.48)

MR (0.53)

LR (0.54)

HR (0.80)

HR+I (0.57)

large

!3 !2 !1 0 1 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

MR+C (0.17)

MR (0.17)

LR (0.39)

HR (0.20)

HR+I (0.21)

overall

!3 !2 !1 0 1 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

MR+C (0.40)

MR (0.43)

LR (0.52)

HR (0.60)

HR+I (0.47)

”medium”

”near”

”reasonable”

TPR / FPPILR

!3 !2.5 !2 !1.5 !1 !0.5 0 0.5 1 1.5 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

log10 FPPI

TPR

30<h<88 (0.30)

30<h<480 (0.48)

HR+Interpolation

!3 !2.5 !2 !1.5 !1 !0.5 0 0.5 1 1.5 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

log10 FPPI

TPR

30<h<88 (0.29)

30<h<480 (0.53)

HR

!3 !2.5 !2 !1.5 !1 !0.5 0 0.5 1 1.5 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

log10 FPPI

TPR

30<h<88 (0.12)

30<h<480 (0.40)

MR

!3 !2.5 !2 !1.5 !1 !0.5 0 0.5 1 1.5 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

log10 FPPI

TPR

30<h<88 (0.30)

30<h<480 (0.57)

context weight

! LR > HR for small instances, LR < HR for large instances.! MR works like LR for small instances, and works like HR for large instances.! Context plays more important role in detecting small instances.! MR+Context outperforms all other approaches in most experiments of bench-

mark.[1] Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR. (2005) I: 886–893[2] Felzenszwalb, P., McAllester, D., Ramanan, D.: A discriminatively trained, multiscale, deformable part model. Computer Vision and

Pattern Recognition, Anchorage, USA, June (2008)[3] Hoiem, D., Efros, A., Hebert, M.: Putting objects in perspective. International Journal of Computer Vision 80 (2008) 3–15[4] Dollar, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: A benchmark. In: Proc. IEEE Conf. on Computer Vision and Pattern

Recognition. (2009)

We should focus on high-resolution data

14

Fig. 1. Overview of the Daimler pedestrian detection benchmark dataset: Pedestrian training samples (top row), non-pedestrian

training images (center row), test images with annotations (bottom row).

size of approx. 8.5 GB1

IV. SELECTED PEDESTRIAN DETECTION APPROACHES

We select a diverse set of pedestrian detection approaches in terms of features (adaptive, non-

adaptive) and classifier architecture for evaluation (see Section V): Haar wavelet-based cascade

[74], neural network using LRF features [75], and histograms of oriented gradients combined

with a linear SVM [11]. In addition to these approaches, used in sliding window fashion, we

consider a system utilizing coarse-to-fine shape matching and texture-based classification, i.e. a

monocular variant of [23]. Temporal integration is incorporated by coupling all approaches with

a 2D bounding box tracker.

We acknowledge that besides the selected approaches there exist many other interesting

lines of research in the field of monocular pedestrian detection (see Section II). We encourage

other authors to report performances using the proposed dataset and evaluation criteria for

benchmarking. Here we focus on the most widely-used approaches2.

Our experimental setup assigns the underlying system parameters (e.g. sample resolution,

feature layout, training) to the values reported to perform best in the original publications [11],

1The dataset is made freely available to academic and non-academic entities for research purposes. See

http://www.science.uva.nl/research/isla/downloads/pedestrians/index.html or contact the second

author.2total processing time for training, testing and evaluation was several months of CPU time on a 2.66 GHz Intel processor,

using implementations in C/C++

October 16, 2008 DRAFT

Parsing the Appearance of People in Images

Lead: Jitendra Malik (UC Berkeley)Participants: Deva Ramanan (UC Irvine), Steve Seitz (U Washington)

Introduction/goal: Human detection and pose estimation are tasks with many applications, in-cluding next-generation human-computer interfaces and activity understanding. Detection is oftencast as a classification problem (does this window contain a person or not?), while pose estimationis often cast as a regression problem, where given an image or sequence of frames, one must pre-dict joint angles. This project will take a more general view and cast both tasks as one of “parsing,”where a full syntactic parse will report the number of people present (if any), their body articu-lation, and most notably, visual attributes describing each person. Examples of such attributesinclude the presence (or lack of) hats, eyeware, jackets, backpacks, and types and color of clothing(shorts versus pants versus skirts). Building on recent success in object recognition, this projectwill develop discriminative representations for encoding attributes and collect annotated bench-mark data for training and evaluating such models.

Game changing potential: Contemporary vision systems tend to report image classification la-bels or bounding boxes. However, such reports are not sufficient for many applications. We wouldlike to move beyond detection and generate more sophisticated output including articulated humanpose and visual attributes. Though human pose recovery is a classic problem with a large bodyof related work, it has typically been addressed in simple, uncluttered scenes. To process out–of-the-lab footage, we plan to build on recent successful work from the detection community thatcombines invariant descriptors with large-scale machine-learning methods. Finally, we argue thatattribute-matching has potentially game-changing applications, including next-gen photo browsingof people in large photo collections based on attributes, pose, etc. (”photo tourism for people”) andmatching people in photo collections (“find all photos of me online”), described further below.

Relevant applications: Our focus on attributes will be guided by two concrete tasks. One task isthat of robust pose estimation in cluttered scenes, useful for such applications as gaming interfaces.

1

(in contrast to most learning methods)

Page 17: Fcv learn ramanan

Caltech Pedestrian Benchmark10

Fig. 4. Benchmark results. From the upper left graph in clockwise direction, we show

the results for reasonable, near, far and medium experiments, evaluated on test in-

stances with various heights (h > 30, h > 80, h < 30, and 30 < h < 80 and h < 30,

respectively). Our context-augmented multiresolution model, labeled as MultiresC,

significantly outperforms all previous systems in 10 out of the 11 benchmark experi-

ments (all but the ’far’ experiment’).

Overall: Overall, our multiresolution model outperforms baseline models.Our contextual model provides a small but noticeable improvement, reducingthe missed detection rate from 43% to 40%. We shall see that the majority of thisimprovement comes from detecting small-scale instances. Somewhat surprisingly,we see that a simple rigid template outperform a more sophisticated part model -52% MD compared to 59%. One can attribute this to the fact that the part-basedmodel has a fixed resolution of 88 pixels (selected through cross-validation),and so cannot detect any instances which are smaller. This significantly hurtsperformance as more than 80% of instances fall in this small category. However,one may suspect that the part-model should perform better when evaluatingresults on test instances that are 88 pixels or taller.

Detecting large instances: When evaluating on large instances (> 90 pix-els in height), our multiresolution model performs similarly to the high-resolutionpart-based model. Both of these models provide a stark improvement over a

Multiresolution representations decrease error by 2X compared to previous work

9

missed detectionsmisseddetections

Low!resolution model

Multiresolution model

High!resolution model

Fig. 3. On the left, we show the result of our low-resolution rigid-template baseline.

One can see it fails to detect large instances. On the right, we show detections of

our high-resolution, part-based baseline, which fails to find small instances. On the

bottom, we show detections of our multiresolution model that is able to detect both

large and small instances. The threshold of each model is set to yield the same rate of

FPPI of 1.2.

5.2 Diagnostic experiments

To further analyze the performance of our system, we construct a set of diag-nostic experiments by splitting up the publically-available Caltech Pedestriantraining data into a disjoint set of training and validation videos. We definedthis split pseudo-randomly, ensuring that similar numbers of people appeared inboth sets. We compare to a high-resolution baseline (equivalent to the originalpart-based code [28]) and a low-resolution baseline (equivalent to a root-onlymodel [13]), and a version of our multiresolution model without context. We vi-sualize our baseline models in Fig. 5. All methods are trained and evaluated onthe exact same data. To better interpret results, we threw out instances that werevery small ( < 30 pixels in height) or abnormal in aspect ratio (i.e. h/w > 5),as we view the latter as an artifact of annotating video by interpolation.

Park et al. 2010

Page 18: Fcv learn ramanan

How do we move beyond the plateau?

1. Develop more structured models with less invariant features

2. Score syntax as semantics

3. Generate ground-truth datasets of structured labels

Page 19: Fcv learn ramanan

Case study: small or big parts

Skeleton Parts/Poselets Mini-parts

Page 20: Fcv learn ramanan

What are good representations?

ExemplarsParts

AttributesVisual Phrases

Grammars?

Page 21: Fcv learn ramanan

Even worse: what are the parts (if any)?

Is there any structure to label here?

Page 22: Fcv learn ramanan

Sharing surfaces?

Page 23: Fcv learn ramanan

Selective parameter sharing

v vv

Exemplars => Parts => Attributes => Grammars

Multi-task training of instance-specific classifiers

Page 24: Fcv learn ramanan

Human-in-the-loop structure learning

Page 25: Fcv learn ramanan

How do we move beyond the plateau?

1. Develop more structured models with less invariant features

2. Score “nuisance” variables as meaningful output

3. Generate ground-truth datasets of structured labels

Page 26: Fcv learn ramanan

Diagram for Eero

Vision

Machine Learning

Vision as applied machine learning

Page 27: Fcv learn ramanan

Diagram for Eero

Vision as structured pattern recognition

Vision

Machine LearningGraphics(shape & appearance)