NIPS2009: Understand Visual Scenes - Part 2

Post on 07-Jul-2015

508 views 2 download

Transcript of NIPS2009: Understand Visual Scenes - Part 2

A car out of context … 

Modeling object co‐occurrences 

2 1

What are the hidden objects? 

What are the hidden objects? 

Chance ~ 1/30000

p(O | I) α p(I|O) p(O)

Object model Context model

image objects

p(O | I) α p(I|O) p(O)

Object model Context model

Full joint Scene model Approx. joint

p(O | I) α p(I|O) p(O)

Object model Context model

Full joint Scene model Approx. joint

p(O | I) α p(I|O) p(O)

Object model Context model

Full joint Scene model

p(O) = Σ Πp(Oi|S=s) p(S=s) s i

Approx. joint

office street

p(O | I) α p(I|O) p(O)

Object model Context model

Full joint Scene model Approx. joint

Pixel labeling using MRFs 

Enforce consistency between neighboring labels, and between labels and pixels 

Carbonetto, de Freitas & Barnard, ECCV’04

Oi

Object‐Object RelaPonships 

Use latent variables to induce long distance correlaPons between labels in a CondiPonal Random Field (CRF) 

He, Zemel & Carreira-Perpinan (04)

Object‐Object RelaPonships 

[Kumar Hebert 2005]

•  Fink & Perona (NIPS 03) Use output of boosPng from other objects at previous 

iteraPons as input into boosPng for this iteraPon 

Object‐Object RelaPonships 

Object‐Object RelaPonships 

Building, boat, motorbike

Building, boat, person

Water, sky

Road

Most consistent labeling according to object co-occurrences& locallabel probabilities.

Boat

Building

Water

Road

A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora and S. Belongie. Objects in Context. ICCV 2007

132

Objects in Context: Contextual Refinement 

Contextual model based on co-occurrences Try to find the most consistent labeling with high posterior probability and high mean pairwise interaction. Use CRF for this purpose. Boat

Building

Water

Road

Independent segment classification Mean interaction of all label pairs

Φ(i,j) is basically the observed label co-occurrences in training set.

Using stuff to find things Heitz and Koller, ECCV 2008

In this work, there is not labeling for stuff. Instead, they look for clusters of textures and model how each cluster correlates with the target object.

What,whereandwho?Classifyingeventsbysceneandobjectrecognition

L-JLi&L.Fei-Fei,ICCV2007Slide by Fei-fei

what who where

L.-J. Li & L. Fei-Fei ICCV 2007 Slide by Fei-fei

Grammars

  Guzman (SEE), 1968   Noton and Stark 1971   Hansen & Riseman (VISIONS), 1978   Barrow & Tenenbaum 1978   Brooks (ACRONYM), 1979   Marr, 1982   Yakimovsky & Feldman, 1973

[Ohta & Kanade 1978]

Grammars for objects and scenes

S.C. Zhu and D. Mumford. A Stochastic Grammar of Images. Foundations and Trends in Computer Graphics and Vision, 2006.

3D scenes

We are wired for 3D ~6cm

We can not shut down 3D perception

(c) 2006 Walt Anthony

3D drives perception of important object attributes

by Roger Shepard (”Turning the Tables”)

Depth processing is automatic, and we can not shut it down…

Coughlan, Yuille. 2003 Slide by James Coughlan

Manhattan World

Coughlan, Yuille. 2003 Slide by James Coughlan

Slide by James Coughlan Coughlan, Yuille. 2003

Single view metrology Criminisi, et al. 1999

Need to recover: •  Ground plane •  Reference height •  Horizon line •  Where objects contact the ground

3d Scene Context

Image World

Hoiem, Efros, Hebert ICCV 2005

3D scene context

meters

met

ers

Ped

Ped

Car

Hoiem, Efros, Hebert ICCV 2005

Qualitative Results

Initial: 2 TP / 3 FP Final: 7 TP / 4 FP

Local Detector from [Murphy-Torralba-Freeman 2003]

Car: TP / FP Ped: TP / FP

Slide by Derek Hoiem

3D City Modeling using Cognitive Loops

N. Cornelis, B. Leibe, K. Cornelis, L. Van Gool. CVPR'06

3D from pixel values D. Hoiem, A.A. Efros, and M. Hebert, "Automatic Photo Pop-up”. SIGGRAPH 2005.

A. Saxena, M. Sun, A. Y. Ng. "Learning 3-D Scene Structure from a Single Still Image" In ICCV workshop on 3D Representation for Recognition (3dRR-07), 2007.

Surface Estimation

Image Support Vertical Sky

V-Left V-Center V-Right V-Porous V-Solid

[Hoiem, Efros, Hebert ICCV 2005]

Object Surface?

Support? Slide by Derek Hoiem

Object Support

Slide by Derek Hoiem

Gupta & Davis, EECV, 2008

Qualitative 3D relationships

Large databases Algorithms that rely on millions of images

Human vision • Many input modalities • Active • Supervised, unsupervised, semi supervised learning. It can look for supervision.

Robot vision • Many poor input modalities • Active, but it does not go far

Internet vision • Many input modalities • It can reach everywhere • Tons of data

Data

The two extremes of learning

Number of training samples

1 10 102 103 104 105

Extrapolation problem Generalization

Diagnostic features

Interpolation problem Correspondence

Finding the differences

∞ 106

Transfer learning Classifiers

Priors Label transfer

Input image Nearest neighbors 

Hays, Efros, Siggraph 2007 Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007 Divvala, Efros, Hebert, 2008 Malisiewicz, Efros 2008 Torralba, Fergus, Freeman, PAMI 2008 Liu, Yuen, Torralba, CVPR 2009 

•  Labels 

•  Depth •  … 

•  Labels 

•  Depth •  … 

•  MoPon 

•  MoPon 

The power of large collections

Google Street View PhotoToursim/PhotoSynth [Snavely et al.,2006] (controlled image capture)

(register images based on multi-view geometry)

Image completion

Instead, generate proposals using millions of images

Hays, Efros, 2007

Input 16 nearest neighbors (gist+color matching)

output

im2gps Instead of using objects labels, the web provides other kinds of metadata associate to large collections of images

Hays & Efros. CVPR 2008

20 million geotagged and geographic text-labeled images

Hays & Efros. CVPR 2008 im2gps

Input image Nearest neighbors Geographic location of the nearest neighbors

Predicting events

C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008

Predicting events

C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008

Query

C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008

Retrieved video Query

C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008

Retrieved video

Synthesized video

Query

C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008

Retrieved video

Synthesized video

Query

C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008

Synthesized video

Query

C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008

Retrieved video

Synthesized video

Query

C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008

Databases and the powers of 10

Datasets and  

Powers of 10 

DATASETS AND

0 images

10 0 images

1972

10 1 images

10 1 images

Marr, 1976

10 2-4 images

10 2-4 images

In 1996 DARPA released 14000 images, from over 1000 individuals.

The faces and cars scale

The PASCAL Visual Object Classes  

M. Everingham, Luc van Gool , C. Williams, J. Winn, A. Zisserman 2007

In 2007, the twenty object classes that have been selected are:

Person: person Animal: bird, cat, cow, dog, horse, sheep Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor

10 2-4 images

10 5 images

Caltech 101 and 256 

Griffin, Holub, Perona, 2007 Fei-Fei, Fergus, Perona, 2004

10 5 images

Lotus Hill Research InsPtute image corpus 

Z.Y. Yao, X. Yang, and S.C. Zhu, 2007

B.C. Russell, A. Torralba, K.P. Murphy, W.T. Freeman, IJCV 2008 Labelme.csail.mit.edu

Tool went online July 1st, 2005 530,000 object annotations collected

LabelMe 10 5 images

Extreme labeling

The other extreme of extreme labeling

… things do not always look good…

Creative testing

10 5 images

10 6-7 images

Things start getting out of hand

Collecting big datasets

•  ESP game (CMU) Luis Von Ahn and Laura Dabbish 2004

•  LabelMe (MIT) Russell, Torralba, Freeman, 2005

•  StreetScenes (CBCL-MIT) Bileschi, Poggio, 2006

•  WhatWhere (Caltech) Perona et al, 2007

•  PASCAL challenge 2006, 2007

•  Lotus Hill Institute Song-Chun Zhu et al, 2007

•  80 million images Torralba, Fergus, Freeman, 2007

10 6-7 images

80.000.000 images 75.000 non-abstract nouns from WordNet 7 Online image search engines

Google: 80 million images

And after 1 year downloading images

A. Torralba, R. Fergus, W.T. Freeman. PAMI 2008

10 6-7 images

~105+ nodes ~108+ images

shepherd dog, sheep dog

German shepherd collie animal

Deng, Dong, Socher, Li & Fei-Fei, CVPR 2009

10 6-7 images

Alexander Sorokin, David Forsyth, "Utility data annotation with Amazon Mechanical Turk", First IEEE Workshop on Internet Vision at CVPR 08.

Labeling for money

1 cent Task: Label one object in this image

1 cent Task: Label one object in this image

Why people does this? 

From: John Smith <…@yahoo.co.in>Date: August 22, 2009 10:18:23 AM EDT 

To: Bryan Russell Subject: Re: Regarding Amazon Mechanical Turk HIT RX5WVKGA9W 

Dear Mr. Bryan,            I am awaiPng for your HITS. Please help us with more. 

Thanks & Regards 

10 6-7 images

10 8-11 images

10 8-11 images

10 8-11 images

Canonical PerspecPve 

From Vision Science, Palmer

Examples of canonical perspective:

In a recognition task, reaction time correlated with the ratings.

Canonical views are recognized faster at the entry level.

3D object categorizaPon 

by Greg Robbins

Despite we can categorize all three pictures as being views of a horse, the three pictures do not look as being equally typical views of horses. And they do not seem to be recognizable with the same easiness.

Canonical Viewpoint 

It is not a uniform sampling on viewpoints (some artificial datasets might contain non natural statistics)

10 8-11 images

Interesting biases…

Canonical Viewpoint 

Clocks are preferred as purely frontal

10 8-11 images

Interesting biases…

10 >11 images

? ?

? ?