Download - NIPS2009: Understand Visual Scenes - Part 2

Transcript
Page 1: NIPS2009: Understand Visual Scenes - Part 2

A car out of context … 

Page 2: NIPS2009: Understand Visual Scenes - Part 2

Modeling object co‐occurrences 

Page 3: NIPS2009: Understand Visual Scenes - Part 2

2 1

What are the hidden objects? 

Page 4: NIPS2009: Understand Visual Scenes - Part 2

What are the hidden objects? 

Chance ~ 1/30000

Page 5: NIPS2009: Understand Visual Scenes - Part 2

p(O | I) α p(I|O) p(O)

Object model Context model

image objects

Page 6: NIPS2009: Understand Visual Scenes - Part 2

p(O | I) α p(I|O) p(O)

Object model Context model

Full joint Scene model Approx. joint

Page 7: NIPS2009: Understand Visual Scenes - Part 2

p(O | I) α p(I|O) p(O)

Object model Context model

Full joint Scene model Approx. joint

Page 8: NIPS2009: Understand Visual Scenes - Part 2

p(O | I) α p(I|O) p(O)

Object model Context model

Full joint Scene model

p(O) = Σ Πp(Oi|S=s) p(S=s) s i

Approx. joint

office street

Page 9: NIPS2009: Understand Visual Scenes - Part 2

p(O | I) α p(I|O) p(O)

Object model Context model

Full joint Scene model Approx. joint

Page 10: NIPS2009: Understand Visual Scenes - Part 2

Pixel labeling using MRFs 

Enforce consistency between neighboring labels, and between labels and pixels 

Carbonetto, de Freitas & Barnard, ECCV’04

Oi

Page 11: NIPS2009: Understand Visual Scenes - Part 2

Object‐Object RelaPonships 

Use latent variables to induce long distance correlaPons between labels in a CondiPonal Random Field (CRF) 

He, Zemel & Carreira-Perpinan (04)

Page 12: NIPS2009: Understand Visual Scenes - Part 2

Object‐Object RelaPonships 

[Kumar Hebert 2005]

Page 13: NIPS2009: Understand Visual Scenes - Part 2

•  Fink & Perona (NIPS 03) Use output of boosPng from other objects at previous 

iteraPons as input into boosPng for this iteraPon 

Object‐Object RelaPonships 

Page 14: NIPS2009: Understand Visual Scenes - Part 2

Object‐Object RelaPonships 

Building, boat, motorbike

Building, boat, person

Water, sky

Road

Most consistent labeling according to object co-occurrences& locallabel probabilities.

Boat

Building

Water

Road

A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora and S. Belongie. Objects in Context. ICCV 2007

Page 15: NIPS2009: Understand Visual Scenes - Part 2

132

Objects in Context: Contextual Refinement 

Contextual model based on co-occurrences Try to find the most consistent labeling with high posterior probability and high mean pairwise interaction. Use CRF for this purpose. Boat

Building

Water

Road

Independent segment classification Mean interaction of all label pairs

Φ(i,j) is basically the observed label co-occurrences in training set.

Page 16: NIPS2009: Understand Visual Scenes - Part 2

Using stuff to find things Heitz and Koller, ECCV 2008

In this work, there is not labeling for stuff. Instead, they look for clusters of textures and model how each cluster correlates with the target object.

Page 17: NIPS2009: Understand Visual Scenes - Part 2

What,whereandwho?Classifyingeventsbysceneandobjectrecognition

L-JLi&L.Fei-Fei,ICCV2007Slide by Fei-fei

Page 18: NIPS2009: Understand Visual Scenes - Part 2

what who where

L.-J. Li & L. Fei-Fei ICCV 2007 Slide by Fei-fei

Page 19: NIPS2009: Understand Visual Scenes - Part 2

Grammars

  Guzman (SEE), 1968   Noton and Stark 1971   Hansen & Riseman (VISIONS), 1978   Barrow & Tenenbaum 1978   Brooks (ACRONYM), 1979   Marr, 1982   Yakimovsky & Feldman, 1973

[Ohta & Kanade 1978]

Page 20: NIPS2009: Understand Visual Scenes - Part 2

Grammars for objects and scenes

S.C. Zhu and D. Mumford. A Stochastic Grammar of Images. Foundations and Trends in Computer Graphics and Vision, 2006.

Page 21: NIPS2009: Understand Visual Scenes - Part 2

3D scenes

Page 22: NIPS2009: Understand Visual Scenes - Part 2

We are wired for 3D ~6cm

Page 23: NIPS2009: Understand Visual Scenes - Part 2

We can not shut down 3D perception

(c) 2006 Walt Anthony

Page 24: NIPS2009: Understand Visual Scenes - Part 2

3D drives perception of important object attributes

by Roger Shepard (”Turning the Tables”)

Depth processing is automatic, and we can not shut it down…

Page 25: NIPS2009: Understand Visual Scenes - Part 2

Coughlan, Yuille. 2003 Slide by James Coughlan

Manhattan World

Page 26: NIPS2009: Understand Visual Scenes - Part 2

Coughlan, Yuille. 2003 Slide by James Coughlan

Page 27: NIPS2009: Understand Visual Scenes - Part 2

Slide by James Coughlan Coughlan, Yuille. 2003

Page 28: NIPS2009: Understand Visual Scenes - Part 2

Single view metrology Criminisi, et al. 1999

Need to recover: •  Ground plane •  Reference height •  Horizon line •  Where objects contact the ground

Page 29: NIPS2009: Understand Visual Scenes - Part 2

3d Scene Context

Image World

Hoiem, Efros, Hebert ICCV 2005

Page 30: NIPS2009: Understand Visual Scenes - Part 2

3D scene context

meters

met

ers

Ped

Ped

Car

Hoiem, Efros, Hebert ICCV 2005

Page 31: NIPS2009: Understand Visual Scenes - Part 2

Qualitative Results

Initial: 2 TP / 3 FP Final: 7 TP / 4 FP

Local Detector from [Murphy-Torralba-Freeman 2003]

Car: TP / FP Ped: TP / FP

Slide by Derek Hoiem

Page 32: NIPS2009: Understand Visual Scenes - Part 2

3D City Modeling using Cognitive Loops

N. Cornelis, B. Leibe, K. Cornelis, L. Van Gool. CVPR'06

Page 33: NIPS2009: Understand Visual Scenes - Part 2

3D from pixel values D. Hoiem, A.A. Efros, and M. Hebert, "Automatic Photo Pop-up”. SIGGRAPH 2005.

A. Saxena, M. Sun, A. Y. Ng. "Learning 3-D Scene Structure from a Single Still Image" In ICCV workshop on 3D Representation for Recognition (3dRR-07), 2007.

Page 34: NIPS2009: Understand Visual Scenes - Part 2

Surface Estimation

Image Support Vertical Sky

V-Left V-Center V-Right V-Porous V-Solid

[Hoiem, Efros, Hebert ICCV 2005]

Object Surface?

Support? Slide by Derek Hoiem

Page 35: NIPS2009: Understand Visual Scenes - Part 2

Object Support

Slide by Derek Hoiem

Page 36: NIPS2009: Understand Visual Scenes - Part 2

Gupta & Davis, EECV, 2008

Qualitative 3D relationships

Page 37: NIPS2009: Understand Visual Scenes - Part 2

Large databases Algorithms that rely on millions of images

Page 38: NIPS2009: Understand Visual Scenes - Part 2

Human vision • Many input modalities • Active • Supervised, unsupervised, semi supervised learning. It can look for supervision.

Robot vision • Many poor input modalities • Active, but it does not go far

Internet vision • Many input modalities • It can reach everywhere • Tons of data

Data

Page 39: NIPS2009: Understand Visual Scenes - Part 2

The two extremes of learning

Number of training samples

1 10 102 103 104 105

Extrapolation problem Generalization

Diagnostic features

Interpolation problem Correspondence

Finding the differences

∞ 106

Transfer learning Classifiers

Priors Label transfer

Page 40: NIPS2009: Understand Visual Scenes - Part 2

Input image Nearest neighbors 

Hays, Efros, Siggraph 2007 Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007 Divvala, Efros, Hebert, 2008 Malisiewicz, Efros 2008 Torralba, Fergus, Freeman, PAMI 2008 Liu, Yuen, Torralba, CVPR 2009 

•  Labels 

•  Depth •  … 

•  Labels 

•  Depth •  … 

•  MoPon 

•  MoPon 

Page 41: NIPS2009: Understand Visual Scenes - Part 2

The power of large collections

Google Street View PhotoToursim/PhotoSynth [Snavely et al.,2006] (controlled image capture)

(register images based on multi-view geometry)

Page 42: NIPS2009: Understand Visual Scenes - Part 2

Image completion

Instead, generate proposals using millions of images

Hays, Efros, 2007

Input 16 nearest neighbors (gist+color matching)

output

Page 43: NIPS2009: Understand Visual Scenes - Part 2

im2gps Instead of using objects labels, the web provides other kinds of metadata associate to large collections of images

Hays & Efros. CVPR 2008

20 million geotagged and geographic text-labeled images

Page 44: NIPS2009: Understand Visual Scenes - Part 2

Hays & Efros. CVPR 2008 im2gps

Input image Nearest neighbors Geographic location of the nearest neighbors

Page 45: NIPS2009: Understand Visual Scenes - Part 2

Predicting events

C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008

Page 46: NIPS2009: Understand Visual Scenes - Part 2

Predicting events

C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008

Page 47: NIPS2009: Understand Visual Scenes - Part 2

Query

C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008

Page 48: NIPS2009: Understand Visual Scenes - Part 2

Retrieved video Query

C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008

Page 49: NIPS2009: Understand Visual Scenes - Part 2

Retrieved video

Synthesized video

Query

C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008

Page 50: NIPS2009: Understand Visual Scenes - Part 2

Retrieved video

Synthesized video

Query

C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008

Page 51: NIPS2009: Understand Visual Scenes - Part 2

Synthesized video

Query

C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008

Page 52: NIPS2009: Understand Visual Scenes - Part 2

Retrieved video

Synthesized video

Query

C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008

Page 53: NIPS2009: Understand Visual Scenes - Part 2

Databases and the powers of 10

Page 54: NIPS2009: Understand Visual Scenes - Part 2

Datasets and  

Powers of 10 

DATASETS AND

Page 55: NIPS2009: Understand Visual Scenes - Part 2

0 images

Page 56: NIPS2009: Understand Visual Scenes - Part 2

10 0 images

1972

Page 57: NIPS2009: Understand Visual Scenes - Part 2

10 1 images

Page 58: NIPS2009: Understand Visual Scenes - Part 2

10 1 images

Marr, 1976

Page 59: NIPS2009: Understand Visual Scenes - Part 2

10 2-4 images

Page 60: NIPS2009: Understand Visual Scenes - Part 2

10 2-4 images

In 1996 DARPA released 14000 images, from over 1000 individuals.

The faces and cars scale

Page 61: NIPS2009: Understand Visual Scenes - Part 2

The PASCAL Visual Object Classes  

M. Everingham, Luc van Gool , C. Williams, J. Winn, A. Zisserman 2007

In 2007, the twenty object classes that have been selected are:

Person: person Animal: bird, cat, cow, dog, horse, sheep Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor

Page 62: NIPS2009: Understand Visual Scenes - Part 2

10 2-4 images

Page 63: NIPS2009: Understand Visual Scenes - Part 2

10 5 images

Page 64: NIPS2009: Understand Visual Scenes - Part 2

Caltech 101 and 256 

Griffin, Holub, Perona, 2007 Fei-Fei, Fergus, Perona, 2004

10 5 images

Page 65: NIPS2009: Understand Visual Scenes - Part 2

Lotus Hill Research InsPtute image corpus 

Z.Y. Yao, X. Yang, and S.C. Zhu, 2007

Page 66: NIPS2009: Understand Visual Scenes - Part 2

B.C. Russell, A. Torralba, K.P. Murphy, W.T. Freeman, IJCV 2008 Labelme.csail.mit.edu

Tool went online July 1st, 2005 530,000 object annotations collected

LabelMe 10 5 images

Page 67: NIPS2009: Understand Visual Scenes - Part 2

Extreme labeling

Page 68: NIPS2009: Understand Visual Scenes - Part 2

The other extreme of extreme labeling

… things do not always look good…

Page 69: NIPS2009: Understand Visual Scenes - Part 2

Creative testing

Page 70: NIPS2009: Understand Visual Scenes - Part 2
Page 71: NIPS2009: Understand Visual Scenes - Part 2

10 5 images

Page 72: NIPS2009: Understand Visual Scenes - Part 2

10 6-7 images

Things start getting out of hand

Page 73: NIPS2009: Understand Visual Scenes - Part 2

Collecting big datasets

•  ESP game (CMU) Luis Von Ahn and Laura Dabbish 2004

•  LabelMe (MIT) Russell, Torralba, Freeman, 2005

•  StreetScenes (CBCL-MIT) Bileschi, Poggio, 2006

•  WhatWhere (Caltech) Perona et al, 2007

•  PASCAL challenge 2006, 2007

•  Lotus Hill Institute Song-Chun Zhu et al, 2007

•  80 million images Torralba, Fergus, Freeman, 2007

10 6-7 images

Page 74: NIPS2009: Understand Visual Scenes - Part 2

80.000.000 images 75.000 non-abstract nouns from WordNet 7 Online image search engines

Google: 80 million images

And after 1 year downloading images

A. Torralba, R. Fergus, W.T. Freeman. PAMI 2008

10 6-7 images

Page 75: NIPS2009: Understand Visual Scenes - Part 2

~105+ nodes ~108+ images

shepherd dog, sheep dog

German shepherd collie animal

Deng, Dong, Socher, Li & Fei-Fei, CVPR 2009

10 6-7 images

Page 76: NIPS2009: Understand Visual Scenes - Part 2

Alexander Sorokin, David Forsyth, "Utility data annotation with Amazon Mechanical Turk", First IEEE Workshop on Internet Vision at CVPR 08.

Labeling for money

Page 77: NIPS2009: Understand Visual Scenes - Part 2

1 cent Task: Label one object in this image

Page 78: NIPS2009: Understand Visual Scenes - Part 2

1 cent Task: Label one object in this image

Page 79: NIPS2009: Understand Visual Scenes - Part 2

Why people does this? 

From: John Smith <…@yahoo.co.in>Date: August 22, 2009 10:18:23 AM EDT 

To: Bryan Russell Subject: Re: Regarding Amazon Mechanical Turk HIT RX5WVKGA9W 

Dear Mr. Bryan,            I am awaiPng for your HITS. Please help us with more. 

Thanks & Regards 

Page 80: NIPS2009: Understand Visual Scenes - Part 2

10 6-7 images

Page 81: NIPS2009: Understand Visual Scenes - Part 2

10 8-11 images

Page 82: NIPS2009: Understand Visual Scenes - Part 2

10 8-11 images

Page 83: NIPS2009: Understand Visual Scenes - Part 2

10 8-11 images

Page 84: NIPS2009: Understand Visual Scenes - Part 2

Canonical PerspecPve 

From Vision Science, Palmer

Examples of canonical perspective:

In a recognition task, reaction time correlated with the ratings.

Canonical views are recognized faster at the entry level.

Page 85: NIPS2009: Understand Visual Scenes - Part 2

3D object categorizaPon 

by Greg Robbins

Despite we can categorize all three pictures as being views of a horse, the three pictures do not look as being equally typical views of horses. And they do not seem to be recognizable with the same easiness.

Page 86: NIPS2009: Understand Visual Scenes - Part 2

Canonical Viewpoint 

It is not a uniform sampling on viewpoints (some artificial datasets might contain non natural statistics)

10 8-11 images

Interesting biases…

Page 87: NIPS2009: Understand Visual Scenes - Part 2

Canonical Viewpoint 

Clocks are preferred as purely frontal

10 8-11 images

Interesting biases…

Page 88: NIPS2009: Understand Visual Scenes - Part 2

10 >11 images

? ?

? ?

Page 89: NIPS2009: Understand Visual Scenes - Part 2