A car out of context …
Modeling object co‐occurrences
2 1
What are the hidden objects?
What are the hidden objects?
Chance ~ 1/30000
p(O | I) α p(I|O) p(O)
Object model Context model
image objects
p(O | I) α p(I|O) p(O)
Object model Context model
Full joint Scene model Approx. joint
p(O | I) α p(I|O) p(O)
Object model Context model
Full joint Scene model Approx. joint
p(O | I) α p(I|O) p(O)
Object model Context model
Full joint Scene model
p(O) = Σ Πp(Oi|S=s) p(S=s) s i
Approx. joint
office street
p(O | I) α p(I|O) p(O)
Object model Context model
Full joint Scene model Approx. joint
Pixel labeling using MRFs
Enforce consistency between neighboring labels, and between labels and pixels
Carbonetto, de Freitas & Barnard, ECCV’04
Oi
Object‐Object RelaPonships
Use latent variables to induce long distance correlaPons between labels in a CondiPonal Random Field (CRF)
He, Zemel & Carreira-Perpinan (04)
Object‐Object RelaPonships
[Kumar Hebert 2005]
• Fink & Perona (NIPS 03) Use output of boosPng from other objects at previous
iteraPons as input into boosPng for this iteraPon
Object‐Object RelaPonships
Object‐Object RelaPonships
Building, boat, motorbike
Building, boat, person
Water, sky
Road
Most consistent labeling according to object co-occurrences& locallabel probabilities.
Boat
Building
Water
Road
A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora and S. Belongie. Objects in Context. ICCV 2007
132
Objects in Context: Contextual Refinement
Contextual model based on co-occurrences Try to find the most consistent labeling with high posterior probability and high mean pairwise interaction. Use CRF for this purpose. Boat
Building
Water
Road
Independent segment classification Mean interaction of all label pairs
Φ(i,j) is basically the observed label co-occurrences in training set.
Using stuff to find things Heitz and Koller, ECCV 2008
In this work, there is not labeling for stuff. Instead, they look for clusters of textures and model how each cluster correlates with the target object.
What,whereandwho?Classifyingeventsbysceneandobjectrecognition
L-JLi&L.Fei-Fei,ICCV2007Slide by Fei-fei
what who where
L.-J. Li & L. Fei-Fei ICCV 2007 Slide by Fei-fei
Grammars
Guzman (SEE), 1968 Noton and Stark 1971 Hansen & Riseman (VISIONS), 1978 Barrow & Tenenbaum 1978 Brooks (ACRONYM), 1979 Marr, 1982 Yakimovsky & Feldman, 1973
[Ohta & Kanade 1978]
Grammars for objects and scenes
S.C. Zhu and D. Mumford. A Stochastic Grammar of Images. Foundations and Trends in Computer Graphics and Vision, 2006.
3D scenes
We are wired for 3D ~6cm
We can not shut down 3D perception
(c) 2006 Walt Anthony
3D drives perception of important object attributes
by Roger Shepard (”Turning the Tables”)
Depth processing is automatic, and we can not shut it down…
Coughlan, Yuille. 2003 Slide by James Coughlan
Manhattan World
Coughlan, Yuille. 2003 Slide by James Coughlan
Slide by James Coughlan Coughlan, Yuille. 2003
Single view metrology Criminisi, et al. 1999
Need to recover: • Ground plane • Reference height • Horizon line • Where objects contact the ground
3d Scene Context
Image World
Hoiem, Efros, Hebert ICCV 2005
3D scene context
meters
met
ers
Ped
Ped
Car
Hoiem, Efros, Hebert ICCV 2005
Qualitative Results
Initial: 2 TP / 3 FP Final: 7 TP / 4 FP
Local Detector from [Murphy-Torralba-Freeman 2003]
Car: TP / FP Ped: TP / FP
Slide by Derek Hoiem
3D City Modeling using Cognitive Loops
N. Cornelis, B. Leibe, K. Cornelis, L. Van Gool. CVPR'06
3D from pixel values D. Hoiem, A.A. Efros, and M. Hebert, "Automatic Photo Pop-up”. SIGGRAPH 2005.
A. Saxena, M. Sun, A. Y. Ng. "Learning 3-D Scene Structure from a Single Still Image" In ICCV workshop on 3D Representation for Recognition (3dRR-07), 2007.
Surface Estimation
Image Support Vertical Sky
V-Left V-Center V-Right V-Porous V-Solid
[Hoiem, Efros, Hebert ICCV 2005]
Object Surface?
Support? Slide by Derek Hoiem
Object Support
Slide by Derek Hoiem
Gupta & Davis, EECV, 2008
Qualitative 3D relationships
Large databases Algorithms that rely on millions of images
Human vision • Many input modalities • Active • Supervised, unsupervised, semi supervised learning. It can look for supervision.
Robot vision • Many poor input modalities • Active, but it does not go far
Internet vision • Many input modalities • It can reach everywhere • Tons of data
Data
The two extremes of learning
Number of training samples
1 10 102 103 104 105
Extrapolation problem Generalization
Diagnostic features
Interpolation problem Correspondence
Finding the differences
∞ 106
Transfer learning Classifiers
Priors Label transfer
Input image Nearest neighbors
Hays, Efros, Siggraph 2007 Russell, Liu, Torralba, Fergus, Freeman. NIPS 2007 Divvala, Efros, Hebert, 2008 Malisiewicz, Efros 2008 Torralba, Fergus, Freeman, PAMI 2008 Liu, Yuen, Torralba, CVPR 2009
• Labels
• Depth • …
• Labels
• Depth • …
• MoPon
• MoPon
The power of large collections
Google Street View PhotoToursim/PhotoSynth [Snavely et al.,2006] (controlled image capture)
(register images based on multi-view geometry)
Image completion
Instead, generate proposals using millions of images
Hays, Efros, 2007
Input 16 nearest neighbors (gist+color matching)
output
im2gps Instead of using objects labels, the web provides other kinds of metadata associate to large collections of images
Hays & Efros. CVPR 2008
20 million geotagged and geographic text-labeled images
Hays & Efros. CVPR 2008 im2gps
Input image Nearest neighbors Geographic location of the nearest neighbors
Predicting events
C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
Predicting events
C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
Query
C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
Retrieved video Query
C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
Retrieved video
Synthesized video
Query
C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
Retrieved video
Synthesized video
Query
C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
Synthesized video
Query
C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
Retrieved video
Synthesized video
Query
C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008
Databases and the powers of 10
Datasets and
Powers of 10
DATASETS AND
0 images
10 0 images
1972
10 1 images
10 1 images
Marr, 1976
10 2-4 images
10 2-4 images
In 1996 DARPA released 14000 images, from over 1000 individuals.
The faces and cars scale
The PASCAL Visual Object Classes
M. Everingham, Luc van Gool , C. Williams, J. Winn, A. Zisserman 2007
In 2007, the twenty object classes that have been selected are:
Person: person Animal: bird, cat, cow, dog, horse, sheep Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor
10 2-4 images
10 5 images
Caltech 101 and 256
Griffin, Holub, Perona, 2007 Fei-Fei, Fergus, Perona, 2004
10 5 images
Lotus Hill Research InsPtute image corpus
Z.Y. Yao, X. Yang, and S.C. Zhu, 2007
B.C. Russell, A. Torralba, K.P. Murphy, W.T. Freeman, IJCV 2008 Labelme.csail.mit.edu
Tool went online July 1st, 2005 530,000 object annotations collected
LabelMe 10 5 images
Extreme labeling
The other extreme of extreme labeling
… things do not always look good…
Creative testing
10 5 images
10 6-7 images
Things start getting out of hand
Collecting big datasets
• ESP game (CMU) Luis Von Ahn and Laura Dabbish 2004
• LabelMe (MIT) Russell, Torralba, Freeman, 2005
• StreetScenes (CBCL-MIT) Bileschi, Poggio, 2006
• WhatWhere (Caltech) Perona et al, 2007
• PASCAL challenge 2006, 2007
• Lotus Hill Institute Song-Chun Zhu et al, 2007
• 80 million images Torralba, Fergus, Freeman, 2007
10 6-7 images
80.000.000 images 75.000 non-abstract nouns from WordNet 7 Online image search engines
Google: 80 million images
And after 1 year downloading images
A. Torralba, R. Fergus, W.T. Freeman. PAMI 2008
10 6-7 images
~105+ nodes ~108+ images
shepherd dog, sheep dog
German shepherd collie animal
Deng, Dong, Socher, Li & Fei-Fei, CVPR 2009
10 6-7 images
Alexander Sorokin, David Forsyth, "Utility data annotation with Amazon Mechanical Turk", First IEEE Workshop on Internet Vision at CVPR 08.
Labeling for money
1 cent Task: Label one object in this image
1 cent Task: Label one object in this image
Why people does this?
From: John Smith <…@yahoo.co.in>Date: August 22, 2009 10:18:23 AM EDT
To: Bryan Russell Subject: Re: Regarding Amazon Mechanical Turk HIT RX5WVKGA9W
Dear Mr. Bryan, I am awaiPng for your HITS. Please help us with more.
Thanks & Regards
10 6-7 images
10 8-11 images
10 8-11 images
10 8-11 images
Canonical PerspecPve
From Vision Science, Palmer
Examples of canonical perspective:
In a recognition task, reaction time correlated with the ratings.
Canonical views are recognized faster at the entry level.
3D object categorizaPon
by Greg Robbins
Despite we can categorize all three pictures as being views of a horse, the three pictures do not look as being equally typical views of horses. And they do not seem to be recognizable with the same easiness.
Canonical Viewpoint
It is not a uniform sampling on viewpoints (some artificial datasets might contain non natural statistics)
10 8-11 images
Interesting biases…
Canonical Viewpoint
Clocks are preferred as purely frontal
10 8-11 images
Interesting biases…
10 >11 images
? ?
? ?
Top Related