NIPS2009: Understand Visual Scenes - Part 2

download NIPS2009: Understand Visual Scenes - Part 2

of 89

  • date post

    07-Jul-2015
  • Category

    Education

  • view

    504
  • download

    2

Embed Size (px)

Transcript of NIPS2009: Understand Visual Scenes - Part 2

  • 1. Acaroutofcontext

2. Modelingobjectcooccurrences 3. Whatarethehiddenobjects?1 2 4. Whatarethehiddenobjects? Chance ~ 1/30000 5. objects imagep(O | I) p(I|O) p(O)Object model Context model 6. p(O | I) p(I|O) p(O)Object model Context modelFull joint Scene model Approx. joint 7. p(O | I) p(I|O) p(O)Object model Context modelFull joint Scene model Approx. joint 8. p(O | I) p(I|O) p(O)Object model Context model Full joint Scene modelApprox. jointp(O) =s ip(Oi|S=s) p(S=s)office street 9. p(O | I) p(I|O) p(O)Object model Context modelFull joint Scene model Approx. joint 10. PixellabelingusingMRFsEnforceconsistencybetweenneighboringlabels,andbetweenlabelsandpixels Oi Carbonetto, de Freitas & Barnard, ECCV04 11. ObjectObjectRelaPonshipsUselatentvariablestoinducelongdistancecorrelaPonsbetweenlabelsinaCondiPonalRandomField(CRF)He, Zemel & Carreira-Perpinan (04) 12. ObjectObjectRelaPonships [Kumar Hebert 2005] 13. ObjectObjectRelaPonships Fink&Perona(NIPS03)UseoutputofboosPngfromotherobjectsatpreviousiteraPonsasinputintoboosPngforthisiteraPon 14. ObjectObjectRelaPonshipsBuilding, boat, personRoad Building, boat, motorbikeWater,sky Building Most consistentlabeling according to Roadobject coBoat -occurrences&locallabel probabilities. WaterA. Rabinovich, A. Vedaldi, C. Galleguillos, E.Wiewiora and S. Belongie. Objects in Context.ICCV 2007 15. ObjectsinContext: ContextualRenementBuilding Contextual model based on co-occurrences Try to find the most consistent labeling with Roadhigh posterior probability and high mean pairwise interaction. BoatUse CRF for this purpose.WaterIndependentMean interaction of all label pairs segment classification(i,j) is basically the observed labelco-occurrences in training set. 132 16. Usingstutondthings Heitz and Koller, ECCV 2008In this work, there is not labeling for stuff. Instead, they look for clusters of textures and model how each cluster correlates with the target object. 17. What,whereandwho?Classifying eventsbysceneandobjectrecognitionSlide by Fei-feiL-JLi&L.Fei-Fei,ICCV2007 18. what whowhereSlide by Fei-fei L.-J. Li & L. Fei-Fei ICCV 2007 19. Grammars Guzman (SEE), 1968 Noton and Stark 1971 Hansen & Riseman (VISIONS), 1978 Barrow & Tenenbaum 1978 Brooks (ACRONYM), 1979[Ohta & Kanade 1978] Marr, 1982 Yakimovsky & Feldman, 1973 20. Grammars for objects and scenes S.C. Zhu and D. Mumford. A Stochastic Grammar of Images. Foundations and Trends in Computer Graphics and Vision, 2006. 21. 3D scenes 22. We are wired for 3D ~6cm 23. We can not shut down 3D perception (c) 2006 Walt Anthony 24. 3D drives perception of importantobject attributes by Roger Shepard (Turning the Tables) Depth processing is automatic, and we can not shut it down 25. Manhattan WorldSlide by James Coughlan Coughlan, Yuille. 2003 26. Slide by James Coughlan Coughlan, Yuille. 2003 27. Slide by James Coughlan Coughlan, Yuille. 2003 28. Single view metrology Criminisi, et al. 1999 Need to recover: Ground plane Reference height Horizon line Where objects contact theground 29. 3d Scene ContextImageWorldHoiem, Efros, Hebert ICCV 2005 30. 3D scene context metersPed PedCarmetersHoiem, Efros, Hebert ICCV 2005 31. Qualitative ResultsCar: TP / FP Ped: TP / FP Initial: 2 TP / 3 FP Final: 7 TP / 4 FPSlide by Derek Hoiem Local Detector from [Murphy-Torralba-Freeman 2003] 32. 3D City Modeling using Cognitive Loops N. Cornelis, B. Leibe, K. Cornelis, L. Van Gool. CVPR06 33. 3D from pixel valuesD. Hoiem, A.A. Efros, and M. Hebert, "Automatic Photo Pop-up. SIGGRAPH 2005.A. Saxena, M. Sun, A. Y. Ng. "Learning 3-D Scene Structure from a Single Still Image"In ICCV workshop on 3D Representation for Recognition (3dRR-07), 2007. 34. Surface EstimationImage SupportVertical Sky V-Left V-Center V-RightV-Porous V-Solid ObjectSurface?[Hoiem, Efros, Hebert ICCV 2005]Support? Slide by Derek Hoiem 35. Object Support Slide by Derek Hoiem 36. Qualitative 3D relationshipsGupta & Davis, EECV, 2008 37. Large databasesAlgorithms that rely on millions of images 38. DataHuman visionMany input modalitiesActiveSupervised, unsupervised, semi supervisedlearning. It can look for supervision.Robot visionMany poor input modalitiesActive, but it does not go farInternet visionMany input modalitiesIt can reach everywhereTons of data 39. The two extremes of learningExtrapolation problem Interpolation problemGeneralization CorrespondenceDiagnostic features Finding the differences 1 10 102 103 104 105106 Number of training samplesTransfer learning ClassifiersLabel transfer Priors 40. NearestneighborsInputimage Labels MoPon Labels Depth MoPon Depth Hays,Efros,Siggraph2007Russell,Liu,Torralba,Fergus,Freeman.NIPS2007Divvala,Efros,Hebert,2008Malisiewicz,Efros2008Torralba,Fergus,Freeman,PAMI2008Liu,Yuen,Torralba,CVPR2009 41. The power of large collections Google Street ViewPhotoToursim/PhotoSynth(controlled image capture) [Snavely et al.,2006] (register images based onmulti-view geometry) 42. Image completionInstead, generate proposals using millions of images Input16 nearest neighborsoutput(gist+color matching) Hays, Efros, 2007 43. im2gpsInstead of using objects labels, the web provides other kinds of metadata associate to large collections of images20 million geotagged and geographic text-labeled images Hays & Efros. CVPR 2008 44. im2gpsHays & Efros. CVPR 2008Input imageNearest neighborsGeographic location of the nearest neighbors 45. Predicting eventsC. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008 46. Predicting eventsC. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008 47. QueryC. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008 48. QueryRetrieved videoC. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008 49. QueryRetrieved videoSynthesized videoC. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008 50. Query Retrieved videoSynthesized videoC. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008 51. QuerySynthesized videoC. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008 52. Query Retrieved videoSynthesized videoC. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, ECCV 2008 53. Databases and the powers of 10 54. DATASETS ANDDatasetsandPowersof10 55. 0images 56. 0 10 images1972 57. 110images 58. 1 10 imagesMarr, 1976 59. 2-410images 60. 2-4 The faces and cars 10images scaleIn 1996 DARPA released 14000 images,from over 1000 individuals. 61. ThePASCALVisualObjectClassesIn 2007, the twenty object classes that have been selected are:Person: personAnimal: bird, cat, cow, dog, horse, sheepVehicle: aeroplane, bicycle, boat, bus, car, motorbike, trainIndoor: bottle, chair, dining table, potted plant, sofa, tv/monitorM. Everingham, Luc van Gool , C. Williams, J. Winn, A. Zisserman 2007 62. 2-410images 63. 510images 64. Caltech101and25610 images 5Griffin, Holub, Perona, 2007Fei-Fei, Fergus, Perona, 2004 65. LotusHillResearchInsPtuteimagecorpus Z.Y. Yao, X. Yang, and S.C. Zhu, 2007 66. 5 LabelMe 10 imagesTool went online July 1st, 2005530,000 object annotations collectedLabelme.csail.mit.edu B.C. Russell, A. Torralba, K.P. Murphy, W.T. Freeman, IJCV 2008 67. Extreme labeling 68. The other extreme of extreme labeling things do not always look good 69. Creative testing 70. 510images 71. 6-7 10 imagesThings start getting out of hand 72. 6-7 Collecting big datasets 10 images ESP game (CMU)Luis Von Ahn and Laura Dabbish 2004 LabelMe (MIT)Russell, Torralba, Freeman, 2005 StreetScenes (CBCL-MIT)Bileschi, Poggio, 2006 WhatWhere (Caltech)Perona et al, 2007 PASCAL challenge2006, 2007 Lotus Hill InstituteSong-Chun Zhu et al, 2007 80 million imagesTorralba, Fergus, Freeman, 2007 73. 80.000.000images106-775.000 non-abstract nouns from WordNet7 Online image search enginesimages And after 1 year downloading images Google: 80 million images A. Torralba, R. Fergus, W.T. Freeman. PAMI 2008 74. 6-7 10 images shepherd dog, sheep dog animal collieGerman shepherd~105+ nodes~108+ imagesDeng, Dong, Socher, Li & Fei-Fei, CVPR 2009 75. Labeling for moneyAlexander Sorokin, David Forsyth, "Utility data annotation with AmazonMechanical Turk", First IEEE Workshop on Internet Vision at CVPR 08. 76. 1centTask: Label one object in this image 77. 1centTask: Label one object in this image 78. Whypeopledoesthis?From:JohnSmithDate:August 22,200910:18:23AMEDTTo:BryanRussellSubject:Re:RegardingAmazon MechanicalTurkHITRX5WVKGA9WDearMr.Bryan,IamawaiPngforyourHITS.Pleasehelpuswith more.Thanks&Regards 79. 6-710images 80. 8-1110images 81. 8-1110images 82. 8-1110images 83. CanonicalPerspecPveExamples of canonical perspective:In a recognition task, reaction time correlated with the ratings.Canonical views are recognized fasterat the entry level.From Vision Science, Palmer 84. 3DobjectcategorizaPonDespite we can categorize all threepictures as being views of a horse,the three pictures do not look asbeing equally typical views ofhorses. And they do not seem tobe recognizable with the sameeasiness.by Greg Robbins 85. 8-11 CanonicalViewpoint 10imagesInteresting biasesIt is not a uniform sampling on viewpoints(some artificial datasets might contain non natural statistics) 86. 8-11CanonicalViewpoint 10 images Interesting biasesClocks are preferred as purely frontal 87. >1110images??? ?