Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6....

33
PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 1 Aggregating local descriptors into a compact image representation -VLAD descriptor Images and corresponding VLAD descriptors, for K=16 centroids. The components of the descriptor are represented like SIFT, with negative components in red. Σχετικό υλικό: Paper-Aggregating local descriptors into a compact image representation Jegou, H. ; INRIA Rennes, Rennes, France ; Douze, M. ; Schmid, C. ; Perez, P. http://lear.inrialpes.fr/pubs/2010/JDSP10/jegou_compactimagerepresentation.pdf https://hal.inria.fr/inria-00633013/PDF/jegou_aggregate.pdf Πρόκειται για μία απλή διαδικασία εξαγωγής χαρακτηριστικού σε εικόνα που βασίζεται στην ομαδοποίηση k-means. Είναι ο γνωστός αλγόριθμος VLAD Ζητείται η υλοποίηση του και η εφαρμογή σε «απόσταση» εικόνων – ανάκτηση εικόνων

Transcript of Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6....

Page 1: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 1

Aggregating local descriptors into a compact image

representation -VLAD descriptor

Images and corresponding VLAD descriptors, for K=16 centroids. The components of the descriptor are represented like SIFT, with negative

components in red.

Σχετικό υλικό:

Paper-Aggregating local descriptors into a compact image representation

Jegou, H. ; INRIA Rennes, Rennes, France ; Douze, M. ; Schmid, C. ; Perez, P. http://lear.inrialpes.fr/pubs/2010/JDSP10/jegou_compactimagerepresentation.pdf

https://hal.inria.fr/inria-00633013/PDF/jegou_aggregate.pdf

Πρόκειται για μία απλή διαδικασία εξαγωγής χαρακτηριστικού σε εικόνα που βασίζεται στην

ομαδοποίηση k-means. Είναι ο γνωστός αλγόριθμος VLAD

Ζητείται η υλοποίηση του και η εφαρμογή σε «απόσταση» εικόνων – ανάκτηση εικόνων

Page 2: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 2

Efficiently Searching for Similar Images

Σχετικό paper:

http://cacm.acm.org/magazines/2010/6/92473-efficiently-searching-for-similar-

images/fulltext

Ζητείται η κατασκευή ιστογράμματος -The Pyramid Match Algorithm -όπως περιγράφεται

στο σχετικό paper(&2.2) και η εφαρμογή του σε “απόσταση- ομοιότητα” εικόνων.

Page 3: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 3

Depth Maps for Object Recognition

Summary As depth data is increasingly common in vision systems through the widespread

availability of RGB-D cameras, deriving useful information from that data is

becoming more important. Recently, there exist large datasets of RGB-D data

as well as algorithms that create features from depth data ([1, 2]). I would

like to further explore the usefulness of depth images by experimenting with

the important phase in the vision , defining features.

Technical Details

RGB-D dataset provided by [1] should be used

http://rgbd-dataset.cs.washington.edu/dataset.html

http://rgbd-dataset.cs.washington.edu/index.html

Create a recognition system (HOG features with SVM classifier)

You can use any type of classifier

References [1] K. Lai, L. Bo, X. Ren, and D. Fox. "A Large-Scale Hierarchical Multi-View

RGB-D Object Dataset". IEEE International Conference on on Robotics

and Automation, 2011

[2] L. Bo, K. Lai, Xiaofeng Ren, and D. Fox "Object Recognition with Hierar-

chical Kernel Descriptors"

[3] Martin Koppel, Mehdi Ben Makhloufz, and Patrick Ndjiki-Nya "Optimized

Adaptive Depth Map Filtering"

[4] Young-Woo Kim, Karam Kim, and Jong-Il Park "Refinement of Depth Map

by Combining Course Depth and Surface Normal"

Page 4: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 4

Κατάτμηση εικόνας με διαδικασία «Laplacian Eigenmaps»

σχετικά papers:

I. Tziakos, C. Theoharatos, N. Laskaris, and G. Economou, "Color image segmentation using

Laplacian eigenmaps," Journal of Electronic Imaging, vol. 18, iss. 2, 2009

IoannisTziakos, Nikolaos Laskaris and Spiros Fotopoulos, "Multivariate Image Segmentation

using Laplacian Eigenmaps", In Proc. of European Signal Processing Conference (EUSIPCO),

Sept. 2004, Vienna, Austria, pp 945-948

Ζητείται η υλοποίηση του Αλγορίθμου όπως περιγράφεται στα παραπάνω Papers

Και εφαρμογή σε RGB εικόνα (input)

Page 5: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 5

Create Pointillism Art with 3 Primary Colors from Natural Images

Description Pointillism is a branch of impressionism that can be dated back to the late 19th century. It is a painting

technique that only uses tiny, distinct dots to form patterns of color. It enjoys a duality in being both

discrete up close (the dots) and continuous from distance (the patterns). Combining this artistic

inspiration with the techniques of digital image processing, we want to develop a special image filter

that creates pointillism art from ordinary digital images.

Plan

The process of creating pointillism art will be as followed:

(1) We take an image.

(2) We will apply various image processing techniques to our image, such as blurring the image with a

structuring element that is larger and proportional to the size of the dots we want to use in our

pointillism art.

(3) We will then downsample the image to reduce the resolution of the image. By blurring the image and

downsampling, we want to create an image that allows us to more easily

deconstruct the image into Pointillism art composed of 3 primary colors.

(4) After the preprocessing, we will determine the intensity map of the image by converting it to

grayscale. The intensity map of the image will be used to determine a density map of the dots in our

subsequent pointillism art.

(5) We will apply histogram normalization to the image to make it more saturated so that the

distribution of colors in the image can be better mapped to the limited gamut of colors we can create

with three primary colors. Then we can create a linear mapping between the colors in the image to the

colors we want to use for the dots.

(6) Once the conversion is done, we will analyze pixels in the original image to create regions of dots

that represent each pixel of the original image. We will determine the distribution of dots and color in the

regions to closely represent the original pixel in an artistic manner. We will overlap the resulting regions

of dots, to reduce artifacts and create a more blended distribution of dots across the entire resulting

pointillism image.

(7) Once our Pointillism art is created, we will display the image on a printout. The goal is if you stand

close to the art, the art looks like a bunch of dots, but the farther you stand from it, then it looks like an

artistic representation of the image.

The procedures here are tentative and subject to changes for the pursuit of aesthetics.

References

Dongxiang Chi, “A Natural Image Pointillism with Controlled Ellipse Dots,” Advances in Multimedia, vol. 2014,

Article ID 567846, 16 pages, 2014. doi:10.1155/2014/567846

Greenberg, Ira, Dianna Xu, and Deepak Kumar. "Image SpecialFX: Pointilism." Processing Creative Coding and

Generative Art in Processing 2. Dordrecht: Springer, 2013. 399-401. Print.

Lansdown, John, and Simon Schofield. "Expressive Rendering: A Review of Nonphotorealistic Techniques."

Computer Graphics and Applications, IEEE 15.3 (1995): 29-37. IEEE. Web. 3 Nov. 2015.

Luong, Tran-Quan, Ankush Seth, Allison Klein, and Jason Lawrence. "Isoluminant Color Picking for Non-

Photorealistic Rendering."SpringerReference (2011): n. pag. Isoluminant Color Picking for Non-Photorealistic

Rendering. Princeton, 2005. Web. 3 Nov. 2015.

<http://www.cs.virginia.edu/~jdl/papers/isolum/luong_gi05.pdf>.

Λεπτομέρειες για το project στο παρακάτω paper:

http://web.stanford.edu/class/ee368/Project_Autumn_1516/Reports/Hong_Liu.pdf

Ζητείται να γίνει ο σχετικός κώδικας

Page 6: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 6

Page 7: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 7

Mobile Haze Removal

Motivation In the last couple decades, China has developed its economy by largely expanding heavily polluting

industries. Public concern over the environmental consequences of this growth has exploded in

recent years. China’s hyperactive microblogs logged 2.5m posts on “smog” in a single month in

2013[1]. Hazy images of Chinese cities went viral online and have come to represent the severe

pollution in China. The generation that born after 2000 have probably never seen a unpolluted China.

In this project, we would like to build a mobile haze removal application that gives users a different

angle to see China. By removing the haze and reconstructing a clear sky, the application will present

an image of an unpolluted China.

Goals The goal of this project is to implement an application that removes haze from an image with

unclear/polluted scene. This image can range from heavily polluted cities to foggy mornings. The

expected output will be a clearer image with higher contrast ratio as well as consistent and relatively

realistic appearance.

Approach Most regions of an image that are not part of the sky contain a few pixels that have a low intensity

in one of the RGB channels. Using this concept we can estimate the thickness of haze from the

intensity of the dark channels[2]. We need a more efficient algorithm for implementation, so we will

detect dark pixels whose intensity is close to zero, use their values to calculate the transmission at

that point, and use the transmission to estimate the values of nearby pixels. We then segment the

image and use a 2D linear fitting of a transmission equation[3]. Based on the quality of our resulting

image, we may also implement denoising[4].

References [1]http://www.economist.com/news/briefing/21583245­china­worlds­worst­polluter­largest­investor­

green­energy­its­rise­will­have

[2] http://research.microsoft.com/en­us/um/people/jiansun/papers/dehaze_cvpr2009.pdf

[3] http://www­scf.usc.edu/~qiyu/Papers/ivcnz2011.pdf

[4] https://users.soe.ucsc.edu/~milanfar/publications/conf/SPIEHaze2012.pdf

Βοήθεια: http://web.stanford.edu/class/ee368/Project_Spring_1415/Reports/Chiang_Ge.pdf

Page 8: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 8

Pinna Feature Extraction from Handheld Device Capture

Overview

Humans are capable of localizing the elevation position of sound sources relative to them.

Identifying the exact location, however, is a difficult perceptual task, and even more difficult

to model. The pinna (outer ear) plays an important role in this process as it generates a series

of elevation cues while filtering the acoustic signal. This can be described via a frequency

response function called the head related transfer function (HRTF) [2]. Different

individuals have distinctive HRTFs since the biometric parameters vary significantly in

relation to size, shape, and orientation. Fast and convenient extraction of pinna edges from 2D

images is beneficial for mapping subjective pinna features to a filter model databank for aural

spatialization (specifically, elevation correction based on the pinna impulse response) [3].

Database: V. R. Algazi, R. O. Duda, D. M. Thompson and C. Avendano, 2001, The CIPIC HRTF

Database, Proc. 2001 IEEE Workshop on Applications of Signal Processing to Audio and

Electroacoustics, pp. 99-102, Mohonk Mountain House, New Paltz, NY, Oct. 21-24

Figure 1: A simple reflection model for the pinna spectral notches. The direct wave incident at

an angle phi gets reflected from the concha. The time delay corresponds to a length of 2d. [2]

Goals

The project can be decomposed into the following parts:

1. Image binarization and denoising

We plan to capture ear images using a handheld device. Depending on the collection environment

(i.e low illumination) and subject to subject dif- ferences such as skin tone, we will first binarize

image with appropriate thresh-olding. Then, we will denoise the image to obtain clean result

for the next processing stage.

2. Edge detection:

Before analyzing the biometric characteristics, a crucial step is to obtain the contour of the

pinna from the collected image. Since the images of the outer ear tend to have low contrast,

challenges in edge detection are expected. We will survey the different detection algorithms, such

as Canny method and Sobel method, to develop a robust algorithm customized for pinna edge

detection. [1]

Page 9: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 9

Figure 2: Example of ear contour extraction [3]

3. Biometric parameter extraction and spectral analysis:

This step aims to use the generated pinna contour image to extract relevant

parameters (i.e radius, centroid location) which determines the pinna impulse

response.

References [1] Michal Chora´s. Ear biometrics based on geometrical method of feature

extraction. In FranciscoJ. Perales and BruceA. Draper, editors, Articulated Motion

and Deformable Objects, volume 3179 of Lecture Notes in Computer Science, pages 51–

61. Springer Berlin Heidelberg, 2004.

[2] Vikas C. Raykar, Ramani Duraiswamib. Extracting the frequencies of the pinna

spectral notches in measured head related impulse responses. Acous- tical Society of

America, 118:364, 2005.

[3] Simone Spagnol, Michele Geronazzo, Davide Rocchesso, and Federico

Avanzini. Extraction of pinna features for customized binaural audio de- livery on

mobile devices. In Proceedings of International Conference on Ad- vances in Mobile

Computing &#38; Multimedia, MoMM ’13, pages 514:514–514:517, New York, NY,

USA, 2013. ACM.

Page 10: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 10

Baby Face Generator

Project description Detecting faces and extracting key facial features remains an active research area with a wide range of

applications. This project seeks to leverage these facial feature detection methods and morphing methods

for an entertaining application: to intelligently combine the faces of two individuals to form a

composite baby image. The algorithm will involve a number of topics covered in class lectures

including color balancing, image segmentation, face detection, eigenimages, and edge detection. The first portion of the project involves implementing an algorithm for detecting human

faces and localizing the key facial features. The implementation will build upon several face

detection and facial feature extraction algorithms including Turk and Pentland’s work using

eigenfaces to locate faces [1], Huang and Chen’s work using active contour models for facial feature

extraction [2], and Saber and Tekalp’s work using color, shape, and symmetry-based cost functions for

facial detection and feature extraction [3]. Additionally, the input faces’ ethnicities will be classified

using eigenimages or fisher images and combined to select the baby’s ethnicity from a predefined

database. The second portion of the project involves combining the identified facial features of the two

individuals to form a composite image. This algorithm incorporates facial morphing techniques

including Beier’s field- morphing algorithm [4] to properly weight and combine the input faces. A

randomized weighting method will be used for selecting which features from the input images will

appear in the output image so that a different baby image is generated each time the program runs.

The final implementation of the baby face generator will include a user-friendly interface

implemented in MATLAB.

REFERENCES

[1] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of cognitive

neuroscience, vol. 3, no. 1, pp. 71–86, 1991.

[2] C.-L. Huang and C.-W. Chen, “Human facial feature extraction for face interpretation and

recognition,” Pattern Recognition, vol. 25, no. 12, pp. 1435 – 1444, 1992.

[3] E. Saber and A. M. Tekalp, “Frontal-view face detection and facial feature extraction using

color, shape and symmetry based cost functions,” Pattern Recognition Letters, vol. 19, no. 8, pp.

669 – 680, 1998. [4] T. Beier and S. Neely, “Feature-based image metamorphosis,” in ACM SIGGRAPH Computer Graphics, vol. 26, no. 2. ACM, 1992, pp. 35–42.

Page 11: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 11

Speech Spectrogram Segmentation

Ζητείται η κατάτμηση ενός speech spectrogram με διαδικασίες κατάτμησης εικόνας.

Προτείνονται οι 1,2,3,4 από τις παρακάτω αναγραφόμενες.

Οι matlab κώδικες και για τις τέσσερες μεθόδους είναι έτοιμες (mathwork)

Image segmentation is the process of dividing an image into multiple parts. This is typically

used to identify objects or other relevant information in digital images. There are many

different ways to perform image segmentation, including:

1. K-means clustering

2. Region growing https://en.wikipedia.org/wiki/Region_growing

3. Image segmentation - Split and Merge (Quadtrees)

4. Connected component labeling

5. Transform methods such as watershed segmentation

6. Gaussian mixture models

Etc. ………………………….

Σχετικά paper για spectrogram segmentation: Δίνονται

Page 12: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 12

Beer label recognition and classification

Project description:

The goal of this project is to develop an image-processing algorithm that can recognize

and classify beer bottle labels in near real-time. Mobile applications that implement such

an algorithm could be used to provide information not available on the bottle (e.g. consumer

ratings and reviews) in different on-the-go situations like grocery shopping.

Goals and

Implementation:

Since this project revolves around machine learning, we will first need to collect images for

the training and testing datasets. For the training set, we will collect “clean” beer labels using

images available online (Figure 1, left). For the testing set, we will use photographs of

bottles with corresponding labels, similar to those that would be acquired with a mobile

phone (Figure 1, right). We plan to use a training set of at least 100 images/beer labels,

including some that come from the same brewery to make the classification more difficult.

Figure 1: Example training image (left) and matching test image (right).

For the test images, we will perform the following pre-processing steps before feature

extraction:

1. Conversion from RGB to grayscale (for some features)

2. 8:1 or 4:1 downsampling with low-pass filtering to improve runtime speed

3. Segmentation to extract the label from the rest of the photograph 4. De-warping to make the label flat using an approximate cylindrical projection [1]

Three sets of features will be investigated:

1. Scale-invariant features, identified by finding local extrema in difference-of-gaussian image pyramids using the SIFT algorithm [2].

2. RGB color histograms of the extracted label. Although this method is sensitive to subtle

changes in noise and lighting, it can be useful for labels that are largely one color (for

example, the label shown in Figure 1). Feature comparisons can be computed using the histogram intersection kernel for support vector machines [3].

Page 13: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 13

3. Prevalence of text. Since labels can include many different fonts and sizes of letters, any attempt to identify specific characters via morphological image processing would probably not succeed. Therefore, instead of character recognition, we will perform text detection –

that is, detecting regions where there is text versus regions where there is no text [4]. How well a

test label matches with each image in the training set can then be quantified by the fraction of the

label that is taken up by text.

We will perform the final label classification taking each of these features into account and

weighting them according to their individual success.

References:

[1] N. Stamatopoulos, et al., "A two-step dewarping of camera document images," in

Document Analysis Systems, 2008. DAS'08. The Eighth IAPR International Workshop on, 2008, pp. 209-216.

[2] D. G. Lowe, "Object recognition from local scale-invariant features," in Computer vision, 1999. The proceedings of the seventh IEEE international conference on, 1999, pp. 1150-

1157. [3] A. Barla, et al., "Histogram intersection kernel for image classification," in Image

Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on, 2003, pp. III-513-16

vol. 2.

[4] C. Liu, et al., "Text detection in images based on unsupervised classification of edge- based features," in Document Analysis and Recognition, 2005. Proceedings. Eighth

International Conference on, 2005, pp. 610-614.

Βοηθητικό συνημμένο paper : Beer Label Classification for Mobile Applications

Page 14: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 14

Automated Estimation of Human Age

Motivation

Human face, as a window to the soul, conveys a significant amount of nonverbal information to

facilitate the real-world human-to-human communications. Recall first time when you meet a

person, you have the ability, probably developed early in life as a child, to accurately determine

facial attributes such as identity, age and gender. We want a machine that can do the same job, as in

some science-fiction films.

Specifications Particularly for this project, we are aiming to estimate the human age only. We can extend to

identity or gender. We plan to develop a MATLAB application that can perform the following

tasks on an input human face image:

1. Accurately recognize the human face as well as important facial aging features and

patterns.

2. Given aging features/patterns, estimate the age range.

Prior works

Age estimation by machine has been a challenging problem for long time. Different people have

different rates of aging process, which is determined by not only the people’s gene but also many

factors, such as health condition, living style, working environment, and sociality.

Paper [1] – [4] introduce a few techniques to tackle τηισis problem:

The system in [1] has to be performed on a large database, Yamaha gender and age

(YGA) database that can be downloaded from web. Biologically-inspired features

(BIF) were investigated for age estimation and showed good performance. Estimation

performance can be further improved significantly when manifold learning uses BIF features.

This could be a promising technique for us in terms of efficiency and manageability.

Paper [2] uses similar approach to that in [1]. It improves the BIF models and shows a better age

estimation results. We will implement the improvements if time permits.

Paper [3] proposes a novel scheme for aging feature extraction and automatic age estimation. A

new method (locally adjusted robust repressor) is introduced for robust learning and prediction of

aging patterns. Experiments are performed on FG-NET database. Complexity of this method is

higher. Paper [4] introduces an estimation method called AGES. It models the aging pattern, which is defined as the sequence of a particular individual’s face images sorted in time order, by constructing a representative subspace.

References

[1] Guodong Guo; Guowang Mu; Yun Fu; Dyer, C.; Huang, T., "A study on automatic age

estimation using a large database," Computer Vision, 2009 IEEE 12th International Conference on

, vol., no., pp.1986,1991, Sept. 29 2009-Oct. 2 2009

[2] Guodong Guo; Guowang Mu; Yun Fu; Huang, T.S., "Human age estimation using bio-

inspired features," Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE

Conference on , vol., no., pp.112,119, 20-25 June 2009

Page 15: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 15

[3] Guodong Guo; Yun Fu; Dyer, C.R.; Huang, T.S., "Image-Based Human Age Estimation by

Manifold Learning and Locally Adjusted Robust Regression," Image Processing, IEEE

Transactions on , vol.17, no.7, pp.1178,1188, July 2008

[4] Xin Geng; Zhi-Hua Zhou; Smith-Miles, K., "Automatic Age Estimation Based on Facial

Aging Patterns," Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol.29,

no.12, pp.2234,2240, Dec. 2007

Βασικό paper (συνημμένο):

Automated Estimation of Human Age, Gender and Expression

Page 16: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 16

Vehicle Detection and Distance Estimation

Distance information for potential obstacles in the path of a vehicle is critical for intelligent automotive systems in order to avoid collision. We propose to develop a real-time vehicle detection and distance measurement algorithm using temporally correlated sequential images from monocular vision Many distance determination algorithms has been proposed and developed in the last decade [1- 3]. Active detection systems are widely implemented commercially in vehicles today because of their immunity to changing ambient light conditions; however, the cost is usually higher than passive systems because they involves transmitters and receivers. Among the passive detection systems, vision-based methods are the most common and effective techniques, which can be roughly classified into two categories: monocular and stereo vision. Stereo vision utilizes images from two or more cameras to construct the 3D space, which usually provides better accuracy than monocular vision; however, the cost is also higher, and it is prone to error if the parameters of the cameras are unknown or change due to vibration on the road. Therefore, we choose to focus on monocular vision.

A complete distance measurement system includes two steps: vehicle detection and distance

calculation. Motion-based and Appearance-based are two main approaches for vehicle detections [3].

We will first apply the combination of Histogram of Oriented Gradients (HOG) descriptor and Support Vector Machine (SVM) classification for vehicle detection, because it has shown promise by many previous works [4]. There are also database online for us to train the SVM classifier [5]. Optical flow method, one of the popular motion-based methods, is another candidate. Once the vehicles are detected, the location in the 3D world needs to be computed. Because monocular vision is used, reference features with known dimensions are required. We will start from simple cases. For example, the parallel lines on the road can be used to calculate the vanishing points as a reference; the width of vehicles are roughly on the same scale, which can also be used as references; license plates are rectangular and have a fixed size serving as a great reference; however, detecting license plates from distance may be very challenging. While these techniques should be enough to measure the distance in the simplest cases, e.g. detecting a vehicle right ahead driving straight, we are also interested in investigating some edge cases, including occlusions, driving in the night or rainy days, presence of vibration, being lit by the headlights from vehicles driving from opposite direction, etc. We will combine different techniques to tackle [1] Sun, Zehang, et al., "On-road vehicle detection: A review." Pattern Analysis and Machine Intelligence, IEEE Transactions on 28.5 (2006): 694-711. [2] Cualain, Diarmaid O., et al., “Distance detection systems for the automotive environment: a review.” In Irish Signals and Systems Conf. 2007. [3] Sivaraman, Sayanan, and Mohan Manubhai Trivedi. “Looking at vehicles on the road: A survey of vision-based vehicle detection, tracking, and behavior analysis.” Intelligent Transportation Systems, IEEE Transactions on 14, no. 4 (2013): 1773-1795. [4] H. Tehrani Niknejad, et al., “On-road multivehicle tracking using deformable object model and particle filter with improved likelihood estimation,” IEEE Trans. Intell. Transp. Syst., vol. 13, no. 2, pp. 748–758, Jun. 2012. [5] http://pascallin.ecs.soton.ac.uk/challenges/VOC/

Βοηθητικό (συνημμένο) paper : On-Road Vehicle and Lane Detection

Page 17: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 17

Hand­drawn electronic circuits recognition

Description

In real life, it is often very handy to draw an electronic circuit with various components on paper.

However, paper is not a reliable media for storing information. On the other hand, sometimes we

want to try things out and test if the sketched circuit is functional, which is impossible to realize on

paper. To solve this problem, we propose the idea to scan the circuit sketch on paper with our

Android device and translate it into standard layouts and run circuit simulations. Challenges in sketched symbol recognition lie in the different sketch styles with regard to stroke

order, direction, etc. We plan to adopt the following approach:

1.Solve correspondence problem between reference image and sketch.

2.Use correspondence to solve alignment problem.

3.Compute distance between the corresponding points of the two shapes.

4.Find the reference image in our database that has the lowest distance score. We will

also continue looking for other promising methods. Plan

1. Build image databases of standard circuits.

2. Test on printed circuits. Start with recognizing single circuit components. Then test on multiple

components connected together as a circuit.

3. Test on sketched circuits.

References [1]Calhoun C, Stahovich TF, Kurtoglu T, Kara LB.Recognizing multi­stroke symbols. In AAAI

Spring Symposium on Sketch Understanding. 2002. p. 15–23.

[2]Belongie S, Malik J, Puzicha J. Shape matching and object recognition using shape contexts.

IEEE Transactions on Pattern Analysis and Machine Intelligence 2002;24(4): 509–22.

[3]Hse H, Newton AR. Sketched symbol recognition using zernike moments. Technical report,

EECS, University of California, 2003 Βοηθητικό υλικό : Circuit Sketch Recognition (συνημμένο)

Page 18: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 18

Eye Detection and Tracking: A Platform For Novel Computer Interfaces

Our long­term goal for this project is to develop a robust eye­detection and tracking platform that

would allow unique interfaces for a computer equipped with a webcam.

We plan to develop a MATLAB application that can perform the following tasks on a fixed

high­resolution image with controlled lighting conditions: 1. Accurately recognize locations of important facial features, namely pupils/irises, nose, lips.

2. Estimate gaze angle based on the relative positions and shapes of the above facial

features. Our aim is to detect gaze with sufficient accuracy to be able to control a computer

interface.

3. Recognize whether each eye is open. Once the core functionality above is accomplished, we would attempt to make improvements

that will allow for a more robust and performant platform, including:

Low resolution images

Uneven or low lighting

Motion and Gaussian blur

Obstructions, such as glasses

Extracting information about user intent from gestures and blinking

Operation on low­power computing platforms (which could enable this to run quickly on a

mobile device’s interface) Prior work (much of which is described in the below references) includes techniques such as:

Determining eye position relative to other facial features, thus estimating gaze. This could be a

promising technique for us. [1]

Inspect the shape of the iris to estimate gaze angle based on the assumption that the iris

underwent a rotational transform. This technique requires high resolution images that a webcam

might not be able to provide. [2]

Use of the Hough transform to detect circular irises. [3]

Use of a neural network to localize eyes, followed by a mean­shift algorithm to detect

gaze changes. [4]

Page 19: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 19

References [1] A. C. Varchmin, R. Rae, and H. Ritter: “Image Based Recognition of Gaze

Direction Using

Adaptive Methods.” In AG Neuroinformatik, 1997. [Link]

[2] J. G. Wang, E. Sung, R. Venkateswarlu: “Eye Gaze Estimation from a Single

Image of One

Eye.” In IEEE Computer Vision, 2003. [Link]

[3] Y. Ito, W. Ohyama, T. Wakabayashi, and F. Kimura: “Detection of Eyes by

Circular Hough Transform and Histogram of Gradient.” In 21st International

Conference on Pattern Recognition. Tsukuba, Japan, 2012. [Link]

[4] E. Y. im and S. K. Kang: “Eye Tracking Using Neural Network and

Mean­Shift.”

Springer­Verlag, Berlin, Heidelberg, 2006. [Link]

Page 20: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 20

Brain tumor segmentation

The aim of the algorithm is to extract the tumour from the MRI brain image (widely

used). You can see the image marked in read in the below image.

here is a survey on the algorithms used for this application. Link: Elsevier

you can find the dataset of MRI brain tumor images at this link: Page on oasis-

brains.org

here are few codes you can use: MRI Brain Segmentation - File Exchange -

MATLAB Central

Page 21: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 21

Recognition of a small set of sign language using Hand Gesture

The aim of this project is to build a model for gesture recognition. In this work, the main aim has been to count the number of fingers that is denoted by a gesture using image processing techniques only. The second challenge is to devise an algorithm to count the number of fingers that is denoted by the gesture. For this part, problem an edge detection algorithm is used specifically for this problem. This too is discussed in detail in ‘Approach’ section. Δίνεται η πλήρης περιγραφή (όχι ο κώδικας ) σχετικά paper:

[1] Zhong Yang; Yi Li; Weidong Chen; Yang Zheng , “Dynamic Hand Gesture Recognition Using Hidden Markov Models”, 7th International Conference on Computer Science and Education, 2012, pp 360-365, IEEE conference Publication. [2] Rautaray, S.S.; Agrawal, A. ,“Design of Gesture Recognition System for Dynamic User Interface”, 2012 IEEE International Conference on Technology Enhanced Education (ICTEE), pp. 1-6, IEEE Conference Publication. [3] Panwar, M., “Hand Gesture Recognition based on Shape Parameters”, 2012 International Conference on Computing, Communication and Applications (ICCCA), pp. 1-6, IEEE Conference Publication.

Page 22: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 22

Smile Detection

Ο αλγόριθμος δίνεται στο [1] παράγραφος 4. SMILE DETECTION

SCHEME

Στοιχεία για βάση δίνονται στο paper [2]

Σχετικά paper:

[1] Y.-H. Huang, C.-S. Fuh, “Face detection and smile detection”.

http://www.csie.ntu.edu.tw/~fuh/personal/FaceDetectionandSmileDetection.pdf

[2] Jacob Whitehill, Gwen Littlewort, Ian Fasel, Marian Bartlett, and Javier Movellan,

“Developing a Practical Smile Detector”.

http://mplab.ucsd.edu/wp-content/uploads/smiledetectionreport.pdf

Page 23: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 23

Image Inpainting

Ζητείται η επιλογή και υλοποίηση ενός αλγορίθμου για image inpainting που

είναι η διαδικασία της ανακατασκευής χαμένων ή φθαρμένων τμημάτων εικόνας ή

βίντεο

Σχετικά paper:

[1] Y. Matsushita, E. Ofek, W. Ge, X. Tang and H.-Y. Shum, “Full-Frame Video

Stabilization with Motion Inpainting”, IEEE Trans. on Pattern Analysis and Machine

Intelligence, vol. 28, no. 7, July 2006.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.221.4938&rep=rep1&typ

e=pdf

https://en.wikipedia.org/wiki/Inpainting

Page 24: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 24

Photometric normalization for face verification – Lighting invariant processing

Ζητείται η υλοποίηση των βημάτων που

περιγράφονται στο paper [1] στην παράγραφο 5

«Illumination Normalization» καθως και

αντίστοιχα βήματα στο [2]

Σχετικά papers:

[1] X. Tan and B. Triggs, “Enhanced Local Texture Feature Sets for Face Recognition

Under Difficult Lighting Conditions”, AMFG 2007, pp. 168–182, 2007.

http://lear.inrialpes.fr/pubs/2007/TT07/Tan-amfg07a.pdf

[2] J. Short, J. Kittler and K. Messer, “A Comparison of Photometric Normalisation

Algorithms for Face Verification”, Proc. of the 6th IEEE Int. Conf. on Automatic Face

and Gesture Recognition (FGR’04).

http://www.ee.surrey.ac.uk/CVSSP/Publications-/papers/short-fgr-2004.pdf

Page 25: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 25

Object Detection

Brief

For the demos please download this set of files: pedestrian_demo.zip

Introduction

The goal of this project is to implement a simple, effective method for detecting

pedestrians in an image. You will be working off of the technique of Dalal and Triggs

(PDF) from 2005. This technique has four main components:

1. A feature descriptor. We first need a way to describe an image region with a

high-dimensional descriptor. For this project, you will be implementing two

descriptors: tiny images and histogram of gradients (HOG) features.

2. A learning method. Next, we need a way to learn to classify an image region

(described using one of the features above) as a pedestrian or not. For this, we

will be using support vector machines (SVMs) and a large training dataset of

image regions containing pedestrians (positive examples) or not containing

pedestrians (negative examples).

3. A sliding window detector. Using our classifer, we can tell if an image region

looks like a pedestrian or not. The next step is to run this classifier as a sliding

window detector on an input image in order to detect all instances of

pedestrians in that image. In order to detect pedestrians at multiple scales we

run our sliding window detector at multiple scales to form a pyramid of

detector responses.

4. Non-maxima suppression. Given the pyramid generated by the sliding

window detector the final step is to find the best detections in each region by

selecting the strongest responses within a neighborhood within an image and

across scales.

Using our skeleton code as a starting point, you'll be implementing parts of all four of

these components, and evaluating your methods by creating precision-recall (PR)

curves.

Downloads

Page 26: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 26

Cropped Pedestrian Dataset (18 MB). You will use this dataset for training

and testing your detector.

Full Image Pedestrian Dataset (10 MB). You will use this dataset to test your

sliding windows and non-maxima suppression code.

Full negatives set (87 MB, only for extra credit)

Here the left side, in red, shows a visualization of negative weights; these are edge

orientations that should not be present in an image region containing a pedestrian. For

instance, observe the horizontal edges in the region of the legs. On the right, in green,

are the positive weights showing edge orientations that should be present in images of

pedestrians.

Further Reading

A recent survey on the best methods for pedestrian detection.

Rodrigo Benenson et al presented recently a collection of improvements to the

detector implemented in this project that gives it a significant boost in

performace. You can use this as inspiration for extra credit.

HOGgles: a better way to visualize HOGs (whith MATLAB code available).

Rujikietgumjorn and Collins discuss a new way of handing occlusion

A disadvantage of the method implemented in this project is that the classifier

will recognize a single view of an object (in our case a frontal or back view of

a pedestrian). Nevertheless Malisiewicz et al. show that by combining multiple

classifiers into an ensemble, each one trained on a different view of an object,

we can actually detect objects in any configuration. Furthermore, they show

that each of the individual classifiers can be trained with a single image of the

object/view of interest, as long as a large collection of negative examples is

supplied.

Another approach for generalizing this classifier to objects in general pose and

view is the work of Pedro Felzenszwalb et al. At a high level these algorithms

combine multiple classifiers similar to the one you implemented in this

project, but now each one is specialized at detecting one part of the object

(e.g., leg or arm or head).

Page 27: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 27

Matching of SIFT Feature-BoW

For the BoW representation-based approach [2], the similarity between SIFT

features can be measured by matching their corresponding visual words via

histogram matching [3]. Typically, the computational complexity of the direct

keypoint matching approach is higher than that of the BoW-based approach.

Nevertheless, the outcomes of the direct keypoint matching approach are usually

more reliable than those of the BoW-based approach suffered from quantization loss

[1] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”

Int. J. Comput. Vision, vol. 60, no. 2, pp. 91–110, 2004.

[2] J. Sivic and A. Zisserman, “Video Google: A text retrieval approach to

object matching in videos,” in Proc. IEEE Int. Conf. Computer Vision,

Nice, France, Oct. 2003, vol. 2, pp. 1470–1477.

[3] O. Pele and M.Werman, “A linear time histogram metric for improved

SIFT matching,” in Proc. Eur. Conf. Computer Vision, 2008, vol. 5304,

pp. 495–508.

Page 28: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 28

Biologically Inspired Object Recognition using Gabor Filters

Στο σχετικό paper δίνεται η ακολουθία της επεξεργασίας

Τα φίλτρα gabor χρησιμοποιούνται για εξαγωγή χαρακτηριστικών

Η ταξινόμηση γίνεται με συγκεκριμένο ταξινομητή που δεν είναι

υποχρεωτικό να υλοποιηθεί στο project αυτό. Οποιοσδήποτε ταξινομητής

μπορεί να επιλεγεί. Πχ. Neural networks, k-nn, SVM κλπ

Για δεδομένα χρησιμοποιούνται εικόνες από την βάση Caltech-101 data

set (Fei-Fei et al.,2004).

σχετικό paper

http://www.cim.mcgill.ca/~siddiqi/COMP-558-2012/willhamilton.pdf

Page 29: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 29

Scene recognition with bag of words

An example of a typical bag of words classification pipeline. Figure by Chatfield et al.

http://cs.brown.edu/courses/csci1430/ http://cs.brown.edu/courses/csci1430/proj3/

Brief

Data: /course/cs143/asgn/proj3/data/

download VLFeat 0.9.17 binary package

VL Feat Matlab reference: http://www.vlfeat.org/matlab/matlab.html

Overview

The goal of this project is to introduce you to image recognition. Specifically, we will

examine the task of scene recognition starting with very simple methods -- tiny

images and nearest neighbor classification -- and then move on to techniques that

resemble the state-of-the-art -- bags of quantized local features and linear classifiers

learned by support vector machines.

Bag of words models are a popular technique for image classification inspired by

models used in natural language processing. The model ignores or downplays word

arrangement (spatial information in the image) and classifies based on a histogram of

the frequency of visual words. The visual word "vocabulary" is established by

clustering a large corpus of local features. See Szeliski chapter 14.4.1 for more details

Page 30: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 30

on category recognition with quantized features. In addition, 14.3.2 discusses

vocabulary creation and 14.1 covers classification techniques.

For this project you will be implementing a basic bag of words model . You will

classify scenes into one of 15 categories by training and testing on the 15 scene

database (introduced in Lazebnik et al. 2006, although built on top of previously

published datasets). Lazebnik et al. 2006 is a great paper to read, although we will be

implementing the baseline method the paper discusses (equivalent to the zero level

pyramid) and not the more sophisticated spatial pyramid . For an excellent survey of

modern feature encoding methods for bag of words models see Chatfield et al, 2011.

Example scenes from of each category in the 15 scene dataset. Figure from Lazebnik

et al. 2006.

Details

You are required to implement image representations -- bags of SIFT features -- and

nearest neighbor or linear SVM classification techniques. In the writeu p, you are

asked to report performance for the following combinations

Bag of SIFT representation and nearest neighbor classifier (accuracy of about

50-60%).

or

Bag of SIFT representation and linear SVM classifier (accuracy of about 60-

70%).

The nearest neighbor classifier is equally simple to understand. When tasked with

classifying a test feature into a particular category, one simply finds the "nearest"

Page 31: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 31

training example (L2 distance is a sufficient metric) and assigns the test case the label

of that nearest training example. The nearest neighbor classifier has many desirable

features -- it requires no training, it can learn arbitrarily complex decision boundaries,

and it trivially supports multiclass problems. It is quite vulnerable to training noise,

though, which can be alleviated by voting based on the K nearest neighbors (but you

are not required to do so). Nearest neighbor classifiers also suffer as the feature

dimensionality increases, because the classifier has no mechanism to learn which

dimensions are irrelevant for the decision.

Before we can represent our training and testing images as bag of feature histograms,

we first need to establish a vocabulary of visual words. We will form this vocabulary

by sampling many local features from our training set (10's or 100's of thousands) and

then clustering them with kmeans. The number of kmeans clusters is the size of our

vocabulary and the size of our features. For example, you might start by clustering

many SIFT descriptors into k=50 clusters. This partitions the continuous, 128

dimensional SIFT feature space into 50 regions. For any new SIFT feature we

observe, we can figure out which region it belongs to as long as we save the centroids

of our original clusters. Those centroids are our visual word vocabulary. Because it

can be slow to sample and cluster many local features, the starter code saves the

cluster centroids and avoids recomputing them on future runs.

Now we are ready to represent our training and testing images as histograms of visual

words. For each image we will densely sample many SIFT descriptors. Instead of

storing hundreds of SIFT descriptors, we simply count how many SIFT descriptors

fall into each cluster in our visual word vocabulary. This is done by finding the

nearest neighbor kmeans centroid for every SIFT feature. Thus, if we have a

vocabulary of 50 visual words, and we detect 220 SIFT features in an image, our bag

of SIFT representation will be a histogram of 50 dimensions where each bin counts

how many times a SIFT descriptor was assigned to that cluster and sums to 220. The

histogram should be normalized so that image size does not dramatically change the

bag of feature magnitude.

You should now measure how well your bag of SIFT representation works when

paired with a nearest neighbor classifier. There are many design decisions and free

parameters for the bag of SIFT representation (number of clusters, sampling density,

sampling scales, SIFT parameters, etc.) so performance might vary from 50% to 60%

accuracy.

The last task is to train 1-vs-all linear SVMS to operate in the bag of SIFT feature

space. Linear classifiers are one of the simplest possible learning models. The feature

space is partitioned by a learned hyperplane and test cases are categorized based on

which side of that hyperplane they fall on. Despite this model being far less

expressive than the nearest neighbor classifier, it will often perform better. For

example, maybe in our bag of SIFT representation 40 of the 50 visual words are

uninformative. They simply don't help us make a decision about whether an image is

a 'forest' or a 'bedroom'. Perhaps they represent smooth patches, gradients, or step

edges which occur in all types of scenes. The prediction from a nearest neighbor

classifier will still be heavily influenced by these frequent visual words, whereas a

linear classifier can learn that those dimensions of the feature vector are less relevant

and thus downweight them when making a decision. There are numerous methods to

Page 32: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 32

learn linear classifiers but we will find linear decision boundaries with a support

vector machine. You do not have to implement the support vector machine. However,

linear classifiers are inherently binary and we have a 15-way classification problem.

To decide which of 15 categories a test case belongs to, you will train 15 binary, 1-vs-

all SVMs. 1-vs-all means that each classifier will be trained to recognize 'forest' vs

'non-forest', 'kitchen' vs 'non-kitchen', etc. All 15 classifiers will be evaluated on each

test case and the classifier which is most confidently positive "wins". E.g. if the

'kitchen' classifier returns a score of -0.2 (where 0 is on the decision boundary), and

the 'forest' classifier returns a score of -0.3, and all of the other classifiers are even

more negative, the test case would be classified as a kitchen even though none of the

classifiers put the test case on the positive side of the decision boundary. When

learning an SVM, you have a free parameter 'lambda' which controls how strongly

regularized the model is. Your accuracy will be very sensitive to lambda, so be sure to

test many values.

Evaluation and Visualization

Build a confusion matrix and visualize your classification decisions by producing a

table of true positives, false positives, and false negatives as a webpage

Data

It is expected to test performance on random splits of the data into training and test

sets.

Useful Functions

Useful functions from VL feat. keep in mind that while Matlab represent points as

row vectors, VL feat uses column vectors. Thus you might need to transpose your

matrices / vectors frequently.

vl_dsift(). This function returns SIFT descriptors sampled at a regular step size

from an image. You are allowed to use this function because you have already

implemented the SIFT descriptor in project 2. You can use your own code if you

want, but it is probably too slow.

vl_kmeans(). This function performs kmeans clustering and you can use it when

building the bag of SIFT vocabulary. Matlab also has a built in kmeans function, but

it is slow.

vl_svmtrain(). This function returns the parameters of a linear decision boundary (a

hyperplane in your feature space). You will use the distance from this hyperplane as a

measure of confidence in your 1-vs-all classifier.

vl_alldist2(). This function returns all pairs of distances between the columns of

two matrices. This is useful in your nearest neighbor classifier and when assigning

SIFT features to the nearest cluster center. You can use this function because you

have already written a similar distance computation as part of the project 2 matching

step

Page 33: Aggregating local descriptors into a compact image representation -VLAD descriptor · 2018. 6. 8. · representation -VLAD descriptor Images and corresponding VLAD descriptors, for

PROJECTS ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ/VIDEO Σπύρος Φωτόπουλος 33

Video Detection and Enhancement of Small Unmanned Aircraft

This project is aimed at being able to successfully detect, track, and highlight the motion of a small unmanned aerial vehicle (UAV). We would like to explore both a ground based observer location and an in-flight observer location. The work for this project focus on the detection of small aircraft (UAVs) and is an extension of [1]. A proposed starting point for a ground based observer method is a technique called sky segmentation, which splits sky and ground elements in the frame and looks for non-sky elements in the sky portion [2]. This would assist us in the detection of an aircraft on a near static background, as there is a good probability that the UAV will be in the sky portion of the video. For an in-flight observer viewing from above the aircraft, there entire field of view will be non-sky requiring a different approach. Here we can start to leverage techniques that compare the velocity of different elements of the video (by comparing sequential frames) and looking for elements that are traveling at different rates than the background. UAVs pose an additional problem that some types of vehicles have the ability to hover and therefore move along with the background. To address this issue, we plan on using a navigation-like approach, similar to the approach in [3], where a Kalman Filter was used. Using a statistical model of the dynamics of a UAV and measured dynamics information from the video, we plan on decreasing the false positive results and improve tracking performance. After detection, we would like to use image processing techniques in class to process aircraft and background pixels separately to highlight the aircraft and display its path in the video for an enhanced user experience References Previous work [1] https://stacks.stanford.edu/file/druid:my512gb2187/Hammond_Padial_Obstacle_Classification_and_Se gmentation.pdf Sky segmentation [2] http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1570842

Distance and Velocity determination [3] http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6205034