VizWiz::LocateIt - Enabling Blind People to Locate Objects...

8
VizWiz::LocateIt - Enabling Blind People to Locate Objects in Their Environment Jerey P. Bigham , Chandrika Jayant , Andrew Miller γ , Brandyn White , and Tom Yeh University of Rochester CS Rochester, NY 14627 USA [email protected] University of Maryland CS College Park, MD 20742 USA bwhite, [email protected] γ University of Central Florida CS Orlando, FL 32816 USA [email protected] University of Washington CSE Seattle, WA 98195 USA [email protected] Abstract Blind people face a number of challenges when in- teracting with their environments because so much in- formation is encoded visually. Text is pervasively used to label objects, colors carry special significance, and items can easily become lost in surroundings that can- not be quickly scanned. Many tools seek to help blind people solve these problems by enabling them to query for additional information, such as color or text shown on the object. In this paper we argue that many use- ful problems may be better solved by direclty modeling them as search problems, and present a solution called VizWiz::LocateIt that directly supports this type of in- teraction. VizWiz::LocateIt enables blind people to take a picture and ask for assistance in finding a specific ob- ject. The request is first forwarded to remote workers who outline the object, enabling ecient and accurate automatic computer vision to guide users interactively from their existing cellphones. A two-stage algorithm is presented that uses this information to guide users to the appropriate object interactively from their phone. 1. Introduction and Motivation Blind people face challenges when interacting with their environments because so much information is en- coded visually. For example, to find a specific object, a blind person may use various applications which at- tempt to translate encoded visual information such as text and color, which can provide verification but does not assist in finding a starting point for the search (Fig- Figure 1. Many real world problems that blind people face can be framed as visual search problems. For instance, finding the “cancel call” button on an elevator otherwise labeled with Braille, finding the right soup from two tactu- ally identical boxes, or finding a favorite DVD from among many on a shelf. ure 1). Items can easily become lost in surroundings that cannot be quickly scanned. Existing technology reveals information that blind

Transcript of VizWiz::LocateIt - Enabling Blind People to Locate Objects...

Page 1: VizWiz::LocateIt - Enabling Blind People to Locate Objects ...mobileaccessibility.cs.washington.edu/publications/vizwiz_cvavi2010… · VizWiz::LocateIt - Enabling Blind People to

VizWiz::LocateIt -Enabling Blind People to Locate Objects in Their Environment

Je!rey P. Bigham†, Chandrika Jayant!, Andrew Miller!, Brandyn White‡, and Tom Yeh‡

†University of Rochester CSRochester, NY 14627 USA

[email protected]

‡University of Maryland CSCollege Park, MD 20742 USAbwhite, [email protected]

!University of CentralFlorida CS

Orlando, FL 32816 [email protected]

!University of Washington CSE

Seattle, WA 98195 [email protected]

Abstract

Blind people face a number of challenges when in-teracting with their environments because so much in-formation is encoded visually. Text is pervasively usedto label objects, colors carry special significance, anditems can easily become lost in surroundings that can-not be quickly scanned. Many tools seek to help blindpeople solve these problems by enabling them to queryfor additional information, such as color or text shownon the object. In this paper we argue that many use-ful problems may be better solved by direclty modelingthem as search problems, and present a solution calledVizWiz::LocateIt that directly supports this type of in-teraction. VizWiz::LocateIt enables blind people to takea picture and ask for assistance in finding a specific ob-ject. The request is first forwarded to remote workerswho outline the object, enabling e!cient and accurateautomatic computer vision to guide users interactivelyfrom their existing cellphones. A two-stage algorithmis presented that uses this information to guide usersto the appropriate object interactively from their phone.

1. Introduction and Motivation

Blind people face challenges when interacting withtheir environments because so much information is en-coded visually. For example, to find a specific object,a blind person may use various applications which at-tempt to translate encoded visual information such astext and color, which can provide verification but doesnot assist in finding a starting point for the search (Fig-

Figure 1. Many real world problems that blind people facecan be framed as visual search problems. For instance,finding the “cancel call” button on an elevator otherwiselabeled with Braille, finding the right soup from two tactu-ally identical boxes, or finding a favorite DVD from amongmany on a shelf.

ure 1). Items can easily become lost in surroundingsthat cannot be quickly scanned.

Existing technology reveals information that blind

Page 2: VizWiz::LocateIt - Enabling Blind People to Locate Objects ...mobileaccessibility.cs.washington.edu/publications/vizwiz_cvavi2010… · VizWiz::LocateIt - Enabling Blind People to

people may be missing, but many of the problems thatblind people face may be better framed as search prob-lems. The question “What does this can say?” isusually asked as part of the process of finding the de-sired can. A blind person may serially ask for the labelfrom one can after another until they hear the answerthey want (e.g., corn). Similarly, “What color is thisbutton?” may be asked because technology does notyet directly support the underlying information need:“Which button do I press to record?” In this paper, wepresent VizWiz::LocateIt, a system that can help blindpeople find arbitrary items in their environments.

VizWiz::LocateIt has three logical components. Thefirst is an iPhone application that lets blind users takea picture and ask for an item that they would like helpfinding. Both the picture and the audio from the re-quested item get sent to the second component: a re-mote server that puts the picture and question on aweb page and recruits human workers to outline theobject in the picture (Figure 2). The server compo-nent interfaces with an existing service called Mechan-ical Turk provided by Amazon Inc [1], and is speciallydesigned to get answers back quickly (generally in lessthan a minute). The third component is an interfaceagain built into the iPhone application that uses rel-atively simple computer vision running in realtime onthe iPhone along with the description of the region se-lected by the remote worker to help interactively guideusers to the desired object. VizWiz e!ectively out-sources the as of yet unsolved parts of computer visionto remote workers and uses highly accurate recognitionon the phone to automatically and interactively guidehuman users.

Although not all of the information needs of blindpeople can be framed as a search problem and solvedby VizWiz, its unique architecture and clever reframingof problems means that many can. Because humansare identifying the objects with the photographs, userscan ask questions that require both visual acumen andintelligence. For instance, in Figure 2, humans need toknow that passports are generally little navy bookletsand are likely to be lying on a desk. Once they haveoutlined this example of a passport, many techniquescan be used to locate that example again in the samelighting and orientation.

1.1. Motivating ScenarioConsider the following motivating scenario. Julie, a

blind woman, goes to her local grocery store to pick up,among other things, cereal for breakfast. From priorvisits, she’s learned where the cereal aisle is. Once inthe aisle, she wants to find Frosted Flakes, her favoritecereal, but does not know exactly where it is located

Figure 2. The interface for remote workers to identify andoutline objects. Because real people are tasked with findingthe object, the object specification can rely on both theirintelligence and contextual knowledge. In this example, theworkers needs to know what a passport is and about whatsize it would be. The worker may also benefit from knowingthat passports in the United States are a deep navy color.

because its precise location often changes as new prod-ucts are added. She remembers that cereal is on theleft side and takes a picture on her cell phone. She says“Frosted Flakes?” to VizWiz::LocateIt and momentslater, her phone vibrates to tell her that it is readyto guide her to the appropriate box. The applicationfirst guides her toward the direction of the correct ce-real box using audible clicks that increase in frequencywhen the phone is pointed in the right direction. Onceshe gets close to the shelf, the phone switches to usinga similar audible clicking feedback to tell her when herphone is over the right cereal box. Julie picks up thecereal, puts it into her cart, and continues shopping.

1.2. Contributions

This paper o!ers the following contributions:• We motivate and present VizWiz::LocateIt, an ac-

cessible mobile system that lets blind people take apicture and request an item that they would like tofind, and interactively helps them find that item.

• We show how human-powered services, such asMechanical Turk, can be used in concert with au-tomatic computer vision to enable advanced com-puter vision in realtime.

• We present a two-stage algorithm that can run onthe iPhone that uses both accelerometer and real-time computer vision to help users locate a desiredobject.

Page 3: VizWiz::LocateIt - Enabling Blind People to Locate Objects ...mobileaccessibility.cs.washington.edu/publications/vizwiz_cvavi2010… · VizWiz::LocateIt - Enabling Blind People to

2. Related Work

Work related to VizWiz generally falls into the fol-lowing two categories: (i) the mobile devices that blindpeople already carry, and (ii) work on leveraging hu-mans remotely as part of computational processes. Wewill provide a brief survey of these areas to provide alandscape upon which cheap, mobile applications likeVizWiz have the potential to be very useful.

2.1. Mobile Devices for Blind People

Most mainstream cellphones are not accessible toblind people. Smartphones often provide the best ac-cess through separate screen reading software like Mo-bile Speak Pocket (MSP) [13]. Though somewhat pop-ular, the uptake of such software among blind usershas been limited due to its high price (an additional$500 after the cost of the phone). Google’s Androidplatform and the Apple iPhone 3GS now include freescreen readers [20, 5]. The iPhone has proven partic-ularly popular among blind users, which motivated usto concentrate on it for VizWiz. Apple’s stringent con-trols on the applications available on its online storeand tighter integration of its screen reader (VoiceOver)with the operating system has resulted in a large num-ber of accessible applications. Touchscreen devices likethe iPhone were once assumed to be inaccessible toblind users, but well-designed, multitouch interfacesleverage the spatial layout of the screen and can evenbe preferred by blind people [29].

Applications for general-purpose smartphones arebeginning to replace special-purpose devices, but blindpeople still carry devices such as GPS-powered nav-igation aids, barcode readers, light detectors, coloridentifiers, and compasses [30]. Some accessible ap-plications that use the camera on existing phones in-clude currency-reading applications and color identi-fiers [33, 19]. Because VizWiz connects users to realpeople, it can potentially answer all of the questions an-swerable by many costly special-purpose applicationsand devices.

Talking OCR Devices: Of particular interest to blindpeople is the ability to read text, which pervasively la-bels objects and provides information. Both the kNF-BReader [10] and the Intel Reader [7] are talking mo-bile OCR tools. Human computation could have an ad-vantage over tools such as these because humans canstill read more text written in more variations thancan automatic approaches. When OCR works, how-ever, it is faster and can be used to transcribe largetext passages. Human workers are slower but this maybe partially o!set by their ability to take instructionsthat require intelligence. For example, an OCR pro-

gram can read an entire menu, but cannot be asked,“What is the price of the cheapest salad?”

Other Automatic Computer Vision for Mobile De-vices: Several research projects and products exposeautomatic computer vision on mobile devices. Photo-based Question Answering enables users to ask ques-tions that reference an included photograph, and tack-les the very di"cult problems of automatic computervision and question answering [41]. Google Gogglesenables users to take a picture and returns relatedsearch results based on object recognition and OCR[6]. Although these projects have made compellingprogress, the state-of-the-art in automatic approachesis still far from being able to answer arbitrary ques-tions about photographs and video. Although not onmobile phones, GroZi is another project using vision;it is a grocery shopping assistant for the visually im-paired which uses Wiimotes, gloves with LED lights,a laptop, a wireless camera, and spacialized audio andfeedback cues [25].

Interfacing with Remote Services: Most mobile toolsare implemented solely as local software, but more ap-plications are starting to use remote resources. Forinstance, TextScout [17] provides an accessible OCRinterface, and Talking Points delivers contextually-relevant navigation information in urban settings [26].VizWiz also sends questions o! for remote processing,and these services suggest that people are becomingfamiliar with outsourcing questions to remote services.

2.2. Human-Powered ServicesVizWiz builds from prior work in using human com-

putation to improve accessibility [22]. The ESP Gamewas originally motivated (in part) by the desire to pro-vide descriptions of web images for blind people [40].The Social Accessibility project connects blind webusers who experience web accessibility problems to vol-unteers who can help resolve them, but 75% of requestsremain unsolved after a day [39]. Solona started asa CAPTCHA solving service, and now lets registeredblind people submit images for description [15]. Ac-cording to its website, “Users normally receive a re-sponse within 30 minutes.” VizWiz’s nearly real-timeapproach could be applied to other problems in the ac-cessibility space, including improving web accessibility.Locating objects, as in VizWiz::LocateIt, is one suchapplication.

Prior work has explored how people ask and answerquestions on their online social networks [35]. Whileanswers were often observed to come back within a fewminutes, response time varied quite a lot. The “So-cial Search Engine” Aardvark adds explicit support forasking questions to your social network, but advertises

Page 4: VizWiz::LocateIt - Enabling Blind People to Locate Objects ...mobileaccessibility.cs.washington.edu/publications/vizwiz_cvavi2010… · VizWiz::LocateIt - Enabling Blind People to

that answers come back “within a few minutes.” [37]Mechanical Turk has made outsourcing small paid

jobs practical [1]. Mechanical Turk has been used for awide variety of tasks, including gathering data for userstudies [31], labeling image data sets used in ComputerVision research [38], and determining political senti-ments in blog snippets [28]. The Amazon Remembersfeature of its iPhone application lets users take pic-tures of objects, and later emails similar products thatAmazon sells [2]. It is widely suspected that Ama-zon outsources some of these questions to MechanicalTurk. The TurKit library enables programmers to eas-ily employ multiple turk workers using common pro-gramming paradigms [32].

2.3. Connecting Remote Workers to Mobile DevicesSome human-powered services provide an expecta-

tion of latency. ChaCha and KGB employees answerquestions asked via the phone or by text message injust a few minutes [4, 8]. Other common remote ser-vices include relay services for deaf and hard of hear-ing people (which requires trained employees) [36], andthe retroactive nearly real-time audio captioning bydedicated workers in Scribe4Me [14]. A user study ofScribe4Me found that participants felt waiting the re-quired 3-5 minutes was too long because it “leaves oneas an observer rather than an active participant.”

Existing Use of Photos and Video for Assistance:Several of the blind consultants whom we interviewedmentioned using digital cameras and email to infor-mally consult sighted friends or family in particularlyfrustrating or important situations (e.g., checking one’sappearance before a job interview). Back in 1992, re-mote reading services for the blind were proposed us-ing low cost fax equipment and sighted remote readers.Compressed video technology allowed very low frame-rate, high-resolution video transmission over ordinarytelephone lines [23]. oMoby is an iPhone applicationsimilar to Google Googles, but instead of an automateddatabase lookup, human computation is used. TheSoylent Grid CAPTCHA-based image labeling systemrequires remote human annotation for CAPTCHA im-ages then included in a searchable database [24].

LookTel is a soon-to-be-released talking mobile ap-plication that can connect blind people to friends andfamily members via a live video feed [11]. Althoughfuture versions of VizWiz may similarly employ video,we chose to focus on photos for two reasons. First,mobile streaming is not possible in much of the worldbecause of slow connections. Even in areas with 3Gcoverage, our experience has been that the resolutionand reliability of existing video services like UStream[18] and knocking [9] is too low for many of the ques-

Stage 1

Stage 2

Outline the Wheaties

Remote Worker on

Mechanical Turk

Figure 3. To use VizWiz::LocateIt users first take a pic-ture of an area in which they believe the desired item islocated, and this is sent to remote workers on MechanicalTurk who outline the item in the photograph. During thisstage, VizWiz uses the accelerometer and compass to di-rect the user in the right direction. Once users are closerto the objects, VizWiz switches to using a color histogramapproach to help users narrow in on a specific item.

tions important to blind people. Second, using videoremoves the abstraction between user and provider thatVizWiz currently provides. With photos, questions canbe asked quickly, workers can be employed for shortamounts of time, and multiple redundant answers canbe returned.

3. VizWiz::LocateIt

Here we present our work on VizWiz::LocateIt, aprototype system that combines remote human workwith automatic computer vision to help blind peoplelocate arbitrary items in their environments (Figure3). To support object localization we created the fol-lowing two components: a web interface to let remoteworkers outline objects, and the VizWiz::LocateIt mo-bile interface consisting of the Sensor (zoom and filter)and Sonification modules.

Page 5: VizWiz::LocateIt - Enabling Blind People to Locate Objects ...mobileaccessibility.cs.washington.edu/publications/vizwiz_cvavi2010… · VizWiz::LocateIt - Enabling Blind People to

!"#$$%&' ()"*%&'

+,-$.-%$"/''

0$1,%&'231"%&'+&%1"'

Figure 4. Frames captured by blind users during the Filterstage exemplifying computer vision challenges.

3.0.1 Sensor Module

In the zoom stage (stage 1), the Sensor module esti-mates how much the user needs to turn in the directionof the target object (left, right, up, or down). It firstuses the object’s image location (u, v) indicated by theremote worker to calculate the 3D position (x, y, z) ofthe object relative to the user’s current position. Theconstruction of such a mapping function typically re-quires knowledge of a set of camera parameters thatare extrinsic (e.g., camera orientation) and intrinsic(e.g., focal length, lense distortion). We estimate in-trinsic parameters by camera calibration once per de-vice and compute extrinsic camera parameters directlyfrom the device’s built-in compass (heading angle) andaccelerometer (gravity vector) once per camera move-ment. Note that extrinsic parameters change when-ever the camera moves whereas the intrinsic parame-ters stay constant and only need to be computed onceper device. Once the 3D position of the target object isknown, we can also compute the 3D position (x!, y!, z!)toward which the camera is currently pointing usinga similar procedure. The angular cosine distance be-tween the two resulting vectors indicates how much theuser needs to turn. This di!erence is measured as anangular cosine distance, and is passed to the Sonifica-tion module to generate appropriate audio cues.

In the filter stage (stage 2), the Sensor module usescomputer vision to determine how close the currentcamera view is to the object outlined by the remoteworker. This task is non-trivial because input imagesare often blurred, tilted, varied in scale, and improperlyframed due to blind users being unable to see the imagethey are capturing (Figure 4). We implemented two vi-sual matching schemes based on invariant local speededup robust features (SURF) and color histograms re-spectively. In the first scheme, a homography betweeneach captured frame and the overview image is com-puted based on the correspondences of SURF features[21]. Based on the homography, the object’s location(u, v) in the overview image is mapped to a location(u!, v!) in the current frame. The distance between(u!, v!) and the center of the current frame is computed.The smaller the distance, the more “centered” the tar-

get object is in the current frame.We found that local features were quite suscepti-

ble to problems related to lighting and blur and so wealso created a visual matching scheme based on colorhistograms that is more robust to these problems. Acolor histogram h of the object outlined by the remotehelper is computed. Then we divide each input frameup into N blocks and compute a color histogram hi foreach block i, which improves robustness to improperframing. We then compare the computer histogramto the target color histogram h, and calculate a dis-tance measure di using L1. The total distance D is theminimum distance of contiguous subsets of the N indi-vidual block di!erences. The smaller the D, the more“similar” the object in the current frame is to the tar-get object. To provide users with a consistent senseof distance, the distance is normalized by the smallestD observed during the k most recent interactions withthe system. The normalized distance is then passed tothe Sonification module to generate audible feedback.

3.0.2 Sonification Module

The Sonification Module inputs computed distancesfrom the Sensor component and generates audio feed-back to inform the user how close she is to the goal.In the zoom stage, the goal is a particular directionto “zoom in” (i.e., walk closer to). In the filter stage,the goal is a particular object for which we have imple-mented three di!erent sonification options. The firsttwo are based on the pitch of a tone and frequencyof clicking, respectively. The third scheme is a voicethat announces a number between one and four, whichmaps to how close the user is to the goal.

3.1. User StudyWe conducted a preliminary within-subjects lab-

based user study of VizWiz::LocateIt in which weasked participants to find a desired cereal box us-ing (i) LocateIt (color-histogram version) and (ii) acommercially-available barcode scanner with a talkinginterface (Figure 5(b)). We prepared three shelves eachwith five cereal boxes (Figure 5(a)). All cereal boxeswere unique and unopened, and they reflected a diver-sity of sizes, colors, and weights. We recruited sevenparticipants (two females, four totally blind, three lowvision) aged 48 years on average (SD=8.7). Only oneparticipant owned an iPhone, four others had expe-rience with an iPhone, and five had previously takenphotographs on inaccessible cell phone cameras.

Participants were trained using both methods (ap-proximately 10 minutes). Participants then completedthree timed trials using each method. For the LocateIt

Page 6: VizWiz::LocateIt - Enabling Blind People to Locate Objects ...mobileaccessibility.cs.washington.edu/publications/vizwiz_cvavi2010… · VizWiz::LocateIt - Enabling Blind People to

(a) (b)Figure 5. (a) Our mock grocery store shelf stocked with 15di!erent cereal boxes. (b) The ID Mate II talking barcodescanner from Envision America.

trials, the zoom and filter stages were timed separately.For the purposes of this study, researchers answered re-quests in order to concentrate on the user interactionwith the application, although our experience has beenthat workers on Mechanical Turk can quickly answerquestions requiring them to outline objects. For allsix trials, participants started 10 feet in front of theshelves, and boxes were randomized after each trial.

3.2. ResultsParticipants used LocateIt and the barcode scanner

very di!erently. LocateIt enabled users to zero in onthe right part of the shelf quickly like a visual scan,whereas the barcode scanner required each box to beserially scanned. The time required for each tool wassimilar, although LocateIt produced many more errors.LocateIt took an average of 92.2 seconds (SD=37.7)whereas the barcode scanner took an average of 85.7seconds (SD=55.0), although the researchers answeredquestions in approximately 10 seconds as compared tothe almost 30 seconds that we would expect workerson Mechanical Turk to require. Participants found thecorrect box in all cases using the barcode scanner (sinceit clearly spoke the name of each box), but found thecorrect box using LocateIt on their first try in 12 of 21cases and in their second try in 7 out of 21 cases.

Interestingly, the zoom stage of LocateIt correctlyled users to the correct area of the wall in only 30.7seconds on average (SD=15.9). We informally triedusing the first stage of LocateIt to direct users to ap-proximately the right part of the wall, and then hadthem switch to the barcode scanner for identification.This ended up being slower, primarily because of howcumbersome it was to switch between devices. In fu-ture work, we will explore how to better integrate bothhuman-powered and automatic services together. Noneof the participants wanted to carry around a bulky bar-code reader, or even a smaller portable one, because oftheir high prices and inconvenience. All participants

said, however, that they would use an accessible bar-code reader on their phone if one was available.

In summary, our first LocateIt prototype was com-parable to barcode scanner in terms of task comple-tion time but produced more errors. However, Lo-cateIt’s advantages include being useful for generalvisual search problems, not requiring objects to betagged in advance, and potentially scaling much bet-ter. From the observations and results we draw threelessons related to cues and orientation that will informthis work as we go forward.

First, participants used many cues other than theaudio information from LocateIt and the barcode scan-ner: shaking the boxes, having prior knowledge of boxsize, or using colors (low vision participants).

Second, all participants liked the clicks used in thezoom stage of our application. For the second stage,many alternatives were brought up, including vibra-tion, pitch, more familiar sounds (e.g., chirps and cuck-oos crosswalk signal sounds), verbal instructions, ora combination of output methods, many of which areused in other applications for blind people [27, 16, 12].

Finally, three participants had di"culty walking in astraight line from their beginning position to the shelfonce a direction was indicated, desiring a more con-tinuous clicking noise to keep them on track. Most ofthe participants had trouble keeping the phone perpen-dicular to the ground. In the up-close stage, all fullyblind participants had trouble judging how far backfrom each cereal box they should hold the phone inorder to frame the cereal boxes.

To design with human values in mind, we asked par-ticipants how comfortable they would feel using an ap-plication such as LocateIt in public. All participantssaid they would likely be comfortable with such an ap-plication if it worked in nearly real-time, but wonderedabout the reaction of bystanders. In practice, LocateItfeedback could be delivered by a headset, although vi-brational output might be preferred as to not interferewith the user’s primary sense.

4. Discussion

We have motivated modeling the problems thatblind people face in their environments as visual searchproblems, and proposed a novel two-stage algorithmthat uses on-board sensors and input from remoteworkers. This approach lets blind people start bene-fiting from the technology before automatic computervision technology can achieve all of the necessary func-tionality and we can start building a corpus of types oftasks for blind people to help motivate future research.

VizWiz::LocateIt combines automatic computer vi-sion with human-powered vision, o#oading the vision

Page 7: VizWiz::LocateIt - Enabling Blind People to Locate Objects ...mobileaccessibility.cs.washington.edu/publications/vizwiz_cvavi2010… · VizWiz::LocateIt - Enabling Blind People to

not yet possible to do automatically to humans, whileretaining the benefit of quick response times o!ered byautomatic services. This allowed us to prototype aninteraction that would not have been possible other-wise and easily begin a participatory design process tosee if this type of interaction is desirable or even useful.This project highlights the potential of nearly real-timehuman computation to influence early designs.

5. Conclusion and Future Work

We have presented VizWiz::LocateIt, a mobile sys-tem that enables blind people to locate objects in theirenvironment using a unique combination of remote hu-man computation and local automatic computer vision.This project represents a novel change in how assistivetechnology works, directly inspired by how blind peopleovercome many accessibility shortcomings today – aska sighted person. Our approach can make this easierwhile keeping the users in control.

As we move forward, we plan to directly engage withblind and low-vision people to help test our approachafter another iteration of design informed by our cur-rent work. We will start with formal lab studies toguage its e!ectiveness and then release the applicationto many users to get results from the field. Engagingthe user population in this way is of equal importanceto the success of the project as getting the technologyright. As just one example of why this is vital, usersmay need to collectively help one another imagine usesfor VizWiz::LocateIt and how to restructure the prob-lems they face as search problems.

In future work, we plan to study in more depth howto best enable blind people to take pictures appropriateto the questions they seek to ask- not just for object lo-cation, but for general questions about the environmentas well. Taking good photographs is of general inter-est, as often blind people want to share photographswith friends and family, just like everyone else. Tak-ing pictures, and in particular framing and focusingphotographs can be di"cult. This, however, has notstopped blind photographers from taking and sharingphotographs [3]. We might be able to provide softwaresupports to help them take pictures more easily. Wealso plan to explore the limits of what can be done toreceive answers even more quickly from remote humancomputation. For example, active techniques might at-tract turkers to our Human Intelligence Tasks (HITs)before they are needed and keep them busy with otherwork until a realtime task comes in that could be in-serted at the front of the queue.

VizWiz presents an excellent platform for new re-search. We believe that low-cost, readily available hu-man computation can be applied to many problems

face by disabled users on the go.

6. Acknowledgements

We would like to thank Jennifer Asuscion, RobMiller, Greg Little, and Aubrey Tatarowicz.

References

[1] Amazon mechanical turk. http://www.mturk.com/.

[2] Amazon remembers. http://www.amazon.com/gp/.2010.

[3] Blind with camera: Changing lives with photography,2009. http://blindwithcamera.org/.

[4] Chacha, 2010. http://www.chacha.com/.

[5] Eyes-free shell for android, 2009.http://code.google.com/p/eyes-free/.

[6] Google goggles, 2010.http://www.google.com/mobile/goggles/.

[7] Intel reader. http://www.intel.com/healthcare/reader/.2009.

[8] Kgb, 2010. http://www.kgb.com

[9] Knocking live video. ustream, 2010.http://knockinglive.com/.

[10] Knfb reader. knfb Reading Technology, Inc., 2008.http://www.knfbreader.com/.

[11] Looktel, 2010. http://www.looktel.com

[12] Miniguide us. http://www.gdp-research.com.au/minig 4.htm/.

[13] Mobile speak screen readers. Code Factory, 2008.http://www.codefactory.es/en/products.asp?id=16.

[14] Scribe4me: Evaluating a mobile sound transcriptiontool for the deaf. In UbiComp 2006: Ubiquitous Com-puting, 2006.

[15] Solona, 2010. http://www.solona.net/.

[16] Talking signs. http://www.talkingsigns.com/, 2008.

[17] Testscout- your mobile reader, 2010.http://www.textscout.eu/en/.

[18] UStream ustream, 2010. http://www.ustream.tv/.

[19] Voice for android, 2010.www.seeingwithsound.com/android.htm.

[20] Voiceover: Macintosh os x, 2007.http://www.apple.com/accessibility/voiceover/.

[21] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Surf:Speeded up robust features. In Proceedings of Com-puter Vision and Image Understanding (CVIU 2008),volume 110, pages 346–359, 2008.

Page 8: VizWiz::LocateIt - Enabling Blind People to Locate Objects ...mobileaccessibility.cs.washington.edu/publications/vizwiz_cvavi2010… · VizWiz::LocateIt - Enabling Blind People to

[22] J. P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller,R. C. Miller, A. Tatarowicz, B. White, S. White,and T. Yeh. Vizwiz: Nearly realtime answers tovisual questions. In Proceedings of the 7th Inter-national Cross-Disciplinary Conference on WebAccessibility (W4A 2010) - Submitted. Availableat http://www.cs.rochester.edu/u/jbigham/papers/challengevizwiz/vizwiz.pdf.

[23] J. Brabyn, W. Mendec, W. Gerrey. Remote ReadingSystems For The Blind: A Potential Application OfVirtual Presence. In Engineering in Medicine and Bi-ology Society, Vol.14, 1992.

[24] P. Faymonville, K. Wang, J. Miller, and S. Belongie.CAPTCHA-based Image Labeling on the Soylent GridHCOMP ’09: Proceedings of the ACM SIGKDDWorkshop on Human Computation.

[25] G. Sze-en Foo. Summary 2009 Grocery Shopping forthe Blind/Visually Impaired. Client: National Feder-ation of the Blind

[26] S. Gi!ord, J. Knox, J. James, and A. Prakash. In-troduction to the talking points project. In As-sets ’06: Proceedings of the 8th international ACMSIGACCESS conference on Computers and accessibil-ity, pages 271–272, New York, NY, USA, 2006. ACM.

[27] D. Hong, S. Kimmel, R. Boehling, N. Camoriano,W. Cardwell, G. Jannaman, A. Purcell, D. Ross, andE. Russel. Development of a semi-autonomous vehi-cle operable by the visually-impaired. IEEE Int ConfMultisensor Fusion Integr Intell Syst IEEE Intl. Conf.on Multisensor Fusion and Integration for IntelligentSystems, pages 539–544, 2008.

[28] Pei-Yun Hsueh, Prem Melville, and Vikas Sindhwani.Data quality from crowdsourcing: a study of anno-tation selection criteria. In HLT ’09: Proc. of theNAACL HLT 2009 Workshop on Active Learning forNatural Language Processing, pages 27–35, Morris-town, NJ, 2009.

[29] S. K. Kane, J. P. Bigham, and J. O. Wobbrock. Sliderule: making mobile touch screens accessible to blindpeople using multi-touch interaction techniques. In As-sets ’08: Proceedings of the 10th international ACMSIGACCESS conference on Computers and accessibil-ity, pages 73–80, New York, NY, USA, 2008. ACM.

[30] Shaun K. Kane, Chandrika Jayant, Jacob O. Wob-brock, and Richard E. Ladner. Freedom to roam: astudy of mobile device adoption and accessibility forpeople with visual and motor disabilities. In Assets’09: Proc. of the 11th Intl. ACM SIGACCESS Conf.on Computers and accessibility, pages 115–122, NewYork, NY, 2009.

[31] Aniket Kittur, Ed H. Chi, and Bongwon Suh. Crowd-sourcing user studies with mechanical turk. In Proc.of the SIGCHI Conf. on Human Factors in ComputingSystems (CHI 2008), pages 453–456, 2008.

[32] Greg Little, Lydia Chilton, Robert C. Miller, and MaxGoldman. Turkit: Tools for iterative tasks on mechan-ical turk. http://glittle.org/Papers/TurKit.pdf, 2009.

[33] X. Liu. A camera phone based currency reader forthe visually impaired. In Assets ’08: Proceedings ofthe 10th international ACM SIGACCESS conferenceon Computers and accessibility, pages 305–306, NewYork, NY, USA, 2008. ACM.

[34] Tara Matthews, Janette Fong, F. W.-L. Ho-Ching, andJennifer Manko!. Evaluating visualizations of non-speech sounds for the deaf. Behavior and InformationTechnology, 25(4):333–351, 2006.

[35] Meredith Ringel Morris, Jaime Teevan, and KatrinaPanovich. What do people ask their social networks,and why? a survey study of status message q&a be-havior. In Proc. of the conf. on Human Factors incomputing systems (CHI 2010), To Appear.

[36] M. Power. Deaf people communicating via sms, tty, re-lay service, fax, and computers in australia. In In Jour-nal of Deaf Studies and Deaf Education, volume 12,2006.

[37] Hadi Bargi Rangin. Anatomy of a large-scale socialsearch engine. In Proc. of the World Wide Web Conf.(WWW 2010), To Appear. Raleigh, NC, 2010.

[38] A. Sorokin and David Forsyth. Utility data annota-tion with amazon mechanical turk. CVPRW ’08. IEEEComputer Society Conf. on Computer Vision and Pat-tern Recognition Workshops., pages 1–8, June 2008.

[39] H. Takagi, S. Kawanaka, M. Kobayashi, T. Itoh, andC. Asakawa. Social accessibility: achieving accessibil-ity through collaborative metadata authoring. In Pro-ceedings of the 10th international ACM SIGACCESSconference on Computers and accessibility (ASSETS2008), pages 193–200, Halifax, Nova Scotia, Canada,2008.

[40] L. von Ahn and L. Dabbish. Labeling images with acomputer game. In Proceedings of the SIGCHI con-ference on Human Factors in computing systems (CHI’04), April 2004.

[41] Tom Yeh, John J. Lee, and Trevor Darrell. Photo-based question answering. In Proc. of the Intl. Conf.on Multimedia (MM ’08), pages 389–398, Vancouver,BC, Canada, 2008.