Touchless Typing Using Head Movement-based Gestures · Touchless Typing Using Head Movement-based...

8
Touchless Typing Using Head Movement-based Gestures Shivam Rustagi *† , Aakash Garg *† , Pranay Raj Anand , Rajesh Kumar βγ , Yaman Kumar α and Rajiv Ratn Shah Delhi Technological University, India Indraprastha Institute of Information Technology Delhi, India α Adobe, India β Haverford College, USA γ Syracuse University, USA [email protected], [email protected], [email protected], [email protected], {[email protected], [email protected]}, [email protected] Abstract—In this paper, we propose a novel touchless typing interface that makes use of an on-screen QWERTY keyboard and a smartphone camera. The keyboard was divided into nine color-coded clusters. The user moved their head toward clusters, which contained the letters that they wanted to type. A front-facing smartphone camera recorded the head movements. A bidirectional GRU based model which used pre-trained embedding rich in head pose features was employed to translate the recordings into cluster sequences. The model achieved an accuracy of 96.78% and 86.81% under intra- and inter-user scenarios, respectively, over a dataset of 2234 video sequences collected from 22 users. Keywords-touchless typing, contactless typing, gesture recognition, head movement, deep learning, and accessibility I. I NTRODUCTION Assistive-technologies have attracted the attention of several Human-Computer Interaction (HCI) researchers over the past decade. People with limited physical abilities find it difficult to interact with traditional equipment, including keyboard, mouse, and joysticks as they require consistent physical-contact based interactions. For example, a person with a physical condition such as Quadriplegia (all four limbs are paralyzed) cannot use any of the off-the-shelf-devices for typing. Recent advances in motion tracking, computer vision, and natural language processing have enabled researchers to develop touchless techniques that assist people in interaction with smart devices. Examples include lip-reading [1], speech recognition [2], Brain-Computer Interface (BCI) [3], [4], eye tracking [5], and head operated interfaces [6], [7]. The previously proposed touchless technologies suffer from issues such as high error rate, intrusiveness, require expensive equipment, suitable for very specific application scenarios [1], [3]–[7]. For example, speech recognition involves the use of voice cues which are not helpful for a person with speaking disabilities [2]. Although one of the most convenient methods, the speech recognition system * These authors contributed equally. The Sixth IEEE International Conference on Multimedia Big Data (BigMM), 2020 ©2020 IEEE is language-dependent and affected by surrounding noise. Another example includes lip-reading systems which have significant performance discrepancy to speech recognition due to the ambiguous nature of lip actuation, which makes it very challenging to extract useful information [8]. Similarly, vocabulary and language dependence of lipreading systems is still an active area of research [9]. These problems thus necessitate the use of other types of accessibility devices which overcome these limitations. BCI-based methods utilize EEG signals generated from a person’s brain activity to infer what word/phrase a person was thinking [3]. The capturing of EEG signals is a tedious task as it requires the user to wear a sophisticated device with several electrodes [3], [4]. In addition, the BCI devices are expensive, limited to research labs, and are not yet available for common use [3], [4]. There are some BCI headbands available in the market such as Muse2 1 , but they contain very limited (four in case of Muse2) electrodes, and are generally used for monitoring sleep. Eye tracking-based typing methods [5] require consistent eye movements which could be strenuous for the user. Head movements-based interfaces have also been explored in the past for controlling a cursor on the screen [10] as well as typing [6]. Similar studies utilize face-movement patterns for controlling a cursor, selecting keys, and/or scrolling over rows of the keyboard [11]. The majority of these techniques use key selection on modified keyboards. Typing on a modified keyboard poses usability concerns as it could be tedious at times because it generally requires more than one clicks. The limitations and drawbacks of previous studies motivated us to explore touchless typing interface that utilizes head movement patterns while the users look at an on-screen QWERTY keypad. Since typing is majorly done using a QWERTY keyboard, it did not take much extra efforts from the users to get accustomed to our proposed touchless typing interface. This also allows for extensibility of the work with different use-cases derived from this. The primary motivation behind our interface choices was usability and cost. Therefore, to capture the head-movement 1 Muse2 is a multi-sensor meditation device that provides real-time feedback on your brain activity, heart rate, breathing, and body movements to help you build a consistent meditation practice. https://choosemuse.com/muse-2/ arXiv:2001.09134v2 [cs.HC] 10 Oct 2020

Transcript of Touchless Typing Using Head Movement-based Gestures · Touchless Typing Using Head Movement-based...

Page 1: Touchless Typing Using Head Movement-based Gestures · Touchless Typing Using Head Movement-based Gestures Shivam Rustagi and Aakash Garg1, Pranay Raj Anand2, Rajesh Kumar3;4 Yaman

Touchless Typing Using Head Movement-based Gestures

Shivam Rustagi∗†, Aakash Garg∗†, Pranay Raj Anand‡, Rajesh Kumarβγ , Yaman Kumar‡α and Rajiv Ratn Shah‡†Delhi Technological University, India ‡Indraprastha Institute of Information Technology Delhi, India

αAdobe, India βHaverford College, USA γSyracuse University, [email protected], [email protected], [email protected],

[email protected], {[email protected], [email protected]}, [email protected]

Abstract—In this paper, we propose a novel touchless typinginterface that makes use of an on-screen QWERTY keyboardand a smartphone camera. The keyboard was divided intonine color-coded clusters. The user moved their head towardclusters, which contained the letters that they wanted to type. Afront-facing smartphone camera recorded the head movements.A bidirectional GRU based model which used pre-trainedembedding rich in head pose features was employed to translatethe recordings into cluster sequences. The model achieved anaccuracy of 96.78% and 86.81% under intra- and inter-userscenarios, respectively, over a dataset of 2234 video sequencescollected from 22 users.

Keywords-touchless typing, contactless typing, gesturerecognition, head movement, deep learning, and accessibility

I. INTRODUCTION

Assistive-technologies have attracted the attention ofseveral Human-Computer Interaction (HCI) researchers overthe past decade. People with limited physical abilitiesfind it difficult to interact with traditional equipment,including keyboard, mouse, and joysticks as they requireconsistent physical-contact based interactions. For example,a person with a physical condition such as Quadriplegia(all four limbs are paralyzed) cannot use any of theoff-the-shelf-devices for typing. Recent advances in motiontracking, computer vision, and natural language processinghave enabled researchers to develop touchless techniquesthat assist people in interaction with smart devices.Examples include lip-reading [1], speech recognition [2],Brain-Computer Interface (BCI) [3], [4], eye tracking [5],and head operated interfaces [6], [7].

The previously proposed touchless technologies sufferfrom issues such as high error rate, intrusiveness, requireexpensive equipment, suitable for very specific applicationscenarios [1], [3]–[7]. For example, speech recognitioninvolves the use of voice cues which are not helpful fora person with speaking disabilities [2]. Although one ofthe most convenient methods, the speech recognition system

∗These authors contributed equally.

The Sixth IEEE International Conference on Multimedia BigData (BigMM), 2020 ©2020 IEEE

is language-dependent and affected by surrounding noise.Another example includes lip-reading systems which havesignificant performance discrepancy to speech recognitiondue to the ambiguous nature of lip actuation, which makes itvery challenging to extract useful information [8]. Similarly,vocabulary and language dependence of lipreading systemsis still an active area of research [9]. These problems thusnecessitate the use of other types of accessibility deviceswhich overcome these limitations.

BCI-based methods utilize EEG signals generated from aperson’s brain activity to infer what word/phrase a personwas thinking [3]. The capturing of EEG signals is a tedioustask as it requires the user to wear a sophisticated devicewith several electrodes [3], [4]. In addition, the BCI devicesare expensive, limited to research labs, and are not yetavailable for common use [3], [4]. There are some BCIheadbands available in the market such as Muse21, but theycontain very limited (four in case of Muse2) electrodes, andare generally used for monitoring sleep. Eye tracking-basedtyping methods [5] require consistent eye movements whichcould be strenuous for the user. Head movements-basedinterfaces have also been explored in the past for controllinga cursor on the screen [10] as well as typing [6]. Similarstudies utilize face-movement patterns for controlling acursor, selecting keys, and/or scrolling over rows of thekeyboard [11]. The majority of these techniques use keyselection on modified keyboards. Typing on a modifiedkeyboard poses usability concerns as it could be tedious attimes because it generally requires more than one clicks.

The limitations and drawbacks of previous studiesmotivated us to explore touchless typing interface thatutilizes head movement patterns while the users look at anon-screen QWERTY keypad. Since typing is majorly doneusing a QWERTY keyboard, it did not take much extraefforts from the users to get accustomed to our proposedtouchless typing interface. This also allows for extensibilityof the work with different use-cases derived from this.The primary motivation behind our interface choices wasusability and cost. Therefore, to capture the head-movement

1Muse2 is a multi-sensor meditation device that provides real-timefeedback on your brain activity, heart rate, breathing, and bodymovements to help you build a consistent meditation practice.https://choosemuse.com/muse-2/

arX

iv:2

001.

0913

4v2

[cs

.HC

] 1

0 O

ct 2

020

Page 2: Touchless Typing Using Head Movement-based Gestures · Touchless Typing Using Head Movement-based Gestures Shivam Rustagi and Aakash Garg1, Pranay Raj Anand2, Rajesh Kumar3;4 Yaman

patterns, we used a readily-available and inexpensivesmartphone camera (smartphone Samsung M10). Today,this type of camera has both a wide-acceptability andavailability. In addition, similar cameras are almost alwaysavailable with other mobile devices like laptops, notepads,and smartphones [12]. The QWERTY keypad was dividedinto nine color-coded clusters (see the keyboard in Figure 1)and was displayed on a monitor of 17 inches. In other words,the design of our interface does not require any expensivetracker or a device.

To type a letter sequence, the user makes a series ofgestures by looking at the clusters to which the lettersbelonged to one by one on the virtual QWERTY keyboard.The sequence of gestures captured by the camera is thenmapped to a sequence of clusters using deep learning modelsfor sequential data. This predicted sequence of clusters canbe used to suggest valid words that can be formed out of thissequence. The process can be compared to touch-enabledswipe keyboard [13] present in modern smartphones. Onemajor difference is the swipe keyboards require the userto touch while our interface does not require the user totouch any interface but to look at virtual QWERTY keypad.Requiring no physical touch make our interface suitable forpeople with upper limb disabilities. The proposed model inits current form could be used as an assistive tech. However,we believe that the presented interface has the potential tobecome a mainstream typing technology.

The complete pipeline of our typing system is shown inFigure 1. Our main contributions are listed as follows:

• We develop a dataset of head movement patternsrecorded via a central-view camera. The data wascollected from a total of 22 users in a lab environment.Each user typed twenty words, ten phrases, and fivesentences with average length (number of characters)of 4.33, 10.6 and 18.6, respectively. The dataset andcode would be shared publicly for fostering research inthis field, further.

• We propose a bidirectional GRU-based model fortranslating the recorded videos into a sequenceof clusters. The proposed model uses pre-trainedembedding rich in head pose features. The performanceof the proposed model was evaluated on theaforementioned dataset under inter- and intra-userscenarios using accuracy and Modified Dynamic TimeWarping (M-DTW) as performance measures.

The rest of the paper is organized as follows. Relatedworks, dataset, details of the proposed system and theexperimental design is presented in Section II, Section III,and Section IV, respectively. Section VI describes resultsfollowed by Section VII that concludes the work besidelisting future research directions.

II. RELATED WORK

The closely related works can be categorized intoselection and gestural-based text entry [14].

Selection-based text entry mechanisms generally utilize acamera mouse (with some additional mechanism), and anon-screen keyboard [15]–[17]. The additional mechanismincludes eye blink, open mouth, etc. and complimentsthe selection process. In addition, these mechanisms usemodified or unconventional keypads. Moreover, they uselimited directional (e.g., Left, Right, Up, and Down)movements. As a result, these system exhibit high errorrates and are time-consuming. An attempt has been madeto reduced time consumption by enabling users to entera letter in maximum three steps [18]. The individual stepconsisted of movement of the head in one of the fourdirections and the return. The text-entry process makes useof a modified keyboard in which the English alphabets werearranged chronologically. The author further improved thesystem with a two-step entry process [15] and later usinga thermal camera to reduce the effect of lighting conditionson the text entry process [6].

Word-gesture keyboards (WGK) [19] enable faster textentry by allowing users to draw the shape of a word on aninput surface. Such keyboards have been used extensivelyfor touch devices. For example, swipe keyboard has beenincorporated in the smartphones for a long time and workswell. Along the same lines, [14] presented a relatively fastermid-air (touchless) typing technique than the selection-basedmethods. The technique offered comparable performance togesture-based text entry on touch-enabled input surfaces.The user of the system, however, had to wear a glove withreflective markers that tracked the position of its hands andfingers. To write a word, a user placed the cursor on thefirst letter, made a pinch gesture using the index fingerand the thumb, which followed tracing the remaining lettersof the word, and finally releasing the pinch. The authorssuggested that the user could select a word from the list ofsuggestions or continue typing, implicitly confirming thatthe highlighted word was a match. The user could alsoundo the suggested word in the text input field by selectingbackspace or delete previously typed words by multipleselections of backspace. In other words, they used fourbasic interaction steps of a match, select, undo, and delete.The authors report that the touchless gesture entry processwas 40% slower than the gesture-based text entry on touchsurfaces and mentally demanding. Hand movement-basedgestures have also been studied in the context of smartphonesecurity. Researchers have established that hand movementsrecorded by a front-facing camera can be used to reconstructsmartphone users’ PIN, pattern, and password [20]–[22].

Besides the aforementioned techniques, one can alsouse a Virtual Reality Headset (VRH) based method [7]which does not require wearing a pair of hand gloves.

Page 3: Touchless Typing Using Head Movement-based Gestures · Touchless Typing Using Head Movement-based Gestures Shivam Rustagi and Aakash Garg1, Pranay Raj Anand2, Rajesh Kumar3;4 Yaman

Q W E R T Y U I O P

A S D F J K L G H

Z SPACE B N MX C V

Pre-processing

Deep Learning

Model

Word suggestion: love,live,etc.

GeneratedSequence:

e.g., 6, 3, 7, 1Raw frames Processed

frames

Figure 1. An end to end overview of the proposed touchless typing process. Raw video frames were extracted, preprocessed, and fed to thedeep-learning-based generative model, which generated a cluster sequence that was translated to a possible set of words.

The users using the VRH generally control a pointer ona virtual keyboard using head rotation. Another interestingway of touchless typing is through eye-tracking. The eyetracking-based systems detect and track the movement ofthe pupil to move a cursor [23] or control a key selector.For example, [24] proposed a system in which they use eyegaze to select keys on a T9 keyboard. Accurate eye gazesystems require sophisticated eye trackers and are also notsuitable for long typing sessions as eyes need to be open fora long period.

Our work differs from these works as it uses: astandard QWERTY keyboard, a single mobile camera forrecording the head movement-based gestures, and a deeplearning-based sequence to sequence model that translate therecorded gestures to a cluster sequence.

User

1Us

er2

User

3Us

er4

User

5Us

er6

User

7Us

er8

User

9Us

er10

User

11Us

er12

User

13Us

er14

User

15Us

er16

User

17Us

er18

User

19Us

er20

User

21Us

er22

User id

010203040506070

Gestu

re en

try pe

r minu

te

Figure 2. Average number of gesture per minute for each user.

III. DATASET

Figure 3 represents the collection setup for multi-modaldata consisting of a series of videos and inertial sensorreadings. In this section, we describe the dataset setup andcollection procedure for the video modality.

A. Data Description

The dataset consists of 2234 recordings of participantswhile they typed a number of words, phrases, and sentencesby moving their head towards a clustered virtual keyboarddisplayed on a monitor (see Figure 3). A total of 25volunteers participated in the data collection, of which 19were male, and six were female university students. Outof these, data for three participants were discarded aftermanual inspection. Each participant typed twenty words,

ten phrases, and five sentences (see Table I). The averagelengths of words, phrases, and sentences were 4.33, 10.6 and18.6 letters, respectively. The words were chosen carefullyranging from three to six characters long, such that eachof the eight clusters representing 26 English alphabets isincluded in at least one word, and the unique transitionsbetween the clusters in a given word are maximized (seeFigure 4). The exercise was repeated three times for eachparticipant which resulted in 105 recordings per user. Thetotal number of recordings that were processed were 2310because some of the samples of two users were not recordedfully and were discarded. The phrases were chosen fromOuluVS dataset [25] and sentences from TIMIT [26]. Theaverage lengths of words, phrases, and sentences were 4.33,10.6 and 18.6 letters, respectively. The dataset is a mix ofshort words like, ”ice”, ”old”, fly, leg” and large phraseslike ”hear a voice within you”. Considering every headmovement from one cluster to another as a gesture (or aletter entry), the users on an average entered 49.26 gestureper minute with a standard deviation of 5.3. The gestureentry rate for each user is show in Figure 2 The gestureentry rate would likely increase with more practice on theproposed system. Some examples of mapping from text toclusters are presented in Table II.

B. Data Collection Environment

The dataset used in this paper was collected as partof a larger data collection exercise. Figure 3 depicts theoverall data collection environment. The goal was to capturethe head-movements of the participants via cameras andmotion sensors (accelerometer and gyroscope) built into aheadband while the users instinctively looked at the clusterone by one (see the keyboard in Figure 1). Specifically,the data collection setup consisted of three cameras placedon a tripod at -45◦, 0◦, and 45◦, a virtual QWERTYkeyboard was displayed on a 17” screen, and a headband(Muse 2) worn by the participant. The camera placed at0◦was facing the participant. The three cameras captured thevisual aspect of the participant’s head movement, whereasheadband’s accelerometer and gyroscope sensors capturedthe acceleration and rotation of the head, respectively.

The recording devices (Samsung M10 smartphones andthe Muse 2) were connected and controlled using another

Page 4: Touchless Typing Using Head Movement-based Gestures · Touchless Typing Using Head Movement-based Gestures Shivam Rustagi and Aakash Garg1, Pranay Raj Anand2, Rajesh Kumar3;4 Yaman

Table ILIST OF 20 WORDS, 10 PHRASES, AND 5 SENTENCES THAT WERE TYPED BY EACH PARTICIPANT. THE EXERCISE WAS REPEATED THREE TIMES.

Category Text Avg. number of letters per entry

Words locate, single, family, would, place, large, work, take, live,box, method, listen, house, learn, come, some, ice, old, fly, leg 4.33

Phrases hello, excuse me, i am sorry, thank you, good bye, see you,nice to meet you, you are welcome, how are you, have a good time 10.6

Sentences i never gave up, best time to live, catch the trade winds,hear a voice within you, he will forget it 18.6

Table IIEXAMPLES OF WORD, PHRASE, SENTENCES, AND CORRESPONDING

CLUSTER SEQUENCES.

Category Text Cluster sequenceWord live/love 6, 3, 7, 1Phrase thank you 2, 5, 4, 9, 6, 8, 2, 3, 2Sentence i never gave up 3,8,9,1,7,1,2,8,5,4,7,1,8,2,3

laptop (see moderator’s laptop in Figure 3). The videoswere recorded at 30 frames per second with a resolutionof 1920× 1080 pixels. The focus was to record mostly thetorso of the participant. A script running on the moderator’slaptop broadcast a message to the OpenCamera RemoteApps installed on each of the phones to start and stop therecording simultaneously.

Camera-1(Right)

Camera-2(Middle) Camera-3

(Left)

Muse 2 Headband

Muse-Monitor App

Moderator’sLaptop

User

On-Screen Keyboard

Laser Pointer

Figure 3. The data collection environment consisting of a monitor onwhich the virtual keyboard was displayed, three cameras placed on tripodsrecording the head movement-based gestures of the participant, in additionto Muse 2 which was worn by the participant on his/her forehead, amoderator’s laptop, and laser light. All three cameras were facing theparticipant.

This work utilizes only the information recorded viaa central-view camera (the one placed at 0◦and labelledas Camera2). We believe that the placement of Camera2(see Figure 3) setup represents a realistic setup as mostof the laptops, notebooks, and phones consist of at leastone central-view camera nowadays. On the other hand, theuse of multiple cameras and the headband symbolizes afuturistic setup as we envision that the use of multiplecameras and Brain-Computer Interfaces (BCI) headbandswould be common for tech-savvy.

The keyboard in Figure 1 demonstrates the color-codedclusters. Each color represents a cluster (a group ofnearby keys). The clusters were numbered from 1 to 9.To type a sequence of letters, the participant pointed tothe corresponding clusters that consisted of those lettersby moving his/her head. For example, if the participantwanted to type the word ’god’, the participant would pointto the cluster sequence [5, 3, 4]. The use of virtuallyclustered QWERTY keypad had multiple advantages. First,the participants were already familiar with the keyboard.Second, they did not have to spend much time in locatingthe keys. They had to point to the clusters by moving theirhead. Finally, it simplified the problem from predicting 27keys to predicting nine clusters.

An analysis was conducted to find out how many uniquewords can be formed from each possible cluster sequence.It was observed that for the 10,000 most common Englishwords2 there are 8529 unique cluster sequences with eachsequence having on an average 1.17 different words. So oncewe predict the cluster sequence, it can be translated to 1-2valid words on an average which is comparable to characterlevel prediction.

The entire data collection exercise was moderated by theresearch team members who guided the participants as perthe need. Specifically, the participants were asked to sit ona chair and rehearse to make themselves familiar with thekeyboard and the text entry process. The time of rehearsalvaried from participant to participant as they were told tomake themselves comfortable with the setup. The moderatorreminded the user of the word, phrase, or sentence thatwas to be typed. The start and stop of the recording wascontrolled by pressing ”Enter” key on the laptop (see Figure3). To begin with the participants looked at the centre ofthe virtual keyboard then started moving their head in thedirection of subsequent letters that were to be typed. Therecording was stopped once the participant finished movingtheir head. The process was repeated three times for each ofthe words, phrases, and sentences.

IV. PROPOSED SYSTEM

In this section, we present the entire pipeline of oursystem, the CNN model used for extracting headpose

2https://github.com/first20hours/google-10000-english

Page 5: Touchless Typing Using Head Movement-based Gestures · Touchless Typing Using Head Movement-based Gestures Shivam Rustagi and Aakash Garg1, Pranay Raj Anand2, Rajesh Kumar3;4 Yaman

cluster1 17.0%

cluster2

17.0%

cluster3

16.0%cluster4

12.0%

cluster57.0%

cluster6

7.0%

cluster7

6.0%

cluster8

11.0%

cluster9

7.0%

Figure 4. The coverage (share) of each cluster across the dataset.

embeddings, and the GRU based model for sequencetranslation.3

A. System Overview

The system overview is presented in Figure 1. First,the region of interest, are extracted from the video usingface detection DNN provided in OpenCV [27]. Second, theprocessed video frames are fed into the deep learning modelsthat output the cluster sequence the user was looking at. Aset of valid words are recommended based on the predictedcluster sequence.

B. Head Pose Feature Embedding

For our proposed model, we precompute the featuresdenoting the head pose of a user for each frame separatelyand train only the RNN model. HopeNet [28] is aCNN based landmark-free head pose estimation model forcomputing the intrinsic Euler angles (yaw, pitch, and roll)from an RGB image of a person’s face in an unconstrainedenvironment (Figure 5(b)). The Euler angles are the threedegrees used to represent the orientation of a rigid bodyin 3-dimensional Euclidean space. For predicting the Eulerangles, classification, as well as a regression approach, isapplied. The angles in the range of ±99◦ are divided into66 bins. The network outputs the bin in which the anglelies, and this angle value for the bin is taken as the predictedvalue for the regression loss. As illustrated in Figure 5(a), themethod uses a ResNet50 as a backbone network augmentedwith three fully connected layers of size 66. These threelayers use the same ResNet50 backbone with shared weights.HopeNet network uses three loss functions for the threeangles. Each one is composed of a coarse-bin classificationloss and regression loss. The ground truth feature vector isprepared by the actual Euler values.

For bin classification, softmax cross-entropy loss is used,and for regression loss, the regular mean squared error lossis computed. The final loss function becomes

L = H(y, y) + β ·MSE(y, y) (1)

3Code:https://github.com/midas-research/muse-touchless-typing

Where L represents the total loss, H represents thesoftmax loss, MSE represents mean squared error loss, yrepresents the target value, y represents the predicted valuerand β represents the weight of regression loss.

Since this model is suitable to form the feature embeddingfor our task, we use the softmax outputs of size 66 (Figure5(a)) for all three Euler angles and concatenate them to formour embedding of dimension 3*66 = 198.

C. Model Architecture

The task of generating cluster sequences using headmovement video can be seen as a sequence to sequencemodelling task in which the input sequence is a sequence ofRGB frames and output sequence is a sequence of clustersthe user was looking at. Recurrent Neural Networks (RNN)such as LSTM [29], GRU [30] are widely used for asequence to sequence modelling tasks. However, it mightbe the case that the number of video frames is not equal tothe number of clusters in the target sequence, and hence aninput frame cannot be aligned to a corresponding cluster. Forthis, we use Connectionist Temporal Classification (CTC)loss [31], which solves the non-alignment problem of inputand output. For forming the feature embedding, we usethe output from the HopeNet model as described in theprevious section. Our model consists of a twenty-layeredBi-Directional GRU network. The embedding dimension is198 (concatenation of 66 dimension softmax output for allthree Euler angles) and the hidden dimension of the GRUcell used is 512.The model architecture is shown in Figure 6. Features alongwith the forward and backward directions, each of size 512,are added to give a T × 512 size feature matrix. We apply1D Batch Normalization followed by a fully connected layerwith softmax activation, which reduces the 512 features toten features. At each timestep, the final output of the modelis a softmax vector of size ten consisting of probability attime T for the nine clusters and a blank (for CTC loss).To generate the cluster sequence, we use Beam SearchDecoding [32], a widely used decoding algorithm used inthe field of natural language processing.

V. TRAINING AND PERFORMANCE EVALUATION

In this section, we describe the training as well asperformance evaluation processes.

A. Training

Instead of training on all N frames of a video, weselect every tenth frame, which helps in better capturingthe actual directional change during head motion. We useSGD optimizer with nesterov set to true. We use learningrate = 0.0025 and momentum = 0.9 and set max norm of400 for tackling gradient explosion. We train the model for300 epochs on batch size of 20.

Page 6: Touchless Typing Using Head Movement-based Gestures · Touchless Typing Using Head Movement-based Gestures Shivam Rustagi and Aakash Garg1, Pranay Raj Anand2, Rajesh Kumar3;4 Yaman

Softmax

Softmax

Softmax

FC Layer

ResNet-50Backbone

Expectation

Expectation

ExpectationMSE

Cross-Entropy

Cross-Entropy

Cross-Entropy

MSE

MSE

Total YawLoss

Total PitchLoss

Total RollLoss

(a) HopeNet architecture for predicting yaw, pitch and roll. (b) Pose estimation using HopeNet.

Figure 5. Headpose estimation

HOPENET

HOPENET

HOPENET

Yaw, Pitch andRoll Features

PretrainedHopeNet model

Bi-GRU

Bi-GRU

Bi-GRU

Bi-GRU

Bi-GRU

Bi-GRU

4 layered Bidirectional GRUNetwork

FCN

FCN

FCN

CTCLOSS

[ T x 224 x 224 x 3 ] [ T x 198 ] [ T x 512 ] [ T x 10 ]

Time (T) Fully ConnectedNetwork

{66 x 3} = 198

BEAMSEARCH

" [ 6, 3, 7, 1 ] "

[Example] PredictedCluster SequenceInference

Training

Suggested Words

" LOVE "" LIVE "

20

Figure 6. Proposed GRU-based architecture

B. Performance Evaluation

The performance of the proposed model was evaluatedunder two different scenarios which are described as follows:

1) Inter-user: In this scenario, the models are trained on aset of users S1 and tested on a different set of users S2 suchthat S1 and S2 are mutually exclusive. The cluster sequencesare kept the same for train and test sets.

2) Intra-user: We had recorded three iterations for eachcluster sequence from each user, we trained the models onthe first two iterations of each sequence and tested it on thethird keeping the users the same in train and test set.

Table IIIPERFORMANCE OF THE PROPOSED MODEL.

ScenariosProposed Model

Accuracy Modified-DTW

Inter-user 86.81 0.38

Intra-user 96.78 0.07

We evaluated and presented the performance of the modelusing Accuracy as well as Modified Dynamic Time Warpingdistance which are described below:

3) Accuracy: Accuracy is defined as the ratio of numberof correctly decoded cluster sequences and number of totaldecoded cluster sequences. A cluster sequence is consideredcorrectly decoded if it matches element by element with thatof the target sequence. For example, [3, 4, 7, 1, 2] will beconsidered a correct match to the target sequence [3, 4, 7,1, 2]. Anything other than [3, 4, 7, 1, 2] will be consideredas a mismatch.

4) Modified DTW: Besides Accuracy, we presentModified Dynamic Time Warping (M-DTW) distance [33]that is specifically designed to compare the sequence ofgestures. M-DTW calculates the distance function using alinear combination of Euclidean and Directional distances.M-DTW was preferred over Standard DTW [34] because thestandard DTW does not consider the directional changes in2D space. We consider the keyboard as 2D space and theclusters as coordinates, as shown in Figure 7. Let P = [7, 5,

Page 7: Touchless Typing Using Head Movement-based Gestures · Touchless Typing Using Head Movement-based Gestures Shivam Rustagi and Aakash Garg1, Pranay Raj Anand2, Rajesh Kumar3;4 Yaman

1], and Q = [9, 5, 3] are two predicted sequences and A = [8,4, 2] be the actual sequence. The standard DTW distancesAP and AQ turned out to be the same. However, in reality,AP should be smaller than AQ as A is more similar to Qthan P (see Figure 7).

1 2 3 1 2 3

4 5 6 4 5 6

7 8 9 7 8 9DTW: 3.00

M-DTW: 2.70DTW: 3.00

M-DTW: 2.50

Figure 7. Comparison of DTW and Modified DTW (M-DTW) algorithms.The lesser the distance is, the better the match.

VI. RESULTS AND DISCUSSION

The performance of the model is presented in Table IIIunder the two experimental scenarios. The model performsexceptionally well when same set of users are used in trainand test set and gives an accuracy of 96.78%. For userinvariant scenario, the model gives an accuracy of 86.81%.The presented results are preliminary as they are basedon data collected from a limited number of users undera controlled lab environment. The results are encouraging,and it would be interesting to see how the presented modelsperform under an unconstrained environment for a muchdiverse population of users. A factor that would likelyimprove the performance of the systems is the extent ofcrisp head-movements by the users during typing. Initialobservation suggests that the extent of the head-movementsdepends on the size of the on-screen keyboard and thedistance of the user’s head from that screen.

VII. CONCLUSION AND FUTURE WORK

This paper presented a touchless typing interface that useshead movement based-gestures. The proposed interface usesa single camera and a QWERTY keypad displayed on ascreen. The gestures captured by the camera are mapped to asequence of clusters using a GRU based deep learning modelthat consists of pre-trained embedding rich in head posefeatures. The performance of the interface was evaluatedon 2234 video recordings collected from twenty-two users.The presented interface achieved an accuracy of 96.78%,and 86.81% under the intra-user and inter-user scenarios,respectively. In the future, the aim is to improve theperformance issue by (1) using more training data containinga variety of meaningful sequences, and (2) combiningvideo feeds from multiple cameras, brainwaves recorded viaEEG sensors, acceleration, and rotation of the user’s headrecorded via accelerometer and gyroscope built into Muse 2which were collected concurrently during the data collection.

We also believe that a new dataset brings with it newchallenges. Few of the challenges and opportunities which

we observed during the collection and experimentationprocess are given. Many of the users asked us to have a”click” functionality which they could potentially indicate bya movement like a long stare or by focussing the eyes on thepoint of interest. This can also help us to simulate a mouse.Interestingly, this also avoids the time which a mouse takesin moving from one point of the screen to other since motormovements using hand are known to be much slower thaneye movements. A significant challenge impacting the usageof the interfaces presented is view and glance dependence.Although we tried to address this by capturing the videofrom multiple angles, yet we feel more can be done by 3-Dmodelling of the user viewpoints.

On another note, work in parallel lines can be pursued todefine eye movements and head gestures to map them to thevarious digital movements like right click, left click, crop,copy, paste, etc. Other future applications could also workin the direction of integrating the interface with wearabledevices and mobile computing. This will bring togethera newer set of applications like browsing from wearableglasses.

REFERENCES

[1] K. M. Salik, S. Aggarwal, Y. Kumar, R. R. Shah, R. Jain,and R. Zimmermann, “Lipper: Speaker independent speechsynthesis using multi-view lipreading,” in AAAI, 2019.

[2] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed,N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath,and B. Kingsbury, “Deep neural networks for acousticmodeling in speech recognition: The shared views of fourresearch groups,” IEEE Signal Processing Magazine, vol. 29,no. 6, pp. 82–97, Nov 2012.

[3] P. Saha and S. S. Fels, “Hierarchical deep feature learningfor decoding imagined speech from EEG,” CoRR, vol.abs/1904.04352, 2019. [Online]. Available: http://arxiv.org/abs/1904.04352

[4] C. Nguyen, G. Karavas, and P. Artemiadis, “Inferringimagined speech using eeg signals: A new approach usingriemannian manifold features,” J. of Neural Engg., 2018.

[5] D. W. Hansen and A. Pece, “Eye typing off the shelf,” inIEEE CVPR, 2004.

[6] A. Nowosielski and P. Forczmanski, “Touchless typing withhead movements captured in thermal spectrum,” PatternAnalysis and Applications, 07 2018.

[7] C. Yu, Y. Gu, Z. Yang, X. Yi, H. Luo, and Y. Shi, “Tap,dwell or gesture? exploring head-based text entry techniquesfor hmds,” in CHI ’17, 2017.

[8] Y. Zhao, R. Xu, X. Wang, P. Hou, H. Tang, and M. Song,“Hearing lips: Improving lip reading by distilling speechrecognizers,” 2019.

[9] Y. Kumar, S. Maheshwari, D. Sahrawat, P. Jhanwar,V. Chaudhary, R. R. Shah, and D. Mahata, “Harnessinggans for addition of new classes in vsr,” arXiv preprintarXiv:1901.10139, 2019.

Page 8: Touchless Typing Using Head Movement-based Gestures · Touchless Typing Using Head Movement-based Gestures Shivam Rustagi and Aakash Garg1, Pranay Raj Anand2, Rajesh Kumar3;4 Yaman

[10] O. Takami, N. Irie, C. Kang, T. Ishimatsu, and T. Ochiai,“Computer interface to use head movement for handicappedpeople,” in Proceedings of Digital Processing Applications(TENCON ’96), 1996.

[11] Y. Gizatdinova, O. Spakov, and V. Surakka, “Face typing:Vision-based perceptual interface for hands-free text entrywith a scrollable virtual keyboard,” in WACV’12, Jan 2012,pp. 81–87.

[12] R. Shah and R. Zimmermann, Multimodal analysis ofuser-generated multimedia content. Springer, 2017.

[13] O. Alsharif, T. Ouyang, F. Beaufays, S. Zhai, T. Breuel, andJ. Schalkwyk, “Long short term memory neural network forkeyboard gesture decoding,” 2015.

[14] A. Markussen, M. R. Jakobsen, and K. Hornbundefinedk,“Vulture: A mid-air word-gesture keyboard,” in CHI ’14,2014.

[15] A. Nowosielski, “Two-letters-key keyboard for predictivetouchless typing with head movements,” 07 2017, pp. 68–79.

[16] J. Tu, H. Tao, and T. Huang, “Face as mouse throughvisual face tracking,” Comput. Vis. Image Underst., vol.108, no. 1–2, p. 35–40, Oct. 2007. [Online]. Available:https://doi.org/10.1016/j.cviu.2006.11.007

[17] M. Nabati and A. Behrad, “3d head pose estimation andcamera mouse implementation using a monocular videocamera,” Signal, Image and Video Processing, vol. 9, 01 2012.

[18] A. Nowosielski, “3-steps keyboard: Reduced interactioninterface for touchless typing with head movements,” 052017, pp. 229–237.

[19] S. Zhai and P. O. Kristensson, “The word-gesture keyboard:Reimagining keyboard interaction,” Commun. ACM, 2012.

[20] D. Shukla, R. Kumar, A. Serwadda, and V. V. Phoha,“Beware, your hands reveal your secrets!” ser. ACM CCS,2014.

[21] G. Ye, Z. Tang, D. Fang, X. Chen, K. I. Kim, B. Taylor, andZ. Wang, “Cracking android pattern lock in five attempts,” inNDSS, 2017.

[22] D. Shukla and V. V. Phoha, “Stealing passwords by observinghands movement,” IEEE TIFS, Dec 2019.

[23] E. Caceres, M. Carrasco, and S. Rıos, “Evaluation ofan eye-pointer interaction device for human-computerinteraction,” Heliyon, vol. 4, no. 3, p. e00574, 2018.[Online]. Available: http://www.sciencedirect.com/science/article/pii/S2405844017305261

[24] C. Zhang, R. Yao, and J. Cai, “Efficient eye typing with9-direction gaze estimation,” Multimedia Tools Appl., 2018.

[25] G. Zhao, M. Barnard, and M. Pietikainen, “Lipreadingwith local spatiotemporal descriptors,” IEEE Transactions onMultimedia, vol. 11, no. 7, pp. 1254–1265, Nov 2009.

[26] V. Zue, S. Seneff, and J. Glass, “Speech database developmentat mit: Timit and beyond,” Speech Communication, vol. 9,no. 4, pp. 351 – 356, 1990. [Online]. Available: http://www.sciencedirect.com/science/article/pii/0167639390900107

[27] G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal ofSoftware Tools, 2000.

[28] N. Ruiz, E. Chong, and J. M. Rehg, “Fine-grained head poseestimation without keypoints,” 2017.

[29] S. Hochreiter and J. Schmidhuber, “Long short-termmemory,” Neural Comput., vol. 9, no. 8, p. 1735–1780, Nov.1997. [Online]. Available: https://doi.org/10.1162/neco.1997.9.8.1735

[30] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau,F. Bougares, H. Schwenk, and Y. Bengio, “Learning phraserepresentations using rnn encoder-decoder for statisticalmachine translation,” 2014.

[31] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber,“Connectionist temporal classification: Labellingunsegmented sequence data with recurrent neural networks,”in ICML ’06, 2006.

[32] S. Wiseman and A. M. Rush, “Sequence-to-sequence learningas beam-search optimization,” 2016.

[33] H.-r. Choi and T. Kim, “Modified dynamic time warpingbased on direction similarity for fast gesture recognition,”Mathematical Problems in Engineering, vol. 2018, pp. 1–9,01 2018.

[34] R. Bellman and R. Kalaba, “On adaptive control processes,”IRE Transactions on Automatic Control, vol. 4, no. 2, pp.1–9, November 1959.