SCANViz: Interpreting the Symbol-Concept Association ... · analytics is a component of paramount...

10
SCANViz : Interpreting the Symbol-Concept Association Captured by Deep Neural Networks through Visual Analytics Junpeng Wang * Visa Research Wei Zhang Visa Research Hao Yang Visa Research a b c b1 b2 b3 b4 b5 c1 a1 a2 a3 Figure 1: SCANViz : (a) the Overview view provides users with an overview on the training statistics of SCAN; (b) the SCAN view shows the architecture of β VAE+SCAN and allows users to switch between different architecture modes, e.g., img2sym, to check the inference power; (c) the Latent view presents the latent space of β VAE and SCAN for tracking and comparison. ABSTRACT Two fundamental problems in machine learning are recognition and generation. Apart from the tremendous amount of research efforts devoted to these two problems individually, finding the association between them has attracted increasingly more attention recently. Symbol-Concept Association Network (SCAN) is one of the most popular models for this problem proposed by Google DeepMind lately, which integrates an unsupervised concept abstraction process and a supervised symbol-concept association process. Despite the outstanding performance of this deep neural network, interpreting and evaluating it remain challenging. Guided by the practical needs from deep learning experts, this paper proposes a visual analytics attempt, i.e., SCANViz, to address this challenge in the visual domain. Specifically, SCANViz evaluates the performance of SCAN through its power of recognition and generation, facilitates the exploration of the latent space derived from both the unsupervised extraction and * e-mail: [email protected] e-mail: [email protected] e-mail: [email protected] supervised association processes, empowers interactive training of SCAN to interpret the model’s understanding on a particular visual concept. Through concrete case studies with multiple deep learning experts, we validate the effectiveness of SCANViz. 1 I NTRODUCTION Two fundamental problems in machine learning are recognition and generation, which, to some extent, can be considered as two reverse processes. The process of recognizing objects from an image is to convert visual concepts in the image to their corresponding symbolic descriptions, e.g., classifying a hat image to hat (Fig. 2, left). Conversely, the process of generation converts symbolic descriptions back to the corresponding visual concepts, e.g., from the instruction hat to a hat image. The essence of learning these two processes is to capture the association between visual concepts and their corresponding symbolic instructions. This problem, however, is extremely difficult when dealing with complex symbols and concepts, e.g., associating the symbol lovely dog with a precise visual representation (i.e., concrete pixel locations and colors). Recently, deep neural networks (DNNs) came onto the stage in handling this problem. The Symbol-Concept Association Network (SCAN [13]) is the state-of-the-art, which is trained in two separate steps. First, SCAN extracts the visual concepts from a target dataset through unsupervised learning. For example, Fig. 2 (right) shows the

Transcript of SCANViz: Interpreting the Symbol-Concept Association ... · analytics is a component of paramount...

Page 1: SCANViz: Interpreting the Symbol-Concept Association ... · analytics is a component of paramount importance [10]. From literature, the visual analytics attempts of model interpretation

SCANViz: Interpreting the Symbol-Concept Association Captured byDeep Neural Networks through Visual Analytics

Junpeng Wang*

Visa ResearchWei Zhang†

Visa ResearchHao Yang‡

Visa Research

a b

c

b1

b2

b3

b4 b5

c1

a1

a2

a3

Figure 1: SCANViz: (a) the Overview view provides users with an overview on the training statistics of SCAN; (b) the SCAN viewshows the architecture of βVAE+SCAN and allows users to switch between different architecture modes, e.g., img2sym, to checkthe inference power; (c) the Latent view presents the latent space of βVAE and SCAN for tracking and comparison.

ABSTRACT

Two fundamental problems in machine learning are recognition andgeneration. Apart from the tremendous amount of research effortsdevoted to these two problems individually, finding the associationbetween them has attracted increasingly more attention recently.Symbol-Concept Association Network (SCAN) is one of the mostpopular models for this problem proposed by Google DeepMindlately, which integrates an unsupervised concept abstraction processand a supervised symbol-concept association process. Despite theoutstanding performance of this deep neural network, interpretingand evaluating it remain challenging. Guided by the practical needsfrom deep learning experts, this paper proposes a visual analyticsattempt, i.e., SCANViz, to address this challenge in the visual domain.Specifically, SCANViz evaluates the performance of SCAN throughits power of recognition and generation, facilitates the exploration ofthe latent space derived from both the unsupervised extraction and

*e-mail: [email protected]†e-mail: [email protected]‡e-mail: [email protected]

supervised association processes, empowers interactive training ofSCAN to interpret the model’s understanding on a particular visualconcept. Through concrete case studies with multiple deep learningexperts, we validate the effectiveness of SCANViz.

1 INTRODUCTION

Two fundamental problems in machine learning are recognitionand generation, which, to some extent, can be considered as tworeverse processes. The process of recognizing objects from an imageis to convert visual concepts in the image to their correspondingsymbolic descriptions, e.g., classifying a hat image to fhatg (Fig. 2,left). Conversely, the process of generation converts symbolicdescriptions back to the corresponding visual concepts, e.g., from theinstruction fhatg to a hat image. The essence of learning these twoprocesses is to capture the association between visual concepts andtheir corresponding symbolic instructions. This problem, however,is extremely difficult when dealing with complex symbols andconcepts, e.g., associating the symbol flovely; dogg with a precisevisual representation (i.e., concrete pixel locations and colors).

Recently, deep neural networks (DNNs) came onto the stage inhandling this problem. The Symbol-Concept Association Network(SCAN [13]) is the state-of-the-art, which is trained in two separatesteps. First, SCAN extracts the visual concepts from a target datasetthrough unsupervised learning. For example, Fig. 2 (right) shows the

Page 2: SCANViz: Interpreting the Symbol-Concept Association ... · analytics is a component of paramount importance [10]. From literature, the visual analytics attempts of model interpretation

DeepMind-Lab dataset [5] and its four orthogonal visual concepts,i.e., object color, wall color, floor color, and object id. Well-adjustedunsupervised learning models, e.g., InfoGAN [9] or βVAE [12], caneffectively extract these visual concepts and encode them into adisentangled latent space. Second, SCAN transforms the symbolicinstructions into the same latent space and pushes them towardsthe representations of their corresponding visual concepts throughsupervised learning. A well-trained SCAN model could enable bothrecognition (from image/concept to symbol, short for img2sym)and generation (sym2img), as it captures the essential associationbetween the visual concepts and symbolic instructions (Fig. 2, left).Interpreting the disentangled representations and the associations inthe latent space is of paramount importance in understanding SCAN.

Object id:

Wall color:

Floor color:

Object color:

Visual Concept

Symbolic InstructionRecognition

(img2sym)

Generation(sym2img)

Object color:Wall color:

Floor color:Object id:

whiteyellowgrayhat

Figure 2: Left: bi-directional transformations between symbols andconcepts. Right: the DeepMind-Lab dataset [5] with 4 visual concepts.

Despite the impressive results presented by SCAN, understanding,interpreting, and diagnosing this model remain challenging to deeplearning experts. First, there is no quantitative metrics to unbiasedlyevaluate the quality of SCAN, especially the sym2img inference,as it is impossible to eyeball all generated images and verify theirassociation with the symbols. Second, it is nontrivial to interpret thelatent space and reveal how the concepts and symbols are associatedthere over time. Different concepts extracted from the unsupervisedlearning step are more likely to be entangled together. Disentanglingthem, associating individual ones to the corresponding symbols, andevaluating the strength of association are extremely challenging tasksfor deep learning experts. Third, the lack of flexibility in interactingwith the model training process often prevents the experts fromclosely monitoring the model evolution and investigating what themodel has really learned at different training stages.

We believe visual analytics is a very promising solution to theaforementioned challenges, as it has contributed significantly toExplainable AI [10]. For example, visual analytics approaches canencode DNN data into visual channels that are easier for humansto perceive [18, 23], enable integrated analysis pipelines to helpdeep learning experts more conveniently explore their models [17,28], reveal multiple data facets simultaneously to build connectionsamong different facets [27]. The insight provided by these visualanalytics attempts has successfully helped deep learning expertsinterpret, diagnose, and even improve their models [7, 26].

In this work, we propose SCANViz, a visual analytics systemto verify the association (between visual concepts and symbolicinstructions) learned by SCAN and interpret how the associationis formed over model trainings. As a preliminary prototype, wefocus more on interpreting the SCAN model itself through relativelysmall image datasets with orthogonal visual concepts. Targeted onthree concrete analytical tasks, SCANViz helps domain experts (1)evaluate SCAN at a fixed training stage and over its entire trainingprocess, (2) explore and interpret the latent space of βVAE+SCAN,as well as the disentangled semantics in this space, and (3) empowerinteractive training of SCAN under different configurations to betterunderstand its quality evolution. Through comprehensive studieswith multiple deep learning experts, we validate the effectiveness ofSCANViz. To sum up, the contributions of the work are threefold:1. We design and develop SCANViz, an integrated visual analytics

system that helps deep learning experts to evaluate, interpret,and dynamically interact with SCAN and its training process.

2. We propose a solution to visually explore the latent space of theβVAE+SCAN model, which has a great potential to be reusedfor the interpretation of latent representations from other DNNs.

3. Our work presents a typical human-in-the-loop training, whichallows humans to teach a model (SCAN) designated conceptsand validate the model’s intelligence dynamically over time.

2 RELATED WORKS

Model Interpretation. Despite the outstanding performance of deeplearning models, the interpretability of their complicated workingmechanism remains a problem. The emerging field of ExplainableAI [10, 15, 24] is designated to address this problem, in which visualanalytics is a component of paramount importance [10]. Fromliterature, the visual analytics attempts of model interpretation canroughly be categorized into two groups, model-specific [7, 17, 18, 26,27] and model-agnostic [20, 24, 28, 30] interpretations.

The model-specific interpretation focuses on a particular deeplearning model to expose its internal architecture and interprethow data were transformed internally. For example, CNNVis [18]formulated the architecture of a CNN as a directed acyclic graph(DAG). Through visualization, it disclosed different facets of theneurons in the DAG and allowed users to cluster neurons to augmenttheir common features. GANViz [27] collected the intermediatetraining results of GANs [11], computed statistics from them, andused them to reproduce the adversarial training over time. Similarvisualization attempts have helped deep learning experts effectivelyunderstand [18, 27], diagnose [17, 22, 28], and even improve [7, 26]their models. However, as usually designed for a specific model, thegeneralization of model-specific approaches is usually a limitation.

The model-agnostic interpretation [24] considers deep leaningmodels as block-boxes and probes the models’ behavior withoutopening them. For example, DeepVID [28] used generated imageswith perturbed semantics to probe the behavior of image classifiers.Manifold [30] compared different machine learning models throughthe agreement and disagreement of their classification results.RuleMatrix [20] transformed predictive machine learning modelsinto more interpretable rule-based representations to interpret andexplore them. This group of approaches are applicable to a widerange of models, but may not be able to provide the models’ internaldetails in-depth. We design SCANViz particularly for the SCANmodel in this work. However, our analysis does not require SCAN’sinternal network architecture for interpretation. Thus, it can be easilyadapted to other SCAN models with different model architectures.

Interactive Model Training. Our work also contributes todynamic and interactive model training, which is an effective wayof understanding a complicated model. TensorFlow playground [1]is one of the most popular demonstrations of enabling interactivetraining for DNNs. It built a small DNN classifier, which could beinteractively trained using a set of pre-defined 2D data points. Theplayground allowed users to dynamically specify the architecture ofthe DNN through interactively adding/removing neurons from thearchitecture. Another example is GAN Lab [14], which focused onGANs [11] to interpret how the adversarial game happened betweenthe generator and discriminator in a web-browser. Additionally,ConvNetJS [3] exposed all internal details in the training of a CNN,making the model more popular and easily understandable for non-experts. These works have contributed significantly to DNN-relatededucation and promoted deep learning to a much broader audience.

Our work distinguishes itself from previous interactive trainingworks in two perspectives. First, SCANViz allows users to flexiblyselect and dynamically change the training data to investigateSCAN’s learning capability. It encodes humans’ understandingof a concept into a feature selection process and enables humans touse image features to probe the learning process. Second, SCANViznot only focuses on training, but also interprets the model over itstraining, i.e., it reveals what the model really learned over time.

Page 3: SCANViz: Interpreting the Symbol-Concept Association ... · analytics is a component of paramount importance [10]. From literature, the visual analytics attempts of model interpretation

3 BACKGROUND AND MOTIVATION

As briefly mentioned in Sect. 1, there are two separate steps intraining SCAN. We explain them with details in this section.

3.1 Variational Auto-Encoder (VAE) and βVAEThe first step is the extraction of visual concepts from images throughunsupervised learning models, for which βVAE [12] is used. Toexplain βVAE, we first introduce its ancestor, VAE (Fig. 3).

original reconstruction

Encoder Decoder

Latent Space

Unit Gaussian

Figure 3: The structure of VAE (with a 4D latent space).

VAE [16] is an unsupervised deep learning model used to learnthe underlying distribution of a given dataset X . It contains twosub-networks of an encoder (q) and a decoder (p). The encoder withits model parameters θ transforms each data instance x2X to a lowerdimensional space, i.e., the latent space. In Fig. 3, the input imageis transferred to a 4D latent space, and the values on each latentdimension follow a Gaussian distribution. Note that, VAE does notguarantee the independence of the four latent dimensions. A noisylatent representation of x, denoted as z, can be derived by samplingfrom the latent space, i.e., z = encoder(x)� qθ (zjx). The decoderwith its parameters φ takes this representation as input and strivesto reconstruct the original instance, i.e., x = decoder(z)� pφ (xjz).The training of VAE is conducted by minimizing the differencebetween x and x, and enforcing the latent representation to followunit Gaussian distributions. Thus, the loss function of VAE is:L(θ ;φ)=�Ez�qθ (zjx)[log pφ (xjz)]+DKL(qθ (zjx)jjN (0;1)): (1)

βVAE [12] is a variant of VAE, which targets to disentangle thelatent space of VAE, i.e., it tries to encode data into a factorized andinterpretable latent space, in which each latent dimension controlscertain semantics of the data (e.g., one latent dimension in Fig. 3may control the color of the hat). It achieved this by adding ahyperparameter β (β > 1) to strengthen the constraint on the latentspace. The loss function of βVAE then becomes:L(θ ;φ)=�Ez�qθ (zjx)[log pφ (xjz)]+βDKL(qθ (zjx)jjN (0;1)): (2)

This simple modification works amazingly well, as it has beendemonstrated with several image datasets [12]. The authors of βVAEalso provided theoretical supports to their experimental results,which have been detailed in [8]. For an ideal and well-trained βVAE,each dimension should control one and only one visual concept (i.e.,well-disentangled), though it is very difficult to accomplish.

3.2 Symbol-Concept Association Network (SCAN)SCAN is used in the second step to build the association betweenthe disentangled visual concepts (extracted using the pre-trainedβVAE in the first step) with the corresponding symbolic descriptions.As shown in Fig. 4 (components in yellow), SCAN can also beconsidered as a variant of VAE, which reconstructs the symbolicinstructions (i.e., from y to y). Its training loss is defined as follow:L(θy;φy) =�Ezy�qθy (zyjy)[log pφy(yjzy)]+

βDKL(qθy(zyjy)jjN (0;1))+ γDKL(qθx(zxjx)jjqθy(zyjy)):(3)

The third term enforces extra constraints to push the latent space ofSCAN towards the latent space of the pre-trained βVAE, i.e., fromqθy(zyjy) to qθx(zxjx). β and γ are two hyperparameters, which areset to 1 and 10 respectively. During training, it is very important tomonitor how similar these two latent spaces are, to understand theprogress of SCAN, as well as the association between the extractedvisual concepts and the corresponding symbolic instructions.

In Fig. 4, the img2sym inference is conducted through the datapath: x!θx!zx!φy!y. It is similar to a recognition process, but

KL KL KL KL

!"Encoder

# "De

code

r

Encoder!$

# $De

code

r

%

&%

'

&'()*+

,-*./$

/"

∼123(/$|%)

∼127 /" ')

Figure 4: Architecture of βVAE+SCAN (they are trained separately).

also captures meaningful intermediate representations of the image.The sym2img is conducted through the path: y!θy!zy!φx!x. Itmakes SCAN really attractive as it can generate images based on itsown understanding of a real-world concept.

Motivation for SCANViz. Although there are many DNN visualanalytics tools, we could not find one that is readily applicableto SCAN, as those tools are usually designed for specific models,e.g., CNNVis [18] and GANViz [27]. TensorBoard [6] is the mostpromising candidate, as it is a general tool and applicable to all DNNmodels. However, it falls short of tackling the particular challengesintroduced by SCAN from three perspectives. First, there is nocomponent that allows domain experts to conveniently investigateand compare the two latent spaces, i.e., qθy(zyjy) and qθx(zxjx), aswell as their evolutions. Second, similar to most DNN visualizationtools, TensorBoard is also a post-hoc analysis tool, i.e., users canonly analyze the model after its training process. The capability ofdirectly interacting with the model training process is still missing.Specifically, in our problem, we need to empower domain experts tointeractively train SCAN with images of different features to probethe concept-learning process. Third, to analyze SCAN, domainexperts also need to frequently switch between different architecturemodes (e.g., from sym2img to img2sym) and examine the statisticalsamples dynamically taken from the latent space. These reasonsmotivated us to propose new visualizations tailored for SCAN.

4 DESIGN REQUIREMENT

We worked closely with a group of deep learning researchers todevelop SCANViz. These researchers all have 5+ years experiencesin deep learning. Through regular meetings over months, we distilledthe following 3 main themes of requirements in designing SCANViz:• R1: support Model Evaluation over the entire training of

SCAN and at a particular training stage. Monitoring andevaluating the performance of a deep learning model is alwaysthe first step in understanding it. Based on the daily practice ofthe experts, this requirement needs SCANViz to:– R1.1: demonstrate the evolution of different quality metrics

(e.g., reconstruction loss, latent space similarity) over time.– R1.2: evaluate the bi-directional inference over training to

reveal the quality of sym2img and img2sym.– R1.3: enable comprehensive examinations on the quality of

the symbol-concept association at a particular training stage.• R2: empower Latent Space Exploration for both SCAN and

βVAE, as well as the comparison between them. The trainingof SCAN involves two latent spaces and the similarity betweenthem reflects how strong the association (between symbols andconcepts) is. We design SCANViz to be able to:– R2.1: explore the latent space of βVAE to identify the

disentangled latent dimensions and their associated semantics(i.e., what visual concepts of the images are encoded in them).

– R2.2: compare the two latent distributions, i.e., qθx(zxjx)and qθy(zyjy), and the reconstructed samples from them tounderstand SCAN and investigate its power in generation.

– R2.3: track the evolutions of SCAN’s latent space, i.e.,qθy(zyjy), in comparison with the latent space from βVAE,

Page 4: SCANViz: Interpreting the Symbol-Concept Association ... · analytics is a component of paramount importance [10]. From literature, the visual analytics attempts of model interpretation

i.e., qθx(zxjx), over time to interpret what dimensions areassociated well (and what are not) in different training stages.

• R3: facilitate Dynamic and Interactive Training of SCAN.Given that SCAN can quickly grasp the association betweensymbols and concepts with only a few symbol-concept pairs [13],training it can be finished in minutes (for the trainings withnumerous symbol-concept pairs, it could still take hours). Withdynamic training, SCANViz empowers its users to:– R3.1: teach SCAN a specific concept using images with

that concept but also with increasingly diverse features toinvestigate how SCAN gradually understands the concept.

– R3.2: get the immediate feedback on SCAN’s capability ingenerating symbols or images, at different training stages, tocheck what the model really learns and how it evolves.

5 SYSTEM OVERVIEW

Fig. 5 shows an overview of SCANViz developed under the threethemes of design requirements. The system contains a back-endpowered by Python TensorFlow and a web-based front-end for userinteractions. The library Flask [4] manages the communicationbetween the two ends (i.e., between Python and Javascript).

!"#$

%&#'

()*+,-.

/0+0()*)+0+-ℎ,*)-*2+)

+)-34.*+2-*,34

56)+6,)7 &,)7 !"#$ &,)7 80*)4* &,)7

90*)4*.*0*,.*,-.

349,4) :0*0 /+3-)..,4; )4;,4) (=)4.3+>937)

@

A

!"#$&,B

,4/2* 04: ,4*)+():,0*) :0*0

,4*)+0-*,6):0*0

0409A.,. 933/

CD:(3:)9 )60920*,34 C2: 90*)4* )@/93+0*,34C3: ,4*)+0-*,6) *+0,4,4;

(3:)9

*+0,4,4;:0*0.)*

)4:

H0-I−

)4:

K+34*−

Figure 5: The overview of SCANViz.

The work-flow of SCANViz can be described as follows (Fig. 5).First, we train the βVAE+SCAN model and collect their modelarchitecture (i.e., computational graph), as well as the trained modelparameters over time. Second, with these two types of data, the back-end of SCANViz can reproduce the model’s state at any training stepon-demand and generate user-interested statistics, e.g., the varianceof individual latent dimensions, the quality metrics for sym2img,etc. The back-end of SCANViz also allows users to feed any datainstance of interest to the model (at a particular training stage) toinvestigate its behavior. Multiple types of intermediate data wouldbe generated at this stage, such as the latent distributions qθx(zxjx)and qθy(zyjy) of the given data instance (x;y), and the correspondingx and y reconstructed from the stochastic samples zx and zy. Third,the front-end of SCANViz takes the intermediate data as input andpresents them into three integrated visualization views, the Overviewview, the SCAN view, and the Latent view. These views allowusers to examine the model’s quality at a specific training stageor over the entire training (model evaluation); facilitate users toconveniently interact with the latent space of βVAE and SCAN(latent exploration); and empower users to train the SCAN modelfrom scratch with images containing designated features and on-demand model configurations (interactive training). Lastly, thesevisualization views also provide users an interface to interact withthe back-end, which forms an interactive data analysis loop.

6 VISUAL ANALYTICS SYSTEM: SCANVizAs shown in Fig. 5, the front-end of SCANViz is composed of threecoordinated visualization views, which together serve to satisfy thethree design requirements. This section explains them with details.

6.1 Overview ViewThe Overview view (Fig. 1a) has two display modes presentingan overview of (1) the training statistics over time (Statistics-Overview); (2) the subset of training images representing differentvisual concepts (Image-Overview). Switching between them isenabled through the drop-down widget in the header of this view.

6.1.1 Statistics-OverviewThe Statistics-Overview is presented through three line charts (R1.1).The first one (Fig. 1-a1) shows three curves reflecting how thethree terms in SCAN’s loss evolve over time, i.e., the reconstructionloss (in purple), latent0 loss (DKL(qθy(zyjy)jjN (0;1)), in orange)and latent1 loss (DKL(qθx(zxjx)jjqθy(zyjy)), in green). The secondone (Fig. 1-a2) tracks how the Gaussian variance of SCAN’s latentdimensions changes over time (R1.1). The logic behind this metricis that the latent dimensions associated with certain visual conceptswill have a constrained variance that is smaller than 1. For example,we trained SCAN on DeepMind-Lab with 32 latent dimensions,so there should be 32 curves in this plot. However, as the datasetcontains four visual concepts only, a perfectly disentangled spaceshould have four latent dimensions associate with the four conceptsand the rest 28 dimensions have no effects to the images (i.e., the 28irrelevant dimensions should follow unit Gaussian distributions, asenforced by latent0). Therefore, tracking the variance of differentlatent dimensions can help users identify important ones and revealthe evolution of the latent space. The vertical axis of this plot is1�σ2 and we can see many curves (in gray) drop to 0 quickly, i.e.,the variance of the those dimensions converges to 1.

The third line chart (Fig. 1-a3) presents two curves tracking thequality of img2sym and sym2img over time (R1.2). As the processof img2sym is similar to classification, its quality can be measuredby the inference accuracy, i.e., comparing the result of img2sym(x)with the corresponding y. The quality of sym2img is harder to bemeasured, as it is impossible to eyeball all generated images andverify if they follow the symbolic instructions. We therefore train ahigh-quality classifier C on the image dataset X and feed the outputfrom sym2img to C to measure the accuracy, i.e., comparing theoutput of C(sym2img(y)) with the corresponding y. The evaluationcan be done by focusing on a single visual concept, e.g., consideronly the inference correctness on object color, or on all concepts.As a result, there will be five (4+ 1) pairs of curves to be shownin this plot. The drop-down widget on the top-left corner of thisplot controls the switch among them. Note that, we used all datainstances when computing these inference accuracy curves.

6.1.2 Image-OverviewThe Image-Overview (Fig. 6) provides an interface for users toexplore images representing different visual concepts. Each rowof color circles in Fig. 6b represents the possible values for onevisual concept. By selecting these circles, the images with thecorresponding concepts will be presented in Fig. 6c for overview.This view is also the interface that enables users to teach SCANa particular visual concept using a group of images with diversefeatures (R3: interactive training). For example, in Fig. 6a, the targetconcept for SCAN to learn is specified as fgray; f loorg. Since thevalues for the other three visual concepts are not specified (indicatedby the circles with a cross), users can select images with diversevalues on these three dimensions. The selection is conducted throughthe interface in Fig. 6b, 6c, and the selected images will be used totrain SCAN from scratch with the interface in Fig. 6d. A detailedinteractive training example will be demonstrated in Sect. 7.2.1.

6.2 SCAN ViewThe SCAN view contains two sub-views: the Architecture view(Fig. 1-b1) showing the structure of βVAE+SCAN, and the Samplesview (Fig. 1-b2) presenting the stochastic samples reconstructed

Page 5: SCANViz: Interpreting the Symbol-Concept Association ... · analytics is a component of paramount importance [10]. From literature, the visual analytics attempts of model interpretation

d

c

b

a

Figure 6: Image exploration (b, c) and interactive training (a, b, and d).

from the latent space of both βVAE (Fig. 1-b2, bottom) and SCAN(Fig. 1-b2, top), as well as their distributions (i.e., the pie charts).

6.2.1 Architecture View

The Architecture view (Fig. 1-b1) shows the joint architecture ofβVAE (blue) and SCAN (yellow). It presents the same layout withFig. 4 and has ten components, x, θx, zx, φx, x, y, θy, zy, φy, and y. Bydefault, all components are presented and the data-flow for a selecteddata instance (x, y) is x!θx!zx!φx!x and y!θy!zy!φy!y,i.e., two parallel data-flows for x and y respectively, and this modeis indicated by the “all components” option in Fig. 1-b3. Notethat, we use the mean of qθx(zxjx) and qθy(zyjy) as zx and zy inthis view to reconstruct x and y, as they are the most representativesamples. The text between the two latent spaces shows the sum ofthe KL-divergence between all latent dimensions when taking thecurrent instance (x, y) as input. Except this mode, there are fourmore possible modes for users to select, i.e., img2img, sym2sym,img2sym, and sym2img. Different modes will disable some of thecomponents to help users better focus on the remaining components.The switching between modes will also change the data-flow. Forexample, Fig. 9c shows the mode of sym2img and the data-flowis y!θy!zy!φx!x. The animated path between the components(see our associated video) will help to show the data-flow direction.

In different architecture modes, users can flexibly select imagesrepresenting different visual concepts, or select different symbolicinstructions, into the architecture view to investigate how theimage/symbol flows through the architecture. This selection can beperformed in two ways. The first way is to directly specify fromthe component y. As shown in Fig. 9-c1, when clicking the colorcircle representing the current value of each visual concept, a pop-upwindow will show all the available choices of that concept for usersto switch. The last circle with a cross means not specifying anyvalue on that dimension. The change of selections will trigger thedata update in the current data-flow automatically. The second wayof selecting different data instances is by providing the index of theinstance through the widget in Fig. 1-b4. This helps users to quicklyget the instance from their previous analysis back to the system.

The Architecture view also shows the confidence level of theinference result y (R1.3). For example, when clicking any circlerepresenting the inferred value of a visual dimension, a bar chartwill pop up to show the output from SCAN’s decoder φy (Fig. 9-a1).Note that the values over each bar is the output from a sigmoidlayer (not softmax), therefore they do not necessarily sum to 1.

6.2.2 Samples View

As shown in Fig. 4, zx and zy are not deterministic for the input datainstance (x, y). Instead, they are stochastic samples from qθx(zxjx)and qθy(zyjy) respectively. The x and y in the Architecture view showonly one possible pair of reconstructions (using the mean values).The Samples view allows users to specify how many samples they

want to generate and SCANViz will sample that many times fromqθx(zxjx) and qθy(zyjy) to reconstruct multiple x and y.

The bottom 60 images in Fig. 1-b2 show the 60 x reconstructedfrom the 60 zx sampled from qθx(zxjx), given the current numberof samples is specified to 60 (Fig. 1-b5). As we can see, mostreconstructions present consistent visual concepts, which is oftenobserved when well-trained model parameters are loaded. However,when the model is in an early training stage, one may observe verydiverse reconstructions. This diversity level is a good reflection onthe model quality at the current training stage (R1.3).

The four pie charts above the 60 images show the distributionof the images in the four visual dimensions. They are computedby feeding the 60 reconstructed x into the pre-trained classifier C(explained in Sect. 6.1.1) and counting the number of reconstructionsfalling into each category based on C’s output. Hovering over eachportion of the pie chart will highlight the corresponding images, asshown in Fig. 1-b2 (the bottom half, images with blue floor colorare highlighted). The outer ring of each pie chart shows the possiblevalues for the corresponding visual dimension, which works as thelegend for the pie chart. A small arc in the out-most indicates thecorresponding portion of the pie that has the maximum amount.

The top half of the Samples view (Fig. 1-b2) is symmetric to thebottom half, which shows 60 symbol reconstructions (y) and theirdistributions. Each reconstruction is presented by four circles, withcolors indicating the inferred value of the corresponding dimension.For object id of the DeepMind-Lab, the color, red, green, and bluerepresent the three possible objects i.e., ice lolly, hat, and suitcase.

The two halves of the Samples view will be enabled/disabledwhen users changing their selection of the architecture mode in theArchitecture view. For example, when the mode is sym2img, the tophalf of this view will be disabled and faded out to help users betterfocus on the reconstructed images (Fig. 9c).

6.3 Latent View

The Latent view (Fig. 1c) presents the high-dimensional latent spacefrom both βVAE and SCAN using the design that is similar to aparallel coordinates plot (PCP [29]). Each parallel axis in this viewrepresents one latent dimension. For a given instance x, its latentrepresentation from βVAE, i.e., qθx(zxjx), on each dimension is aGaussian density function. Connecting the means of the Gaussiansacross axes forms a polyline, which we draw as a cubic Beziercurve (in blue) and allow users to adjust its curvature (through thewidget in Fig. 1-c1). Image x in the Architecture view (Sect. 6.2.1) isreconstructed from this curve. The standard deviation on each latentdimension is a value range and connecting the ranges across axesforms the standard deviation band (in transparent blue) of βVAE.

Similarly, the latent representation from SCAN, i.e., qθy(zyjy),can be visualized in the same PCP (the yellow curve and bandin Fig. 1c). This superimposition allows users to easily comparethe two latent representations on/across individual dimensions tocheck their agreement (R2.2). In Fig. 1c, it is obvious that the twolatent representations have the biggest divergence in “d21”, as thetwo bands and curves have little overlap in that dimension. Becauseit could be difficult to visually quantify the level of overlap, we usea small rectangle above each axis to indicate the divergence levelbetween qθx(zxjx) and qθy(zyjy) on the corresponding dimension(the more filled pink color, the bigger the divergence is). Hoveringover each small rectangle will show the KL-divergence value.

Forward and Backward Analysis. One important task for theLatent view is to reveal which latent dimensions captured the visualconcepts from the images and what they have really captured (R2.1).This can be conducted through both the forward (from latent toimage space) and backward (from image to latent space) analysis.

The forward analysis is performed by simply dragging the redcontrol point on each axis of the PCP (Fig. 1c) and observing thedimension’s effects on the reconstructed x in the Architecture view.

Page 6: SCANViz: Interpreting the Symbol-Concept Association ... · analytics is a component of paramount importance [10]. From literature, the visual analytics attempts of model interpretation

Figure 7: The forward (a) and backward (b) analysis of the Latent view in revealing the connection between latent dimensions and concepts.

Fig. 7a shows an example of exploring the dimension “d20”, whichshows how this dimension controls the floor color dimension.

The backward analysis allows users to select multiple images thatvarying in an interested visual concept from the Image-Overviewview (Fig. 6) to investigate their aggregated latent representations.Continue our previous case, to explore what other latent dimensionsalso affect the floor color, we can select a group of images withdifferent floor colors (the top row of images in Fig. 7b). The redband in Fig. 7b (which is defined by the min and max of all themean values on individual axes) shows the aggregated mean curvesof those images. From which, we know that “d11”, “d18”, “d21”,and “d31” also have different levels of controls to the floor color.

Note that, not all axes are equally important in this view. Still, wepresent them all due to the following reasons. First, domain expertsneed an overview of all latent dimensions to quickly distinguishthe relevant dimensions from irrelevant ones. Relevant dimensionsthat controlling certain concepts of the image usually have a smallerstandard deviation, i.e., a more constrained value range, which canbe easily identified from the width of the band. Second, the irrelevantdimensions provide an intuitive context view of the whole latentspace, and allow users to interact with them to verify if they reallyhave no effect to the reconstructed image. Third, presenting all latentdimensions also allows domain experts to track their evolution, asthe relevance of certain dimensions may change over training.

7 CASE STUDIES WITH DEEP LEARNING EXPERTS

We worked with five deep learning experts to explore the power ofSCANViz. Among them, three are full-time researchers focusing ondisentangling the embedding space from DNNs, two are full-timesenior researchers that have worked with machine learning for morethan 15 years. All have 5+ years of experience in deep learning.

7.1 Datasets and Network Structure

face color:

hair style:

eye color:

hair color:

a b

d22dd22cstyle 3 style 1 style 9

d22

1 − #$

Figure 8: Exploring the CartoonSet images [2].

Two image datasets were used in our studies: the DeepMind-Lab [5]and CartoonSet [2]. For DeepMind-Lab, the numbers of object color,wall color, floor color, and object id are 16, 16, 16, and 3. So, thereare 12,288 RGB images in total, each of size 80�80. For CartoonSet(Fig. 8a), we extracted 4 visual concepts (hair color, hair style, eyecolor, and face color) from the original cartoon images and remixedthem (through color mapping) to generate a smaller dataset. The

number of possible values for each concept is 10, 10, 10, and 11. So,there are 11,000 images (10�10�10�11) in total. The original sizeof the cartoon images is 500�500. We cropped the center 350�350pixels and scaled them down to 80�80 for our usage.

We used similar βVAE+SCAN architecture with the originalSCAN paper [13]. Specifically, the encoder of βVAE contains fourconvolutional layers, each halves the size of the input image. Twofully connected layers, then, transform the convolutional outputto 32 Gaussian density functions (i.e., a 32D mean and a 32Dstandard deviation vector). The βVAE’s decoder has a symmetricarchitecture, i.e., two fully-connected layers followed by fourtransposed convolutional layers. The encoder of SCAN containstwo fully-connected layers transforming the k-hot vector (i.e., y) intoa 32D latent space. k=51 (16+16+16+3) for DeepMind-Lab andk=41 (10+10+10+11) for CartoonSet. The decoder of SCAN witha symmetric network architecture transforms zy�qθy(zyjy) from the32D latent space back to the k-hot vector to generate y. Note that,SCANViz focuses more on interpreting SCAN. The concept extractor,βVAE, was pre-trained and will not be changed over our studies.

7.2 Case Studies with Domain Experts

We organize the findings over our explorations with the experts intwo subsections. Sect. 7.2.1 focuses on DeepMind-Lab to showhow SCANViz satisfies the three design themes. Sect. 7.2.2 usesCartoonSet to further verify our findings and reveal more insight.

7.2.1 Exploring the DeepMind-Lab dataset

R1: Model Evaluation. From the Overview view, we caneasily summarize the model training statistics. First, the latent0and latent1 losses (in orange and green, Fig. 1-a1) converged veryquickly, i.e., after around five epochs. However, the reconstructionloss (in purple) took around 70 epochs to become stable (R1.1). Thisobservation led the experts to hypothesize that SCAN’s decoder maybe more difficult to be trained than its encoder. The logic behindthis is that the latent losses control the parameter updates of SCAN’sencoder only, whereas the reconstruction loss reflects the quality ofboth the encoder and decoder. Since the latent losses converged butthe reconstruction loss was still going down, it is very likely that theloss decrease was from the improvement of the decoder. There werealso some discussions on the increase of latent0, which is to push thedistribution of zy on each latent dimension towards a unit Gaussian.Its increase is resulted from the stronger contradictory constraint oflatent1, which pushes the distribution of zy towards the pre-trainedlatent distribution of zx, i.e., γ =10 > β =1 in Equation 3.

Second, the variance of the 32 latent dimensions also convergedvery quickly, as reflected by the 32 curves in Fig. 1-a2 (R1.1). Wesorted those curves based their values decreasingly and only 7 outof the 32 curves have obvious value increase over the training. Thisindicates the 7 dimensions have a variance smaller than 1, since thevertical axis of the plot is 1�σ2. It also led the experts to believethat they (i.e., “d20”, “d4”, “d11”, “d14”, “d31”, “d18”, and “d21”)

Page 7: SCANViz: Interpreting the Symbol-Concept Association ... · analytics is a component of paramount importance [10]. From literature, the visual analytics attempts of model interpretation

Figure 9: Evaluating sym2sym inference on reconstructing yellow (a) and red (b) wall color, and sym2img on yellow wall color, at epoch 10.

are the dimensions that captured the four visual concepts of the data,which can be verified by interacting with them in the Latent view.

Third, the five pairs of accuracy curves in Fig. 1-a3 all showa consistent trend that the sym2img curve (in orange) convergesfaster than the img2sym curve (in purple) (R1.2). sym2img uses theencoder of SCAN and the decoder of βVAE. Since βVAE is pre-trained, the sym2img curve reflects more on the quality of SCAN’sencoder. Similarly, the img2sym curve reveals the quality of SCAN’sdecoder. The consistent result from the five pairs of curves verifiesthe hypothesis that SCAN’s encoder is trained faster than its decoder.

Additionally, at a particular training stage, the experts investigatedhow good SCAN is in different inference tasks (R1.3). Fig. 9 showsSCANViz in epoch 10. When performing sym2sym on a data instancewith yellow wall color, as shown in Fig. 9a, we found the yellowwall color is incorrectly reconstructed as red. Clicking the red circleshows a bar chart presenting the output from SCAN’s decoder (Fig. 9-a1). It can be seen that SCAN is not confident on this reconstruction,as not a single value is dominating the output. The distribution ofthe 60 reconstructed symbols shown in Fig. 9-a2 also reflects this.However, we also found that sym2sym works well on some otherinstances at this epoch. For example, in Fig. 9b, SCAN confidentlyreconstructed the instance with red wall color (Fig. 9-b1, b2). Thecomparison between Fig. 9a and Fig. 9b indicates that reconstructingdifferent symbols may have different levels of difficulty for SCAN.

As we already knew (from the losses) that the decoder of SCAN isnot well-trained at epoch 10, most probably the poor reconstructionperformance in Fig. 9a is due to the low-quality decoder. Toverify this, the experts explored sym2img inference (Fig. 9c). Thereconstructed image (Fig. 9-c2), the 60 reconstructed samples (Fig. 9-c3), and their distribution (Fig. 9-c4) all present good reconstructionsof the yellow wall color, which verifies that SCAN’s encoder isgood enough to extract the essence of this symbol (sym2img onlyuses SCAN’s encoder). The explorations motivated the experts toapply some changes to SCAN. They were considering if one canearly stop the training of the encoder (since it is easier to be trained)or increase the weight of the reconstruction loss in later trainingstages (i.e., decay the value of β and γ in Equation 3).

R2: Latent Exploration. With an overview of SCAN’squality, we then explore the latent space of βVAE and SCAN.

First, using the Latent view, the experts explored which latentdimensions of the βVAE captured the visual concepts of DeepMind-Lab and what they have captured (R2.1). The ideal disentanglementresult is one latent dimension controls only one visual concept,and conversely, the visual concept can only be controlled by thisdimension, which is really hard to accomplish. Through bothforward and backward analysis, the experts found five dimensionsthat have very clear connections with the visual concepts, i.e., “d20”

and “d18” mainly control the floor color, “d4” mainly controlsthe wall color, “d31” mainly controls the object id, “d11” mainlycontrols the object color. Except them, “d21” shows minor effectto object id, object color and very minor effect to floor color.“d14” shows minor effect to object color. The rest 25 dimensionsshow completely no effect to the reconstructed image. These 25“irrelevant” dimensions become unit Gaussian distributions quicklyduring training, which is constrained by latent0. The observationalso inspired the experts to train βVAE with fewer latent dimensions.

Next, the experts investigated how SCAN learns to associate itslatent space with the pre-trained latent space of βVAE (R2.2, R2.3).They found that most of the SCAN’s latent dimensions achieved asimilar representation with that of the βVAE very quickly, except“d21” and “d31”. Fig. 10 (left) shows the evolution of SCAN’slatent space (yellow band) in comparison with the latent space ofβVAE (blue band) over epoch 1, 2, and 3. Fig. 10 (right) shows theperformance of sym2img at the corresponding epoch when given thesymbol gray object color, yellow wall color, green floor color,and hat (hat is represented by green color). We present the qualityof sym2img, as it is directly affected by SCAN’s latent space.

At the first epoch, all latent dimensions of SCAN roughly follow aunit Gaussian distribution (which is dependent on the initialization).The distributions of the reconstructed images from sym2img lookvery random across four visual attributes, as shown by the first rowof pie charts in Fig. 10 (right). To the second epoch, the varianceof dimension “d4”, “d11”, “d14”, “d18”, “d20”, “d21”, and “d31”becomes smaller and the mean of these dimensions also shifts. Theportion for the correct floor color and object id in the pie chartsbecomes bigger, i.e., the third and fourth pie chart in the second row.To the third epoch, the latent space of SCAN becomes very similarto that of βVAE, except dimension “d21” and “d31”. The portionof the correct concept values in the four pies all dominate the valuedistributions. Keep going forward, the experts found the latent spaceof SCAN still goes slowly towards that of the βVAE. At the sametime, the quality of sym2img becomes better and better. Note thatthe blue band did not change over time, as the βVAE is pre-trained.

However, the improvement on “d21” is really marginal, evenafter 100 training epochs, as shown in Fig. 1c. Connecting ourprevious analysis, “d21” from the pre-trained βVAE affects multiplevisual concepts simultaneously with different scales. The conceptentanglement confused SCAN when learning to associate symbolswith their corresponding concepts, causing the training problem.

R3: Interactive Training. As shown in the original SCANpaper [13], progressive training can effectively reveal SCAN’slearning behavior. Here, we present an example of learning theconcept fgray; f loorg from increasingly diverse images to show howconveniently interactive training (R3) can be done with SCANViz.

Page 8: SCANViz: Interpreting the Symbol-Concept Association ... · analytics is a component of paramount importance [10]. From literature, the visual analytics attempts of model interpretation

object color wall color floor color object id

epoch 1

epoch 2

epoch 3

d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19 d20 d21 d22 d23 d24 d25 d26 d27 d28 d29 d30 d31

Figure 10: Evolution of SCAN’s latent space (left) and the quality of sym2img (right). The same instance with Fig. 1 was selected in these figures.

Before training, the experts set the target floor color as gray,(Fig. 6a). Values for the rest concepts were not specified as indictedby the circles with a cross. The training was divided into three stages.First, images with gray floor color and differ only in object colorwere selected from the Image-Overview for training. Second, thediversity of the training images was increased by adding imageswith different wall colors. The third stage further increased the datadiversity by adding images with different object ids. Fig. 11b, 11c,and 11d show some selected images used in the three stages. Eachstage lasted for 2000 epochs, which can be changed on-demand.

The evolution of the 32 latent dimensions of SCAN is monitoredin Fig. 11a (the vertical axis is 1�σ2). It is obvious that only theseven previously identified dimensions have variance less than 1 overthe three stages. The variance of the other 25 dimensions quicklyconverges to 1. Five dimensions were of primary interest as theyshow clear concept associations (“d4”, “d11”, “d18”, “d20”, “d31”).

In the first stage, the curves for “d4”, “d18”, “d20”, and “d31”show obvious increasing trends. These four dimensions control theobject color, floor color, and object id. Their increase indicatesSCAN restricted values of the three concepts to a very small range.Specifically, SCAN believes a fgray; f loorg image must have graywall color and the object id must be ice lolly, which is biased bythe training data. This can be verified by asking SCAN to generatefgray; f loorg images through sym2img. Ideally, it should generateimages with diversity in all concept dimensions, except floor color,which must be gray. However, as shown in Fig. 11e, the diversityof wall color and object id in the generated images is very low.

In the second stage, SCAN starts to see images with differentwall colors. This immediately results in the drop of “d4” (whichis associated with wall color). This drop indicates the relax of theconstraint on the wall color dimension, i.e., SCAN starts to realizethat images with different wall colors can also be fgray; f loorg.From the generate images (Fig. 11f), we confirm the model gains theability to generate fgray; f loorg images with different wall colors.

The most obvious change in the third stage is the drop of “d31”,which controls object id. Since the model starts to see fgray; f loorgimages with different objects, it learns that object type is irrelevantto floor color. To the end of this stage, SCAN is very robust tothe target concept, and gains the ability to generate fgray; f loorgimages with diversity in all other three visual dimensions (Fig. 11g).

Over the three stages, the curves for “d20” and “d18” kept high,indicating the consistently strong constraint on floor color, i.e., itmust always be gray. The curve for “d11” (in green, Fig. 11a) keptin the range of 0:3�0:4 over the three stages, reflecting the model’sability in generating diverse object colors across the three stages.

During training, curves in the Statistics-Overview are dynamicallyupdated, providing the experts immediate feedback on SCAN’squality evolution (please see our associated video). Through theconfiguration file of SCANViz, the experts can also synchronize theupdates into the SCAN and Latent view on-demand over the training.

7.2.2 Exploring CartoonSet with SCANVizImproving βVAE through Visual Feedback. Switching

to CartoonSet, we first used the Latent view to identify what latent

dimensions of the βVAE captured the visual concepts in CatroonSet.Through comprehensive analysis, especially the backward analysis,we found that the concept eye color was not captured by any latentdimension, i.e., the aggregated mean band for images varying in eyecolor degenerated to a curve. Most probably, this training failureis resulted from the small area of the eye region. In βVAE, thedifference in eye color between the original and reconstructed imagesis measured by the first term of Equation 2. However, due to thesmall size of the eyes, the relatively smaller loss introduced by thereconstruction loss might be ignored by the model. To verify this,we trained the βVAE again but with a smaller β , i.e., deemphasizingthe latent loss. Exploring the new βVAE with the Latent view again,we found the problem was fixed and “d3” has a strong associationwith the eye color. Meanwhile, “d7” and “d9” control hair style,“d1” and “d2” control hair color, “d30” controls face color.

Exploring “d22”. Except these six dimensions, the Statistics-Overview of SCANViz also identified one interesting dimension,i.e., “d22” (Fig. 8b). This dimension has a variance less than 1, butmuch larger than the other six dimensions. Through explorations,we found it controls the hair style of a subset of images. Forexample, in Fig. 8c, the latent representation of the selected imageon “d22” is a unit Gaussian, and “d22” has completely no effect tothe reconstructed image. However, “d22” can change the hair styleof the image in Fig. 8d, though the change is very limited, i.e., it canonly change between style 1 and style 9 (Fig. 8d). Before thisexploration, the experts did not know that one latent dimension cancontrol features in a subset of images, and it is extremely difficult forthem to find this through traditional statistical measurements. Oneexpert believes “d22” may learn a local distribution and help “d7”and “d9” in controlling the hair style on the subset of images. Shehas initiated some theoretical investigations on this.

Interactive Training. We also performed interactive trainingson CaretoonSet through SCANViz. Fig. 11h shows the variancechange of SCAN’s 32 latent dimensions. The target concept to learnis fblack eye; black f aceg. In the first 2000 epochs, images withthis concept and diverse hair colors are fed to SCAN for training.The curves for “d3” (in orange) and “d30” (in purple) increasedquickly in this stage to learn the strong specification on eye colorand face color. Meanwhile, as the training images in this stagehave no variance in hair style, the curves for “d7” (in green) and“d9” (in yellow), which control the hair style, also increased. Fromepoch 2000 to 4000, we add images with varying hair styles into thetraining, and the drop of “d7” and “d9” indicates the relax on hairstyle, i.e., the model realized that images with the target concept canhave different hair styles. Since the green and yellow curves did notconverge at the end of 4000 epochs, we continued to train the modelfor more epochs. Finally, the green and yellow curves convergedaround 6000 and 8000 epochs respectively. With SCANViz, userscan perform this continuous training on-demand conveniently.

Apart from the above findings, the CartoonSet shows consistentresult with the DeepMind-Lab. For example, we still see that thelatent losses converge much faster than the reconstruction loss, andsym2img achieves good performance much sooner than img2sym.

Page 9: SCANViz: Interpreting the Symbol-Concept Association ... · analytics is a component of paramount importance [10]. From literature, the visual analytics attempts of model interpretation

Figure 11: Interactive training of SCAN on DeepMind-Lab in 3 stages: (b, c, d) selected input of the 3 stages; (e, f, g) the output of sym2img fromthe 3 stages. The variance changes of the 32 latent dimensions when training SCAN on DeepMind-Lab and CartoonSet are shown in (a) and (h).

7.3 Feedback from Domain Experts

We conducted open-ended exit interviews with the five experts(E1�E5) to think aloud on the derived insights, discuss the pros andcons of our design, and envision future improvement of SCANViz.

Overall, the feedback from them is positive. Most of them likedthe interactive training the most. E1 believed this function “hasa big potential for model debugging”. E2 connected this with thelife-long learning [21], and commented that “the function may guidea model to maintain specific knowledge and one can intentionallycontrol the evolution of certain knowledge” in incremental learning.E3 appreciated SCANViz in demonstrating the evolution detailson SCAN’s intelligence. He also mentioned that “informationtheories could be borrowed to quantitatively measure the volume ofknowledge”, and suggested to “monitor the entropy in the trainingand generated images to measure the information flow”.

The model evaluation part also attracted a lot of attention from theexperts. E2 believed the strong emphasis of the latent losses enforcedthe encoder’s parameters to come to a “comfort zone” quickly. E4was surprised to see the two sets of curves (in Fig. 1-a1, Fig. 1-a3)could reflect “the unbalanced training of the balanced (symmetric)architecture” of SCAN. E3 and E5 analyzed how the 3 terms inSCAN’s loss affected the back-propagation process. E2 liked theidea of using a pre-trained classifier to evaluate sym2img. She used

“similar approaches to evaluate data generate from GANs [11] and itis interesting that people are evaluating DNNs through DNNs”.

E1, E2 and E5 enjoyed exploring the latent dimensions through theforward and backward analysis. After noticing that both “d18” and“d20” have clear associations with floor color, E1 was wondering “ifthe effects of two latent dimensions could cancel each other”. Forexample, after dragging the control point on “d18” to change floorcolor from gray to red, can we change the values on “d20” to getthe floor color back? We did not find such cases in DeepMind-Lab,but it deserves more explorations to finally confirm.

Additionally, the experts also mentioned several desired featuresand suggested some improvements. Compared to E2�E5, E1 wasmore conservative to draw the conclusion that the encoder is trainedfaster than the decoder. He stated that “visualizing the gradientchanges in the encoder and decoder would be more convincing”.E1�E3 and E5 expressed their concerns on extending SCAN to moregeneral datasets and they all agreed that the bottleneck lies in theunsupervised feature extraction process. Conversely, the experts alsodiscussed the benefits of using even simpler datasets to understandthe concept extraction and symbol-concept association process in-depth (we added such a dataset in our supplementary material).

Based on the insight from SCANViz, the experts proposed somechanges to the SCAN model. E5 commented that the second termof SCAN’s loss works like a regularizer. However, since βVAE’slatent representations (the target representations) have already beenregularized (by the second term of Equation 2, “it might be okay toremove this part (the second term of Equation 3) from SCAN’s loss”.

E1, E2, and E3 also have the same thoughts, after seeing the increaseof latent0 in Fig. 1-a1. E4 and E5 were very interested in the jointarchitecture of βVAE+SCAN. After seeing that “d21” entangledmultiple visual concepts (in the Latent view), E4 proposed the ideaof “training βVAE and SCAN jointly to change the unsupervisedconcept extraction into a semi-supervised learning process”.

8 DISCUSSION, LIMITATIONS, AND FUTURE WORK

The architecture details of βVAE and SCAN were not provided, aswe can explain SCANViz without these details. When implementingβVAE, a denoising auto-encoder (DAE [25]) was used to reduce thenoise in the generated images. We have included the architecturedetails of DAE, βVAE, and SCAN in our supplementary material.

The 60 reconstructed symbols are not as informative as the 60reconstructed images in Fig. 1-b2. For example, two reconstructedimages of the same concept can present very different pixel values(e.g., darker/lighter gray), but two reconstructed symbols maybe presented as two sets of identical circles. When playing withSCANViz, the experts mostly used the pie charts to check the model’sconfidence in reconstruction, and rarely examined the individualreconstructed symbols. One optimization we have planned to adoptin the future is to present only the unique symbol reconstructions andtheir count. Additionally, to simplify our illustration, the number ofsamples taken from the latent distribution is always 60. However,users can definitely sample a much larger number of reconstructions.

Currently, scalability is a big limitation of SCANViz, e.g., it doesnot scale well when the studied dataset has a large number of visualconcepts. For example, we cannot directly use SCANViz to studythe CelebA dataset [19], as it contains 40 binary visual concepts.We did not emphasize too much of the scalability problem in thiswork, as we currently focus more on interpreting the SCAN modelitself (with relatively small datasets). In the future, we would liketo extend SCANViz to handle datasets with more and even non-orthogonal visual concepts. When working with those datasets, wecan present a summary view of all visual concepts first, and ask usersto select the top-4 most interesting ones for detailed explorationslater. Through carefully-designed interactions, users would still havethe ability to explore all visual concepts within the limited screenspace. Also, we plan to use glyph visualizations in the SCAN viewto represent the values of individual concepts, especially when thosevalues cannot be represented by color circles. For datasets with non-orthogonal concepts, the unsupervised βVAE may not perform wellin the concept extraction process. In that case, we expect our currentforward/backward analysis (Sect. 7.2.1) to play a more importantrole in revealing what concepts the model has really captured.

The generalization of SCANViz is the major focus of our futurework and we plan to start from the following two directions. First,we would like to extend our method in latent space exploration toother DNN models. Specifically, the Latent view of SCANViz candirectly be used to present the latent space of the traditional VAE

Page 10: SCANViz: Interpreting the Symbol-Concept Association ... · analytics is a component of paramount importance [10]. From literature, the visual analytics attempts of model interpretation

and its variants. With this view, model designers can investigateand compare different VAEs by directly comparing their latentspace. Separate trainings of the same VAE can also be compared todiagnose the model training dynamics. This extension will promoteour work to have a broader impact on the more general problemof embedding disentanglement. Second, the progressive trainingidea (with the human-in-the-loop analysis pipeline) can also beborrowed to investigate other models. For example, one can traina CNN classifier using images with increasingly diverse featuresto see how the classifier’s decision boundary evolves over time.This is similar to our problem of observing how SCAN captured aparticular concept and deepened its understanding of the conceptthrough data instances of varying features. Similar experiments canalso be extended to GAN models to monitor their generative power.

We also identified several interesting future directions from theexperts’ feedback. First, we want to disclose the gradient changeof SCAN’s encoder and decoder, as suggested by E1. Second,using entropy to monitor the quality of the reconstructed imagesover interactive trainings (suggested by E5) will also be adopted inSCANViz soon. Third, we will work with the experts to see how theirproposed model-level changes affect SCAN’s performance.

9 CONCLUSION

In this work, we proposed SCANViz, a visual analytics systemto interpret the symbol-concept association captured by SCAN.Targeted on three objectives of model evaluation, latent exploration,and interactive training, SCANViz integrated three visualizationviews. These views allow users to comprehensively evaluate theperformance of SCAN, explore/compare the latent representation ofdata instances from both the unsupervised concept extraction processand the supervised symbol-concept association process. SCANVizalso empowers interactive model trainings (through a human-in-the-loop analysis pipeline) to assist users in progressively interpretingthe model evolution process. Through case studies with multipledeep learning experts, we validated the effectiveness of SCANViz.

REFERENCES

[1] A Neural Network Playground - TensorFlow (accessed: 2018-08-13).https://playground.tensorflow.org.

[2] CartoonSet dataset (accessed: 2018-09-10). https://google.

github.io/cartoonset/index.html. CartoonSet is licensed byGoogle LLC under a Creative Commons Attribution 4.0 InternationalLicense (https://creativecommons.org/licenses/by/4.0/).

[3] ConvNetJS (accessed: 2019-08-10). https://cs.stanford.edu/people/karpathy/convnetjs/.

[4] Flask (accessed: 2019-08-30). https://palletsprojects.com/p/flask/. Flask is licensed under the BSD 3-Clause ”New” or ”Revised”License (https://github.com/pallets/flask/blob/master/LICENSE.rst).

[5] Rodent (accessed: 2019-08-23). https://github.com/miyosuda/rodent/tree/master/examples/04_texture_replace.Rodent is licensed under the Apache License, Version 2.0(http://www.apache.org/licenses/LICENSE-2.0).

[6] TensorBoard: TensorFlow’s visualization toolkit (accessed: 2019-12-10). https://www.tensorflow.org/tensorboard.

[7] A. Bilal, A. Jourabloo, M. Ye, X. Liu, and L. Ren. Do convolutionalneural networks learn class hierarchy? IEEE transactions onvisualization and computer graphics, 24(1):152–162, 2017.

[8] C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins,and A. Lerchner. Understanding disentangling in β -VAE. arXivpreprint arXiv:1804.03599, 2018.

[9] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, andP. Abbeel. Infogan: Interpretable representation learning byinformation maximizing generative adversarial nets. In Advances inneural information processing systems, pp. 2172–2180, 2016.

[10] J. Choo and S. Liu. Visual analytics for explainable deep learning.IEEE computer graphics and applications, 38(4):84–92, 2018.

[11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In

Advances in neural information processing systems, pp. 2672–2680,2014.

[12] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick,S. Mohamed, and A. Lerchner. beta-vae: Learning basic visualconcepts with a constrained variational framework. InternationalConference on Learning Representations (ICLR), 2017.

[13] I. Higgins, N. Sonnerat, L. Matthey, A. Pal, C. P. Burgess, M. Bosnjak,M. Shanahan, M. Botvinick, D. Hassabis, and A. Lerchner. SCAN:Learning hierarchical compositional visual concepts. InternationalConference on Learning Representations (ICLR), 2018.

[14] M. Kahng, N. Thorat, D. H. P. Chau, F. B. Viegas, and M. Wattenberg.Gan lab: Understanding complex deep generative models usinginteractive visual experimentation. IEEE transactions on visualizationand computer graphics, 25(1):1–11, 2018.

[15] B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, andR. Sayres. Interpretability Beyond Feature Attribution: QuantitativeTesting with Concept Activation Vectors (TCAV) . In InternationalConference on Machine Learning (ICML), pp. 2673–2682, 2018.

[16] D. P. Kingma and M. Welling. Auto-encoding variational bayes.International Conference on Learning Representations (ICLR), 2014.

[17] M. Liu, J. Shi, K. Cao, J. Zhu, and S. Liu. Analyzing thetraining processes of deep generative models. IEEE transactions onvisualization and computer graphics, 24(1):77–87, 2017.

[18] M. Liu, J. Shi, Z. Li, C. Li, J. Zhu, and S. Liu. Towards better analysis ofdeep convolutional neural networks. IEEE transactions on visualizationand computer graphics, 23(1):91–100, 2016.

[19] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributesin the wild. In Proceedings of International Conference on ComputerVision (ICCV), December 2015.

[20] Y. Ming, H. Qu, and E. Bertini. Rulematrix: visualizing andunderstanding classifiers with rules. IEEE transactions on visualizationand computer graphics, 25(1):342–352, 2018.

[21] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter. Continuallifelong learning with neural networks: A review. Neural Networks,2019.

[22] N. Pezzotti, T. Hollt, J. Van Gemert, B. P. Lelieveldt, E. Eisemann, andA. Vilanova. Deepeyes: Progressive visual analytics for designing deepneural networks. IEEE transactions on visualization and computergraphics, 24(1):98–108, 2017.

[23] D. Ren, S. Amershi, B. Lee, J. Suh, and J. D. Williams. Squares:Supporting interactive performance analysis for multiclass classifiers.IEEE transactions on visualization and computer graphics, 23(1):61–70, 2016.

[24] M. T. Ribeiro, S. Singh, and C. Guestrin. Why should i trust you?:Explaining the predictions of any classifier. In Proceedings of the 22ndACM SIGKDD international conference on knowledge discovery anddata mining, pp. 1135–1144. ACM, 2016.

[25] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extractingand composing robust features with denoising autoencoders. InProceedings of the 25th international conference on Machine learning,pp. 1096–1103. ACM, 2008.

[26] J. Wang, L. Gou, H.-W. Shen, and H. Yang. DQNViz: A visualanalytics approach to understand deep q-networks. IEEE transactionson visualization and computer graphics, 25(1):288–298, 2018.

[27] J. Wang, L. Gou, H. Yang, and H.-W. Shen. Ganviz: A visual analyticsapproach to understand the adversarial game. IEEE transactions onvisualization and computer graphics, 24(6):1905–1917, 2018.

[28] J. Wang, L. Gou, W. Zhang, H. Yang, and H.-W. Shen. DeepVID: Deepvisual interpretation and diagnosis for image classifiers via knowledgedistillation. IEEE transactions on visualization and computer graphics,25(6):2168–2180, 2019.

[29] J. Wang, X. Liu, H.-W. Shen, and G. Lin. Multi-resolution climateensemble parameter analysis with nested parallel coordinates plots.IEEE transactions on visualization and computer graphics, 23(1):81–90, 2016.

[30] J. Zhang, Y. Wang, P. Molino, L. Li, and D. S. Ebert. Manifold: Amodel-agnostic framework for interpretation and diagnosis of machinelearning models. IEEE transactions on visualization and computergraphics, 25(1):364–373, 2018.