DEEP RECURSIVE NEURAL NETWORKS Ozan İ Claire …oirsoy/files/drsv/drsv_poster.pdf · Roger Dodger...

1
RECURSIVE NEURAL NETWORK Representation at each node is a nonlinear transformation of the two children: DEEP RECURSIVE NEURAL NETWORKS FOR COMPOSITIONALITY IN LANGUAGE Ozan İrsoy Claire Cardie Cornell University that movie was cool that movie was cool x η = f ( W L x l (η ) + W R x r (η ) + b ) y η = g( Ux η + c ) UNTIED RECURSIVE NET Recursive connections are parametrized according to whether the incoming edge is from a leaf or an internal: Two advantages of untying: 1. Complexity is linear in |x|, not quadratic. Use small models with large word vectors 2. Activation of rectifiers is more natural: Word vectors are dense, hidden layers are sparse. that movie was cool h η = f ( W L l (η ) h l (η ) + W R r (η ) h r (η ) + b ) y η = g( U (η ) h η + c ) DEEP RECURSIVE NET We construct deep recursive nets by stacking multiple recursive layer: Each recursive layer processes a tree and passes it to the next layer. h η ( i ) = f ( W L ( i ) h ( i ) l (η ) + W R ( i ) h ( i ) r (η ) + V ( i ) h η ( i 1) + b ) NETWORK TRAINING Softmax and rectifier for output and hidden layer activations, respectively. Dropout regularization. Stochastic gradient descent with Cross- Entropy classification objective. Model selection (with early stopping) is done over the development set. No pre-training, no fine-tuning. Pretrained word2vec vectors. Stanford Sentiment Treebank as the data. 42 44 46 48 50 Fine-grained 82 83 84 85 86 87 1, 50 2, 45 3, 40 1, 340 2, 242 3, 200 4, 174 5, 157 Binary Depth, Width 40 42 44 46 48 50 Fine-grained 82 82.5 83 83.5 84 84.5 85 85.5 86 86.5 87 BiNB RNN MVRNN RNTN DCNN ParVec dRNN Binary Roger Dodger 6 . 7 8 is 5 one 4 of 3 2 the 1 [best] variations on this theme coolest/good/average/bad/worst 1 2 3 4 5 6 7 8 DEEP VS SHALLOW RNNS DEEP RNN VS BASELINES charming results charming , interesting results charming chemistry charming and riveting performances perfect ingredients appealingly manic and energetic gripping performances brilliantly played refreshingly adult take on adultery joyous documentary perfect medium unpretentious , sociologically pointed an amazing slapstick instrument engaging film not great as great nothing good not very informative a great not compelling not really funny is great only good not quite satisfying Is n’t it great too great thrashy fun be great completely numbing experience fake fun INPUT PERTURBATION NEAREST PHRASE NEIGHBORS CONCLUSIONS Untying + pretrained word2vec + rectifier + dropout gives a boost (48.1 vs 43.2) Deep recursive nets perform better than their shallow counterparts in fine-grained sentiment detection Deep recursive nets outperform existing baselines, achieving new state-of-the-art on the Stanford Sentiment Treebank Qualitative evaluations show that multiple layers indeed capture different things, they have different notions of similarity. BASELINES Baselines from (Socher et al, 2013): Bigram Naïve Bayes (BiNB), Recursive Net (RNN) Matrix-Vector Recursive Net (MVRNN) Recursive Neural Tensor Net (RNTN) Dynamic Convolutional Net (DCNN) (Kalchbrenner et al, 2014) ParagraphVectors (ParVec) (Le & Mikolov, 2014) Recursive neural networks comprise a class of architecture that can operate on structured input. They have been previously successfully applied to model compositionality in natural language using parse-tree-based structural representations. Even though these architectures are deep in structure, they lack the capacity for hierarchical representation that exists in conventional deep feed-forward networks as well as in recently investigated deep recurrent neural networks. In this work we introduce a new architecture – a deep recursive neural network (deep RNN) – constructed by stacking multiple recursive layers. We evaluate the proposed model on the task of fine-grained sentiment classification. Our results show that deep RNNs outperform associated shallow counterparts that employ the same number of parameters. Furthermore, our approach outperforms previous baselines on the sentiment analysis task, including a multiplicative RNN variant as well as the recently introduced paragraph vectors, achieving new state-of-the-art results. We provide exploratory analyses of the effect of multiple layers and show that they capture different aspects of compositionality in language.

Transcript of DEEP RECURSIVE NEURAL NETWORKS Ozan İ Claire …oirsoy/files/drsv/drsv_poster.pdf · Roger Dodger...

Page 1: DEEP RECURSIVE NEURAL NETWORKS Ozan İ Claire …oirsoy/files/drsv/drsv_poster.pdf · Roger Dodger 6 . 7 8 is 5 one 4 of 3 2 the 1 [best] variations on this theme coolest/good/average/bad/worst

RECURSIVE NEURAL NETWORK Representation at each node is a nonlinear transformation of the two children:

DEEP RECURSIVE NEURAL NETWORKS FOR COMPOSITIONALITY IN LANGUAGE

Ozan İrsoy Claire Cardie

Cornell University

that

movie

was

cool

that

movie

was

cool

xη = f (WLxl (η ) +WRxr(η ) + b)yη = g(Uxη + c)

UNTIED RECURSIVE NET Recursive connections are parametrized according to whether the incoming edge is from a leaf or an internal: Two advantages of untying: 1.  Complexity is linear in |x|, not quadratic.

Use small models with large word vectors 2.  Activation of rectifiers is more natural:

Word vectors are dense, hidden layers are sparse.

that

movie

was

cool

hη = f (WLl (η )hl (η ) +WR

r(η )hr(η ) + b)

yη = g(U(η )hη + c)

DEEP RECURSIVE NET We construct deep recursive nets by stacking multiple recursive layer: Each recursive layer processes a tree and passes it to the next layer.

hη(i) = f (WL

(i)h(i)l (η ) +WR(i)h(i)r(η ) +V

(i)hη(i−1) + b)

NETWORK TRAINING •  Softmax and rectifier for output and hidden

layer activations, respectively. •  Dropout regularization. •  Stochastic gradient descent with Cross-

Entropy classification objective. •  Model selection (with early stopping) is

done over the development set. •  No pre-training, no fine-tuning. •  Pretrained word2vec vectors. •  Stanford Sentiment Treebank as the data.

42

44

46

48

50

Fine

-gra

ined

82

83

84

85

86

87

1, 50 2, 45 3, 40 1, 340 2, 242 3, 200 4, 174 5, 157

Bin

ary

Depth, Width

40

42

44

46

48

50

Fine

-gra

ined

82 82.5

83 83.5

84 84.5

85 85.5

86 86.5

87

BiNB RNN MVRNN RNTN DCNN ParVec dRNN

Bin

ary

Roger

Dodger

6

.

7

8

is

5

one

4

of

3

2

the

1

[best]

variations

on

this theme

coolest/good/average/bad/worst

1 2 3 4 5 6 7 8

DEEP VS SHALLOW RNNS DEEP RNN VS BASELINES

charming results

charming , interesting results charming chemistry

charming and riveting performances perfect ingredients

appealingly manic and energetic

gripping performances brilliantly played

refreshingly adult take on adultery

joyous documentary perfect medium

unpretentious , sociologically pointed

an amazing slapstick instrument

engaging film

not great

as great nothing good not very informative

a great not compelling not really funny

is great only good not quite satisfying

Is n’t it great too great thrashy fun

be great completely numbing experience

fake fun

INPUT PERTURBATION

NEAREST PHRASE NEIGHBORS

CONCLUSIONS •  Untying + pretrained word2vec + rectifier +

dropout gives a boost (48.1 vs 43.2) •  Deep recursive nets perform better than

their shallow counterparts in fine-grained sentiment detection

•  Deep recursive nets outperform existing baselines, achieving new state-of-the-art on the Stanford Sentiment Treebank

•  Qualitative evaluations show that multiple layers indeed capture different things, they have different notions of similarity.

BASELINES Baselines from (Socher et al, 2013):

Bigram Naïve Bayes (BiNB), Recursive Net (RNN) Matrix-Vector Recursive Net (MVRNN) Recursive Neural Tensor Net (RNTN)

Dynamic Convolutional Net (DCNN) (Kalchbrenner et al, 2014)

ParagraphVectors (ParVec) (Le & Mikolov, 2014)

Recursive neural networks comprise a class of architecture that can operate on structured input. They have been previously successfully applied to model compositionality in natural language using parse-tree-based structural representations. Even though these architectures are deep in structure, they lack

the capacity for hierarchical representation that exists in conventional deep feed-forward networks as well as in recently investigated deep recurrent neural networks. In this work we introduce a new architecture – a deep recursive neural network (deep RNN) – constructed by stacking multiple recursive layers. We evaluate

the proposed model on the task of fine-grained sentiment classification. Our results show that deep RNNs outperform associated shallow counterparts that employ the same number of parameters. Furthermore, our approach outperforms previous baselines on the sentiment analysis task, including a

multiplicative RNN variant as well as the recently introduced paragraph vectors, achieving new state-of-the-art results. We provide exploratory analyses of the effect of multiple layers and show that they capture different aspects of compositionality in language.