Skip to content

Commit

Permalink
Added comparative baseline results and more detailed discussion.
Browse files Browse the repository at this point in the history
  • Loading branch information
bradleypallen committed Feb 14, 2017
1 parent d912ef7 commit c74a364
Showing 1 changed file with 50 additions and 6 deletions.
56 changes: 50 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,19 +12,55 @@ The model architecture is based on the Stanford Natural Language Inference [[2]]

## Evaluation

We partition the Quora question pairs into a 90/10 train/test split. We run training for 25 epochs with a further 90/10 train/validation split, saving the weights from the model checkpoint with the maximum validation accuracy. Training takes approximately 150 secs/epoch, using Tensorflow as a backend for Keras on an Amazon Web Services EC2 p2-xlarge GPU compute instance. We finally evaluate the best checkpointed model to obtain a test set accuracy of _0.8291_.
We partition the Quora question pairs into a 90/10 train/test split. We run training for 25 epochs with a further 90/10 train/validation split, saving the weights from the model checkpoint with the maximum validation accuracy. Training takes approximately 120 secs/epoch, using Tensorflow as a backend for Keras on an Amazon Web Services EC2 p2-xlarge GPU compute instance. We finally evaluate the best checkpointed model to obtain a test set accuracy of **0.8291**. The table below places this in the context of other work on the dataset reported to date:

| Model | Source of Word Embeddings | Accuracy }
| "LSTM with concatenation" [[5]](https://engineering.quora.com/Semantic-Question-Matching-with-Deep-Learning) | "Quora's text corpus" | 0.87 |
| "LSTM with distance and angle" [[5]](https://engineering.quora.com/Semantic-Question-Matching-with-Deep-Learning) | "Quora's text corpus" | 0.87 |
| "Decomposable attention" [[5]](https://engineering.quora.com/Semantic-Question-Matching-with-Deep-Learning) | "Quora's text corpus" | 0.86 |
| Max bag-of-embeddings (*this model*) | GloVe (840B tokens, 300D) | 0.83 |
| "Neural bag-of-words" (max) [[6]](https://explosion.ai/blog/quora-deep-text-pair-classification) | --- | 0.83 |
| "Neural bag-of-words" (max & mean) [[6]](https://explosion.ai/blog/quora-deep-text-pair-classification) | --- | 0.83 |
| "Max-out Window Encoding" with depth 2 [[6]](https://explosion.ai/blog/quora-deep-text-pair-classification) | --- | 0.83 |
| "Neural bag-of-words" (mean) [[6]](https://explosion.ai/blog/quora-deep-text-pair-classification) | --- | 0.81 |
| "Spacy + TD-IDF + Siamese" [[7]](http://www.erogol.com/duplicate-question-detection-deep-learning/) | GloVe (6B tokens, 300D) | 0.79 |


## Discussion

An initial pass at hyperparameter tuning through line search led to the following observations:
An initial pass at hyperparameter tuning by evaluating possible
settings a hyperparameter at a time led to the following observations:

* Computing the sentence representation by applying the max operator to the word embeddings slightly outperformed using mean and sum, and slightly outperformed the use of LSTM and GRU recurrent layers.
* Batch normalization improved accuracy.
* Any amount of dropout decreased accuracy.
* Computing the question representation by applying the max operator to the word embeddings slightly outperformed using mean and sum, which is consistent with what is reported in [[6]](https://explosion.ai/blog/quora-deep-text-pair-classification).
* Computing the question representation using max also slightly outperformed the use of bidirectional LSTM and GRU recurrent layers, again as discussed in [[6]](https://explosion.ai/blog/quora-deep-text-pair-classification).
* Batch normalization improved accuracy, as observed by [[7]](http://www.erogol.com/duplicate-question-detection-deep-learning/).
* Any amount of dropout decreased accuracy, as also observed by [[7]](http://www.erogol.com/duplicate-question-detection-deep-learning/).
* Four hidden layers in the fully-connected component had the best accuracy, with between zero and six hidden layers evaluated.
* Using 200 dimensions for the layers in the fully-connected component showed the best accuracy among tested dimensions 50, 100, 200, and 300.

A more computationally-intensive campaign of randomized search over the space of hyperparameter configurations is planned.
It would appear that complex architectures have yet to outperform
bag-of-embeddings approaches. As noted in
[[6]](https://explosion.ai/blog/quora-deep-text-pair-classification),
this is an encouraging result in favor of simple, bag-of-embeddings
approaches for dyadic prediction using textual data. Outside of the
Quora baselines, using max as the operator to combine word embeddings
seems to yield the best results at around 0.83 accuracy.

How to account for the superior performance of the Quora baselines?
The simplest Quora architecture is essentially the same as the other
bag-of-embeddings architectures modulo its use of a recurrent LSTM
layer to combine the word embeddings into a question representation;
the above tuning investigation showed no improvement using an
LSTM. One hypothesis is that training embeddings directly on the Quora
text corpus, as opposed to using the relatively more generic, publicly
accessible sources for embeddings such as GloVe is a contributor to
the difference in performance. [[8]](#popescu-private-communication).

## Future work

A more principled (and computationally-intensive) campaign of
randomized search over the space of hyperparameter configurations is
planned.

## Requirements

Expand Down Expand Up @@ -81,6 +117,14 @@ MIT. See the LICENSE file for the copyright notice.

[[4]](http://nlp.stanford.edu/pubs/glove.pdf) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. "GloVe: Global Vectors for Word Representation," in Proceedings of the 2014 Conference on Empirical Methods In Natural Language Processing (EMNLP 2014), October 2014.

[[5]](https://engineering.quora.com/Semantic-Question-Matching-with-Deep-Learning) Lili Jiang, Shuo Chang, and Nikhil Dandekar. "Semantic Question Matching with Deep Learning," 13 February 2017. Retrieved at https://engineering.quora.com/Semantic-Question-Matching-with-Deep-Learning on 13 February 2017.

[[6]](https://explosion.ai/blog/quora-deep-text-pair-classification) Matthew Honnibal. "Deep text-pair classification with Quora's 2017 question dataset," 13 February 2017. Retreived at https://explosion.ai/blog/quora-deep-text-pair-classification on 13 February 2017.

[[7]](http://www.erogol.com/duplicate-question-detection-deep-learning/) Eren Golge. "Duplicate Question Detection with Deep Learning on Quora Dataset," 12 February 2017. Retreived at http://www.erogol.com/duplicate-question-detection-deep-learning/ on 13 February 2017.

[[8]](#popescu-private-communication) Ana-Maria Popescu. Private communication, 13 February 2017.

## License

MIT. See the LICENSE file for the copyright notice.

0 comments on commit c74a364

Please sign in to comment.