Added comparative baseline results and more detailed discussion.

chernovsergey · Feb 14, 2017 · c74a364 · c74a364
1 parent d912ef7
commit c74a364
Showing 1 changed file with 50 additions and 6 deletions.
diff --git a/README.md b/README.md
@@ -12,19 +12,55 @@ The model architecture is based on the Stanford Natural Language Inference [[2]]
 
 ## Evaluation
 
-We partition the Quora question pairs into a 90/10 train/test split. We run training for 25 epochs with a further 90/10 train/validation split, saving the weights from the model checkpoint with the maximum validation accuracy. Training takes approximately 150 secs/epoch, using Tensorflow as a backend for Keras on an Amazon Web Services EC2 p2-xlarge GPU compute instance. We finally evaluate the best checkpointed model to obtain a test set accuracy of _0.8291_.
+We partition the Quora question pairs into a 90/10 train/test split. We run training for 25 epochs with a further 90/10 train/validation split, saving the weights from the model checkpoint with the maximum validation accuracy. Training takes approximately 120 secs/epoch, using Tensorflow as a backend for Keras on an Amazon Web Services EC2 p2-xlarge GPU compute instance. We finally evaluate the best checkpointed model to obtain a test set accuracy of **0.8291**. The table below places this in the context of other work on the dataset reported to date:
+
+| Model | Source of Word Embeddings | Accuracy }
+| "LSTM with concatenation" [[5]](https://engineering.quora.com/Semantic-Question-Matching-with-Deep-Learning) | "Quora's text corpus" | 0.87 |
+| "LSTM with distance and angle" [[5]](https://engineering.quora.com/Semantic-Question-Matching-with-Deep-Learning) | "Quora's text corpus" | 0.87 |
+| "Decomposable attention" [[5]](https://engineering.quora.com/Semantic-Question-Matching-with-Deep-Learning) | "Quora's text corpus" | 0.86 |
+| Max bag-of-embeddings (*this model*) | GloVe (840B tokens, 300D) | 0.83 |
+| "Neural bag-of-words" (max) [[6]](https://explosion.ai/blog/quora-deep-text-pair-classification) | --- | 0.83 |
+| "Neural bag-of-words" (max & mean) [[6]](https://explosion.ai/blog/quora-deep-text-pair-classification) | --- | 0.83 |
+| "Max-out Window Encoding" with depth 2 [[6]](https://explosion.ai/blog/quora-deep-text-pair-classification) | --- | 0.83 |
+| "Neural bag-of-words" (mean) [[6]](https://explosion.ai/blog/quora-deep-text-pair-classification) | --- | 0.81 |
+| "Spacy + TD-IDF + Siamese" [[7]](http://www.erogol.com/duplicate-question-detection-deep-learning/) | GloVe (6B tokens, 300D) | 0.79 |
+
 
 ## Discussion
 
-An initial pass at hyperparameter tuning through line search led to the following observations:
+An initial pass at hyperparameter tuning by evaluating possible
+settings a hyperparameter at a time led to the following observations:
 
-* Computing the sentence representation by applying the max operator to the word embeddings slightly outperformed using mean and sum, and slightly outperformed the use of LSTM and GRU recurrent layers.
-* Batch normalization improved accuracy.
-* Any amount of dropout decreased accuracy.
+* Computing the question representation by applying the max operator to the word embeddings slightly outperformed using mean and sum, which is consistent with what is reported in [[6]](https://explosion.ai/blog/quora-deep-text-pair-classification).
+* Computing the question representation using max also slightly outperformed the use of bidirectional LSTM and GRU recurrent layers, again as discussed in [[6]](https://explosion.ai/blog/quora-deep-text-pair-classification).
+* Batch normalization improved accuracy, as observed by [[7]](http://www.erogol.com/duplicate-question-detection-deep-learning/).
+* Any amount of dropout decreased accuracy, as also observed by [[7]](http://www.erogol.com/duplicate-question-detection-deep-learning/).
 * Four hidden layers in the fully-connected component had the best accuracy, with between zero and six hidden layers evaluated.
 * Using 200 dimensions for the layers in the fully-connected component showed the best accuracy among tested dimensions 50, 100, 200, and 300.
 
-A more computationally-intensive campaign of randomized search over the space of hyperparameter configurations is planned.
+It would appear that complex architectures have yet to outperform
+bag-of-embeddings approaches. As noted in
+[[6]](https://explosion.ai/blog/quora-deep-text-pair-classification),
+this is an encouraging result in favor of simple, bag-of-embeddings
+approaches for dyadic prediction using textual data. Outside of the
+Quora baselines, using max as the operator to combine word embeddings
+seems to yield the best results at around 0.83 accuracy.
+
+How to account for the superior performance of the Quora baselines?
+The simplest Quora architecture is essentially the same as the other
+bag-of-embeddings architectures modulo its use of a recurrent LSTM
+layer to combine the word embeddings into a question representation;
+the above tuning investigation showed no improvement using an
+LSTM. One hypothesis is that training embeddings directly on the Quora
+text corpus, as opposed to using the relatively more generic, publicly
+accessible sources for embeddings such as GloVe is a contributor to
+the difference in performance. [[8]](#popescu-private-communication).
+
+## Future work
+
+A more principled (and computationally-intensive) campaign of
+randomized search over the space of hyperparameter configurations is
+planned.
 
 ## Requirements
 
@@ -81,6 +117,14 @@ MIT. See the LICENSE file for the copyright notice.
 
 [[4]](http://nlp.stanford.edu/pubs/glove.pdf) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. "GloVe: Global Vectors for Word Representation," in Proceedings of the 2014 Conference on Empirical Methods In Natural Language Processing (EMNLP 2014), October 2014.
 
+[[5]](https://engineering.quora.com/Semantic-Question-Matching-with-Deep-Learning) Lili Jiang, Shuo Chang, and Nikhil Dandekar. "Semantic Question Matching with Deep Learning," 13 February 2017. Retrieved at https://engineering.quora.com/Semantic-Question-Matching-with-Deep-Learning on 13 February 2017.
+
+[[6]](https://explosion.ai/blog/quora-deep-text-pair-classification) Matthew Honnibal. "Deep text-pair classification with Quora's 2017 question dataset," 13 February 2017. Retreived at https://explosion.ai/blog/quora-deep-text-pair-classification on 13 February 2017.
+
+[[7]](http://www.erogol.com/duplicate-question-detection-deep-learning/) Eren Golge. "Duplicate Question Detection with Deep Learning on Quora Dataset," 12 February 2017. Retreived at http://www.erogol.com/duplicate-question-detection-deep-learning/ on 13 February 2017.
+
+[[8]](#popescu-private-communication) Ana-Maria Popescu. Private communication, 13 February 2017.
+
 ## License
 
 MIT. See the LICENSE file for the copyright notice.