This repository contains code for training deep learning systems to do question answering tasks. Our primary focus is on Aristo's science questions, though we can run various models on several popular datasets.
This code is a mix of scala (for data processing / pipeline management) and python (for actually training and executing deep models with Keras / Theano / TensorFlow).
This repository implements several variants of memory networks, including the models found in these papers:
- The original MemNN, from Memory Networks, by Weston, Chopra and Bordes
- End-to-end memory networks, by Sukhbaatar and others (close, but still in progress)
- Dynamic memory networks, by Kumar and others (close, but still in progress)
- DMN+, from Dynamic Memory Networks for Visual and Textual Question Answering, by Xiong, Merity and Socher (close, but still in progress)
- The attentive reader, from Teaching Machines to Read and Comprehend, by Hermann and others
- Windowed-memory MemNNs, from The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations (in progress)
As well as some of our own, as-yet-unpublished variants. There is a lot of similarity between the models in these papers, and our code is structured in a way to allow for easily switching between these models. For a description of how we've built an extensible memory network architecture in this library, see this readme.
This code allows for easy experimentation with the following datasets:
- AI2 Elementary school science questions (no diagrams)
- The Facebook Children's Book Test dataset
- The Facebook bAbI dataset
- The NewsQA dataset
- The Stanford Question Answering Dataset (SQuAD)
- The Who Did What dataset
And more to come... In the near future, we hope to also include easy experimentation with CNN/Daily Mail and SimpleQuestions.
This code is a mix of scala and python. The intent is that the data processing and experiment
pipeline code is in scala, and the deep learning code is in python. The recommended approach is to
set up your experiments in scala code, then run them through sbt
. Some documentation on how to
do this is found in the README for the org.allenai.deep_qa.experiments
package.
If for whatever reason you don't want to gain the benefits of the scala pipeline when running
experiments, you can run the python code manually. To do this, from the base directory, you run
the command python src/main/python/run_solver.py [model_config]
. You must use python >= 3.5, as
we make heavy use of the type annotations introduced in python 3.5 to aid in code readability (I
recommend using anaconda to set up python 3, if you don't
have it set up already).
You can see some examples of what model configuration files look like in the example experiments directory. We try to keep these up to date, but the way parameters are specified is still sometimes in a state of flux, so we make no promises that these are actually usable with the current master (and you'll have to provide your own input files to use them, in any event). Looking at the most recently added or changed example experiment should be your best bet to get an accurate format. And if you find one that's out of date, submitting a pull request to fix it would be really nice!
The best way currently to get an idea for what options are available in this configuration file,
and what those options mean, is to look at the class mentioned in the solver_class
field.
Looking at the
dynamic_memory_network.json
example, we can see that it's using a MultipleTrueFalseMemoryNetworkSolver
as it's
solver_class
.
If we go to that class's __init__
method,
in the code, we don't see any parameters, because MultipleTrueFalseMemoryNetworkSolver
has no
unique parameters of its own. So, we continue up the class hierarchy to
MemoryNetworkSolver
,
and we can see the parameters that it takes: things like num_memory_layers
, knowledge_encoder
,
entailment_model
, and so on. If you continue on to its super class,
TextTrainer
,
you'll find more parameters, this time for things that deal with word embeddings and sentence
encoders. Finally, you can continue to the base class,
Trainer
,
to see parameters for things like whether and where models should be saved, how to run training,
specifying debug output, running pre-training, and other things. It would be nice to automatically
generate some website to document all of these parameters, but I don't know how to do that and
don't have the time to dedicate to making it happen. So for now, just read the comments that are
in the code.
There are several places where we give lists of available choices for particular options. For
example, there is a list of concrete
solver classes
that are valid options for the solver_class
parameter in a model config file. One way to find
lists of available options for these parameters (other than just by tracing the handling of
parameters in the code) is by searching github for
get_choice
or
get_choice_with_default
.
This might point you, for instance, to the
knowledge_encoders
field in memory_network.py
, which is
imported
from layers/knowledge_encoders.py
, where it is defined at the bottom of the
file.
In general, the places where there are these kinds of options are in the solver class (already
mentioned), and the various layers we have implemented - each kind of Layer
will typically
specify a list of options either at the bottom of the corresponding file, or in an associated
__init__.py
file (as is done with the sentence
encoders).
We've tried to also give reasonable documentation throughout the code, both in docstring comments and in READMEs distributed throughout the code packages, so browsing github should be pretty informative if you're confused about something. If you're still confused about how something works, open an issue asking to improve documentation of a particular piece of the code (or, if you've figured it out after searching a bit, submit a pull request containing documentation improvements that would have helped you).
If you use this code and think something could be improved, pull requests are very welcome. Opening an issue is ok, too, but we're a lot more likely to respond to a PR. The primary maintainer of this code is Matt Gardner, with a lot of help from Pradeep Dasigi (who was the initial author of this codebase) and Mark Neumann.
This code is released under the terms of the Apache 2 license.