Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
…taset-creator into lemmAndStem

Conflicts:
	README.md
  • Loading branch information
Petr Belohlavek committed Feb 26, 2016
2 parents 6ef6d0c + 5fdcff7 commit 3f8181b
Show file tree
Hide file tree
Showing 3 changed files with 122 additions and 19 deletions.
98 changes: 94 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,16 +6,17 @@ We describe the files for generating the Ubuntu Dialogue Corpus, and the dataset

There are several updates and bug fixes that are present in v2.0. The updates are significant enough that results on the two datasets will not be equivalent, and should not be compared. However, models that do well on the first dataset should transfer to the second dataset (with perhaps a new hyperparameter search).

- Separated the train/validation/test sets by time. The training set goes from the beginning (2004) to about April 27, 2012, the validation set goes from April 27 to August 7, 2012, and the test set goes from August 7 to December 1, 2012. This more closely mimics real life implementation, where you are training a model on past data to predict future data.
- Separated the train/validation/test sets by time. The training set goes from the beginning (2004) to about April 27, 2012, the validation set goes from April 27 to August 7, 2012, and the test set goes from August 7 to December 1, 2012. This more closely mimics real life implementation, where you are training a model on past data to predict future data.
- Changed the sampling procedure for the context length in the validation and test sets, from an inverse distribution to a uniform distribution (between 2 and the max context size). This increases the average context length, which we consider desirable since we would like to model long-term dependencies.
- Changed the tokenization and entity replacement procedure. After complaints stating v1 was too aggressive, we've decided to remove these. It is up to each person using the dataset to come up with their own tokenization/ entity replacement scheme. We plan to use the tokenization internally.
- Added differentiation between the end of an utterance (`__eou__`) and end of turn (`__eot__`). In the original dataset, we concatenated all consecutive utterances by the same user in to one utterance, and put `__EOS__` at the end. Here, we also denote where the original utterances were (with `__eou__`). Also, the terminology should now be consistent between the training and test set (instead of both `__EOS__` and `</s>`).
- Fixed a bug that caused the distribution of false responses in the test and validation sets to be different from the true responses. In particular, the number of words in the false responses was shorter on average than for the true responses, which could have been exploited by some models.

##UBUNTU CORPUS GENERATION FILES:

###generate.sh:
####DESCRIPTION:
Script that calls `create_ubuntu_dataset.py`
This is the script you should run in order to download the dataset
Script that calls `create_ubuntu_dataset.py`. This is the script you should run in order to download the dataset. The parameters passed to this script will be passed to `create_ubuntu_dataset.py`. Example usage: `./generate.sh -t -s -l`.

###create_ubuntu_dataset.py:
####DESCRIPTION:
Expand Down Expand Up @@ -53,11 +54,100 @@ Maps the original dialogue files to the training, validation, and test sets.
##UBUNTU CORPUS FILES (after generating):

###train.csv:
Contains the training set. It is separated into 3 columns: the context of the conversation, the candidate response or 'utterance', and a flag or 'label' (= 0 or 1) denoting whether the response is a 'true response' to the context (flag = 1), or a randomly drawn response from elsewhere in the dataset (flag = 0). This triples format is described in the paper. When generated with the default settings, train.csv is 463Mb, with 898,143 lines (ie. examples, which corresponds to 449,071 dialogues) and with a vocabulary size of 1,344,621. Note that, to generate the full dataset, you should use the `--examples` argument for the `create_ubuntu_dataset.py` file.
Contains the training set. It is separated into 3 columns: the context of the conversation, the candidate response or 'utterance', and a flag or 'label' (= 0 or 1) denoting whether the response is a 'true response' to the context (flag = 1), or a randomly drawn response from elsewhere in the dataset (flag = 0). This triples format is described in the paper. When generated with the default settings, train.csv is 463Mb, with 1,000,000 lines (ie. examples, which corresponds to 449,071 dialogues) and with a vocabulary size of 1,344,621. Note that, to generate the full dataset, you should use the `--examples` argument for the `create_ubuntu_dataset.py` file.

###valid.csv:
Contains the validation set. Each row represents a question. Separated into 11 columns: the context, the true response or 'ground truth utterance', and 9 false responses or 'distractors' that were randomly sampled from elsewhere in the dataset. Your model gets a question correct if it selects the ground truth utterance from amongst the 10 possible responses. When generated with the default settings, `valid.csv` is 27Mb, with 19,561 lines and a vocabulary size of 115,688.

###test.csv:
Contains the test set. Formatted in the same way as the validation set. When generated with the default settings, test.csv is 27Mb, with 18,921 lines and a vocabulary size of 115,623.

##BASELINE RESULTS

####Dual Encoder LSTM model:
1 in 2:
recall@1: 0.868730970907
1 in 10:
recall@1: 0.552213717862
recall@2: 0.72099120433,
recall@5: 0.924285351827

####Dual Encoder RNN model:
1 in 2:
recall@1: 0.776539210705,
1 in 10:
recall@1: 0.379139142954,
recall@2: 0.560689786585,
recall@5: 0.836350355691,

####TF-IDF model:
1 in 2:
recall@1: 0.749260042283
1 in 10:
recall@1: 0.48810782241
recall@2: 0.587315010571
recall@5: 0.763054968288


##HYPERPARAMETERS USED

Code for the model can be found here (might not be up to date with the new dataset): https://github.com/npow/ubottu

####Dual Encoder LSTM model:

act_penalty=500
batch_size=256
conv_attn=False
corr_penalty=0.0
emb_penalty=0.001
fine_tune_M=True
fine_tune_W=False
forget_gate_bias=2.0
hidden_size=200
is_bidirectional=False
lr=0.001
lr_decay=0.95
max_seqlen=160
n_epochs=100
n_recurrent_layers=1
optimizer='adam'
penalize_activations=False
penalize_emb_drift=False
penalize_emb_norm=False
pv_ndims=100
seed=42
shuffle_batch=False
sort_by_len=False
sqr_norm_lim=1
use_pv=False
xcov_penalty=0.0

####Dual Encoder RNN model:

act_penalty=500
batch_size=512
conv_attn=False
corr_penalty=0.0
emb_penalty=0.001
fine_tune_M=False
fine_tune_W=False
forget_gate_bias=2.0
hidden_size=100
is_bidirectional=False
lr=0.0001
lr_decay=0.95
max_seqlen=160
n_epochs=100
n_recurrent_layers=1
optimizer='adam'
penalize_activations=False
penalize_emb_drift=False
penalize_emb_norm=False
pv_ndims=100
seed=42
shuffle_batch=False
sort_by_len=False
sqr_norm_lim=1
use_pv=False
xcov_penalty=0.0

37 changes: 25 additions & 12 deletions src/create_ubuntu_dataset.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
import argparse
import cPickle as pickle
import os
import unicodecsv
import random
Expand All @@ -14,7 +13,9 @@

"""
Script for generation of train, test and valid datasets from Ubuntu Corpus 1 on 1 dialogs.
Copyright IBM 2015
Copyright IBM Corporation 2016
LICENSE: Apache License 2.0 URL: ttp://www.apache.org/licenses/LICENSE-2.0
Contact: Rudolf Kadlec ([email protected])
"""

dialog_end_symbol = "__dialog_end__"
Expand Down Expand Up @@ -166,7 +167,7 @@ def create_single_dialog_train_example(context_dialog_path, candidate_dialog_pat
return context_str, response, label


def create_single_dialog_test_example(context_dialog_path, candidate_dialog_paths, rng, distractors_num):
def create_single_dialog_test_example(context_dialog_path, candidate_dialog_paths, rng, distractors_num, max_context_length):
"""
Creates a single example for testing or validation. Each line contains a context, one positive example and N negative examples.
:param context_dialog_path:
Expand All @@ -178,7 +179,7 @@ def create_single_dialog_test_example(context_dialog_path, candidate_dialog_path

dialog = translate_dialog_to_lists(context_dialog_path)

context_str, next_utterance_ix = create_random_context(dialog, rng)
context_str, next_utterance_ix = create_random_context(dialog, rng, max_context_length=max_context_length)

# use the next utterance as positive example
positive_response = singe_user_utterances_to_string(dialog[next_utterance_ix])
Expand All @@ -187,7 +188,7 @@ def create_single_dialog_test_example(context_dialog_path, candidate_dialog_path
return context_str, positive_response, negative_responses


def create_examples_train(candidate_dialog_paths, rng, positive_probability=0.5):
def create_examples_train(candidate_dialog_paths, rng, positive_probability=0.5, max_context_length=20):
"""
Creates single training example.
:param candidate_dialog_paths:
Expand All @@ -201,11 +202,12 @@ def create_examples_train(candidate_dialog_paths, rng, positive_probability=0.5)
if i % 1000 == 0:
print str(i)
dialog_path = candidate_dialog_paths[i]
examples.append(create_single_dialog_train_example(dialog_path, candidate_dialog_paths, rng, positive_probability))
examples.append(create_single_dialog_train_example(dialog_path, candidate_dialog_paths, rng, positive_probability,
max_context_length=max_context_length))
i+=1
#return map(lambda dialog_path : create_single_dialog_train_example(dialog_path, candidate_dialog_paths, rng, positive_probability), candidate_dialog_paths)

def create_examples(candidate_dialog_paths, creator_function):
def create_examples(candidate_dialog_paths, examples_num, creator_function):
"""
Creates a list of training examples from a list of dialogs and function that transforms a dialog to an example.
:param candidate_dialog_paths:
Expand All @@ -214,7 +216,10 @@ def create_examples(candidate_dialog_paths, creator_function):
"""
i = 0
examples = []
for context_dialog in candidate_dialog_paths:
unique_dialogs_num = len(candidate_dialog_paths)

while i < examples_num:
context_dialog = candidate_dialog_paths[i % unique_dialogs_num]
# counter for tracking progress
if i % 1000 == 0:
print str(i)
Expand Down Expand Up @@ -278,7 +283,9 @@ def create_eval_dataset(args, file_list_csv):
dialog_paths = map(lambda path: os.path.join(args.data_root, "dialogs", path), convert_csv_with_dialog_paths(f))

data_set = create_examples(dialog_paths,
lambda context_dialog, candidates : create_single_dialog_test_example(context_dialog, candidates, rng, args.n))
len(dialog_paths),
lambda context_dialog, candidates : create_single_dialog_test_example(context_dialog, candidates, rng,
args.n, args.max_context_length))
# output the dataset
w = unicodecsv.writer(open(args.output, 'w'), encoding='utf-8')
# header
Expand Down Expand Up @@ -313,9 +320,12 @@ def train_cmd(args):

f = open(os.path.join("meta", "trainfiles.csv"), 'r')
dialog_paths = map(lambda path: os.path.join(args.data_root, "dialogs", path), convert_csv_with_dialog_paths(f))
dialog_paths = dialog_paths[:args.examples]

train_set = create_examples(dialog_paths, lambda context_dialog, candidates : create_single_dialog_train_example(context_dialog, candidates, rng, args.p))
train_set = create_examples(dialog_paths,
args.examples,
lambda context_dialog, candidates :
create_single_dialog_train_example(context_dialog, candidates, rng,
args.p, max_context_length=args.max_context_length))

stemmer = SnowballStemmer("english")
lemmatizer = WordNetLemmatizer()
Expand Down Expand Up @@ -353,11 +363,14 @@ def test_cmd(args):
"The script downloads 1on1 dialogs from internet and then it randomly samples all the datasets with positive and negative examples.")

parser.add_argument('--data_root', default='.',
help='directory where 1on1 dialogs will downloaded and extracted, the data will be downloaded from cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz')
help='directory where 1on1 dialogs will be downloaded and extracted, the data will be downloaded from cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz')

parser.add_argument('--seed', type=int, default=1234,
help='seed for random number generator')

parser.add_argument('--max_context_length', type=int, default=20,
help='maximum number of dialog turns in the context')

parser.add_argument('-o', '--output', default=None,
help='output csv')

Expand Down
6 changes: 3 additions & 3 deletions src/generate.sh
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
python create_ubuntu_dataset.py -t --output 'train.csv' 'train'
python create_ubuntu_dataset.py -t --output 'test.csv' 'test'
python create_ubuntu_dataset.py -t --output 'valid.csv' 'valid'
python create_ubuntu_dataset.py "$@" --output 'train.csv' 'train'
python create_ubuntu_dataset.py "$@" --output 'test.csv' 'test'
python create_ubuntu_dataset.py "$@" --output 'valid.csv' 'valid'

0 comments on commit 3f8181b

Please sign in to comment.