Gerating Sentences from a Continuous Space

Tensorflow implementation of Generating Sentences from a Continuous Space.

Prerequisites

Python packages:
- Python 3.4 or higher
- Tensorflow r0.12
- Numpy

Setting up the environment:

Clone this repository:

git clone https://github.com/Chung-I/Variational-Recurrent-Autoencoder-Tensorflow.git

Set up conda environment:

conda create -n vrae python=3.6
conda activate vrae

Install python package requirements:

pip install -r requirements.txt

Usage

Training:

python vrae.py  --model_dir models --do train --new True

Reconstruct:

python vrae.py --model_dir models --do reconstruct --new False --input input.txt --output output.txt

Sample (this script read only the first line of input.txt, generate num_pts samples, and write them into output.txt):

python vrae.py --model_dir models --do sample --new False --input input.txt --output output.txt

Interpolate (this script requires that input.txt consists of only two sentences; it generate num_pts interpolations between them, and write those interpolated sentences into output.txt)::

python vrae.py --model_dir models --do interpolate --new False --input input.txt --output output.txt

model_dir: The location of the config file config.json and the checkpoint file.

do: Accept 4 values: train, encode_decode, sample, or interpolate.

new: create models with fresh parameters if set to True; else read model parameters from checkpoints in model_dir.

config.json

Hyperparameters are not passed from command prompt like that in tensorflow/models/rnn/translate/translate.py. Instead, vrae.py reads hyperparameters from config.json in model_dir.

Below are hyperparameters in config.json:

model:
- size: embedding size, and encoder/decoder state size.
- latent_dim: latent space size.
- in_vocab_size: source vocabulary size.
- out_vocab_size: target vocabulary size.
- data_dir: path to the corpus.
- num_layers: number of layers for encoder and decoder.
- use_lstm: use lstm for encoder and decoder or not. Use BasicLSTMCell if set to True; else GRUCell is used.
- buckets: A list of pairs of [input size, output size] for each bucket.
- bidirectional: bidirectional_rnn is used if set to True.
- probablistic: variance is set to zero if set to False.
- orthogonal_initializer: orthogonal_initializer is used if set to True; else uniform_unit_scaling_initializer is used.
- iaf: inverse autoregressive flow is used if set to True.
- activation: activation for encoder-to-latent layer and latent-to-decoder layer.
  - elu: exponential linear unit.
  - prelu: parametric linear unit. (default)
  - None: linear.
train:
- batch_size
- beam_size: beam size for decoding. Warning: beam search is still under implementation. NotImplementedError would be raised if beam_size is set to be greater than 1.
- learning_rate: learning rate parameter passed into AdamOptimizer.
- steps_per_checkpoint: save checkpoint every steps_per_checkpoint steps.
- anneal: do KL cost annealing if set to True.
- kl_rate_rise_factor: KL term weight is increasd by this much every steps_per_checkpoint steps.
- max_train_data_size: Limit on the size of training data (0: no limit).
- feed_previous: If True, only the first of decoder_inputs will be used (the "GO" symbol), and all other decoder inputs will be generated by: next = embedding_lookup(embedding, argmax(previous_output)). In effect, this implements a greedy decoder. It can also be used during training to emulate http://arxiv.org/abs/1506.03099. If False, decoder_inputs are used as given (the standard decoder case).
- kl_min: the minimum information constraint. Should be a non-negative float (where 0 is no constraint).
- max_gradient_norm: gradients will be clipped to maximally this norm.
- word_dropout_keep_prob: probability of randomly replacing some fraction of the conditioned-on word tokens with the generic unknown word token UNK. when equal to 0, the decoder sees no input.
reconstruct:
- feed_previous
- word_dropout_keep_prob
sample:
- feed_previous
- word_dropout_keep_prob
- num_pts: sample num_pts points.
interpolate:
- feed_previous
- word_dropout_keep_prob
- num_pts: sample num_pts points.

Data

Penn TreeBank corpus is included in the repo. We also provide a Chinese poem corpus, its preprocessed version (set {"model":{"data_dir": "<corpus_dir>"}} in <model_dir>/config.json to it), and its pretrained model (set model_dir to it), all of which can be found here.

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
corpus		corpus
models		models
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
seq2seq.py		seq2seq.py
seq2seq_model.py		seq2seq_model.py
vrae.py		vrae.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gerating Sentences from a Continuous Space

Prerequisites

Setting up the environment:

Usage

config.json

Data

About

Releases

Packages

Languages

SeanSsj/Variational-Recurrent-Autoencoder-Tensorflow

Folders and files

Latest commit

History

Repository files navigation

Gerating Sentences from a Continuous Space

Prerequisites

Setting up the environment:

Usage

config.json

Data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages