DataLoader for Encoder to Decoder Model

Efficient data loader for text dataset using torch.utils.data.Dataset, collate_fn and torch.utils.data.DataLoader.

Update Seq2Seq Dataloader from yunjey/seq2seq-dataloader.

Seq2Seq Model image from Seq2Seq model in TensorFlow Post

Add <start> token in decoder input, <end> token in target output words of model.

I am a Student => Je suis etudiant
encoder input : 'I', 'am', 'a', 'Student'
decoder input : '<start>', 'Je', 'suis', 'etudiant'
target 		  : 'Je', 'suis', 'etudiant', '<end>'

Please See this example.

Der weltweit zweitgrößte Anbieter von Besucherattraktionen zielt darauf ab , seinen 30 Millionen Besuchern auf der ganzen Welt durch seine globalen und lokalen Marken sowie das Engagement und die Leidenschaft seiner Führungskräfte und Mitarbeiter ein einzigartiges , unvergessliches und lohnenswertes Erlebnis zu bieten .

print(trg_seqs[0])
tensor([   1,   49, 2267,    3, 4091,   68,    3, 2651,  152,  419,    8,  331,
         229,  524, 1680,  212,   49,  299, 1235,  156,  944, 3192,   14,  357,
        2454,  117,   23, 4624,   14,   50, 3648, 1819,    3,   14,  317,  171,
           3,    8,    3,   14,    3, 2676,  127, 1207,   28,    0,    0,    0,
           0])

print(target[0])
tensor([  49, 2267,    3, 4091,   68,    3, 2651,  152,  419,    8,  331,  229,
         524, 1680,  212,   49,  299, 1235,  156,  944, 3192,   14,  357, 2454,
         117,   23, 4624,   14,   50, 3648, 1819,    3,   14,  317,  171,    3,
           8,    3,   14,    3, 2676,  127, 1207,   28,    2,    0,    0,    0,
           0])

Add replace UNK Token Mechanism in OOV(out of vocabulary) Problem.

sequence.extend([word2id[token] if token in word2id else word2id['<unk>'] for token in tokens])

Add trg_max, src_max to avoid cuda memory leak.

src_max : maximum length source domain.
trg_max : maximum length target domain.

This can avoid memory leak when getting high dimension of input sequence length.

Prerequesites

Usage

1. Clone the repository

$ git clone https://github.com/graykode/enc2dec-dataloader.git
$ cd enc2dec-dataloader

2. Download nltk tokenizer

$ pip install nltk
$ python
$ import nltk
$ nltk.download('punkt')

3. Build word2id dictionary

$ python build_vocab.py

4. Check DataLoader

For usage, please see example.ipynb.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
LICENSE		LICENSE
README.md		README.md
build_vocab.py		build_vocab.py
data_loader.py		data_loader.py
example.ipynb		example.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataLoader for Encoder to Decoder Model

Update Seq2Seq Dataloader from yunjey/seq2seq-dataloader.

Prerequesites

Usage

1. Clone the repository

2. Download nltk tokenizer

3. Build word2id dictionary

4. Check DataLoader

About

Releases

Packages

Languages

License

graykode/enc2dec-dataloader

Folders and files

Latest commit

History

Repository files navigation

DataLoader for Encoder to Decoder Model

Update Seq2Seq Dataloader from yunjey/seq2seq-dataloader.

Prerequesites

Usage

1. Clone the repository

2. Download nltk tokenizer

3. Build word2id dictionary

4. Check DataLoader

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages