Skip to content

PyTorch DataLoader for Encoder to Decoder Model

License

Notifications You must be signed in to change notification settings

batman-do/enc2dec-dataloader

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataLoader for Encoder to Decoder Model

Efficient data loader for text dataset using torch.utils.data.Dataset, collate_fn and torch.utils.data.DataLoader.

Update Seq2Seq Dataloader from yunjey/seq2seq-dataloader.

Seq2Seq Model image from Seq2Seq model in TensorFlow Post

  1. Add <start> token in decoder input, <end> token in target output words of model.
I am a Student => Je suis etudiant
encoder input : 'I', 'am', 'a', 'Student'
decoder input : '<start>', 'Je', 'suis', 'etudiant'
target 		  : 'Je', 'suis', 'etudiant', '<end>'

Please See this example.

Der weltweit zweitgrößte Anbieter von Besucherattraktionen zielt darauf ab , seinen 30 Millionen Besuchern auf der ganzen Welt durch seine globalen und lokalen Marken sowie das Engagement und die Leidenschaft seiner Führungskräfte und Mitarbeiter ein einzigartiges , unvergessliches und lohnenswertes Erlebnis zu bieten .
print(trg_seqs[0])
tensor([   1,   49, 2267,    3, 4091,   68,    3, 2651,  152,  419,    8,  331,
         229,  524, 1680,  212,   49,  299, 1235,  156,  944, 3192,   14,  357,
        2454,  117,   23, 4624,   14,   50, 3648, 1819,    3,   14,  317,  171,
           3,    8,    3,   14,    3, 2676,  127, 1207,   28,    0,    0,    0,
           0])

print(target[0])
tensor([  49, 2267,    3, 4091,   68,    3, 2651,  152,  419,    8,  331,  229,
         524, 1680,  212,   49,  299, 1235,  156,  944, 3192,   14,  357, 2454,
         117,   23, 4624,   14,   50, 3648, 1819,    3,   14,  317,  171,    3,
           8,    3,   14,    3, 2676,  127, 1207,   28,    2,    0,    0,    0,
           0])
  1. Add replace UNK Token Mechanism in OOV(out of vocabulary) Problem.
sequence.extend([word2id[token] if token in word2id else word2id['<unk>'] for token in tokens])
  1. Add trg_max, src_max to avoid cuda memory leak.
  • src_max : maximum length source domain.
  • trg_max : maximum length target domain.

This can avoid memory leak when getting high dimension of input sequence length.


Prerequesites


Usage

1. Clone the repository

$ git clone https://github.com/graykode/enc2dec-dataloader.git
$ cd enc2dec-dataloader

2. Download nltk tokenizer

$ pip install nltk
$ python
$ import nltk
$ nltk.download('punkt')

3. Build word2id dictionary

$ python build_vocab.py

4. Check DataLoader

For usage, please see example.ipynb.

About

PyTorch DataLoader for Encoder to Decoder Model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 58.4%
  • Python 41.6%