Introduction

This repository implements a Vanila & Attention Encoder Decoder to learn ALFRED (Action Learning From Realistic Environments and Directives) dataset. This dataset contains a set of instructions given to a human to achieve a specific task such as grapping a cup or finding a mat. It also includes for each instruction the target object like (mat, cup, knife .. etc) and the action done (gotolocation, putobject .. etc).

Implementation

Vanila Encoder Decoder

  Encoder(
   (embedding): Embedding(1000, 300)
   (lstm): LSTM(300, 256, batch_first=True, dropout=0.33)
  )
  Decoder(
   (embedding_action): Embedding(11, 300, padding_idx=0)
   (embedding_target): Embedding(83, 300, padding_idx=0)
   (lstm): LSTM(300, 256, batch_first=True)
   (fc_action): Linear(in_features=256, out_features=11, bias=True)
   (fc_target): Linear(in_features=256, out_features=83, bias=True)
  )

Encoder Decoder with Global & Local Attention & Glove

  Encoder(
   (embedding): Embedding(1000, 300)
   (lstm): LSTM(300, 256, batch_first=True, dropout=0.33)
  )
  AttentionDecoder(
   (embedding_action): Embedding(11, 256, padding_idx=0)
   (embedding_target): Embedding(83, 256, padding_idx=0)
   (lstm): LSTM(256, 256, batch_first=True)
   (dropout): Dropout(p=0.33, inplace=False)
   (a): Linear(in_features=512, out_features=1, bias=True)
   (fc_action): Linear(in_features=256, out_features=11, bias=True)
   (fc_target): Linear(in_features=256, out_features=83, bias=True)
  )

Discussion

The vanilla encoder decoder didn't perform very well in terms of accuracy. The accuracy didn't pass 0.03 for both validation and training set.

Running: train.py

The vanilla encoder decoder with global attention performed much worse than the vanilla because of the long instruction sequence which made the attention learning more difficult. As in the below plots the accuracy didn't exceed 0.008 which is much lower than the vanilla one. I used pack_padded_sequence so it packs up the input to the maximum sequence in the input. One thing that I noticed, when training with a batch greater than one the performance degrades alot because of the extra padding in the different episodes than the maximum packed one which made learning the attention on the input sequence more difficult.

Running: train.py --attention

(BONUS) To mitiagte the previous issue I used local attention with a window of 2 right & left which made a huge jump in performance as in the below plots. The accuracy has passed 0.8 for training and was around 0.58 for validation which is much better than the previous models. A better approach to consider is to attend the whole instructions of the decoding step.

Running: train.py --attention --local_attention

(BONUS) To see how word embeddings can contribute to downstream tasks I used GLoVE as a word embedding and freezed it's learning, I didn't notice a learning gain than the pervious model.

Running: train.py --attention --local_attention --glove

I implemented the Transformer Encoder Decoder but didn't have time to run it and compare it with the attention encoder decoder

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
plots		plots
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
lang_to_sem_data.json		lang_to_sem_data.json
model.py		model.py
requirements.txt		requirements.txt
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Implementation

Discussion

About

Releases

Packages

Languages

MohammedAlsayed/encoder-decoder-ALFRED

Folders and files

Latest commit

History

Repository files navigation

Introduction

Implementation

Discussion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages