Forget gate bias should probably be initialized to 1 #3

twoletters · 2024-05-14T16:14:51Z

Line 66 in f0f54bf

" self.bf = nn.Parameter(torch.randn(1))\n",

The training of traditional LSTMs benefits from initializing the forget gate bias to 1. It prevents the LSTM from forgetting until it has learned to do so, speeding up training.

It seems to me that sLSTM is essentially the same as the traditional LSTM in that regard, and initializing the forget gate biases to 1 should speed up training. Don't take my word for it, though. Test, don't trust.

twoletters · 2024-05-16T15:26:03Z

I amend my comment: this is useful only if the sigmoid is used as the activation function of the forget gate (one proposed option in the paper). If the exponential is used, the forget gate will be close to 1 if the parameters are close to zero.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forget gate bias should probably be initialized to 1 #3

Forget gate bias should probably be initialized to 1 #3

twoletters commented May 14, 2024

twoletters commented May 16, 2024

Forget gate bias should probably be initialized to 1 #3

Forget gate bias should probably be initialized to 1 #3

Comments

twoletters commented May 14, 2024

twoletters commented May 16, 2024