This repository contains the models, dataset, helpers, and systems' comparison for our paper on Arabic Text Diacritization:
"Neural Arabic Text Diacritization: State of the Art Results and a Novel Approach for Machine Translation", Ali Fadel, Ibraheem Tuffaha, Bara' Al-Jawarneh and Mahmoud Al-Ayyoub, EMNLP-IJCNLP 2019.
- predict.py - General script that can be used to predict using any model existing in this repository
- sample-input - Sample input file
- extra_train.zip - Contains the extra training dataset that was used to train the models
- abandah - Abandah et al., 2015
- belinkov - Belinkov et al., 2015
- shakkala - Barqawi et al., 2017
These folders each contain the generated dataset, system output and DER/WER statistics used to compair our system with each of the three systems.
- constants
- ARABIC_LETTERS_LIST.pickle - Contains a list of Arabic letters
- DIACRITICS_LIST.pickle - Contains a list of all diacritics
- FFNN_CLASSES_MAPPING.pickle - Contains a dictionary, mapping each class to its unique integer (FFNN)
- FFNN_REV_CLASSES_MAPPING.pickle - Contains a dictionary, mapping each integer to its unique class (FFNN)
- FFNN_SMALL_CHARACTERS_MAPPING.pickle - Contains a dictionary, mapping each character to its unique integer (Without using the extra training dataset for FFNN)
- RNN_CLASSES_MAPPING.pickle - Contains a dictionary, mapping each class to its unique integer (RNN)
- RNN_REV_CLASSES_MAPPING.pickle - Contains a dictionary, mapping each integer to its unique class (RNN)
- RNN_SMALL_CHARACTERS_MAPPING.pickle - Contains a dictionary, mapping each character to its unique integer (Without using using the extra training dataset for RNN)
- RNN_BIG_CHARACTERS_MAPPING.pickle - Contains a dictionary, mapping each character to its unique integer (Using using the extra training dataset for RNN)
- avg_checkpoints.py - Creates weights averaged models using the last epochs checkpoints from the training phase
- build_confusion_matrix.py - Builds and plots confusion matrix using the gold data and the predicted output
- build_der_figure.py - Restores and plots the diacritic error rate progress while training for each model from keras training log files
- plot_character_embeddings.py - Plots embeddings extracted from any epoch checkpoint using t-SNE technique
- count_error_frequency.py - Counts the frequency of errors in each diacritized word
- prepare_feed_forward_data.py - Prepares FFNN models data
- restore_model_accuracy_and_loss.py - Restores and plots the accuracy and loss values for FFNN models from keras training log files
- optimizer.py - An implementation for Block-Normalized Gradient Method: An Empirical Study for Training Deep Neural Network paper copied from here
- ffnn_models - Contains all feed-forward neural networks codes, models and statistics
- 1_basic_model - Contains basic FFNN model training and predicting codes, model weights and DER/WER statistics
- 2_100_hot_model - Contains 100 hot FFNN model training and predicting codes, model weights and DER/WER statistics
- 3_embeddings_model - Contains embeddings FFNN model training and predicting codes, model weights and DER/WER statistics
- rnn_models - Contains all recurrent neural networks codes, models and statistics
- 1_basic_model - Contains basic RNN model training code, model weights, averaged models and DER/WER statistics. The model was trained with and without the extra training dataset
- 2_crf_model - Contains CRF-RNN model training code, model weights, averaged models and DER/WER statistics. The model was trained with and without the extra training dataset
- 3_normalized_model - Contains normalized RNN model training code, model weights, averaged models and DER/WER statistics. The model was trained with and without the extra training dataset
- Tested with Python 3.6.8
- Install required packages listed in
requirements.txt
filepip install -r requirements.txt
To predict the diacritized text using any model provided in this repository the script predict.py
can be used, example:
python predict.py --input-file-path sample_input \
--model-type rnn \
--model-number 3 \
--model-size small \
--model-average 20 \
--output-file-path sample_output
The previous line will diacritize the text inside sample_input
file using the rnn
model that have number 3
trained on the small
dataset (without extra training dataset) after averaging the last 20
epochs and writes the diacritized text on sample_output
.
The allowed option are:
- --model-type: ffnn, rnn
- --model-number:
- ffnn: 1, 2, 3
- rnn: 1, 2, 3
- --model-size: small, big
- --model-average:
- rnn: 1, 5, 10, 20
Before training any FFNN model you need to prepare the dataset using prepare_feed_forward_data.py
script. After that to train any FFNN model you can use the model.ipynb
notebooks that exists under models/ffnn_models/*/
There is no need to prepare any data to train RNN models, to train any RNN model you can use the model.ipynb
notebooks that exists under models/rnn_models/*/
Note that the RNN models use CuDNNLSTM
layers which should run on GPU, to train the models or predict output from them using CPU only you can use regular LSTM
layers. Moreover, all RNN models checkpoints exist under models/rnn_models/*/*/
use CuDNNLSTM
layers, so the checkpoints should be loaded on GPU, but under model/rnn_models/*/*/lstm/
you can find the same checkpoints with same weights and structure but with regular LSTM
layers used instead of CuDNNLSTM
layers.
Basic RNN model structure
All results reported below are on (Fadel et al., 2019) test set (best results shown in bold).
There are three feed-forward neural network models, the following table show the results of each of them:
DER/WER | With case ending | Without case ending | With case ending | Without case ending |
---|---|---|---|---|
Including no diacritic | Excluding no diacritic | |||
Basic Model | 9.33%/25.93% | 6.58%/13.89% | 10.85%/25.39% | 7.51%/13.53% |
100-Hot Model | 6.57%/20.21% | 4.83%/11.14% | 7.75%/19.83% | 5.62%/10.93% |
Embeddings Model | 5.52%/17.12% | 4.06%/9.38% | 6.44%/16.63% | 4.67%/9.10% |
DER/WER statistics for FFNN models
There are three recurrent neural network models, each of them was trained twice with and without the extra training dataset. The following tables show the results of each of them:
DER/WER | With case ending | Without case ending | With case ending | Without case ending |
---|---|---|---|---|
Including no diacritic | Excluding no diacritic | |||
Basic Model | 2.68%/7.91% | 2.19%/4.79% | 3.09%/7.61% | 2.51%/4.66% |
CRF Model | 2.67%/7.73% | 2.19%/4.69% | 3.08%/7.46% | 2.52%/4.60% |
Normalized Model | 2.60%/7.69% | 2.11%/4.57% | 3.00%/7.39% | 2.42%/4.44% |
DER/WER statistics for RNN models without training on extra training dataset
DER/WER | With case ending | Without case ending | With case ending | Without case ending |
---|---|---|---|---|
Including no diacritic | Excluding no diacritic | |||
Basic Model | 1.72%/5.16% | 1.37%/2.98% | 1.99%/4.96% | 1.59%/2.92% |
CRF Model | 1.84%/5.42% | 1.47%/3.17% | 2.13%/5.22% | 1.69%/3.09% |
Normalized Model | 1.69%/5.09% | 1.34%/2.91% | 1.95%/4.89% | 1.54%/2.83% |
DER/WER statistics for RNN models with training on extra training dataset
The following figure shows the validation DER of each model while training reported every 5 epochs.
RNN models validation DER while training
Note: All codes in this repository tested on Ubuntu 18.04
The project is available as open source under the terms of the MIT License.