This directory contains the code developed for SIGTYP 2020 Shared Task which involves prediction of typological properties of languages given a handful of observed features. The typological features are taken from the World Atlas of Language Structures (WALS).
Technical details can be found in
Alexander Gutkin and Richard Sproat: "NEMO: Frequentist Inference Approach to Constrained Linguistic Typology Feature Prediction in SIGTYP 2020 Shared Task". Proceedings of the Second Workshop on Computational Research in Linguistic Typology (SIGTYP 2020) at EMNLP 2020, pp. 17–28, Association for Computational Linguistics (ACL), Online, November, 2020 (preprint).
Please check requirements.txt for a list of basic dependencies for most of the tools in this directory.
In the following steps we assume all the commands are run from the current directory where all the python code resides.
The original shared task data is distributed here
here. Download it to some local directory
(in the examples below,${WORK_DIR}
):
WORK_DIR=/tmp/workspace
mkdir -p ${WORK_DIR}/sigtyp
git clone https://github.com/sigtyp/ST2020 ${WORK_DIR}/sigtyp
The pipeline internally uses a different csv format from the original distribution. In particular, each WALS feature has its own dedicated column. To generate data in the internal format please run sigtyp_reader_main.py:
mkdir ${WORK_DIR}/internal_data
python3 sigtyp_reader_main.py \
--sigtyp_dir ${WORK_DIR}/sigtyp/data \
--output_dir ${WORK_DIR}/internal_data
The above command will generate several files in ${WORK_DIR}/internal_data
data directory:
- The csv files containing various combinations of training, development and test data.
- The compressed dictionaries in JSON format that contain miscellaneous WALS feature information.
Compute the feature associations (such as implicational universals) using the compute_associations_main.py tool:
mkdir -p ${WORK_DIR}/associations/train
python3 compute_associations_main.py \
--training_data ${WORK_DIR}/internal_data/train.csv \
--dev_data ${WORK_DIR}/internal_data/dev.csv \
--association_dir ${WORK_DIR}/associations/train
The above will generate several feature association files under
${WORK_DIR}/associations/train
:
raw_proportions_by_family.tsv
: MLE estimates for language families.raw_proportions_by_genus.tsv
: MLE estimates for language genera.raw_proportions_by_neighborhood.tsv
: MLE estimates based on geographic neighborhood.implicational_universals.tsv
: Implicational universals.
Following will evaluate the random forest models trained using the training data on the development set using the general tool evaluate_main.py for doing such things:
python3 evaluate_main.py \
--sigtyp_dir ${WORK_DIR}/sigtyp/data \
--training_data_dir ${WORK_DIR}/internal_data \
--train_set_name train --test_set_name dev \
--association_dir ${WORK_DIR}/associations/train \
--algorithm NemoModel \
--num_workers 10 \
--force_classifier RandomForest
Alternatively, it is possible to evaluate individual features or groups of
features using the specialized tool operating on sklearn
-based classifiers
using scikit_classifier_main.py as follows:
python3 scikit_classifier_main.py \
--training_data_file ${WORK_DIR}/internal_data/train.csv \
--dev_data_file ${WORK_DIR}/internal_data/dev.csv \
--data_info_file ${WORK_DIR}/internal_data/data_info_train_dev.json.gz \
--association_dir ${WORK_DIR}/associations/train \
--classifiers=SVM,DNN,LogisticRegression,AdaBoost,RandomForest \
--target_feature Order_of_Subject,_Object_and_Verb \
--nocatch_exceptions
The above will evaluate a bunch of classifiers trained for the
Order_of_Subject,_Object_and_Verb
WALS feature.
To predict unknown (missing) WALS features for the test data (which marks those
by ?
), first generate the associations for the combined training and
development set using compute_associations_main.py:
mkdir -p ${WORK_DIR}/associations/train_dev
python3 compute_associations_main.py \
--training_data ${WORK_DIR}/internal_data/train_dev.csv \
--dev_data ${WORK_DIR}/internal_data/test_blinded.csv \
--association_dir ${WORK_DIR}/associations/train_dev
Then run prediction by enabling the prediction mode of evaluate_main.py as follows:
python3 evaluate_main.py \
--sigtyp_dir ${WORK_DIR}/sigtyp/data \
--training_data_dir ${WORK_DIR}/internal_data \
--train_set_name train_dev \
--test_set_name test_blinded \
--association_dir ${WORK_DIR}/associations/train_dev \
--algorithm NemoModel \
--num_workers 10 \
--force_classifier RandomForest \
--prediction_mode \
--output_sigtyp_predictions_file ${WORK_DIR}/test_results.csv
This will fill in the missing WALS features in the original SIGTYP 2020 csv
format in ${WORK_DIR}/test_results.csv
.