This repo show how to train bert model on Jigsaw Unintended Bias in Toxicity Classification
star me and i will keep update the code
this repo is modified from google open source code for bert , thank Jon Mischo advice here
- 2019-04-06: 0.91216
- 2019-04-07: 0.91455(add text clean method reference here)
- download the pretrain model
- download the data and unzip to input folder
- split the train and dev data(for convenience, i just tyde this command and not recommanded)
cat train.csv | tail -n 1000 > dev_1000.csv
- run run_classifier.py
python run_classifier.py \
--data_dir=input/ --vocab_file=uncased_L-12_H-768_A-12/vocab.txt \
--bert_config_file=uncased_L-12_H-768_A-12/bert_config.json \
--init_checkpoint=uncased_L-12_H-768_A-12/bert_model.ckpt \
--task_name=toxic \
--do_train=True \
--do_eval=True \
--do_predict=True \
--output_dir=model_output/
- the model will train 10 epochs, but you can stop it depend on your time
- the checkpoint will be saved on the model_output, also the prediton on the test data(see model_output/test_result.tsv)
- run encode.py
- upload the output/sub.csv to kaggle
- add csv handler(line 243 in run_classifier.py)
- add ToxicProcessor(line 264 in run_classifier.py)
- text clean and OOV
- CV
- average different checkpoint prediction