Skip to content

Commit

Permalink
Add Vietnamese intent detection and slot filling into task list (seba…
Browse files Browse the repository at this point in the history
…stianruder#570)

* update vietnamese NER result

* update PhoNER_COVID19 + PhoATIS

* update PhoNER_COVID19 + PhoATIS

* update PhoNER_COVID19 + PhoATIS

* add intent detection and slot filling section for Vietnamese in README.md, reformat

* update intent detection section for Vietnamese in README.md/reformat layout

* update intent detection section for Vietnamese in README.md/reformat layout

* reformat README.md

Co-authored-by: Thinh Truong <[email protected]>
  • Loading branch information
joey234 and Thinh Truong authored Aug 13, 2021
1 parent 3b1a6dc commit d3d317c
Show file tree
Hide file tree
Showing 2 changed files with 28 additions and 27 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
### Vietnamese

- [Dependency parsing](vietnamese/vietnamese.md#dependency-parsing)
- [Intent detection and Slot filling](vietnamese/vietnamese.md#intent-detection-and-slot-filling)
- [Machine translation](vietnamese/vietnamese.md#machine-translation)
- [Named entity recognition](vietnamese/vietnamese.md#named-entity-recognition)
- [Part-of-speech tagging](vietnamese/vietnamese.md#part-of-speech-tagging)
Expand Down
54 changes: 27 additions & 27 deletions vietnamese/vietnamese.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,18 @@

* Results for the BIST graph/transition-based parsers, MSTparser and MaltParser are reported in "[An empirical study for Vietnamese dependency parsing](http://www.aclweb.org/anthology/U16-1017)."

## Intent detection and Slot filling
### [PhoATIS](https://github.com/VinAIResearch/JointIDSF)
* The first dataset for intent detection and slot filling for Vietnamese, based on the common ATIS benchmark in the flight booking domain. Data is localized (e.g. replacing slot values with Vietnamese-specific entities) to fit the context of flight booking in Vietnam.
* Training set: 4478 sentences
* Development set: 500 sentences
* Test set: 893 sentences

| Model | Intent Acc. | Slot F1 | Sentence Acc. | Paper | Code | Note |
| ------------- | :-----:| --- |--- |--- | --- | --- |
| JointIDSF (2021) | 97.62 | 94.98 | 86.25 | [Intent Detection and Slot Filling for Vietnamese](https://arxiv.org/abs/2104.02021) | [Official](https://github.com/VinAIResearch/JointIDSF) | Text are automatically word-segmented using [RDRSegmenter](https://github.com/vncorenlp/VnCoreNLP)
| JointBERT (2019) with PhoBERT encoder | 97.40 | 94.75 | 85.55 | [Intent Detection and Slot Filling for Vietnamese](https://arxiv.org/abs/2104.02021) | [Official](https://github.com/VinAIResearch/JointIDSF) | Text are automatically word-segmented using [RDRSegmenter](https://github.com/vncorenlp/VnCoreNLP)

## Machine translation

### English-Vietnamese translation
Expand Down Expand Up @@ -72,6 +84,21 @@


## Named entity recognition
### [PhoNER_COVID19](https://github.com/VinAIResearch/PhoNER_COVID19)
* A named entity recognition dataset for Vietnamese with 10 newly-defined entity types in the context of the COVID-19 pandemic. Data is extracted from news articles and manually annotated. In total, there are 34 984 entities over 10 027 sentences.
* Training set: 5027 sentences
* Development set: 2000 sentences
* Test set: 3000 sentences

| Model | F1 | Paper | Code | Note |
| ------------- | :-----:| --- | --- | --- |
| PhoBERT-large (2020) | 94.5 | [PhoBERT: Pre-trained language models for Vietnamese](https://arxiv.org/abs/2003.00744) | [Official](https://github.com/VinAIResearch/PhoBERT) |
| PhoBERT-base (2020) | 94.2 | [PhoBERT: Pre-trained language models for Vietnamese](https://arxiv.org/abs/2003.00744) | [Official](https://github.com/VinAIResearch/PhoBERT) |
| XLM-R-large (2019) | 93.8 | [Unsupervised Cross-lingual Representation Learning at Scale](https://aclanthology.org/2020.acl-main.747/) | [Official](https://github.com/facebookresearch/XLM) |
| XLM-R-base (2019) | 92.5 | [Unsupervised Cross-lingual Representation Learning at Scale](https://aclanthology.org/2020.acl-main.747/) | [Official](https://github.com/facebookresearch/XLM) |
| BiLSTM-CRF + CNN-char (2016) + Word Segmentation | 91 | [End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF](http://www.aclweb.org/anthology/P16-1101) | [Link](https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf/) | Text are automatically word-segmented using [RDRSegmenter](https://github.com/vncorenlp/VnCoreNLP) |
| BiLSTM-CRF + CNN-char (2016) | 90.6 | [End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF](http://www.aclweb.org/anthology/P16-1101) | [Link](https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf/) | No word segmentation |

### VLSP
* 16,861 sentences for training and development from the VLSP 2016 NER shared task:
* 14,861 sentences are used for training.
Expand Down Expand Up @@ -99,21 +126,6 @@
* [1] denotes that scores are reported in "[ETNLP: a visual-aided systematic approach to select pre-trained embeddings for a downstream task](https://arxiv.org/abs/1903.04433)"
* [2] denotes that BiLSTM-CRF-based scores are reported in "[VnCoreNLP: A Vietnamese Natural Language Processing Toolkit](http://aclweb.org/anthology/N18-5012)"

### [PhoNER_COVID19](https://github.com/VinAIResearch/PhoNER_COVID19)
* A named entity recognition dataset for Vietnamese with 10 newly-defined entity types in the context of the COVID-19 pandemic. Data is extracted from news articles and manually annotated. In total, there are 34 984 entities over 10 027 sentences.
* Training set: 5027 sentences
* Development set: 2000 sentences
* Test set: 3000 sentences

| Model | F1 | Paper | Code | Note |
| ------------- | :-----:| --- | --- | --- |
| PhoBERT-large (2020) | 94.5 | [PhoBERT: Pre-trained language models for Vietnamese](https://arxiv.org/abs/2003.00744) | [Official](https://github.com/VinAIResearch/PhoBERT) |
| PhoBERT-base (2020) | 94.2 | [PhoBERT: Pre-trained language models for Vietnamese](https://arxiv.org/abs/2003.00744) | [Official](https://github.com/VinAIResearch/PhoBERT) |
| XLM-R-large (2019) | 93.8 | [Unsupervised Cross-lingual Representation Learning at Scale](https://aclanthology.org/2020.acl-main.747/) | [Official](https://github.com/facebookresearch/XLM) |
| XLM-R-base (2019) | 92.5 | [Unsupervised Cross-lingual Representation Learning at Scale](https://aclanthology.org/2020.acl-main.747/) | [Official](https://github.com/facebookresearch/XLM) |
| BiLSTM-CRF + CNN-char (2016) + Word Segmentation | 91 | [End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF](http://www.aclweb.org/anthology/P16-1101) | [Link](https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf/) | Text are automatically word-segmented using [RDRSegmenter](https://github.com/vncorenlp/VnCoreNLP) |
| BiLSTM-CRF + CNN-char (2016) | 90.6 | [End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF](http://www.aclweb.org/anthology/P16-1101) | [Link](https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf/) | No word segmentation |


## Part-of-speech tagging

Expand Down Expand Up @@ -155,15 +167,3 @@
| DongDu (2012) | 96.90 | [Ứng dụng phương pháp Pointwise vào bài toán tách từ cho tiếng Việt](https://tiengvietmenyeu.wordpress.com/2013/02/16/ung%C2%B7dung-phuong%C2%B7phap-pointwise-vao-bai%C2%B7toan-tach-tu-cho-tieng%C2%B7viet/) | |

* Results for VnTokenizer, JVnSegmenter and DongDu are reported in "[A hybrid approach to Vietnamese word segmentation](http://doi.org/10.1109/RIVF.2016.7800279)."

## Intent Detection and Slot Filling
### [PhoATIS](https://github.com/VinAIResearch/JointIDSF)
* The first dataset for intent detection and slot filling for Vietnamese, based on the common ATIS benchmark in the flight booking domain. Data is localized (e.g. replacing slot values with Vietnamese-specific entities) to fit the context of flight booking in Vietnam.
* Training set: 4478 sentences
* Development set: 500 sentences
* Test set: 893 sentences

| Model | Intent Acc. | Slot F1 | Sentence Acc. | Paper | Code | Note |
| ------------- | :-----:| --- |--- |--- | --- | --- |
| JointIDSF (2021) | 97.62 | 94.98 | 86.25 | [Intent Detection and Slot Filling for Vietnamese](https://arxiv.org/abs/2104.02021) | [Official](https://github.com/VinAIResearch/JointIDSF) | Text are automatically word-segmented using [RDRSegmenter](https://github.com/vncorenlp/VnCoreNLP)
| JointBERT (2019) with PhoBERT encoder | 97.40 | 94.75 | 85.55 | [Intent Detection and Slot Filling for Vietnamese](https://arxiv.org/abs/2104.02021) | [Official](https://github.com/VinAIResearch/JointIDSF) | Text are automatically word-segmented using [RDRSegmenter](https://github.com/vncorenlp/VnCoreNLP)

0 comments on commit d3d317c

Please sign in to comment.