Skip to content

Commit

Permalink
Traditional Chinese segmentation is supported now
Browse files Browse the repository at this point in the history
  • Loading branch information
hankcs committed Jan 10, 2020
1 parent 05d0a0a commit c61c0db
Show file tree
Hide file tree
Showing 4 changed files with 13 additions and 11 deletions.
8 changes: 5 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,9 +44,11 @@ However, you can predict much faster. In the era of deep learning, batched compu

```python
>>> tokenizer(['萨哈夫说,伊拉克将同联合国销毁伊拉克大规模杀伤性武器特别委员会继续保持合作。',
'上海华安工业(集团)公司董事长谭旭光和秘书张晚霞来到美国纽约现代艺术博物馆参观。'])
[['萨哈夫', '', '', '伊拉克', '', '', '联合国', '销毁', '伊拉克', '大规模', '杀伤性', '武器', '特别', '委员会', '继续', '保持', '合作', ''],
['上海', '华安', '工业', '', '集团', '', '公司', '董事长', '谭旭光', '', '秘书', '张晚霞', '来到', '美国', '纽约', '现代', '艺术', '博物馆', '参观', '']]
'上海华安工业(集团)公司董事长谭旭光和秘书张晚霞来到美国纽约现代艺术博物馆参观。',
'HanLP支援臺灣正體、香港繁體,具有新詞辨識能力的中文斷詞系統'])
[['萨哈夫', '', '', '伊拉克', '', '', '联合国', '销毁', '伊拉克', '', '规模', '杀伤性', '武器', '特别', '委员会', '继续', '保持', '合作', ''],
['上海', '华安', '工业', '', '集团', '', '公司', '董事长', '谭旭光', '', '秘书', '张晚霞', '来到', '美国', '纽约', '现代', '艺术', '博物馆', '参观', ''],
['HanLP', '支援', '臺灣', '正體', '', '香港', '繁體', '', '具有', '新詞', '辨識', '能力', '', '中文', '斷詞', '系統']]
```

That's it! You're now ready to employ the latest DL models from HanLP in your research and work. Here are some tips if you want to go further.
Expand Down
10 changes: 5 additions & 5 deletions hanlp/pretrained/cws.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@
# Date: 2019-12-28 21:12
from hanlp.common.constant import HANLP_URL

SIGHAN2005_PKU_CONVSEG = HANLP_URL + 'cws/sighan2005-pku-convseg_20191229_035326.zip'
SIGHAN2005_MSR_CONVSEG = HANLP_URL + 'cws/sighan2005-msr-convseg_20191229_014345.zip'
SIGHAN2005_MSR_BERT_BASE = HANLP_URL + 'cws/cws_bert_base_msra_20191230_194627.zip'
SIGHAN2005_PKU_CONVSEG = HANLP_URL + 'cws/sighan2005-pku-convseg_20200110_153722.zip'
SIGHAN2005_MSR_CONVSEG = HANLP_URL + 'cws/convseg-msr-nocrf-noembed_20200110_153524.zip'
# SIGHAN2005_MSR_BERT_BASE = HANLP_URL + 'cws/cws_bert_base_msra_20191230_194627.zip'
CTB6_CONVSEG = HANLP_URL + 'cws/ctb6_convseg_nowe_nocrf_20200110_004046.zip'
CTB6_BERT_BASE = HANLP_URL + 'cws/cws_bert_base_ctb6_20191230_185536.zip'
PKU_NAME_MERGED_SIX_MONTHS_CONVSEG = HANLP_URL + 'cws/pku98_6m_conv_ngram_20200103_232809.zip'
# CTB6_BERT_BASE = HANLP_URL + 'cws/cws_bert_base_ctb6_20191230_185536.zip'
PKU_NAME_MERGED_SIX_MONTHS_CONVSEG = HANLP_URL + 'cws/pku98_6m_conv_ngram_20200110_134736.zip'

# Will be filled up during runtime
ALL = {}
2 changes: 1 addition & 1 deletion hanlp/version.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@
# Author: hankcs
# Date: 2019-12-28 19:26

__version__ = '2.0.0-alpha.18'
__version__ = '2.0.0-alpha.19'
4 changes: 2 additions & 2 deletions tests/demo/zh/demo_cws.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,11 @@
# Date: 2019-12-28 21:25
import hanlp

tokenizer = hanlp.load(hanlp.pretrained.cws.CTB6_CONVSEG)
tokenizer = hanlp.load(hanlp.pretrained.cws.PKU_NAME_MERGED_SIX_MONTHS_CONVSEG)
print(tokenizer('商品和服务'))
print(tokenizer(['萨哈夫说,伊拉克将同联合国销毁伊拉克大规模杀伤性武器特别委员会继续保持合作。',
'上海华安工业(集团)公司董事长谭旭光和秘书张晚霞来到美国纽约现代艺术博物馆参观。',
'HanLP支援臺灣正體、香港繁體']))
'HanLP支援臺灣正體、香港繁體,具有新詞辨識能力的中文斷詞系統']))

text = 'NLP统计模型没有加规则,聪明人知道自己加。英文、数字、自定义词典统统都是规则。'
print(tokenizer(text))
Expand Down

0 comments on commit c61c0db

Please sign in to comment.