Name		Name	Last commit message	Last commit date
parent directory ..
blink		blink
checkpoints		checkpoints
data		data
my_jieba		my_jieba
utils		utils
README.md		README.md
__init__.py		__init__.py
el_config.py		el_config.py
el_dataset.py		el_dataset.py
el_main.py		el_main.py
el_main.sh		el_main.sh
el_models.py		el_models.py
el_preprocess.py		el_preprocess.py
el_process.py		el_process.py
el_processor.py		el_processor.py
el_service.py		el_service.py
service.log		service.log
service_main.py		service_main.py
start_service.sh		start_service.sh
stop_service.sh		stop_service.sh
test_jieba.py		test_jieba.py
test_service.py		test_service.py
test_tokenizer.py		test_tokenizer.py

README.md

pytorch_bert_entity_linking

基于bert的中文实体链接
在hugging face上下载好预训练的权重：chinese-bert-wwm-ext
这里在预测的时候选择用修改后的jieba分词+自定义实体词典的方式来获取候选实体，需要注意的地方是对英文词组的切分方式。

目录说明

--checkpoints：模型保存
--data：数据
--logs：日志
--my_jieba：修改后的结巴分词，解决jieba分词不能将知识库中的kg ls正确分词
--utils：辅助函数，里面值得注意的是tokenization，主要解决的是进行token化的时候将英文、数字等分开。
--el_config.py：配置信息
--el_dataset.py：转换数据为pytorch的格式
--el_main.py：主运行文件，训练、验证测试和预测
--el_main.sh：运行指令
--el_models.py：模型
--el_preprocess.py：处理数据为bert需要的格式
--el_process.py：处理原始训练数据和知识库，得到一些中间文件
--el_processor.py：测试el_preprocess中的处理器
--el_service.py：进行起服务
--service.log：服务日志
--service_main.py：抽离主程序，用于起服务
--start_service.sh：开始服务
--stop_servie.sh：终止服务
--test_jieba.py：测试my_jieba
--test_service.py：测试调用起的服务
--test_tokenizer.py：测试tokenizer
同时，我们要注意数据的一些文件：在/data/ccks2019/下
alias_and_subjects.txt：知识库中的实体名
develop.json：用于预测
entity_to_ids.json：实体以及对应知识库中的id
entity_type.txt：实体的类型
kb_data：知识库
subject_id_with_info.json：知识库中实体id及其对应的相关信息
test.pkl：测试二进制文件
train.json：训练数据
train.pkl：训练二进制文件

流程

首先是el_process.py里面生成一些我们所需要的中间文件。然后是el_processor.py测试数据处理器。接着在el_preprocess.py里面处理数据为bert所需要的格式，并划分训练集和测试集，存储为相关二进制文件。在el_dataset.py里面转换为pytorch所需要的格式，最后在el_main.py里面调用。

依赖

pytorch==1.6
transformers
sklearn

命令

python el_main.py \
--bert_dir="../model_hub/chinese-bert-wwm-ext/" \
--data_dir="./data/ccsk2019/" \
--log_dir="./logs/" \
--output_dir="./checkpoints" \
--num_tags=2 \
--seed=123 \
--gpu_ids="0" \
--max_seq_len=256 \
--lr=2e-5 \
--other_lr=2e-4 \
--train_batch_size=32 \
--train_epochs=1 \
--eval_batch_size=32

起服务

nohup python -u el_service.py --ip '0.0.0.0' --port '1080' > service.log 2>&1 &

测试服务

import requests

text = '恶魔猎手吧-百度贴吧--《魔兽世界》恶魔猎手职业贴吧...'
text = text.encode('utf-8')
url = 'http://0.0.0.0:1080/entity_linking'
result = requests.post(url, data=text)
result = result.text
print(result)

终止服务

最后面要多一个空格。

ps -ef|grep "el_service.py --ip ${ip}"|grep -v grep|awk '{print $2}'|xargs kill -9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

entity_sort

entity_sort

README.md

pytorch_bert_entity_linking

目录说明

流程

依赖

命令

起服务

测试服务

终止服务

Files

entity_sort

Directory actions

More options

Directory actions

More options

Latest commit

History

entity_sort

Folders and files

parent directory

README.md

pytorch_bert_entity_linking

目录说明

流程

依赖

命令

起服务

测试服务

终止服务