Skip to content

📚中文突发事件语料库(Chinese Emergency Corpus)-上海大学-语义智能实验室

Notifications You must be signed in to change notification settings

jy007/CEC-Corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 

Repository files navigation

中文突发事件语料库

中文突发事件语料库是由上海大学(语义智能实验室)所构建。根据国务院颁布的《国家突发公共事件总体应急预案》的分类体系,从互联网上收集了5类(地震、火灾、交通事故、恐怖袭击和食物中毒)突发事件的新闻报道作为生语料,然后再对生语料进行文本预处理、文本分析、事件标注以及一致性检查等处理,最后将标注结果保存到语料库中,CEC合计332篇。

CEC 采用了 XML 语言作为标注格式,其中包含了六个最重要的数据结构(标记):Event、Denoter、Time、Location、Participant 和 Object。Event用于描述事件;Denoter、Time、Location、Participant 和 Object用于描述事件的指示词和要素。此外,我们还为每一个标记定义了与之相关的属性。与ACE和TimeBank语料库相比,CEC语料库的规模虽然偏小,但是对事件和事件要素的标注却最为全面。

具体内容可参见上海大学公开发表的相关硕士博士论文,以及期刊会议论文等。

本语料库的研究与开发由国家自然科学基金项目“基于描述逻辑的事件推理关键问题研究(编号:61305053)”和“事件本体模型与应用技术”(编号:60975033)资助。

在此感谢上海大学语义智能实验室为CEC的标注工作作出贡献的各位硕士、博士研究生。

研究论文:
[1] 刘炜, 王东, 刘宗田, 刘菲京. 基于事件本体的文本事件要素抽取方法. 中文信息学报(已录用)
[2] 付剑锋, 刘宗田, 刘炜, 周文. 基于层叠条件随机场的事件因果关系抽取[J]. 模式识别与人工智能, 2011, 24(4):567-573.
[3] 朱莎莎, 刘宗田, 付剑锋, 朱芳. 基于条件随机场的中文时间短语识别[J]. 计算机工程, 2011, 37(15):164-167.
[4] 付剑锋, 刘宗田, 刘炜, 基于特征加权的事件要素识别[J], 计算机科学,2010年03期
[5] 刘宗田, 黄美丽等,面向事件的本体研究[J],计算机科学,2009年11期
[6] Xu-jie Zhang, Zong-tian Liu, Wei Liu, Jian-feng Fu. Research on event-based semantic annotation of Chinese[C]. Computer Science and Network Technology (ICCSNT), 2012 2nd International Conference on: 1883-1888.
[7] Fang Zhu, Zongtian Liu, Juanli Yang, Ping Zhu. Chinese event place phrase recognition of emergency event using Maximum Entropy[C]. Cloud Computing and Intelligence Systems (CCIS), 2011 IEEE International Conference on: 614-618.
[8] Jian-feng Fu, Wei Liu, Zong-tian Liu, Sha-sha Zhu. A Study of Chinese Event Taggability[C]. Communication Software and Networks, 2010. ICCSN '10. Second International Conference on: 400-404.
[9] Jianfeng Fu, Zongtian Liu, Wei Liu. Using dual-layer CRFs for event causal relation extraction. IEICE Electronics Express. 2011, Vol.8, No.5, 306–310. (2011,第三作者)
[10] Xujie Zhang, Zongtian Liu, Wei Liu, Junhui Yang, Shengnan Fei, Chinese Event Classification for Event Ontology Construction, Journal of Computational Information Systems , JCIS. 9: 9 (2013) 3511–3519

博士论文:
[1] 付剑锋. 面向事件的知识处理研究[D]. 上海:上海大学, 2010.
[2] 单建芳. 面向事件的文本表示研究[D]. 上海:上海大学, 2011.
[3] 仲兆满. 事件本体及其在查询扩展中的应用. 上海:上海大学, 2011.
[4] 张旭洁. 事件本体构建中几个关键问题的研究[D]. 上海:上海大学, 2012.

硕士论文:
[1] 费胜男. 意念事件研究[D]. 上海:上海大学, 2013.
[2] 朱莎莎.面向突发事件领域的事件时间要素抽取与推理研究[D]. 上海:上海大学, 2011.

===============================================================时间线===============================================================

2015年9月18日,我们添加了已标注的环境污染类,环境污染类语料共包括六小类,分别是:海洋污染、空气污染、社会效应、水污染、土壤污染、噪声污染,总规模合计106篇。

本次语料标注工作主要由王旭、丁宁等完成,其中标注结果格式化、编码转换、错误修正等工作由王旭完成。

Chinese Emergency Corpus (CEC)

Chinese Emergency Corpus (CEC) is built by Data Semantic Laboratory in Shanghai University. This corpus is divided into 5 categories – earthquake, fire, traffic accident, terrorist attack and intoxication of food. There are totally 332 texts in CEC, which are derived from Internet and processed by several steps.

CEC utilizes XML as a formation, including 6 tags -Denoter, Time, Location, Participant, Mean and Object- which describe the elements of event (Event). Furthermore, these tags have their own properties. Compared with ACE Corpus and TimeBank Corpus, the scale of CEC is not so large, but CEC has the all-sided annotation of event and event elements.

If you want to know more about CEC, you can refer to the related dissertations and papers, such as
Research on Event-Oriented Knowledge Processing written by Jianfeng Fu
a Study of Several Key Problems in Construction of Event Ontology written by Xujie Zhang.

Thank you, all of the postgraduates and PhDs in Data Semantic Laboratory in Shanghai University, for making a contribution to CEC.

===============================================================Timeline===============================================================

September 18, 2015, we added an annotated corpus of environmental pollution, environmental pollution corpus includes six small classes, respectively is: marine pollution, air pollution, the social effect, water pollution, soil pollution, noise pollution, the total size of the corpus is 106.

The corpus tagging work mainly completed done by Wang Xu, Ding Ning, etc, which format the annotated results, encoding conversion, error correction and other work done by Wang Xu.

About

📚中文突发事件语料库(Chinese Emergency Corpus)-上海大学-语义智能实验室

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published