Skip to content

Commit

Permalink
Merge pull request HIT-SCIR#1 from HIT-SCIR/master
Browse files Browse the repository at this point in the history
merge updates from HIT-SCIR/ltp
  • Loading branch information
icycandy committed Aug 10, 2015
2 parents a275989 + 5693a9b commit 1b282c5
Show file tree
Hide file tree
Showing 11 changed files with 208 additions and 238 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@ tools/train/otpos
tools/train/otner
tools/train/maxent
tools/train/nndepparser
tools/train/Release/
tools/train/Debug/

###############
# data file #
Expand Down
2 changes: 1 addition & 1 deletion doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -460,7 +460,7 @@ Linux
+------------------------------------------+--------------------------------------------------------------------+
| const std::vector<std::string> & words | 待分析的词序列 |
+------------------------------------------+--------------------------------------------------------------------+
| const std::vector<std::string> & postags | 待分析的词的词性序列 |
| const std::vector<std::string> & postags | 待分析的词的词性序列 |
+------------------------------------------+--------------------------------------------------------------------+
| std::vector<int> & heads | 结果依存弧,heads[i]代表第i个词的父亲节点的编号 |
+------------------------------------------+--------------------------------------------------------------------+
Expand Down
81 changes: 50 additions & 31 deletions doc/ltpserver.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ LTP Server在轻量级服务器程序mongoose基础上开发。在编译LTP源


其中较为重要的参数包括:

- port:指定LTP server监听的端口
- threads:指定LTP server运行的线程数,线程数影响并发的处理能力
- log-level:指定日志级别,TRACE级别最低,显示日志信息最详细。INFO级别最高,显示日志最粗略。WARN与ERROR级日志默认显示。
Expand All @@ -75,8 +76,6 @@ client提交的post请求主要有以下几个字段。
+--------+--------------------------------------------------------------------------------------------------------------------------------------+
| x | 用以指明是否使用xml |
+--------+--------------------------------------------------------------------------------------------------------------------------------------+
| c | 用以指明输入编码方式 |
+--------+--------------------------------------------------------------------------------------------------------------------------------------+
| t | 用以指明分析目标,t可以为分词(ws),词性标注(pos),命名实体识别(ner),依存句法分析(dp),语义角色标注(srl)或者全部任务(all) |
+--------+--------------------------------------------------------------------------------------------------------------------------------------+

Expand Down Expand Up @@ -110,24 +109,47 @@ LTML 标准要求如下:

结点标签分别为 xml4nlp, note, doc, para, sent, word, arg 共七种结点标签:

1. xml4nlp 为根结点,无任何属性值;
2. note 为标记结点,具有的属性分别为:sent, word, pos, ne, parser, srl;分别代表分句,分词,词性标注,命名实体识别,依存句法分析,词义消歧,语义角色标注;值为”n”,表明未做,值为”y”则表示完成,如pos=”y”,表示已经完成了词性标注;
3. doc 为篇章结点,以段落为单位包含文本内容;无任何属性值;
4. para 为段落结点,需含id 属性,其值从0 开始;
5. sent 为句子结点,需含属性为id,cont;id 为段落中句子序号,其值从0 开始;cont 为句子内容;
6. word 为分词结点,需含属性为id, cont;id 为句子中的词的序号,其值从0 开始,cont为分词内容;可选属性为 pos, ne, parent, relate;pos 的内容为词性标注内容;ne 为命名实体内容;parent 与relate 成对出现,parent 为依存句法分析的父亲结点id 号,relate 为相对应的关系;
7. arg 为语义角色信息结点,任何一个谓词都会带有若干个该结点;其属性为id, type, beg,end;id 为序号,从0 开始;type 代表角色名称;beg 为开始的词序号,end 为结束的序号;
1. xml4nlp 为根结点,无任何属性值;

2. note 为标记结点,具有的属性分别为:sent, word, pos, ne, parser, srl;
分别代表分句,分词,词性标注,命名实体识别,依存句法分析,词义消歧,语义角色标注;
值为"n",表明未做,值为"y"则表示完成,如pos="y",表示已经完成了词性标注;

3. doc 为篇章结点,以段落为单位包含文本内容;无任何属性值;

4. para 为段落结点,需含id 属性,其值从0 开始;

5. sent 为句子结点,需含属性为id,cont;

a) id 为段落中句子序号,其值从0 开始;
b) cont 为句子内容;

6. word 为分词结点,需含属性为id, cont;

a) id 为句子中的词的序号,其值从0 开始,
b) cont为分词内容;可选属性为 pos, ne, parent, relate;

I) pos 的内容为词性标注内容;
II) ne 为命名实体内容;
III) parent 与relate 成对出现,parent 为依存句法分析的父亲结点id 号,relate 为相对应的关系;

7. arg 为语义角色信息结点,任何一个谓词都会带有若干个该结点;其属性为id, type, beg,end;

a) id 为序号,从0 开始;
b) type 代表角色名称;
c) beg 为开始的词序号,end 为结束的序号;

各结点及属性的逻辑关系说明如下:

1. 各结点层次关系可以从图中清楚获得,凡带有id 属性的结点是可以包含多个;
2. 如果sent=”n”即未完成分句,则不应包含sent 及其下结点;
3. 如果sent=”y” word=”n”即完成分句,未完成分词,则不应包含word 及其下结点;
4. 其它情况均是在sent=”y” word=”y”的情况下:
1. 如果 pos=”y”则分词结点中必须包含pos 属性;
2. 如果 ne=”y”则分词结点中必须包含ne 属性;
3. 如果 parser=”y”则分词结点中必须包含parent 及relate 属性;
4. 如果 srl=”y”则凡是谓词(predicate)的分词会包含若干个arg 结点;
1. 各结点层次关系可以从图中清楚获得,凡带有id 属性的结点是可以包含多个;
2. 如果sent="n"即未完成分句,则不应包含sent 及其下结点;
3. 如果sent="y" word="n"即完成分句,未完成分词,则不应包含word 及其下结点;
4. 其它情况均是在sent="y" word="y"的情况下:

a) 如果 pos="y"则分词结点中必须包含pos 属性;
b) 如果 ne="y"则分词结点中必须包含ne 属性;
c) 如果 parser="y"则分词结点中必须包含parent 及relate 属性;
d) 如果 srl="y"则凡是谓词(predicate)的分词会包含若干个arg 结点;

示例程序
~~~~~~~~~~
Expand Down Expand Up @@ -156,15 +178,17 @@ LTML 标准要求如下:

如果请求有不符合格式要求,LTP Server会返回400错误。下面的表格显示了LTP Server返回的错误类型以及原因。

+-------+----------------------+------------------------+
| code | reason | 解释 |
+=======+======================+========================+
| 400 | EMPTY SENTENCE | 输入句子为空 |
+-------+----------------------+------------------------+
| 400 | ENCODING NOT IN UTF8 | 输入句子非UTF8编码 |
+-------+----------------------+------------------------+
| 400 | BAD XML FORMAT | 输入句子不符合LTML格式 |
+-------+----------------------+------------------------+
+-------+----------------------+---------------------------------------------------+
| code | reason | 解释 |
+=======+======================+===================================================+
| 400 | EMPTY SENTENCE | 输入句子为空 |
+-------+----------------------+---------------------------------------------------+
| 400 | ENCODING NOT IN UTF8 | 输入句子非UTF8编码 |
+-------+----------------------+---------------------------------------------------+
| 400 | SENTENCE TOO LONG | 输入句子不符合 :ref:`ltprestrict-reference-label` |
+-------+----------------------+---------------------------------------------------+
| 400 | BAD XML FORMAT | 输入句子不符合LTML格式 |
+-------+----------------------+---------------------------------------------------+

当前版本服务性能
----------------
Expand All @@ -188,8 +212,3 @@ Number of agents = 10
+------------+----------------------+----------------------+
| srl/all | 0.036 | 266.094 |
+------------+----------------------+----------------------+

.. rubric::

.. [#f1] 如需指定监听其他端口,请在 :file:`src/server/ltp_server.cpp` 中将宏 `LISTENING_PORT "12345"` 设置为其他整数即可。
7 changes: 7 additions & 0 deletions doc/ltptest.rst
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,13 @@ xxx_cmdline的主要目标是提供不同于xml,同时可自由组合的语言
细节
----

.. _ltprestrict-reference-label:

长度限制
~~~~~~~~

为了防止输入过长句子对稳定性造成影响。ltp限制用户输入字数少于1024字,分词结果少于256词。

.. _ltpexlex-reference-label:

外部词典
Expand Down
2 changes: 1 addition & 1 deletion doc/theory.rst
Original file line number Diff line number Diff line change
Expand Up @@ -229,7 +229,7 @@
-----------------

依存句法分析模块的主要算法依据神经网络依存句法分析算法,Chen and Manning (2014)。同时加入丰富的全局特征和聚类特征。在模型训练时,我们也参考了Yoav等人关于dynamic oracle的工作。
在 `Chinese Dependency Treebank(CDT) <https://catalog.ldc.upenn.edu/LDC2012T05>`_ 数据集上,三种不同解码方式对应的性能如下表所示,其中运行速度和内存开销从CDT测试集(平均29.13词/句)上结果中获得
在 `Chinese Dependency Treebank(CDT) <https://catalog.ldc.upenn.edu/LDC2012T05>`_ 数据集上,其中运行速度和内存开销从CDT测试集上结果中获得

+------------+-------+-------+
| | UAS | LAS |
Expand Down
30 changes: 11 additions & 19 deletions src/ner/otner.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,11 @@ int learn(int argc, const char* argv[]) {
"The prefix of the model file, model will be stored as model.$iter.")
("reference", value<std::string>(), "The path to the reference file.")
("development", value<std::string>(), "The path to the development file.")
("algorithm", value<std::string>(), "The learning algorithm\n"
("algorithm", value<std::string>()->default_value("pa"), "The learning algorithm\n"
" - ap: averaged perceptron\n"
" - pa: passive aggressive [default]")
("max-iter", value<int>(), "The number of iteration [default=10].")
("rare-feature-threshold", value<int>(),
("max-iter", value<int>()->default_value(10), "The number of iteration [default=10].")
("rare-feature-threshold", value<int>()->default_value(0),
"The threshold for rare feature, used in model truncation. [default=0]")
("help,h", "Show help information");

Expand Down Expand Up @@ -67,21 +67,14 @@ int learn(int argc, const char* argv[]) {
development = vm["development"].as<std::string>();
}

std::string algorithm = "pa";
if (vm.count("algorithm")) {
algorithm = vm["algorithm"].as<std::string>();
if (algorithm != "pa" && algorithm != "ap") {
WARNING_LOG("algorithm should either be ap or pa, set as default [pa].");
algorithm = "pa";
}
std::string algorithm = vm["algorithm"].as<std::string>();
if (algorithm != "pa" && algorithm != "ap") {
WARNING_LOG("algorithm should either be ap or pa, set as default [pa].");
algorithm = "pa";
}

int max_iter = 10;
if (vm.count("max-iter")) { max_iter = vm["max-iter"].as<int>(); }

int rare_feature_threshold = 0;
if (vm.count("rare-feature-threshold")) {
rare_feature_threshold= vm["rare-feature-threshold"].as<int>(); }
int max_iter = vm["max-iter"].as<int>();
int rare_feature_threshold = vm["rare-feature-threshold"].as<int>();

NamedEntityRecognizerFrontend frontend(reference, development, model_name,
algorithm, max_iter, rare_feature_threshold);
Expand All @@ -99,7 +92,7 @@ int test(int argc, const char* argv[]) {
optparser.add_options()
("model", value<std::string>(), "The path to the model file.")
("input", value<std::string>(), "The path to the reference file.")
("evaluate", value<bool>(),
("evaluate", value<bool>()->default_value(false),
"if configured, perform evaluation, input should contain '#' concatenated tag")
("help,h", "Show help information");

Expand Down Expand Up @@ -134,8 +127,7 @@ int test(int argc, const char* argv[]) {
output_file = vm["output"].as<std::string>();
}

bool evaluate = false;
if (vm.count("evaluate")) { evaluate = vm["evaluate"].as<bool>(); }
bool evaluate = vm["evaluate"].as<bool>();

NamedEntityRecognizerFrontend frontend(input_file, model_name, evaluate);
frontend.test();
Expand Down
Loading

0 comments on commit 1b282c5

Please sign in to comment.