Skip to content

Commit

Permalink
Revise document
Browse files Browse the repository at this point in the history
  • Loading branch information
hankcs committed Jan 11, 2021
1 parent 67be9d1 commit 08995a5
Show file tree
Hide file tree
Showing 7 changed files with 87 additions and 62 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ In particular, the Python `HanLPClient` can also be used as a callable function

## Train Your Own Models

To write DL models is not hard, the real hard thing is to write a model able to reproduce the scores in papers. The snippet below shows how to surpass the state-of-the-art tokenizer in 9 minutes.
To write DL models is not hard, the real hard thing is to write a model able to reproduce the scores in papers. The snippet below shows how to surpass the state-of-the-art tokenizer in 6 minutes.

```python
tokenizer = TransformerTaggingTokenizer()
Expand Down
100 changes: 50 additions & 50 deletions docs/annotations/dep/sd.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,55 @@

See also [Stanford typed dependencies manual](https://nlp.stanford.edu/software/dependencies_manual.pdf).

## Chinese

|Tag|Description|中文简称|例句|依存弧|
| ---- | ---- | ---- | ---- | ---- |
|nn|noun compound modifier|复合名词修饰|服务中心|nn(中心,服务)|
|punct|punctuation|标点符号|海关统计表明,|punct(表明,,)|
|nsubj|nominal subject|名词性主语|梅花盛开|nsubj (盛开,梅花)|
|conj|conjunct (links two conjuncts)|连接性状语|设备和原材料|conj(原材料,设备)|
|dobj|direct object|直接宾语|浦东颁布了七十一件文件|dobj(颁布,文件)|
|advmod|adverbial modifier|副词性状语|部门先送上文件|advmod(送上,先)|
|prep|prepositional modifier|介词性修饰语|在实践中逐步完善|prep(完善,在)|
|nummod|number modifier|数词修饰语|七十一件文件|nummod(件,七十一)|
|amod|adjectival modifier|形容词修饰语|跨世纪工程|amod(工程,跨世纪)|
|pobj|prepositional object|介词性宾语|根据有关规定|pobj (根据,规定)|
|rcmod|relative clause modifier|相关关系|不曾遇到过的情况|rcmod(情况,遇到)|
|cpm|complementizer|补语|开发浦东的经济活动|cpm(开发,的)|
|assm|associative marker|关联标记|企业的商品|assm(企业,的)|
|assmod|associative modifier|关联修饰|企业的商品|assmod(商品,企业)|
|cc|coordinating conjunction|并列关系|设备和原材料|cc(原材料,和)|
|elf|classifier modifier|类别修饰|七十一件文件|elf(文件,件)|
|ccomp|clausal complement|从句补充|银行决定先取得信用评级|ccomp(决定,取得)|
|det|determiner|限定语|这些经济活动|det(洁动,这些)|
|lobj|localizer object|时间介词|近年来|lobj(来,近年)|
|range|dative object that is a quantifier phrase|数量词间接宾语|成交药品一亿多元|range(成交,兀)|
|asp|aspect marker|时态标记|发挥了作用|asp(发挥,了)|
|tmod|temporal modifier|时间修饰语|以前不曾遇到过|tmod(遇到,以前)|
|plmod|localizer modifier of a preposition|介词性地点修饰|在这片热土上|plmod(在,上)|
|attr|attributive|属性|贸易额为二百亿美元|attr(为,美元)|
|mmod|modal verb modifier|情态动词|利益能得到保障|mmod(得到,能)|
|loc|localizer|位置补语|占九成以上|loc(占,以上)|
|top|topic|主题|建筑是主要活动|top(是,建筑)|
|pccomp|clausal complement of a preposition|介词补语|据有关部门介绍|pccomp(据,介绍)|
|etc|etc modifier|省略关系|科技、文教等领域|etc(文教,等)|
|lccomp|clausal complement of a localizer|位置补语|中国对外开放中升起的明星|lccomp(中,开方夂)|
|ordmod|ordinal number modifier|量词修饰|第七个机构|ordmod(个,第七)|
|xsubj|controlling subject|控制主语|银行决定先取得信用评级|xsubj (取得,银行)|
|neg|negative modifier|否定修饰|以前不曾遇到过|neg(遇到,不)|
|rcomp|resultative complement|结果补语|研究成功|rcomp(研究,成功)|
|comod|coordinated verb compound modifier|并列联合动词|颁布实行|comod(颁布,实行)|
|vmod|verb modifier|动词修饰|其在支持外商企业方面的作用|vmod(方面,支持)|
|prtmod|particles such as 所,以,来,而|小品词|在产业化所取得的成就|prtmod(取得,所)|
|ba|“ba” construction|把字关系|把注意力转向市场|ba(转向,把)|
|dvpm|manner DE(地)modifier|地字修饰|有效地防止流失|dvpm(有效,地)|
|dvpmod|a "XP+DEV", phrase that modifies VP|地字动词短语|有效地防止流失|dvpmod(防止,有效)|
|prnmod|parenthetical modifier|插入词修饰|八五期间(1990- 1995 )|pmmod(期间,1995)|
|cop|copular|系动词|原是自给自足的经济|cop(自给自足,是)|
|pass|passive marker|被动标记|被认定为高技术产业|pass(认定,被)|
|nsubjpass|nominal passive subject|被动名词主语|镍被称作现代工业的维生素|nsubjpass(称作,镍)|

## English

| Tag | Description |
Expand Down Expand Up @@ -87,53 +136,4 @@ See also [Stanford typed dependencies manual](https://nlp.stanford.edu/software/
| tmod | temporal modifier |
| vmod | verb modifier |
| xcomp | open clausal complement |
| xsubj | controlling subject |

## Chinese

|Tag|Description|中文简称|例句|依存弧|
| ---- | ---- | ---- | ---- | ---- |
|nn|noun compound modifier|复合名词修饰|服务中心|nn(中心,服务)|
|punct|punctuation|标点符号|海关统计表明,|punct(表明,,)|
|nsubj|nominal subject|名词性主语|梅花盛开|nsubj (盛开,梅花)|
|conj|conjunct (links two conjuncts)|连接性状语|设备和原材料|conj(原材料,设备)|
|dobj|direct object|直接宾语|浦东颁布了七十一件文件|dobj(颁布,文件)|
|advmod|adverbial modifier|副词性状语|部门先送上文件|advmod(送上,先)|
|prep|prepositional modifier|介词性修饰语|在实践中逐步完善|prep(完善,在)|
|nummod|number modifier|数词修饰语|七十一件文件|nummod(件,七十一)|
|amod|adjectival modifier|形容词修饰语|跨世纪工程|amod(工程,跨世纪)|
|pobj|prepositional object|介词性宾语|根据有关规定|pobj (根据,规定)|
|rcmod|relative clause modifier|相关关系|不曾遇到过的情况|rcmod(情况,遇到)|
|cpm|complementizer|补语|开发浦东的经济活动|cpm(开发,的)|
|assm|associative marker|关联标记|企业的商品|assm(企业,的)|
|assmod|associative modifier|关联修饰|企业的商品|assmod(商品,企业)|
|cc|coordinating conjunction|并列关系|设备和原材料|cc(原材料,和)|
|elf|classifier modifier|类别修饰|七十一件文件|elf(文件,件)|
|ccomp|clausal complement|从句补充|银行决定先取得信用评级|ccomp(决定,取得)|
|det|determiner|限定语|这些经济活动|det(洁动,这些)|
|lobj|localizer object|时间介词|近年来|lobj(来,近年)|
|range|dative object that is a quantifier phrase|数量词间接宾语|成交药品一亿多元|range(成交,兀)|
|asp|aspect marker|时态标记|发挥了作用|asp(发挥,了)|
|tmod|temporal modifier|时间修饰语|以前不曾遇到过|tmod(遇到,以前)|
|plmod|localizer modifier of a preposition|介词性地点修饰|在这片热土上|plmod(在,上)|
|attr|attributive|属性|贸易额为二百亿美元|attr(为,美元)|
|mmod|modal verb modifier|情态动词|利益能得到保障|mmod(得到,能)|
|loc|localizer|位置补语|占九成以上|loc(占,以上)|
|top|topic|主题|建筑是主要活动|top(是,建筑)|
|pccomp|clausal complement of a preposition|介词补语|据有关部门介绍|pccomp(据,介绍)|
|etc|etc modifier|省略关系|科技、文教等领域|etc(文教,等)|
|lccomp|clausal complement of a localizer|位置补语|中国对外开放中升起的明星|lccomp(中,开方夂)|
|ordmod|ordinal number modifier|量词修饰|第七个机构|ordmod(个,第七)|
|xsubj|controlling subject|控制主语|银行决定先取得信用评级|xsubj (取得,银行)|
|neg|negative modifier|否定修饰|以前不曾遇到过|neg(遇到,不)|
|rcomp|resultative complement|结果补语|研究成功|rcomp(研究,成功)|
|comod|coordinated verb compound modifier|并列联合动词|颁布实行|comod(颁布,实行)|
|vmod|verb modifier|动词修饰|其在支持外商企业方面的作用|vmod(方面,支持)|
|prtmod|particles such as 所,以,来,而|小品词|在产业化所取得的成就|prtmod(取得,所)|
|ba|“ba” construction|把字关系|把注意力转向市场|ba(转向,把)|
|dvpm|manner DE(地)modifier|地字修饰|有效地防止流失|dvpm(有效,地)|
|dvpmod|a “XP+DEV(i^),,phrase that modifies VP|地字动词短语|有效地防止流失|dvpmod(防止,有效)|
|prnmod|parenthetical modifier|插入词修饰|八五期间(1990- 1995 )|pmmod(期间,1995)|
|cop|copular|系动词|原是自给自足的经济|cop(自给自足,是)|
|pass|passive marker|被动标记|被认定为高技术产业|pass(认定,被)|
|nsubjpass|nominal passive subject|被动名词主语|镍被称作现代工业的维生素|nsubjpass(称作,镍)|
| xsubj | controlling subject |
2 changes: 2 additions & 0 deletions docs/api/hanlp/components/mtl/mtl.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,7 @@
.. autoclass:: hanlp.components.mtl.multi_task_learning.MultiTaskLearning
:members:
:special-members:
:exclude-members: __init__, __repr__
```
12 changes: 6 additions & 6 deletions docs/data_format.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ field or a `tokens` field. The input to RESTful API is very flexible. It can be
```{eval-rst}
Additionally, fine-grained controls are performed with the arguments defined in
:meth:`hanlp_restful.HanLPClient.parse`.
```
```


#### Examples
Expand All @@ -44,7 +44,7 @@ curl -X POST "https://hanlp.hankcs.com/api/parse" \

### Model Input

The input format to models is specified per model and per tasks. Generally speaking, if a model has no tokenizer built in, then its input is
The input format to models is specified per model and per task. Generally speaking, if a model has no tokenizer built in, then its input is
a sentence in `list[str]` form (a list of tokens), or multiple such sentences nested in a `list`.

If a model has a tokenizer built in, each sentence is in `str` form.
Expand Down Expand Up @@ -74,8 +74,8 @@ HanLP = HanLPClient('https://hanlp.hankcs.com/api', auth=None) # Fill in your a
print(HanLP('2021年HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。英首相与特朗普通电话讨论华为与苹果公司。'))
```

The outputs above is represented as a `json` dictionary where each key is a model name and its value is
the output of the corresponding model.
The outputs above is represented as a `json` dictionary where each key is a task name and its value is
the output of the corresponding task.
For each output, if it's a nested `list` then it contains multiple sentences otherwise it's just one single sentence.

We make the following naming convention of NLP tasks, each consists of 3 letters.
Expand All @@ -95,10 +95,10 @@ Each NLP task can exploit multiple datasets with their annotations, see our [ann
| lem | Lemmatization. Each element is a lemma. | 词干提取 |
| fea | Features of Universal Dependencies. Each element is a feature. | 词法语法特征 |
| ner | Named Entity Recognition. Each element is a tuple of `(entity, type, begin, end)`, where `begin` and `end` are exclusive offsets. | 命名实体识别 |
| dep | Dependency Parsing. Each element is a tuple of `(head, relation)` where `head` starts with index `0` and `ROOT` has index `-1`. | 依存句法分析 |
| dep | Dependency Parsing. Each element is a tuple of `(head, relation)` where `head` starts with index `1` and `ROOT` has index `0`. | 依存句法分析 |
| con | Constituency Parsing. Each list is a bracketed constituent. | 短语成分分析 |
| srl | Semantic Role Labeling. Similar to `ner`, each element is tuple (arg/pred, label, begin, end), where the predicate is labeled as `PRED`. | 语义角色标注 |
| sdp | Semantic Dependency Parsing. Similar to `dep`, however each token can have zero or zero or multiple heads and corresponding relations. | 语义依存分析 |
| sdp | Semantic Dependency Parsing. Similar to `dep`, however each token can have any number (including zero) of heads and corresponding relations. | 语义依存分析 |
| amr | Abstract Meaning Representation. Each AMR graph is represented as list of logical triples. See [AMR guidelines](https://github.com/amrisi/amr-guidelines/blob/master/amr.md#example). | 抽象意义表示 |

When there are multiple models performing the same task, the keys are appended with a secondary identifier. For example, `tok/fine` and `tok/corase` means a fine-grained tokenization model and a coarse-grained one.
10 changes: 6 additions & 4 deletions hanlp/common/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@

class Transformable(ABC):
def __init__(self, transform: Union[Callable, List] = None) -> None:
"""An object which can be transformed with a list of functions. It can be imaged as an objected being passed
"""An object which can be transformed with a list of functions. It can be treated as an objected being passed
through a list of functions, while these functions are kept in a list.
Args:
Expand All @@ -46,7 +46,8 @@ def append_transform(self, transform: Callable):
Args:
transform: A new transform to be appended.
Returns: Itself.
Returns:
Itself.
"""
assert transform is not None, 'None transform not allowed'
Expand All @@ -67,7 +68,8 @@ def insert_transform(self, index: int, transform: Callable):
index: A certain position.
transform: A new transform.
Returns: Dataset itself.
Returns:
Itself.
"""
assert transform is not None, 'None transform not allowed'
Expand Down Expand Up @@ -95,7 +97,7 @@ def transform_sample(self, sample: dict, inplace=False) -> dict:
then 2 ``BOS`` will be inserted which might not be an intended result.
Returns:
Transformed sample.
"""
if not inplace:
sample = copy(sample)
Expand Down
2 changes: 1 addition & 1 deletion hanlp/common/transform.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ def load_vocab(self, save_dir):
class VocabDict(SerializableDict):

def __init__(self, *args, **kwargs) -> None:
"""A dict holding :class:`hanlp.common.vocab.Vocab` instances. When used a transform, it transforms the field
"""A dict holding :class:`hanlp.common.vocab.Vocab` instances. When used as a transform, it transforms the field
corresponding to each :class:`hanlp.common.vocab.Vocab` into indices.
Args:
Expand Down
21 changes: 21 additions & 0 deletions hanlp/components/mtl/multi_task_learning.py
Original file line number Diff line number Diff line change
Expand Up @@ -460,6 +460,18 @@ def predict(self,
tasks: Optional[Union[str, List[str]]] = None,
skip_tasks: Optional[Union[str, List[str]]] = None,
**kwargs) -> Document:
"""Predict on data.
Args:
data: A sentence or a list of sentences.
batch_size: Decoding batch size.
tasks: The tasks to predict.
skip_tasks: The tasks to skip.
**kwargs: Not used.
Returns:
A :class:`~hanlp_common.document.Document`.
"""
doc = Document()
if not data:
return doc
Expand Down Expand Up @@ -755,6 +767,15 @@ def __getitem__(self, task_name: str) -> Task:
return self.tasks[task_name]

def __delitem__(self, task_name: str):
"""Delete a task (and every resource it owns) from this component.
Args:
task_name: The name of the task to be deleted.
Examples:
>>> del mtl['dep'] # Delete dep from MTL
"""
del self.tasks[task_name]
del self.model.decoders[task_name]
del self._computation_graph[task_name]
Expand Down

0 comments on commit 08995a5

Please sign in to comment.