Revise document

yaocl · Jan 11, 2021 · 08995a5 · 08995a5
1 parent 67be9d1
commit 08995a5
Show file tree

Hide file tree

Showing 7 changed files with 87 additions and 62 deletions.
diff --git a/README.md b/README.md
@@ -75,7 +75,7 @@ In particular, the Python `HanLPClient` can also be used as a callable function
 
 ## Train Your Own Models
 
-To write DL models is not hard, the real hard thing is to write a model able to reproduce the scores in papers. The snippet below shows how to surpass the state-of-the-art tokenizer in 9 minutes.
+To write DL models is not hard, the real hard thing is to write a model able to reproduce the scores in papers. The snippet below shows how to surpass the state-of-the-art tokenizer in 6 minutes.
 
 ```python
 tokenizer = TransformerTaggingTokenizer()

diff --git a/docs/annotations/dep/sd.md b/docs/annotations/dep/sd.md
@@ -23,6 +23,55 @@
 
 See also [Stanford typed dependencies manual](https://nlp.stanford.edu/software/dependencies_manual.pdf).
 
+## Chinese
+
+|Tag|Description|中文简称|例句|依存弧|
+| ---- | ---- | ---- | ---- | ---- |
+|nn|noun compound modifier|复合名词修饰|服务中心|nn(中心，服务）|
+|punct|punctuation|标点符号|海关统计表明，|punct(表明，，)|
+|nsubj|nominal subject|名词性主语|梅花盛开|nsubj (盛开，梅花）|
+|conj|conjunct (links two conjuncts)|连接性状语|设备和原材料|conj(原材料，设备）|
+|dobj|direct object|直接宾语|浦东颁布了七十一件文件|dobj(颁布，文件）|
+|advmod|adverbial modifier|副词性状语|部门先送上文件|advmod(送上，先）|
+|prep|prepositional modifier|介词性修饰语|在实践中逐步完善|prep(完善，在）|
+|nummod|number modifier|数词修饰语|七十一件文件|nummod(件，七十一）|
+|amod|adjectival modifier|形容词修饰语|跨世纪工程|amod(工程，跨世纪）|
+|pobj|prepositional object|介词性宾语|根据有关规定|pobj (根据，规定）|
+|rcmod|relative clause modifier|相关关系|不曾遇到过的情况|rcmod(情况，遇到）|
+|cpm|complementizer|补语|开发浦东的经济活动|cpm(开发，的）|
+|assm|associative marker|关联标记|企业的商品|assm(企业，的）|
+|assmod|associative modifier|关联修饰|企业的商品|assmod(商品，企业）|
+|cc|coordinating conjunction|并列关系|设备和原材料|cc(原材料，和）|
+|elf|classifier modifier|类别修饰|七十一件文件|elf(文件，件）|
+|ccomp|clausal complement|从句补充|银行决定先取得信用评级|ccomp(决定，取得）|
+|det|determiner|限定语|这些经济活动|det(洁动，这些）|
+|lobj|localizer object|时间介词|近年来|lobj(来，近年）|
+|range|dative object that is a quantifier phrase|数量词间接宾语|成交药品一亿多元|range(成交，兀）|
+|asp|aspect marker|时态标记|发挥了作用|asp(发挥，了）|
+|tmod|temporal modifier|时间修饰语|以前不曾遇到过|tmod(遇到，以前）|
+|plmod|localizer modifier of a preposition|介词性地点修饰|在这片热土上|plmod(在，上）|
+|attr|attributive|属性|贸易额为二百亿美元|attr(为，美元）|
+|mmod|modal verb modifier|情态动词|利益能得到保障|mmod(得到，能）|
+|loc|localizer|位置补语|占九成以上|loc(占，以上）|
+|top|topic|主题|建筑是主要活动|top(是，建筑）|
+|pccomp|clausal complement of a preposition|介词补语|据有关部门介绍|pccomp(据，介绍）|
+|etc|etc modifier|省略关系|科技、文教等领域|etc(文教，等）|
+|lccomp|clausal complement of a localizer|位置补语|中国对外开放中升起的明星|lccomp(中，开方夂）|
+|ordmod|ordinal number modifier|量词修饰|第七个机构|ordmod(个，第七）|
+|xsubj|controlling subject|控制主语|银行决定先取得信用评级|xsubj (取得，银行）|
+|neg|negative modifier|否定修饰|以前不曾遇到过|neg(遇到，不）|
+|rcomp|resultative complement|结果补语|研究成功|rcomp(研究，成功）|
+|comod|coordinated verb compound modifier|并列联合动词|颁布实行|comod(颁布，实行）|
+|vmod|verb modifier|动词修饰|其在支持外商企业方面的作用|vmod(方面，支持）|
+|prtmod|particles such as 所，以，来，而|小品词|在产业化所取得的成就|prtmod(取得，所）|
+|ba|“ba” construction|把字关系|把注意力转向市场|ba(转向，把）|
+|dvpm|manner DE(地）modifier|地字修饰|有效地防止流失|dvpm(有效，地）|
+|dvpmod|a "XP+DEV", phrase that modifies VP|地字动词短语|有效地防止流失|dvpmod(防止，有效）|
+|prnmod|parenthetical modifier|插入词修饰|八五期间（1990- 1995 )|pmmod(期间，1995)|
+|cop|copular|系动词|原是自给自足的经济|cop(自给自足，是）|
+|pass|passive marker|被动标记|被认定为高技术产业|pass(认定，被）|
+|nsubjpass|nominal passive subject|被动名词主语|镍被称作现代工业的维生素|nsubjpass(称作，镍）|
+
 ## English
 
 | Tag        | Description                       |
@@ -87,53 +136,4 @@ See also [Stanford typed dependencies manual](https://nlp.stanford.edu/software/
 | tmod       | temporal modifier                 |
 | vmod       | verb modifier                     |
 | xcomp      | open clausal complement           |
-| xsubj      | controlling subject               |
-
-## Chinese
-
-|Tag|Description|中文简称|例句|依存弧|
-| ---- | ---- | ---- | ---- | ---- |
-|nn|noun compound modifier|复合名词修饰|服务中心|nn(中心，服务）|
-|punct|punctuation|标点符号|海关统计表明，|punct(表明，，)|
-|nsubj|nominal subject|名词性主语|梅花盛开|nsubj (盛开，梅花）|
-|conj|conjunct (links two conjuncts)|连接性状语|设备和原材料|conj(原材料，设备）|
-|dobj|direct object|直接宾语|浦东颁布了七十一件文件|dobj(颁布，文件）|
-|advmod|adverbial modifier|副词性状语|部门先送上文件|advmod(送上，先）|
-|prep|prepositional modifier|介词性修饰语|在实践中逐步完善|prep(完善，在）|
-|nummod|number modifier|数词修饰语|七十一件文件|nummod(件，七十一）|
-|amod|adjectival modifier|形容词修饰语|跨世纪工程|amod(工程，跨世纪）|
-|pobj|prepositional object|介词性宾语|根据有关规定|pobj (根据，规定）|
-|rcmod|relative clause modifier|相关关系|不曾遇到过的情况|rcmod(情况，遇到）|
-|cpm|complementizer|补语|开发浦东的经济活动|cpm(开发，的）|
-|assm|associative marker|关联标记|企业的商品|assm(企业，的）|
-|assmod|associative modifier|关联修饰|企业的商品|assmod(商品，企业）|
-|cc|coordinating conjunction|并列关系|设备和原材料|cc(原材料，和）|
-|elf|classifier modifier|类别修饰|七十一件文件|elf(文件，件）|
-|ccomp|clausal complement|从句补充|银行决定先取得信用评级|ccomp(决定，取得）|
-|det|determiner|限定语|这些经济活动|det(洁动，这些）|
-|lobj|localizer object|时间介词|近年来|lobj(来，近年）|
-|range|dative object that is a quantifier phrase|数量词间接宾语|成交药品一亿多元|range(成交，兀）|
-|asp|aspect marker|时态标记|发挥了作用|asp(发挥，了）|
-|tmod|temporal modifier|时间修饰语|以前不曾遇到过|tmod(遇到，以前）|
-|plmod|localizer modifier of a preposition|介词性地点修饰|在这片热土上|plmod(在，上）|
-|attr|attributive|属性|贸易额为二百亿美元|attr(为，美元）|
-|mmod|modal verb modifier|情态动词|利益能得到保障|mmod(得到，能）|
-|loc|localizer|位置补语|占九成以上|loc(占，以上）|
-|top|topic|主题|建筑是主要活动|top(是，建筑）|
-|pccomp|clausal complement of a preposition|介词补语|据有关部门介绍|pccomp(据，介绍）|
-|etc|etc modifier|省略关系|科技、文教等领域|etc(文教，等）|
-|lccomp|clausal complement of a localizer|位置补语|中国对外开放中升起的明星|lccomp(中，开方夂）|
-|ordmod|ordinal number modifier|量词修饰|第七个机构|ordmod(个，第七）|
-|xsubj|controlling subject|控制主语|银行决定先取得信用评级|xsubj (取得，银行）|
-|neg|negative modifier|否定修饰|以前不曾遇到过|neg(遇到，不）|
-|rcomp|resultative complement|结果补语|研究成功|rcomp(研究，成功）|
-|comod|coordinated verb compound modifier|并列联合动词|颁布实行|comod(颁布，实行）|
-|vmod|verb modifier|动词修饰|其在支持外商企业方面的作用|vmod(方面，支持）|
-|prtmod|particles such as 所，以，来，而|小品词|在产业化所取得的成就|prtmod(取得，所）|
-|ba|“ba” construction|把字关系|把注意力转向市场|ba(转向，把）|
-|dvpm|manner DE(地）modifier|地字修饰|有效地防止流失|dvpm(有效，地）|
-|dvpmod|a “XP+DEV(i^)，，phrase that modifies VP|地字动词短语|有效地防止流失|dvpmod(防止，有效）|
-|prnmod|parenthetical modifier|插入词修饰|八五期间（1990- 1995 )|pmmod(期间，1995)|
-|cop|copular|系动词|原是自给自足的经济|cop(自给自足，是）|
-|pass|passive marker|被动标记|被认定为高技术产业|pass(认定，被）|
-|nsubjpass|nominal passive subject|被动名词主语|镍被称作现代工业的维生素|nsubjpass(称作，镍）|
+| xsubj      | controlling subject               |
diff --git a/docs/api/hanlp/components/mtl/mtl.md b/docs/api/hanlp/components/mtl/mtl.md
@@ -5,5 +5,7 @@
 
 .. autoclass:: hanlp.components.mtl.multi_task_learning.MultiTaskLearning
 	:members:
+	:special-members:
+	:exclude-members: __init__, __repr__
 
 ```
diff --git a/docs/data_format.md b/docs/data_format.md
@@ -31,7 +31,7 @@ field or a `tokens` field. The input to RESTful API is very flexible. It can be
 ```{eval-rst}
 Additionally, fine-grained controls are performed with the arguments defined in 
 :meth:`hanlp_restful.HanLPClient.parse`.
-``` 
+```
 
 
 #### Examples
@@ -44,7 +44,7 @@ curl -X POST "https://hanlp.hankcs.com/api/parse" \
 
 ### Model Input
 
-The input format to models is specified per model and per tasks. Generally speaking, if a model has no tokenizer built in, then its input is
+The input format to models is specified per model and per task. Generally speaking, if a model has no tokenizer built in, then its input is
 a sentence in `list[str]` form (a list of tokens), or multiple such sentences nested in a `list`.
 
 If a model has a tokenizer built in, each sentence is in `str` form. 
@@ -74,8 +74,8 @@ HanLP = HanLPClient('https://hanlp.hankcs.com/api', auth=None)  # Fill in your a
 print(HanLP('2021年HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。英首相与特朗普通电话讨论华为与苹果公司。'))
 ```
 
-The outputs above is represented as a `json` dictionary where each key is a model name and its value is 
-the output of the corresponding model.
+The outputs above is represented as a `json` dictionary where each key is a task name and its value is 
+the output of the corresponding task.
 For each output, if it's a nested `list` then it contains multiple sentences otherwise it's just one single sentence.
 
 We make the following naming convention of NLP tasks, each consists of 3 letters.
@@ -95,10 +95,10 @@ Each NLP task can exploit multiple datasets with their annotations, see our [ann
 | lem  | Lemmatization. Each element is a lemma.                      | 词干提取     |
 | fea  | Features of Universal Dependencies. Each element is a feature. | 词法语法特征 |
 | ner  | Named Entity Recognition. Each element is a tuple of `(entity, type, begin, end)`, where `begin` and `end` are exclusive offsets. | 命名实体识别 |
-| dep  | Dependency Parsing. Each element is a tuple of `(head, relation)` where `head` starts with index `0` and `ROOT` has index `-1`. | 依存句法分析 |
+| dep  | Dependency Parsing. Each element is a tuple of `(head, relation)` where `head` starts with index `1` and `ROOT` has index `0`. | 依存句法分析 |
 | con  | Constituency Parsing. Each list is a bracketed constituent.  | 短语成分分析 |
 | srl  | Semantic Role Labeling. Similar to `ner`, each element is tuple (arg/pred, label, begin, end), where the predicate is labeled as `PRED`. | 语义角色标注 |
-| sdp  | Semantic Dependency Parsing. Similar to `dep`, however each token can have zero or zero or multiple heads and corresponding relations. | 语义依存分析 |
+| sdp  | Semantic Dependency Parsing. Similar to `dep`, however each token can have any number (including zero) of heads and corresponding relations. | 语义依存分析 |
 | amr  | Abstract Meaning Representation. Each AMR graph is represented as list of logical triples. See [AMR guidelines](https://github.com/amrisi/amr-guidelines/blob/master/amr.md#example). | 抽象意义表示 |
 
 When there are multiple models performing the same task, the keys are appended with a secondary identifier. For example, `tok/fine` and `tok/corase` means a fine-grained tokenization model and a coarse-grained one.
diff --git a/hanlp/common/dataset.py b/hanlp/common/dataset.py
@@ -29,7 +29,7 @@
 
 class Transformable(ABC):
     def __init__(self, transform: Union[Callable, List] = None) -> None:
-        """An object which can be transformed with a list of functions. It can be imaged as an objected being passed
+        """An object which can be transformed with a list of functions. It can be treated as an objected being passed
         through a list of functions, while these functions are kept in a list.
 
         Args:
@@ -46,7 +46,8 @@ def append_transform(self, transform: Callable):
         Args:
             transform: A new transform to be appended.
 
-        Returns: Itself.
+        Returns:
+            Itself.
 
         """
         assert transform is not None, 'None transform not allowed'
@@ -67,7 +68,8 @@ def insert_transform(self, index: int, transform: Callable):
             index: A certain position.
             transform: A new transform.
 
-        Returns: Dataset itself.
+        Returns:
+            Itself.
 
         """
         assert transform is not None, 'None transform not allowed'
@@ -95,7 +97,7 @@ def transform_sample(self, sample: dict, inplace=False) -> dict:
             then 2 ``BOS`` will be inserted which might not be an intended result.
 
         Returns:
-
+            Transformed sample.
         """
         if not inplace:
             sample = copy(sample)

diff --git a/hanlp/common/transform.py b/hanlp/common/transform.py
@@ -99,7 +99,7 @@ def load_vocab(self, save_dir):
 class VocabDict(SerializableDict):
 
     def __init__(self, *args, **kwargs) -> None:
-        """A dict holding :class:`hanlp.common.vocab.Vocab` instances. When used a transform, it transforms the field
+        """A dict holding :class:`hanlp.common.vocab.Vocab` instances. When used as a transform, it transforms the field
         corresponding to each :class:`hanlp.common.vocab.Vocab` into indices.
 
         Args:

diff --git a/hanlp/components/mtl/multi_task_learning.py b/hanlp/components/mtl/multi_task_learning.py
@@ -460,6 +460,18 @@ def predict(self,
                 tasks: Optional[Union[str, List[str]]] = None,
                 skip_tasks: Optional[Union[str, List[str]]] = None,
                 **kwargs) -> Document:
+        """Predict on data.
+
+        Args:
+            data: A sentence or a list of sentences.
+            batch_size: Decoding batch size.
+            tasks: The tasks to predict.
+            skip_tasks: The tasks to skip.
+            **kwargs: Not used.
+
+        Returns:
+            A :class:`~hanlp_common.document.Document`.
+        """
         doc = Document()
         if not data:
             return doc
@@ -755,6 +767,15 @@ def __getitem__(self, task_name: str) -> Task:
         return self.tasks[task_name]
 
     def __delitem__(self, task_name: str):
+        """Delete a task (and every resource it owns) from this component.
+
+        Args:
+            task_name: The name of the task to be deleted.
+
+        Examples:
+            >>> del mtl['dep']  # Delete dep from MTL
+
+        """
         del self.tasks[task_name]
         del self.model.decoders[task_name]
         del self._computation_graph[task_name]