Skip to content

Commit

Permalink
update dataset..
Browse files Browse the repository at this point in the history
  • Loading branch information
shibing624 committed Mar 1, 2022
1 parent 4317052 commit 74e38ba
Show file tree
Hide file tree
Showing 3 changed files with 37 additions and 3 deletions.
1 change: 1 addition & 0 deletions CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ message: "If you use this software, please cite it as below."
authors:
- family-names: "Xu"
given-names: "Ming"
orcid: "https://orcid.org/0000-0003-3402-7159"
title: "Text2vec: Text to vector toolkit"
url: "https://github.com/shibing624/text2vec"
data-released: 2022-02-27
Expand Down
35 changes: 34 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -179,9 +179,42 @@ python3 setup.py install
```

### 数据集
中文语义匹配数据集已经上传到huggingface datasets [https://huggingface.co/datasets/shibing624/nli_zh](https://huggingface.co/datasets/shibing624/nli_zh)

数据集使用示例:
```shell
pip3 install datasets
```

```python
from datasets import load_dataset

dataset = load_dataset("shibing624/nli_zh", "STS-B")
print(dataset)
print(dataset['test'][0])
```

output:
```shell
DatasetDict({
train: Dataset({
features: ['sentence1', 'sentence2', 'label'],
num_rows: 5231
})
validation: Dataset({
features: ['sentence1', 'sentence2', 'label'],
num_rows: 1458
})
test: Dataset({
features: ['sentence1', 'sentence2', 'label'],
num_rows: 1361
})
})
{'sentence1': '一个女孩在给她的头发做发型。', 'sentence2': '一个女孩在梳头。', 'label': 2}
```

常见中文语义匹配数据集,包含[ATEC](https://github.com/IceFlameWorm/NLP_Datasets/tree/master/ATEC)[BQ](http://icrc.hitsz.edu.cn/info/1037/1162.htm)[LCQMC](http://icrc.hitsz.edu.cn/Article/show/171.html)[PAWSX](https://arxiv.org/abs/1908.11828)[STS-B](https://github.com/pluto-junzeng/CNSD)共5个任务。
可以从数据集对应的链接自行下载,也可以从[百度网盘(提取码:qkt6)](https://pan.baidu.com/s/1d6jSiU1wHQAEMWJi7JJWCQ)下载。

其中senteval_cn目录是评测数据集汇总,senteval_cn.zip是senteval目录的打包,两者下其一就好。

# Usage
Expand Down
4 changes: 2 additions & 2 deletions text2vec/similarity.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ def get_score(self, sentence1: str, sentence2: str) -> float:

def get_scores(
self, sentences1: List[str], sentences2: List[str], only_aligned: bool = False
) -> Union[List[Tensor], ndarray, Tensor, None]:
) -> ndarray:
"""
Get similarity scores between sentences1 and sentences2
:param sentences1: list, sentence1 list
Expand All @@ -111,7 +111,7 @@ def get_scores(
:return: return: Matrix with res[i][j] = cos_sim(a[i], b[j])
"""
if not sentences1 or not sentences2:
return None
return np.array([])
if only_aligned and len(sentences1) != len(sentences2):
logger.warning('Sentences size not equal, auto set is_aligned=False')
only_aligned = False
Expand Down

0 comments on commit 74e38ba

Please sign in to comment.