Skip to content

Commit

Permalink
修复多国语言叠字拆分问题 HIT-SCIR#478
Browse files Browse the repository at this point in the history
  • Loading branch information
AlongWY committed Feb 27, 2021
1 parent e5eef92 commit 1f147a0
Showing 1 changed file with 5 additions and 1 deletion.
6 changes: 5 additions & 1 deletion ltp/frontend.py
Original file line number Diff line number Diff line change
Expand Up @@ -281,7 +281,11 @@ def seg(self, inputs: Union[List[str], List[List[str]]], truncation: bool = True
for source_text, length, encoding, seg_tag, preffix in \
zip(inputs, lengths, tokenized.encodings, segment_output, batch_prefix):
offsets = encoding.offsets[1:length + 1]
text = [source_text[start:end] for start, end in offsets]
text = []
last_offset = None
for start, end in offsets:
text.append('' if last_offset == (start, end) else source_text[start:end])
last_offset = (start, end)

for idx in range(1, length):
current_beg = offsets[idx][0]
Expand Down

0 comments on commit 1f147a0

Please sign in to comment.