-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ellipses and design decision #29
Comments
Digging a bit deeper into this it seems that some symbols do not map to equivalent characters, e.g., dec=172 unicode=¬ ascii=! (Mathematical not symbol). A better approach would be to convert only the unicode characters that influence segmentation to their ASCII equivalent, e.g., applicable FULL STOP, like you do in punctuation_replacer.rb. Thanks ahead of time for any feedback. |
Hi Christian, Thanks for checking out the library and thanks for the feedback and ideas. If you could provide some sample sentences where the ellipsis is failing that would be helpful. I'll add those to the test suite and update the gem. More generally, to answer your question:
My goal with this gem is/was using it on common texts (i.e. things you would find on Wikipedia, but not necessarily things you would find on Twitter), so I only went so far down the rabbit hole. Characters that are not so commonly used are not yet accounted for. I tried to be pragmatic ;-) Happy to accept any PR though that would help improve the gem so that it can handle a wider range of unicode characters that might influence segmenting decisions. Even just a list of failing test cases that I can add to the test suite and then get to when I can would be helpful and appreciated. |
Do you have any paper describing the techniques used in the tool? |
No. It is mainly regular expressions. The README is the best resource for information. |
I've been testing the ellipsis rules with . . . replaced with U+2026 (…) and find that pragmatic segmenter fails when given the actual ellipsis character. I'm probably missing something but shouldn't ellipsis.rb contain rules for the actual ellipsis character?
This brings up a bigger question of how all the variants of symbols are covered. I notice that certain end punctuation characters are explicitly defined, e.g., U+FF1F (?) in punctuation_replacer.rb. However, there are many Unicode characters that could stand in for their ASCII equivalents, e.g., U+FE56 (﹖), U+FE16 (︖), etc. for question marks or U+2047 (⁇), U+2048 (⁈), etc. for double end punctuation and so on for all symbols that are used in segmenting decisions, e.g., (), [], -, ., ...
Chasing all these down seems like a nightmare!
Couldn't it make sense to convert everything to ASCII, i.e, unidecode, segment, and then replace the decoded characters with their original characters? This assumes that all 'equivalent' characters have the same meaning but I believe they do, e.g., ፧ is the Ethiopic question mark which carries the same linguistic meaning as in English. If not, those could be the exceptions rather than the rule.
I would love to hear your thoughts.
Thanks for the great library...from my testing it performs better than spacy, segtok, CoreNLP, and Punkt on English wikipedia data.
The text was updated successfully, but these errors were encountered: