Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ellipses and design decision #29

Open
christian-storm opened this issue Apr 20, 2017 · 4 comments
Open

Ellipses and design decision #29

christian-storm opened this issue Apr 20, 2017 · 4 comments

Comments

@christian-storm
Copy link

I've been testing the ellipsis rules with . . . replaced with U+2026 (…) and find that pragmatic segmenter fails when given the actual ellipsis character. I'm probably missing something but shouldn't ellipsis.rb contain rules for the actual ellipsis character?

This brings up a bigger question of how all the variants of symbols are covered. I notice that certain end punctuation characters are explicitly defined, e.g., U+FF1F (?) in punctuation_replacer.rb. However, there are many Unicode characters that could stand in for their ASCII equivalents, e.g., U+FE56 (﹖), U+FE16 (︖), etc. for question marks or U+2047 (⁇), U+2048 (⁈), etc. for double end punctuation and so on for all symbols that are used in segmenting decisions, e.g., (), [], -, ., ...
Chasing all these down seems like a nightmare!

Couldn't it make sense to convert everything to ASCII, i.e, unidecode, segment, and then replace the decoded characters with their original characters? This assumes that all 'equivalent' characters have the same meaning but I believe they do, e.g., ፧ is the Ethiopic question mark which carries the same linguistic meaning as in English. If not, those could be the exceptions rather than the rule.

I would love to hear your thoughts.

Thanks for the great library...from my testing it performs better than spacy, segtok, CoreNLP, and Punkt on English wikipedia data.

@christian-storm
Copy link
Author

Digging a bit deeper into this it seems that some symbols do not map to equivalent characters, e.g., dec=172 unicode=¬ ascii=! (Mathematical not symbol). A better approach would be to convert only the unicode characters that influence segmentation to their ASCII equivalent, e.g., applicable FULL STOP, like you do in punctuation_replacer.rb.

Thanks ahead of time for any feedback.

@diasks2
Copy link
Owner

diasks2 commented Apr 20, 2017

Hi Christian,

Thanks for checking out the library and thanks for the feedback and ideas. If you could provide some sample sentences where the ellipsis is failing that would be helpful. I'll add those to the test suite and update the gem.

More generally, to answer your question:

"However, there are many Unicode characters that could stand in for their ASCII equivalents, e.g., U+FE56 (﹖), U+FE16 (︖), etc. for question marks or U+2047 (⁇), U+2048 (⁈), etc. for double end punctuation and so on for all symbols that are used in segmenting decisions, e.g., (), [], -, ., ... Chasing all these down seems like a nightmare!"

My goal with this gem is/was using it on common texts (i.e. things you would find on Wikipedia, but not necessarily things you would find on Twitter), so I only went so far down the rabbit hole. Characters that are not so commonly used are not yet accounted for. I tried to be pragmatic ;-)

Happy to accept any PR though that would help improve the gem so that it can handle a wider range of unicode characters that might influence segmenting decisions. Even just a list of failing test cases that I can add to the test suite and then get to when I can would be helpful and appreciated.

@arademaker
Copy link

Do you have any paper describing the techniques used in the tool?

@diasks2
Copy link
Owner

diasks2 commented Feb 21, 2020

Do you have any paper describing the techniques used in the tool?

No. It is mainly regular expressions. The README is the best resource for information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants