-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Another Unicode issue #15
Comments
The proper fully compatible solution that would resolve most if not all issues with Unicode is to rewrite all substring handling to use the However, the public [EDIT] It looks like this wouldn't be too difficult of a change with the Ukkonen trie, if you go about it naively and just replace regular The downsides are that it would probably murder atleast construction performance; and that the Might be better off by one-time converting all strings into a dedicated data structure operating at the grapheme level though. That would certainly keep code more maintainable. |
For me the thing throws OutOfBoundsExceptions when I even try to construct something that has any special characters in it. And all my sources are in ISO-8859-1. |
@holopoj , Preparing the text for the trie before adding and before searching is a good workaround. I will work with Basic Multilingual Plane which contains characters for almost all modern languages, and a large number of symbols:
Now this code works, it shows 2 and 1. The second item still has the double code point grapheme:
|
Ran into an issue with unicode 0x300. This can be reproduced with the below code:
This will print 0. Note that the second item added is not a byte-equal prefix of s, their unicode sequences are different. Though a.StartsWith(b) returns true, presumably because of culture settings. The second one uses two characters: a normal 'i' followed by unicode 0x300 to add the accent, while the first one uses a single accented i character.
The text was updated successfully, but these errors were encountered: