Another Unicode issue #15

holopoj · 2019-08-04T21:57:30Z

Ran into an issue with unicode 0x300. This can be reproduced with the below code:

var a= "rosalía castro";
var b= "rosalía";
var t = new UkkonenTrie<int>(3);
t.Add(a, 1);
t.Add(b, 2);
Console.WriteLine(t.Retrieve(a).Count());

This will print 0. Note that the second item added is not a byte-equal prefix of s, their unicode sequences are different. Though a.StartsWith(b) returns true, presumably because of culture settings. The second one uses two characters: a normal 'i' followed by unicode 0x300 to add the accent, while the first one uses a single accented i character.

The text was updated successfully, but these errors were encountered:

rjgotten · 2019-09-10T19:00:29Z

The proper fully compatible solution that would resolve most if not all issues with Unicode is to rewrite all substring handling to use the StringInfo class to work with 'real' characters, i.e. graphemes, rather than individual char codepoints.

However, the public StringInfo API is very uncomfortable. E.g. you have to manually pump a non-generic IEnumerator with MoveNext() to iterate over a string's graphemes. There's no IEnumerable<> support and thus also no foreach support.

[EDIT]

It looks like this wouldn't be too difficult of a change with the Ukkonen trie, if you go about it naively and just replace regular SubString() calls and Length accesses with StringInfo-driven equivalents.

The downsides are that it would probably murder atleast construction performance; and that the Node class will need to hold an IDictionary<string,Edge> as a grapheme may not fit in a single char. That last bit means an increase in space taken as well, but luckily it's still bounded. Unicode graphemes aren't endlessly long, iirc.

Might be better off by one-time converting all strings into a dedicated data structure operating at the grapheme level though. That would certainly keep code more maintainable.

prj · 2020-04-30T09:03:57Z

For me the thing throws OutOfBoundsExceptions when I even try to construct something that has any special characters in it. And all my sources are in ISO-8859-1.
So it seems this project is useless in any real world application, unless you're dealing with plain ASCII.

jesuslpm · 2021-01-31T11:29:29Z

@holopoj ,

Preparing the text for the trie before adding and before searching is a good workaround. I will work with Basic Multilingual Plane which contains characters for almost all modern languages, and a large number of symbols:

/// <summary>
/// It Removes diacritics from text, converts it to lower, removes surrogate 
/// characters and normalizes it to prepare text for accent and case insensitive search
/// </summary>
/// <param name="text"></param>
/// <returns></returns>
static string PrepareForTrie(string text)
{
    //return text;
    var normalizedString = text.Normalize(NormalizationForm.FormD);
    var stringBuilder = new StringBuilder();

    for (int i = 0; i < normalizedString.Length; i++)
    {
        char c = normalizedString[i];
        var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
        if (char.IsHighSurrogate(c) || char.IsLowSurrogate(c)) continue;
        if (unicodeCategory != UnicodeCategory.NonSpacingMark && unicodeCategory != UnicodeCategory.Control)
        {
            stringBuilder.Append(char.ToLower(c));
        }
    }
    return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}

Now this code works, it shows 2 and 1. The second item still has the double code point grapheme:

var a = PrepareForTrie("Rosalia de Castro");
var b = PrepareForTrie("rosalía");
var t = new UkkonenTrie<int>(3);
t.Add(a, 1);
t.Add(b, 2);
foreach (var value in t.Retrieve(b))
{
     Console.WriteLine(value);
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Another Unicode issue #15

Another Unicode issue #15

holopoj commented Aug 4, 2019

rjgotten commented Sep 10, 2019 •

edited

Loading

prj commented Apr 30, 2020

jesuslpm commented Jan 31, 2021 •

edited

Loading

Another Unicode issue #15

Another Unicode issue #15

Comments

holopoj commented Aug 4, 2019

rjgotten commented Sep 10, 2019 • edited Loading

prj commented Apr 30, 2020

jesuslpm commented Jan 31, 2021 • edited Loading

rjgotten commented Sep 10, 2019 •

edited

Loading

jesuslpm commented Jan 31, 2021 •

edited

Loading