twitter-text/unicode_regex at master · Shockator/twitter-text

History

Name		Name	Last commit message	Last commit date
parent directory ..
README		README
decimal_numbers_java_1_7_and_ruby_1_9.txt		decimal_numbers_java_1_7_and_ruby_1_9.txt
decimal_numbers_objc.txt		decimal_numbers_objc.txt
letters_and_marks_java_1_7_and_ruby_1_9.txt		letters_and_marks_java_1_7_and_ruby_1_9.txt
letters_and_marks_objc.txt		letters_and_marks_objc.txt
unicode_regex_groups.scala		unicode_regex_groups.scala

README

Why manually generate regex groups for unicode char classes?

Even though there are unicode regex groups like \p{L} (all unicode letters) \p{M} (all unicode marks), and \p{Nd} (unicode decimal digits), the exact set they represent is inconsistent between languages and versions. And JavaScript doesn't support these groups at all. Event if JavaScript did, it doesn't support the astral ranges without resorting to using RegEx groupings because its regex engine uses UCS-2 rather than UTF-16 encoding.

First, the inconsistency issue. Let's take letters, \p{L}. Java (and Scala)'s regex engine and Ruby 1.9.3 define it to contain 100520 code points. Ruby 2.2.3 has about two thousand more at 102725, and the latest Objective C has even more at 102754. Now what about marks, \p{M}? Java and Ruby 1.9.3 define it to contain 1498 code points. Ruby 2.2.3 has a few hundred more at 1830, and the latest Objective C even more at 1869. Surely decimal digits, \p{Nd}, is consistent? Nope. Java and Ruby 1.9.3 define it to contain 420 code points. Ruby 2.2.3 and Objective C have 540.

The best way to reach every person on the planet is to support all of unicode. Therefore, we should augment each language's base unicode sets with additional items. Since Objective C has the most up to date definitions, we can use that as our standard. And Ruby 1.9 and Java 1.7 have a common baseline.

Here is some scala code you can run in the repl to create the baseline text files

def allMatches(re: scala.util.matching.Regex) = ((0 to 0xD7FF)++(0xE000 to 0x10FFFF)).filterNot(i => re.findFirstIn(new String(Array(i), 0, 1)).isEmpty)
val pattern = "[\\p{L}\\p{M}]"
println(s"# ${allMatches(pattern.r).length} code points matched for ${pattern}\n${allMatches(pattern.r).map { case n => f"$n%x"}.mkString("\n")}")
val pattern = "\\p{Nd}"
println(s"# ${allMatches(pattern.r).length} code points matched for ${pattern}\n${allMatches(pattern.r).map { case n => f"$n%x"}.mkString("\n")}")


Here is the same thing for ruby 1.9, although it generates the same output as scala /java 1.7:

def allMatches(re) [*0..0xD7FF,*0xE000..0x10FFFF].select {|x| [x].pack('U') =~ re } end
pattern = "[\\p{L}\\p{M}]"
puts "# #{allMatches(/#{pattern}/).length} code points matched for #{pattern}\n${allMatches(/#{pattern}/).map { |n| => n.to_s(16) }.join("\n")}"
pattern = "\\p{Nd}"
puts "# #{allMatches(/#{pattern}/).length} code points matched for #{pattern}\n${allMatches(/#{pattern}/).map { |n| => n.to_s(16) }.join("\n")}"


Here is some code you can run in XCode to generate the current lists of code points

#import <Foundation/Foundation.h>

void logMatches(NSString *pattern) {
    NSMutableArray* array = [[NSMutableArray alloc] init];
    NSError *error = nil;
    NSRegularExpression* regex = [NSRegularExpression regularExpressionWithPattern: pattern options:0 error:&error];

    for (int i = 0; i <= 0x10FFFF; ++i) {
        if (i >= 0xd800 && i <= 0xdfff) continue;
        NSData* data = [[NSData alloc] initWithBytes:&i length:sizeof(int)];
        NSString *ts = [[NSString alloc] initWithData:data encoding:NSUTF32LittleEndianStringEncoding];
        if ([regex numberOfMatchesInString:ts options:0 range:NSMakeRange(0, [ts length])] > 0) {
            [array addObject: [NSString stringWithFormat:@"%X", i]];
        }
    }
    NSLog(@"\n# %d code points matched for %@ \n%@\n", (int)[array count], pattern, [array componentsJoinedByString:(@"\n")]);
}

int main(int argc, const char * argv[]) {
    @autoreleasepool {
        logMatches(@"[\\p{L}\\p{M}]");
        logMatches(@"\\p{Nd}");
    }
    return 0;
}

To generate the compact form of the regexes for Java, Ruby, and JavaScript, run
scala unicode_regex_groups.scala

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicode_regex

unicode_regex

README

Files

unicode_regex

Directory actions

More options

Directory actions

More options

Latest commit

History

unicode_regex

Folders and files

parent directory

README