unicode_regex
Folders and files
Name | Name | Last commit date | ||
---|---|---|---|---|
parent directory.. | ||||
Why manually generate regex groups for unicode char classes? Even though there are unicode regex groups like \p{L} (all unicode letters) \p{M} (all unicode marks), and \p{Nd} (unicode decimal digits), the exact set they represent is inconsistent between languages and versions. And JavaScript doesn't support these groups at all. Event if JavaScript did, it doesn't support the astral ranges without resorting to using RegEx groupings because its regex engine uses UCS-2 rather than UTF-16 encoding. First, the inconsistency issue. Let's take letters, \p{L}. Java (and Scala)'s regex engine and Ruby 1.9.3 define it to contain 100520 code points. Ruby 2.2.3 has about two thousand more at 102725, and the latest Objective C has even more at 102754. Now what about marks, \p{M}? Java and Ruby 1.9.3 define it to contain 1498 code points. Ruby 2.2.3 has a few hundred more at 1830, and the latest Objective C even more at 1869. Surely decimal digits, \p{Nd}, is consistent? Nope. Java and Ruby 1.9.3 define it to contain 420 code points. Ruby 2.2.3 and Objective C have 540. The best way to reach every person on the planet is to support all of unicode. Therefore, we should augment each language's base unicode sets with additional items. Since Objective C has the most up to date definitions, we can use that as our standard. And Ruby 1.9 and Java 1.7 have a common baseline. Here is some scala code you can run in the repl to create the baseline text files def allMatches(re: scala.util.matching.Regex) = ((0 to 0xD7FF)++(0xE000 to 0x10FFFF)).filterNot(i => re.findFirstIn(new String(Array(i), 0, 1)).isEmpty) val pattern = "[\\p{L}\\p{M}]" println(s"# ${allMatches(pattern.r).length} code points matched for ${pattern}\n${allMatches(pattern.r).map { case n => f"$n%x"}.mkString("\n")}") val pattern = "\\p{Nd}" println(s"# ${allMatches(pattern.r).length} code points matched for ${pattern}\n${allMatches(pattern.r).map { case n => f"$n%x"}.mkString("\n")}") Here is the same thing for ruby 1.9, although it generates the same output as scala /java 1.7: def allMatches(re) [*0..0xD7FF,*0xE000..0x10FFFF].select {|x| [x].pack('U') =~ re } end pattern = "[\\p{L}\\p{M}]" puts "# #{allMatches(/#{pattern}/).length} code points matched for #{pattern}\n${allMatches(/#{pattern}/).map { |n| => n.to_s(16) }.join("\n")}" pattern = "\\p{Nd}" puts "# #{allMatches(/#{pattern}/).length} code points matched for #{pattern}\n${allMatches(/#{pattern}/).map { |n| => n.to_s(16) }.join("\n")}" Here is some code you can run in XCode to generate the current lists of code points #import <Foundation/Foundation.h> void logMatches(NSString *pattern) { NSMutableArray* array = [[NSMutableArray alloc] init]; NSError *error = nil; NSRegularExpression* regex = [NSRegularExpression regularExpressionWithPattern: pattern options:0 error:&error]; for (int i = 0; i <= 0x10FFFF; ++i) { if (i >= 0xd800 && i <= 0xdfff) continue; NSData* data = [[NSData alloc] initWithBytes:&i length:sizeof(int)]; NSString *ts = [[NSString alloc] initWithData:data encoding:NSUTF32LittleEndianStringEncoding]; if ([regex numberOfMatchesInString:ts options:0 range:NSMakeRange(0, [ts length])] > 0) { [array addObject: [NSString stringWithFormat:@"%X", i]]; } } NSLog(@"\n# %d code points matched for %@ \n%@\n", (int)[array count], pattern, [array componentsJoinedByString:(@"\n")]); } int main(int argc, const char * argv[]) { @autoreleasepool { logMatches(@"[\\p{L}\\p{M}]"); logMatches(@"\\p{Nd}"); } return 0; } To generate the compact form of the regexes for Java, Ruby, and JavaScript, run scala unicode_regex_groups.scala