Skip to content

Commit

Permalink
[libc++][format] Switches to Unicode 15.1. (llvm#86543)
Browse files Browse the repository at this point in the history
In addition to changes in the tables the extended grapheme clustering
algorithm has been overhauled. Before I considered a separate state
machine to implement the rules. With the new rule GB9c this became more
attractive and the design has changed.

This change initially had quite an impact on the performance. By making
the state machine persistent the performance was improved greatly. Note
it is still slower than before due to the larger Unicode tables.

Before
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_ascii_text<char>             1891 ns         1889 ns       369504
BM_unicode_text<char>         106642 ns       106397 ns         6576
BM_cyrillic_text<char>         73420 ns        73277 ns         9445
BM_japanese_text<char>         62485 ns        62387 ns        11153
BM_emoji_text<char>             1895 ns         1893 ns       369525
BM_ascii_text<wchar_t>          2015 ns         2013 ns       346887
BM_unicode_text<wchar_t>       92119 ns        92017 ns         7598
BM_cyrillic_text<wchar_t>      62637 ns        62568 ns        11117
BM_japanese_text<wchar_t>      53850 ns        53785 ns        12803
BM_emoji_text<wchar_t>          2016 ns         2014 ns       347325

After
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_ascii_text<char>             1906 ns         1904 ns       369409
BM_unicode_text<char>         265462 ns       265175 ns         2628
BM_cyrillic_text<char>        181063 ns       180865 ns         3871
BM_japanese_text<char>        130927 ns       130789 ns         5324
BM_emoji_text<char>             1892 ns         1890 ns       370537
BM_ascii_text<wchar_t>          2038 ns         2035 ns       343689
BM_unicode_text<wchar_t>      277603 ns       277282 ns         2526
BM_cyrillic_text<wchar_t>     188558 ns       188339 ns         3727
BM_japanese_text<wchar_t>     133084 ns       132943 ns         5262
BM_emoji_text<wchar_t>          2012 ns         2010 ns       348015

Persistent
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_ascii_text<char>             1904 ns         1899 ns       367472
BM_unicode_text<char>         133609 ns       133287 ns         5246
BM_cyrillic_text<char>         90185 ns        89941 ns         7796
BM_japanese_text<char>         75137 ns        74946 ns         9316
BM_emoji_text<char>             1906 ns         1901 ns       368081
BM_ascii_text<wchar_t>          2703 ns         2696 ns       259153
BM_unicode_text<wchar_t>      131497 ns       131168 ns         5341
BM_cyrillic_text<wchar_t>      87071 ns        86840 ns         8076
BM_japanese_text<wchar_t>      72279 ns        72099 ns         9682
BM_emoji_text<wchar_t>          2021 ns         2016 ns       346767
  • Loading branch information
mordante authored Apr 9, 2024
1 parent cf6feff commit 59e66c5
Show file tree
Hide file tree
Showing 19 changed files with 6,412 additions and 3,008 deletions.
3 changes: 3 additions & 0 deletions libcxx/docs/ReleaseNotes/19.rst
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ Improvements and New Features
resulting in a performance increase of up to 1400x.
- The ``std::mismatch`` algorithm has been optimized for integral types, which can lead up to 40x performance
improvements.

- The ``std::ranges::minmax`` algorithm has been optimized for integral types, resulting in a performance increase of
up to 100x.

Expand All @@ -64,6 +65,8 @@ Improvements and New Features
- The ``_LIBCPP_ENABLE_CXX26_REMOVED_WSTRING_CONVERT`` macro has been added to make the declarations in ``<locale>``
available.

- The formatting library is updated to Unicode 15.1.0.

Deprecations and Removals
-------------------------

Expand Down
1 change: 1 addition & 0 deletions libcxx/include/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -395,6 +395,7 @@ set(files
__format/formatter_pointer.h
__format/formatter_string.h
__format/formatter_tuple.h
__format/indic_conjunct_break_table.h
__format/parser_std_format_spec.h
__format/range_default_formatter.h
__format/range_formatter.h
Expand Down
11 changes: 6 additions & 5 deletions libcxx/include/__format/escaped_output_table.h
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ namespace __escaped_output_table {
/// - bits [0, 10] The size of the range, allowing 2048 elements.
/// - bits [11, 31] The lower bound code point of the range. The upper bound of
/// the range is lower bound + size.
_LIBCPP_HIDE_FROM_ABI inline constexpr uint32_t __entries[893] = {
_LIBCPP_HIDE_FROM_ABI inline constexpr uint32_t __entries[894] = {
0x00000020,
0x0003f821,
0x00056800,
Expand Down Expand Up @@ -464,14 +464,14 @@ _LIBCPP_HIDE_FROM_ABI inline constexpr uint32_t __entries[893] = {
0x0174d000,
0x0177a00b,
0x017eb019,
0x017fe004,
0x01800000,
0x01815005,
0x01820000,
0x0184b803,
0x01880004,
0x01898000,
0x018c7800,
0x018f200b,
0x018f200a,
0x0190f800,
0x05246802,
0x05263808,
Expand Down Expand Up @@ -1000,8 +1000,9 @@ _LIBCPP_HIDE_FROM_ABI inline constexpr uint32_t __entries[893] = {
0x15b9d005,
0x15c0f001,
0x1675100d,
0x175f0fff,
0x179f0c1e,
0x175f080e,
0x1772f7ff,
0x17b2f1a1,
0x17d0f5e1,
0x189a5804};

Expand Down
Loading

0 comments on commit 59e66c5

Please sign in to comment.