In order to update the Unicode data files, follow these steps:
- Download the following files into your current working directory (e.g.
graal/regex
). If updating to another version, replace 12.1.0 with the version you are aiming for.UnicodeData.txt
(https://www.unicode.org/Public/12.1.0/ucd/UnicodeData.txt)CaseFolding.txt
(https://www.unicode.org/Public/12.1.0/ucd/CaseFolding.txt)SpecialCasing.txt
(https://www.unicode.org/Public/12.1.0/ucd/SpecialCasing.txt)PropertyAliases.txt
(https://www.unicode.org/Public/12.1.0/ucd/PropertyAliases.txt)PropertyValueAliases.txt
(https://www.unicode.org/Public/12.1.0/ucd/PropertyValueAliases.txt)ucd.nounihan.flat.xml
(https://www.unicode.org/Public/12.1.0/ucdxml/ucd.nounihan.flat.zip)- You will need to unzip the archive.
emoji-data.txt
(https://unicode.org/Public/emoji/12.0/emoji-data.txt)
- Run
src/com.oracle.truffle.regex/tools/unicode-script.sh
. This generates the following files in your current working directory:UnicodeFoldTable.txt
NonUnicodeFoldTable.txt
PythonSimpleCasing.txt
PythonExtendedCasing.txt
- Run
src/com.oracle.truffle.regex/tools/generate_case_fold_table.clj >> src/com.oracle.truffle.regex/src/com/oracle/truffle/regex/tregex/parser/CaseFoldTable.java
to generate the new case fold tables and append them toCaseFoldTable.java
. Then openCaseFoldTable.java
in an editor to replace the old character data with the new definitions.
- In order to run this script, you will need to have a way to run Clojure scripts.
- You can use Boot (https://boot-clj.com/), which lets you execute the script directly. Boot can usually be installed from your distribution's package manager.
- Alternatively, you can use a Clojure jar file directly as in
java -jar clojure-1.8.0.jar --init src/com.oracle.truffle.regex/tools/generate_case_fold_table.clj --eval '(-main)'
.
- Run
src/com.oracle.truffle.regex/tools/generate_unicode_properties.py > src/com.oracle.truffle.regex/src/com/oracle/truffle/regex/charset/UnicodePropertyData.java
. This rewritesUnicodePropertyData.java
to contain the new definitions of Unicode properties. - Run the
main
method ofcom.oracle.truffle.regex.charset.UnicodeGeneralCategoriesGenerator
and replacesrc/com.oracle.truffle.regex/src/com/oracle/truffle/regex/charset/UnicodeGeneralCategories.java
with its output. - Run
mx eclipseformat
to fix any code formatting issues.
Steps 1-4 are automated by run_scripts.sh
. This script assumes you have the following things installed: clojure
, python3
, wget
, and unzip
.