Skip to content

janlelis/unicode-scripts

Repository files navigation

Unicode::Scripts [version] [ci]

Retrieve all Unicode script(s) a string belongs to. Can also return the Script_Extension property (scx) which is defined as characters which are "commonly used with more than one script, but with a limited number of scripts".

Based on the Script_Extension, this library can also return the augmented script set to figure out if a string is mixed-script or single-script. Mixed scripts can be an indicator of suspicious user inputs.

Unicode version: 16.0.0 (September 2024)

Supported Rubies: 3.x (might work: 2.x)

Gemfile

gem "unicode-scripts"

Usage - Scripts and Script Extensions

require "unicode/scripts"

Unicode::Scripts.scripts("СC") # => ["Cyrillic", "Latin"]

# 4 letter script aliases
Unicode::Scripts.scripts("СC", format: :short) # => ["Cyrl", "Latn"]

# Single character
Unicode::Scripts.script("ᴦ") # => "Greek"

# Script_Extension property
Unicode::Scripts.script_extensions("॥")
# => ["Bengali", "Devanagari", "Dogra", "Grantha", "Gujarati", "Gunjala_Gondi", "Gurmukhi","Gurung_Khema",
      "Kannada","Khudawadi",  "Limbu",  "Mahajani",  "Malayalam", "Masaram_Gondi", "Nandinagari", "Ol_Onal",
      "Oriya", "Sinhala", "Syloti_Nagri", "Takri", "Tamil", "Telugu", "Tirhuta"]

Usage - Augmented Scripts

Like script extensions, but adds meta scripts for Asian languages and treats Common/Inherited values as ALL scripts.

require "unicode/scripts"

Unicode::Scripts.augmented_scripts("ねガ") # => ['Hira', 'Kana', 'Jpan']
Unicode::Scripts.augmented_scripts("1") # => ["Adlm", "Aghb", "Ahom", … ]

Usage - Resolved Script

Intersection of all augmented scripts per character.

require "unicode/scripts"

Unicode::Scripts.resolved_scripts("СігсӀе") # =>  [ 'Cyrl' ]
Unicode::Scripts.resolved_scripts("Сirсlе") # =>  []
Unicode::Scripts.resolved_scripts("𝖢𝗂𝗋𝖼𝗅𝖾") # => ['Adlm', 'Aghb', 'Ahom', … ]
Unicode::Scripts.resolved_scripts("1") # => ['Adlm','Aghb', 'Ahom', … ]
Unicode::Scripts.resolved_scripts("ねガ") # =>  ['Hira', 'Kana', 'Jpan']

Please note that the resolved script can contain multiple scripts, as per standard.

Usage - Mixed-Script Detection

Mixed-script if resolved script set is empty, single-script otherwise.

require "unicode/scripts"

Unicode::Scripts.mixed?("СігсӀе"); # => false
Unicode::Scripts.mixed?("Сirсlе"); # => true
Unicode::Scripts.mixed?("𝖢𝗂𝗋𝖼𝗅𝖾"); # => false
Unicode::Scripts.mixed?("1"); # => false
Unicode::Scripts.mixed?("ねガ"); # => false

Unicode::Scripts.single?("СігсӀе"); # => true
Unicode::Scripts.single?("Сirсlе"); # => false
Unicode::Scripts.single?("𝖢𝗂𝗋𝖼𝗅𝖾"); # => true
Unicode::Scripts.single?("1"); # => true
Unicode::Scripts.single?("ねガ"); # => true

Please note that a single-script string might actually contain multiple scripts, as per standard (e.g. for Asian languages)

List of All Scripts

You can extract all script names from the gem like this:

require "unicode/scripts"
puts Unicode::Scripts.names # list of scripts

To get all 4 letter script codes (ISO 15924):

require "unicode/scripts"
puts Unicode::Scripts.names(format: :short) # list of scripts

Augmented scripts:

require "unicode/scripts"
puts Unicode::Scripts.names(format: :short, augmented: :only)

You can find a list of all scripts in Unicode, with links to Wikipedia on character.construction/scripts

Hints

Regex Matching

If you have a string and want to match a substring/character from a specific Unicode script, you actually won't need this gem. Instead, you can use the Regexp Unicode Property Syntax \p{}:

"Coptic letter: ⲁ".scan(/\p{Coptic}/) # => ["ⲁ"]

See Idiosyncratic Ruby: Proper Unicoding for more info.

Also See

MIT License