Provides support to javascript for unicode operations such as decomposition and diacritical mark removal.
In order to cope with the large set of unicode data and operations with a minimal memory and network footprint, this library provides optional data loading so that only the functions needed by the application have their data loaded.
For instance, to be able to call the lowercase_nomark() function, the following script loading are required, so that only the small dataset used to perform the operation is loaded:
* <script type="text/javascript" src="src/normalizer_lowercase_nomark.js"></script>
* <script type="text/javascript" src="src/unicode.js"></script>
Note that the unicode.js file must be loaded last.
The library sets itself in the net.kornr.unicode namespace to avoid collision.
Algorithms are designed to perform more efficiently when the looked-up characters are part of the same character set. The loaded datasets (such as normalizer_lowercase_nomark.js) are pre-processed and compacted so that several operations are performed in a single pass (for instance case-changing, decomposition, and diacritical marks removal).
This library is provided under the terms of the Apache License, Version 2.0. http://www.apache.org/licenses/LICENSE-2.0
Returns a lower-cased, decomposed form, with all the diacritical marks removed. This operation is performed as follows:
- The characters are converted into lower case if needed
- The characters are converted into their decomposed canonical form. That is, for instance the char é is decomposed into chars 65 + 301.
- All characters identified as diacritical marks by the unicode database are removed from the string
* <script type="text/javascript" src="normalizer_lowercase_nomark.js"></script>
* <script type="text/javascript" src="unicode.js"></script>
var UC = net.kornr.unicode;
var mystring = UC.lowercase_nomark("Ça brûle")); // returns "ca brule"
Returns a lower-cased, decomposed form. The string may become larger, due to the decomposition that converts single characters into several characters (ie. a letter and its diacritical marks)
* <script type="text/javascript" src="normalizer_lowercase.js"></script>
* <script type="text/javascript" src="unicode.js"></script>
var mystring = net.kornr.unicode.lowercase("Ça brûle")); // returns "ça brûle"
Returns an upper-cased and decomposed form, with all the diacritical marks removed. This operation is performed as follows:
- The characters are converted into upper case if needed
- The characters are converted into their decomposed canonical form. That is, for instance the char é is decomposed into chars 65 + 301.
- All characters identified as diacritical marks by the unicode database are removed from the string
* <script type="text/javascript" src="normalizer_uppercase_nomark.js"></script>
* <script type="text/javascript" src="unicode.js"></script>
var UC = net.kornr.unicode;
var mystring = UC.uppercase_nomark("Ça brûle")); // returns "CA BRULE"
Returns an upper-cased, decomposed form. The string may become larger, due to the decomposition that converts single characters into several characters (ie. a letter and its diacritical marks)
* <script type="text/javascript" src="normalizer_uppercase.js"></script>
* <script type="text/javascript" src="unicode.js"></script>
var mystring = net.kornr.unicode.uppercase("Ça brûle")); // returns "ÇA BRÛLE"
Returns true if str is either a charCode for a letter, or a string that only contains letters, false otherwise.
* <script type="text/javascript" src="categ_letters.js"></script>
* <script type="text/javascript" src="unicode.js"></script>
var a = net.kornr.unicode.is_letter(" ")); // returns false
var b = net.kornr.unicode.is_letter("A")); // returns true
var c = net.kornr.unicode.is_letter("1")); // returns false
Returns true if str is either a charCode for a number, or a string that only contains numbers, false otherwise.
* <script type="text/javascript" src="categ_numbers.js"></script>
* <script type="text/javascript" src="unicode.js"></script>
var a = net.kornr.unicode.is_letter(" ")); // returns false
var b = net.kornr.unicode.is_letter("A")); // returns false
var c = net.kornr.unicode.is_letter("1")); // returns true
Returns true if str is either a charCode for a letter or a number, or a string that only contains letters and numbers, false otherwise.
* <script type="text/javascript" src="categ_letters_numbers.js"></script>
* <script type="text/javascript" src="unicode.js"></script>
var a = net.kornr.unicode.is_letter(" ")); // returns false
var b = net.kornr.unicode.is_letter("A")); // returns true
var c = net.kornr.unicode.is_letter("1")); // returns true
Returns true if some_string is either a charCode for a punctuation sign, or a string that only contains punctuations, This includes the following unicode categories: Pc Connector_Punctuation a connecting punctuation mark, like a tie Pd Dash_Punctuation a dash or hyphen punctuation mark Ps Open_Punctuation an opening punctuation mark (of a pair) Pe Close_Punctuation a closing punctuation mark (of a pair) Pi Initial_Punctuation an initial quotation mark Pf Final_Punctuation a final quotation mark Po Other_Punctuation a punctuation mark of other type
* <script type="text/javascript" src="categ_puncts.js"></script>
* <script type="text/javascript" src="unicode.js"></script>
Returns true if some_string is either a charCode for a separator, or a string that only contains separators, This includes the following unicode categories: Zs Space_Separator a space character (of various non-zero widths) Zl Line_Separator U+2028 LINE SEPARATOR only Zp Paragraph_Separator U+2029 PARAGRAPH SEPARATOR only
* <script type="text/javascript" src="categ_separators.js"></script>
* <script type="text/javascript" src="unicode.js"></script>
Returns true if some_string is either a charCode for a control code, or a string that only contains control codes,
* <script type="text/javascript" src="categ_controls.js"></script>
* <script type="text/javascript" src="unicode.js"></script>
Returns true if some_string is either a charCode for a separator or a punctuation sign, or a string that only contains one of those characters.
* <script type="text/javascript" src="categ_puncts_separators.js"></script>
* <script type="text/javascript" src="unicode.js"></script>
Returns true if some_string is either a charCode for a separator, a punctuation or a control sign, or a string that only contains one of those characters.
* <script type="text/javascript" src="categ_puncts_separators_controls.js"></script>
* <script type="text/javascript" src="unicode.js"></script>
Returns true if some_string is a math sign or a string only composed of math signs.
* <script type="text/javascript" src="categ_maths.js"></script>
* <script type="text/javascript" src="unicode.js"></script>
Returns true if some_string is a currency symbol or a string only composed of currencies.
* <script type="text/javascript" src="categ_currencies.js"></script>
* <script type="text/javascript" src="unicode.js"></script>