transphone
is a grapheme-to-phoneme conversion toolkit. It provides phoneme tokenizers as well as approximation G2P model for 8000 languages.
This repo contains our code and pretrained models roughly following our paper accepted at Findings of ACL 2022
Zero-shot Learning for Grapheme to Phoneme Conversion with Language Ensemble
It is a multilingual G2P (grapheme-to-phoneme) model that can be applied to all 8k languages registered in the Glottolog database. You can read our papers at Open Review
Our approach:
- We first trained our supervised multilingual model with ~900 languages using lexicons from Wikitionary
- if the target language (any language from the 8k languages) does not have any training set, we approximate its g2p model by using similar lanuguages from the supervised language sets and ensemble their inference results to obtain the target g2p.
transphone is available from pip
pip install transphone
You can clone this repository and install
python setup.py install
The tokenizer has a similar interface as HuggingFace tokenizer, which converts a string into each languages' phonemes
The tokenizer will first lookup lexicon dictionary for pronunciation, it will fall back to the G2P engine if lexicon is not available. Currently, more than 200 languages have lexicon available inside. Other languages will use G2P instead.
In [1]: from transphone import read_tokenizer
In [2]: eng = read_tokenizer('eng')
In [3]: lst = eng.tokenize('hello world')
In [4]: lst
Out[4]: ['h', 'ʌ', 'l', 'o', 'w', 'w', 'ɹ̩', 'l', 'd']
In [5]: ids = eng.convert_tokens_to_ids(lst)
In [6]: ids
Out[6]: [7, 36, 11, 14, 21, 21, 33, 11, 3]
In [7]: eng.convert_ids_to_tokens(ids)
Out[7]: ['h', 'ʌ', 'l', 'o', 'w', 'w', 'ɹ̩', 'l', 'd']
In [8]: jpn = read_tokenizer('jpn')
In [9]: jpn.tokenize('こんにちは世界')
Out[9]: ['k', 'o', 'N', 'n', 'i', 'ch', 'i', 'w', 'a', 's', 'e', 'k', 'a', 'i']
In [10]: cmn = read_tokenizer('cmn')
In [11]: cmn.tokenize('你好世界')
Out[11]: ['n', 'i', 'x', 'a', 'o', 'ʂ', 'ɻ̩', 't͡ɕ', 'i', 'e']
In [12]: deu = read_tokenizer('deu')
In [13]: deu.tokenize('Hallo Welt')
Out[13]: ['h', 'a', 'l', 'o', 'v', 'e', 'l', 't']
A command line tool is available
# compute pronunciation for every word in input file
$ python -m transphone.run --lang eng --input sample.txt
h ɛ l o ʊ
w ə l d
t ɹ æ n s f ə ʊ n
# by specifying combine flag, you can get word + pronunciation per line
$ python -m transphone.run --lang eng --input sample.txt --combine=True
hello h ɛ l o ʊ
world w ə l d
transphone t ɹ æ n s f ə ʊ n
A simple python usage is as follows:
In [1]: from transphone.g2p import read_g2p
# read a pretrained model. It will download the pretrained model automatically into repo_root/data/model
In [2]: model = read_g2p()
# to infer pronunciation for a word with ISO 639-3 id
# For any pretrained languages (~900 languages), it will use the pretrained model without approximation
In [3]: model.inference('transphone', 'eng')
Out[3]: ['t', 'ɹ', 'æ', 'n', 's', 'f', 'ə', 'ʊ', 'n']
# If the specified language is not available, then it will approximate it using nearest languages
# in this case, aaa (Ghotuo language) is not one of the training languages, we fetch 10 nearest languages to approximate it
In [4]: model.inference('transphone', 'aaa')
lang aaa is not available directly, use ['bin', 'bja', 'bkh', 'bvx', 'dua', 'eto', 'gwe', 'ibo', 'kam', 'kik'] instead
Out[4]: ['t', 'l', 'a', 'n', 's', 'f', 'o', 'n', 'e']
# To gain deeper insights, you can also specify debug flag to see output of each language
In [5]: model.inference('transphone', 'aaa', debug=True)
bin ['s', 'l', 'a', 'n', 's', 'f', 'o', 'n', 'e']
bja ['s', 'l', 'a', 'n', 's', 'f', 'o', 'n']
bkh ['t', 'l', 'a', 'n', 's', 'f', 'o', 'n', 'e']
bvx ['t', 'r', 'a', 'n', 's', 'f', 'o', 'n', 'e']
dua ['t', 'r', 'n', 's', 'f', 'n']
eto ['t', 'l', 'a', 'n', 's', 'f', 'o', 'n', 'e']
gwe ['t', 'l', 'a', 'n', 's', 'f', 'o', 'n', 'e']
ibo ['t', 'l', 'a', 'n', 's', 'p', 'o', 'n', 'e']
kam ['t', 'l', 'a', 'n', 's', 'f', 'o', 'n', 'e']
kik ['t', 'l', 'a', 'n', 's', 'f', 'ɔ', 'n', 'ɛ']
Out[5]: ['t', 'l', 'a', 'n', 's', 'f', 'o', 'n', 'e']
model | # supported languages | description |
---|---|---|
latest | ~8k | based on our work at [1] |
- [1] Li, Xinjian, et al. "Zero-shot Learning for Grapheme to Phoneme Conversion with Language Ensemble." Findings of the Association for Computational Linguistics: ACL 2022. 2022.
- [2] Li, Xinjian, et al. "Phone Inventories and Recognition for Every Language" LREC 2022. 2022