transphone

transphone is a grapheme-to-phoneme conversion toolkit. It provides phoneme tokenizers as well as approximation G2P model for 8000 languages.

This repo contains our code and pretrained models roughly following our paper accepted at Findings of ACL 2022

Zero-shot Learning for Grapheme to Phoneme Conversion with Language Ensemble

It is a multilingual G2P (grapheme-to-phoneme) model that can be applied to all 8k languages registered in the Glottolog database. You can read our papers at Open Review

Our approach:

We first trained our supervised multilingual model with ~900 languages using lexicons from Wikitionary
if the target language (any language from the 8k languages) does not have any training set, we approximate its g2p model by using similar lanuguages from the supervised language sets and ensemble their inference results to obtain the target g2p.

Install

transphone is available from pip

pip install transphone

You can clone this repository and install

python setup.py install

Usage

Tokenizer interface

The tokenizer has a similar interface as HuggingFace tokenizer, which converts a string into each languages' phonemes

The tokenizer will first lookup lexicon dictionary for pronunciation, it will fall back to the G2P engine if lexicon is not available. Currently, more than 200 languages have lexicon available inside. Other languages will use G2P instead.

In [1]: from transphone import read_tokenizer                                                                                                  

In [2]: eng = read_tokenizer('eng')                                                                                                            

In [3]: lst = eng.tokenize('hello world')                                                                                                      

In [4]: lst                                                                                                                                    
Out[4]: ['h', 'ʌ', 'l', 'o', 'w', 'w', 'ɹ̩', 'l', 'd']

In [5]: ids = eng.convert_tokens_to_ids(lst)                                                                                                   

In [6]: ids                                                                                                                                    
Out[6]: [7, 36, 11, 14, 21, 21, 33, 11, 3]

In [7]: eng.convert_ids_to_tokens(ids)                                                                                                         
Out[7]: ['h', 'ʌ', 'l', 'o', 'w', 'w', 'ɹ̩', 'l', 'd']

In [8]: jpn = read_tokenizer('jpn')                                                                                                            

In [9]: jpn.tokenize('こんにちは世界')                                                                                                         
Out[9]: ['k', 'o', 'N', 'n', 'i', 'ch', 'i', 'w', 'a', 's', 'e', 'k', 'a', 'i']

In [10]: cmn = read_tokenizer('cmn')                                                                                                           

In [11]: cmn.tokenize('你好世界')                                                                                                              
Out[11]: ['n', 'i', 'x', 'a', 'o', 'ʂ', 'ɻ̩', 't͡ɕ', 'i', 'e']

In [12]: deu = read_tokenizer('deu')                                    

In [13]: deu.tokenize('Hallo Welt')                                     
Out[13]: ['h', 'a', 'l', 'o', 'v', 'e', 'l', 't']

G2P Command line

A command line tool is available

# compute pronunciation for every word in input file
$ python -m transphone.run --lang eng --input sample.txt 
h ɛ l o ʊ
w ə l d
t ɹ æ n s f ə ʊ n

# by specifying combine flag, you can get word + pronunciation per line
$ python -m transphone.run --lang eng --input sample.txt --combine=True
hello h ɛ l o ʊ
world w ə l d
transphone t ɹ æ n s f ə ʊ n

python G2P interface

A simple python usage is as follows:

In [1]: from transphone.g2p import read_g2p                                                                                                     

# read a pretrained model. It will download the pretrained model automatically into repo_root/data/model
In [2]: model = read_g2p()                                                                                                                      

# to infer pronunciation for a word with ISO 639-3 id
# For any pretrained languages (~900 languages), it will use the pretrained model without approximation
In [3]: model.inference('transphone', 'eng')                                                                                                    
Out[3]: ['t', 'ɹ', 'æ', 'n', 's', 'f', 'ə', 'ʊ', 'n']

# If the specified language is not available, then it will approximate it using nearest languages
# in this case, aaa (Ghotuo language) is not one of the training languages, we fetch 10 nearest languages to approximate it 
In [4]: model.inference('transphone', 'aaa')                                                                                                    
lang  aaa  is not available directly, use  ['bin', 'bja', 'bkh', 'bvx', 'dua', 'eto', 'gwe', 'ibo', 'kam', 'kik']  instead
Out[4]: ['t', 'l', 'a', 'n', 's', 'f', 'o', 'n', 'e']

# To gain deeper insights, you can also specify debug flag to see output of each language
In [5]: model.inference('transphone', 'aaa', debug=True)                                                                                        
bin   ['s', 'l', 'a', 'n', 's', 'f', 'o', 'n', 'e']
bja   ['s', 'l', 'a', 'n', 's', 'f', 'o', 'n']
bkh   ['t', 'l', 'a', 'n', 's', 'f', 'o', 'n', 'e']
bvx   ['t', 'r', 'a', 'n', 's', 'f', 'o', 'n', 'e']
dua   ['t', 'r', 'n', 's', 'f', 'n']
eto   ['t', 'l', 'a', 'n', 's', 'f', 'o', 'n', 'e']
gwe   ['t', 'l', 'a', 'n', 's', 'f', 'o', 'n', 'e']
ibo   ['t', 'l', 'a', 'n', 's', 'p', 'o', 'n', 'e']
kam   ['t', 'l', 'a', 'n', 's', 'f', 'o', 'n', 'e']
kik   ['t', 'l', 'a', 'n', 's', 'f', 'ɔ', 'n', 'ɛ']
Out[5]: ['t', 'l', 'a', 'n', 's', 'f', 'o', 'n', 'e']

Models

model	# supported languages	description
latest	~8k	based on our work at [1]

Reference

[1] Li, Xinjian, et al. "Zero-shot Learning for Grapheme to Phoneme Conversion with Language Ensemble." Findings of the Association for Computational Linguistics: ACL 2022. 2022.
[2] Li, Xinjian, et al. "Phone Inventories and Recognition for Every Language" LREC 2022. 2022

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github/workflows		.github/workflows
test		test
transphone		transphone
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
sample.txt		sample.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

transphone

Install

Usage

Tokenizer interface

G2P Command line

python G2P interface

Models

Reference

About

Releases

Packages

Languages

License

entn-at/transphone

Folders and files

Latest commit

History

Repository files navigation

transphone

Install

Usage

Tokenizer interface

G2P Command line

python G2P interface

Models

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages