src
Folders and files
Name | Name | Last commit date | ||
---|---|---|---|---|
parent directory.. | ||||
===== Coding Standard ===== https://isocpp.org/wiki/faq/coding-standards ===== File description ===== zhy_symbol_map.cpp/h, zh_symbol_map.cpp/h: a perfect hash of phnetic symbol string to integer code generated by gen_gperf_symbol_pcm.pl. ===== Data Structure ===== ZHY_PHash { char* symbol, short code }; // code should from 1 to 5012 short * ZHY_PHash::in_word_set (register const char *str, register unsigned int len); ZH_PHash { char* symbol, short code }; // from 10001 to 12474 short * ZH_PHash::in_word_set (register const char *str, register unsigned int len); ===== Data Flow ===== lookup a Chinese string: Dict.lookup(string) 1. split into parts(words and English text) with friso: ... 2. query phonetic symbol of each part 2.1 English will keep itself 2.2 word will split into Characters: Character::split(string) 2.3 get DictItem from the first character: Dict.mDictItemArray[c.code] 2.4 check matched word: DictItem.wordList 2.5 get phonetic symbol: Character.phonSymbol 3. if it's English, synth with eSpeak or Festival 4. else if there is phonSymbols of the word is recorded, get it 5. else synth each phonSymbol ===== Dictionary Format ===== unsigned short/unsigned int characterCode; // if the first 2 bytes characterCode is 1, // then the next 4 bytes is the real characterCode unsigned short symbolCode; // to get symbol string: Dict.symbolArray[symbolCode].symbol unsigned short wordCount; unsigned byte charCount; // number of characters of the first word unsigned short/unsigned int characterCode; // the first character code of the first word unsigned short symbolCode; // symbol code of first character ..... notes: sizeof(short) == 2 bytes sizeof(int) == 4 bytes endian = little ===== Voice Data Index File Format ===== // 5个字节 unsigned short samplerate; unsigned byte; // 1 for 16bit wav, 2 for 16bit gsm unsigned short symbolCount; // 对于单字来说,1个拼音占9个字节 unsigned byte codeCount; // number of symbol. >1 means the wave file is a word unsigned short symbolCode; ... // repeat codeCount times unsigned int frameOffset; // wave file offset frames; // 占3个字节 ... // repeat symbolCount times ===== memory check ===== Reference: http://valgrind.org/docs/manual/ms-manual.html valgrind --tool=massif ./ekho 123 ms_print massif.out.xxx