Skip to content

Latest commit

 

History

History

src

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
===== Coding Standard =====
https://isocpp.org/wiki/faq/coding-standards

===== File description =====
zhy_symbol_map.cpp/h, zh_symbol_map.cpp/h:
  a perfect hash of phnetic symbol string to integer code generated by gen_gperf_symbol_pcm.pl.

===== Data Structure =====
ZHY_PHash { char* symbol, short code }; // code should from 1 to 5012
short * ZHY_PHash::in_word_set (register const char *str, register unsigned int len);

ZH_PHash { char* symbol, short code };  // from 10001 to 12474
short * ZH_PHash::in_word_set (register const char *str, register unsigned int len);

===== Data Flow =====
lookup a Chinese string: Dict.lookup(string)
1. split into parts(words and English text) with friso: ...
2. query phonetic symbol of each part
2.1 English will keep itself
2.2 word will split into Characters: Character::split(string)
2.3 get DictItem from the first character: Dict.mDictItemArray[c.code]
2.4 check matched word: DictItem.wordList
2.5 get phonetic symbol: Character.phonSymbol
3. if it's English, synth with eSpeak or Festival
4. else if there is phonSymbols of the word is recorded, get it
5. else synth each phonSymbol

===== Dictionary Format =====
unsigned short/unsigned int characterCode; // if the first 2 bytes characterCode is 1,
                        // then the next 4 bytes is the real characterCode
unsigned short symbolCode; // to get symbol string: Dict.symbolArray[symbolCode].symbol
unsigned short wordCount;
unsigned byte charCount; // number of characters of the first word
unsigned short/unsigned int characterCode; // the first character code of the first word
unsigned short symbolCode; // symbol code of first character
.....
notes:
sizeof(short) == 2 bytes
sizeof(int) == 4 bytes
endian = little

===== Voice Data Index File Format =====
// 5个字节
unsigned short samplerate;
unsigned byte; // 1 for 16bit wav, 2 for 16bit gsm
unsigned short symbolCount;

// 对于单字来说,1个拼音占9个字节
unsigned byte codeCount; // number of symbol. >1 means the wave file is a word
unsigned short symbolCode;
... // repeat codeCount times
unsigned int frameOffset; // wave file offset
frames; // 占3个字节
... // repeat symbolCount times

===== memory check =====
Reference: http://valgrind.org/docs/manual/ms-manual.html
valgrind --tool=massif ./ekho 123
ms_print massif.out.xxx