L. Ardaillon and A. Roebel, "Fully-Convolutional Network for Pitch Estimation of Speech Signals", Proc. Interspeech, 2019.
We kindly request that academic publications making use of our FCN models cite the aforementioned paper.
The code provided in this repository aims at performing monophonic pitch (f0) estimation. It is partly based on the code from the CREPE repository => https://github.com/marl/crepe
Two different fully-convolutional pre-trained models are provided. Those models have been trained exclusively on speech data and may thus not perform as well on other types of sounds.
The code currently provided only allows to run the pitch estimation on given sound files using the provided pretrained models (no code is currently provided to train the model on new data).
The models, algorithm, training, and evaluation procedures have been described in a publication entitled "Fully-Convolutional Network for Pitch Estimation of Speech Signals", to be presented at the Interspeech 2019 conference.
Below are the results of our evaluation comparing our models to the SWIPE algorithm and CREPE model:
FCN-1953 | FCN-993 | FCN-929 | CREPE | CREPE-speech | SWIPE | |
---|---|---|---|---|---|---|
PAN-synth (25 cents) | 93.62 ± 3.34% | 94.31 ± 3.15% | 93.50 ± 3.43% | 77.62 ± 9.31% | 86.92 ± 8.28% | 84.56 ± 11.68% |
PAN-synth (50 cents) | 98.37 ± 1.62% | 98.53 ± 1.54% | 98.27 ± 1.73% | 91.23 ± 6.00% | 97.27 ± 2.09% | 93.10 ± 7.26% |
PAN-synth (200 cents) | 99.81 ± 0.64% | 99.79 ± 0.65% | 99.77 ± 0.73% | 95.65 ± 5.17% | 99.25 ± 1.07% | 97.51 ± 4.90% |
manual (50 cents) | 88.32 ± 6.33% | 88.57 ± 5.77% | 88.88 ± 5.73% | 87.03 ± 7.35% | 88.45 ± 5.70% | 85.93 ± 7.62% |
manual (200 cents) | 97.35 ± 3.02% | 97.31 ± 2.56% | 97.36 ± 2.51% | 92.57 ± 5.22% | 96.63 ± 2.91% | 95.03 ± 4.04% |
And below are comparaison of results and computation times for the different models and SWIPE :
FCN-1953 | FCN-993 | FCN-929 | CREPE | SWIPE | |
---|---|---|---|---|---|
latency | 0.122s | 0.062s | 0.058s | 0.032s | 0.128 |
GPU | 0.016s | 0.010s | 0.021s | 0.092s | X |
CPU | 1.65s | 0.89s | 3.34s | 14.79s | 0.63s |
python /path_to/FCN-f0/prediction.py -i /path_to/test.wav -o /path_to/test-FCN_1953.f0.csv -m /path_to/FCN-f0/models/FCN_1953/model.json -w /path_to/FCN-f0/models/FCN_1953/weights.h5 --use_single_core --verbose --plot
or
python /path_to/FCN-f0/prediction.py -i /path_to/test.wav -o /path_to/test-FCN_1953-no_json.f0.csv -w /path_to/FCN-f0/models/FCN_1953/weights.h5 -is 1953 --use_single_core --verbose --plot
python /path_to/FCN-f0/prediction.py -i /path_to/test.wav -o /path_to/test-FCN_929.f0.csv -m /path_to/FCN-f0/models/FCN_929/model.json -w /path_to/FCN-f0/models/FCN_929/weights.h5 --use_single_core --verbose --plot
or
python /path_to/FCN-f0/prediction.py -i /path_to/test.wav -o /path_to/test-FCN_929-no_json.f0.csv -w /path_to/FCN-f0/models/FCN_929/weights.h5 -is 929 --use_single_core --verbose --plot
[1] Jong Wook Kim, Justin Salamon, Peter Li, Juan Pablo Bello. "CREPE: A Convolutional Representation for Pitch Estimation", Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018.