This is my own implementation of Rawformer model
(LEVERAGING POSITIONAL-RELATED LOCAL-GLOBAL DEPENDENCY FOR SYNTHETIC SPEECH DETECTION - Xiaohui Liu, Meng Liu, Longbiao Wang, Kong Aik Lee2, Hanyi Zhang1, Jianwu Dang)
WARNING
- This code could be not same as explained in the paper. -If you find some bugs and want to fix it, please open pull requests.
- Using pre-emphasis for preprocessing showed superior performance.
In the paper, authors developed three types of Rawformer, Rawformer-S, Rawformer-L and SE-Rawformer.
I implemented all of these models only with 1-dimesional positional encoding.
N
is the number of Conv2D-based Blocks and M
is the number of Transformer Encoders.
- Rawformer-S
N
= 4M
= 2Conv2D-based Block
- same as a ResNet block used in AASIST
- Rawformer-L
N
= 6M
= 3Conv2D-based Block
- same as a ResNet block used in AASIST
- SE-Rawformer
N
= 4M
= 2Conv2D-based Block
- replaced blocks of Rawformer-S with Res-SERes2Net blocks for last three blocks