optimize Binarize() performance when `onset == offset` #1721

benniekiss · 2024-06-15T16:15:53Z

While processing long audios in the SpeakerDiarization pipeline, I noticed that the to_annotation() method was taking a while, and I tracked it down to pyannote.audio.utils.signal.Binarize.__call__() where it was looping over a numpy array which could end up being quite large.

In my tests, the original implementation took about 60 seconds for a 9 hour audio. With this new implementation, it takes about 0.5 seconds.

I've only tested this with the SpeakerDiarization pipeline, but the new implementation returns the same results as the original.

benniekiss · 2024-07-14T18:42:54Z

Fixed an off-by-one error in the new method.

I also made a google colab notebook showcasing the improvements: https://colab.research.google.com/drive/1Me3GgQUPXxjuEn06DNVco_GIxlUoYPTE?usp=sharing

In summary, the new method has a slight speed up for fully synthetic data, a 2x speedup for discrete (0s and 1s) synthetic data, and an almost 100x speedup for real data in the SpeakerDiarization pipeline.

The notebook also lets you extend the real data sample to however many hours is desired under the TEST WITH REAL DATA section by setting AUDIO_LENGTH to the desired number of hours.

Data Type	Original Method	V2 Method
Synthetic = `np.random.randn(100000, 50)`	00:00:08.781	00:00:07.972
Synthetic Discrete = `np.random.randint(0, 2, size=(100000, 50))`	00:00:19.085	00:00:10.724
Real Data - huggingface datasets (01:02:27.300 long audio)	00:00:00.755	00:00:00.008

EDIT: I realized that I did not test this with various offsets and onsets when initializing the Binarize class, and after doing so, the implementations are not equal. Will keep working on this to see if there's a way to make any improvements

benniekiss · 2024-11-30T17:02:23Z

I've updated this patch to only use the optimized method when onset == offset, otherwise it uses the original method. I spent some time trying to find a suitable replacement for cases where onset != offset, but most of my attempts actually took longer.

The speed improvements appear in the speaker-diarization pipeline because the to_annotation() method hardcodes onset and offset to both be 0.5.

I'm regularly processing audio 5+ hours, so the speed improvements here are beneficial for me. I've been regularly running this patch and have not noticed any issues, either. With shorter audio lengths, the improvements are trivial.

This patch is ready for review, but I understand if you don't want to merge since the benefits are only seen in cases of long audio files, and I don't know how many people are processing 5+ hour long audio samples. If others are experiencing this bottleneck, I'd be interested in hearing.

* let the user decide how to rename tracks, if necessary * reduces a costly step for long audios

benniekiss force-pushed the binarize_annotation_opt branch from c6d73b0 to dc6b1a8 Compare June 21, 2024 11:08

benniekiss force-pushed the binarize_annotation_opt branch from dc6b1a8 to 3f1a482 Compare July 6, 2024 15:01

benniekiss changed the title ~~improve Binarize() performance~~ [WIP] improve Binarize() performance Jul 14, 2024

benniekiss marked this pull request as draft July 31, 2024 21:12

benniekiss force-pushed the binarize_annotation_opt branch from 5666ab8 to afef580 Compare August 27, 2024 21:12

benniekiss force-pushed the binarize_annotation_opt branch 2 times, most recently from 20395b9 to c3f1df8 Compare November 30, 2024 16:32

benniekiss marked this pull request as ready for review November 30, 2024 17:03

benniekiss changed the title ~~[WIP] improve Binarize() performance~~ optimize Binarize() performance when onset == offset Nov 30, 2024

conditionally use optimized method

649c060

benniekiss force-pushed the binarize_annotation_opt branch from c3f1df8 to 649c060 Compare January 29, 2025 22:25

dont rename tracks when generating annotation

0f3b2fe

* let the user decide how to rename tracks, if necessary * reduces a costly step for long audios

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize Binarize() performance when `onset == offset` #1721

optimize Binarize() performance when `onset == offset` #1721

benniekiss commented Jun 15, 2024

benniekiss commented Jul 14, 2024 •

edited

Loading

benniekiss commented Nov 30, 2024

optimize Binarize() performance when onset == offset #1721

Are you sure you want to change the base?

optimize Binarize() performance when onset == offset #1721

Conversation

benniekiss commented Jun 15, 2024

benniekiss commented Jul 14, 2024 • edited Loading

benniekiss commented Nov 30, 2024

optimize Binarize() performance when `onset == offset` #1721

optimize Binarize() performance when `onset == offset` #1721

benniekiss commented Jul 14, 2024 •

edited

Loading