Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] feat: use native Go package for decoding Ogg Vorbis audio #118

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

aykevl
Copy link
Contributor

@aykevl aykevl commented Oct 6, 2024

This is something I'm experimenting with. It's nowhere near finished, though it does work. The main thing that's missing is seek indices (which makes seeking a bit slower than it could be).

Advantages:

  • No need to have libogg and libvorbis installed. This means it's easier to cross compile and easier to port go-librespot to a different OS.
  • Less code to maintain (this PR removes a lot of code!).
  • Maybe faster (see below)

I did some benchmarking, to see whether using a native Go version would perhaps be faster than using libvorbis through cgo because of the CGo call overhead. What I found was that the native version is a lot slower, but not for the reason you might think: the audio decryptor actually becomes twice as slow while the decoder itself is only about 8% slower. So what I'm suspecting is that the new decoder does a lot of unnecessary seeking or reads more data than it actually needs.

I wanted to share this as an experiment. Maybe I'll continue working on it to see whether I can fix the decryptor slowdowns and of course to implement seeking.

@aykevl aykevl force-pushed the vorbis-purego branch 4 times, most recently from 5d70cac to f7c46d7 Compare October 7, 2024 10:23
@aykevl
Copy link
Contributor Author

aykevl commented Oct 7, 2024

Together with #121 this new Vorbis decoder is in fact more efficient than the existing libvorbis implementation, by about 20% on my laptop (Asahi Linux, which is arm64 like modern Raspberry Pis but perhaps not very comparable in terms of performance).

The main thing it's missing now is seek indices. That will likely require an addition to the oggvorbis package.

@aykevl
Copy link
Contributor Author

aykevl commented Oct 7, 2024

Added support for seeking. This requires a patch to the oggvorbis package: jfreymuth/oggvorbis@master...aykevl:oggvorbis:add-SetPositionAfter

In fact, with this change seeking becomes precise: it will seek to exactly the sample calculated in SetPositionMs instead of approximating it (usually off by a few hundred ms). This usually requires skipping through 2-3 Ogg pages which should be relatively fast.

I'll be testing this PR for a while, to see whether there are any regressions.

@aykevl
Copy link
Contributor Author

aykevl commented Oct 8, 2024

Seems to be working well.
@devgianlu if you agree with this change, I'll make this PR ready for merging.

(While testing I found a bug though: if you play a bit of a loud part of a song, pause, seek to a quiet part, and then resume, you'll hear a short bit of the loud part. But that bug was already present before this change).

@aykevl
Copy link
Contributor Author

aykevl commented Oct 8, 2024

Did some benchmarking on a Raspberry Pi 3, and unfortunately the new decoder is about 37% slower there. So maybe this is not such a good idea after all.
I profiled the code, but there's nothing in particular that springs out - it's all just loads, stores and a lot of floating point math.

@devgianlu
Copy link
Owner

This is great for removing a lot of Vorbis-specific code by offloading it to a library and also for making cross-compiling easier.

It is peculiar that it is that much slower than the native implementation, but floating-point math has always been killing performance on librespot-java too. I am surprised floating-point in Go is slower than in C, perhaps the native library is just less optimized?

@aykevl
Copy link
Contributor Author

aykevl commented Oct 14, 2024

I am surprised floating-point in Go is slower than in C, perhaps the native library is just less optimized?

My guess is that Go is missing some optimizations that are part of C compilers. The Go compiler hasn't had nearly the amount of performance work that GCC and Clang have had. And the Apple M1 Pro is probably good enough to detect all that and make it fast regardless, but the Cortex-A53 (in the Raspberry Pi 3) isn't.

Specifically, I saw that a C compiler is able to use the ldp and stp instructions to load/store two floats at a time: https://godbolt.org/z/hcvcr3sMK
But the Go compiler isn't: https://godbolt.org/z/fdsEYr1h4

So that would probably be a significant performance hit.

(Even loading a complex64 is done as two separate float32 loads, instead of ldp!)

@aykevl
Copy link
Contributor Author

aykevl commented Oct 14, 2024

Apparently this is a known issue: golang/go#19715
It seems like such a trivial issue to solve, with such obvious benefits.

@aykevl
Copy link
Contributor Author

aykevl commented Oct 17, 2024

Went way too deep into this rabbit hole and made a Go patch to support stp at least: https://go-review.googlesource.com/c/go/+/620535 (ldp is a bit more difficult).

Unfortunately this won't help the vorbis package directly, some changes are needed so that the compiler can recognize the stores are in fact next to each other in memory. But it probably won't help that much anyway, and I suspect there's lower hanging fruit (like bounds check elimination).

@devgianlu
Copy link
Owner

Went way too deep into this rabbit hole and made a Go patch to support stp at least: https://go-review.googlesource.com/c/go/+/620535 (ldp is a bit more difficult).

This is great! 😄

Unfortunately this won't help the vorbis package directly, some changes are needed so that the compiler can recognize the stores are in fact next to each other in memory. But it probably won't help that much anyway, and I suspect there's lower hanging fruit (like bounds check elimination).

Yes, it's not probably a matter of a few instructions.

@aykevl
Copy link
Contributor Author

aykevl commented Oct 17, 2024

I did a quick test with TinyGo (which uses LLVM, same as Clang and Rust), and the function that seems to be most important (imdct) is 29% faster (25% when optimizing for code size) So I'd say the slowdown is mostly caused by Go gc being a not-so-great compiler.

When I disable bounds checking in TinyGo, the difference is even bigger: it's now 54% faster (or 50% when optimizing for code size). Of course that's an unfair comparison, but it shows that the Go code could be much faster with some optimizations (mostly to avoid bounds checks) and a more advanced optimizer. Disabling bounds checking in the Go compiler has a much smaller effect.

Honestly I probably shouldn't put much more time in this. The existing vorbis decoder works fine. But if you're willing to deal with the slowdown, this new decoder could be worth it.

@devgianlu
Copy link
Owner

Honestly I probably shouldn't put much more time in this. The existing vorbis decoder works fine. But if you're willing to deal with the slowdown, this new decoder could be worth it.

IMO, the new decoder is great for reducing the amount of code to maintain and for removing a native dependency, but if we have to then maintain the native library too it doesn't make much sense. It may be worth reconsidering this in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants