-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorized Paeth filtering (multiple pixels at the same time) #157
Comments
That's an interesting idea. Do you have any numbers on how often "the Paeth filter is used for 4 or 8 adjacent rows" is true?
Are you able to link to or share your test code? |
Here's some test code just testing the raw filter, assuming that all rows use Paeth: https://gitlab.com/-/snippets/4781267 I don't necessarily have good (representative) numbers on how often it happens that 4 or 8 adjacent rows use the Paeth filter, but across some files from the qoi benchmarking dataset, I have seen some distribution that in many files it does not happen at all, but then in the remaining files it happens a significant number of times. Here are some stats by qoibench, stbi2 is a modified version of stb_image where I applied this filtering technique:
Here's a histogram of in how many files, a certain percentage of rows were covered by the 4-at-a-time technique, from 1% to 100% but without the bar for 0% which would ruin the scale of the y axis |
OK, stbi2 is faster than stb_image, but stb_image does not do SIMD Paeth filtering, right? (It does have some SIMD code, for JPEG decoding, but not PNG.) Wuffs already does SIMD Paeth filtering (example), even if it's only 1 row at a time and not your 4-at-a-time technique. Do you have a sense of how much faster it would be (compared to a SIMD-using baseline) to adopt your algorithm? |
The 4-rows-at-a-time technique also wins in the pure filter-vs-filter (which are both SIMD filters but one is the classic pixel-by-pixel one and the other is the 4-rows-at-a-time one) test by up to 3x on my PC, but I don't have an "integrated" test that compares them in-context |
I've got the same idea as @IJzerbaard about a week ago when improving PNG decoding performance in Blend2D. I think it's a great idea to use multiple rows to filter Paeth, but I think the most important is to optimize the DEFLATE decompressor first, because it consumes the most cycles, at least according to Here are my own benchmarks that compare decoding of libpng, wuffs, spng, stbimage, and blend2d (these are from my local branch, still working on improvements). I'm using Ryzen 79950X.
These use testing images from QOI project and some other testing images for PNG I found people use to benchmark PNG decoders. The first images are especially interesting as they contain dozens of dynamic Huffman tables so the decompressor can bottleneck on Huffman table building if fast-bits is too high. My aim is to match libdeflate in DEFLATE performance - then I think further filter optimizations make sense. For comparison, here is a comparison with QOI:
The article I wrote here is already outdated with numbers (the perf of blend2d improved since then): In general when both PNG and QOI decoders are optimized to the bone the average decoding time is not that far. Both PNG and QOI are annoyingly scalar when it comes to decoding (impossible to do more work ahead of time as everything depends on past work). |
Most PNG decoders (including the one in Wuffs from what I can see) use a pixel-by-pixel approach to filtering, and may use SIMD for parallelism across the channels of a pixel. However, it is possible to use SIMD to apply the Paeth filter to multiple pixels at once. Not 4 sequential pixels, which would be the neatest case for SIMD but is prevented by dependencies between the pixels, but 4 anti-diagonal pixels. 4 anti-diagonal pixels can be collected together in a vector relatively efficiently by loading 4 rows of pixels, staggered so that the first row has an x-offset of 3 pixels compared to the last row, then transpose those 4 rows (similar to _MM_TRANSPOSE4_PS, but with integer vectors). That produces 4 of those sets of 4 anti-diagonal pixels for the price of 8 shuffles, and after applying the filter another 8 shuffles are needed to un-transpose them (some additional shuffles are needed to update the "top" aka B vector between columns). An anti-diagonal group of pixels does not have dependencies between the pixels in the same group (as a horizontal group of pixels would) so that kind of group can be filtered at the same time using SIMD. All the shuffling is not free but in my tests it was well worth doing.
Some diagrams:
Filtering 4 or 8 rows at the same time (filtering 8 rows at once exposes more ILP) does not fit very naturally onto the "possibly different filter for each row" nature of the PNG format, but it is still possible in cases where the Paeth filter is used for 4 or 8 adjacent rows. The Avg filter can be implemented in a similar manner.
If the Wuffs project is interested in this approach, I'm willing to help integrate it into Wuffs, but I'm really not familiar with the language.
The text was updated successfully, but these errors were encountered: