Skip to content

Commit

Permalink
SSE2 SIMD implementation of Huffman encoding
Browse files Browse the repository at this point in the history
Full-color compression speedups relative to libjpeg-turbo 1.4.2:

2.8 GHz Intel Xeon W3530, Linux, 64-bit:  2.2-18% (avg. 9.5%)
2.8 GHz Intel Xeon W3530, Linux, 32-bit:  10-25% (avg. 17%)

2.3 GHz AMD A10-4600M APU, Linux, 64-bit:  4.9-17% (avg. 11%)
2.3 GHz AMD A10-4600M APU, Linux, 32-bit:  8.8-19% (avg. 15%)

3.0 GHz Intel Core i7, OS X, 64-bit:  3.5-16% (avg. 10%)
3.0 GHz Intel Core i7, OS X, 32-bit:  4.8-14% (avg. 11%)

2.6 GHz AMD Athlon 64 X2 5050e:
Performance-neutral (give or take a few percent)

Full-color compression speedups relative to IPP:

2.8 GHz Intel Xeon W3530, Linux, 64-bit:  4.8-34% (avg. 19%)
2.8 GHz Intel Xeon W3530, Linux, 32-bit:  -19%-7.0% (avg. -7.0%)

Refer to libjpeg-turbo#42 for discussion.  Numerous other approaches were attempted,
but this one proved to be the most performant across all platforms.

This commit also fixes libjpeg-turbo#3 (works around, really-- the clang-compiled version
of jchuff.c still performs 20% worse than its GCC-compiled counterpart, but
that code is now bypassed by the new SSE2 Huffman algorithm.)

Based on:
mayeut@2cb4d41
mayeut@36c94e0
  • Loading branch information
dcommander committed Jan 12, 2016
1 parent eb59b6e commit f3a8684
Show file tree
Hide file tree
Showing 18 changed files with 5,157 additions and 84 deletions.
90 changes: 39 additions & 51 deletions BUILDING.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,19 +38,7 @@ Build Requirements

NOTE: the NASM build will fail if texinfo is not installed.

- GCC v4.1 or later recommended for best performance
* Beginning with Xcode 4, Apple stopped distributing GCC and switched to
the LLVM compiler. Xcode v4.0 through v4.6 provides a GCC front end
called LLVM-GCC. Unfortunately, as of this writing, neither LLVM-GCC nor
the LLVM (clang) compiler produces optimal performance with libjpeg-turbo.
Building libjpeg-turbo with LLVM-GCC v4.2 results in a 10% performance
degradation when compressing using 64-bit code, relative to building
libjpeg-turbo with GCC v4.2. Building libjpeg-turbo with LLVM (clang)
results in a 20% performance degradation when compressing using 64-bit
code, relative to building libjpeg-turbo with GCC v4.2. If you are
running Snow Leopard or earlier, it is suggested that you continue to use
Xcode v3.2.6, which provides GCC v4.2. If you are using Lion or later, it
is suggested that you install Apple GCC v4.2 or GCC v5 through MacPorts.
- GCC v4.1 (or later) or clang recommended for best performance

- If building the TurboJPEG Java wrapper, JDK or OpenJDK 1.5 or later is
required. Some systems, such as Solaris 10 and later and Red Hat Enterprise
Expand Down Expand Up @@ -89,38 +77,38 @@ for 64-bit build instructions.)

This will generate the following files under .libs/:

**libjpeg.a**
**libjpeg.a**
Static link library for the libjpeg API

**libjpeg.so.{version}** (Linux, Unix)
**libjpeg.{version}.dylib** (OS X)
**cygjpeg-{version}.dll** (Cygwin)
**libjpeg.so.{version}** (Linux, Unix)
**libjpeg.{version}.dylib** (OS X)
**cygjpeg-{version}.dll** (Cygwin)
Shared library for the libjpeg API

By default, *{version}* is 62.1.0, 7.1.0, or 8.0.2, depending on whether
libjpeg v6b (default), v7, or v8 emulation is enabled. If using Cygwin,
*{version}* is 62, 7, or 8.

**libjpeg.so** (Linux, Unix)
**libjpeg.dylib** (OS X)
**libjpeg.so** (Linux, Unix)
**libjpeg.dylib** (OS X)
Development symlink for the libjpeg API

**libjpeg.dll.a** (Cygwin)
**libjpeg.dll.a** (Cygwin)
Import library for the libjpeg API

**libturbojpeg.a**
**libturbojpeg.a**
Static link library for the TurboJPEG API

**libturbojpeg.so.0.1.0** (Linux, Unix)
**libturbojpeg.0.1.0.dylib** (OS X)
**cygturbojpeg-0.dll** (Cygwin)
**libturbojpeg.so.0.1.0** (Linux, Unix)
**libturbojpeg.0.1.0.dylib** (OS X)
**cygturbojpeg-0.dll** (Cygwin)
Shared library for the TurboJPEG API

**libturbojpeg.so** (Linux, Unix)
**libturbojpeg.dylib** (OS X)
**libturbojpeg.so** (Linux, Unix)
**libturbojpeg.dylib** (OS X)
Development symlink for the TurboJPEG API

**libturbojpeg.dll.a** (Cygwin)
**libturbojpeg.dll.a** (Cygwin)
Import library for the TurboJPEG API


Expand Down Expand Up @@ -333,16 +321,16 @@ Set the following shell variables for simplicity:
IOS_SYSROOT=$IOS_PLATFORMDIR/Developer/SDKs/iPhoneOS*.sdk
IOS_GCC=$IOS_PLATFORMDIR/Developer/usr/bin/arm-apple-darwin10-llvm-gcc-4.2

*ARMv6 (code will run on all iOS devices, not SIMD-accelerated)*
*ARMv6 (code will run on all iOS devices, not SIMD-accelerated)*
[NOTE: Requires Xcode 4.4.x or earlier]

IOS_CFLAGS="-march=armv6 -mcpu=arm1176jzf-s -mfpu=vfp"

*ARMv7 (code will run on iPhone 3GS-4S/iPad 1st-3rd Generation and newer)*
*ARMv7 (code will run on iPhone 3GS-4S/iPad 1st-3rd Generation and newer)*

IOS_CFLAGS="-march=armv7 -mcpu=cortex-a8 -mtune=cortex-a8 -mfpu=neon"

*ARMv7s (code will run on iPhone 5/iPad 4th Generation and newer)*
*ARMv7s (code will run on iPhone 5/iPad 4th Generation and newer)*
[NOTE: Requires Xcode 4.5 or later]

IOS_CFLAGS="-march=armv7s -mcpu=swift -mtune=swift -mfpu=neon"
Expand All @@ -365,11 +353,11 @@ Set the following shell variables for simplicity:
IOS_SYSROOT=$IOS_PLATFORMDIR/Developer/SDKs/iPhoneOS*.sdk
IOS_GCC=/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang

*ARMv7 (code will run on iPhone 3GS-4S/iPad 1st-3rd Generation and newer)*
*ARMv7 (code will run on iPhone 3GS-4S/iPad 1st-3rd Generation and newer)*

IOS_CFLAGS="-arch armv7"

*ARMv7s (code will run on iPhone 5/iPad 4th Generation and newer)*
*ARMv7s (code will run on iPhone 5/iPad 4th Generation and newer)*

IOS_CFLAGS="-arch armv7s"

Expand Down Expand Up @@ -527,22 +515,22 @@ on which version of cl.exe is in the `PATH`.

The following files will be generated under *{build_directory}*:

**jpeg-static.lib**
**jpeg-static.lib**
Static link library for the libjpeg API

**sharedlib/jpeg{version}.dll**
**sharedlib/jpeg{version}.dll**
DLL for the libjpeg API

**sharedlib/jpeg.lib**
**sharedlib/jpeg.lib**
Import library for the libjpeg API
**turbojpeg-static.lib**

**turbojpeg-static.lib**
Static link library for the TurboJPEG API

**turbojpeg.dll**
**turbojpeg.dll**
DLL for the TurboJPEG API

**turbojpeg.lib**
**turbojpeg.lib**
Import library for the TurboJPEG API

*{version}* is 62, 7, or 8, depending on whether libjpeg v6b (default), v7, or
Expand All @@ -569,22 +557,22 @@ build of libjpeg-turbo.

This will generate the following files under *{build_directory}*:

**{configuration}/jpeg-static.lib**
**{configuration}/jpeg-static.lib**
Static link library for the libjpeg API

**sharedlib/{configuration}/jpeg{version}.dll**
**sharedlib/{configuration}/jpeg{version}.dll**
DLL for the libjpeg API

**sharedlib/{configuration}/jpeg.lib**
**sharedlib/{configuration}/jpeg.lib**
Import library for the libjpeg API

**{configuration}/turbojpeg-static.lib**
**{configuration}/turbojpeg-static.lib**
Static link library for the TurboJPEG API

**{configuration}/turbojpeg.dll**
**{configuration}/turbojpeg.dll**
DLL for the TurboJPEG API

**{configuration}/turbojpeg.lib**
**{configuration}/turbojpeg.lib**
Import library for the TurboJPEG API

*{configuration}* is Debug, Release, RelWithDebInfo, or MinSizeRel, depending
Expand All @@ -603,22 +591,22 @@ cross-compiling on a Linux/Unix machine, then see "Build Recipes" below.

This will generate the following files under *{build_directory}*:

**libjpeg.a**
**libjpeg.a**
Static link library for the libjpeg API

**sharedlib/libjpeg-{version}.dll**
**sharedlib/libjpeg-{version}.dll**
DLL for the libjpeg API

**sharedlib/libjpeg.dll.a**
**sharedlib/libjpeg.dll.a**
Import library for the libjpeg API

**libturbojpeg.a**
**libturbojpeg.a**
Static link library for the TurboJPEG API

**libturbojpeg.dll**
**libturbojpeg.dll**
DLL for the TurboJPEG API

**libturbojpeg.dll.a**
**libturbojpeg.dll.a**
Import library for the TurboJPEG API

*{version}* is 62, 7, or 8, depending on whether libjpeg v6b (default), v7, or
Expand Down
10 changes: 10 additions & 0 deletions ChangeLog.txt
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,16 @@ benchmark from outputting any images. This removes any potential operating
system overhead that might be caused by lazy writes to disk and thus improves
the consistency of the performance measurements.

[12] Added SIMD acceleration for Huffman encoding on SSE2-capable x86 and
x86-64 platforms. This speeds up the compression of full-color JPEGs by about
10-15% on average (relative to libjpeg-turbo 1.4.x) when using modern Intel and
AMD CPUs. Additionally, this works around an issue in the clang optimizer that
prevents it (as of this writing) from achieving the same performance as GCC
when compiling the C version of the Huffman encoder
(https://llvm.org/bugs/show_bug.cgi?id=16035). For the purposes of benchmarking
or regression testing, SIMD-accelerated Huffman encoding can be disabled by
setting the JSIMD_NOHUFFENC environment variable to 1.


1.4.2
=====
Expand Down
58 changes: 47 additions & 11 deletions jchuff.c
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
* Copyright (C) 1991-1997, Thomas G. Lane.
* libjpeg-turbo Modifications:
* Copyright (C) 2009-2011, 2014-2016 D. R. Commander.
* Copyright (C) 2015 Matthieu Darbois.
* For conditions of distribution and use, see the accompanying README.ijg
* file.
*
Expand All @@ -20,7 +21,7 @@
#define JPEG_INTERNALS
#include "jinclude.h"
#include "jpeglib.h"
#include "jchuff.h" /* Declarations shared with jcphuff.c */
#include "jsimd.h"
#include "jconfigint.h"
#include <limits.h>

Expand Down Expand Up @@ -108,6 +109,8 @@ typedef struct {
long * dc_count_ptrs[NUM_HUFF_TBLS];
long * ac_count_ptrs[NUM_HUFF_TBLS];
#endif

int simd;
} huff_entropy_encoder;

typedef huff_entropy_encoder * huff_entropy_ptr;
Expand Down Expand Up @@ -159,6 +162,8 @@ start_pass_huff (j_compress_ptr cinfo, boolean gather_statistics)
entropy->pub.finish_pass = finish_pass_huff;
}

entropy->simd = jsimd_can_huff_encode_one_block();

for (ci = 0; ci < cinfo->comps_in_scan; ci++) {
compptr = cinfo->cur_comp_info[ci];
dctbl = compptr->dc_tbl_no;
Expand Down Expand Up @@ -480,6 +485,23 @@ flush_bits (working_state * state)

/* Encode a single block's worth of coefficients */

LOCAL(boolean)
encode_one_block_simd (working_state * state, JCOEFPTR block, int last_dc_val,
c_derived_tbl *dctbl, c_derived_tbl *actbl)
{
JOCTET _buffer[BUFSIZE], *buffer;
size_t bytes, bytestocopy; int localbuf = 0;

LOAD_BUFFER()

buffer = jsimd_huff_encode_one_block(state, buffer, block, last_dc_val,
dctbl, actbl);

STORE_BUFFER()

return TRUE;
}

LOCAL(boolean)
encode_one_block (working_state * state, JCOEFPTR block, int last_dc_val,
c_derived_tbl *dctbl, c_derived_tbl *actbl)
Expand Down Expand Up @@ -640,16 +662,30 @@ encode_mcu_huff (j_compress_ptr cinfo, JBLOCKROW *MCU_data)
}

/* Encode the MCU data blocks */
for (blkn = 0; blkn < cinfo->blocks_in_MCU; blkn++) {
ci = cinfo->MCU_membership[blkn];
compptr = cinfo->cur_comp_info[ci];
if (! encode_one_block(&state,
MCU_data[blkn][0], state.cur.last_dc_val[ci],
entropy->dc_derived_tbls[compptr->dc_tbl_no],
entropy->ac_derived_tbls[compptr->ac_tbl_no]))
return FALSE;
/* Update last_dc_val */
state.cur.last_dc_val[ci] = MCU_data[blkn][0][0];
if (entropy->simd) {
for (blkn = 0; blkn < cinfo->blocks_in_MCU; blkn++) {
ci = cinfo->MCU_membership[blkn];
compptr = cinfo->cur_comp_info[ci];
if (! encode_one_block_simd(&state,
MCU_data[blkn][0], state.cur.last_dc_val[ci],
entropy->dc_derived_tbls[compptr->dc_tbl_no],
entropy->ac_derived_tbls[compptr->ac_tbl_no]))
return FALSE;
/* Update last_dc_val */
state.cur.last_dc_val[ci] = MCU_data[blkn][0][0];
}
} else {
for (blkn = 0; blkn < cinfo->blocks_in_MCU; blkn++) {
ci = cinfo->MCU_membership[blkn];
compptr = cinfo->cur_comp_info[ci];
if (! encode_one_block(&state,
MCU_data[blkn][0], state.cur.last_dc_val[ci],
entropy->dc_derived_tbls[compptr->dc_tbl_no],
entropy->ac_derived_tbls[compptr->ac_tbl_no]))
return FALSE;
/* Update last_dc_val */
state.cur.last_dc_val[ci] = MCU_data[blkn][0][0];
}
}

/* Completed MCU, so update state */
Expand Down
9 changes: 9 additions & 0 deletions jsimd.h
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,16 @@
*
* Copyright 2009 Pierre Ossman <[email protected]> for Cendio AB
* Copyright 2011, 2014 D. R. Commander
* Copyright 2015 Matthieu Darbois
*
* Based on the x86 SIMD extension for IJG JPEG library,
* Copyright (C) 1999-2006, MIYASAKA Masaru.
* For conditions of distribution and use, see copyright notice in jsimdext.inc
*
*/

#include "jchuff.h" /* Declarations shared with jcphuff.c */

EXTERN(int) jsimd_can_rgb_ycc (void);
EXTERN(int) jsimd_can_rgb_gray (void);
EXTERN(int) jsimd_can_ycc_rgb (void);
Expand Down Expand Up @@ -82,3 +85,9 @@ EXTERN(void) jsimd_h2v2_merged_upsample
EXTERN(void) jsimd_h2v1_merged_upsample
(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
JDIMENSION in_row_group_ctr, JSAMPARRAY output_buf);

EXTERN(int) jsimd_can_huff_encode_one_block (void);

EXTERN(JOCTET*) jsimd_huff_encode_one_block
(void * state, JOCTET *buffer, JCOEFPTR block, int last_dc_val,
c_derived_tbl *dctbl, c_derived_tbl *actbl);
14 changes: 14 additions & 0 deletions jsimd_none.c
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
*
* Copyright 2009 Pierre Ossman <[email protected]> for Cendio AB
* Copyright 2009-2011, 2014 D. R. Commander
* Copyright 2015 Matthieu Darbois
*
* Based on the x86 SIMD extension for IJG JPEG library,
* Copyright (C) 1999-2006, MIYASAKA Masaru.
Expand Down Expand Up @@ -387,3 +388,16 @@ jsimd_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
{
}

GLOBAL(int)
jsimd_can_huff_encode_one_block (void)
{
return 0;
}

GLOBAL(JOCTET*)
jsimd_huff_encode_one_block (void * state, JOCTET *buffer, JCOEFPTR block,
int last_dc_val, c_derived_tbl *dctbl,
c_derived_tbl *actbl)
{
return NULL;
}
3 changes: 2 additions & 1 deletion jversion.h
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
"Copyright (C) 2009-2016 D. R. Commander\n" \
"Copyright (C) 2009-2011 Nokia Corporation and/or its subsidiary(-ies)\n" \
"Copyright (C) 2013-2014 MIPS Technologies, Inc.\n" \
"Copyright (C) 2013 Linaro Limited"
"Copyright (C) 2013 Linaro Limited\n" \
"Copyright (C) 2015 Matthieu Darbois"

#define JCOPYRIGHT_SHORT "Copyright (C) 1991-2016 The libjpeg-turbo Project and many others"
14 changes: 8 additions & 6 deletions simd/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -22,17 +22,19 @@ endif()

if(SIMD_X86_64)
set(SIMD_BASENAMES jfdctflt-sse-64 jccolor-sse2-64 jcgray-sse2-64
jcsample-sse2-64 jdcolor-sse2-64 jdmerge-sse2-64 jdsample-sse2-64
jfdctfst-sse2-64 jfdctint-sse2-64 jidctflt-sse2-64 jidctfst-sse2-64
jidctint-sse2-64 jidctred-sse2-64 jquantf-sse2-64 jquanti-sse2-64)
jchuff-sse2-64 jcsample-sse2-64 jdcolor-sse2-64 jdmerge-sse2-64
jdsample-sse2-64 jfdctfst-sse2-64 jfdctint-sse2-64 jidctflt-sse2-64
jidctfst-sse2-64 jidctint-sse2-64 jidctred-sse2-64 jquantf-sse2-64
jquanti-sse2-64)
message(STATUS "Building x86_64 SIMD extensions")
else()
set(SIMD_BASENAMES jsimdcpu jfdctflt-3dn jidctflt-3dn jquant-3dn jccolor-mmx
jcgray-mmx jcsample-mmx jdcolor-mmx jdmerge-mmx jdsample-mmx jfdctfst-mmx
jfdctint-mmx jidctfst-mmx jidctint-mmx jidctred-mmx jquant-mmx jfdctflt-sse
jidctflt-sse jquant-sse jccolor-sse2 jcgray-sse2 jcsample-sse2 jdcolor-sse2
jdmerge-sse2 jdsample-sse2 jfdctfst-sse2 jfdctint-sse2 jidctflt-sse2
jidctfst-sse2 jidctint-sse2 jidctred-sse2 jquantf-sse2 jquanti-sse2)
jidctflt-sse jquant-sse jccolor-sse2 jcgray-sse2 jchuff-sse2 jcsample-sse2
jdcolor-sse2 jdmerge-sse2 jdsample-sse2 jfdctfst-sse2 jfdctint-sse2
jidctflt-sse2 jidctfst-sse2 jidctint-sse2 jidctred-sse2 jquantf-sse2
jquanti-sse2)
message(STATUS "Building i386 SIMD extensions")
endif()

Expand Down
Loading

0 comments on commit f3a8684

Please sign in to comment.