CHANGE_LOG.TXT

//-----------------------------------------------------------------------------

1.3.2    07/28/2014
    - Bug fixes: 
        - Fix for cub::DeviceReduce where reductions of small problems 
          (small enough to only dispatch a single threadblock) would run in 
          the default stream (stream zero) regardless of whether an alternate
          stream was specified.  
          
//-----------------------------------------------------------------------------

1.3.1    05/23/2014
    - Bug fixes: 
        - Workaround for a benign WAW race warning reported by cuda-memcheck
          in BlockScan specialized for BLOCK_SCAN_WARP_SCANS algorithm.
        - Fix for bug in DeviceRadixSort where the algorithm may sort more 
          key bits than the caller specified (up to the nearest radix digit).
        - Fix for ~3% DeviceRadixSort performance regression on Kepler and 
          Fermi that was introduced in v1.3.0.  

//-----------------------------------------------------------------------------

1.3.0    05/12/2014
    - New features:
        - CUB's collective (block-wide, warp-wide) primitives underwent a minor 
          interface refactoring:
            - To provide the appropriate support for multidimensional thread blocks,
              The interfaces for collective classes are now template-parameterized 
              by X, Y, and Z block dimensions (with BLOCK_DIM_Y and BLOCK_DIM_Z being 
              optional, and BLOCK_DIM_X replacing BLOCK_THREADS).  Furthermore, the 
              constructors that accept remapped linear thread-identifiers have been 
              removed: all primitives now assume a row-major thread-ranking for 
              multidimensional thread blocks.  
            - To allow the host program (compiled by the host-pass) to 
              accurately determine the device-specific storage requirements for 
              a given collective (compiled for each device-pass), the interfaces 
              for collective classes are now (optionally) template-parameterized 
              by the desired PTX compute capability. This is useful when 
              aliasing collective storage to shared memory that has been 
              allocated dynamically by the host at the kernel call site.   
            - Most CUB programs having typical 1D usage should not require any 
              changes to accomodate these updates.
        - Added new "combination" WarpScan methods for efficiently computing 
          both inclusive and exclusive prefix scans (and sums).
    - Bug fixes: 
        - Fixed bug in cub::WarpScan (which affected cub::BlockScan and 
          cub::DeviceScan) where incorrect results (e.g., NAN) would often be 
          returned when parameterized for floating-point types (fp32, fp64).
        - Workaround-fix for ptxas error when compiling with with -G flag on Linux 
          (for debug instrumentation) 
        - Misc. workaround-fixes for certain scan scenarios (using custom 
          scan operators) where code compiled for SM1x is run on newer 
          GPUs of higher compute-capability: the compiler could not tell
          which memory space was being used collective operations and was 
          mistakenly using global ops instead of shared ops. 

//-----------------------------------------------------------------------------

1.2.3    04/01/2014
    - Bug fixes: 
        - Fixed access violation bug in DeviceReduce::ReduceByKey for non-primitive value types
        - Fixed code-snippet bug in ArgIndexInputIterator documentation 

//-----------------------------------------------------------------------------

1.2.2    03/03/2014
    - New features:
        - Added MS VC++ project solutions for device-wide and block-wide examples 
    - Performance:
        - Added a third algorithmic variant of cub::BlockReduce for improved performance
          when using commutative operators (e.g., numeric addition)
    - Bug fixes: 
        - Fixed bug where inclusion of Thrust headers in a certain order prevented CUB device-wide primitives from working properly

//-----------------------------------------------------------------------------

1.2.0    02/25/2014
    - New features:
        - Added device-wide reduce-by-key (DeviceReduce::ReduceByKey, DeviceReduce::RunLengthEncode) 
    - Performance
        - Improved DeviceScan, DeviceSelect, DevicePartition performance
    - Documentation and testing:
        - Compatible with CUDA 6.0
        - Added performance-portability plots for many device-wide primitives to doc 
        - Update doc and tests to reflect iterator (in)compatibilities with CUDA 5.0 (and older) and Thrust 1.6 (and older).
    - Bug fixes 
        - Revised the operation of temporary tile status bookkeeping for DeviceScan (and similar) to be safe for current code run on future platforms (now uses proper fences)  
        - Fixed DeviceScan bug where Win32 alignment disagreements between host and device regarding user-defined data types would corrupt tile status
        - Fixed BlockScan bug where certain exclusive scans on custom data types for the BLOCK_SCAN_WARP_SCANS variant would return incorrect results for the first thread in the block
        - Added workaround for TexRefInputIterator to work with CUDA 6.0
    
//-----------------------------------------------------------------------------

1.1.1    12/11/2013
    - New features:
        - Added TexObjInputIterator, TexRefInputIterator, CacheModifiedInputIterator, and CacheModifiedOutputIterator types for loading & storing arbitrary types through the cache hierarchy.  Compatible with Thrust API. 
        - Added descending sorting to DeviceRadixSort and BlockRadixSort
        - Added min, max, arg-min, and arg-max to DeviceReduce
        - Added DeviceSelect (select-unique, select-if, and select-flagged)
        - Added DevicePartition (partition-if, partition-flagged)
        - Added generic cub::ShuffleUp(), cub::ShuffleDown(), and cub::ShuffleBroadcast() for warp-wide communication of arbitrary data types (SM3x+)
        - Added cub::MaxSmOccupancy() for accurately determining SM occupancy for any given kernel function pointer
    - Performance
        - Improved DeviceScan and DeviceRadixSort performance for older architectures (SM10-SM30)
    - Interface changes:
        - Refactored block-wide I/O (BlockLoad and BlockStore), removing cache-modifiers from their interfaces.  The CacheModifiedInputIterator and CacheModifiedOutputIterator should now be used with BlockLoad and BlockStore to effect that behavior.
        - Rename device-wide "stream_synchronous" param to "debug_synchronous" to avoid confusion about usage
    - Documentation and testing:
        - Added simple examples of device-wide methods
        - Improved doxygen documentation and example snippets
        - Improved test coverege to include up to 21,000 kernel variants and 851,000 unit tests (per architecture, per platform)
    - Bug fixes 
        - Fixed misc DeviceScan, BlockScan, DeviceReduce, and BlockReduce bugs when operating on non-primitive types for older architectures SM10-SM13
        - Fixed DeviceScan / WarpReduction bug: SHFL-based segmented reduction producting incorrect results for multi-word types (size > 4B) on Linux 
        - Fixed BlockScan bug: For warpscan-based scans, not all threads in the first warp were entering the prefix callback functor
        - Fixed DeviceRadixSort bug: race condition with key-value pairs for pre-SM35 architectures
        - Fixed DeviceRadixSort bug: incorrect bitfield-extract behavior with long keys on 64bit Linux
        - Fixed BlockDiscontinuity bug: complation error in for types other than int32/uint32
        - CDP (device-callable) versions of device-wide methods now report the same temporary storage allocation size requirement as their host-callable counterparts
     

//-----------------------------------------------------------------------------

1.0.2    08/23/2013
    - Corrections to code snippet examples for BlockLoad, BlockStore, and BlockDiscontinuity
    - Cleaned up unnecessary/missing header includes.  You can now safely #inlude a specific .cuh (instead of cub.cuh)
    - Bug/compilation fixes for BlockHistogram 

//-----------------------------------------------------------------------------

1.0.1    08/08/2013
    - New collective interface idiom (specialize::construct::invoke).
    - Added best-in-class DeviceRadixSort.  Implements short-circuiting for homogenous digit passes.
    - Added best-in-class DeviceScan.  Implements single-pass "adaptive-lookback" strategy.
    - Significantly improved documentation (with example code snippets) 
    - More extensive regression test suit for aggressively testing collective variants
    - Allow non-trially-constructed types (previously unions had prevented aliasing temporary storage of those types)
    - Improved support for Kepler SHFL (collective ops now use SHFL for types larger than 32b)
    - Better code generation for 64-bit addressing within BlockLoad/BlockStore
    - DeviceHistogram now supports histograms of arbitrary bins
    - Misc. fixes
      - Workarounds for SM10 codegen issues in uncommonly-used WarpScan/Reduce specializations
      - Updates to accommodate CUDA 5.5 dynamic parallelism   


//-----------------------------------------------------------------------------

0.9.4    05/07/2013

    - Fixed compilation errors for SM10-SM13
    - Fixed compilation errors for some WarpScan entrypoints on SM30+
    - Added block-wide histogram (BlockHistogram256)
    - Added device-wide histogram (DeviceHistogram256)
    - Added new BlockScan algorithm variant BLOCK_SCAN_RAKING_MEMOIZE, which 
      trades more register consumption for less shared memory I/O)
    - Updates to BlockRadixRank to use BlockScan (which improves performance
      on Kepler due to SHFL instruction)
    - Allow types other than C++ primitives to be used in WarpScan::*Sum methods 
      if they only have operator + overloaded.  (Previously they also required 
      to support assignment from int(0).) 
    - Update BlockReduce's BLOCK_REDUCE_WARP_REDUCTIONS algorithm to work even 
      when block size is not an even multiple of warp size
    - Added work management utility descriptors (GridQueue, GridEvenShare)
    - Refactoring of DeviceAllocator interface and CachingDeviceAllocator 
      implementation 
    - Misc. documentation updates and corrections. 
     
//-----------------------------------------------------------------------------

0.9.2    04/04/2013

    - Added WarpReduce.  WarpReduce uses the SHFL instruction when applicable. 
      BlockReduce now uses this WarpReduce instead of implementing its own.
    - Misc. fixes for 64-bit Linux compilation warnings and errors.
    - Misc. documentation updates and corrections. 

//-----------------------------------------------------------------------------

0.9.1    03/09/2013

    - Fix for ambiguity in BlockScan::Reduce() between generic reduction and 
      summation.  Summation entrypoints are now called ::Sum(), similar to the 
      convention in BlockScan.
    - Small edits to mainpage documentation and download tracking
    
//-----------------------------------------------------------------------------

0.9.0    03/07/2013    

    - Intial "preview" release.    CUB is the first durable, high-performance library 
      of cooperative block-level, warp-level, and thread-level primitives for CUDA 
      kernel programming.  More primitives and examples coming soon!