[SPARK-13325][SQL] Create a 64-bit hashcode expression

This PR introduces a 64-bit hashcode expression. Such an expression is especially usefull for HyperLogLog++ and other probabilistic datastructures. I have implemented xxHash64 which is a 64-bit hashing algorithm created by Yann Colet and Mathias Westerdahl. This is a high speed (C implementation runs at memory bandwidth) and high quality hashcode. It exploits both Instruction Level Parralellism (for speed) and the multiplication and rotation techniques (for quality) like MurMurHash does. The initial results are promising. I have added a CG'ed test to the `HashBenchmark`, and this results in the following results (running from SBT): Running benchmark: Hash For simple Running case: interpreted version Running case: codegen version Running case: codegen version 64-bit Intel(R) Core(TM) i7-4750HQ CPU 2.00GHz Hash For simple: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- interpreted version 1011 / 1016 132.8 7.5 1.0X codegen version 1864 / 1869 72.0 13.9 0.5X codegen version 64-bit 1614 / 1644 83.2 12.0 0.6X Running benchmark: Hash For normal Running case: interpreted version Running case: codegen version Running case: codegen version 64-bit Intel(R) Core(TM) i7-4750HQ CPU 2.00GHz Hash For normal: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- interpreted version 2467 / 2475 0.9 1176.1 1.0X codegen version 2008 / 2115 1.0 957.5 1.2X codegen version 64-bit 728 / 758 2.9 347.0 3.4X Running benchmark: Hash For array Running case: interpreted version Running case: codegen version Running case: codegen version 64-bit Intel(R) Core(TM) i7-4750HQ CPU 2.00GHz Hash For array: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- interpreted version 1544 / 1707 0.1 11779.6 1.0X codegen version 2728 / 2745 0.0 20815.5 0.6X codegen version 64-bit 2508 / 2549 0.1 19132.8 0.6X Running benchmark: Hash For map Running case: interpreted version Running case: codegen version Running case: codegen version 64-bit Intel(R) Core(TM) i7-4750HQ CPU 2.00GHz Hash For map: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- interpreted version 1819 / 1826 0.0 444014.3 1.0X codegen version 183 / 194 0.0 44642.9 9.9X codegen version 64-bit 173 / 174 0.0 42120.9 10.5X This shows that algorithm is consistently faster than MurMurHash32 in all cases and up to 3x (!) in the normal case. I have also added this to HyperLogLog++ and it cuts the processing time of the following code in half: val df = sqlContext.range(1<<25).agg(approxCountDistinct("id")) df.explain() val t = System.nanoTime() df.show() val ns = System.nanoTime() - t // Before ns: Long = 5821524302 // After ns: Long = 2836418963 cc cloud-fan (you have been working on hashcodes) / rxin Author: Herman van Hovell <[email protected]> Closes apache#11209 from hvanhovell/xxHash.
jjallaire · Mar 23, 2016 · 919bf32 · 919bf32
1 parent 8c82688
commit 919bf32
Show file tree

Hide file tree

Showing 7 changed files with 713 additions and 110 deletions.
diff --git a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/XXH64.java b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/XXH64.java
@@ -0,0 +1,192 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.expressions;
+
+import org.apache.spark.unsafe.Platform;
+import org.apache.spark.util.SystemClock;
+
+// scalastyle: off
+
+/**
+ * xxHash64. A high quality and fast 64 bit hash code by Yann Colet and Mathias Westerdahl. The
+ * class below is modelled like its Murmur3_x86_32 cousin.
+ * <p/>
+ * This was largely based on the following (original) C and Java implementations:
+ * https://github.com/Cyan4973/xxHash/blob/master/xxhash.c
+ * https://github.com/OpenHFT/Zero-Allocation-Hashing/blob/master/src/main/java/net/openhft/hashing/XxHash_r39.java
+ * https://github.com/airlift/slice/blob/master/src/main/java/io/airlift/slice/XxHash64.java
+ */
+// scalastyle: on
+public final class XXH64 {
+
+  private static final long PRIME64_1 = 0x9E3779B185EBCA87L;
+  private static final long PRIME64_2 = 0xC2B2AE3D27D4EB4FL;
+  private static final long PRIME64_3 = 0x165667B19E3779F9L;
+  private static final long PRIME64_4 = 0x85EBCA77C2B2AE63L;
+  private static final long PRIME64_5 = 0x27D4EB2F165667C5L;
+
+  private final long seed;
+
+  public XXH64(long seed) {
+    super();
+    this.seed = seed;
+  }
+
+  @Override
+  public String toString() {
+    return "xxHash64(seed=" + seed + ")";
+  }
+
+  public long hashInt(int input) {
+    return hashInt(input, seed);
+  }
+
+  public static long hashInt(int input, long seed) {
+    long hash = seed + PRIME64_5 + 4L;
+    hash ^= (input & 0xFFFFFFFFL) * PRIME64_1;
+    hash = Long.rotateLeft(hash, 23) * PRIME64_2 + PRIME64_3;
+    return fmix(hash);
+  }
+
+  public long hashLong(long input) {
+    return hashLong(input, seed);
+  }
+
+  public static long hashLong(long input, long seed) {
+    long hash = seed + PRIME64_5 + 8L;
+    hash ^= Long.rotateLeft(input * PRIME64_2, 31) * PRIME64_1;
+    hash = Long.rotateLeft(hash, 27) * PRIME64_1 + PRIME64_4;
+    return fmix(hash);
+  }
+
+  public long hashUnsafeWords(Object base, long offset, int length) {
+    return hashUnsafeWords(base, offset, length, seed);
+  }
+
+  public static long hashUnsafeWords(Object base, long offset, int length, long seed) {
+    assert (length % 8 == 0) : "lengthInBytes must be a multiple of 8 (word-aligned)";
+    long hash = hashBytesByWords(base, offset, length, seed);
+    return fmix(hash);
+  }
+
+  public long hashUnsafeBytes(Object base, long offset, int length) {
+    return hashUnsafeBytes(base, offset, length, seed);
+  }
+
+  public static long hashUnsafeBytes(Object base, long offset, int length, long seed) {
+    assert (length >= 0) : "lengthInBytes cannot be negative";
+    long hash = hashBytesByWords(base, offset, length, seed);
+    long end = offset + length;
+    offset += length & -8;
+
+    if (offset + 4L <= end) {
+      hash ^= (Platform.getInt(base, offset) & 0xFFFFFFFFL) * PRIME64_1;
+      hash = Long.rotateLeft(hash, 23) * PRIME64_2 + PRIME64_3;
+      offset += 4L;
+    }
+
+    while (offset < end) {
+      hash ^= (Platform.getByte(base, offset) & 0xFFL) * PRIME64_5;
+      hash = Long.rotateLeft(hash, 11) * PRIME64_1;
+      offset++;
+    }
+    return fmix(hash);
+  }
+
+  private static long fmix(long hash) {
+    hash ^= hash >>> 33;
+    hash *= PRIME64_2;
+    hash ^= hash >>> 29;
+    hash *= PRIME64_3;
+    hash ^= hash >>> 32;
+    return hash;
+  }
+
+  private static long hashBytesByWords(Object base, long offset, int length, long seed) {
+    long end = offset + length;
+    long hash;
+    if (length >= 32) {
+      long limit = end - 32;
+      long v1 = seed + PRIME64_1 + PRIME64_2;
+      long v2 = seed + PRIME64_2;
+      long v3 = seed;
+      long v4 = seed - PRIME64_1;
+
+      do {
+        v1 += Platform.getLong(base, offset) * PRIME64_2;
+        v1 = Long.rotateLeft(v1, 31);
+        v1 *= PRIME64_1;
+
+        v2 += Platform.getLong(base, offset + 8) * PRIME64_2;
+        v2 = Long.rotateLeft(v2, 31);
+        v2 *= PRIME64_1;
+
+        v3 += Platform.getLong(base, offset + 16) * PRIME64_2;
+        v3 = Long.rotateLeft(v3, 31);
+        v3 *= PRIME64_1;
+
+        v4 += Platform.getLong(base, offset + 24) * PRIME64_2;
+        v4 = Long.rotateLeft(v4, 31);
+        v4 *= PRIME64_1;
+
+        offset += 32L;
+      } while (offset <= limit);
+
+      hash = Long.rotateLeft(v1, 1)
+              + Long.rotateLeft(v2, 7)
+              + Long.rotateLeft(v3, 12)
+              + Long.rotateLeft(v4, 18);
+
+      v1 *= PRIME64_2;
+      v1 = Long.rotateLeft(v1, 31);
+      v1 *= PRIME64_1;
+      hash ^= v1;
+      hash = hash * PRIME64_1 + PRIME64_4;
+
+      v2 *= PRIME64_2;
+      v2 = Long.rotateLeft(v2, 31);
+      v2 *= PRIME64_1;
+      hash ^= v2;
+      hash = hash * PRIME64_1 + PRIME64_4;
+
+      v3 *= PRIME64_2;
+      v3 = Long.rotateLeft(v3, 31);
+      v3 *= PRIME64_1;
+      hash ^= v3;
+      hash = hash * PRIME64_1 + PRIME64_4;
+
+      v4 *= PRIME64_2;
+      v4 = Long.rotateLeft(v4, 31);
+      v4 *= PRIME64_1;
+      hash ^= v4;
+      hash = hash * PRIME64_1 + PRIME64_4;
+    } else {
+      hash = seed + PRIME64_5;
+    }
+
+    hash += length;
+
+    long limit = end - 8;
+    while (offset <= limit) {
+      long k1 = Platform.getLong(base, offset);
+      hash ^= Long.rotateLeft(k1 * PRIME64_2, 31) * PRIME64_1;
+      hash = Long.rotateLeft(hash, 27) * PRIME64_1 + PRIME64_4;
+      offset += 8L;
+    }
+    return hash;
+  }
+}
diff --git a/.../main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlus.scala b/.../main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlus.scala
@@ -169,7 +169,7 @@ case class HyperLogLogPlusPlus(
     val v = child.eval(input)
     if (v != null) {
       // Create the hashed value 'x'.
-      val x = MurmurHash.hash64(v)
+      val x = XxHash64Function.hash(v, child.dataType, 42L)
 
       // Determine the index of the register we are going to use.
       val idx = (x >>> idxShift).toInt