-
Notifications
You must be signed in to change notification settings - Fork 28
Home
Go to the repository -> click here
This is a page for a prototype of Apache Spark to effectively store partition data in a columnar RDD in a binary format. The motivation is to accelerate Spark workloads by using GPU and SIMD in Apache Spark. A RDD in the original Apache Spark keeps data as Scala sequence for each row on Java heap. This prototype keeps data as a binary representation on a off-heap, as Dataset introduced in Spark 1.6. This prototype keeps also in a columnar storage, which is suitable for GPU and SIMD.
You can see our current performance improvement (more than 3x) at benchmark section.
You can run our prototype in your box with NVIDIA GPU card or run AWS EC2 by following the procedure described here
You can download pre-build binary from (http://github.com/kiszk/spark-gpu/wiki/Downloads).
Current version has several limitations
- support only x86_64 and ppc64le
- support OpenJDK and IBM JDK
- Support only NVIDIA GPU with CUDA (we confirmed with CUDA 7.0)
- support CUDA 7.0 and 7.5 (should work with CUDA 6.0 and 6.5)
- support scalar variables in primitive scalar types and an primitive array in RDD
- support a new column format for map and reduce functions
Future plan
- Generate GPU and SIMD code from a Spark application program
- Now, a programmer has to provide CUDA function for GPU kernels in Spark functions. Or, limited code generation for map() and reduce() functions is enabled with "spark.gpu.codegen=true"