Home

Welcome to the spark-gpu wiki!

Go to the repository -> click here

This is a page for a prototype of Apache Spark to effectively store partition data in a columnar RDD in a binary format. The motivation is to accelerate Spark workloads by using GPU and SIMD in Apache Spark. A RDD in the original Apache Spark keeps data as Scala sequence for each row on Java heap. This prototype keeps data as a binary representation on a off-heap, as Dataset introduced in Spark 1.6. This prototype keeps also in a columnar storage, which is suitable for GPU and SIMD.

You can see our current performance improvement (more than 3x) at benchmark section.

You can run our prototype in your box with NVIDIA GPU card or run AWS EC2 by following the procedure described here

You can download pre-build binary from (http://github.com/kiszk/spark-gpu/wiki/Downloads).

Please also visit other pages from the menu in the right-hand side.

Current version has several limitations

support only x86_64 and ppc64le
support OpenJDK and IBM JDK
Support only NVIDIA GPU with CUDA (we confirmed with CUDA 7.0)
support CUDA 7.0 and 7.5 (should work with CUDA 6.0 and 6.5)
support scalar variables in primitive scalar types and an primitive array in RDD
support a new column format for map and reduce functions

Future plan

Generate GPU and SIMD code from a Spark application program
Now, a programmer has to provide CUDA function for GPU kernels in Spark functions. Or, limited code generation for map() and reduce() functions is enabled with "spark.gpu.codegen=true"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Welcome to the spark-gpu wiki!

Go to the repository -> click here

Please also visit other pages from the menu in the right-hand side.

Clone this wiki locally