Skip to content

Latest commit

 

History

History
 
 

bench

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

File Format Benchmarks

These big data file format benchmarks, compare:

  • Avro
  • Json
  • ORC
  • Parquet

There are three sub-modules to try to mitigate dependency hell:

  • core - the shared part of the benchmarks
  • hive - the Hive benchmarks
  • spark - the Spark benchmarks

To build this library:

% mvn clean package

To fetch the source data:

% ./fetch-data.sh

To generate the derived data:

% java -jar core/target/orc-benchmarks-core-*-uber.jar generate data

To run a scan of all of the data:

% java -jar core/target/orc-benchmarks-core-*-uber.jar scan data

To run full read benchmark:

% java -jar hive/target/orc-benchmarks-hive-*-uber.jar read-all data

To run column projection benchmark:

% java -jar hive/target/orc-benchmarks-hive-*-uber.jar read-some data

To run decimal/decimal64 benchmark:

% java -jar hive/target/orc-benchmarks-hive-*-uber.jar decimal data

To run spark benchmark:

% java -jar spark/target/orc-benchmarks-spark-*.jar spark data