A Fair benchmark for Cloudera impala

Based on TPC-DS benchmark for Cloudera we found here https://github.com/cloudera/impala-tpcds-kit

NOTICE: This repo contains modifications to the official TPC-DS specification so any results from this are not comparable to officially audited results.

Environment Setup Steps

These steps setup your environment to perform a distributed data generation for the given scale factor.

Prerequisites

The scripts assume that you have passwordless SSH from the master node (where you will clone the repos to) to every DataNode that is in your cluster.

These scripts also assume that your $HOME directory is the same path on all DataNodes.

Download and build the modified TPC-DS tools

sudo yum -y install gcc make flex bison byacc git
cd $HOME (use your $HOME directory as it's hard coded in some scripts for now)
git clone https://github.com/grahn/tpcds-kit.git
cd tpcds-kit/tools
make -f Makefile.suite

Clone the Impala TPC-DS tools repo & Configure the HDFS directories

cd $HOME (use your $HOME directory as it's hard coded in some scripts for now)
clone this repo git clone https://github.com/cloudera/impala-tpcds-kit
cd impala-tpcds-kit
Edit tpcds_env.sh and modify as needed. The defaults assume you have a /user/$USER directory in HDFS. If you don't, run these commands:
- sudo -u hdfs hdfs dfs -mkdir /user/$USER
- sudo -u hdfs hdfs dfs -chown $USER /user/$USER
- sudo -u hdfs hdfs dfs -chmod 777 /user/$USER
Edit dn.txt and put one DataNode hostname per line - no blank lines.
Run push-bits.sh which will scp tpcds-kit and impala-tpcds-kit to each DataNode listed in dn.txt.
Run set-nodenum.sh. This will create impala-tpcds-kit/nodenum.sh on every DataNode and set the value accordingly. This is used to determine what portion of the distributed data generation is done on each node.

Preparation and Data Generation

./push-bits.sh && ./set-nodenum.sh && ./run-gen-facts.sh > /tmp/tpcds.log && tail -f /tmp/tpcds.log

Data loading

./returns-move.sh && ./hdfs-load.sh && ./impala-drop-db.sh && ./impala-create-external-tables.sh && ./impala-load-all.sh && ./impala-compute-stats.sh

Queries

impala-tpcds-kit/benchmark.py is the python script that executes all queries. The method run_benchmark() allows to specify different parameters for benchmarking.

The queries themselves can be found in impala-tpcds-kit/queries. For each Scale Factor, 10 different query streams are generated. You can generate more using the TPC-DS toolkit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Fair benchmark for Cloudera impala

Environment Setup Steps

Prerequisites

Download and build the modified TPC-DS tools

Clone the Impala TPC-DS tools repo & Configure the HDFS directories

Preparation and Data Generation

Data loading

Queries

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
output		output
queries		queries
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
benchmark.py		benchmark.py
distinct-ss-sold-date.txt		distinct-ss-sold-date.txt
dn.txt		dn.txt
gen-facts.sh		gen-facts.sh
gen-queries.sh		gen-queries.sh
hdfs-load.sh		hdfs-load.sh
hdfs-mkdirs.sh		hdfs-mkdirs.sh
impala-compute-stats.sh		impala-compute-stats.sh
impala-create-external-tables.sh		impala-create-external-tables.sh
impala-drop-db.sh		impala-drop-db.sh
impala-load-all.sh		impala-load-all.sh
impala-load-dims.sh		impala-load-dims.sh
impala-load-store_sales.sh		impala-load-store_sales.sh
load-store-sales.py		load-store-sales.py
nodenum.sh		nodenum.sh
push-bits.sh		push-bits.sh
returns-move.sh		returns-move.sh
run-gen-facts.sh		run-gen-facts.sh
set-nodenum.sh		set-nodenum.sh
tpcds-env.sh		tpcds-env.sh

License

peeterskris/impala-tpcds-kit

Folders and files

Latest commit

History

Repository files navigation

A Fair benchmark for Cloudera impala

Environment Setup Steps

Prerequisites

Download and build the modified TPC-DS tools

Clone the Impala TPC-DS tools repo & Configure the HDFS directories

Preparation and Data Generation

Data loading

Queries

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages