Testing Automation Framework for BlazingSQL
- pyspark
- drill and excel support
- GoogleSheet support
Inside a conda environment:
conda install --yes -c conda-forge openjdk=8.0 maven pyspark=3.0.0 pytest
pip install pydrill openpyxl pymysql gitpython pynvml gspread oauth2client sql_metadata pyyaml
You will also need Apache Drill
cd blazingsql
# help
./test.sh -h
# run all tests: I/O, communication, engine and end to end tests (e2e)
./test.sh
By default the end to end tests:
- Run in single node only (nrals: 1)
- Compare against parquet result files instead of Drill or pySpark (execution mode: gpuci)
- The log directory is the CONDA_PREFIX folder.
- Download automatically and use the testing files (the data folder and the and parquet result files) in your CONDA_PREFIX folder (see https://github.com/BlazingDB/blazingsql-testing-files)
cd blazingsql
# Run all e2e tests based on your current env settings.
./test.sh e2e
# Run only the round end to end test group.
./test.sh e2e tests=roundSuite
# Run the round and orderby end to end test groups.
./test.sh e2e tests=roundSuite,orderbySuite
- Make a fork from https://github.com/BlazingDB/blazingsql and create a new branch (example feat/my-new-test)
- After that make another one from https://github.com/BlazingDB/blazingsql-testing-files and create a new branch with the same name as above (example feat/my-new-test)
- Write a new test file in blazingsql/tests/BlazingSQLTest/
- Add new files in blazingsql-testing-files/data/
- Push your changes of both repositories (example: git push origin feat/my-new-test)
- Finally make a new Pull Request on https://github.com/BlazingDB/blazingsql/compare
cd blazingsql
# Run all e2e tests based on your current env settings.
./test.sh e2e
# Run only the round end to end test group.
./test.sh e2e tests=roundSuite
# Run the round and orderby end to end test groups.
./test.sh e2e tests=roundSuite,orderbySuite
All the behaviour of the end to end test are base on environment variables. So when you want to have more control you need to change some of the default values exporting or defining the target environment variable before run the tests.
Examples:
# Run all e2e tests in full mode
BLAZINGSQL_E2E_EXEC_MODE=full ./test.sh e2e
# Run all e2e tests with 2 rals/workers using LocalCUDACluster
BLAZINGSQL_E2E_N_RALS=2 ./test.sh e2e
# Run all e2e tests with 2 rals/workers using an online dask-scheduler IP:PORT
export BLAZINGSQL_E2E_N_RALS=2
BLAZINGSQL_E2E_DASK_CONNECTION="127.0.0.1:8786" ./test.sh e2e
Here are all the environment variables with its default values:
#TestSettings
export BLAZINGSQL_E2E_DATA_DIRECTORY=$CONDA_PREFIX/blazingsql-testing-files/data/
export BLAZINGSQL_E2E_LOG_DIRECTORY=$CONDA_PREFIX/
export BLAZINGSQL_E2E_FILE_RESULT_DIRECTORY=$CONDA_PREFIX/blazingsql-testing-files/results/
export BLAZINGSQL_E2E_DATA_SIZE="100MB2Part"
export BLAZINGSQL_E2E_EXECUTION_ENV="local"
export BLAZINGSQL_E2E_DASK_CONNECTION="local" # values: "dask-scheduler-ip:port", "local"
# AWS S3 env vars
export BLAZINGSQL_E2E_AWS_S3_BUCKET_NAME=''
export BLAZINGSQL_E2E_AWS_S3_ACCESS_KEY_ID=''
export BLAZINGSQL_E2E_AWS_S3_SECRET_KEY=''
# Google Storage env vars
export BLAZINGSQL_E2E_GOOGLE_STORAGE_PROJECT_ID=''
export BLAZINGSQL_E2E_GOOGLE_STORAGE_BUCKET_NAME=''
export BLAZINGSQL_E2E_GOOGLE_STORAGE_ADC_JSON_FILE=""
#RunSettings
export BLAZINGSQL_E2E_EXEC_MODE="gpuci" # values: gpuci, full, generator
export BLAZINGSQL_E2E_N_RALS=1
export BLAZINGSQL_E2E_N_GPUS=1
export BLAZINGSQL_E2E_NETWORK_INTERFACE="lo"
export BLAZINGSQL_E2E_SAVE_LOG=false
export BLAZINGSQL_E2E_WORKSHEET="BSQL Log Results" # or "BSQL Performance Results"
export BLAZINGSQL_E2E_LOG_INFO=''
export BLAZINGSQL_E2E_COMPARE_RESULTS=true
export BLAZINGSQL_E2E_TARGET_TEST_GROUPS=""
export BLAZINGSQL_E2E_TEST_WITH_NULLS="false" # use this when you want to use the dataset with nulls
#ComparissonTest
export BLAZINGSQL_E2E_COMPARE_BY_PERCENTAJE=false
export BLAZINGSQL_E2E_ACCEPTABLE_DIFERENCE=0.01
If you don't want to use the stored parquet results (BLAZINGSQL_E2E_FILE_RESULT_DIRECTORY) and you want to compare directly against Drill
or Spark
then you can change the execution mode variable BLAZINGSQL_E2E_EXEC_MODE from "gpuci"
to "full"
.
Please, note that if you want to run on full
mode you must have a Drill
instance running.
If you want to run a test with n rals/workers where n>1 you need to change the environment variable BLAZINGSQL_E2E_N_RALS. Also, remember that when running a distributed tests the variable BLAZINGSQL_E2E_DASK_CONNECTION must be set to either "local" or "dask-scheduler-ip:port".
- "local" it will use dask LocalCUDACluster to simulate the n rals/workers on a single GPU.
- "dask-scheduler-ip:port" is the connection information of you dask-scheduler (in case you run manually your local dask cluster)
Finally, there are sensible data that never must be public so to get the values for these variables please ask to QA & DevOps teams:
- AWS S3 connection settings: BLAZINGSQL_E2E_AWS_S3_BUCKET_NAME, BLAZINGSQL_E2E_AWS_S3_ACCESS_KEY_ID, BLAZINGSQL_E2E_AWS_S3_SECRET_KEY
- Google Storage connection settings: BLAZINGSQL_E2E_GOOGLE_STORAGE_PROJECT_ID, BLAZINGSQL_E2E_GOOGLE_STORAGE_BUCKET_NAME, BLAZINGSQL_E2E_GOOGLE_STORAGE_ADC_JSON_FILE
- Google Docs spreadsheet access: BLAZINGSQL_E2E_LOG_INFO
When running the end to end tests in its default mode (BLAZINGSQL_E2E_EXEC_MODE="gpuci"), it will use result files which live in the blazingsql-testing-files repository which should be in the root of your conda environment directory. To generate new files here for new E2E test queries, you will need to run the testing framework for the test set in generator mode. You must also run it for regular data and data with nulls. You will have to do something like this:
BLAZINGSQL_E2E_EXEC_MODE="generator" ./test.sh e2e tests=windowNoPartitionTest
BLAZINGSQL_E2E_TEST_WITH_NULLS=true BLAZINGSQL_E2E_EXEC_MODE="generator" ./test.sh e2e tests=windowNoPartitionTest
Note: If you need to modify an already existing file, you will need to delete it, before you try to regenerate it.
- For development/debugging is recommended to set you env var BLAZINGSQL_E2E_DASK_CONNECTION to your dask-scheduler IP:PORT
- Do not touch bash files, if you need a feature please talk with QA & DevOps teams.
- Only add/modify end to end tests once you have coordinated with QA team.
cd blazingsql
# engine/ral tests
./test.sh libengine
# I/O unit tests
./test.sh io
The values of the following variables must be set.
# AWS S3 env vars
export BLAZINGSQL_E2E_AWS_S3_BUCKET_NAME=''
export BLAZINGSQL_E2E_AWS_S3_ACCESS_KEY_ID=''
export BLAZINGSQL_E2E_AWS_S3_SECRET_KEY=''
# Google Storage env vars
export BLAZINGSQL_E2E_GOOGLE_STORAGE_PROJECT_ID=''
export BLAZINGSQL_E2E_GOOGLE_STORAGE_BUCKET_NAME=''
export BLAZINGSQL_E2E_GOOGLE_STORAGE_ADC_JSON_FILE=""
Then you have to execute the next sentence.
cd blazingsql/tests/BlazingSQLTest
python manualTesting.py
To support the testing of queries to a HDFS filesystem whose authentication is given by Kerberos, we provide a containerized environment that starts a fully kerborized HDFS server.
-
We use Docker as our container engine, that can be installed following the steps contained on the URL below:
https://docs.docker.com/v17.09/engine/installation/linux/docker-ce/ubuntu/#os-requirements
-
The utility Docker Compose is used for starting up the multi-container environment composed by the HDFS container and the Kerberos container. To install via pip:
$ pip install docker-compose
-
Download and extract a compatible Hadoop distribution that contains the client libraries (tested with versions 2.7.3 and 2.7.4):
-
Some environment variables are needed to find the right paths to the Java dependencies; you can set them loading the script below passing the root of your Hadoop distribution:
$ cd KrbHDFS $ source ./load_hdfs_env_vars.sh /PATH/TO/HADOOP
-
By running the script below, the Docker containers of Hadoop and Kerberos will be automatically started; also, the data located on the label dataDirectory from the config file configE2ETest.json will be copied inside the HDFS filesystem. Finally, the E2E framework will execute the tests and will generate the summary report.
$ cd BlazingSQLTest $ python -m EndToEndTests.fileSystemHdfsTest configE2ETest.json
To run other tests beyond the E2E tests (ad hoc scripts, local Hadoop tests), you can also start the Docker + Kerberos containers following the next steps:
-
Set the environment variables required by Hadoop:
$ cd KrbHDFS $ source ./load_hdfs_env_vars.sh /PATH/TO/HADOOP
-
Run the script that we provided to start the containers. You need to pass as first argument the root of your Hadoop distribution, and the path to the data that will be copied inside the HDFS instance.
$ cd KrbHDFS $ ./start_hdfs.sh /PATH/TO/HADOOP /PATH/TO/YOUR/DATA
-
Once your tests are finished, you could stop the containers.
$ cd KrbHDFS $ ./stop_hdfs.sh
- For now, the starting script will require your superuser credentials.
- You must pass a valid Kerberos ticket to the filesystem register command of BlazingSQL. For your convenience, when you start the Docker instances, the script will copy a valid ticket into the path below:
./KrbHDFS/myconf/krb5cc_0
We provide as well a copy of the Apache Hive software (tested with version 1.2.2) inside the HDFS container. Therefore, the steps to run the E2E tests using Hive are similar than the instructions for HDFS.
-
Set the environment variables needed by running the script below passing the root of your Hadoop distribution:
$ cd KrbHDFS $ source ./load_hdfs_env_vars.sh /PATH/TO/HADOOP
-
Behind the scenes, an instance of Hive linked to the HDFS server will be ready to ingest your data; so the E2E framework will execute the tests and will generate the summary report.
$ cd BlazingSQLTest $ python -m EndToEndTests.fileSystemHiveTest configE2ETest.json
For MySQL and PostgreSQL you will need to install these libs:
conda install -c conda-forge mysql-connector-python
conda install -c conda-forge psycopg2
and run the following line in the MySQL console:
SET GLOBAL local_infile = 'ON';
To run the tests for tables from other SQL databases just define these env vars before run the test:
# for MySQL
BLAZINGSQL_E2E_MYSQL_HOSTNAME
BLAZINGSQL_E2E_MYSQL_PORT
BLAZINGSQL_E2E_MYSQL_USERNAME
BLAZINGSQL_E2E_MYSQL_PASSWORD
BLAZINGSQL_E2E_MYSQL_DATABASE
# for PostgreSQL
BLAZINGSQL_E2E_POSTGRESQL_HOSTNAME
BLAZINGSQL_E2E_POSTGRESQL_PORT
BLAZINGSQL_E2E_POSTGRESQL_USERNAME
BLAZINGSQL_E2E_POSTGRESQL_PASSWORD
BLAZINGSQL_E2E_POSTGRESQL_DATABASE
# for SQLite
BLAZINGSQL_E2E_SQLITE_DATABASE
Note BLAZINGSQL_E2E_SQL_PORT is a number and the other vars are strings!
Sometimes, for many reasons the E2E script test could raise an error. In that case, containers may be in an invalid state. Before try again, please check that there aren't any HDFS or Kerberos containers running by calling the stopping of the containers explicitly:
$ cd KrbHDFS
$ docker-compose down
- To generate TPCH Dataset, you have to use the GenerateTpchDataFiles script
-
In configurationFile.json -> "dataDirectory": "/path_to_dataset/100MB2Part/"
You should have two folders inside dataDirectory: tpch folder and tpcx folder See https://github.com/BlazingDB/blazingsql-testing-files/blob/master/data.tar.gz
- CreationDatabases
- Example:
To run all end to end test:
$ python -m EndToEndTests.allE2ETest configurationFile.json
Run performance
$ python allE2ETest2.py configurationFile.json
To run a different subset of queries that cover a particular type of queries:
$ python -m EndToEndTests.whereClauseTest configurationFile.json