Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Data Generation/Workload Execution Pipeline Enhancements (cmu-db#50)
Diff is mainly targeted at infrastructure changes and tooling related to training data collection and workload execution. This diff introduces the following set of new `doit` tasks: - `tscout_init` allows attaching `TScout` to the currently running postgres instance for data collection. `tscout_shutdown` is the counterpart that handles shutting down `TScout` and flushing out its buffers to the CSV files on disk. - `noisepage_swap_config` allows switching the `postgresql.conf` file without wiping `pgdata`. Currently this will always restart the database instance under the assumption that some changes in `postgresql.conf` require a restart to take effect. `noisepage_swap_config` does not touch `postgresql.auto.conf`. - `benchbase_prewarm_install` installs the extension `pg_prewarm` for benchbase. - `behavior_pg_prewarm_benchmark` and `behavior_pg_analyze_benchmark` prewarm and analyze (respectively) a given benchmark's tables that are assumed to reside in the `benchbase` database. - `behavior_generate_workloads` generates workloads from the config file - `behavior_execute_workloads` executes workloads (via `behavior/datagen/run_workloads.sh`) to generate data. - `behavior_perform_plan_diff` performs plan differencing This diff also introduces some other data pipelining improvements: - Directory restructure such that `train/eval` data are now stored under the given experiment run - `behavior_microservice` now accepts a path to a specific model directory - `behavior_execute_workloads` can execute multiple benchbase runs with different configs using the same database. - Plan Differencing accepts a glob pattern to select a particular experiment - Training accepts parameters to control which experiments/benchmark runs to incorporate into the model Two known caveats for executing workloads (that are documented in-line): - Loading multiple databases and alternating the workload (i.e., execute TPC-C then TATP) in a single experiment is not currently supported. However, adding support for this is not difficult if needed. With the training script's argument updates, we can coalesce training data from different experiments too. - `run_workloads.sh` allows switching the `postgresql.conf` and executing a BenchBase run within the same experiment/loaded database. However, `run_workloads.sh` does not respect changes to the scale factor in subsequent benchbase configurations. New Dependencies: - `niet` in `requirements.txt` for parsing a YAML file in a bash script
- Loading branch information