Skip to content

Commit

Permalink
Data Generation/Workload Execution Pipeline Enhancements (cmu-db#50)
Browse files Browse the repository at this point in the history
Diff is mainly targeted at infrastructure changes and tooling related to training data collection and workload execution. This diff introduces the following set of new `doit` tasks:

- `tscout_init` allows attaching `TScout` to the currently running postgres instance for data collection. `tscout_shutdown` is the counterpart that handles shutting down `TScout` and flushing out its buffers to the CSV files on disk.
- `noisepage_swap_config` allows switching the `postgresql.conf` file without wiping `pgdata`. Currently this will always restart the database instance under the assumption that some changes in `postgresql.conf` require a restart to take effect. `noisepage_swap_config` does not touch `postgresql.auto.conf`.
- `benchbase_prewarm_install` installs the extension `pg_prewarm` for benchbase.
- `behavior_pg_prewarm_benchmark` and `behavior_pg_analyze_benchmark` prewarm and analyze (respectively) a given benchmark's tables that are assumed to reside in the `benchbase` database.
- `behavior_generate_workloads` generates workloads from the config file
- `behavior_execute_workloads` executes workloads (via `behavior/datagen/run_workloads.sh`) to generate data.
- `behavior_perform_plan_diff` performs plan differencing

This diff also introduces some other data pipelining improvements:
- Directory restructure such that `train/eval` data are now stored under the given experiment run
- `behavior_microservice` now accepts a path to a specific model directory
- `behavior_execute_workloads` can execute multiple benchbase runs with different configs using the same database.
- Plan Differencing accepts a glob pattern to select a particular experiment
- Training accepts parameters to control which experiments/benchmark runs to incorporate into the model

Two known caveats for executing workloads (that are documented in-line):
- Loading multiple databases and alternating the workload (i.e., execute TPC-C then TATP) in a single experiment is not currently supported. However, adding support for this is not difficult if needed. With the training script's argument updates, we can coalesce training data from different experiments too.
- `run_workloads.sh` allows switching the `postgresql.conf` and executing a BenchBase run within the same experiment/loaded database. However, `run_workloads.sh` does not respect changes to the scale factor in subsequent benchbase configurations.

New Dependencies:
- `niet` in `requirements.txt` for parsing a YAML file in a bash script
  • Loading branch information
17zhangw authored Mar 3, 2022
1 parent fb4b2da commit 118871d
Show file tree
Hide file tree
Showing 17 changed files with 979 additions and 759 deletions.
4 changes: 2 additions & 2 deletions behavior/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

from plumbum import cli

from behavior.datagen import gen
from behavior.datagen import generate_workloads
from behavior.microservice import app
from behavior.modeling import train
from behavior.plans import diff
Expand All @@ -17,7 +17,7 @@ def main(self) -> None:

if __name__ == "__main__":
logging.basicConfig(format="%(levelname)s:%(asctime)s %(message)s", level=logging.INFO)
BehaviorCLI.subcommand("datagen", gen.DataGeneratorCLI)
BehaviorCLI.subcommand("generate_workloads", generate_workloads.GenerateWorkloadsCLI)
BehaviorCLI.subcommand("datadiff", diff.DataDiffCLI)
BehaviorCLI.subcommand("train", train.TrainCLI)
BehaviorCLI.subcommand("microservice", app.ModelMicroserviceCLI)
Expand Down
Loading

0 comments on commit 118871d

Please sign in to comment.