Data Generation/Workload Execution Pipeline Enhancements (#50) · mbutrovich/noisepage-pilot@118871d

Commit

Data Generation/Workload Execution Pipeline Enhancements (cmu-db#50)

Diff is mainly targeted at infrastructure changes and tooling related to training data collection and workload execution. This diff introduces the following set of new `doit` tasks:

- `tscout_init` allows attaching `TScout` to the currently running postgres instance for data collection. `tscout_shutdown` is the counterpart that handles shutting down `TScout` and flushing out its buffers to the CSV files on disk.
- `noisepage_swap_config` allows switching the `postgresql.conf` file without wiping `pgdata`. Currently this will always restart the database instance under the assumption that some changes in `postgresql.conf` require a restart to take effect. `noisepage_swap_config` does not touch `postgresql.auto.conf`.
- `benchbase_prewarm_install` installs the extension `pg_prewarm` for benchbase.
- `behavior_pg_prewarm_benchmark` and `behavior_pg_analyze_benchmark` prewarm and analyze (respectively) a given benchmark's tables that are assumed to reside in the `benchbase` database.
- `behavior_generate_workloads` generates workloads from the config file
- `behavior_execute_workloads` executes workloads (via `behavior/datagen/run_workloads.sh`) to generate data.
- `behavior_perform_plan_diff` performs plan differencing

This diff also introduces some other data pipelining improvements:
- Directory restructure such that `train/eval` data are now stored under the given experiment run
- `behavior_microservice` now accepts a path to a specific model directory
- `behavior_execute_workloads` can execute multiple benchbase runs with different configs using the same database.
- Plan Differencing accepts a glob pattern to select a particular experiment
- Training accepts parameters to control which experiments/benchmark runs to incorporate into the model

Two known caveats for executing workloads (that are documented in-line):
- Loading multiple databases and alternating the workload (i.e., execute TPC-C then TATP) in a single experiment is not currently supported. However, adding support for this is not difficult if needed. With the training script's argument updates, we can coalesce training data from different experiments too.
- `run_workloads.sh` allows switching the `postgresql.conf` and executing a BenchBase run within the same experiment/loaded database. However, `run_workloads.sh` does not respect changes to the scale factor in subsequent benchbase configurations.

New Dependencies:
- `niet` in `requirements.txt` for parsing a YAML file in a bash script

Loading branch information

17zhangw authored Mar 3, 2022

1 parent fb4b2da commit 118871d

behavior/__main__.py

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -2,7 +2,7 @@
  
    from plumbum import cli

    from behavior.datagen import gen

    from behavior.datagen import generate_workloads

    from behavior.microservice import app

    from behavior.modeling import train

    from behavior.plans import diff

    @@ -17,7 +17,7 @@ def main(self) -> None:
  
    if __name__ == "__main__":

        logging.basicConfig(format="%(levelname)s:%(asctime)s %(message)s", level=logging.INFO)

        BehaviorCLI.subcommand("datagen", gen.DataGeneratorCLI)

        BehaviorCLI.subcommand("generate_workloads", generate_workloads.GenerateWorkloadsCLI)

        BehaviorCLI.subcommand("datadiff", diff.DataDiffCLI)

        BehaviorCLI.subcommand("train", train.TrainCLI)

        BehaviorCLI.subcommand("microservice", app.ModelMicroserviceCLI)

0 comments on commit `118871d`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `118871d`

Commit

There are no files selected for viewing

0 comments on commit 118871d

0 comments on commit `118871d`