Name		Name	Last commit message	Last commit date
parent directory ..
compare_nano_micro		compare_nano_micro
README.md		README.md
awkward_flashgg.ipynb		awkward_flashgg.ipynb
custom_types.ipynb		custom_types.ipynb
environment.yaml		environment.yaml
micro_aod.ipynb		micro_aod.ipynb
nano_aod.ipynb		nano_aod.ipynb
numpy_flashgg.ipynb		numpy_flashgg.ipynb
prepare_data_with_sys.py		prepare_data_with_sys.py
study_flip_systematics.ipynb		study_flip_systematics.ipynb
study_uproot_performance.ipynb		study_uproot_performance.ipynb
test_jitting.py		test_jitting.py
test_pure_xgb.py		test_pure_xgb.py
test_xgboost_distributed.ipynb		test_xgboost_distributed.ipynb
test_xgboost_uproot.ipynb		test_xgboost_uproot.ipynb
train_and_save_models.py		train_and_save_models.py

README.md

Prepare

In order to run this tutorial a working installation of conda is needed. Once you have this, follow the instructions to setup the environm ent.

1. Clone this repository

$ git clone https://github.com/maxgalli/UsefulHEPScripts
$ cd UsefulHEPScripts/flashgg_investigate

2. Create a conda environment

$ conda env create -f environment.yml

Content

This folder includes useful studies performed to find an optimal structure for the bew flashgg framework.

XGBoost -> RDataFrame

Here we investigate the possibility of efficiently apply tags and systematics in an RDataFrame analysis flow. The application would be the new flashgg framework. The idea consists in training a model with XGBoost for some variables and then apply it, in the context of an RDataFrame flow, to a new dataset which has the same variables the model was trained with plus two "modified" versions (Up and Down, which represent the systematics).

Workflow

prepare_data_with_sys.py: this program fetches two different remote datasets, SMHiggsToZZTo4L.root and ZZTo2e2mu.root, which will be respectively our signal and background. Data is processed to flatten the variable of interest (Muon_pt_1, Muon_pt_2, Electron_pt_1, Electron_pt_2) and Define is applied to produce fake branches of systematic variations (variables + _Up and variables + _Down), for a total amount of 12 branches in each dataset (signal and background). Then, both datasets are split into training and test, ending up with train_sys_signal.root, train_sys_background.root, test_sys_signal.root and test_sys_background.root.
train_and_save_models.py: here we use XGBoost to train a classifiers that separates signal events from background events. The classifier is saved into two different formats: plain XGBoost in classifier.pkl and in TMVA digestible format as myBDT inside classifier.root.
test_jitting.py: here we set-up a trick (necessary due to the fact that RDataFrame.Define accepts strings of C++ code as arguments) to conveniently make signal-bkg predictions for the events stored in test_sys_signal.root, ending up in the branches y, y_up and y_down. The bdt is set up only once for the three predictions. The predictions for the first 200 events are printed (run with python test_jitting.py > test_jitting.log to store it in a log file).

Double-check

Run python test_pure_xgb.py > test_pure_xgb.log to perform the test for signal dataset using only XGBoost from Python and compare the results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flashgg_investigate

flashgg_investigate

README.md

Prepare

1. Clone this repository

2. Create a conda environment

Content

XGBoost -> RDataFrame

Workflow

Double-check

Files

flashgg_investigate

Directory actions

More options

Directory actions

More options

Latest commit

History

flashgg_investigate

Folders and files

parent directory

README.md

Prepare

1. Clone this repository

2. Create a conda environment

Content

XGBoost -> RDataFrame

Workflow

Double-check