The case study considered in this paper is implemented using the OhioT1DM dataset for blood glucose level prediction. The data contains continuous measurements for six patients with type 1 diabetes over eight weeks. The objective is to learn an optimal policy that maps patients’ time-varying covariates into the amount of insulin injected at each time to maximize patients’ health status.
The OhioT1DM dataset is available from http://smarthealth.cs.ohio.edu/OhioT1DM-dataset.html. However, a Data Use Agreement (DUA) is required to protect the data and ensure that it is used only for research purposes.
In our experiment, we divide each day of follow-up into one hour intervals and a treatment decision is made every
hour. We consider three important time-varying state variables, including the average blood glucose levels
Let
The data used in the paper is called the trajs_mh.pkl
. However, for confidentiality considerations, we do not put it in this repository. The code to generate this data is placed in the generate_trajs_mh.py
in the data
folder. Once you have downloaded the raw data, put it in the same folder as the code, and then run the code to get the data used in this paper.
Two OpenAI Gym environments, LunarLander-v2
and Qbert-ram-v0
are used to generate the synthetic data. You can run the qr_dqn_online.ipynb
in the data
folder and change the env_name
and num_actions
to get the trajs_qr_dqn_lunar.pkl
and the trajs_qr_dqn_qbert.pkl
respectively. These two files are zipped in the data/synthetic_data.rar
.
The seal
folder contains the core code to implement the proposed method and various utility functions.
models
: network structures.agents
: DQN, QR-DQN, MultiHeadDQN (REM), DiscreteBCQ, and BEAR(MMD replaced by KL_control) agents.replay_buffers
: basic and prioritized replay buffers.algos
: behavior cloning, density estimator, advantage learner, fitted Q evaluation, etc.utils
: utility functions.
we assume that the forward and backward of network complexity is S
.
- step 2 (Policy optimization): training
L
DQN agents, batch sizeB_1
, training stepsI_1
, totalO(L * I_1 * B_1 * S)
- step 3 (Estimation of the density ratio): training
L
density estimators, batch sizeB_2
, training stepsI_2
, totalO(L * I_2 * B_2^4 * S)
- step 4 (Construction of Pseudo Outcomes): batch size
B_3
, totalO(B_3 * N * T * A * S)
, whereN
number of trajs,T
average length of trajs,A
number of actions. - step 5 (Training
$\tau$ ): batch sizeB_4
, training stepsI_4
, totalO(I_4 * B_4 * S)
All experiments run on a single computer instance with 40 Intel(R) Xeon(R) 2.20GHz CPUs.
- Python version: Python 3.6.8 :: Anaconda custom (64-bit)
- numpy == 1.18.1
- pandas == 1.0.3
- sklearn == 0.22.1
- tensorflow == 2.1.0
- tensorflow-probability == 0.9
- gym == 0.17.3
- copy the
data/data/synthetic_data.rar/trajs_qr_dqn_lunar.pkl
to thelunarlander-v2/dqn_2_200/random/
folder, and change its name totrajs_qr_dqn.pkl
. Use thecd
command to switch to thelunarlander-v2
directory and run the python filebatch_seal_vs_dqn.py
(around 20 hours without GPU support). This will generate DQN v.s. SEAL offline training results under 200 trajectories randomly sampled out of the total trajectories in.csv
files. Similarly, we can obtain DDQN, QR-DQN, REM, Discrete-BCQ and Discrete-BEAR results. Same procedures to take withQbert-ram-v0
. - We aggregate all the results in the
synthetic_results
folder withplots_lunar
andplots_qbert
folders. Each folder containsdqn.csv
,ddqn.csv
,qrdqn.csv
and4_methods.csv
.
- Run the
DQN_mh.ipynb
under therealdata
folder after putting thetrajs_mh.pkl
into therealdata/data/mh/dqn
folder. This will generate DQN v.s. SEAL training results in a.pkl
file. Similarly, we can obtain DDQN, QR-DQN, REM, Discrete-BCQ and Discrete-BEAR results. - We aggregate all the results in the
real_data_results
folder containingdqn.csv
,ddqn.csv
,qrdqn.csv
and4_methods.csv
.
You can get the Figure 2, Figure 3 and Figure 4 in the article by running the file plot_figures.py
. The figs
folder contains these figures named Figure_2.png
, Figure_3.png
and Figure_4.png
.