This repository contains code for our ICML 2024 paper, When Do Skills Help Reinforcement Learning? A Theoretical Analysis of Temporal Abstractions. If you use this code, please cite:
@inproceedings{li2022rlskilltheory,
title={When Do Skills Help Reinforcement Learning? A Theoretical Analysis of Temporal Abstractions},
author={Li, Zhening and Poesia, Gabriel and Solar-Lezama, Armando},
booktitle={Proceedings of the 41st International Conference on Machine Learning},
year={2024}
}
We use Python 3.9. Run
pip install -r requirements.txt
to install all dependencies other than PyTorch. To install PyTorch, follow instructions on its official webpage.
Experiments determining correlation between RL difficulty and RL sample efficiency (train_rl.py
)
Experiments were conducted on 4 environments:
CliffWalking
[1]: a simple grid world (implementation by Gymnasium)CompILE2
[2]: the CompILE grid world with visit length 28Puzzle
: the 8-puzzleRubiksCube222
: the 2-by-2 Rubiks cube (implementation byrubiks_cube_gym
[3])
We applied 4 RL algorithms:
- Q-learning
- Value iteration (modified to the RL setting)
- REINFORCE
- DQN
There are 3 sets of experiments:
- Testing how well
$p$ -learning difficulty and$p$ -exploration difficulty capture RL sample complexity various macroaction augmentations of the same environment. For each environment, we conduct RL on the base environment and 31 macroaction augmentations. 6 of the 31 macroaction augmentations (located underabs_examples/
) are (manually) derived from the optimal macroactions (located underabs_optimal/
). To reproduce these optimal macroactions with LEMMA [4], use the bash scripts located underabs_scripts/
for existing configs. - Testing how well
$p$ -learning difficulty captures the complexity of planning algorithms (state/action value iteration) on various macroaction augmentations of the same environment. - Testing how well unmerged
$p$ -incompressibility captures the difficulty in learning useful skills for hierarchical RL. In this set of experiments, we test two skill learning algorithms: LEMMA [4] for macroactions, and LOVE [5] for neural skills.
Follow the following steps to reproduce our experimental results.
-
First use the bash scripts located under
envinfo_scripts_main/
to compute information (e.g., RL difficulty metrics) about each environment (including both the base environment and the 31 macroaction augmentations). The expected amount of time each script takes to run varies from a few seconds (CliffWalking
) to about an hour (RubiksCube222
). -
Run the scripts
scripts_main/QLearning/ENV_true-Q_few-abs-extra[_sN].sh
,scripts_main/ValueIteration/ENV_true-V_few-abs-extra[_sN].sh
, andscripts_main/REINFORCE/ENV_policy-from-Q_few-abs-extra[_sN].sh
, for eachENV
(CliffWalking
,CompILE2
,8Puzzle
,RubiksCube222
). The_sN
suffix in the script name denotes the seed (absence of the suffix refers to seed 0). The expected amount of time each script takes to run varies from a few seconds (CliffWalking
) to a few minutes (RubiksCube222
). These scripts calculate the ground truth state and action values and the optimal policies of all 32 variants of each environment. -
For the experiments with the planning algorithms (state/action value iteration), run the scripts
scripts_main/QLearning/ENV_no-expl_few-abs-extra[_sN].sh
andscripts_main/ValueIteration/ENV_no-expl_few-abs-extra[_sN].sh
for eachENV
(CliffWalking
,CompILE2
,8Puzzle
,RubiksCube222
). The_sN
suffix in the script name denotes the seed (absence of the suffix refers to seed 0). The expected amount of time each script takes to run varies from several seconds (CliffWalking
) to a couple tens of minutes (RubiksCube222
). Note thatRubiksCube222
uses the most GPU memory; if you're using 8 GPUs at once, then having 15GB of memory available on each is sufficient. -
Then use the following scripts to conduct all training runs for vanilla RL and hRL with LEMMA macroactions.
- Q-learning:
scripts_main/QLearning/ENV_few-abs-extra-trunc-replay-adapteps[_sN].sh
- Value iteration:
scripts_main/ValueIteration/ENV_few-abs-extra-trunc-replay-adapteps[_sN].sh
- REINFORCE:
scripts_main/REINFORCE/ENV_few-abs-extra-trunc[_sN].sh
- DQN:
scripts_main/QLearning/ENV_deep_few-abs-extra-trunc-replay-adapteps[_sN].sh
The expected amount of time for each script to run varies between a couple of hours (
CliffWalking
) to several days (RubiksCube222
). Since there are over a hundred scripts to run, we use the scripttrack_unfinished.py
for tracking which runs have completed, are in progress, or have not begun.We also provide scripts for the following deep RL algorithms that we did not get a chance to experiment on. They use the same neural state embedders as DQN:
- Deep value iteration:
scripts_main/ValueIteration/ENV_deep_few-abs-extra-trunc-replay-adapteps[_sN].sh
(not implemented) - Deep REINFORCE:
scripts_main/REINFORCE/ENV_deep_few-abs-extra-trunc[_sN].sh
(Note: We have not yet implemented deep value iteration, so those scripts will not run properly.)
- Q-learning:
-
For hRL with LOVE options, first download this zip file to the root of the repository. It contains LOVE options trained on offline trajectory data from each base environment. Extract its contents into
abs_optimal/
:unzip love_ckpts.zip -d abs_optimal
Next, download this zip file to the root of the repository. It contains the offline trajectory data that the LOVE options were trained on. Extract its contents:
unzip trajectories.zip
Finally, run the training scripts
scripts_main/QLearning/ENV_deep_love-trunc-replay-adapteps[_sN].sh
. The expected amount of time for each script to run varies from a few seconds (CliffWalking
) to a couple of days (RubiksCube222
). As with the training runs for LEMMA abstractions, the scripttrack_unfinished.py
can be used to track progress. -
To analyze the experimental results, use the Jupyter notebook
notebooks/analysis.ipynb
.
We have provided the learnt LEMMA macroactions under abs_optimal/
and LOVE options in this zip file
(which you should have downloaded and extracted to abs_optimal/
in Step 5 above).
To reproduce all of these abstractions yourself, use the scripts located under abs_scripts/
.
- Sutton, R. S. and Barto, A. G. Temporal difference learning. In Reinforcement learning: An introduction, chapter 6. MIT Press, 2018.
- Kipf, T., Li, Y., Dai, H., Zambaldi, V., Sanchez-Gonzalez, A., Grefenstette, E., Kohli, P., and Battaglia, P. Compile: Compositional imitation learning and execution. In International Conference on Machine Learning, pp. 3418–3428. PMLR, 2019.
- Hukmani, K., Kolekar, S., and Vobugari, S. Solving twisty puzzles using parallel Q-learning. Engineering Letters, 29(4), 2021.
- Li, Z., Poesia, G., Costilla-Reyes, O., Goodman, N., and Solar-Lezama, A. Lemma: Bootstrapping high-level mathematical reasoning with learned symbolic abstractions. NeurIPS'22 MATH-AI Workshop, 2022.
- Jiang, Y., Liu, E., Eysenbach, B., Kolter, J. Z., and Finn, C. Learning options via compression. Advances in Neural Information Processing Systems, 35:21184–21199, 2022.