rl-skill-theory

This repository contains code for our ICML 2024 paper, When Do Skills Help Reinforcement Learning? A Theoretical Analysis of Temporal Abstractions. If you use this code, please cite:

@inproceedings{li2022rlskilltheory,
  title={When Do Skills Help Reinforcement Learning? A Theoretical Analysis of Temporal Abstractions},
  author={Li, Zhening and Poesia, Gabriel and Solar-Lezama, Armando},
  booktitle={Proceedings of the 41st International Conference on Machine Learning},
  year={2024}
}

Setup

We use Python 3.9. Run

pip install -r requirements.txt

to install all dependencies other than PyTorch. To install PyTorch, follow instructions on its official webpage.

Experiments determining correlation between RL difficulty and RL sample efficiency (`train_rl.py`)

Experiments were conducted on 4 environments:

CliffWalking [1]: a simple grid world (implementation by Gymnasium)
CompILE2 [2]: the CompILE grid world with visit length 2
8Puzzle: the 8-puzzle
RubiksCube222: the 2-by-2 Rubiks cube (implementation by rubiks_cube_gym [3])

We applied 4 RL algorithms:

Q-learning
Value iteration (modified to the RL setting)
REINFORCE
DQN

There are 3 sets of experiments:

Testing how well $p$-learning difficulty and $p$-exploration difficulty capture RL sample complexity various macroaction augmentations of the same environment. For each environment, we conduct RL on the base environment and 31 macroaction augmentations. 6 of the 31 macroaction augmentations (located under abs_examples/) are (manually) derived from the optimal macroactions (located under abs_optimal/). To reproduce these optimal macroactions with LEMMA [4], use the bash scripts located under abs_scripts/ for existing configs.
Testing how well $p$-learning difficulty captures the complexity of planning algorithms (state/action value iteration) on various macroaction augmentations of the same environment.
Testing how well unmerged $p$-incompressibility captures the difficulty in learning useful skills for hierarchical RL. In this set of experiments, we test two skill learning algorithms: LEMMA [4] for macroactions, and LOVE [5] for neural skills.

Follow the following steps to reproduce our experimental results.

First use the bash scripts located under envinfo_scripts_main/ to compute information (e.g., RL difficulty metrics) about each environment (including both the base environment and the 31 macroaction augmentations). The expected amount of time each script takes to run varies from a few seconds (CliffWalking) to about an hour (RubiksCube222).
Run the scripts scripts_main/QLearning/ENV_true-Q_few-abs-extra[_sN].sh, scripts_main/ValueIteration/ENV_true-V_few-abs-extra[_sN].sh, and scripts_main/REINFORCE/ENV_policy-from-Q_few-abs-extra[_sN].sh, for each ENV (CliffWalking, CompILE2, 8Puzzle, RubiksCube222). The _sN suffix in the script name denotes the seed (absence of the suffix refers to seed 0). The expected amount of time each script takes to run varies from a few seconds (CliffWalking) to a few minutes (RubiksCube222). These scripts calculate the ground truth state and action values and the optimal policies of all 32 variants of each environment.
For the experiments with the planning algorithms (state/action value iteration), run the scripts scripts_main/QLearning/ENV_no-expl_few-abs-extra[_sN].sh and scripts_main/ValueIteration/ENV_no-expl_few-abs-extra[_sN].sh for each ENV (CliffWalking, CompILE2, 8Puzzle, RubiksCube222). The _sN suffix in the script name denotes the seed (absence of the suffix refers to seed 0). The expected amount of time each script takes to run varies from several seconds (CliffWalking) to a couple tens of minutes (RubiksCube222). Note that RubiksCube222 uses the most GPU memory; if you're using 8 GPUs at once, then having 15GB of memory available on each is sufficient.
Then use the following scripts to conduct all training runs for vanilla RL and hRL with LEMMA macroactions.
- Q-learning: scripts_main/QLearning/ENV_few-abs-extra-trunc-replay-adapteps[_sN].sh
- Value iteration: scripts_main/ValueIteration/ENV_few-abs-extra-trunc-replay-adapteps[_sN].sh
- REINFORCE: scripts_main/REINFORCE/ENV_few-abs-extra-trunc[_sN].sh
- DQN: scripts_main/QLearning/ENV_deep_few-abs-extra-trunc-replay-adapteps[_sN].sh
The expected amount of time for each script to run varies between a couple of hours (CliffWalking) to several days (RubiksCube222). Since there are over a hundred scripts to run, we use the script track_unfinished.py for tracking which runs have completed, are in progress, or have not begun.

We also provide scripts for the following deep RL algorithms that we did not get a chance to experiment on. They use the same neural state embedders as DQN:
- Deep value iteration: scripts_main/ValueIteration/ENV_deep_few-abs-extra-trunc-replay-adapteps[_sN].sh (not implemented)
- Deep REINFORCE: scripts_main/REINFORCE/ENV_deep_few-abs-extra-trunc[_sN].sh
(Note: We have not yet implemented deep value iteration, so those scripts will not run properly.)
For hRL with LOVE options, first download this zip file to the root of the repository. It contains LOVE options trained on offline trajectory data from each base environment. Extract its contents into abs_optimal/:
```
unzip love_ckpts.zip -d abs_optimal
```
Next, download this zip file to the root of the repository. It contains the offline trajectory data that the LOVE options were trained on. Extract its contents:
```
unzip trajectories.zip
```
Finally, run the training scripts scripts_main/QLearning/ENV_deep_love-trunc-replay-adapteps[_sN].sh. The expected amount of time for each script to run varies from a few seconds (CliffWalking) to a couple of days (RubiksCube222). As with the training runs for LEMMA abstractions, the script track_unfinished.py can be used to track progress.
To analyze the experimental results, use the Jupyter notebook notebooks/analysis.ipynb.

We have provided the learnt LEMMA macroactions under abs_optimal/ and LOVE options in this zip file (which you should have downloaded and extracted to abs_optimal/ in Step 5 above). To reproduce all of these abstractions yourself, use the scripts located under abs_scripts/.

References

Sutton, R. S. and Barto, A. G. Temporal difference learning. In Reinforcement learning: An introduction, chapter 6. MIT Press, 2018.
Kipf, T., Li, Y., Dai, H., Zambaldi, V., Sanchez-Gonzalez, A., Grefenstette, E., Kohli, P., and Battaglia, P. Compile: Compositional imitation learning and execution. In International Conference on Machine Learning, pp. 3418–3428. PMLR, 2019.
Hukmani, K., Kolekar, S., and Vobugari, S. Solving twisty puzzles using parallel Q-learning. Engineering Letters, 29(4), 2021.
Li, Z., Poesia, G., Costilla-Reyes, O., Goodman, N., and Solar-Lezama, A. Lemma: Bootstrapping high-level mathematical reasoning with learned symbolic abstractions. NeurIPS'22 MATH-AI Workshop, 2022.
Jiang, Y., Liu, E., Eysenbach, B., Kolter, J. Z., and Finn, C. Learning options via compression. Advances in Neural Information Processing Systems, 35:21184–21199, 2022.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
abs_examples		abs_examples
abs_optimal		abs_optimal
abs_scripts		abs_scripts
abstractions @ f8a9309		abstractions @ f8a9309
envinfo_scripts_main		envinfo_scripts_main
notebooks		notebooks
scripts_main		scripts_main
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
abstract_trajectories.py		abstract_trajectories.py
abstract_trajectories_love.py		abstract_trajectories_love.py
compile_env.py		compile_env.py
compute_envinfo.py		compute_envinfo.py
custom_types.py		custom_types.py
get_trajectories.py		get_trajectories.py
interact.py		interact.py
learning_agent.py		learning_agent.py
love.py		love.py
love_utils.py		love_utils.py
npuzzle_env.py		npuzzle_env.py
policy_from_Q.py		policy_from_Q.py
requirements.txt		requirements.txt
rldiff_envs.py		rldiff_envs.py
track_unfinished.py		track_unfinished.py
train_rl.py		train_rl.py
utils.py		utils.py
wrappers.py		wrappers.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rl-skill-theory

Setup

Experiments determining correlation between RL difficulty and RL sample efficiency (`train_rl.py`)

References

About

Releases

Packages

Languages

License

uranium11010/rl-skill-theory

Folders and files

Latest commit

History

Repository files navigation

rl-skill-theory

Setup

Experiments determining correlation between RL difficulty and RL sample efficiency (train_rl.py)

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Experiments determining correlation between RL difficulty and RL sample efficiency (`train_rl.py`)

Packages