Unofficial implementation of the paper Data-Efficient Reinforcement Learning with Probabilistic Model Predictive Control with Pytorch and GPyTorch.
Trial-and-error based reinforcement learning (RL) has seen rapid advancements in recent times, especially with the advent of deep neural networks.
However, the majority of autonomous RL algorithms either rely on engineered features or a large number of interactions with the environment.
Such a large number of interactions may be impractical in many real-world applications.
For example, robots are subject to wear and tear and, hence, millions of interactions may change or damage the system.
Moreover, practical systems have limitations in the form of the maximum torque that can be safely applied.
To reduce the number of system interactions while naturally handling constraints, we propose a model-based RL framework based on Model Predictive Control (MPC).
In particular, we propose to learn a probabilistic transition model using Gaussian Processes (GPs) to incorporate model uncertainties into long-term predictions, thereby,
reducing the impact of model errors. We then use MPC to find a control sequence that minimises the expected long-term cost.
We provide theoretical guarantees for the first-order optimality in the GP-based transition models with deterministic approximate inference for long-term planning.
The proposed framework demonstrates superior data efficiency and learning rates compared to the current state of the art.
For each experiment, two plots allow to see and understand the control.
-
2d plots showing the states, actions and costs during control
- The top graph shows the states along with the predicted states and uncertainty from n time steps earlier. The value of n is specified in the legend.
- The middle graph shows the actions
- The bottom graph shows the real cost alongside with the predicted trajectory cost, which is the mean of future predicted cost, and its uncertainty.
-
3d plots showing the Gaussian processes models and points in memory. In this plot, each of the graphs of the top line represents the change in states for the next step as a function of the current states and actions. The indices represented in the xy axis name represent either states or actions. For example, the input with index 3 represent the action for the pendulum. Action indices are defined as higher than the state indices. As not every input of the gp can be shown on the 3d graph, the axes of the 3d graph are chosen to represent the two inputs (state or action) with the smallest lengthscales. So, the x-y axes may be different for each graph. The graphs of the bottom line represent the predicted uncertainty, and the points are the prediction errors. The points stored in the memory of the Gaussian process model are shown in green, and the points that are not stored because they were too similar to other points already in memory are represented in black.
During the control, a dynamic graph similar to the 2d plot described above allows to see the evolution of the states, action and costs, but also shows the predicted states, actions and costs computed by the model for the MPC. The predicted future states, actions and loss are represented with dashed lines, along with their confidence interval (2 standard deviation).
The following figure shows the mean cost over 10 runs:
We can see that the model allows to control the environment in less than hundred interactions with the environment from scratch. As a comparison, the state of the art of model free reinforcement learning algorithms in https://github.com/quantumiracle/SOTA-RL-Algorithms solves the problem in more than 15 episodes of 200 interactions with the environment.
The following figures and animation shows an example of control.
The following figure shows the 2d graphs for the inverted pendulum that is shown in the animation.
And the gaussian process models and the points in memory:
The dynamic graph updated in real-time:
The mountain car problem is different in that the number of time steps to plan in order to control the environment is higher. To avoid this problem, the parameter to repeat the actions has been set to 5. For the shown example, 1 control time step corresponds to 5 time steps where the action is held constant. If this trick is not used, the control is not possible, or the computation times become too high.
The mean costs over 10 runs can be seen in the following figure:
As for the pendulum, the optimal control is obtained in very few steps compared to the state of the art of model-free reinforcement agents
The following figures and animation shows an example of control.
numpy, gym, pytorch, gpytorch, matplotlib, scikit-learn, ffmpeg
Download anaconda Open an anaconda prompt window:
git clone https://github.com/SimonRennotte/Data-Efficient-Reinforcement-Learning-with-Probabilistic-Model-Predictive-Control
cd Data-Efficient-Reinforcement-Learning-with-Probabilistic-Model-Predictive-Control
conda env create -f environment.yml
conda activate gp_rl_env
python main.py
To run the script using your environment, you must first define it as a gym environment, then create two json files inside the folder params that contains all the parameters relative to the control. For an example of such definition, you can look at the custom env defined in the file utils/env_example
-
The parameters of the main script are stored in main_parameters_env.json, which specifies:
- Which gym environment to use,
- The parameters relative to visualizations.
- The number of runs to perform for the computation of mean losses. If it is set to 1, the mean losses will not be computed.
-
For each gym environment, a json file containing all the parameters relative to this environment for the control used. The syntax is parameters_"gym_env_name".json
The plots and animations will be saved in the folder "folder_save", with the following structure: folder_save => environment name => time and date of the run
For more information about the parameters, see PARAMETERS.md
The approach uses a model to control the environment. This family of methods are called Model Predictive Control (MPC). At each interaction with the real environment, the optimal action is obtained through an iterative approach. The model is used to evaluate certain actions over a fixed time horizon by simulating the environment. This simulation is used several times with different actions at each interaction with the real world to find the optimal actions in the time horizon window. The first control of the time horizon is then used for the next action in the real world. In traditional control theory, the model is a mathematical model obtained from theory. Here, the model is a Gaussian process that learns from observed data.
Gaussian processes are used to predict the change of states as a function of states and actions. The predictions have the form of a distribution, which also allows the uncertainty of these predictions. Gaussian processes are defined by a mean and covariance function, and store previous points (states(t), actions(t), (states(t+1) - states(t))) in memory. To compute new predictions, the covariance between the new points and the points stored in memory is calculated, which allows, with a little mathematics, to get the predicted distribution. Conceptually, Gaussian processes can be seen as if they were looking at adjacent points in memory to compute predictions at new points. Depending on the distance between the new point and the points stored in memory, the uncertainty will be greater or smaller. In our case, for each state, one Gaussian process is used which has n (number of states) + m (number of actions) inputs, and 1 output used to predict the variation of that state.
One specificity of the paper is that for this method, uncertainties propagate during trajectory calculations which allows to calculate the uncertainty of the loss in the window of the simulation horizon. This makes it possible to explore more efficiently by rewarding states where the uncertainty of the loss is high. It can also be used to get a real-time idea of the model's certainty about the future. Uncertainty can also be used to impose security constraints. This can be done by prohibiting visits to states where the uncertainty is too high by imposing constraints on the lower or upper limit of the state confidence interval. This method is already used for safe Bayesian optimization. For example, it has been used to optimize UAV controllers to avoid crashes during optimization.
This approach allows learning fast enough to enable online learning from scratch, which opens up many possibilities for Reinforcement Learning in new applications with some more research.
Currently, real-world applications of model-free reinforcement learning algorithms are limited due to the number of interactions they require with the environment.
With all the limitations that this method presents, it shows that for the applications on which it can be used, the same results as for state-of-the-art model-free algorithms (to the extent of my knowledge) can be obtained with approximately 20 times less interaction with the environment.
Understanding the reasons of this increased efficiency would open the search for algorithms with the same improvement in sample efficiency but without the limitations of this method.
For example, the future predicted rewards (or cost) are predicted as a distribution. By maximizing the upper confidence bound of future rewards, future states with high reward uncertainty are encouraged, allowing for effective exploration.
Maximizing future state uncertainty could also be used to explore environments without rewards.
If future research removes the limitations of this method, this type of data efficiency could be used for real world applications where real-time learning is required and thus open many new applications for reinforcement learning.
Compared to the implementation in the paper, the scripts have been designed to perform the control over a long time without any reset, which means :
- The optimized function in the mpc is the lower confidence bound of the expected long-term cost to reward exploration and avoid getting stuck in a local minimum.
- The environment is not reset, learning is done in one go. Thus, the hyper-parameters training can not be done between trials. The learning of the hyperparameters and the storage of the visualizations are performed in a parallel process at regular time intervals in order to minimize the computation time at each control iteration.
- The optimizer for actions is LBFGS
- An option has been added to decide to include a point in the model memory depending on the prediction error at that point and the predicted uncertainty to avoid having too many points in memory. Only points with a predicted uncertainty or a prediction error greater than a threshold are stored in memory.
- An option has been added to allow to include the time of observation to the gaussian process models. This can be useful when the environment changes over time, as the model will learn to rely on more recent points vs older points in memory.
- The cost function must be clearly defined as a squared distance function of the states and actions from a reference point.
- The length of the prediction horizon of the mpc will impact computation times. This can be a problem when the dimensionality of the observation space and/or action space is also high.
- The dimension of the input and output of the gaussian process must stay low, which limits application to cases with low dimensionality of the states and actions.
- If too much points are stored in the memory of the gaussian process, the computation times might become too high per iteration.
- The current implementation will not work for gym environments with discrete states or actions.
- No guarantee is given for the time per iteration.
- Actions must have an effect on the observation of the next observed step. Delays are not supported in the model. Observation must unequivocally describe the system states.
Gaussian processes: https://www.youtube.com/watch?v=92-98SYOdlY&list=PL93aLKqThq4hbACqDja5QFuFKDcNtkPXz&index=2
Presentation of PILCO by Marc Deisenroth: https://www.youtube.com/watch?v=AVdx2hbcsfI (method that uses the same gaussian process model, but without an MPC controller)
Safe Bayesian optimization: https://www.youtube.com/watch?v=sMfPRLtrob4
Original paper: https://deepai.org/publication/data-efficient-reinforcement-learning-with-probabilistic-model-predictive-control
PILCO paper that describes the moment matching approximation used for states uncertainty propagation: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6654139
Marc Deisenroth thesis: https://deisenroth.cc/pdf/thesis.pdf
http://www.gaussianprocess.org/gpml/
https://github.com/nrontsis/PILCO
You can contact me on Linkedin: https://www.linkedin.com/in/simon-rennotte-96aa04169/ or by email: [email protected]