This project is the first one in Udacity's Deep Reinforcement Learning Nanodegree. In this project, I have trained an agent to navigate and collect bananas in a large square world.
A reward of +1 is provided for collecting a yellow banana, and a reward of -1 is provided for collecting a blue banana. Thus, the goal of your agent is to collect as many yellow bananas as possible while avoiding blue bananas.
The state space has 37 dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction. Given this information, the agent has to learn how to best select actions. Four discrete actions are available, corresponding to:
0
- move forward.1
- move backward.2
- turn left.3
- turn right.
The task is episodic, and in order to solve the environment, your agent must get an average score of +13 over 100 consecutive episodes.
The things we did in Navigation.ipynb
are:
- Initialize the agent
- Evaluate state and action space
- Learn from it using Deep Q-Networks (DQN). Model is a simple 3-layered neural network
- Iterate until agent reaches a threshold score of 15.0
I am planning to use one or more of the following in the upcoming days:
One issue with Deep Q-Networks is they can overestimate Q-values (see Thrun & Schwartz, 1993). The accuracy of the Q-values depends on which actions have been tried and which states have been explored. If the agent hasn't gathered enough experiences, the Q-function will end up selecting the maximum value from a noisy set of reward estimates. Early in the learning process, this can cause the algorithm to propagate incidentally high rewards that were obtained by chance (exploding Q-values). This could also result in fluctuating Q-values later in the process
We can address this issue using Double Q-Learning, where one set of parameters w
is used to select the best action, and another set of parameters w'
is used to evaluate that action.
Experience replay lets online reinforcement learning agents remember and reuse experiences from the past. In prior work, experience transitions were uniformly sampled from a replay memory. However, this approach simply replays transitions at the same frequency that they were originally experienced, regardless of their significance. To replay important transitions more frequently, and therefore learn more efficiently, we use prioritized Experience Replay
Dueling networks utilize two streams: one that estimates the state value function V(s)
, and another that estimates the advantage for each action A(s,a)
. These two values are then combined to obtain the desired Q-values.
-
Download the environment from one of the links below. You need only select the environment that matches your operating system:
- Linux: click here
- Mac OSX: click here
- Windows (32-bit): click here
- Windows (64-bit): click here
(For Windows users) Check out this link if you need help with determining if your computer is running a 32-bit version or 64-bit version of the Windows operating system.
(For AWS) If you'd like to train the agent on AWS (and have not enabled a virtual screen), then please use this link to obtain the environment.
-
Place the file in the working folder and unzip the file
-
Start working with
Navigation.ipynb