Continuous Control With Deep Reinforcement Learning Github

Project 2: Continuous Control

Train an agent to hold a set of balls

artificial robots

Background

Deep Reinforcement Learning is a trending new algorithmic approach to solve complex learning tasks. The purpose of this project is to train an agent to solve real world "Robotic Control Systems" problems like the one below:

real robots

But in an artificial environment.

The Environment

This project is based off of the Reacher environment.

In this environment, a double-jointed arm can move to target locations. A reward of +0.1 is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of 33 variables corresponding to position, rotation, velocity, and angular velocities of the arm. Each action is a vector with four numbers, corresponding to torque applicable to two joints. Every entry in the action vector should be a number between -1 and 1.

Udacity created two environment options for this project, one with a single arm and another with twenty arms. After a fair number of attempts trying to master the first environment and alllllllllot! of hyperparameter optimization. I decided to implement the second environment. The barrier for solving the second version of the environment is slightly different, to take into account the presence of many agents. In particular, your agents must get an average score of +30 (over 100 consecutive episodes, and over all agents).

Specifically,

After each episode, we add up the rewards that each agent received (without discounting), to get a score for each agent. *
This yields 20 (potentially different) scores. We then take the average of these 20 scores. This yields an average score for each episode (where the average is over all 20 agents).

Methodology

The Experimental setup is as follows:

Setup the Environment.
Establish a baseline with a random walk.
Implement a reinforcement learning algorithm.
Display Results.
Conclusion of Results.
Ideas for future work.

1. Environment Setup

Download the environment from one of the links below. You need only select the environment that matches your operating system:

Linux: click here
Mac OSX: click here
Windows (32-bit): click here
Windows (64-bit): click here

Import the Unity environment and create an env object

              from              unityagents              import              UnityEnvironment              env              =              UnityEnvironment(file_name              =              'location of reacher.exe')

Info about the environment is printed out through the Info() class found here as seen below:

              Unity Academy name: Academy Number of Brains: 1 Number of External Brains : 1 Lesson number : 0 Reset Parameters :   goal_speed -> 1.0   goal_size -> 5.0  Unity brain name: ReacherBrain Number of Visual Observations (per agent): 0 Vector Observation space type: continuous Vector Observation space size (per agent): 33 Number of stacked Vector Observation: 1 Vector Action space type: continuous Vector Action space size (per agent): 4 Vector Action descriptions: , , , created Info Number of agents: 1 Number of actions: 4  States look like: [ 0.00000000e+00 -4.00000000e+00  0.00000000e+00  1.00000000e+00  -0.00000000e+00 -0.00000000e+00 -4.37113883e-08  0.00000000e+00   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00   0.00000000e+00  0.00000000e+00 -1.00000000e+01  0.00000000e+00   1.00000000e+00 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00   0.00000000e+00  0.00000000e+00  5.75471878e+00 -1.00000000e+00   5.55726671e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00  -1.68164849e-01] States have length: 33

2. Establish a baseline

To evaluate the difficulty of the environment. A random walk was scored before any algorithmic implementation of the reinforcement learning agent was made. This was done by making an agent that selects random actions at each time step.

              env_info              =              env.reset(train_mode              =              False)[brain_name]              # reset the environment                            states              =              env_info.vector_observations              # get the current state (for each agent)              scores              =              np.zeros(num_agents)              # initialize the score (for each agent)              while              True:              actions              =              np.random.randn(num_agents,              action_size)              # select an action (for each agent)              actions              =              np.clip(actions,              -              1,              1)              # all actions between -1 and 1              env_info              =              env.step(actions)[brain_name]              # send all actions to tne environment              next_states              =              env_info.vector_observations              # get next state (for each agent)              rewards              =              env_info.rewards              # get reward (for each agent)              dones              =              env_info.local_done              # see if episode finished              scores              +=              env_info.rewards              # update the score (for each agent)              states              =              next_states              # roll over states to next time step              if              np.any(dones):              # exit loop if episode finished              break              print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

This resulted in a score of 0.03. This is quite poor when you keep in mind that the implemented agent needs to get a score of 30, consitently for 100 episodes.

3. Implemented Algorithm

Due to the nature of the environment being a continuous control problem. The reinforcemenr learning agorithm needs to be able to work in a continuous space. This hard requirement means we have to use a deep learning approach where neural networks are used for continuous function approximation. When considering between Policy-based vs Value-based Methods. Policy-based methods are better suited for continuous action spaces. Udacity suggest using either the PPO, A3C or D4PG. I chose to implement the Deep Deterministic Policy Gradient, which is describes as an extension of Deep Q-learning to continuous tasks.

I based my code off of this repository, an implementation of DDPG with OpenAI Gym's Pendulum environment.

Pendulum

I copied the Actor and Critic models, as found here, but I adapted the number of hidden unites to 256 and added another layer of batch normalization. I copied the agent code, found here, then changed it to accomidate 20 environments.

The Agent() code can be found here resulting DDPG algorithm can be seen below:

              def              train(env,              agent,              n_episodes              =              500,              max_t              =              int(1000),              train_mode              =              True,              solved_score              =              30.0,              consec_episodes              =              100,              print_every              =              1,              actor_path              =              'actor_ckpt.pth',              critic_path              =              'critic_ckpt.pth'):              """Deep Deterministic Policy Gradient (DDPG)                                            Params                              ======                              n_episodes (int)      : maximum number of training episodes                              max_t (int)           : maximum number of timesteps per episode                              train_mode (bool)     : if 'True' set environment to training mode                              solved_score (float)  : min avg score over consecutive episodes                              consec_episodes (int) : number of consecutive episodes used to calculate score                              print_every (int)     : interval to display results                              actor_path (str)      : directory to store actor network weights                              critic_path (str)     : directory to store critic network weights                              """              # get env information              brain_name              =              env.brain_names[0]              brain              =              env.brains[brain_name]              # set the best score to negative infinity              best_score              =              -              np.inf              # list containing scores from each episode              mean_scores              =              []              # last 100 scores                            mean_score_window              =              deque(maxlen              =              100)              # average of last 100 scores                            mean_scores_window_avg              =              []              # for each episode              for              i_episode              in              range(1,              n_episodes              +              1):              # reset environment              env_info              =              env.reset(train_mode              =              True)[brain_name]              # get current state for each agent                            states              =              env_info.vector_observations              num_agents              =              len(env_info.agents)              agent.reset()              # initialize score for each agent              scores              =              np.zeros(num_agents)              start_time              =              time.time()              # for each iteration              for              t              in              range(max_t):              # select an action              actions              =              agent.act(states)              # send actions to environment                            env_info              =              env.step(actions)[brain_name]              # get next state                            next_states              =              env_info.vector_observations              # get reward                            rewards              =              env_info.rewards              # check if episode has finished                            dones              =              env_info.local_done              # save experience to replay buffer,              for              state,              action,              reward,              next_state,              done              in              \              zip(states,              actions,              rewards,              next_states,              dones):              # Step through the experiences given by each environment              agent.step(state,              action,              reward,              next_state,              done,              t)              # set the next states              states              =              next_states              # save the agents performance              scores              +=              rewards              # break if any environment is completed                            if              np.any(dones):              break              # if the average score of the last 100 episodes, over all environments, is greater than the goal                            # and 100 episode              if              mean_scores_window_avg[-              1]              >=              solved_score              and              i_episode              >=              consec_episodes:              if              train_mode:              torch.save(agent.actor_local.state_dict(),              actor_path)              torch.save(agent.critic_local.state_dict(),              critic_path)              break

4. Results

My algorithm was able to solve the environment in 23 episodes with an average of 31.8 over the first 100 episodes. Check the graph below to see how it trained.

              Episode 90      Duration: 290.0         Mean Episode Scores: 38.52      Mean Consecutive Scores: 31.10 Episode 91      Duration: 290.0         Mean Episode Scores: 38.89      Mean Consecutive Scores: 31.19 Episode 92      Duration: 289.0         Mean Episode Scores: 39.22      Mean Consecutive Scores: 31.27 Episode 93      Duration: 289.0         Mean Episode Scores: 38.32      Mean Consecutive Scores: 31.35 Episode 94      Duration: 291.0         Mean Episode Scores: 35.74      Mean Consecutive Scores: 31.39 Episode 95      Duration: 287.0         Mean Episode Scores: 38.17      Mean Consecutive Scores: 31.47 Episode 96      Duration: 292.0         Mean Episode Scores: 39.21      Mean Consecutive Scores: 31.55 Episode 97      Duration: 294.0         Mean Episode Scores: 38.78      Mean Consecutive Scores: 31.62 Episode 98      Duration: 296.0         Mean Episode Scores: 38.03      Mean Consecutive Scores: 31.69 Episode 99      Duration: 301.0         Mean Episode Scores: 39.09      Mean Consecutive Scores: 31.76 Episode 100     Duration: 301.0         Mean Episode Scores: 38.99      Mean Consecutive Scores: 31.83

real robots

The final agent at work: Result Agent

5. Ideas for Future Work

Hyperparameter optimization - Most algorithms can be tweeked to perform better for specific environments when by changeing the various hyper parameters. This could be investigated to find a more effective agent.
Priority Experience Replay - Prioritized experience replay selects experiences based on a priority value that is correlated with the magnitude of error. This replaces the random selection of experiences with an approach that is more intelligent, as described in this paper.

Get Started

Install Anaconda from here.

Create a new evironment from the environment file in this repository with the command

                  conda env create -f environment.yml

Run python main.py

Remove the comments in main to train and run the baseline.
Watch the agent hold its balls.

Collier Anecters