[Draft] PettingZoo Support (LucasAlegre#45)

jkterry1 · LucasAlegre · web-flow · commit dd5bf4a2d66f · 2021-09-29T11:06:34.000-03:00
PettingZoo env and parallel_env suppor!

Co-authored-by: Lucas Alegre &lt;lucasnale@gmail.com&gt;
diff --git a/.github/workflows/linux-test.yml b/.github/workflows/linux-test.yml
@@ -0,0 +1,32 @@
+name: Python tests
+
+on:
+  push:
+    branches: [ master ]
+  pull_request:
+    branches: [ master ]
+
+jobs:
+  linux-test:
+    runs-on: ubuntu-20.04
+    strategy:
+      matrix:
+        python-version: ['3.6', '3.7', '3.8', '3.9']
+    steps:
+    - uses: actions/checkout@v2
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v2
+      with:
+        python-version: ${{ matrix.python-version }}
+    - name: Install dependencies
+      run: |
+        sudo add-apt-repository ppa:sumo/stable
+        sudo apt-get update
+        sudo apt-get install sumo sumo-tools sumo-doc
+        pip install pytest
+        pip install -e .[all]
+    - name: Full Python tests
+      run: |
+        export SUMO_HOME="/usr/share/sumo"
+        export LIBSUMO_AS_TRACI=1
+        pytest ./tests/pz_test.py
diff --git a/README.md b/README.md
@@ -8,8 +8,9 @@
 
 SUMO-RL provides a simple interface to instantiate Reinforcement Learning environments with [SUMO](https://github.com/eclipse/sumo) for Traffic Signal Control. 
 
-The main class [SumoEnvironment](https://github.com/LucasAlegre/sumo-rl/blob/master/environment/env.py) inherits [MultiAgentEnv](https://github.com/ray-project/ray/blob/master/python/ray/rllib/env/multi_agent_env.py) from [RLlib](https://github.com/ray-project/ray/tree/master/python/ray/rllib).  
+The main class [SumoEnvironment](https://github.com/LucasAlegre/sumo-rl/blob/master/sumo_rl/environment/env.py) behaves like a [MultiAgentEnv](https://github.com/ray-project/ray/blob/master/python/ray/rllib/env/multi_agent_env.py) from [RLlib](https://github.com/ray-project/ray/tree/master/python/ray/rllib).  
 If instantiated with parameter 'single-agent=True', it behaves like a regular [Gym Env](https://github.com/openai/gym/blob/master/gym/core.py) from [OpenAI](https://github.com/openai).  
+Call [env](https://github.com/LucasAlegre/sumo-rl/blob/master/sumo_rl/environment/env.py) or [parallel_env](https://github.com/LucasAlegre/sumo-rl/blob/master/sumo_rl/environment/env.py) for [PettingZoo](https://github.com/PettingZoo-Team/PettingZoo) environment support.  
 [TrafficSignal](https://github.com/LucasAlegre/sumo-rl/blob/master/sumo_rl/environment/traffic_signal.py) is responsible for retrieving information and actuating on traffic lights using [TraCI](https://sumo.dlr.de/wiki/TraCI) API.
 
 Goals of this repository:
@@ -57,9 +58,10 @@ pip install -e .
 ### Observation
 The default observation for each traffic signal agent is a vector:
 ```
-    obs = [phase_one_hot, lane_1_density,...,lane_n_density, lane_1_queue,...,lane_n_queue]
+    obs = [phase_one_hot, min_green_elapsed, lane_1_density,...,lane_n_density, lane_1_queue,...,lane_n_queue]
 ```
 - ```phase_one_hot``` is a one-hot encoded vector indicating the current active green phase
+- ```min_green_elapsed``` is a binary variable indicating whether min_green seconds have already passed in the current phase
 - ```lane_i_density``` is the number of vehicles in incoming lane i dividided by the total capacity of the lane
 - ```lane_i_queue```is the number of queued (speed below 0.1 m/s) vehicles in incoming lane i divided by the total capacity of the lane
 
@@ -73,7 +75,7 @@ E.g.: In the [2-way single intersection](https://github.com/DLR-RM/stable-baseli
 
 <img src="outputs/actions.png" align="center" width="75%"/>
 
-Obs: Every time a phase change occurs, the next phase is preeceded by a yellow phase lasting ```yellow_time``` seconds.
+Important: every time a phase change occurs, the next phase is preeceded by a yellow phase lasting ```yellow_time``` seconds.
 
 ### Rewards
 The default reward function is the change in cumulative vehicle delay:
@@ -86,6 +88,19 @@ You can define your own reward function changing the method 'compute_reward' of
 
 ## Examples
 
+### PettingZoo API
+```python
+env = sumo_rl.env(net_file='sumo_net_file.net.xml',
+                  route_file='sumo_route_file.rou.xml',
+                  use_gui=True,
+                  num_seconds=3600)  
+env.reset()
+for agent in env.agent_iter():
+    observation, reward, done, info = env.last()
+    action = policy(observation)
+    env.step(action)
+```
+
 Check [experiments](https://github.com/LucasAlegre/sumo-rl/tree/master/experiments) to see how to instantiate a SumoEnvironment and use it with your RL algorithm.
 
 ### [Q-learning](https://github.com/LucasAlegre/sumo-rl/blob/master/agents/ql_agent.py) in a one-way single intersection:
diff --git a/experiments/a2c_2way-single-intersection.py b/experiments/a2c_2way-single-intersection.py
@@ -18,16 +18,14 @@
 
     write_route_file('nets/2way-single-intersection/single-intersection-gen.rou.xml', 400000, 100000)
 
-    # multiprocess environment
-    n_cpu = 1
     env = SubprocVecEnv([lambda: SumoEnvironment(net_file='nets/2way-single-intersection/single-intersection.net.xml',
                                         route_file='nets/2way-single-intersection/single-intersection-gen.rou.xml',
                                         out_csv_name='outputs/2way-single-intersection/a2c',
                                         single_agent=True,
                                         use_gui=False,
                                         num_seconds=100000,
                                         min_green=5,
-                                        max_depart_delay=0) for _ in range(n_cpu)])
+                                        max_depart_delay=0)])
 
     model = A2C(MlpPolicy, env, verbose=1, learning_rate=0.001, lr_schedule='constant')
     model.learn(total_timesteps=100000)
diff --git a/experiments/a3c_4x4grid.py b/experiments/a3c_4x4grid.py
@@ -1,4 +1,3 @@
-import argparse
 import os
 import sys
 if 'SUMO_HOME' in os.environ:
@@ -10,27 +9,28 @@
 import ray
 from ray.rllib.agents.a3c.a3c import A3CTrainer
 from ray.rllib.agents.a3c.a3c_tf_policy import A3CTFPolicy
+from ray.rllib.env import PettingZooEnv
 from ray.tune.registry import register_env
 from gym import spaces
 import numpy as np
-from sumo_rl import SumoEnvironment
+import sumo_rl
 import traci
 
 
 if __name__ == '__main__':
     ray.init()
 
-    register_env("4x4grid", lambda _: SumoEnvironment(net_file='nets/4x4-Lucas/4x4.net.xml',
+    register_env("4x4grid", lambda _: PettingZooEnv(sumo_rl.env(net_file='nets/4x4-Lucas/4x4.net.xml',
                                                     route_file='nets/4x4-Lucas/4x4c1c2c1c2.rou.xml',
                                                     out_csv_name='outputs/4x4grid/a3c',
                                                     use_gui=False,
                                                     num_seconds=80000,
-                                                    max_depart_delay=0))
+                                                    max_depart_delay=0)))
 
     trainer = A3CTrainer(env="4x4grid", config={
         "multiagent": {
             "policies": {
-                '0': (A3CTFPolicy, spaces.Box(low=np.zeros(10), high=np.ones(10)), spaces.Discrete(2), {})
+                '0': (A3CTFPolicy, spaces.Box(low=np.zeros(11), high=np.ones(11)), spaces.Discrete(2), {})
             },
             "policy_mapping_fn": (lambda id: '0')  # Traffic lights are always controlled by this policy
         },
diff --git a/experiments/ql_4x4grid.py b/experiments/ql_4x4grid.py
@@ -24,8 +24,10 @@
 
     env = SumoEnvironment(net_file='nets/4x4-Lucas/4x4.net.xml',
                           route_file='nets/4x4-Lucas/4x4c1c2c1c2.rou.xml',
-                          use_gui=True,
+                          use_gui=False,
                           num_seconds=80000,
+                          min_green=8,
+                          delta_time=5,
                           max_depart_delay=0)
 
     for run in range(1, runs+1):
@@ -42,11 +44,11 @@
             actions = {ts: ql_agents[ts].act() for ts in ql_agents.keys()}
 
             s, r, done, info = env.step(action=actions)
-
+            
             for agent_id in s.keys():
                 ql_agents[agent_id].learn(next_state=env.encode(s[agent_id], agent_id), reward=r[agent_id])
 
-        env.save_csv('outputs/4x4/ql_test', run)
+        env.save_csv('outputs/4x4/ql-test!', run)
         env.close()
 
 
diff --git a/experiments/ql_4x4grid_pz.py b/experiments/ql_4x4grid_pz.py
@@ -0,0 +1,52 @@
+import argparse
+import os
+import sys
+import pandas as pd
+
+if 'SUMO_HOME' in os.environ:
+    tools = os.path.join(os.environ['SUMO_HOME'], 'tools')
+    sys.path.append(tools)
+else:
+    sys.exit("Please declare the environment variable 'SUMO_HOME'")
+
+import traci
+import sumo_rl
+from sumo_rl.agents import QLAgent
+from sumo_rl.exploration import EpsilonGreedy
+
+
+if __name__ == '__main__':
+
+    alpha = 0.1
+    gamma = 0.99
+    decay = 1
+    runs = 1
+
+    env = sumo_rl.env(net_file='nets/4x4-Lucas/4x4.net.xml',
+                          route_file='nets/4x4-Lucas/4x4c1c2c1c2.rou.xml',
+                          use_gui=False,
+                          min_green=8,
+                          delta_time=5,
+                          num_seconds=80000,
+                          max_depart_delay=0)
+
+    for run in range(1, runs+1):
+        env.reset()
+        initial_states = {ts: env.observe(ts) for ts in env.agents}
+        ql_agents = {ts: QLAgent(starting_state=env.unwrapped.env.encode(initial_states[ts], ts),
+                                 state_space=env.observation_spaces[ts],
+                                 action_space=env.action_spaces[ts],
+                                 alpha=alpha,
+                                 gamma=gamma,
+                                 exploration_strategy=EpsilonGreedy(initial_epsilon=0.05, min_epsilon=0.005, decay=decay)) for ts in env.agents}
+        infos = []
+        for agent in env.agent_iter():
+            s, r, done, info = env.last()
+            if ql_agents[agent].action is not None:
+                ql_agents[agent].learn(next_state=env.unwrapped.env.encode(s, agent), reward=r)
+
+            action = ql_agents[agent].act() if not done else None
+            env.step(action)
+
+        env.unwrapped.env.save_csv('outputs/4x4/pz_ql', run)
+        env.close()
diff --git a/experiments/sarsa_2way-single-intersection.py b/experiments/sarsa_2way-single-intersection.py
@@ -41,8 +41,7 @@
                           use_gui=args.gui,
                           num_seconds=args.seconds,
                           min_green=args.min_green,
-                          max_green=args.max_green,
-                          max_depart_delay=0)
+                          max_green=args.max_green)
 
     for run in range(1, args.runs+1):
         obs = env.reset()
diff --git a/experiments/sarsa_double.py b/experiments/sarsa_double.py
@@ -25,9 +25,7 @@ def run(use_gui=True, runs=1):
                           num_seconds=86400,
                           yellow_time=3,
                           min_green=5,
-                          max_green=60,
-                          max_depart_delay=300,
-                          time_to_load_vehicles=0)
+                          max_green=60)
 
     fixed_tl = False
     agents = {ts_id: TrueOnlineSarsaLambda(env.observation_spaces(ts_id), env.action_spaces(ts_id), alpha=0.000000001, gamma=0.95, epsilon=0.05, lamb=0.1, fourier_order=7) 
diff --git a/experiments/sb3.py b/experiments/sb3.py
@@ -0,0 +1,80 @@
+from stable_baselines3 import PPO
+import sumo_rl
+import supersuit as ss
+from stable_baselines3.common.vec_env import VecMonitor
+from stable_baselines3.common.evaluation import evaluate_policy
+from stable_baselines3.common.callbacks import EvalCallback
+import numpy as np
+from array2gif import write_gif
+
+n_evaluations = 20
+n_agents = 2
+n_envs = 1
+n_timesteps = 8000000
+
+env = sumo_rl.parallel_env(net_file='nets/4x4-Lucas/4x4.net.xml',
+               route_file='nets/4x4-Lucas/4x4c1c2c1c2.rou.xml',
+               out_csv_name='outputs/4x4grid/test',
+               use_gui=False,
+               num_seconds=80000)
+
+env = ss.frame_stack_v1(env, 3)
+env = ss.pettingzoo_env_to_vec_env_v0(env)
+env = ss.concat_vec_envs_v0(env, n_envs, num_cpus=1, base_class='stable_baselines3')
+env = VecMonitor(env)
+
+""" eval_env = sumo_rl.parallel_env(net_file='nets/4x4-Lucas/4x4.net.xml',
+                    route_file='nets/4x4-Lucas/4x4c1c2c1c2.rou.xml',
+                    out_csv_name='outputs/4x4grid/test',
+                    use_gui=False,
+                    num_seconds=80000)
+
+eval_env = ss.frame_stack_v1(eval_env, 3)
+eval_env = ss.pettingzoo_env_to_vec_env_v0(eval_env)
+eval_env = ss.concat_vec_envs_v0(eval_env, 1, num_cpus=1, base_class='stable_baselines3')
+eval_env = VecMonitor(eval_env) """
+
+eval_freq = int(n_timesteps / n_evaluations)
+eval_freq = max(eval_freq // (n_envs*n_agents), 1)
+
+model = PPO("MlpPolicy", env, verbose=3, gamma=0.95, n_steps=256, ent_coef=0.0905168, learning_rate=0.00062211, vf_coef=0.042202, max_grad_norm=0.9, gae_lambda=0.99, n_epochs=5, clip_range=0.3, batch_size=256)
+#eval_callback = EvalCallback(eval_env, best_model_save_path='./logs/', log_path='./logs/', eval_freq=eval_freq, deterministic=True, render=False)
+model.learn(total_timesteps=n_timesteps) #callback=eval_callback)
+
+model = PPO.load("./logs/best_model")
+
+mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10)
+
+print(mean_reward)
+print(std_reward)
+
+""" render_env = sumo_rl.env(net_file='nets/4x4-Lucas/4x4.net.xml',
+                      route_file='nets/4x4-Lucas/4x4c1c2c1c2.rou.xml',
+                      out_csv_name='outputs/4x4grid/test',
+                      use_gui=False,
+                      num_seconds=80000)
+
+render_env = render_env.parallel_env()
+render_env = ss.color_reduction_v0(render_env, mode='B')
+render_env = ss.resize_v0(render_env, x_size=84, y_size=84)
+render_env = ss.frame_stack_v1(render_env, 3)
+
+obs_list = []
+i = 0
+render_env.reset()
+
+
+while True:
+    for agent in render_env.agent_iter():
+        observation, _, done, _ = render_env.last()
+        action = model.predict(observation, deterministic=True)[0] if not done else None
+
+        render_env.step(action)
+        i += 1
+        if i % (len(render_env.possible_agents)) == 0:
+            obs_list.append(np.transpose(render_env.render(mode='rgb_array'), axes=(1, 0, 2)))
+    render_env.close()
+    break
+
+print('Writing gif')
+write_gif(obs_list, 'kaz.gif', fps=15) """
diff --git a/setup.py b/setup.py
@@ -1,21 +1,24 @@
 from setuptools import setup, find_packages
 
-REQUIRED = ['gym', 'numpy', 'pandas', 'ray[rllib]']
+REQUIRED = ['gym', 'numpy', 'pandas', 'pillow']
 
-with open("README.md", "r") as fh:
-    long_description = fh.read()
+extras = {
+    "pettingzoo": ["pettingzoo"],
+}
+extras["all"] = extras["pettingzoo"]
 
 setup(
     name='sumo-rl',
     version='1.0',
-    packages=['sumo_rl',],
+    packages=['sumo_rl'],
     install_requires=REQUIRED,
+    extras_require=extras,
     author='LucasAlegre',
     author_email='lucasnale@gmail.com',
     url='https://github.com/LucasAlegre/sumo-rl',
     download_url='https://github.com/LucasAlegre/sumo-rl/archive/v1.0.tar.gz',
     long_description=open("README.md", encoding="utf-8").read(),
     long_description_content_type="text/markdown",
     license="MIT",
-    description='Environments inheriting OpenAI Gym Env and RL algorithms for Traffic Signal Control on SUMO.'
+    description='RL environments and learning code for traffic signal control in SUMO.'
 )
diff --git a/sumo_rl/.DS_Store b/sumo_rl/.DS_Store
diff --git a/sumo_rl/__init__.py b/sumo_rl/__init__.py
@@ -1 +1,2 @@
-from sumo_rl.environment.env import SumoEnvironment
+from sumo_rl.environment.env import SumoEnvironment
+from sumo_rl.environment.env import env, parallel_env
diff --git a/sumo_rl/environment/.DS_Store b/sumo_rl/environment/.DS_Store
diff --git a/sumo_rl/environment/env.py b/sumo_rl/environment/env.py
diff --git a/sumo_rl/environment/traffic_signal.py b/sumo_rl/environment/traffic_signal.py
diff --git a/tests/pz_test.py b/tests/pz_test.py

Original file line number	Diff line number	Diff line change
`@@ -1 +1,2 @@`
`1`		`-from sumo_rl.environment.env import SumoEnvironment`
	`1`	`+from sumo_rl.environment.env import SumoEnvironment`
	`2`	`+from sumo_rl.environment.env import env, parallel_env`