Official implementation of the paper:
Teaching as Iterative Correction and Evaluation: A Bi-Level Reinforcement Learning Approach
In this paper, we adopt a bi-level reinforcement learning framework to model the interaction among the teachers, the students, and the task environment.- Lower-level RL: the student agent interacts with the task environment, like a standard RL problem.
- Higher-level RL: the teacher agent observes the lower-level interaction and offers instructions to improve the student’s policy.
Depending on whether the teacher can provide timely suggestions to students during the interaction, two basic problem formulations are considered:
Instant Teaching | Delayed Teaching |
---|---|
- Instant Teaching: In turn-based games (such as Go), students have the opportunity to report their intended actions to the teacher before taking action, thus the instruction can be adopted and evaluated instantly.
- Delayed Teaching: If the teacher provides instruction after the student’s action is executed, for example in tennis training, the effectiveness can only be evaluated by delaying the student’s adoption of this instruction until the next occurrence of the same task state.
The Gridworld game features a boundary area divided into multiple unit squares. The agent’s objective is to navigate from the start square to the goal square, with available actions of {Up, Down, Left, Right}. Attempts to move beyond the boundary do not change the agent’s position. The game complexity is heightened by introducing unknown wind forces under each column, influencing the agent’s movement. An optimal path of the designed map, marked with a blue line, serves as a reference.
Results demonstrate that corrective teachers in both scenarios outperform elite-player and can better facilitate student learning.
Instant Scenario | Delayed Scenario |
---|---|
Even for the students initialized with varying skill levels, the proposed corrective teacher help can students achieve better efficiency than their self-study.
Instant Scenario | Delayed Scenario |
---|---|