Static version

Lab_6

Lab 6: Reinforcement Learning¶

0. Environment setup¶

In [1]:

from tqdm import trange
from IPython.core.display_functions import clear_output
from collections import deque
import random
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn.functional as F


torch.set_printoptions(precision=3)


n_s_dim = 4
n_a_dim = 2

1. Markov Decision Processes¶

A Markov Decision Process (MDP) is a dynamical system $f$ with state trajectory $s_t$ which we control through the choice of actions $a_t$. The evolution of the dynamical system is given by

$$ s_{t+1} = f \big (\, s_t, \, a_t \,\big) $$

We further postulate that whenever we visit the state $s$ and take action $a$, we collect a reward $r(s, a)$. The state evolution equation in \eqref{eqn_rl_dynamical_system} along with the rewards $r(s, a)$ define the MDP. The use of the word “Markov” indicates that the evolution of the process depends on the state $s_t$ and the action $a_t$ but does not depend on past state-action pairs $(s_u, a_u)$ for $u<t$.

An MDP that models walking on a line is shown in Figure 1. The states $s = 0, 1, \ldots, N$ are scalar numbers that indicate the walker’s current position. When we are in a state other than $s = 0$ or $s = N$, the possible actions are $a = +1$ and $a = -1$, which signify taking a step forward or backward, respectively. When in state $s = 0$, the possible actions are $a = +1$ and $a = 0$, representing a step backward or staying in place. When in state $s = N$, the possible actions are $a = 0$ and $a = -1$, staying in place or walking backwards. It follows from this description that the dynamical system is defined by the state evolution equation

$$ s_{t+1} = s_t + a_t $$

To complete the description of the MDP, we assign rewards to each state-action pair. We choose to make $r(s,a)=0$ for all states $s$ other than $s = N$ and $r(N,a)=1$ for $s=N$, irrespective of the value of the action $a$. This reward structure signifies that $s = N$ is a desirable state. It incentivizes the walker to move forward.

Figure 1: Markov Decision Process (MDP). A walker moves forward when choosing action $a = +1$ and backwards when choosing action $a = -1$. Action $a = 0$ at the border states keeps the walker in place. The rewards $r(N,a) = 1$ for $s = N$ and $r(s,a) = 0$ for other states $s$ incentivize the walker to move forward.

1.1. Policies¶

A policy $\pi$ is a mapping from states $s$ to actions $a = \pi(s)$ so that whenever we visit state $s$, we take action $a = \pi(s)$. Thus, the system’s state $s_t$ when executing policy $\pi$ evolves according to

$$ s_{t+1} = f \big (\, s_t, \, a_t \,\big) = f \big (\, s_t, \, \pi(s_t) \,\big) ~. $$

In Figure 2, we illustrate two policies for the walker in Figure 1. The forward policy selects $\pi_{\text{F}}(N) = 0$ at the terminal state $s = N$ and $\pi_{\text{F}}(s) = 1$ for other states $s \neq N$. The backward policy selects actions $\pi_{\text{B}}(0) = 0$ and $\pi_{\text{B}}(s) = -1$ for $s \neq 0$.

Notice that different policies accumulate different amounts of rewards. Our goal when working with MDPs is to find the policy $\pi$ that results in trajectories that accumulate as much reward as possible. Defining this goal mathematically requires some effort (see Section 3), but it is clear that for the walker in Figure 1, the optimal policy is to walk forward to reach state $s = N$, which is the only state with nonzero rewards.

Policies illustration for the walker MDP

Figure 2: Policies. Policies map states to actions. For the walker in Figure 1, the forward policy is defined as $\pi_{\text{F}}(N) = 0$ and $\pi_{\text{F}}(s) = 1$ for $s \neq N$, and the backward policy as $\pi_{\text{B}}(0) = 0$ and $\pi_{\text{B}}(s) = -1$ for $s \neq 0$. Different policies accumulate different total rewards. Optimal policies accumulate as much reward as possible.

1.2. Stochastic Policies and MDPs¶

The MDP in equation (2) is deterministic. The state-action pair $(s_t, a_t)$ completely determines the next state $s_{t+1}= f (s_t, a_t)$. The policy in equation (1) is also deterministic. Given the current state, $s_t$ completely determines the choice of action $a_t=\pi(s_t)$. Both of these can be stochastic.

In a stochastic MDP, the state-action pair $(s_t, a_t)$ controls the probability distribution of the next state. That is, the state $s_{t+1}$ is drawn from the conditional probability distribution

$$ s_{t+1} \sim p \big (\, s_{t+1} \, | \, s_t, \, a_t \,\big), $$

Instead of dictating the next state $s_{t+1}$, the choice of action $a_t$ dictates the likelihood that the next state is $s_{t+1}$.

Figure 3 shows a stochastic variation of the walker in Figure 1. This walker moves up with probability $p+a$ and down with probability $1-p-a$. Thus, the MDP is characterized by the stochastic evolution,

$$ p \big (\, s_{t+1} = s_t +1 \, | \, s_t, \, a_t \,\big) = p+a, $$$$ p \big (\, s_{t+1} = s_t -1 \, | \, s_t, \, a_t \,\big) = 1 – p – a . $$

In this example, actions $a$ must be chosen so that $p+a$ and $1 – p – a$ are valid probabilities. For instance, we may have $p = 1/2$ and $a = +1/4$ or $-1/4$. With these values, the model represents a random walker that can bias its walk up or down. It can make the probability of walking forward as high as $3/4$ by choosing $a = +1/4$ and the probability of walking backwards as high as $3/4$ by choosing $a = -1/4$.

In addition to working with stochastic MDPs, we can also choose to work with stochastic policies. A stochastic policy $\pi$ does not determine the action $a_t$ exactly but provides the probability of choosing a particular action,

$$ a \sim \pi \big (a \, | \, s \big). $$

For example, the random walker in Figure 3 may decide to bias its walk up with probability $q$ and to bias its walk down with probability $1-q$,

$$ \pi \big (a = +1/4 \, | \, s \big) = q , $$$$ \pi \big (a = -1/4 \, | \, s \big) = 1 -q . $$

In this chapter, we will work with deterministic MDPs and deterministic policies as they are easier to explain and train. However, we need to be aware of stochastic policies and MDPs, as most of the literature on reinforcement learning works with these. The actor-critic method that we will introduce later also works directly with stochastic MDPs.

Figure 3: Stochastic MDP.

2. Q-Functions and Value Functions¶

We are given a discount factor $\gamma<1$, we fix a policy $\pi$, and we initialize the system with state-action pair $(s,a) =(s_0,a_0)$. We define the Q-function $Q (s, a; \pi)$ as the reward accumulated by executing policy $\pi$ when the system is initialized at state-action pair $(s,a)$,

$$ Q (s, a; \pi) = \sum_{t=0}^\infty \gamma^t \, r(s_t, a_t) = r(s_0, a_0) + \gamma r(s_1, a_1) + \gamma^2 r(s_2, a_2) + \ldots $$

In this reward accumulation, the purpose of the discount factor $\gamma$ is to give more weight to earlier collection of rewards. Notice that the MDP $f$ and the policy $\pi$ enter the right-hand side of the equation through the state evolution. This is because actions $a_t = \pi(s_t)$ follow the policy $\pi$, and state transitions $s_{t+1} = f(s_t, a_t) = f(s_t, \pi(s_t))$ follow the MDP.

Indeed, the system starts at the state-action pair $(s,a)=(s_0,a_0)$ and collects the reward $r(s,a)=r(s_0, a_0)$. As per equation (1), the system moves to state $s_1 = f(s_0,a_0) = f(s,a)$. At this point in time, we choose action $a_1 = \pi(s_1)$ to collect the reward $\gamma r(s_1, a_1)$ and effect the system to transition into state $s_2 = f(s_1,a_1) = f(s_1,\pi(s_1))$. We then execute action $a_2 = \pi(s_2)$ to collect reward $\gamma^2r(s_2, a_2)$ and transition into state $s_3 = f(s_2,a_2) = f(s_2,\pi(s_2))$. We keep executing actions as dictated by the policy $\pi$ so that at arbitrary time $t$ we execute action $a_t = \pi(s_t)$ to collect reward $\gamma^t r(s_t, a_t)$ and transition into state $s_{t+1} = f(s_t,a_t)$.

Notice that in the definition of the Q-function $Q (s, a; \pi)$, the action $a$ may or may not be the policy action $\pi(s)$.

The Q-function describes the reward that we collect when the system is initialized at state-action pair $(s,a) = (s_0,a_0)$. In general, we care about the system when it is initialized at a number of different states $s$. To capture this interest, we consider a distribution $p(s)$ of initial states $s = s_0$ and define the value function of policy $\pi$ as

$$ V(\pi) = \mathbb{E}_{p(s)} \Big[\, Q \Big(\, s, \, \pi(s); \, \pi \,\Big) \, \Big] . $$

Observe that in the definition of the value function, we particularize $a$ to $\pi(s)$ in the Q-function. The value function considers the effect of running policy $\pi$ from $t=0$ onward averaged over an exogenous choice of the distribution of initial states $s_0=s$.

2.1. Q-Function Recursion¶

The form of the Q-function is such that it can be computed recursively. This result is simple to derive but important enough to be highlighted as a proposition.

Proposition¶

Consider an MDP $f$ as defined in equation (1) along with a policy $\pi$ as defined in equation (2). The $Q$-function $Q (s, a; \pi)$ satisfies the recursion

$$ Q \Big(\, s, \pi(s); \pi \, \Big) = r(s, a) + \gamma Q\Big( s^{+}, \pi(s^{+}); \pi\, \Big) , $$

for any initial state $s$ and next state $s^{+} = f(s, \pi(s))$.

Proof¶

To prove this recursion, we write $Q (s, a; \pi) = Q (s_0, a_0; \pi)$ for clarity and recall that in the definition of the Q-function, actions $a_t = \pi(s_t)$ follow the policy $\pi$ and state transitions $s_{t+1} = f(s_t, a_t) = f(s_t, \pi(s_t))$ follow the MDP. In particular, $s_{1} = f(s_0, a_0)$ and $a_1 = \pi(s_1)$.

Begin by separating the term corresponding to $t=0$ in the Q-function definition:

$$ Q (s_0, a_0; \pi) = r(s_0, a_0) + \gamma \sum_{t=1}^\infty \gamma^{t-1} \, r(s_t, a_t) . $$

Observe that the summation on the right-hand side is the Q-function $Q(s_1, a_1; \pi)$. The sum starts at $t=1$, but this is just a summation index shift. For clarity, we expand the sum:

$$ \sum_{t=1}^\infty \gamma^{t-1} \, r(s_t, a_t) = r(s_1, a_1) + \gamma r(s_2, a_2) + \gamma^2 r(s_3, a_3) + \ldots , $$

which is the same as the Q-function but with the initial condition $(s,a) =(s_1,a_1)$. Therefore, we conclude that

$$ Q (s_0, a_0; \pi) = r(s_0, a_0) + \gamma Q(s_1, a_1; \pi) $$

This is the same as the recursive equation above because, as noted, $s_{1} = f(s_0, a_0)$ and $a_1 = \pi(s_1)$. We can substitute $(s_0, a_0) = (s, \pi(s))$ and $(s_1, a_1) = (s^{+}, \pi(s^{+}))$, completing the proof.

The claim in Proposition 1 is a straightforward statement. It says that the reward we accumulate from $t=0$ onwards is the sum of the reward at $t=0$ and the reward we accumulate from $t=1$ onwards, with the reward accumulated from $t=1$ onwards being discounted.

2.2. Optimal Policy¶

A first approach to define an optimal policy is to maximize the Q-function for a certain initial condition $(s,\pi(s))$,

$$ \pi^* = \argmax_{\pi} Q (s, \pi(s); \pi) $$

Although it seems that the optimal policies $\pi^*$ depend on the initial condition $(s,a) = (s_0,a_0)$, this is not quite the case. There exists an optimal policy $\pi^*$ that is the same for all initial conditions. This is true because having more than one optimal policy would contradict Proposition 1. To see this, suppose that policy $\pi^*$ is optimal for the Q-function $Q (s, \pi(s); \pi)$ but not for the Q-function $Q (s^{+}, \pi(s^{+}); \pi)$, which is maximized by policy $\pi^{**}$. If this is the case,

$$ Q \Big(\, s, \pi^*(s); \pi^* \, \Big) = r(s, \pi^*(s)) + \gamma Q\Big( s^{+}, \pi^*(s^{+}); \pi^*\, \Big) < r(s, \pi^*(s)) + \gamma Q\Big( s^{+}, \pi^{**}(s^{+}); \pi^{**}\, \Big), $$

where the equality follows from equation (3), and the inequality holds because $\pi^*$ is not optimal for $Q (s^{+}, \pi(s^{+}); \pi)$. This inequality contradicts the statement that $\pi^*$ is optimal for $Q (s, \pi(s); \pi)$ because we can change the policy to $\pi^{**}$ from time $t=1$ and increase the accumulated reward.

A second approach to defining an optimal policy $\pi^*$ is to maximize the value function $V(\pi)$.

$$ \pi^* = \argmax_{\pi} V(\pi) $$

Although it may seem that this approach yields a different optimal policy, this is not the case. Since a policy that is optimal for all initial conditions exists, this policy is also optimal for the average in the value function. However, working with value functions is preferable in practice because finding policies that maximize value functions is often easier than maximizing Q-functions.

To solve the above optimization problem over the policy $\pi$, we introduce a learning parameterization $\pi(s, \alpha)$. This leads to the optimization problem

$$ \alpha^* = \argmax_\alpha V(\pi(\alpha)) = \argmax_\alpha \mathbb{E}_{p(s)} \Big[\, Q \Big(\, s, \, \pi(s,\alpha); \, \pi(\alpha) \,\Big) \, \Big]. $$

The advantage of introducing a learning parameterization is that after solving this optimization, we can execute optimal actions $\pi(s_t; \alpha^*)$ through simple evaluations of the learning parameterization. This is our customary reason for using machine learning.

The trajectory control problem we addressed in Chapter 8 can be written as a particular case of equation (6). Consider a reference trajectory specified here as set $\mathcal{S}$ of vectors $s_{\text{R}} = [p_{\text{R}}; v_{\text{R}}]$, where $p_{\text{R}}$ and $v_{\text{R}}$ are coordinates and velocities that we want to track. The pair $[p_{\text{R}}; v_{\text{R}}]$ represents an optimal pair of positions and velocities as determined, e.g., by an expert driver. The state of the system is a vector $s = [p, v]$ that contains the position and velocity of the car. We control the car by choosing accelerations $a_t$. A first approach to capture the goal of following the expert is the reward function,

$$ r_1(s_t, a_t) = – \min_{[p_{\text{R}}; v_{\text{R}}] \in \mathcal{S}} \bigg[\, \frac{a}{2} \| p_t – p_{\text{R}} \|^2 + \frac{b}{2} \| v_t – v_{\text{R}} \|^2 \bigg] . $$

The term $-(a/2) \| p_t – p_{\text{R}} \|^2$ keeps the position of the car $p_t$ close to the reference trajectory $p_{\text{R}}$. The term $-(b/2) \| v_t – v_{\text{R}} \|^2$ keeps the car’s velocity $v_t$ close to the velocity of the expert $v_{\text{R}}$. The minimum in the reward $r(s_t, a_t)$ is taken over all vectors $s_{\text{R}} = [p_{\text{R}}; v_{\text{R}}]$ in the expert trajectory $\mathcal{S}$. No matter the car’s position and velocity, we want to incentivize it to move closest to the nearest point in the expert trajectory.

If the car is initialized at state $s$ with action $a$ and we execute policy $\pi$, the Q-function reduces to

$$ Q(s, a; \pi) = – \sum_{t=0}^\infty \gamma^t \min_{[p_{\text{R}}; v_{\text{R}}] \in \mathcal{S}} \bigg[\, \frac{a}{2} \| p_t – p_{\text{R}} \|^2 + \frac{b}{2} \| v_t – v_{\text{R}} \|^2 \bigg] , $$

which we obtain by substituting equation (7) into equation (4). Recall that the policy $\pi$ enters this Q-function in the generation of the trajectory. States $s_t = [p_t, v_t]$ are the consequence of executing actions $a_u = \pi(s_u)$ and the evolution $s_{u+1} = f (s_u, a_u)$ of the dynamical system for all times $u < t$. Except for the discount factor $\gamma$, this is the same loss we used to design MPC controllers in Lab 8.

Further consider a set of initial states $s_i$ drawn from a set of $N$ possible initial states. Using this set of states as the initial state distribution $p(s)$ and substituting the Q-function from above into equation (5) yields the value function

$$ V(\pi) = – \frac{1}{N} \sum_{i=1}^{N} \sum_{t=0}^\infty \gamma^t \min_{[p_{\text{R}}; v_{\text{R}}] \in \mathcal{S}} \bigg[ \frac{a}{2} \| p_{it} – p_{\text{R}} \|^2 + \frac{b}{2} \| v_{it} – v_{\text{R}} \|^2 \bigg] , $$

where $p_{it}$ and $v_{it}$ are the trajectories generated on the dynamical system $f$ when we execute policy $\pi$ and start the system at initial condition $s_i$. We see here that minimizing the Q-function and value function are indeed equivalent problems, as is generally true for MDPs. We also see why solving the value function is preferable in practice. Having a set of initial conditions $s_i$ instead of a single initial condition $s$ yields a richer set of states to explore as we learn an optimal policy.

A more refined reward combines the reward above with the norm of the acceleration

$$ r(s_t, a_t) = r_1(s_t, a_t) – \frac{c}{2} \| a_t \|^2 . $$

The term $-(c/2) \| a_t \|^2$ penalizes large accelerations to prevent imitation of the expert with swerving trajectories. Penalizing large accelerations moves the car closer to the expert trajectory with smooth movements. The use of different scaling constants $a$, $b$, and $c$ gives different importance to positions, velocities, and accelerations. For example, it is clear that keeping positions $p_t$ and $p_{\text{R}}$ close is more important than keeping velocities close, so we should make $a > b$.

Henceforth, we use the reward $r(s_t, a_t)$ defined above.

Task 9.1¶

Code a Car class whose attributes are an expert trajectory $\mathcal{S}$, the state of the car $s = [p, v]$, and its acceleration $a$. Make the car simulator a parent of this class. We take this simulator as the system $f$, so that calling simulator.step(s, a) computes the state $s’ = f(s, a)`. Although not mathematically necessary, it may help with interpretations to know that a step of the car simulator corresponds to $T_s = 50$ ms.

Add a reward method that computes the reward $r(s, a)$ collected by the car given the state of its attributes. This method computes the reward as specified in equation (11).

Instantiate the car and initialize its expert trajectory. To solve this task, you need to load the car simulator and the expert trajectory. The simulator can be found here, and the expert trajectory here.

In [2]:

Ts = 0.05
drag_coefficient = 0.1

d_max = 1
v_max = 20
a_max = 1

c = 0.075

In [3]:

def dynamics_ca(dt: float):
    dt = Ts
    A = np.array(
        [
            [1, 0, dt, 0],
            [0, 1, 0, dt],
            [0, 0, 1, 0],
            [0, 0, 0, 1],
        ]
    )
    B = np.array(
        [
            [dt**2 / 2, 0.0],
            [0.0, dt**2 / 2],
            [dt, 0.0],
            [0.0, dt],
        ]
    )
    return A, B


def dynamics_ca_drag(dt: float, mu: float):
    A, B = dynamics_ca(dt)
    A[0, 2] -= mu * dt**2 / 2
    A[1, 3] -= mu * dt**2 / 2
    A[2, 2] -= mu * dt
    A[3, 3] -= mu * dt
    return A, B

In [4]:

class Car:
    def __init__(self) -> None:
        self.A, self.B = dynamics_ca_drag(Ts, drag_coefficient)
        ref = np.load('data/lerp_expert_sample.npy')
        self.ref = torch.from_numpy(ref).double()
        self.rng = np.random.default_rng()

    def get_ref_distance(self, s):
        if len(s.shape) == 1:
            s = s.unsqueeze(0)
        idx = torch.argmin(torch.sqrt(torch.sum((self.ref.unsqueeze(0) - s.unsqueeze(1)) ** 2, dim=-1)), dim=-1)
        s_ref = self.ref[idx]
        d = torch.sqrt(torch.sum((s_ref - s) ** 2, dim=-1))
        return d

    def get_reward(self, s, a):
        max_reward = 1.5
        d = self.get_ref_distance(s)
        if d > d_max:
            return torch.tensor([-1])
        return max_reward - d - c * torch.sqrt(torch.sum(a) ** 2)

    def reset(self):
        return self.ref[int(torch.rand(1).item() * self.ref.shape[0])].unsqueeze(0)

    def step(self, s, a):
        r = self.get_reward(s.detach(), a.detach())
        s_next = s.detach() @ self.A.T + a.detach() @ self.B.T

        d = self.get_ref_distance(s_next) 
        done = d > d_max

        return s_next, r, done

2.4. Observations¶

Agents in an MDP are assumed to know their states $s_t$. This is how they can execute the action $a_t = \pi(s_t)$ dictated by their policy. In addition to states, agents may also be assumed to have access to observations $o_t$. When these observations are available, the policy $\pi$ is a function of the state and the observation,

$$ a_t = \pi(s_t, o_t). $$

In circuit navigation, the expert trajectory $\mathcal{S}$ is a possible example of an observation that the agent leverages to make acceleration decisions.

Observations $o_t$ do not affect the evolution of the dynamical system except for their effect in the choice of policy. When choosing actions as in equation (12), the state of the system still evolves according to the same MDP,

$$ s_{t+1} = f \Big (\, s_t, \, a_t \,\big) = f \Big (\, s_t, \, \pi(s_t, o_t) \,\Big). $$

Observations $o_t = h(s_t)$ are also completely determined by the state $s_t$ through the observation function $h$. The observation function $g$, like the dynamical system $f$, is unknown.

Since observations $o_t$ are determined by states $s_t$ and state evolution is independent of observations — save for their effect on the action choice — we think of observations $o_t$ as side information that is attached to the state $s_t$. This information is available to make better decisions $a_t = \pi(s_t, o_t)$ but does not otherwise impact the evolution of the MDP. This is different from states $s_t$ that affect both the choice of action $a_t$ and the next state $s_{t+1}$.

In subsequent sections, we write policies as $a_t = \pi(s_t)$ even though we most often mean $a_t = \pi(s_t, o_t)$. We do this to avoid complicating notation.

Task 9.2¶

Modify the Car class in Task 6.1 to add an observation attribute $o$. Add an observation method that, given the state of the car $s = [p, v]$ and the reference trajectory $\mathcal{S}$, computes the observation $o$ consisting of the $n_o$ reference states $s_R$ whose positions $p_R$ are closest to the position of the agent $p$. For subsequent experiments, we recommend setting $n_o = 10$.

Modify the parent method simulator.step(s, a) to include an update of the observation attribute using the observation method. Modify the __init__ method to initialize the observation attribute using the observation method.

In [5]:

n_obs = 10
n_s_obs_dim = n_s_dim * (n_obs + 1)

In [67]:

class Car:
    def __init__(self) -> None:
        self.A, self.B = dynamics_ca_drag(Ts, drag_coefficient)
        ref = np.load('data/lerp_expert_sample.npy')
        self.ref = torch.from_numpy(ref).double()
        self.all_indices = torch.arange(self.ref.shape[0] * 2).unsqueeze(0)
        self.rng = np.random.default_rng()

    def get_ref_distance(self, s):
        if len(s.shape) == 1:
            s = s.unsqueeze(0)
        idx = torch.argmin(torch.sqrt(torch.sum((self.ref.unsqueeze(0) - s.unsqueeze(1)) ** 2, dim=-1)), dim=-1)
        s_ref = self.ref[idx]
        d = torch.sqrt(torch.sum((s_ref - s) ** 2, dim=-1))
        return d, idx

    def get_reward(self, s, a):
        max_reward = 1.5
        d, _ = self.get_ref_distance(s)
        if d > d_max:
            return torch.tensor([-1])
        return max_reward - d - c * torch.sqrt(torch.sum(a) ** 2)

    def get_obs(self, s):
        if len(s.shape) == 1:
            s = s.unsqueeze(0)
        _, idx = self.get_ref_distance(s)
        if len(idx.shape) == 0:
            idx = idx.unsqueeze(0)
        mask = self.all_indices.repeat(s.shape[0], 1)
        mask = (idx.unsqueeze(-1) <= mask) & (mask < idx.unsqueeze(-1) + n_obs)
        look_ahead = self.ref.unsqueeze(0).repeat(s.shape[0], 2, 1)[mask]
        if len(look_ahead.shape) == 2:
            look_ahead = look_ahead.unsqueeze(0)
        return look_ahead.reshape(s.shape[0], n_s_obs_dim - n_s_dim)

    def reset(self):
        s = self.ref[int(torch.rand(1).item() * self.ref.shape[0])].unsqueeze(0)
        obs = self.get_obs(s)
        return s, obs

    def step(self, s, a):
        r = self.get_reward(s.detach(), a.detach()).double()
        s_next = s.detach() @ self.A.T + a.detach() @ self.B.T
        obs_next = self.get_obs(s_next)

        d, _ = self.get_ref_distance(s_next) 
        done = d > d_max

        return s_next, obs_next, r, done

simulator = Car()

3. Reinforcement Learning¶

In Section 3, we define optimal policies of Markov decision processes (MDP). If the MDP transition function $f$ and the reward functions $r$ are known, the optimization problems formulated in Section 3 are regular optimization problems. The formulation in equation (6), in particular, is a standard machine learning problem. However, the function $f$ and the reward $r$ are not known. A possible approach to solve equation (6) is to design or learn models of $f$ and $r$ and use them to find the policy $\alpha^*$. This is model predictive control (MPC).

Alternatively, we can learn the policy $\alpha^*$ by probing the MDP with different policies $\pi$ and observing the collected rewards $V(\pi)$. This is reinforcement learning (RL). We can then think of RL as the process of learning and minimizing the value function. This is in contrast to MPC, where we build or learn models of the transition function $f$ and the reward function $r$ as an intermediate step to finding the optimal policy $\alpha^*$.

4. Policy Optimization¶

To solve equation (6) without learning a model of the MDP, we run gradient ascent on the policy parameter $\alpha$. Introduce an iteration index $k$, a step size $\epsilon$, an initial value $\alpha_0$, and proceed through iterations that update $\alpha$ by following gradients $g(\alpha) := \partial V(\pi(\alpha)) / \partial \alpha$,

$$ \alpha_{k+1} = \alpha_k + \epsilon\, g(\alpha_k) = \alpha_k + \epsilon\, \frac{\partial}{\partial \alpha} V(\pi(\alpha_k)) . $$

In contrast to gradient descent, we move in the direction of the gradient, not its opposite. This modification is because we are searching for the maximum of $V(\alpha)$, not its minimum.

To implement this gradient ascent, we need to compute gradients of the value function. This computation is more challenging than it seems, but it can be carried out. To give a gradient expression, it is convenient to write the value function as

$$ V(\pi(\alpha)) = \mathbb{E}_{p(s)} \Big[\, Q \big(\, s, \, \pi(s,\alpha); \, \pi(\alpha) \,\big) \, \Big] = \mathbb{E}_{p(s)} \Big[\, Q \big(\, s, \, a; \, \pi(\alpha) \,\big) \, \Big], $$

where in the second equality it is implicit that $a = \pi(s,\alpha)$. This is simply a notation simplification.

Now consider a trajectory $s_t$ generated by policy $\pi$. This is a trajectory in which we choose $a_t = \pi(s_t)$ from time $t=0$ onward. The gradient of the value function can then be written as

$$ g(\alpha) = \mathbb{E}_{p(s)} \Bigg[\, \sum_{t=0}^\infty \gamma^t \times \frac{\partial}{\partial a_t} Q \Big(\, s_t, \, a_t; \, \pi(\alpha) \,\Big) \times \frac{\partial}{\partial \alpha} \pi(s_t, \alpha) \, \Bigg] . $$

In this expression, the gradients of the Q-function are taken with respect to the choice of action $a_t$ and are evaluated at $a_t = \pi(s_t,\alpha)$. That is, it is implicit that $a_t = \pi(s_t,\alpha)$, as it is implicit in the previous equation that $a = \pi(s, \alpha)$.

4.1. Critics¶

In equation (13), the gradients ${\partial \pi(s_t,\alpha)}/{\partial \alpha}$ are with respect to the learning parameterization and can be computed with automated differentiation. The gradients ${\partial Q(s_t, a_t; \pi(\alpha) )}/{\partial a_t}$ are more difficult to evaluate because we do not know the Q-function. We solve the latter problem by introducing a critic.

A critic of a given policy $\pi$ is a function $\tilde{Q}(s, a; \beta)$ that maps the state-action pair $(s,a) = (s, \pi(s))$ and the parameter $\beta$ to an estimate of the Q-function’s value $Q(s,a; \pi)$. With a critic available, we can replace the gradient of the Q-function in equation (13) with the gradient of the critic to compute gradient estimates,

$$ \tilde{g}(\alpha) = \mathbb{E}_{p(s)} \Bigg[\, \sum_{t=0}^\infty \gamma^t \times \frac{\partial}{\partial a_t} \tilde{Q} \Big(\, s_t, \, a_t; \, \beta \,\Big) \times \frac{\partial}{\partial \alpha} \pi(s_t, \alpha) \, \Bigg] . $$

The vector $\tilde{g}(\alpha)$ is an estimate of the value function gradient based on derivatives of the critic function. We call it a critic gradient for short.

The advantage of using the critic $\tilde{Q} (s, a; \beta)$ in the gradient computation is that the critic is a parametric function. Thus, the gradients ${\partial \tilde{Q}(s, a; \beta)}/{\partial a}$ can be computed with automated differentiation. We can introduce an iteration index $k$, a step size $\epsilon$, and an initial value $\alpha_0$ to update $\alpha_k$ by following the critic gradients $\tilde{g}(\alpha_k)$,

$$ \alpha_{k+1} = \alpha_k + \epsilon\, \tilde{g}(\alpha_k) . $$

In this recursion, the vectors $\tilde{g}(\alpha_k)$ are the critic gradients defined in equation (14). This contrasts with the true gradients $g(\alpha_k)$ of the value function $V(\pi)$ used in equation (9) and expressed in equation (13).

The downside of using a critic is that it is an approximation, not the true Q-function. For this to be effective, we need to learn a good critic with close function values $\tilde{Q}(s, a; \beta) \approx Q(s, a; \pi(\alpha))$ and close derivatives ${\partial \tilde{Q}(s, a; \beta)}/{\partial a} \approx {\partial Q(s, a; \pi(\alpha))}/{\partial a}$. In a sense, equation (14) does not solve the problem of not knowing the Q-function in equation (13); it merely postpones it. We assume in the following that a critic is available and return to its computation in Section 5.

4.2. Actors¶

An actor is an agent that, given a critic $\tilde{Q}(s, a; \beta)$, solves the optimization problem

$$ \alpha^* (\beta) = \argmax_\alpha \mathbb{E}_{p(s)} \Bigg[\, \sum_{t=0}^\infty \gamma^t \times \tilde{Q} \Big(\, s_t, \, \pi(s_t; \alpha); \, \beta \,\Big) \, \Bigg] = \argmax_\alpha \ell_{\text{A}} (\alpha, \beta) , $$

where in the second equality we define the actor loss $\ell_{\text{A}} (\alpha, \beta)$, which we call a loss despite the fact that it is a reward we want to maximize. In this loss, we draw initial samples $s_0 = s$ from the distribution $p(s)$. We then generate trajectories $s_t$ from these initial conditions by executing the policy $\pi(\alpha)$. The critic $\tilde{Q}(s, a; \beta)$ is evaluated at states and actions that correspond to this trajectory.

The optimization in this equation is justified by the fact that the gradients of the loss $\ell_{\text{A}} (\alpha, \beta)$ have the same form as the critic gradients $\tilde{g}(\alpha)$ in equation (14),

$$ \frac{\partial}{\partial \alpha} \ell_{\text{A}} (\alpha, \beta) = \mathbb{E}_{p(s)} \Bigg[\, \sum_{t=0}^\infty \gamma^t \times \frac{\partial}{\partial a_t} \tilde{Q} \Big(\, s_t, \, a_t; \, \beta \,\Big) \times \frac{\partial}{\partial \alpha} \pi(s_t, \alpha) \, \Bigg] . $$

There is, however, a subtle difference between equations (14) and (17). In equation (14), the critic $\tilde{Q}(s, a; \beta)$ approximates the Q-function of policy $\pi(\alpha)$—the same policy whose gradients we are evaluating. In equation (17), the critic $\tilde{Q}(s, a; \beta)$ is fixed, and we evaluate gradients at an arbitrary policy parameter $\alpha$. This is why in equation (16), the optimal actor parameter is written as a function of $\beta$.

That the critic parameter in equation (16) is fixed is not an insignificant issue. Let us, however, set it aside for the time being and work on its solution. We revisit this challenge in Section 6.

To solve equation (16), we consider stochastic approximations of the loss $\ell_{\text{A}} (\alpha, \beta)$. Formally, consider an initial state $s \sim p(s)$ sampled from the initial state distribution and compute

$$ \hat{\ell}_{\text{A}} (\alpha, \beta) = \sum_{t=0}^\infty \gamma^t \times \tilde{Q} \Big(\, s_t, \, \pi(s_t; \alpha); \, \beta \,\Big). $$

As in equation (16), the trajectory $s_t$ is generated from the initial condition $s$ by executing the policy $\pi(\alpha)$. We refer to this as a rollout. We generate a trajectory by executing (or rolling out) the policy $\pi(\alpha_k)$ from the initial condition $s$. Although we keep a sum running to $t=\infty$ for simplicity, in practical implementations, we limit the rollout to a finite time $t=T$.

Instead of a single initial state sample $s$, we may work with a batch of $B$ samples $s_i \sim p(s)$ drawn from the initial state distribution. In this case, the stochastic approximation of the loss $\ell_{\text{A}} (\alpha, \beta)$ becomes

$$ \hat{\ell}_{\text{A}} (\alpha, \beta) = \frac{1}{B} \sum_{i=1}^{B} \sum_{t=0}^\infty \gamma^t \times \tilde{Q} \Big(\, s_{it}, \, \pi(s_{it}; \alpha); \, \beta \,\Big). $$

As in equations (16) and (18), the trajectories $s_{it}$ are generated from the initial conditions $s_i$ by executing the policy $\pi(\alpha)$.

Task 9.3¶

The code in this link provides a critic class with a trained critic for the circuit navigation problem described in Section 4. Run the method critic.load to load the critic and the method critic.evaluate(s, a) to evaluate the critic at state-action pair $(s, a)$. The critic class is a child of nn.Module to allow for computation of gradients with automated differentiation. This critic has been trained on the optimal policy, meaning it approximates the Q-function of the policy $\pi(\alpha^*)$.

Use this critic to find the optimal policy $\pi(\alpha^*)$. Do so by implementing the recursion in equation (15) with the stochastic approximate gradients in equation (18). Use $\gamma = 1.0$ as a discount factor. Note that in equation (18), policy rollouts extend to $t = \infty$. Use $T = 250$ as a large enough number to approximate infinity.

To find this optimal policy, you need a parametric policy. We suggest that you use a fully connected neural network (FCNN) where the inputs are the state ($s$) and observation ($o$) attributes of a car object instantiated from the class in Task 6.2, and the outputs are the acceleration actions ($a$),

$$ a = \pi(s, o; \alpha) = \text{FCNN}(s, o; \alpha). $$

An FCNN with 4 hidden layers containing $n_\ell = 128$ hidden neurons each has worked well in our experiments.

To find the optimal policy, you also need to choose batches of initial states to evaluate equation (18). We suggest that you sample states from the expert trajectory. In our experiments, using $B = 1$ worked well in practice.

In [100]:

class Track:
    def __init__(self, inside, outside):
        self.inside = inside
        self.outside = outside

    def plot(self, ax=None):
        ax = ax or plt.gca()
        ax.plot(self.inside[:, 0], self.inside[:, 1], color="k")
        ax.plot(self.outside[:, 0], self.outside[:, 1], color="k")


def load_track(bounds_file):
    data = np.load(bounds_file)
    return Track(data["inside"], data["outside"])

In [101]:

state_scale = 20
action_scale = 35

class Critic:
    def __init__(self, model_path):
        self.model = torch.jit.load(model_path).double()

    def evaluate(self, s, obs, a):
        if len(s.shape) == 1:
            s = s.unsqueeze(0)
        s_aug = torch.cat([s, obs], dim=-1) / state_scale
        a = a / action_scale
        return self.model(s_aug, a)

critic = Critic("data/q_model.pth")

In [102]:

class Policy(torch.nn.Module):
    def __init__(self):
        super(Policy, self).__init__()
        self.fc1 = torch.nn.Linear(n_s_obs_dim, 128)
        self.fc2 = torch.nn.Linear(128, 128)
        self.fc3 = torch.nn.Linear(128, 128)
        self.fc4 = torch.nn.Linear(128, 2)
        torch.nn.init.uniform_(self.fc4.weight, a=-1e-6, b=1e-6)

    def forward(self, s):
        x = s
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = F.tanh(self.fc4(x))
        return x

pi = Policy().double()

In [103]:

class Actor:
    def __init__(self, model):
        self.model = model

    def act(self, s, obs):
        if len(s.shape) == 1:
            s = s.unsqueeze(0)
        s_aug = torch.cat([s, obs], dim=-1) / state_scale
        return self.model(s_aug) * action_scale

actor = Actor(pi)

In [104]:

def get_returns(rewards, gamma):
    returns = []
    for i in range(rewards.shape[0]):
        returns.append(rewards[i] + gamma * rewards[i + 1:].sum())
    return torch.stack(returns)

In [105]:

T = 250
gamma = 0.99

def rollout(actor, random_action=False):
    states, observations, actions, rewards = [], [], [], []

    s, obs = simulator.reset()
    for _ in range(T):
        if random_action:
            a = torch.rand((1, n_a_dim)) * action_scale * 2 - action_scale
        else:
            a = actor.act(s, obs)

        states.append(s)
        observations.append(obs)
        actions.append(a)

        s, obs, r, done = simulator.step(s, a)
        rewards.append(r)

        if done:
            break

    states = torch.stack(states).squeeze(1)
    actions = torch.stack(actions).squeeze(1)
    observations = torch.stack(observations).squeeze(1).detach()
    rewards = torch.stack(rewards).squeeze(1).detach()
    returns = get_returns(rewards, gamma)

    return states, observations, actions, rewards, returns

In [ ]:

episodes = 20_000
gamma = 0.99

pi_loss = 0
pi_opt = torch.optim.Adam(pi.parameters(), lr=2e-5, maximize=True)

track = load_track("data/track.npz")

pi.train()
for episode in trange(episodes):
    pi_opt.zero_grad()

    s, obs, a, _, _ = rollout(actor)

    pi_loss = critic.evaluate(s, obs, a).mean()
    pi_loss.backward()
    pi_opt.step()

    if episode 
        clear_output()
        print(f'Episode {episode+1}\t Policy loss: {pi_loss}\t Trajectory size: {s.shape[0]}')

Task 9.4¶

Implement the optimal policy $\pi(\alpha^*)$ computed in Task 9.3. Record the sequence of states $s_t$ and actions $a_t$ as the trajectory unfolds. Evaluate the performance by comparing the generated trajectory with an expert trajectory. Plot both trajectories.

In [ ]:

s, _, _, _, _ = rollout(actor)

plt.figure()
track.plot()
plt.grid(True)
plt.xlabel("x (m)")
plt.ylabel("y (m)")
plt.plot(simulator.ref[:, 0], simulator.ref[:, 1], "--", label="Expert")
plt.plot(s[:, 0], s[:, 1], "-", label='RL')
plt.legend(loc="upper right", framealpha=1.0)
plt.show()

5. Exploratory Policies¶

RL problems are unique among the problems we have studied in that the training data is chosen. In both the original formulation of the optimal policy in equation (6) and the actor formulation in equation (16), we optimize over the initial state distribution $p(s)$. In Task 6.3, we choose $p(s)$ as the reference trajectory $\mathcal{S}$, but this is an arbitrary choice. We can also, as we did in Lab 8, use dithering of initial states around the reference trajectory.

In RL, it is standard to use random exploration around policy trajectories. We describe this formally by introducing an exploratory policy in which actions are chosen at random with probability $\delta$ and according to $\pi(\alpha)$ with probability $1 – \delta$,

$$ \pi_{\text{E}}(s, \alpha) = \mathcal{N}(0,\sigma^2\mathbb{I}), \qquad \text{with probability } \delta , $$$$ \pi_{\text{E}}(s, \alpha) = \pi(s, \alpha), \qquad \text{with probability } 1 – \delta . $$

Policy rollouts in equation (19), or equation (18) if using batches of initial states, proceed according to the exploratory policy $\pi_{\text{E}}(s, \alpha)$.

The idea of exploratory policies is to let the actor visit states around the trajectory generated by the current policy. The rationale for these random visits is to explore during training a set of states that is more representative of the set of states that will be seen during execution. We use a combination of states that we expect to see a priori—drawn from $p(s)$—and states that we expec

Task 9.5¶

Repeat Task 9.3 using an exploratory policy. Set $\delta = 0.01$ for the probability of choosing a random acceleration. When choosing random accelerations, draw them from a white normal distribution with mean $0$ and variance $\sigma^2 = 0.1$.

Implement the optimal policy $\pi(\alpha^*)$ obtained with the exploratory policy. Note that we use exploration during training but not during execution. Record the sequence of states $s_t$ and actions $a_t$ as the trajectory unfolds. Evaluate the performance by comparing the generated trajectory with an expert trajectory. Plot both trajectories.

In [113]:

delta = 0.01
mu = torch.tensor([0.0, 0.0]).double()
sigma = torch.tensor([0.1, 0.1]).double()

class Actor:
    def __init__(self, model):
        self.model = model

    def act(self, s, obs):
        if len(s.shape) == 1:
            s = s.unsqueeze(0)
        s_aug = torch.cat([s, obs], dim=-1) / state_scale
        if np.random.rand() < delta:
            return mu + sigma * torch.randn((s.shape[0], n_a_dim)).double()
        return self.model(s_aug) * action_scale

actor = Actor(pi)

6. Critic Training¶

As stated in Section 5, a critic $\tilde{Q}(s, a; \beta)$ is a parametric approximation of the true Q-function $Q(s, a; \pi)$ of a policy $\pi$. To train this approximation, we use policy rollouts to evaluate the Q-function. Starting from the initial condition $(s, a) = (s, \pi(s))$, we execute policy $\pi$ and follow the MDP’s dynamics to estimate the Q-function as

$$ Q(s, a; \pi) = Q(s, \pi(s); \pi) = \sum_{t=0}^\infty \gamma^t \, r(s_t, a_t) $$

This is the same as equation (4), except that we set $a = \pi(s)$, as this choice of initial action is required in the definition of the critic.

To train the critic, we evaluate the squared loss of the critic and Q-function difference, averaged over an initial distribution of states $p(s)$,

$$ \beta^*(\pi) = \argmin_\beta \mathbb{E}_{p(s)} \Bigg[\, \frac{1}{2} \Big\| Q(s, a, \pi) – \tilde{Q}(s, a; \beta) \Big\|^2 \, \Bigg], $$

where it is implicit that $a = \pi(s)$ in both the Q-function and its critic.

The loss above can be difficult to train because we need to acquire several estimates of the Q-function, each requiring a rollout of the policy $\pi$ to evaluate the discounted sum of rewards.

A simpler training approach can be derived from the recursion in Proposition 1. This recursion claims that for any pair of states $s$ and $s^{+} = f(s, a)$, with corresponding actions $a = \pi(s)$ and $a^{+} = \pi(s^{+})$,

$$ Q(s, a; \pi) = r(s, a) + \gamma Q(s^{+}, a^{+}; \pi), $$

If we have a critic $\tilde{Q}(s, a; \beta)$ that approximates the Q-function well, this critic must satisfy the same recursion. That is,

$$ \tilde{Q}(s, a; \beta) = r(s, a) + \gamma \tilde{Q}(s^{+}, a^{+}; \beta) + \Delta(s, a, \beta), $$

for some function $\Delta(s, a, \beta)$ that is small for all state-action pairs $(s, a) = (s, \pi(s))$ and their subsequent pairs $(s^{+}, a^{+}) = (s^{+}, \pi(s^{+}))$.

The smaller the function $\Delta(s, a, \beta)$ is, the closer $\tilde{Q}(s, a; \beta)$ is to satisfying the Q-function recursion and, by extension, the closer $\tilde{Q}(s, a; \beta)$ is to the true Q-function $Q(s, a; \pi)$. We can therefore train the critic $\tilde{Q}(s

6.1. Critic Rollouts¶

To solve the statistical risk minimization (SRM) problem in equation (17), we consider samples $s_i \sim p(s)$ to formulate and solve the empirical risk minimization (ERM) problem defined by the loss

$$ \hat{\ell}_{\text{C}} (\alpha, \beta) = \frac{1}{B} \sum_{i=1}^B \frac{1}{2} \Big\| \Big(\tilde{Q}(s_{i}, a_{i}; \beta) – \gamma \tilde{Q}(s^{+}_{i}, a^{+}_{i}; \beta) \Big) – r(s_{i}, a_{i}) \Big\|^2 . $$

As should go without saying by now, actions in this expression (as they are in equation (17)) are chosen as $a_i = \pi(s_i, \alpha)$ and as $a^{+}_i = \pi(s^{+}_i, \alpha)$. The next states follow (as they do in equation (17)) from the execution of the MDP dynamics, $s^{+}_i = f(s_i, a_i)$.

An alternative approach to training the critic is to consider policy rollouts. Let $s$ be an initial state and $s_t$ be the trajectory generated by rolling out the policy $\pi$. We train the critic $\tilde{Q}(s, a; \beta)$ by minimizing the empirical risk,

$$ \hat{\ell}_{\text{C}} (\alpha, \beta) = \sum_{t=1}^\infty \frac{1}{2} \Big\| \Big(\tilde{Q}(s_{t}, a_{t}; \beta) – \gamma \tilde{Q}(s^{+}_{t}, a^{+}_{t}; \beta) \Big) – r(s_{t}, a_{t}) \Big\|^2 . $$

The ERM problems defined by the losses above are not equivalent. They evaluate the critic fit over different datasets. In the first case, we measure fit over the initial state distribution $p(s)$. In the second case, we measure fit over the distribution of states that are visited when we execute the policy $\pi$.

A third alternative is to combine both approaches. We consider initial state samples $s_i \sim p(s)$ along with trajectories $s_{it}$ obtained by rolling out the policy $\pi$ starting at initial condition $s_i$. We then train the critic to minimize the loss,

$$ \hat{\ell}_{\text{C}} (\alpha, \beta) = \frac{1}{B} \sum_{i=1}^B \sum_{t=0}^{\infty} \frac{1}{2} \Big\| \Big(\tilde{Q}(s_{it}, a_{it}; \beta) – \gamma \tilde{Q}(s^{+}_{it}, a^{+}_{it}; \beta) \Big) – r(s_{it}, a_{it}) \Big\|^2 . $$

It is worth comparing this approach with the critic stochastic gradients in equations (11) and (18). Equations (11) and (20) estimate gradients and risks using policy rollouts, while equations (18) and (21) estimate gradients and risks by combining policy rollouts with an exogenous distribution of initial states. In both cases, the latter is the preferred training method.

In both cases, we encounter the exploration challenge. Since training states in RL are chosen, we must choose states during training that are representative of states we expect to see during execution. This is why we use a combination of states that we expect to see a priori—drawn from $p(s)$—and states that we expect to see a posteriori—drawn from a policy rollout. As discussed in Section 6, the exploration challenge is unique among problems we have studied.

Task 9.6¶

For a fixed policy $\pi$, train a critic to estimate the Q-function corresponding to this policy. Train this critic by minimizing equation (22). Use $\gamma = 0.99$ as a discount factor, $B = 256$ for the batch size, and $T = 250$ as a large enough number to approximate infinity. Sample initial states from the reference trajectory $\mathcal{S}$.

The critic requires a learning parameterization. We suggest using a fully connected neural network (FCNN) where the inputs are the state ($s$), observation ($o$), and acceleration attributes of a car object instantiated from the class in Task 9.2. The output of this FCNN is the critic value,

$$ \tilde{Q}(s, a, \beta) = \text{FCNN}(s, o, a). $$

An FCNN with 4 hidden layers containing $n_\ell = 128$ hidden neurons each has worked well in our experiments.

Train a critic to estimate the Q-function associated with the optimal policy $\pi(\alpha^*)$ learned in Task 6.3. Compute the loss $\Delta(s, a, \beta)$ over a trajectory generated by $\alpha^*$ to evaluate the quality of the critic’s estimates of the Q-function.

In [205]:

class QHat(torch.nn.Module):
    def __init__(self):
        super(QHat, self).__init__()
        self.fc1 = torch.nn.Linear(n_s_obs_dim + n_a_dim, 128)
        self.fc2 = torch.nn.Linear(128, 128)
        self.fc3 = torch.nn.Linear(128, 128)
        self.fc4 = torch.nn.Linear(128, 1)

    def forward(self, s_aug, a):
        x = torch.cat([s_aug, a], dim=-1)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        q = self.fc4(x)
        return q

Q = QHat().double()

In [206]:

class Critic:
    def __init__(self, model):
        self.model = model

    def evaluate(self, s, obs, a):
        if len(s.shape) == 1:
            s = s.unsqueeze(0)
        s_aug = torch.cat([s, obs], dim=-1) / state_scale
        a = a / action_scale
        return self.model(s_aug, a)

critic = Critic(Q)

In [ ]:

rollouts = 1_000
states, observations, actions, rewards, returns = [], [], [], [], []
pi.eval()
for _ in range(rollouts):
    s, obs, a, r, R = rollout(actor)

    states.append(s)
    observations.append(obs)
    actions.append(a)
    rewards.append(r)
    returns.append(R)

states = torch.vstack(states).detach()
observations = torch.vstack(observations).detach()
actions = torch.vstack(actions).detach()
rewards = torch.concat(rewards).detach()
returns = torch.concat(returns).detach()

In [ ]:

B = 256
episodes = 1_000

Q.train()
Q_opt = torch.optim.Adam(Q.parameters(), lr=1e-4)

loss = []
for episode in trange(episodes):
    Q_opt.zero_grad()

    indices = torch.randperm(states.shape[0])[:B]

    b_states = states[indices]
    b_observations = observations[indices]
    b_actions = actions[indices]
    b_returns = returns[indices].unsqueeze(1)

    Q_hat = critic.evaluate(b_states, b_observations, b_actions)

    Q_loss = torch.nn.MSELoss()(Q_hat, b_returns).mean()
    Q_loss.backward()
    Q_opt.step()

    loss.append(Q_loss.detach().item())

In [ ]:

plt.plot(loss)
plt.xlim(0, episodes)
plt.xlabel('Episodes')
plt.ylabel('Loss')
plt.show()

7. Actor-Critic Reinforcement Learning¶

In Task 6.3, we assume that a critic is available and use it to train a policy. In Task 6.6, we assume a policy is available and use it to train a critic. There seems to be some circular reasoning here that we have not fully explained.

To clarify, recall that the goal of an actor is to maximize the actor loss defined in equation (16) when a critic is given,

$$ \alpha^*(\beta) = \argmax_\alpha \ell_\text{A} (\alpha, \beta) $$

Similarly, the goal of a critic is to minimize the critic loss defined in equation (17) when a policy is given,

$$ \beta^*(\alpha) = \argmin_\beta \ell_\text{C} (\alpha, \beta) $$

This presents a chicken-and-egg situation. The problem we would like to solve is finding the actor $\alpha^*(\beta^*(\alpha^*))$. That is, knowing the optimal policy $\alpha^*$, we find the corresponding optimal critic $\beta^*(\alpha^*)$. This is the “egg” in Task 6.6. Then, knowing this optimal critic, we find the optimal policy $\alpha^*(\beta^*(\alpha^*))$. This is the “chicken” in Task 6.3. We need the chicken $\alpha^*$ to lay the egg $\beta^*(\alpha^*)$ to birth the chicken $\alpha^*(\beta^*(\alpha^*))$.

As with chickens and eggs, the answer is that they evolve together. Thus, for an iteration index $k$ and a policy initialization $\alpha_0$, we proceed through the recursive updates,

$$ \beta_k = \argmin_\beta ~\ell_\text{C} (\alpha_{k}, \beta), $$$$ \alpha_{k+1} = \argmax_\alpha ~\ell_\text{A} (\alpha, \beta_k). $$

For a given policy $\alpha_{k}$, we compute the critic $\beta_k$ (the chicken $\alpha_k$ lays the egg $\beta_k$). With this critic available, we update the actor policy to $\alpha_{k+1}$ (the egg $\beta_k$ births a better chicken $\alpha_{k+1}$). We then repeat this process with updated critics and policies.

As in Sections 5.1 and 6.3, we use empirical approximations of the statistical losses. Thus, instead of the iterations above, we use the iterations

$$ \beta_k = \argmin_\beta ~ \hat{\ell}_\text{C} (\alpha_{k}, \beta), $$$$ \alpha_{k+1} = \argmax_\alpha ~ \hat{\ell}_\text{A} (\alpha, \beta_k). $$

The empirical actor loss $\hat{\ell}_\text{A} (\alpha, \beta_k)$ is given in equation (18). The empirical critic loss $\hat{\ell}_\text{C} (\alpha_{k}, \beta)$ is given in equation (22). The iteration described by these equations is the actor-critic method.

Task 9.7¶

Train the actor and the critic using the iterative actor-critic method described by equation (23). Use the same FCNN parameterization from Task 9.3 for the actor and the same FCNN parameterization from Task 9.6 for the critic. As in Tasks 9.3 and 9.6, sample initial conditions from the expert trajectory $\mathcal{S}$. Also, use the same hyperparameters $\gamma$, $B$, and $T$ as in Tasks 9.3 and 9.6.

Although equation (23) calls for iterative critic minimization and actor maximization, in practice, we run $K_\text{C}$ stochastic gradient descent steps on the critic loss and $K_\text{A}$ stochastic gradient ascent steps on the actor loss. Set $K_\text{C} = K_\text{A} = 20$.

Actor-critic training is more computationally expensive than separate actor and critic training. To reduce computational cost, use experience replay buffers. This means storing all states $s_{it}$, actions $a_{it}$, rewards $r(s_{it}, a_{it})$, and the next states $s^{+}_{it}$ and next actions $a^{+}_{it}$ generated by policy rollouts. When implementing gradient steps in the actor and critic losses, draw batches from these stored histories instead of generating new trajectories each time.

Evaluate the accumulated reward per episode and the critic’s loss during training. Comment on your observations.

In [232]:

class ReplayBuffer:
    def __init__(self, size):
        self.buffer = deque(maxlen=size)

    def __len__(self):
        return len(self.buffer)

    def push(self, s, obs, a, R):
        experiences = torch.cat([s.detach(), obs.detach(), a.detach(), R.unsqueeze(1).detach()], dim=-1)
        for experience in experiences:
            self.buffer.append(experience)

    def sample(self, batch_size):
        batch = torch.stack(random.sample(self.buffer, batch_size))
        s = batch[:, :n_s_dim]
        obs = batch[:, n_s_dim :n_s_obs_dim]
        a = batch[:, n_s_obs_dim :n_s_obs_dim + n_a_dim]
        R = batch[:, n_s_obs_dim + n_a_dim :n_s_obs_dim + n_a_dim + 1]
        return s, obs, a, R

buffer = ReplayBuffer(100_000)

In [ ]:

Q = QHat().double()
pi = Policy().double()

critic = Critic(Q)
actor = Actor(pi)

Q_opt = torch.optim.Adam(Q.parameters(), lr=1e-4)
pi_opt = torch.optim.Adam(pi.parameters(), lr=2e-5, maximize=True)

q_iter = 3
pi_iter = 1

B_q = 256
B_pi = 1

episodes = 100_000
episodes_expl = 2500

for episode in trange(episodes):
    # Critic
    Q.train()
    pi.eval()

    is_exploration = episode < episodes_expl

    s, obs, a, _, R = rollout(actor, random_action=is_exploration)
    buffer.push(s, obs, a, R)

    if len(buffer) > B_q:
        for i in range(q_iter):
            Q_opt.zero_grad()

            b_states, b_observations, b_actions, b_returns = buffer.sample(B_q)
            Q_hat = critic.evaluate(b_states, b_observations, b_actions)

            Q_loss = torch.nn.MSELoss()(b_returns, Q_hat)
            Q_loss.backward()
            Q_opt.step()

    if is_exploration:
        continue

    # Actor
    Q.eval()
    pi.train()

    s, obs, _, _, _ = rollout(actor)
    pi_opt.zero_grad()

    a = actor.act(s, obs)
    Q_hat = critic.evaluate(s, obs, a).mean()
    Q_hat.backward()

    pi_opt.step()

    if episode 
        clear_output()
        print(f'Episode {episode+1}\t Policy loss: {Q_hat}\t Trajectory size: {s.shape[0]}')

Task 9.8¶

Implement the policy $\pi(\alpha^*)$ computed in Task 9.7. Record the sequence of states $s_t$ and actions $a_t$ as the trajectory unfolds. Evaluate the performance by comparing the generated trajectory with an expert trajectory. Plot both trajectories.

In [ ]:

s, _, _, _, _ = rollout(actor)
track = load_track("data/track.npz")

plt.figure()
track.plot()
plt.grid(True)
plt.xlabel("x (m)")
plt.ylabel("y (m)")
plt.plot(simulator.ref[:, 0], simulator.ref[:, 1], "--", label="Expert")
plt.plot(s[:, 0], s[:, 1], "-", label='RL')
plt.legend(loc="upper right", framealpha=1.0)
plt.show()

Lab 9: Reinforcement Learning

Static version

Lab 6: Reinforcement Learning¶

0. Environment setup¶

1. Markov Decision Processes¶

1.1. Policies¶

1.2. Stochastic Policies and MDPs¶

2. Q-Functions and Value Functions¶

2.1. Q-Function Recursion¶

Proposition¶

Proof¶

2.2. Optimal Policy¶

2.3. Circuit Navigation¶

Task 9.1¶

2.4. Observations¶

Task 9.2¶

3. Reinforcement Learning¶

4. Policy Optimization¶

4.1. Critics¶

4.2. Actors¶

Task 9.3¶

Task 9.4¶

5. Exploratory Policies¶

Task 9.5¶

6. Critic Training¶

6.1. Critic Rollouts¶

Task 9.6¶

7. Actor-Critic Reinforcement Learning¶

Task 9.7¶

Task 9.8¶