Imitation Learning: How well does it perform?

We share a recent paper that provides a taxonomic framework and theoretical performance bounds for IL algos, then dive into into the craft of proving such bounds.

Apr 28, 2021

By Surya Vengadesan

Imitation learning (IL) broadly encompasses any algorithm (e.g. BC) that attempts to learn a policy for an MDP given expert demonstrations. In this blog post, we will be discussing how one can distinguish this task into three subclasses of IL algorithms with some mathematical rigor. In particular, we will be describing ideas presented in this recent paper that provides a taxonomixal framework for imitation learning algorithms.

A key idea behind this paper is to define a metric that generalizes the goal of IL algorithms at large. Simply put, find a policy that minimizes the imitation gap J(π_E)−J(π), between our learned policy π and an expert policy π_E. In particular, we define J(⋅), what we call a moment, which is the expection over some function.

The goal is to then match the expert and learner moments, which the paper refers to as moment matching. In particular each type of IL algorithm can be defined by different types of moments being matched, such as reward or Q functions. Through this lens, IL can be viewed as applying optimization techniques to minimize the imitation gap.

The paper then goes onto clearly defining the three types of IL algorithms and provides each with theoretical guarantees, proof after proof. Specifically, the author delineates between (1) matching rewards with access to the environment (2) matching q-values with access to the environment (3) matching q-values without access to the environment.

Now, we’ll attempt to deconstruct proofs from the paper on whether an algorithm’s imitation gap grows linearly or quadratically in time.

In particular, we will cover the paper’s first lemma which proves a linear upper bound on the imitation gap for type (1) algorithms that moment match reward functions like GAIL or SQIL. In the process, we will breakdown this craft of proving algorithmic bounds, then welcome the reader to decode the remaining proofs.

Lemma 1: Reward Upper Bound: If F_r spans F, then for all MDPs, π_E and π←Ψ{ϵ}

\(J(\pi_E) - J(\pi) \leq O(\epsilon T)\)

That sounds like a handful; now, let’s break this down.

The author precedes this lemma with general framework that formalizes the goal of imitation learning algorithms. We should look at imitation learning as a two-player game. We define the payoff function of this game with U₁ as follows:

\(U_1(\pi, f) = \frac{1}{T}(\mathbb{E}_{\tau \sim \pi}[\Sigma_{t=1}^{T} f(s_t, a_t)] - \mathbb{E}_{\tau \sim \pi_E}[\Sigma_{t=1}^{T} f(s_t, a_t)])\)

Above, see the difference between two moments, the first moment over the sum of rewards till a time horizon T while following policy π, and the second moment over the sum of rewards till the same time horizon but following the expert policy π_E. Therefore, given some expert policy π_E, the game is to choose a policy π that best matches the moment of the expert, while being robust to all possible MDP reward functions f.

The paper defines Player 1 as the IL algorithm that attempts to select some policy π from the set of all possible policies

\(\Pi = \ \{\pi: \mathcal{S} \to \Delta(\mathcal{A})\}\)

and whose goal is to minimize the imitation gap U₁. Player 2 is a discriminating reward function f selected from the set of all reward functions F_r={S x A→[−1,1]}, whose goal is to maximize the imitation gap U₁.

The notion of an δ-approximate solution to this game is also introduced. Given a payoff U_j, a pair (f^,π^) is a δ-eq. solution if:

\(sup_{f \in \mathcal{F}}U_j(f, \hat{\pi}) - \frac{\delta}{2} \leq U_j(\hat{f}, \hat{\pi}) \leq inf_{\pi \in \Pi} U_j(\hat{f}, \pi) + \frac{\delta}{2}\)

In addition, we define a function that magically solves this game for us, which we call the imitation game δ-oracle, Ψ{δ}(⋅), which takes in in a payoff function

\(U: \Pi \times \mathcal{F} \to [-k, k]\)

and returns a (kδ)-approximate equilibrium solution (f^,π^) that satifies the condition above.

Using the tools built up, we can now prove lemma 1. First one needs to use the imitation gap, based on it’s definition in the paper:

\(J(\pi_E) - J(\pi) \leq sup_{f \in \mathcal{F}_r} \mathbb{E}_{\tau \sim \pi}\Sigma_{t = 1}^{T} f(s_t, a_t) - \mathbb{E}_{\tau \sim \pi_E}\Sigma_{t=1}^{T}f(s_t, a_t)\)

Next, since the sup is always positive, we can multiply by it by two and the new sup will be greater than one above. In addition, by defintion of U₁, you can see that term we are taking the supremum over is simply T multiplied by U₁. Therefore, we have:

\(J(\pi_E) - J(\pi) \leq 2T sup_{f \in \mathcal{F}} U_1(\pi, f)\)

Finally, if we apply the δ-approx. equilibrium def, we can see that:

\(sup_{f \in \mathcal{F}} U_j(f, \hat{\pi}) - \frac{\delta}{2} \leq U_j(\hat{f}, \hat{\pi})\)

Where f^ and π^, the optimal values can be seen as the result of performing the minimax over the policy class Π:

\(min_{\pi \in \Pi} max_{f \in \mathcal{F}}U_j(\pi, f)\)

which is equal to 0 since, the expert policy π_E∈Π, so U₁ (as further defined above) collapses to 0.

Finally, we have that

\(sup_{f \in \mathcal{F}} U_1(f, \hat{\pi})- \frac{\delta}{2} \leq 0\)

and if you multiply this inequality by 2T, and bring the δ term to the right, you get

\(2T sup_{f \in \mathcal{F}}U_1(\pi, f) \leq 2T\delta\)

This matches the ϵ definition from the Lemma 1, so we finally get

\(J(\pi_{E}) - L(\pi) \leq 2T\epsilon = O(\epsilon T)\)

The other lemma then goes onto proving that the lowerboard is also linearly constrainted.

Lemma 2: There exists an MDP, π_E, and π←Ψ{ϵ}(U₁) such that

\(J(\pi_E) - J(\pi) \geq \Omega(\epsilon T)\)

However, takes on different techniques to prove the statement, which requires an in-depth understanding of more definitions to creatively chain together multiple logical tricks. Together, lemma 1 and 2 show that for any IL algorithm that attempts to match reward moments, its imitation gap will grow linear in both the best and worst case scenarios.

The paper goes on to prove bounds for the other two taxonomical categories, introduces a mathematical concept on MDPs called “recoverability”, and proposes two new algorithms using the proven bounds, one of which is called Adversarial Reward-moment Imitation Learning that is claimed to perform with better guaranteed performance over SQIL and GAIL.

If this material sounds interesting, check out the further readings below to get a glimpse into the world of Imitation Learning and how researchers today are incorporating rigor into the field in creative ways.

Also check out this youtube playlist created by Swamy et al. which gives an excellent overview of the work discussed in this blog.

ML@B Blog

Discussion about this post

ML@B Blog

Imitation Learning: How well does it perform?

We share a recent paper that provides a taxonomic framework and theoretical performance bounds for IL algos, then dive into into the craft of proving such bounds.

References

Discussion about this post