Evaluating trained RL policies offline is extremely important in real-world production: a trained policy with unexpected behaviors or unsuccessful learning would cause the system regress online therefore what safe to do is to evaluate their performance on the offline training data, based on which we decide whether to deploy. Evaluating policies offline is an ongoing research topic called “counterfactual policy evaluation” (CPE): what would a policy perform even though we only have trajectories that were generated by some other policy?
CPE for contextual bandit
We first look at CPE on contextual bandit problems. The contextual bandit problem is to take action at each state that is drawn from a distribution . We can only observe the reward associated to that specific pair of and : . Rewards associated to other actions not taken can’t be observed. The next state at which you would take the next action, will not be affected by the current state or action. Essentially, contextual bandit problems are Markov Decision Problems when horizon (the number of steps) equals one. Suppose we have a dataset with samples, each sample being a tuple . is sampled from a behavior policy . is calculated based on an underlying but non-observable reward function . For now, we ignore any noise that could exist in reward collection. To simplify our calculation, we will assume that the decision of is independent across samples: this assumption bypasses the possible fact that may maintain an internal memory , which is the history of observations used to facilitate its future decisions. We will also assume that during data collection we can access the true probability , the true action propensity of the behavior policy. This is not difficult to achieve in practice because we usually have the direct access to the current policy’s model. We will call the policy that we want to evaluate target policy, denoted as .
Given all these notations defined, the value function of is:
A straightforward way called Direct Method (DM) to estimate is to train an approximated reward function and plug into :
The bias of directly determines the bias of . The problem is that if is only trained on generated by , it might be possible that these samples do not sufficiently cover state/action pairs relevant to , thus could be very biased and inaccurate.
The second method is called inverse propensity score (IPS). Its formulation is basically importance sampling on rewards:
We can prove that is an unbiased estimator of . I get some ideas from [4]:
There is one condition to be satisfied for the proof to be hold: if then . First, we don’t need to worry about that any exists in the denominator of because those samples would never be collected in in the first place. However, if for some , there would never be data to evaluate the outcome of taking action at state , which means becomes a biased estimator of . (Part of these ideas comes from [7].)
The intuition behind is that:
- If is a large (good) reward, and if , then we guess that is probably better than because may give actions leading to good rewards larger probabilities than .
- If is a small (bad) reward, and if , then we guess that is probably worse than because may give actions leading to bad rewards larger probabilities than .
- We can reverse the relationship such that if , we can guess is better than if itself is a bad reward. probably is a better policy trying to avoid actions leading to bad rewards.
- If and is a good reward, is probably a better policy trying to take actions leading to good rewards.
We can also relate the intuition of to Inverse Probability of Treatment Weighting (IPTW) in observational studies [5]. In observational studies, researchers want to determine the effect of certain treatment while the treatment and control selection of subjects is not totally randomized. IPTW is one kind of adjustment to observational data for measuring the treatment effect correctly. One intuitive explanation of IPTW is to weight the subject with how much information the subject could reveal [2]. For example, if a subject who has a low probability to receive treatment and has a high probability to stay in the control group happened to receive treatment, his treatment result should be very informative because he represents the characteristics mostly associated with the control group. In a similar way, we can consider the importance sampling ratio for each as an indicator of how much information that sample reveals. When the behavior policy neglects some action that is instead emphasized by (i.e., ), the sample has some valuable counter-factual information hence we need to “amplify” .
The third method to estimate called Doubly Robust (DR) [1] combines the direct method and IPS method:
Given our assumption that we know the true action propensity , is still an unbiased estimator:
However, in the original paper of DR [1], the authors don’t assume that the true action propensity can be accurately obtained. This means, can be a biased estimator: however, if or have biases under certain bounds, would have far lower bias than either of the other two.
Under the same assumption that the true action propensity is accessible, we could get the variances of DM, DR, and IPS CPE (, , and ) based on the formulas given in Theorem 2 in [1] while setting because we assume we know the true action propensity:
Under some reasonable assumptions according to [1], should be between and . Therefore, should be overall favored because of its zero bias and intermediate variance.
CPE for MDP
We have completed introducing several CPE methods on contextual bandit problems. Now, we move our focus to CPE on Markov Decision Problems when horizon is larger than 1. The value function of a policy is:
where is the number of steps in one trajectory. Simplifying the notation of this formula, we get:
where is a single trajectory, denotes the transition dynamics, is the accumulative discounted rewards for the trajectory .
Following the idea of IPS, we can have importance-sampling based CPE of based on trajectories generated by :
is an unbiased estimator of given the same condition satisfied in : if then . is also an unbiased estimator of proved in [6]. However, has lower upper bound than therefore in practice usually enjoys lower variance than (compare chapter 3.5 and 3.6 in [9]).
Following the idea of DR, [3] proposed unbiased DR-based CPE on finite horizons, and later [8] extended to the infinite-horizon case:
where and are the learned q-value and value function under the behavior policy . and we define the corner case .
Based on Theorem 1 in [3], should have lower variance than , after realizing that per-decision-IS CPE is a special case of DR CPE.
Another CPE method also proposed by [6] is called Weighted Doubly Robust (WDR). It normalizes in :
where .
is a biased estimator but since it truncates the range of in to , WDR could lower variance than DR when fewer samples are available. The low variance of WDR produces larger reduction in expected square error than the additional error incurred by the bias. (See chapter 3.8 of [9]). Furthermore, under reasonable assumptions WDR is a consistent estimator so the bias will become less an issue as the sample size increase. (As a side note, an estimator can be any combination of biased/unbiased and inconsistent/consistent [10].)
Direct Method CPE on trajectories learns to reconstruct the MDP. This means DM needs to learn the transition function and reward function for simulating episodes.The biggest concern is whether the collected samples support to train a low bias model. If they do, then DM can actually achieve very good performance. [8] Section 6 lists a situation when DM can outperform DR and WDR: while DM has a low variance and low bias (even no bias), stochasticity of reward and state transition produces high variance in DR and WDR.
The last CPE method is called Model and Guided Importance Sampling Combining (MAGIC) [8]. It is considered as the best CPE method because it adopts the advantages from both WDR and DM. The motivation given by [8] is that CPE methods involving importance sampling are generally consistent (i.e., when samples increase, the estimated distribution moves close to the true value) but still suffer high variance; model-based methods (DM) has low variance but is biased. MAGIC combines the ideas of DM and WDR such that MAGIC can track the performance of whichever is better between DM and WDR. The MAGIC method is heavily math involved. Without deep dive into it yet, we only show its pseudocode for now:
Below is a table concluding this post, showing whether each CPE method (for evaluating trajectories) is biased/unbiased and consistent/inconsistent.
DM | biased | inconsistent | 1 |
trajectory-IS | unbiased | consistent | 5 |
per-decision-IS | unbiased | consistent | 4 |
DR | unbiased | consistent | 3 |
WDR | biased | consistent | 2 |
MAGIC | biased | consistent | Not sure, possibly
between 1 and 2 |
(Table note: (1) variance rank 1 means the lowest variance, 5 means the highest; (2) when generating this table, we use some reasonable assumptions we mentioned and used in the papers cited .)