As a simple example, consider the gamePong: one might like to predict if a new strategy (the target policy) increases the chance of winning when considering only historical data collected from previous strategies (behavior policies) and without actually playing the game. If one were interested only in the performance of the behavior policy, a good metric might be to average the rewards of all the time steps from the historical data. However, since historical data is based on actions determined by the behavior policy and not the target policy, this simple average of rewards in the off-policy data would not yield a good estimate of the target policy’s long-term reward. Instead, proper correction must be made to remove thebiasresulting from having two different policies (i.e., the difference in data distribution).

github地址:https://ai.googleblog.com/2020/04/off-policy-estimation-for-infinite.html