Position: PhD Candidate
Current Institution: Princeton University
Abstract: Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation
This work focuses on the statistical theory of batch data reinforcement learning with function approximation. Consider the off-policy evaluation problem which is to estimate the cumulative value of a new target policy from logged history generated by unknown behavioral policies. We study a regression-based fitted Q-iteration method and show that it is equivalent to a model-based method that estimates a conditional mean embedding of the transition operator. We prove that this method is information-theoretically optimal and has nearly minimal estimation error. In particular by leveraging the contraction property of Markov processes and martingale concentration we establish a finite-sample instance-dependent error upper bound and a nearly-matching minimax lower bound. The policy evaluation error depends sharply on a restricted chi-square-divergence over the function class between the long-term distribution of the target policy and the distribution of past data. This restricted chi-square-divergence is both instance-dependent and function-class-dependent. It characterizes the statistical limit of off-policy evaluation. Furthermore we provide an easily computable confidence bound for the policy evaluator which may be useful for optimistic planning and safe policy improvement.
Yaqi Duan is a final-year Ph.D. student in the department of Operations Research and Financial Engineering at Princeton University advised by Professor Mengdi Wang. Her research interests lie in machine learning particularly reinforcement learning. She works on the statistical analysis of batch reinforcement learning and dimensionality reduction methods for large stochastic systems. Prior to Princeton Yaqi obtained a B.S. in Mathematics from Peking University.