SOTAVerified

Automatic Double Reinforcement Learning in Semiparametric Markov Decision Processes with Applications to Long-Term Causal Inference

2025-01-12Unverified0· sign in to hype

Lars van der Laan, David Hubbard, Allen Tran, Nathan Kallus, Aurélien Bibaut

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Estimating long-term causal effects from short-term data is essential for decision-making in healthcare, economics, and industry, where long-term follow-up is often infeasible. Markov Decision Processes (MDPs) offer a principled framework for modeling outcomes as sequences of states, actions, and rewards over time. We introduce a semiparametric extension of Double Reinforcement Learning (DRL) for statistically efficient, model-robust inference on linear functionals of the Q-function, such as policy values, in infinite-horizon, time-homogeneous MDPs. By imposing semiparametric structure on the Q-function, our method relaxes the strong state overlap assumptions required by fully nonparametric approaches, improving efficiency and stability. To address computational and robustness challenges of minimax nuisance estimation, we develop a novel debiased plug-in estimator based on isotonic Bellman calibration, which integrates fitted Q-iteration with an isotonic regression step. This procedure leverages the Q-function as a data-driven dimension reduction, debiases all linear functionals of interest simultaneously, and enables nonparametric inference without explicit nuisance function estimation. Bellman calibration generalizes isotonic calibration to MDPs and may be of independent interest for prediction in reinforcement learning. Finally, we show that model selection for the Q-function incurs only second-order bias and extend the adaptive debiased machine learning (ADML) framework to MDPs for data-driven learning of semiparametric structure.

Tasks

Reproductions