Reinforcement learning policies in Markov decision processes (MDPs) often behave unexpectedly, especially in environments with sparse rewards, raising challenges for debugging and verification. We propose a general framework for discrete MDPs to generate two complementary, one-step explanations for single-action anomalies: (1) minimal counterfactual states—the smallest factored-state perturbations that flip a chosen action—and (2) robustness regions—contiguous state neighborhoods over which the original action remains invariant.
Without accessing internal model details, our black-box technique uses only action feedback, is applicable to any discrete RL setting, and has been validated on various Gymnasium environments, providing actionable understanding of when and why policies change their decisions.