It's not clear to me how this is interestingly different from model-based RL, where you learn the state function and reward function, and then use various types of simulation to learn a value function. I guess I'll have to read more than the abstract...