MDP and Dynamic Programming in C++
Bellman optimality and value iteration
The Bellman equation rewrites a sequential decision problem as a fixed-point problem. Once the transition kernel and reward function are known, dynamic programming turns policy search into repeated value updates.
Core object
The value function V*(s) is the best discounted reward attainable from state s. The optimal policy is obtained by selecting the action that maximizes the Bellman operator at each state.
Why it matters
This is the cleanest way to separate modeling assumptions from optimization. If the model is explicit, policy computation becomes transparent, testable, and reproducible.
Research reflex
Before running value iteration, I check whether the state space is small enough for exact dynamic programming, whether rewards are well-scaled, and whether gamma implies a contraction strong enough for stable convergence.