Theory

Explore the theory behind the dynamic programming algorithm implemented in DynaProg

Introduction

DynaProg solves finite horizon, deterministic multi-stage decision problems.
A multi-stage decision problem is a control problem where decisions must be made in stages in order to minimize a certain cost. A system, which is characterized by its state, evolves through the stages and this evolution is influenced by the decisions themselves.
Deterministic problems in particular refer to the fact that the system's evolution for each stage is fully predictable: given the current state and the decision variables, the state of the system at the next stage can be evaluated with no uncertainty.
Finally, a finite-horizon problem is a control problem where decisions must be made for a finite number of stages.
DynaProg solves these problems using a Dynamic Programming algorithm. Dynamic Programming was first introduced by Richard Bellmann [1], and has since been applied in a wide variety of contexts and its theoretical basis extended by many other great contributors.

Basics

Given the initial state of the system x_0, our control optimization problem consists in selecting the sequence of control variables overNstages (u_o, u_1, ..., u_(N-1)) that minimizes some total cost
where  L_k(x_k,u_k) is the stage cost incurred in advancing by one stage and F(x_N) is a terminal cost associated to the terminal system state.
In other words, the goal of the optimization problem is therefore to find the optimal cost , that is the minimum total cost that can be incurred:
and the control sequence that minimizes it:
Here, U_k(x_k) is the set of feasible control variables that can be selected at stage k.
The foundation of Dynamic Programming-based optimization algorithms is Bellman's principle of optimality:
"An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision" [1].
In practice, the idea is to divide the original control problem into several tail subproblems, each of which is defined as the optimal control problem starting from a certain stage k to the final stage N, and to iteratively solve increasingly long tail subproblems by exploiting the solution of the previously obtained tail subproblems.
The optimization is divided in two phases: a so-called "backward" phase and a "forward" phase.

Backward Phase

Start by setting:
Set and solve:
Set k back by one stage and repeat the last step until is obtained.
Because represents the minimum cost that can be incurred if the system must evolve from stage k to stage N given the initial state x_k, it is also called the cost-to-go. Some sources also refer to this quantity as the value function.

Forward Phase

Start from k=0, and evaluate
.
Advance the simulation by updating the state variables
x_(k+1) = f(x_k, u^*_k(x_k)).
Advance k by one stage and repeat these two steps until the last stage.

Discretization

DynaProg applies the outlined algorithm to derive a numerical solution to the control optimization problem. For any multi-stage process with discrete state and control variables, the solution is straightforward and it does not present any particular numerical hazard. If process instead is a continuous time process and/or the state and control variables are continuous, a numerical solution can still be achieved, but some care must be taken.

Time discretization

If the process considers a system which evolves in continuous time, then time must be discretized with a certain time step into a sequence of stages. An optimal solution can be then achieved for the discrete-time equivalent process.

Control variables discretization

If one or more of control variables are continuous, they must be discretized. This simplifies the numerical solution as the min and argmin operations that are required simplify to finding minimum values over finite sets. The optimal control sequence found by the optimization algorithm will be restricted to the discretized control variables.

State variables discretization

If the system has one or more state variables, it does not need to be discretized in the simulation.
However, the backward phase of the optimization algorithm relies on storing, at each iteration, J^*_k(x_k) (the value of the cost-to-go as a function of x). In order to do this, DynaProg evaluates J^*_k(x_k) for a finite set of sampled values of x_k: for this reason, a discretized computational grid for the state variables is needed. Then, when values of must be evaluated for values of that do not belong to the computational grid, it is obtained by linear interpolation.
The discretization of this computational grid for the state variables affects the computation of the costs-to-go and, as such, it is a source of sub-optimality. Selecting the proper discretization level for the state variables grid usually requires some understanding of the physics behind the system under analysis and possibly some trial-and-error.
To wrap up, in the presence of contiunous state variables, a discrete computational grid must be created for the optimization algorithm. This has an influence on the optimality of the solution, even though the state is not really discretized in simulating the system's evolution. The trajectory of the state variables given the optimal control sequence is continuous and it is not tied to the discrete computational grid.

The terminal cost

DynaProg allows you to specify the terminal cost as a function handle by setting the TerminalCost property. Read the Syntax guide for more information.
In addition to that, DynaProg adds a penalty term to the terminal cost in order to enforce terminal state constraints (if present). In other words, the cost-to-go at stage N is initialized to:
.
Currently, two alternatives can be selected to define this penalty term: rift penalization and linear penalization.
Rift penalization means that the penalty term has the form:
where and are the upper and lower bounds of the terminal state constraints. The term can be modified by the user using the myInf property.
Linear penalization means that the terminal cost has the form
i.e. the terminal cost is proportional to the distance between the terminal state and the closest bound of the terminal state constraints, and b is the proportionality factor. Here, denotes element-wise absolute value of all elements of the vector v.
The linear penalization can fail to enforce terminal state constraints if the b factors are not set properly. Rift penalization on the other hand can cause the optimization to fail if the terminal state bounds are too tight or if the state grids are too coarse. This is caused by numerical issues related to the interpolations performed by the algorithm to obtain values for the cost-to-go function in the proximity of the unfeasible region.
You can specify what method to use by using the VFPenalty property and the proportionality factors b by using the VFFactors property. Upper and lower bounds for the terminal state constraints are specified with the StateFinal property. Read the Syntax guide for more information.

Notes on computational time

Dynamic Programming is an extremely powerful optimization tool in that it allow to solve problems even if the state dynamics and cost function are highly nonlinear in their structure. However, this comes at the cost of relatively high computational time. By far, the most time-consuming task is to building the cost-to-go function in the backward phase. Consider the cost-to-go update that is repeated for each stage:
Nothe that this requires evaluating f_k(x_k,u_k)for each combination of state and control variables defined by their discrete computational grids. This means that, for each stage, the state dynamics must be evaluated N_{SV_1} x N_{SV_2} x ... x N_{SV__{NS}} x N_{CV_1} x N_{CV_2} x ... x N_{CV_{NC}} times, where N_SV_i and N_CV_j are the number of grid points that discretize the state variable and the control variable CV_j andNS and NCare the number of state and control variables.
Not only is the number of function evaluations that must be performed very high, but it also increases dramatically by adding more state and/or control variables. For this reason, for complex system dynamics and cost functions, it is important that no unneccesary computations are defined in the system and cost function by the user. DynaProg provides several ways to help reducing them to a minimum, with simple features such as exogenous inputs and additional inputs and the more advanced system and cost function split.

References

[1] Bellman, R.E. (2003) [1957]. Dynamic Programming. Dover. ISBN 0-486-42809-5.
Contact: federico.miretti@polito.it