CRAAM
2.0.0
Robust and Approximate Markov Decision Processes
|
Main namespace for algorithms that operate on MDPs and RMDPs. More...
Namespaces | |
internal | |
Internal helper functions. | |
Classes | |
class | PolicyDeterministic |
class | PolicyNature |
The class abstracts some operations of value / policy iteration in order to generalize to various types of robust MDPs. More... | |
struct | Solution |
A solution to a plain MDP. More... | |
struct | SolutionRobust |
A robust solution to a robust or regular MDP. More... | |
Typedefs | |
template<class T > | |
using | NatureResponse = vec_scal_t(*)(numvec const &v, numvec const &p, T threshold) |
Function representing constraints on nature. More... | |
template<class T > | |
using | NatureInstance = pair< NatureResponse< T >, T > |
Represents an instance of nature that can be used to directly compute the response. | |
Functions | |
template<typename SType , typename Policies > | |
MatrixXd | transition_mat (const GRMDP< SType > &rmdp, const Policies &policies, bool transpose=false) |
Constructs the transition (or its transpose) matrix for the policy. More... | |
template<typename SType , typename Policy > | |
numvec | rewards_vec (const GRMDP< SType > &rmdp, const Policy &policies) |
Constructs the rewards vector for each state for the RMDP. More... | |
template<typename SType , typename Policies > | |
numvec | occfreq_mat (const GRMDP< SType > &rmdp, const Transition &init, prec_t discount, const Policies &policies) |
Computes occupancy frequencies using matrix representation of transition probabilities. More... | |
vec_scal_t | robust_l1 (const numvec &v, const numvec &p, prec_t threshold) |
L1 robust response. | |
vec_scal_t | optimistic_l1 (const numvec &v, const numvec &p, prec_t threshold) |
L1 optimistic response. | |
template<class T > | |
vec_scal_t | robust_unbounded (const numvec &v, const numvec &p, T) |
worst outcome, threshold is ignored | |
template<class T > | |
vec_scal_t | optimistic_unbounded (const numvec &v, const numvec &p, T) |
best outcome, threshold is ignored | |
template<class T > | |
vec_scal_t | value_action (const RegularAction &action, const numvec &valuefunction, prec_t discount, const NatureInstance< T > &nature) |
Computes an ambiguous value (e.g. More... | |
template<class T > | |
vec_scal_t | value_action (const WeightedOutcomeAction &action, numvec const &valuefunction, prec_t discount, const NatureInstance< T > nature) |
Computes the maximal outcome distribution constraints on the nature's distribution. More... | |
template<class AType , class T > | |
vec_scal_t | value_fix_state (const SAState< AType > &state, numvec const &valuefunction, prec_t discount, long actionid, const NatureInstance< T > &nature) |
Computes the value of a fixed action and any response of nature. More... | |
template<typename AType , typename T > | |
ind_vec_scal_t | value_max_state (const SAState< AType > &state, const numvec &valuefunction, prec_t discount, const NatureInstance< T > &nature) |
Finds the greedy action and its value for the given value function. More... | |
template<class T > | |
PolicyNature< T > | uniform_nature (size_t statecount, NatureResponse< T > nature, T threshold) |
A helper function that simply copies a nature specification across all states. | |
template<class Model , class T > | |
PolicyNature< T > | uniform_nature (const Model &m, NatureResponse< T > nature, T threshold) |
A helper function that simply copies a nature specification across all states. | |
template<class SType , class T = prec_t> | |
auto | rsolve_vi (const GRMDP< SType > &mdp, prec_t discount, const vector< NatureResponse< T >> &nature, const vector< T > &thresholds, numvec valuefunction=numvec(0), const indvec &policy=numvec(0), unsigned long iterations=MAXITER, prec_t maxresidual=SOLPREC) |
Gauss-Seidel variant of value iteration (not parallelized). More... | |
template<class SType , class T = prec_t> | |
auto | rsolve_vi (const GRMDP< SType > &mdp, prec_t discount, const NatureResponse< T > &nature, const vector< T > &thresholds, numvec valuefunction=numvec(0), const indvec &policy=numvec(0), unsigned long iterations=MAXITER, prec_t maxresidual=SOLPREC) |
Simplified function call with a single nature for all states. | |
template<class SType , class T = prec_t> | |
auto | rsolve_mpi (const GRMDP< SType > &mdp, prec_t discount, const vector< NatureResponse< T >> &nature, const vector< T > &thresholds, const numvec &valuefunction=numvec(0), const indvec &policy=indvec(0), unsigned long iterations_pi=MAXITER, prec_t maxresidual_pi=SOLPREC, unsigned long iterations_vi=MAXITER, prec_t maxresidual_vi=SOLPREC/2, bool print_progress=false) |
Modified policy iteration using Jacobi value iteration in the inner loop. More... | |
template<class SType , class T = prec_t> | |
auto | rsolve_mpi (const GRMDP< SType > &mdp, prec_t discount, const NatureResponse< T > &nature, const vector< T > &thresholds, const numvec &valuefunction=numvec(0), const indvec &policy=indvec(0), unsigned long iterations_pi=MAXITER, prec_t maxresidual_pi=SOLPREC, unsigned long iterations_vi=MAXITER, prec_t maxresidual_vi=SOLPREC/2, bool print_progress=false) |
Simplified function call with a single nature for all states. | |
NatureResponse< prec_t > | string_to_nature (string nature) |
Converts a string representation of nature response to the appropriate nature response call. More... | |
prec_t | value_action (const RegularAction &action, const numvec &valuefunction, prec_t discount) |
Computes the average value of the action. More... | |
prec_t | value_action (const RegularAction &action, const numvec &valuefunction, prec_t discount, numvec distribution) |
Computes a value of the action for a given distribution. More... | |
prec_t | value_action (const WeightedOutcomeAction &action, numvec const &valuefunction, prec_t discount) |
Computes the average outcome using the provided distribution. More... | |
prec_t | value_action (const WeightedOutcomeAction &action, numvec const &valuefunction, prec_t discount, const numvec &distribution) |
Computes the action value for a fixed index outcome. More... | |
template<class AType > | |
pair< long, prec_t > | value_max_state (const SAState< AType > &state, const numvec &valuefunction, prec_t discount) |
Finds the action with the maximal average return. More... | |
template<class AType > | |
prec_t | value_fix_state (const SAState< AType > &state, numvec const &valuefunction, prec_t discount, long actionid) |
Computes the value of a fixed (and valid) action. More... | |
template<class AType > | |
prec_t | value_fix_state (const SAState< AType > &state, numvec const &valuefunction, prec_t discount, long actionid, numvec distribution) |
Computes the value of a fixed action and fixed response of nature. More... | |
template<class SType , class ResponseType = PolicyDeterministic> | |
auto | vi_gs (const GRMDP< SType > &mdp, prec_t discount, numvec valuefunction=numvec(0), const ResponseType &response=PolicyDeterministic(), unsigned long iterations=MAXITER, prec_t maxresidual=SOLPREC) |
Gauss-Seidel variant of value iteration (not parallelized). More... | |
template<class SType , class ResponseType = PolicyDeterministic> | |
auto | mpi_jac (const GRMDP< SType > &mdp, prec_t discount, const numvec &valuefunction=numvec(0), const ResponseType &response=PolicyDeterministic(), unsigned long iterations_pi=MAXITER, prec_t maxresidual_pi=SOLPREC, unsigned long iterations_vi=MAXITER, prec_t maxresidual_vi=SOLPREC/2, bool print_progress=false) |
Modified policy iteration using Jacobi value iteration in the inner loop. More... | |
template<class SType > | |
auto | solve_vi (const GRMDP< SType > &mdp, prec_t discount, numvec valuefunction=numvec(0), const indvec &policy=numvec(0), unsigned long iterations=MAXITER, prec_t maxresidual=SOLPREC) |
Gauss-Seidel variant of value iteration (not parallelized). More... | |
template<class SType > | |
auto | solve_mpi (const GRMDP< SType > &mdp, prec_t discount, const numvec &valuefunction=numvec(0), const indvec &policy=indvec(0), unsigned long iterations_pi=MAXITER, prec_t maxresidual_pi=SOLPREC, unsigned long iterations_vi=MAXITER, prec_t maxresidual_vi=SOLPREC/2, bool print_progress=false) |
Modified policy iteration using Jacobi value iteration in the inner loop. More... | |
Main namespace for algorithms that operate on MDPs and RMDPs.
using craam::algorithms::NatureResponse = typedef vec_scal_t (*)(numvec const& v, numvec const& p, T threshold) |
Function representing constraints on nature.
The function computes the best response of nature and can be used in value iteration.
This function represents a nature which computes (in general) a randomized policy (response). If the response is always deterministic, it may be better to define and use a nature that computes and uses a deterministic response.
The parameters are the q-values v, the reference distribution p, and the threshold. The function returns the worst-case solution and the objective value. The threshold can be used to determine the desired robustness of the solution.
|
inline |
Modified policy iteration using Jacobi value iteration in the inner loop.
See solve_mpi for a simplified interface. This method generalizes modified policy iteration to robust MDPs. In the value iteration step, both the action and the outcome are fixed.
Note that the total number of iterations will be bounded by iterations_pi * iterations_vi
type | Type of realization of the uncertainty |
discount | Discount factor |
valuefunction | Initial value function |
response | Using PolicyResponce allows to specify a partial policy. Only the actions that not provided by the partial policy are included in the optimization. Using a class of a different types enables computing other objectives, such as robust or risk averse ones. |
iterations_pi | Maximal number of policy iteration steps |
maxresidual_pi | Stop the outer policy iteration when the residual drops below this threshold. |
iterations_vi | Maximal number of inner loop value iterations |
maxresidual_vi | Stop the inner policy iteration when the residual drops below this threshold. This value should be smaller than maxresidual_pi |
print_progress | Whether to report on progress during the computation |
|
inline |
Computes occupancy frequencies using matrix representation of transition probabilities.
This method may not scale well
SType | Type of the state in the MDP (regular vs robust) |
Policy | Type of the policy. Either a single policy for the standard MDP evaluation, or a pair of a deterministic policy and a randomized policy of the nature |
init | Initial distribution (alpha) |
discount | Discount factor (gamma) |
policies | The policy (indvec) or the pair of the policy and the policy of nature (pair<indvec,vector<numvec> >). The nature is typically a randomized policy |
|
inline |
Constructs the rewards vector for each state for the RMDP.
Policy | Type of the policy. Either a single policy for the standard MDP evaluation, or a pair of a deterministic policy and a randomized policy of the nature |
rmdp | Regular or robust MDP |
policies | The policy (indvec) or the pair of the policy and the policy of nature (pair<indvec,vector<numvec> >). The nature is typically a randomized policy |
|
inline |
Modified policy iteration using Jacobi value iteration in the inner loop.
This method generalizes modified policy iteration to robust MDPs. In the value iteration step, both the action and the outcome are fixed.
This is a simplified method interface. Use mpi_jac with PolicyNature for full functionality.
Note that the total number of iterations will be bounded by iterations_pi * iterations_vi
type | Type of realization of the uncertainty |
discount | Discount factor |
nature | Response of nature, one function per state. |
thresholds | Parameters passed to nature response functions. One value per state. |
valuefunction | Initial value function |
policy | Partial policy specification. Optimize only actions that are policy[state] = -1 |
iterations_pi | Maximal number of policy iteration steps |
maxresidual_pi | Stop the outer policy iteration when the residual drops below this threshold. |
iterations_vi | Maximal number of inner loop value iterations |
maxresidual_vi | Stop the inner policy iteration when the residual drops below this threshold. This value should be smaller than maxresidual_pi |
print_progress | Whether to report on progress during the computation |
|
inline |
Gauss-Seidel variant of value iteration (not parallelized).
This function is suitable for computing the value function of a finite state MDP. If the states are ordered correctly, one iteration is enough to compute the optimal value function. Since the value function is updated from the last state to the first one, the states should be ordered in the temporal order.
This is a simplified method interface. Use vi_gs with PolicyNature for full functionality.
mdp | The MDP to solve |
discount | Discount factor. |
nature | Response of nature, one function per state. |
thresholds | Parameters passed to nature response functions. One value per state. |
valuefunction | Initial value function. Passed by value, because it is modified. Optional, use all zeros when not provided. Ignored when size is 0. |
policy | Partial policy specification. Optimize only actions that are policy[state] = -1 |
iterations | Maximal number of iterations to run |
maxresidual | Stop when the maximal residual falls below this value. |
|
inline |
Modified policy iteration using Jacobi value iteration in the inner loop.
This method generalizes modified policy iteration to robust MDPs. In the value iteration step, both the action and the outcome are fixed.
Note that the total number of iterations will be bounded by iterations_pi * iterations_vi
type | Type of realization of the uncertainty |
discount | Discount factor |
valuefunction | Initial value function |
policy | Partial policy specification. Optimize only actions that are policy[state] = -1 |
iterations_pi | Maximal number of policy iteration steps |
maxresidual_pi | Stop the outer policy iteration when the residual drops below this threshold. |
iterations_vi | Maximal number of inner loop value iterations |
maxresidual_vi | Stop the inner policy iteration when the residual drops below this threshold. This value should be smaller than maxresidual_pi |
print_progress | Whether to report on progress during the computation |
|
inline |
Gauss-Seidel variant of value iteration (not parallelized).
This function is suitable for computing the value function of a finite state MDP. If the states are ordered correctly, one iteration is enough to compute the optimal value function. Since the value function is updated from the last state to the first one, the states should be ordered in the temporal order.
mdp | The MDP to solve |
discount | Discount factor. |
valuefunction | Initial value function. Passed by value, because it is modified. Optional, use all zeros when not provided. Ignored when size is 0. |
policy | Partial policy specification. Optimize only actions that are policy[state] = -1 |
iterations | Maximal number of iterations to run |
maxresidual | Stop when the maximal residual falls below this value. |
|
inline |
Converts a string representation of nature response to the appropriate nature response call.
This function is useful when the code is used within a python or R libraries. The values values correspond to the function definitions, and ones that are currently supported are:
|
inline |
Constructs the transition (or its transpose) matrix for the policy.
SType | Type of the state in the MDP (regular vs robust) |
Policy | Type of the policy. Either a single policy for the standard MDP evaluation, or a pair of a deterministic policy and a randomized policy of the nature |
rmdp | Regular or robust MDP |
policies | The policy (indvec) or the pair of the policy and the policy of nature (pair<indvec,vector<numvec> >). The nature is typically a randomized policy |
transpose | (optional, false) Whether to return the transpose of the transition matrix. This is useful for computing occupancy frequencies |
|
inline |
Computes the average value of the action.
action | Action for which to compute the value |
valuefunction | State value function to use |
discount | Discount factor |
|
inline |
Computes a value of the action for a given distribution.
This function can be used to evaluate a robust solution which may modify the transition probabilities.
The new distribution may be non-zero only for states for which the original distribution is not zero.
action | Action for which to compute the value |
valuefunction | State value function to use |
discount | Discount factor |
distribution | New distribution. The length must match the number of states to which the original transition probabilities are strictly greater than 0. The order of states is the same as in the underlying transition. |
|
inline |
Computes the average outcome using the provided distribution.
action | Action for which the value is computed |
valuefunction | Updated value function |
discount | Discount factor |
|
inline |
Computes an ambiguous value (e.g.
robust) of the action, depending on the type of nature that is provided.
action | Action for which to compute the value |
valuefunction | State value function to use |
discount | Discount factor |
nature | Method used to compute the response of nature. |
|
inline |
Computes the action value for a fixed index outcome.
action | Action for which the value is computed |
valuefunction | Updated value function |
discount | Discount factor |
distribution | Custom distribution that is selected by nature. |
|
inline |
Computes the maximal outcome distribution constraints on the nature's distribution.
Does not work when the number of outcomes is zero.
action | Action for which the value is computed |
valuefunction | Value function reference |
discount | Discount factor |
nature | Method used to compute the response of nature. |
|
inline |
Computes the value of a fixed (and valid) action.
Performs validity checks.
state | State to compute the value for |
valuefunction | Value function to use for the following states |
discount | Discount factor |
|
inline |
Computes the value of a fixed action and any response of nature.
state | State to compute the value for |
valuefunction | Value function to use in computing value of states. |
discount | Discount factor |
nature | Instance of a nature optimizer |
|
inline |
Computes the value of a fixed action and fixed response of nature.
state | State to compute the value for |
valuefunction | Value function to use in computing value of states. |
discount | Discount factor |
distribution | New distribution over states with non-zero nominal probabilities |
|
inline |
Finds the action with the maximal average return.
The return is 0 with no actions. Such state is assumed to be terminal.
state | State to compute the value for |
valuefunction | Value function to use for the following states |
discount | Discount factor |
|
inline |
Finds the greedy action and its value for the given value function.
This function assumes a robust or optimistic response by nature depending on the provided ambiguity.
When there are no actions, the state is assumed to be terminal and the return is 0.
state | State to compute the value for |
valuefunction | Value function to use in computing value of states. |
discount | Discount factor |
nature | Method used to compute the response of nature. |
|
inline |
Gauss-Seidel variant of value iteration (not parallelized).
See solve_vi for a simplified interface.
This function is suitable for computing the value function of a finite state MDP. If the states are ordered correctly, one iteration is enough to compute the optimal value function. Since the value function is updated from the last state to the first one, the states should be ordered in the temporal order.
mdp | The mdp to solve |
discount | Discount factor. |
valuefunction | Initial value function. Passed by value, because it is modified. Optional, use all zeros when not provided. Ignored when size is 0. |
response | Using PolicyResponce allows to specify a partial policy. Only the actions that not provided by the partial policy are included in the optimization. Using a class of a different types enables computing other objectives, such as robust or risk averse ones. |
iterations | Maximal number of iterations to run |
maxresidual | Stop when the maximal residual falls below this value. |