CRAAM  2.0.0
Robust and Approximate Markov Decision Processes
Namespaces | Classes | Typedefs | Functions
craam::algorithms Namespace Reference

Main namespace for algorithms that operate on MDPs and RMDPs. More...

Namespaces

 internal
 Internal helper functions.
 

Classes

class  PolicyDeterministic
 
class  PolicyNature
 The class abstracts some operations of value / policy iteration in order to generalize to various types of robust MDPs. More...
 
struct  Solution
 A solution to a plain MDP. More...
 
struct  SolutionRobust
 A robust solution to a robust or regular MDP. More...
 

Typedefs

template<class T >
using NatureResponse = vec_scal_t(*)(numvec const &v, numvec const &p, T threshold)
 Function representing constraints on nature. More...
 
template<class T >
using NatureInstance = pair< NatureResponse< T >, T >
 Represents an instance of nature that can be used to directly compute the response.
 

Functions

template<typename SType , typename Policies >
MatrixXd transition_mat (const GRMDP< SType > &rmdp, const Policies &policies, bool transpose=false)
 Constructs the transition (or its transpose) matrix for the policy. More...
 
template<typename SType , typename Policy >
numvec rewards_vec (const GRMDP< SType > &rmdp, const Policy &policies)
 Constructs the rewards vector for each state for the RMDP. More...
 
template<typename SType , typename Policies >
numvec occfreq_mat (const GRMDP< SType > &rmdp, const Transition &init, prec_t discount, const Policies &policies)
 Computes occupancy frequencies using matrix representation of transition probabilities. More...
 
vec_scal_t robust_l1 (const numvec &v, const numvec &p, prec_t threshold)
 L1 robust response.
 
vec_scal_t optimistic_l1 (const numvec &v, const numvec &p, prec_t threshold)
 L1 optimistic response.
 
template<class T >
vec_scal_t robust_unbounded (const numvec &v, const numvec &p, T)
 worst outcome, threshold is ignored
 
template<class T >
vec_scal_t optimistic_unbounded (const numvec &v, const numvec &p, T)
 best outcome, threshold is ignored
 
template<class T >
vec_scal_t value_action (const RegularAction &action, const numvec &valuefunction, prec_t discount, const NatureInstance< T > &nature)
 Computes an ambiguous value (e.g. More...
 
template<class T >
vec_scal_t value_action (const WeightedOutcomeAction &action, numvec const &valuefunction, prec_t discount, const NatureInstance< T > nature)
 Computes the maximal outcome distribution constraints on the nature's distribution. More...
 
template<class AType , class T >
vec_scal_t value_fix_state (const SAState< AType > &state, numvec const &valuefunction, prec_t discount, long actionid, const NatureInstance< T > &nature)
 Computes the value of a fixed action and any response of nature. More...
 
template<typename AType , typename T >
ind_vec_scal_t value_max_state (const SAState< AType > &state, const numvec &valuefunction, prec_t discount, const NatureInstance< T > &nature)
 Finds the greedy action and its value for the given value function. More...
 
template<class T >
PolicyNature< T > uniform_nature (size_t statecount, NatureResponse< T > nature, T threshold)
 A helper function that simply copies a nature specification across all states.
 
template<class Model , class T >
PolicyNature< T > uniform_nature (const Model &m, NatureResponse< T > nature, T threshold)
 A helper function that simply copies a nature specification across all states.
 
template<class SType , class T = prec_t>
auto rsolve_vi (const GRMDP< SType > &mdp, prec_t discount, const vector< NatureResponse< T >> &nature, const vector< T > &thresholds, numvec valuefunction=numvec(0), const indvec &policy=numvec(0), unsigned long iterations=MAXITER, prec_t maxresidual=SOLPREC)
 Gauss-Seidel variant of value iteration (not parallelized). More...
 
template<class SType , class T = prec_t>
auto rsolve_vi (const GRMDP< SType > &mdp, prec_t discount, const NatureResponse< T > &nature, const vector< T > &thresholds, numvec valuefunction=numvec(0), const indvec &policy=numvec(0), unsigned long iterations=MAXITER, prec_t maxresidual=SOLPREC)
 Simplified function call with a single nature for all states.
 
template<class SType , class T = prec_t>
auto rsolve_mpi (const GRMDP< SType > &mdp, prec_t discount, const vector< NatureResponse< T >> &nature, const vector< T > &thresholds, const numvec &valuefunction=numvec(0), const indvec &policy=indvec(0), unsigned long iterations_pi=MAXITER, prec_t maxresidual_pi=SOLPREC, unsigned long iterations_vi=MAXITER, prec_t maxresidual_vi=SOLPREC/2, bool print_progress=false)
 Modified policy iteration using Jacobi value iteration in the inner loop. More...
 
template<class SType , class T = prec_t>
auto rsolve_mpi (const GRMDP< SType > &mdp, prec_t discount, const NatureResponse< T > &nature, const vector< T > &thresholds, const numvec &valuefunction=numvec(0), const indvec &policy=indvec(0), unsigned long iterations_pi=MAXITER, prec_t maxresidual_pi=SOLPREC, unsigned long iterations_vi=MAXITER, prec_t maxresidual_vi=SOLPREC/2, bool print_progress=false)
 Simplified function call with a single nature for all states.
 
NatureResponse< prec_tstring_to_nature (string nature)
 Converts a string representation of nature response to the appropriate nature response call. More...
 
prec_t value_action (const RegularAction &action, const numvec &valuefunction, prec_t discount)
 Computes the average value of the action. More...
 
prec_t value_action (const RegularAction &action, const numvec &valuefunction, prec_t discount, numvec distribution)
 Computes a value of the action for a given distribution. More...
 
prec_t value_action (const WeightedOutcomeAction &action, numvec const &valuefunction, prec_t discount)
 Computes the average outcome using the provided distribution. More...
 
prec_t value_action (const WeightedOutcomeAction &action, numvec const &valuefunction, prec_t discount, const numvec &distribution)
 Computes the action value for a fixed index outcome. More...
 
template<class AType >
pair< long, prec_tvalue_max_state (const SAState< AType > &state, const numvec &valuefunction, prec_t discount)
 Finds the action with the maximal average return. More...
 
template<class AType >
prec_t value_fix_state (const SAState< AType > &state, numvec const &valuefunction, prec_t discount, long actionid)
 Computes the value of a fixed (and valid) action. More...
 
template<class AType >
prec_t value_fix_state (const SAState< AType > &state, numvec const &valuefunction, prec_t discount, long actionid, numvec distribution)
 Computes the value of a fixed action and fixed response of nature. More...
 
template<class SType , class ResponseType = PolicyDeterministic>
auto vi_gs (const GRMDP< SType > &mdp, prec_t discount, numvec valuefunction=numvec(0), const ResponseType &response=PolicyDeterministic(), unsigned long iterations=MAXITER, prec_t maxresidual=SOLPREC)
 Gauss-Seidel variant of value iteration (not parallelized). More...
 
template<class SType , class ResponseType = PolicyDeterministic>
auto mpi_jac (const GRMDP< SType > &mdp, prec_t discount, const numvec &valuefunction=numvec(0), const ResponseType &response=PolicyDeterministic(), unsigned long iterations_pi=MAXITER, prec_t maxresidual_pi=SOLPREC, unsigned long iterations_vi=MAXITER, prec_t maxresidual_vi=SOLPREC/2, bool print_progress=false)
 Modified policy iteration using Jacobi value iteration in the inner loop. More...
 
template<class SType >
auto solve_vi (const GRMDP< SType > &mdp, prec_t discount, numvec valuefunction=numvec(0), const indvec &policy=numvec(0), unsigned long iterations=MAXITER, prec_t maxresidual=SOLPREC)
 Gauss-Seidel variant of value iteration (not parallelized). More...
 
template<class SType >
auto solve_mpi (const GRMDP< SType > &mdp, prec_t discount, const numvec &valuefunction=numvec(0), const indvec &policy=indvec(0), unsigned long iterations_pi=MAXITER, prec_t maxresidual_pi=SOLPREC, unsigned long iterations_vi=MAXITER, prec_t maxresidual_vi=SOLPREC/2, bool print_progress=false)
 Modified policy iteration using Jacobi value iteration in the inner loop. More...
 

Detailed Description

Main namespace for algorithms that operate on MDPs and RMDPs.

Typedef Documentation

◆ NatureResponse

template<class T >
using craam::algorithms::NatureResponse = typedef vec_scal_t (*)(numvec const& v, numvec const& p, T threshold)

Function representing constraints on nature.

The function computes the best response of nature and can be used in value iteration.

This function represents a nature which computes (in general) a randomized policy (response). If the response is always deterministic, it may be better to define and use a nature that computes and uses a deterministic response.

The parameters are the q-values v, the reference distribution p, and the threshold. The function returns the worst-case solution and the objective value. The threshold can be used to determine the desired robustness of the solution.

Function Documentation

◆ mpi_jac()

template<class SType , class ResponseType = PolicyDeterministic>
auto craam::algorithms::mpi_jac ( const GRMDP< SType > &  mdp,
prec_t  discount,
const numvec valuefunction = numvec(0),
const ResponseType &  response = PolicyDeterministic(),
unsigned long  iterations_pi = MAXITER,
prec_t  maxresidual_pi = SOLPREC,
unsigned long  iterations_vi = MAXITER,
prec_t  maxresidual_vi = SOLPREC/2,
bool  print_progress = false 
)
inline

Modified policy iteration using Jacobi value iteration in the inner loop.

See solve_mpi for a simplified interface. This method generalizes modified policy iteration to robust MDPs. In the value iteration step, both the action and the outcome are fixed.

Note that the total number of iterations will be bounded by iterations_pi * iterations_vi

Parameters
typeType of realization of the uncertainty
discountDiscount factor
valuefunctionInitial value function
responseUsing PolicyResponce allows to specify a partial policy. Only the actions that not provided by the partial policy are included in the optimization. Using a class of a different types enables computing other objectives, such as robust or risk averse ones.
iterations_piMaximal number of policy iteration steps
maxresidual_piStop the outer policy iteration when the residual drops below this threshold.
iterations_viMaximal number of inner loop value iterations
maxresidual_viStop the inner policy iteration when the residual drops below this threshold. This value should be smaller than maxresidual_pi
print_progressWhether to report on progress during the computation
Returns
Computed (approximate) solution

◆ occfreq_mat()

template<typename SType , typename Policies >
numvec craam::algorithms::occfreq_mat ( const GRMDP< SType > &  rmdp,
const Transition init,
prec_t  discount,
const Policies &  policies 
)
inline

Computes occupancy frequencies using matrix representation of transition probabilities.

This method may not scale well

Template Parameters
STypeType of the state in the MDP (regular vs robust)
PolicyType of the policy. Either a single policy for the standard MDP evaluation, or a pair of a deterministic policy and a randomized policy of the nature
Parameters
initInitial distribution (alpha)
discountDiscount factor (gamma)
policiesThe policy (indvec) or the pair of the policy and the policy of nature (pair<indvec,vector<numvec> >). The nature is typically a randomized policy

◆ rewards_vec()

template<typename SType , typename Policy >
numvec craam::algorithms::rewards_vec ( const GRMDP< SType > &  rmdp,
const Policy &  policies 
)
inline

Constructs the rewards vector for each state for the RMDP.

Template Parameters
PolicyType of the policy. Either a single policy for the standard MDP evaluation, or a pair of a deterministic policy and a randomized policy of the nature
Parameters
rmdpRegular or robust MDP
policiesThe policy (indvec) or the pair of the policy and the policy of nature (pair<indvec,vector<numvec> >). The nature is typically a randomized policy

◆ rsolve_mpi()

template<class SType , class T = prec_t>
auto craam::algorithms::rsolve_mpi ( const GRMDP< SType > &  mdp,
prec_t  discount,
const vector< NatureResponse< T >> &  nature,
const vector< T > &  thresholds,
const numvec valuefunction = numvec(0),
const indvec policy = indvec(0),
unsigned long  iterations_pi = MAXITER,
prec_t  maxresidual_pi = SOLPREC,
unsigned long  iterations_vi = MAXITER,
prec_t  maxresidual_vi = SOLPREC/2,
bool  print_progress = false 
)
inline

Modified policy iteration using Jacobi value iteration in the inner loop.

This method generalizes modified policy iteration to robust MDPs. In the value iteration step, both the action and the outcome are fixed.

This is a simplified method interface. Use mpi_jac with PolicyNature for full functionality.

Note that the total number of iterations will be bounded by iterations_pi * iterations_vi

Parameters
typeType of realization of the uncertainty
discountDiscount factor
natureResponse of nature, one function per state.
thresholdsParameters passed to nature response functions. One value per state.
valuefunctionInitial value function
policyPartial policy specification. Optimize only actions that are policy[state] = -1
iterations_piMaximal number of policy iteration steps
maxresidual_piStop the outer policy iteration when the residual drops below this threshold.
iterations_viMaximal number of inner loop value iterations
maxresidual_viStop the inner policy iteration when the residual drops below this threshold. This value should be smaller than maxresidual_pi
print_progressWhether to report on progress during the computation
Returns
Computed (approximate) solution

◆ rsolve_vi()

template<class SType , class T = prec_t>
auto craam::algorithms::rsolve_vi ( const GRMDP< SType > &  mdp,
prec_t  discount,
const vector< NatureResponse< T >> &  nature,
const vector< T > &  thresholds,
numvec  valuefunction = numvec(0),
const indvec policy = numvec(0),
unsigned long  iterations = MAXITER,
prec_t  maxresidual = SOLPREC 
)
inline

Gauss-Seidel variant of value iteration (not parallelized).

This function is suitable for computing the value function of a finite state MDP. If the states are ordered correctly, one iteration is enough to compute the optimal value function. Since the value function is updated from the last state to the first one, the states should be ordered in the temporal order.

This is a simplified method interface. Use vi_gs with PolicyNature for full functionality.

Parameters
mdpThe MDP to solve
discountDiscount factor.
natureResponse of nature, one function per state.
thresholdsParameters passed to nature response functions. One value per state.
valuefunctionInitial value function. Passed by value, because it is modified. Optional, use all zeros when not provided. Ignored when size is 0.
policyPartial policy specification. Optimize only actions that are policy[state] = -1
iterationsMaximal number of iterations to run
maxresidualStop when the maximal residual falls below this value.
Returns
Solution that can be used to compute the total return, or the optimal policy.

◆ solve_mpi()

template<class SType >
auto craam::algorithms::solve_mpi ( const GRMDP< SType > &  mdp,
prec_t  discount,
const numvec valuefunction = numvec(0),
const indvec policy = indvec(0),
unsigned long  iterations_pi = MAXITER,
prec_t  maxresidual_pi = SOLPREC,
unsigned long  iterations_vi = MAXITER,
prec_t  maxresidual_vi = SOLPREC/2,
bool  print_progress = false 
)
inline

Modified policy iteration using Jacobi value iteration in the inner loop.

This method generalizes modified policy iteration to robust MDPs. In the value iteration step, both the action and the outcome are fixed.

Note that the total number of iterations will be bounded by iterations_pi * iterations_vi

Parameters
typeType of realization of the uncertainty
discountDiscount factor
valuefunctionInitial value function
policyPartial policy specification. Optimize only actions that are policy[state] = -1
iterations_piMaximal number of policy iteration steps
maxresidual_piStop the outer policy iteration when the residual drops below this threshold.
iterations_viMaximal number of inner loop value iterations
maxresidual_viStop the inner policy iteration when the residual drops below this threshold. This value should be smaller than maxresidual_pi
print_progressWhether to report on progress during the computation
Returns
Computed (approximate) solution

◆ solve_vi()

template<class SType >
auto craam::algorithms::solve_vi ( const GRMDP< SType > &  mdp,
prec_t  discount,
numvec  valuefunction = numvec(0),
const indvec policy = numvec(0),
unsigned long  iterations = MAXITER,
prec_t  maxresidual = SOLPREC 
)
inline

Gauss-Seidel variant of value iteration (not parallelized).

This function is suitable for computing the value function of a finite state MDP. If the states are ordered correctly, one iteration is enough to compute the optimal value function. Since the value function is updated from the last state to the first one, the states should be ordered in the temporal order.

Parameters
mdpThe MDP to solve
discountDiscount factor.
valuefunctionInitial value function. Passed by value, because it is modified. Optional, use all zeros when not provided. Ignored when size is 0.
policyPartial policy specification. Optimize only actions that are policy[state] = -1
iterationsMaximal number of iterations to run
maxresidualStop when the maximal residual falls below this value.
Returns
Solution that can be used to compute the total return, or the optimal policy.

◆ string_to_nature()

NatureResponse<prec_t> craam::algorithms::string_to_nature ( string  nature)
inline

Converts a string representation of nature response to the appropriate nature response call.

This function is useful when the code is used within a python or R libraries. The values values correspond to the function definitions, and ones that are currently supported are:

  • robust_unbounded
  • optimistic_unbounded
  • robust_l1
  • optimistic_l1

◆ transition_mat()

template<typename SType , typename Policies >
MatrixXd craam::algorithms::transition_mat ( const GRMDP< SType > &  rmdp,
const Policies &  policies,
bool  transpose = false 
)
inline

Constructs the transition (or its transpose) matrix for the policy.

Template Parameters
STypeType of the state in the MDP (regular vs robust)
PolicyType of the policy. Either a single policy for the standard MDP evaluation, or a pair of a deterministic policy and a randomized policy of the nature
Parameters
rmdpRegular or robust MDP
policiesThe policy (indvec) or the pair of the policy and the policy of nature (pair<indvec,vector<numvec> >). The nature is typically a randomized policy
transpose(optional, false) Whether to return the transpose of the transition matrix. This is useful for computing occupancy frequencies

◆ value_action() [1/6]

prec_t craam::algorithms::value_action ( const RegularAction action,
const numvec valuefunction,
prec_t  discount 
)
inline

Computes the average value of the action.

Parameters
actionAction for which to compute the value
valuefunctionState value function to use
discountDiscount factor
Returns
Action value

◆ value_action() [2/6]

prec_t craam::algorithms::value_action ( const RegularAction action,
const numvec valuefunction,
prec_t  discount,
numvec  distribution 
)
inline

Computes a value of the action for a given distribution.

This function can be used to evaluate a robust solution which may modify the transition probabilities.

The new distribution may be non-zero only for states for which the original distribution is not zero.

Parameters
actionAction for which to compute the value
valuefunctionState value function to use
discountDiscount factor
distributionNew distribution. The length must match the number of states to which the original transition probabilities are strictly greater than 0. The order of states is the same as in the underlying transition.
Returns
Action value

◆ value_action() [3/6]

prec_t craam::algorithms::value_action ( const WeightedOutcomeAction action,
numvec const &  valuefunction,
prec_t  discount 
)
inline

Computes the average outcome using the provided distribution.

Parameters
actionAction for which the value is computed
valuefunctionUpdated value function
discountDiscount factor
Returns
Mean value of the action

◆ value_action() [4/6]

template<class T >
vec_scal_t craam::algorithms::value_action ( const RegularAction action,
const numvec valuefunction,
prec_t  discount,
const NatureInstance< T > &  nature 
)
inline

Computes an ambiguous value (e.g.

robust) of the action, depending on the type of nature that is provided.

Parameters
actionAction for which to compute the value
valuefunctionState value function to use
discountDiscount factor
natureMethod used to compute the response of nature.

◆ value_action() [5/6]

prec_t craam::algorithms::value_action ( const WeightedOutcomeAction action,
numvec const &  valuefunction,
prec_t  discount,
const numvec distribution 
)
inline

Computes the action value for a fixed index outcome.

Parameters
actionAction for which the value is computed
valuefunctionUpdated value function
discountDiscount factor
distributionCustom distribution that is selected by nature.
Returns
Value of the action

◆ value_action() [6/6]

template<class T >
vec_scal_t craam::algorithms::value_action ( const WeightedOutcomeAction action,
numvec const &  valuefunction,
prec_t  discount,
const NatureInstance< T >  nature 
)
inline

Computes the maximal outcome distribution constraints on the nature's distribution.

Does not work when the number of outcomes is zero.

Parameters
actionAction for which the value is computed
valuefunctionValue function reference
discountDiscount factor
natureMethod used to compute the response of nature.
Returns
Outcome distribution and the mean value for the choice of the nature

◆ value_fix_state() [1/3]

template<class AType >
prec_t craam::algorithms::value_fix_state ( const SAState< AType > &  state,
numvec const &  valuefunction,
prec_t  discount,
long  actionid 
)
inline

Computes the value of a fixed (and valid) action.

Performs validity checks.

Parameters
stateState to compute the value for
valuefunctionValue function to use for the following states
discountDiscount factor
Returns
Value of state, 0 if it's terminal regardless of the action index

◆ value_fix_state() [2/3]

template<class AType , class T >
vec_scal_t craam::algorithms::value_fix_state ( const SAState< AType > &  state,
numvec const &  valuefunction,
prec_t  discount,
long  actionid,
const NatureInstance< T > &  nature 
)
inline

Computes the value of a fixed action and any response of nature.

Parameters
stateState to compute the value for
valuefunctionValue function to use in computing value of states.
discountDiscount factor
natureInstance of a nature optimizer
Returns
Value of state, 0 if it's terminal regardless of the action index

◆ value_fix_state() [3/3]

template<class AType >
prec_t craam::algorithms::value_fix_state ( const SAState< AType > &  state,
numvec const &  valuefunction,
prec_t  discount,
long  actionid,
numvec  distribution 
)
inline

Computes the value of a fixed action and fixed response of nature.

Parameters
stateState to compute the value for
valuefunctionValue function to use in computing value of states.
discountDiscount factor
distributionNew distribution over states with non-zero nominal probabilities
Returns
Value of state, 0 if it's terminal regardless of the action index

◆ value_max_state() [1/2]

template<class AType >
pair<long,prec_t> craam::algorithms::value_max_state ( const SAState< AType > &  state,
const numvec valuefunction,
prec_t  discount 
)
inline

Finds the action with the maximal average return.

The return is 0 with no actions. Such state is assumed to be terminal.

Parameters
stateState to compute the value for
valuefunctionValue function to use for the following states
discountDiscount factor
Returns
(Index of best action, value), returns 0 if the state is terminal.

◆ value_max_state() [2/2]

template<typename AType , typename T >
ind_vec_scal_t craam::algorithms::value_max_state ( const SAState< AType > &  state,
const numvec valuefunction,
prec_t  discount,
const NatureInstance< T > &  nature 
)
inline

Finds the greedy action and its value for the given value function.

This function assumes a robust or optimistic response by nature depending on the provided ambiguity.

When there are no actions, the state is assumed to be terminal and the return is 0.

Parameters
stateState to compute the value for
valuefunctionValue function to use in computing value of states.
discountDiscount factor
natureMethod used to compute the response of nature.
Returns
(Action index, outcome index, value), 0 if it's terminal regardless of the action index

◆ vi_gs()

template<class SType , class ResponseType = PolicyDeterministic>
auto craam::algorithms::vi_gs ( const GRMDP< SType > &  mdp,
prec_t  discount,
numvec  valuefunction = numvec(0),
const ResponseType &  response = PolicyDeterministic(),
unsigned long  iterations = MAXITER,
prec_t  maxresidual = SOLPREC 
)
inline

Gauss-Seidel variant of value iteration (not parallelized).

See solve_vi for a simplified interface.

This function is suitable for computing the value function of a finite state MDP. If the states are ordered correctly, one iteration is enough to compute the optimal value function. Since the value function is updated from the last state to the first one, the states should be ordered in the temporal order.

Parameters
mdpThe mdp to solve
discountDiscount factor.
valuefunctionInitial value function. Passed by value, because it is modified. Optional, use all zeros when not provided. Ignored when size is 0.
responseUsing PolicyResponce allows to specify a partial policy. Only the actions that not provided by the partial policy are included in the optimization. Using a class of a different types enables computing other objectives, such as robust or risk averse ones.
iterationsMaximal number of iterations to run
maxresidualStop when the maximal residual falls below this value.
Returns
Solution that can be used to compute the total return, or the optimal policy.