CRAAM
2.0.0
Robust and Approximate Markov Decision Processes
|
Constructs an MDP from integer samples. More...
#include <Samples.hpp>
Public Member Functions | |
SampledMDP () | |
Constructs an empty MDP from discrete samples. | |
void | add_samples (const DiscreteSamples &samples) |
Constructs or adds states and actions based on the provided samples. More... | |
shared_ptr< const MDP > | get_mdp () const |
shared_ptr< MDP > | get_mdp_mod () |
Transition | get_initial () const |
vector< vector< prec_t > > | get_state_action_weights () |
long | state_count () |
Returns thenumber of states in the samples (the highest observed index. More... | |
Protected Attributes | |
shared_ptr< MDP > | mdp |
Internal MDP representation. | |
Transition | initial |
Initial distribution. | |
vector< vector< prec_t > > | state_action_weights |
Sample counts. | |
Constructs an MDP from integer samples.
Integer samples: Each decision state, expectation state, and action are identified by an integer.
Input: Sample set \( \Sigma = (s_i, a_i, s_i', r_i, w_i)_{i=0}^{m-1} \)
Output: An MDP such that:
\[ P(s,a,s') = \frac{\sum_{i=0}^{m-1} w_i 1\{ s = s_i, a = a_i, s' = s_i' \} } { \sum_{i=0}^{m-1} w_i 1\{ s = s_i, a = a_i \} } \]
\[ r(s,a,s') = \frac{\sum_{i=0}^{m-1} r_i w_i 1\{ s = s_i, a = a_i, s' = s_i' \} } { \sum_{i=0}^{m-1} w_i 1\{ s = s_i, a = a_i, s' = s_i' \} } \]
The class also tracks cumulative weights of state-action samples \( z \):
\[ z(s,a) = \sum_{i=0}^{m-1} w_i 1\{ s = s_i, a = a_i \} \]
If \( z(s,a) = 0 \) then the action \( a \) is marked as invalid. There is some extra memory penalty due to storing these weights.
Important: Actions that are not sampled (no samples per that state and action pair) are labeled as invalid and are not included in the computation of value function or the solution. For example, if there is an action 1 in state zero but there are no samples that include action 0 then action 0 is still created, but is ignored when computing the value function.
When sample sets are added by multiple calls of SampledMDP::add_samples, the results is the same as if all the individual sample sets were combined and added together. See SampledMDP::add_samples for more details.
|
inline |
Constructs or adds states and actions based on the provided samples.
Sample sets can be added iteratively. Assume that the current transition probabilities are constructed based on a sample set \( \Sigma = (s_i, a_i, s_i', r_i, w_i)_{i=0}^{m-1} \) and add_samples is called with sample set \( \Sigma' = (s_j, a_j, s_j', r_j, w_j)_{i=m}^{n-1} \). The result is the same as if simultaneously adding samples \( 0 \ldots (n-1) \).
New MDP values are updates as follows:
\[ z'(s,a) = z(s,a) + \sum_{j=m}^{n-1} w_j 1\{ s = s_j, a = a_j \} \]
\begin{align*} P'(s,a,s') &= \frac{z(s,a) * P(s,a,s') + \sum_{j=m}^{n-1} w_j 1\{ s = s_j, a = a_j, s' = s_j' \} } { z'(s,a) } = \\ &= \frac{P(s,a,s') + (1 / z(s,a)) \sum_{j=m}^{n-1} w_j 1\{ s = s_j, a = a_j, s' = s_j' \} } { z'(s,a) / z(s,a) } \end{align*}
The denominator is computed implicitly by normalizing transition probabilities.\begin{align*} r'(s,a,s') &= \frac{r(s,a,s') z(s,a) P(s,a,s') + \sum_{j=m}^{n-1} r_j w_j 1\{ s = s_j, a = a_j, s' = s_j' \}} {z'(s,a)P'(s,a,s')} \\ r'(s,a,s') &= \frac{r(s,a,s') z(s,a) P(s,a,s') + \sum_{j=m}^{n-1} r_j w_j 1\{ s = s_j, a = a_j, s' = s_j' \}} {z'(s,a)P'(s,a,s')} \\ &= \frac{r(s,a,s') P(s,a,s') + \sum_{j=m}^{n-1}r_j (w_j/z(s,a)) 1\{ s = s_j, a = a_j, s' = s_j' \}} {z'(s,a)P'(s,a,s')/ z(s,a)} \\ &= \frac{r(s,a,s') P(s,a,s') + \sum_{j=m}^{n-1} r_j (w_j/z(s,a) 1\{ s = s_j, a = a_j, s' = s_j' \}} {P(s,a,s') + \sum_{j=m}^{n-1} (w_j/z(s,a)) 1\{ s = s_j, a = a_j, s' = s_j' \}} \end{align*}
The last line follows from the definition of \( P(s,a,s') \). This corresponds to the operation of Transition::add_sample repeatedly for \( j = m \ldots (n-1) \) with\begin{align*} p &= (w_j/z(s,a)) 1\{ s = s_j, a = a_j, s' = s_j' \}\\ r &= r_j \end{align*}
.samples | New sample set to add to transition probabilities and rewards |
|
inline |
|
inline |
|
inline |
|
inline |
|
inline |
Returns thenumber of states in the samples (the highest observed index.
Some may be missing)