Contents

强化学习的数学原理(一):基本概念

一、基本概念 (Basic Concepts)

  • State (状态)

    • The status of the agent with respect to the environment. (Agent 相对于环境的状态)
    • State 是在环境中观测到的形态。
    • State Space (状态空间, $S$): The set of all states, $S = \{s_i\}_{i=1}^n$.
  • Action (动作)

    • For each state, possible actions. (在每个状态下,可以执行的动作)
    • Action 是 agent 采取的动作。
    • Action Space of a state ($A(s)$): $A(s_i) = \{a_j\}_{j=1}^m$.
  • State Transition (状态转移)

    • $S_1 \xrightarrow{a_1} S_2$.
    • 定义了 agent 与环境的交互。
    • State Transition Probability (状态转移概率): 用概率描述 state transition。
      • 例 (确定性环境): $p(s_2|s_1, a_1)=1$, $p(s_i|s_1, a_1)=0, \forall i \neq 2$.
  • Policy (策略, $\pi$)

    • Tell the agent what actions to take at a state. (告诉智能体在某个状态下该做什么动作)
    • 强化学习的目标就是找到最优策略。
    • 例 (确定性策略): $\pi(a_k|s_j)=1$, $\pi(a|s_j)=0$ for $a \neq a_k$.
  • Reward (奖励, $R$)

    • A real number we get after taking an action. (执行一个动作后得到的数值)
    • 用来评判我们采取的 action。
    • 例: $p(r=-1|s_1, a_1)=1$, $p(r=k, k\neq-1|s_1, a_1)=0$.
    • 个人理解: The reward depends on the state and action, but not the next state.
  • 个人理解汇总

    • Model-based vs. Model-free: 在 model-base 情况下,状态转移概率是已知。model-free 情况下,状态转移概率未知。
    • 关于 Reward: 在一个状态下,采取一个 action, 会有不同的概率到不同的 state, 拿到不同的 reward。某些状态下,采取不同的 action 会得到不同的 reward。在一个状态下,采取不同 action 会跳到相同的 state, 但是 reward 可能不同。
  • Trajectory (轨迹)

    • A trajectory is a state-action-reward chain.
    • $S_1 \xrightarrow{a_1, r=0} S_2 \xrightarrow{a_2, r=0} S_3 \xrightarrow{a_3, r=0} S_8 \xrightarrow{a_4, r=1} S_9$
  • Return (回报)

    • The return of this trajectory is the sum of all the rewards collected along the trajectory.
    • 例: return = $0 + 0 + 0 + 1 = 1$.
  • Discounted Return (折扣回报)

    • Discount rate $\gamma \in [0, 1]$.
    • Discounted Return = $R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots$
    • 例: $0 + \gamma \cdot 0 + \gamma^2 \cdot 0 + \gamma^3 \cdot 1 = \gamma^3$.
    • 作用:
      1. Sum 收敛 (有限)。
      2. Balance the far and near future rewards. (权衡远近奖励)
      3. $\gamma$ 越小越近视。
  • Episode (回合)

    • The resulting trajectory is called an episode (a trial).

    • An episode is usually assumed to be a finite trajectory.

    • Episodic Tasks (回合制任务): Tasks with episodes.

    • Continuing Tasks (连续性任务): 没有 terminal state 的任务, 为了让收益有限, 我们采用折扣。

    • 更一般化表示 (Convert episodic tasks to continuing tasks)

      • Option 1: Treat the target state as a special absorbing state. Agent 到达后不再 leave, reward r=0.
      • Option 2: Treat the target state as a normal state with a policy. Agent 可能会离开, 当进入 target state, r=+1.
      • 笔记注记: Option 2 更加泛化。
  • Markov Decision Process (MDP)

    • Sets (集合):

      • State: The set of states $S$.
      • Action: The set of actions $A(s)$ is associated for state $s \in S$.
      • Reward: The set of rewards $R(s, a)$.
    • Probability distribution (概率分布):

      • State transition probability: At state $s$, taking action $a$, the probability to transit to state $s’$ is $p(s’|s,a)$.
      • Reward probability: At state $s$, take action $a$, the probability to get reward $r$ is $p(r|s,a)$.
    • Policy (策略, $\pi$): At state $s$, the probability to choose action $a$ is $\pi(a|s)$.

    • Markov Property (马尔可夫性质)

      • Memoryless property (无记忆性).
      • $p(S_{t+1} | A_t, S_t, \dots, a_0, S_0) = p(S_{t+1} | A_t, S_t)$.
      • $p(R_{t+1} | A_t, S_t, \dots, a_0, S_0) = p(R_{t+1} | A_t, S_t)$.