强化学习的数学原理（一）：基本概念

枕石 included in 强化学习

2025-09-17 863 words 2 minutes

Contents

一、基本概念 (Basic Concepts)

State (状态)
- The status of the agent with respect to the environment. (Agent 相对于环境的状态)
- State 是在环境中观测到的形态。
- State Space (状态空间, $S$): The set of all states, $S = \{s_i\}_{i=1}^n$.
Action (动作)
- For each state, possible actions. (在每个状态下，可以执行的动作)
- Action 是 agent 采取的动作。
- Action Space of a state ($A(s)$): $A(s_i) = \{a_j\}_{j=1}^m$.
State Transition (状态转移)
- $S_1 \xrightarrow{a_1} S_2$.
- 定义了 agent 与环境的交互。
- State Transition Probability (状态转移概率): 用概率描述 state transition。
  - 例 (确定性环境): $p(s_2|s_1, a_1)=1$, $p(s_i|s_1, a_1)=0, \forall i \neq 2$.
Policy (策略, $\pi$)
- Tell the agent what actions to take at a state. (告诉智能体在某个状态下该做什么动作)
- 强化学习的目标就是找到最优策略。
- 例 (确定性策略): $\pi(a_k|s_j)=1$, $\pi(a|s_j)=0$ for $a \neq a_k$.
Reward (奖励, $R$)
- A real number we get after taking an action. (执行一个动作后得到的数值)
- 用来评判我们采取的 action。
- 例: $p(r=-1|s_1, a_1)=1$, $p(r=k, k\neq-1|s_1, a_1)=0$.
- 个人理解: The reward depends on the state and action, but not the next state.
个人理解汇总
- Model-based vs. Model-free: 在 model-base 情况下，状态转移概率是已知。model-free 情况下，状态转移概率未知。
- 关于 Reward: 在一个状态下，采取一个 action, 会有不同的概率到不同的 state, 拿到不同的 reward。某些状态下，采取不同的 action 会得到不同的 reward。在一个状态下，采取不同 action 会跳到相同的 state, 但是 reward 可能不同。
Trajectory (轨迹)
- A trajectory is a state-action-reward chain.
- $S_1 \xrightarrow{a_1, r=0} S_2 \xrightarrow{a_2, r=0} S_3 \xrightarrow{a_3, r=0} S_8 \xrightarrow{a_4, r=1} S_9$
Return (回报)
- The return of this trajectory is the sum of all the rewards collected along the trajectory.
- 例: return = $0 + 0 + 0 + 1 = 1$.
Discounted Return (折扣回报)
- Discount rate $\gamma \in [0, 1]$.
- Discounted Return = $R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots$
- 例: $0 + \gamma \cdot 0 + \gamma^2 \cdot 0 + \gamma^3 \cdot 1 = \gamma^3$.
- 作用:
  1. Sum 收敛 (有限)。
  2. Balance the far and near future rewards. (权衡远近奖励)
  3. $\gamma$ 越小越近视。
Episode (回合)
- The resulting trajectory is called an episode (a trial).
- An episode is usually assumed to be a finite trajectory.
- Episodic Tasks (回合制任务): Tasks with episodes.
- Continuing Tasks (连续性任务): 没有 terminal state 的任务, 为了让收益有限, 我们采用折扣。
- 更一般化表示 (Convert episodic tasks to continuing tasks)
  - Option 1: Treat the target state as a special absorbing state. Agent 到达后不再 leave, reward r=0.
  - Option 2: Treat the target state as a normal state with a policy. Agent 可能会离开, 当进入 target state, r=+1.
  - 笔记注记: Option 2 更加泛化。
Markov Decision Process (MDP)
- Sets (集合):
  - State: The set of states $S$.
  - Action: The set of actions $A(s)$ is associated for state $s \in S$.
  - Reward: The set of rewards $R(s, a)$.
- Probability distribution (概率分布):
  - State transition probability: At state $s$, taking action $a$, the probability to transit to state $s’$ is $p(s’|s,a)$.
  - Reward probability: At state $s$, take action $a$, the probability to get reward $r$ is $p(r|s,a)$.
- Policy (策略, $\pi$): At state $s$, the probability to choose action $a$ is $\pi(a|s)$.
- Markov Property (马尔可夫性质)
  - Memoryless property (无记忆性).
  - $p(S_{t+1} | A_t, S_t, \dots, a_0, S_0) = p(S_{t+1} | A_t, S_t)$.
  - $p(R_{t+1} | A_t, S_t, \dots, a_0, S_0) = p(R_{t+1} | A_t, S_t)$.