强化学习算法 Sarsa 解迷宫游戏,代码逐条详解

本文内容源自百度强化学习 7 日入门课程学习整理
感谢百度 PARL 团队李科浇老师的课程讲解

强化学习算法 Sarsa 解迷宫游戏

一、安装依赖库

安装强化学习算法中环境库 Gym

pip install gym

二、导入依赖库

import gym
import numpy as np
import time # 用于延时程序,方便渲染画面

三、智能体 Agent 的算法:Sarsa

  • 智能体 Agent 是和环境 environment 交互的主体
    • 包含了观察当前状态
    • 根据当前状态作出动作选择
    • 根据选择后的结果更新 Q 值表
  • predict() 方法:输入观察值 observation(或者说状态state),输出 “预测” 动作 action (最优动作)
    • 观察当前状态下,所有可以采用的 action 对应的 Q 值
    • 在其中选取最大的,组成一个列表
    • 该列表对应可能选取的最优动作列表
    • 在最优动作列表中随机选取一个动作
  • sample() 方法:在 predict() 方法基础上使用 ε-greedy 增加探索,输出 “实际” 动作 action
    • 采用 epsilon greedy 算法
    • 90% 概率选择最优动作
    • 10% 概率选择随机动作
  • learn() 方法:输入训练数据,完成一轮Q表格的更新
    • 更新的是之前状态 obs 下采取动作 action 后的 Q 值
    • 如果游戏结束,则 reward 为新的 Q 值
    • 如果游戏没有结束,则 reward 和下一步的 Q 值结合产生新的 Q 值
    • 同时用学习速率 lr 做更新约束
class SarsaAgent(object):
    def __init__(self, obs_n, act_n, learning_rate=0.01, gamma=0.9, e_greed=0.1):
        self.act_n = act_n      # 动作维度,有几个动作可选
        self.lr = learning_rate # 学习率
        self.gamma = gamma      # 后面的 Q 值对前面的影响
        self.epsilon = e_greed  # 按一定概率随机选动作
        self.Q = np.zeros((obs_n, act_n))

    # 根据输入观察值,采样输出的动作值(带 10% 的探索)
    def sample(self, obs):
        if (np.random.uniform(0, 1) < 1 - self.epsilon): # 这里是 90% 可能性
            action = self.predict(obs) # 执行最优动作
        else: # 10% 的概率
            action = np.random.choice(self.act_n) # 执行随机动作
        return action

    # 根据输入观察值,预测输出的动作值
    def predict(self, obs):
        Q_list = self.Q[obs, :] # 获取当前状态下,作出所有动作,对应的 Q 值列表
        maxQ = np.max(Q_list) # 求列表中的最大值
        action_list = np.where(Q_list == maxQ)[0] # 最大 Q 值对应的动作即最优动作
        action = np.random.choice(action_list) # 随机选择一个最优动作
        return action

    # 学习方法,也就是更新Q-table的方法
    def learn(self, obs, action, reward, next_obs, next_action, done):
        """ on-policy
            obs: 交互前的obs, s_t
            action: 本次交互选择的action, a_t
            reward: 本次动作获得的奖励r
            next_obs: 本次交互后的obs, s_t+1
            next_action: 根据当前Q表格, 针对next_obs会选择的动作, a_t+1
            done: episode是否结束
        """
        predict_Q = self.Q[obs,action] # 交互前的状态下,选择的动作所对应 Q 值
        if (done): # 游戏结束
            target_Q = reward # 新的 Q 值为 reward
        else: # 游戏没有结束
            target_Q = reward + self.gamma * self.Q[next_obs, next_action]
            # 用 reward 和 交互后状态下,选择的下一个动作对应的 Q 值,综合得到新的 Q 值
        self.Q[obs,action] += self.lr * (target_Q - predict_Q) # 使用 lr 做修正更新的幅度

    # 保存Q表格数据到文件
    def save(self):
        npy_file = './q_table.npy'
        np.save(npy_file, self.Q)
        print(npy_file + ' saved.')
    
    # 从文件中读取数据到Q表格中
    def restore(self, npy_file='./q_table.npy'):
        self.Q = np.load(npy_file)
        print(npy_file + ' loaded.')

四、训练和测试语句

每一局游戏,记录下步数 total_steps 和总奖励 total_reward

每一步都更新 Q 值表

def run_episode(env, agent, render=False):
    total_steps = 0 # 记录每个episode走了多少step
    total_reward = 0 # 记录每一局游戏的总奖励

    obs = env.reset() # 重置环境, 重新开一局(即开始新的一个episode)
    action = agent.sample(obs) # 根据算法选择一个动作

    while True:
        next_obs, reward, done, _ = env.step(action) # 与环境进行一个交互,执行动作
        next_action = agent.sample(next_obs) # 根据算法选择下一个动作
        # 训练 Sarsa 算法
        agent.learn(obs, action, reward, next_obs, next_action, done)
        # obs 执行动作前的状态,action 执行的动作,得到预测的 Q0
        # reward 执行动作后的奖励,next_obs 执行动作后的状态,next_action 选择的下一个动作,得到更新的 Q0
        # done 判断游戏是否结束
        

        action = next_action # 迭代新的动作
        obs = next_obs  # 存储上一个观察值,迭代新的状态
        total_reward += reward # 累计奖励
        total_steps += 1 # 计算step数
        if render: # 判断是否需要渲染图形显示
            env.render() #渲染新的一帧图形
        if done: # 游戏结束
            break # 跳出循环,即结束本局游戏
    return total_reward, total_steps # 返回总的奖励和总的步数

def test_episode(env, agent):
    total_reward = 0 # 记录总的奖励
    obs = env.reset() # 重置环境,obs 初始观察值,即初始状态
    while True:
        action = agent.predict(obs) # greedy,每次选择最优动作
        next_obs, reward, done, _ = env.step(action) # 交互后,获取新的状态,奖励,游戏是否结束
        total_reward += reward # 累计奖励
        obs = next_obs # 迭代更新状态
        time.sleep(0.5) # 休眠,以便于我们观察渲染的图形
        env.render() # 渲染图形显示
        if done: # 游戏结束
            break # 跳出循环
    return total_reward # 返回最终累计奖励

五、创建环境,实例化Agent,启动训练和测试

使用 Gym 库创建我们需要的环境

实例化 SarsaAgent 类,创建一个 Agent 对象,同时设定超参数

训练 500 局游戏,查看每一局游戏的结果

训练结束后进行测试

# 使用gym创建迷宫环境,设置is_slippery为False降低环境难度
env = gym.make("FrozenLake-v0", is_slippery=False)  # 0 left, 1 down, 2 right, 3 up
# 使用 make 方法创建需要的环境

# 创建一个agent实例,输入超参数
agent = SarsaAgent(
        obs_n=env.observation_space.n, # 16 个状态代表这个环境中 4*4 一共 16 个格子
        act_n=env.action_space.n, # 4 种动作选择:0 left, 1 down, 2 right, 3 up
        learning_rate=0.1, # 学习速率
        gamma=0.9, # 下一步的影响率
        e_greed=0.1) # 随机选择概率


# 训练500个episode,打印每个episode的分数
for episode in range(500):
    ep_reward, ep_steps = run_episode(env, agent, False)
    print('Episode %s: steps = %s , reward = %.1f' % (episode, ep_steps, ep_reward))

# 全部训练结束,查看算法效果
test_reward = test_episode(env, agent)
print('test reward = %.1f' % (test_reward))

运行结果:

Episode 0: steps = 6 , reward = 0.0
Episode 1: steps = 17 , reward = 0.0
Episode 2: steps = 9 , reward = 0.0
Episode 3: steps = 2 , reward = 0.0
Episode 4: steps = 8 , reward = 0.0
Episode 5: steps = 8 , reward = 0.0
Episode 6: steps = 14 , reward = 0.0
Episode 7: steps = 7 , reward = 0.0
Episode 8: steps = 7 , reward = 0.0
Episode 9: steps = 2 , reward = 0.0
Episode 10: steps = 3 , reward = 0.0
Episode 11: steps = 8 , reward = 0.0
Episode 12: steps = 3 , reward = 0.0
Episode 13: steps = 8 , reward = 0.0
Episode 14: steps = 6 , reward = 0.0
Episode 15: steps = 5 , reward = 0.0
Episode 16: steps = 5 , reward = 0.0
Episode 17: steps = 7 , reward = 0.0
Episode 18: steps = 2 , reward = 0.0
Episode 19: steps = 7 , reward = 0.0
Episode 20: steps = 2 , reward = 0.0
Episode 21: steps = 7 , reward = 0.0
Episode 22: steps = 6 , reward = 0.0
Episode 23: steps = 3 , reward = 0.0
Episode 24: steps = 4 , reward = 0.0
Episode 25: steps = 4 , reward = 0.0
Episode 26: steps = 17 , reward = 0.0
Episode 27: steps = 11 , reward = 0.0
Episode 28: steps = 4 , reward = 0.0
Episode 29: steps = 9 , reward = 0.0
Episode 30: steps = 3 , reward = 0.0
Episode 31: steps = 11 , reward = 0.0
Episode 32: steps = 7 , reward = 0.0
Episode 33: steps = 3 , reward = 0.0
Episode 34: steps = 16 , reward = 0.0
Episode 35: steps = 10 , reward = 0.0
Episode 36: steps = 2 , reward = 0.0
Episode 37: steps = 9 , reward = 0.0
Episode 38: steps = 9 , reward = 0.0
Episode 39: steps = 19 , reward = 1.0
Episode 40: steps = 6 , reward = 0.0
Episode 41: steps = 6 , reward = 0.0
Episode 42: steps = 7 , reward = 0.0
Episode 43: steps = 4 , reward = 0.0
Episode 44: steps = 4 , reward = 0.0
Episode 45: steps = 5 , reward = 0.0
Episode 46: steps = 4 , reward = 0.0
Episode 47: steps = 22 , reward = 1.0
Episode 48: steps = 2 , reward = 0.0
Episode 49: steps = 2 , reward = 0.0
Episode 50: steps = 2 , reward = 0.0
Episode 51: steps = 17 , reward = 0.0
Episode 52: steps = 14 , reward = 0.0
Episode 53: steps = 6 , reward = 0.0
Episode 54: steps = 8 , reward = 0.0
Episode 55: steps = 18 , reward = 0.0
Episode 56: steps = 5 , reward = 0.0
Episode 57: steps = 2 , reward = 0.0
Episode 58: steps = 8 , reward = 0.0
Episode 59: steps = 4 , reward = 0.0
Episode 60: steps = 10 , reward = 0.0
Episode 61: steps = 2 , reward = 0.0
Episode 62: steps = 11 , reward = 0.0
Episode 63: steps = 21 , reward = 0.0
Episode 64: steps = 4 , reward = 0.0
Episode 65: steps = 2 , reward = 0.0
Episode 66: steps = 3 , reward = 0.0
Episode 67: steps = 3 , reward = 0.0
Episode 68: steps = 18 , reward = 1.0
Episode 69: steps = 6 , reward = 0.0
Episode 70: steps = 8 , reward = 0.0
Episode 71: steps = 8 , reward = 0.0
Episode 72: steps = 4 , reward = 0.0
Episode 73: steps = 13 , reward = 0.0
Episode 74: steps = 3 , reward = 0.0
Episode 75: steps = 7 , reward = 0.0
Episode 76: steps = 8 , reward = 0.0
Episode 77: steps = 3 , reward = 0.0
Episode 78: steps = 7 , reward = 0.0
Episode 79: steps = 8 , reward = 0.0
Episode 80: steps = 7 , reward = 0.0
Episode 81: steps = 10 , reward = 1.0
Episode 82: steps = 6 , reward = 1.0
Episode 83: steps = 9 , reward = 1.0
Episode 84: steps = 6 , reward = 0.0
Episode 85: steps = 6 , reward = 1.0
Episode 86: steps = 3 , reward = 0.0
Episode 87: steps = 7 , reward = 1.0
Episode 88: steps = 6 , reward = 1.0
Episode 89: steps = 7 , reward = 1.0
Episode 90: steps = 6 , reward = 1.0
Episode 91: steps = 6 , reward = 1.0
Episode 92: steps = 10 , reward = 1.0
Episode 93: steps = 6 , reward = 1.0
Episode 94: steps = 8 , reward = 1.0
Episode 95: steps = 6 , reward = 1.0
Episode 96: steps = 7 , reward = 1.0
Episode 97: steps = 6 , reward = 1.0
Episode 98: steps = 6 , reward = 1.0
Episode 99: steps = 8 , reward = 1.0
Episode 100: steps = 6 , reward = 1.0
Episode 101: steps = 8 , reward = 1.0
Episode 102: steps = 6 , reward = 1.0
Episode 103: steps = 6 , reward = 1.0
Episode 104: steps = 6 , reward = 1.0
Episode 105: steps = 8 , reward = 1.0
Episode 106: steps = 6 , reward = 1.0
Episode 107: steps = 6 , reward = 1.0
Episode 108: steps = 6 , reward = 1.0
Episode 109: steps = 6 , reward = 1.0
Episode 110: steps = 4 , reward = 0.0
Episode 111: steps = 6 , reward = 1.0
Episode 112: steps = 6 , reward = 1.0
Episode 113: steps = 6 , reward = 1.0
Episode 114: steps = 6 , reward = 1.0
Episode 115: steps = 7 , reward = 1.0
Episode 116: steps = 7 , reward = 1.0
Episode 117: steps = 10 , reward = 1.0
Episode 118: steps = 5 , reward = 0.0
Episode 119: steps = 6 , reward = 1.0
Episode 120: steps = 3 , reward = 0.0
Episode 121: steps = 6 , reward = 1.0
Episode 122: steps = 6 , reward = 1.0
Episode 123: steps = 9 , reward = 1.0
Episode 124: steps = 6 , reward = 1.0
Episode 125: steps = 5 , reward = 0.0
Episode 126: steps = 6 , reward = 1.0
Episode 127: steps = 6 , reward = 1.0
Episode 128: steps = 8 , reward = 1.0
Episode 129: steps = 6 , reward = 1.0
Episode 130: steps = 6 , reward = 1.0
Episode 131: steps = 8 , reward = 1.0
Episode 132: steps = 8 , reward = 1.0
Episode 133: steps = 6 , reward = 1.0
Episode 134: steps = 6 , reward = 1.0
Episode 135: steps = 6 , reward = 1.0
Episode 136: steps = 6 , reward = 1.0
Episode 137: steps = 6 , reward = 1.0
Episode 138: steps = 6 , reward = 1.0
Episode 139: steps = 4 , reward = 0.0
Episode 140: steps = 6 , reward = 1.0
Episode 141: steps = 6 , reward = 1.0
Episode 142: steps = 6 , reward = 1.0
Episode 143: steps = 9 , reward = 1.0
Episode 144: steps = 6 , reward = 1.0
Episode 145: steps = 6 , reward = 1.0
Episode 146: steps = 6 , reward = 1.0
Episode 147: steps = 7 , reward = 1.0
Episode 148: steps = 7 , reward = 1.0
Episode 149: steps = 6 , reward = 1.0
Episode 150: steps = 6 , reward = 1.0
Episode 151: steps = 6 , reward = 1.0
Episode 152: steps = 7 , reward = 1.0
Episode 153: steps = 6 , reward = 1.0
Episode 154: steps = 6 , reward = 1.0
Episode 155: steps = 7 , reward = 1.0
Episode 156: steps = 7 , reward = 1.0
Episode 157: steps = 7 , reward = 1.0
Episode 158: steps = 6 , reward = 1.0
Episode 159: steps = 6 , reward = 1.0
Episode 160: steps = 6 , reward = 1.0
Episode 161: steps = 4 , reward = 0.0
Episode 162: steps = 6 , reward = 1.0
Episode 163: steps = 5 , reward = 0.0
Episode 164: steps = 6 , reward = 1.0
Episode 165: steps = 6 , reward = 1.0
Episode 166: steps = 6 , reward = 1.0
Episode 167: steps = 6 , reward = 1.0
Episode 168: steps = 9 , reward = 1.0
Episode 169: steps = 6 , reward = 1.0
Episode 170: steps = 8 , reward = 1.0
Episode 171: steps = 6 , reward = 1.0
Episode 172: steps = 6 , reward = 1.0
Episode 173: steps = 6 , reward = 1.0
Episode 174: steps = 6 , reward = 1.0
Episode 175: steps = 6 , reward = 1.0
Episode 176: steps = 6 , reward = 1.0
Episode 177: steps = 6 , reward = 1.0
Episode 178: steps = 8 , reward = 1.0
Episode 179: steps = 6 , reward = 1.0
Episode 180: steps = 6 , reward = 1.0
Episode 181: steps = 3 , reward = 0.0
Episode 182: steps = 6 , reward = 1.0
Episode 183: steps = 6 , reward = 1.0
Episode 184: steps = 6 , reward = 1.0
Episode 185: steps = 8 , reward = 1.0
Episode 186: steps = 10 , reward = 1.0
Episode 187: steps = 8 , reward = 1.0
Episode 188: steps = 6 , reward = 1.0
Episode 189: steps = 6 , reward = 1.0
Episode 190: steps = 6 , reward = 1.0
Episode 191: steps = 6 , reward = 1.0
Episode 192: steps = 7 , reward = 1.0
Episode 193: steps = 6 , reward = 1.0
Episode 194: steps = 6 , reward = 1.0
Episode 195: steps = 8 , reward = 1.0
Episode 196: steps = 6 , reward = 1.0
Episode 197: steps = 4 , reward = 0.0
Episode 198: steps = 5 , reward = 0.0
Episode 199: steps = 6 , reward = 1.0
Episode 200: steps = 6 , reward = 1.0
Episode 201: steps = 6 , reward = 1.0
Episode 202: steps = 4 , reward = 0.0
Episode 203: steps = 8 , reward = 1.0
Episode 204: steps = 8 , reward = 1.0
Episode 205: steps = 7 , reward = 1.0
Episode 206: steps = 6 , reward = 1.0
Episode 207: steps = 6 , reward = 1.0
Episode 208: steps = 6 , reward = 1.0
Episode 209: steps = 8 , reward = 1.0
Episode 210: steps = 7 , reward = 1.0
Episode 211: steps = 6 , reward = 1.0
Episode 212: steps = 6 , reward = 1.0
Episode 213: steps = 10 , reward = 1.0
Episode 214: steps = 6 , reward = 1.0
Episode 215: steps = 6 , reward = 1.0
Episode 216: steps = 6 , reward = 1.0
Episode 217: steps = 6 , reward = 1.0
Episode 218: steps = 6 , reward = 1.0
Episode 219: steps = 6 , reward = 1.0
Episode 220: steps = 6 , reward = 1.0
Episode 221: steps = 7 , reward = 1.0
Episode 222: steps = 6 , reward = 1.0
Episode 223: steps = 6 , reward = 1.0
Episode 224: steps = 6 , reward = 1.0
Episode 225: steps = 6 , reward = 1.0
Episode 226: steps = 6 , reward = 1.0
Episode 227: steps = 6 , reward = 1.0
Episode 228: steps = 7 , reward = 1.0
Episode 229: steps = 6 , reward = 1.0
Episode 230: steps = 6 , reward = 1.0
Episode 231: steps = 10 , reward = 1.0
Episode 232: steps = 6 , reward = 1.0
Episode 233: steps = 6 , reward = 1.0
Episode 234: steps = 6 , reward = 1.0
Episode 235: steps = 8 , reward = 1.0
Episode 236: steps = 6 , reward = 1.0
Episode 237: steps = 6 , reward = 1.0
Episode 238: steps = 6 , reward = 1.0
Episode 239: steps = 8 , reward = 1.0
Episode 240: steps = 6 , reward = 1.0
Episode 241: steps = 6 , reward = 1.0
Episode 242: steps = 8 , reward = 1.0
Episode 243: steps = 2 , reward = 0.0
Episode 244: steps = 6 , reward = 1.0
Episode 245: steps = 6 , reward = 1.0
Episode 246: steps = 6 , reward = 1.0
Episode 247: steps = 6 , reward = 1.0
Episode 248: steps = 6 , reward = 1.0
Episode 249: steps = 6 , reward = 1.0
Episode 250: steps = 7 , reward = 1.0
Episode 251: steps = 6 , reward = 1.0
Episode 252: steps = 2 , reward = 0.0
Episode 253: steps = 6 , reward = 1.0
Episode 254: steps = 6 , reward = 1.0
Episode 255: steps = 6 , reward = 1.0
Episode 256: steps = 8 , reward = 1.0
Episode 257: steps = 6 , reward = 1.0
Episode 258: steps = 6 , reward = 1.0
Episode 259: steps = 7 , reward = 1.0
Episode 260: steps = 6 , reward = 1.0
Episode 261: steps = 6 , reward = 1.0
Episode 262: steps = 7 , reward = 1.0
Episode 263: steps = 6 , reward = 1.0
Episode 264: steps = 6 , reward = 1.0
Episode 265: steps = 6 , reward = 1.0
Episode 266: steps = 6 , reward = 1.0
Episode 267: steps = 7 , reward = 1.0
Episode 268: steps = 6 , reward = 1.0
Episode 269: steps = 6 , reward = 1.0
Episode 270: steps = 6 , reward = 1.0
Episode 271: steps = 6 , reward = 1.0
Episode 272: steps = 6 , reward = 1.0
Episode 273: steps = 7 , reward = 1.0
Episode 274: steps = 3 , reward = 0.0
Episode 275: steps = 8 , reward = 1.0
Episode 276: steps = 7 , reward = 1.0
Episode 277: steps = 4 , reward = 0.0
Episode 278: steps = 6 , reward = 1.0
Episode 279: steps = 4 , reward = 0.0
Episode 280: steps = 7 , reward = 1.0
Episode 281: steps = 6 , reward = 1.0
Episode 282: steps = 6 , reward = 1.0
Episode 283: steps = 6 , reward = 1.0
Episode 284: steps = 6 , reward = 1.0
Episode 285: steps = 7 , reward = 1.0
Episode 286: steps = 8 , reward = 1.0
Episode 287: steps = 6 , reward = 1.0
Episode 288: steps = 5 , reward = 0.0
Episode 289: steps = 8 , reward = 1.0
Episode 290: steps = 7 , reward = 1.0
Episode 291: steps = 8 , reward = 1.0
Episode 292: steps = 4 , reward = 0.0
Episode 293: steps = 6 , reward = 1.0
Episode 294: steps = 9 , reward = 1.0
Episode 295: steps = 6 , reward = 1.0
Episode 296: steps = 6 , reward = 1.0
Episode 297: steps = 6 , reward = 0.0
Episode 298: steps = 6 , reward = 1.0
Episode 299: steps = 6 , reward = 1.0
Episode 300: steps = 6 , reward = 1.0
Episode 301: steps = 5 , reward = 0.0
Episode 302: steps = 6 , reward = 1.0
Episode 303: steps = 7 , reward = 1.0
Episode 304: steps = 6 , reward = 1.0
Episode 305: steps = 8 , reward = 1.0
Episode 306: steps = 6 , reward = 1.0
Episode 307: steps = 6 , reward = 1.0
Episode 308: steps = 6 , reward = 1.0
Episode 309: steps = 6 , reward = 1.0
Episode 310: steps = 4 , reward = 0.0
Episode 311: steps = 7 , reward = 1.0
Episode 312: steps = 8 , reward = 1.0
Episode 313: steps = 7 , reward = 1.0
Episode 314: steps = 6 , reward = 1.0
Episode 315: steps = 6 , reward = 1.0
Episode 316: steps = 7 , reward = 1.0
Episode 317: steps = 6 , reward = 1.0
Episode 318: steps = 6 , reward = 1.0
Episode 319: steps = 6 , reward = 1.0
Episode 320: steps = 6 , reward = 1.0
Episode 321: steps = 6 , reward = 1.0
Episode 322: steps = 7 , reward = 1.0
Episode 323: steps = 6 , reward = 1.0
Episode 324: steps = 6 , reward = 1.0
Episode 325: steps = 6 , reward = 1.0
Episode 326: steps = 6 , reward = 1.0
Episode 327: steps = 6 , reward = 1.0
Episode 328: steps = 6 , reward = 1.0
Episode 329: steps = 6 , reward = 1.0
Episode 330: steps = 6 , reward = 1.0
Episode 331: steps = 6 , reward = 1.0
Episode 332: steps = 6 , reward = 1.0
Episode 333: steps = 6 , reward = 1.0
Episode 334: steps = 3 , reward = 0.0
Episode 335: steps = 6 , reward = 1.0
Episode 336: steps = 6 , reward = 1.0
Episode 337: steps = 4 , reward = 0.0
Episode 338: steps = 6 , reward = 1.0
Episode 339: steps = 8 , reward = 1.0
Episode 340: steps = 6 , reward = 1.0
Episode 341: steps = 6 , reward = 1.0
Episode 342: steps = 6 , reward = 1.0
Episode 343: steps = 6 , reward = 1.0
Episode 344: steps = 6 , reward = 1.0
Episode 345: steps = 6 , reward = 1.0
Episode 346: steps = 6 , reward = 1.0
Episode 347: steps = 6 , reward = 1.0
Episode 348: steps = 6 , reward = 1.0
Episode 349: steps = 6 , reward = 1.0
Episode 350: steps = 6 , reward = 1.0
Episode 351: steps = 7 , reward = 1.0
Episode 352: steps = 6 , reward = 1.0
Episode 353: steps = 10 , reward = 1.0
Episode 354: steps = 3 , reward = 0.0
Episode 355: steps = 7 , reward = 1.0
Episode 356: steps = 7 , reward = 1.0
Episode 357: steps = 6 , reward = 1.0
Episode 358: steps = 2 , reward = 0.0
Episode 359: steps = 6 , reward = 1.0
Episode 360: steps = 6 , reward = 1.0
Episode 361: steps = 6 , reward = 1.0
Episode 362: steps = 7 , reward = 1.0
Episode 363: steps = 8 , reward = 1.0
Episode 364: steps = 6 , reward = 1.0
Episode 365: steps = 2 , reward = 0.0
Episode 366: steps = 6 , reward = 1.0
Episode 367: steps = 5 , reward = 0.0
Episode 368: steps = 6 , reward = 1.0
Episode 369: steps = 6 , reward = 1.0
Episode 370: steps = 6 , reward = 1.0
Episode 371: steps = 6 , reward = 1.0
Episode 372: steps = 6 , reward = 1.0
Episode 373: steps = 6 , reward = 1.0
Episode 374: steps = 8 , reward = 1.0
Episode 375: steps = 9 , reward = 1.0
Episode 376: steps = 6 , reward = 0.0
Episode 377: steps = 6 , reward = 1.0
Episode 378: steps = 6 , reward = 1.0
Episode 379: steps = 8 , reward = 1.0
Episode 380: steps = 6 , reward = 1.0
Episode 381: steps = 6 , reward = 1.0
Episode 382: steps = 6 , reward = 1.0
Episode 383: steps = 6 , reward = 1.0
Episode 384: steps = 6 , reward = 1.0
Episode 385: steps = 6 , reward = 1.0
Episode 386: steps = 8 , reward = 1.0
Episode 387: steps = 6 , reward = 1.0
Episode 388: steps = 6 , reward = 1.0
Episode 389: steps = 2 , reward = 0.0
Episode 390: steps = 6 , reward = 1.0
Episode 391: steps = 6 , reward = 1.0
Episode 392: steps = 6 , reward = 1.0
Episode 393: steps = 6 , reward = 1.0
Episode 394: steps = 7 , reward = 1.0
Episode 395: steps = 6 , reward = 1.0
Episode 396: steps = 6 , reward = 1.0
Episode 397: steps = 6 , reward = 1.0
Episode 398: steps = 6 , reward = 1.0
Episode 399: steps = 7 , reward = 1.0
Episode 400: steps = 6 , reward = 1.0
Episode 401: steps = 6 , reward = 1.0
Episode 402: steps = 6 , reward = 1.0
Episode 403: steps = 6 , reward = 1.0
Episode 404: steps = 8 , reward = 1.0
Episode 405: steps = 6 , reward = 1.0
Episode 406: steps = 6 , reward = 1.0
Episode 407: steps = 6 , reward = 1.0
Episode 408: steps = 6 , reward = 1.0
Episode 409: steps = 6 , reward = 1.0
Episode 410: steps = 6 , reward = 1.0
Episode 411: steps = 6 , reward = 1.0
Episode 412: steps = 6 , reward = 1.0
Episode 413: steps = 6 , reward = 1.0
Episode 414: steps = 6 , reward = 1.0
Episode 415: steps = 9 , reward = 1.0
Episode 416: steps = 6 , reward = 1.0
Episode 417: steps = 4 , reward = 0.0
Episode 418: steps = 6 , reward = 1.0
Episode 419: steps = 6 , reward = 1.0
Episode 420: steps = 7 , reward = 1.0
Episode 421: steps = 6 , reward = 1.0
Episode 422: steps = 6 , reward = 1.0
Episode 423: steps = 10 , reward = 1.0
Episode 424: steps = 6 , reward = 1.0
Episode 425: steps = 6 , reward = 1.0
Episode 426: steps = 8 , reward = 1.0
Episode 427: steps = 6 , reward = 1.0
Episode 428: steps = 9 , reward = 1.0
Episode 429: steps = 6 , reward = 1.0
Episode 430: steps = 4 , reward = 0.0
Episode 431: steps = 6 , reward = 1.0
Episode 432: steps = 6 , reward = 1.0
Episode 433: steps = 6 , reward = 1.0
Episode 434: steps = 6 , reward = 1.0
Episode 435: steps = 8 , reward = 1.0
Episode 436: steps = 6 , reward = 1.0
Episode 437: steps = 6 , reward = 1.0
Episode 438: steps = 6 , reward = 1.0
Episode 439: steps = 8 , reward = 1.0
Episode 440: steps = 2 , reward = 0.0
Episode 441: steps = 6 , reward = 1.0
Episode 442: steps = 10 , reward = 1.0
Episode 443: steps = 6 , reward = 1.0
Episode 444: steps = 6 , reward = 1.0
Episode 445: steps = 8 , reward = 1.0
Episode 446: steps = 6 , reward = 1.0
Episode 447: steps = 6 , reward = 1.0
Episode 448: steps = 5 , reward = 0.0
Episode 449: steps = 6 , reward = 1.0
Episode 450: steps = 8 , reward = 1.0
Episode 451: steps = 6 , reward = 1.0
Episode 452: steps = 8 , reward = 1.0
Episode 453: steps = 8 , reward = 1.0
Episode 454: steps = 7 , reward = 1.0
Episode 455: steps = 5 , reward = 0.0
Episode 456: steps = 6 , reward = 1.0
Episode 457: steps = 6 , reward = 1.0
Episode 458: steps = 8 , reward = 1.0
Episode 459: steps = 8 , reward = 1.0
Episode 460: steps = 10 , reward = 1.0
Episode 461: steps = 8 , reward = 1.0
Episode 462: steps = 7 , reward = 1.0
Episode 463: steps = 7 , reward = 1.0
Episode 464: steps = 6 , reward = 1.0
Episode 465: steps = 6 , reward = 1.0
Episode 466: steps = 6 , reward = 1.0
Episode 467: steps = 6 , reward = 1.0
Episode 468: steps = 6 , reward = 1.0
Episode 469: steps = 6 , reward = 1.0
Episode 470: steps = 3 , reward = 0.0
Episode 471: steps = 7 , reward = 1.0
Episode 472: steps = 6 , reward = 1.0
Episode 473: steps = 6 , reward = 1.0
Episode 474: steps = 7 , reward = 1.0
Episode 475: steps = 6 , reward = 1.0
Episode 476: steps = 8 , reward = 1.0
Episode 477: steps = 6 , reward = 1.0
Episode 478: steps = 6 , reward = 1.0
Episode 479: steps = 6 , reward = 1.0
Episode 480: steps = 6 , reward = 1.0
Episode 481: steps = 6 , reward = 1.0
Episode 482: steps = 6 , reward = 1.0
Episode 483: steps = 6 , reward = 1.0
Episode 484: steps = 5 , reward = 0.0
Episode 485: steps = 6 , reward = 1.0
Episode 486: steps = 9 , reward = 1.0
Episode 487: steps = 7 , reward = 1.0
Episode 488: steps = 6 , reward = 1.0
Episode 489: steps = 6 , reward = 1.0
Episode 490: steps = 6 , reward = 1.0
Episode 491: steps = 6 , reward = 1.0
Episode 492: steps = 9 , reward = 1.0
Episode 493: steps = 6 , reward = 1.0
Episode 494: steps = 6 , reward = 1.0
Episode 495: steps = 9 , reward = 1.0
Episode 496: steps = 6 , reward = 1.0
Episode 497: steps = 6 , reward = 1.0
Episode 498: steps = 6 , reward = 1.0
Episode 499: steps = 7 , reward = 1.0
  (Down)
SFFF
FHFH
FFFH
HFFG
  (Down)
SFFF
FHFH
FFFH
HFFG
  (Right)
SFFF
FHFH
FFFH
HFFG
  (Down)
SFFF
FHFH
FFFH
HFFG
  (Right)
SFFF
FHFH
FFFH
HFFG
  (Right)
SFFF
FHFH
FFFH
HFFG
test reward = 1.0

五、结果分析

我们可以查看下最终训练完成的 Q 表:

print(agent.Q)

运行结果:

[[0.27140285 0.4364344  0.09145568 0.15201279]
 [0.26813138 0.         0.         0.00945424]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.26636559 0.51632351 0.         0.13684245]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.33346755 0.         0.68004322 0.31572772]
 [0.26970648 0.77477987 0.35436455 0.        ]
 [0.04662094 0.73217092 0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.39939922 0.88159607 0.11581402]
 [0.4472322  0.72976712 1.         0.40947544]
 [0.         0.         0.         0.        ]]

16 个格子对应的情况:

SFFF
FHFH
FFFH
HFFG

其中 S 代表起点,F 代表平地,H 代表陷阱(掉进去游戏结束),G 代表终点(到达则获胜)

每个格子的排序序号:

0  1  2  3
4  5  6  7
8  9  10 11
12 13 14 15

所以测试开始后,首先在第 0 格,这个时候的 4 个动作对应的 Q 值是:

[0.27140285 0.4364344  0.09145568 0.15201279]

这 4 个 Q 值对应:0 left,1 down,2 right,3 up

所以最大值 0.4364344 对应的是 1,即动作为往下走一格

这个时候到达了第 4 个格子:

[0.26636559 0.51632351 0.         0.13684245]

选择 1,动作:down,到达第 8 个格子:

[0.33346755 0.         0.68004322 0.31572772]

选择 2,动作:right,到达第 9 个格子:

[0.26970648 0.77477987 0.35436455 0.        ]

选择 1,动作:down,到达第 13 个格子:

[0.         0.39939922 0.88159607 0.11581402]

选择 2,动作 right,到达第 14 个格子:

[0.4472322  0.72976712 1.         0.40947544]

选择 2,动作 right,到达第 15 个格子:终点!