强化学习导论 课后习题参考 – Chapter 1,2

Reinforcement Learning: An Introduction (second edition) – Chapter 1,2

Chapter 1

1.1
Self-Play Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for selecting moves?

  • 在tic-tac-toe这类完美信息的two-player的博弈中,self-play算法会收敛到纳什均衡策略(定义,收敛性理论证明)。此时学到的策略和random的对手对打会生成不同的策略,random opponent的对局中最终会利用该对手的exploitability(定义),生成一个针对该对手的策略。

1.2
Symmetries Many tic-tac-toe positions appear different but are really the same because of symmetries. How might we amend the learning process described above to take advantage of this? In what ways would this change improve the learning process? Now think again. Suppose the opponent did not take advantage of symmetries. In that case, should we? Is it true, then, that symmetrically equivalent positions should necessarily have the same value?

  • 可以将状态和动作做对应的旋转、镜像翻转(data augmentation),这些状态可以对应相同的标签,如value label、probability label等。这种方式可以减少和环境交互的次数,提高探索效率,一次交互得到多个样本。但是,如果对手的策略并不服从对称性,那么我们无法利用数据增广的方式提高学习效率,因为对称的状态所对应的策略不再一致,也就是没有相同的value label或者probability label等,所以需要区别对待每个状态,以此利用对手策略的exploitability得到针对该对手的更好的策略。

1.3
Greedy Play Suppose the reinforcement learning player was greedy, that is, it always played the move that brought it to the position that it rated the best. Might it learn to play better, or worse, than a nongreedy player? What problems might occur?

  • 会学到一个更坏的策略。这种情形下agent无法有效探索未知的状态,只会在当前不准确的估计上选择最优动作,从而收敛到局部最优。

1.4
Learning from Exploration Suppose learning updates occurred after all moves, including exploratory moves. If the step-size parameter is appropriately reduced over time (but not the tendency to explore), then the state values would converge to a different set of probabilities. What (conceptually) are the two sets of probabilities computed when we do, and when we do not, learn from exploratory moves? Assuming that we do continue to make exploratory moves, which set of probabilities might be better to learn? Which would result in more wins?

  • 如果不学习探索状态指的是只不更新用探索动作到达的状态,而整条轨迹上其他用\(greedy\)策略到达的状态还是会学习的话,那么不学习探索得到的概率会收敛到最优值,因为整条轨迹其实已经包括了探索,任意状态在理论上都可以被访问无限次。而学习探索状态可能会得到一个包含探索概率的次优解。但如果探索以一个很小的概率进行,例如\(\epsilon-greedy\)的方式,那么两种学习方式都可能会收敛到最优决策,即使学到的概率里面,学习了探索的概率可能会有些许不一致,但最终做出的决策可能和不学习探索状态的决策完全一样。如果探索永远持续下去,不学习探索的策略会学得更好,获胜更多。

1.5
Other Improvements Can you think of other ways to improve the reinforcement learning player? Can you think of any better way to solve the tic-tac-toe problem as posed?

  • 状态:可以对状态空间进行表征,简化问题;回报:若回报量级(order)变化巨大,可适量作归一化等操作;探索(exploration):根据具体问题设计适合的探索方式,如sparse reward的情形可以考虑引入好奇心(curiosity)机制,或用信息增益(information gain)等方式引入内部回报(intrinsic reward)增加探索效率;参数:根据不同问题调节参数,如折扣率(discount),步长(step size)等;算法:某些问题可以设计不同算法,如model based,model free,value based,policy based等。tic-tac-toe这类完美信息的两人博弈游戏,启发式搜索算法如蒙特卡洛树搜索(MCTS)或者AlphaZero算法应该是现存的最好选择。

Chapter 2

2.1
In \(\epsilon\)-greedy action selection, for the case of two actions and \(\epsilon\)= 0.5, what is the probability that the greedy action is selected?

  • \(\epsilon= 0.5\),则通过exploiting选择greedy action的概率为\(1-\epsilon= 0.5\),又由一共只有两个动作,通过exploring选到greedy action的概率为\(\epsilon/2=0.25\),综上概率为0.75。

2.2
Bandit example Consider a k-armed bandit problem with \(k = 4\) actions, denoted 1, 2, 3, and 4. Consider applying to this problem a bandit algorithm using \(\epsilon\)-greedy action selection, sample-average action-value estimates, and initial estimates of \(Q_1(a) = 0\), for all \(a\). Suppose the initial sequence of actions and rewards is \(A_1 = 1,R_1=-1,A_2=2,R_2=1,A_3=2,R_3=-2,A_4=2,R_4=2,A_5=3,R_5=0\). On some of these time steps the \(\epsilon\) case may have occurred, causing an action to be selected at random. On which time steps did this definitely occur? On which time steps could this possibly have occurred?

  • 列出一个Q值变化的表格方便说明:
time step \(Q(1)\) \(Q(2)\) \(Q(3)\) \(Q(4)\) action reward
1 0 0 0 0 1 -1
2 -1 0 0 0 2 1
3 -1 1 0 0 2 -2
4 -1 -0.5 0 0 2 2
5 -1 1/3 0 0 3 0

只要看每次选择的动作是否对应最大的Q值即可,如果不是,则可以肯定是随机动作。所以time steps 4,5肯定是随机选择的动作,其他的可能是随机选择,也可能是\(greedy\) action。

2.3 In the comparison shown in Figure 2.2, which method will perform best in the long run in terms of cumulative reward and probability of selecting the best action? How much better will it be? Express your answer quantitatively.

  • 首先排除greedy method,因为要想找到全局最优决策,需要保证每个状态可以被无限次访问,greedy method不能保证。另外两个方法都能保证,而长期来看探索越小,选到最优动作的概率越高。理论上,\(\epsilon=0.01\)时,选到最优动作概率的上界为\(1-\epsilon+\epsilon /10 = 99.1\%\)(10-armed testbed),而\(\epsilon=0.1\)的上界为\(1-\epsilon+\epsilon /10 = 91\%\)。所以\(\epsilon=0.01\)长期看来更优,概率高8.1%。

2.4
If the step-size parameters, \(\alpha_n\), are not constant, then the estimate \(Q_n\) is a weighted average of previously received rewards with a weighting different from that given by (2.6). What is the weighting on each prior reward for the general case, analogous to (2.6), in terms of the sequence of step-size parameters?

  • 重写(2.6)
\[\begin{array}{l}
Q_{n+1} = Q_n+\alpha_n[R_n-Q_n])\\
\qquad \ \ = \alpha_n R_n+(1-\alpha_n)Q_n \\
\qquad \ \ = \alpha_n R_n+(1-\alpha_n)[\alpha_{n-1}R_{n-1}+(1-\alpha_{n-1})Q_{n-1}] \\
\qquad \ \ = \alpha_n R_n + (1-\alpha_n)\alpha_{n-1}R_{n-1}+(1-\alpha_n)(1-\alpha_{n-1})\alpha_{n-2}R_{n-2}+ \\ \qquad \qquad … +(1-\alpha_n)(1-\alpha_{n-1})…(1-\alpha_2)\alpha_1R_1+(1-\alpha_1)^nQ_1 \\
\qquad \ \ = (1-\alpha_1)^n Q_1+\sum_{i=1}^{n}\alpha_iR_i\prod_{j=i+1}^n(1-\alpha_j)
\end{array}
\]

2.5
(programming) Design and conduct an experiment to demonstrate the difficulties that sample-average methods have for nonstationary problems. Use a modified version of the 10-armed testbed in which all the \(q_*(a)\) start out equal and then take independent random walks (say by adding a normally distributed increment with mean zero and standard deviation 0.01 to all the \(q_*(a)\) on each step). Prepare plots like Figure 2.2 for an action-value method using sample averages, incrementally computed, and another action-value method using a constant step-size parameter, \(\alpha\)= 0.1. Use \(\epsilon\)= 0.1 and longer runs, say of 10,000 steps.

2.6
Mysterious Spikes The results shown in Figure 2.3 should be quite reliable because they are averages over 2000 individual, randomly chosen 10-armed bandit tasks. Why, then, are there oscillations and spikes in the early part of the curve for the optimistic method? In other words, what might make this method perform particularly better or worse, on average, on particular early steps?

  • 由于初始化的Q值比较大,若真实Q值比初始值小,那么前期每访问一个动作,其Q值会逐渐被更新小。从而每个动作都会被访问到,这也导致前期策略的探索性很大。一段时间更新之后,Q值的估计逐渐变小变准确,从而最终大概率选到最优动作。所以整个过程前期由于Optimistic Initial Values的存在,动作的选择探索性较高,会出现震荡,随后Q的估计逐渐变准,初始值的影响变小,Q值逐渐收敛。影响效果的因素和初始化的Q值,以及真实的Q值有关。由于没有\(\epsilon\)探索策略,若真实的值比初始值大,可能选到一个动作之后,就再也跳不出局部最优,此时曲线不会震荡,但表现一直维持在很低的水平不会提升。相反若真实的值比初始值小,则会出现图2.3的情形。若真实的值和初始值的差异不大,此时的效果可能会比较平稳,而算法的探索性更多可能会取决于奖励的方差。

2.7
Unbiased Constant-Step-Size Trick In most of this chapter we have used sample averages to estimate action values because sample averages do not produce the initial bias that constant step sizes do (see the analysis leading to (2.6)). However, sample averages are not a completely satisfactory solution because they may perform poorly on nonstationary problems. Is it possible to avoid the bias of constant step sizes while retaining their advantages on nonstationary problems? One way is to use a step size of

\[\beta_n \overset{.}{=} \alpha/ \bar o_n ,
\]

to process the \(n\)th reward for a particular action, where \(\alpha>0\) is a conventional constant step size, and \(\bar o_0\) is a trace of one that starts at 0:

\[\bar o_n \overset{\text{.}}{=} \bar o_{n-1} + \alpha(1-\bar o_{n-1}) , \ \ for \ \ n \ge 0, \ \ with \ \ \bar o_0 \overset{.}{=} 0 .
\]

Carry out an analysis like that in (2.6) to show that \(Q_n\) is an exponential recency-weighted average without initial bias.

  • 这里想要证明\(Q_n\)exponential recency-weighted average的同时还是without initial bias的。由于这里的step size\(\beta_n\)也是一个随\(n\)变化的数,之前习题2.4推导过一次,直接搬过来把\(\alpha_n\)换成\(\beta_n\)
\[\begin{array}{l}
Q_{n+1} = Q_n+\beta_n[R_n-Q_n])\\
\qquad \ \ = \beta_n R_n+(1-\beta_n)Q_n \\
\qquad \ \ = \beta_n R_n+(1-\beta_n)[\beta_{n-1}R_{n-1}+(1-\beta_{n-1})Q_{n-1}] \\
\qquad \ \ = \beta_n R_n + (1-\beta_n)\beta_{n-1}R_{n-1}+(1-\beta_n)(1-\beta_{n-1})\beta_{n-2}R_{n-2}+ \\ \qquad \qquad … +(1-\beta_n)(1-\beta_{n-1})…(1-\beta_2)\beta_1R_1+(1-\beta_1)^nQ_1 \\
\qquad \ \ = (1-\beta_1)^n Q_1+\sum_{i=1}^{n}\beta_iR_i\prod_{j=i+1}^n(1-\beta_j)
\end{array}
\]

显然有该式子是exponential recency-weighted average的。要说明是without initial bias的,只需式子与\(Q_1\)无关即可。关注\((1-\beta_1)^n Q_1\)

\[\begin{array}{l}
\beta_1 = \alpha/ \bar o_1 \\
\qquad \ \ = \alpha / (\bar o_0 + \alpha(1- \bar o_0)) \\
\qquad \ \ = \alpha/ (0+\alpha(1-0)) \\
\qquad \ \ = 1
\end{array}
\]

\((1-\beta_1)^n Q_1=0\),即与\(Q_1\)无关。

2.8
UCB Spikes In Figure 2.4 the UCB algorithm shows a distinct spike in performance on the 11th step. Why is this? Note that for your answer to be fully satisfactory it must explain both why the reward increases on the 11th step and why it decreases on the subsequent steps. Hint: If \(c = 1\), then the spike is less prominent.

  • 注意到这个testbed是10-armed的,那么前10个动作会分别访问这个10个动作(If \(Nt(a) = 0\), then \(a\) is considered to be a maximizing action)。此时所有动作都被访问一次,有平方根项\(c\sqrt{\frac{\ln t}{N_t(a)}}\)的值相同,那么第11次会选择前10次里最大回报对应的动作,所以第十一次的平均回报有一个大幅度提升。而当这个动作被访问2次之后,其平方根项\(c\sqrt{\frac{\ln t}{N_t(a)}}\)会显著小于其他只被访问1次的动作,所以下一次会选择一个回报较小的其他动作,导致平均回报有所下降。如果参数\(c = 2\)改为\(c = 1\),那么平方根项的权值变小,若第11次选择的动作的奖励值显著大于其他动作,则随后还会访问到该动作,直到其平方根项的值显著小于其他动作。所以此时,震荡相较于\(c = 2\)会更加平缓。

2.9
Show that in the case of two actions, the soft-max distribution is the same as that given by the logistic, or sigmoid, function often used in statistics and artificial neural networks.

  • 假设两个动作为\(a_0,a_1\),直接写开
\[\begin{array}{l}
\Pr\{A_t=a\} \\ \overset{.}{=} \frac{e^{H_t(a)}}{e^{H_t(a_0)
}+e^{H_t(a_1)}} \\
=\left\{\begin{array}{l} \frac{1}{1+e^{-(H_t(a_0)-H_t(a_1))}} \ \ \ if \ \ a=a_0 \\\frac{1}{1+e^{-(H_t(a_1)-H_t(a_0))}}\ \ \ \ \ if \ \ a=a_1 \end{array}\right.
\end{array}
\]

即为sigmoid的形式,其中每个动作的概率大小取决于两个动作的preference\(H_t(a)\)的差。

2.10
Suppose you face a 2-armed bandit task whose true action values change randomly from time step to time step. Specifically, suppose that, for any time step, the true values of actions 1 and 2 are respectively 0.1 and 0.2 with probability 0.5 (case A), and 0.9 and 0.8 with probability 0.5 (case B). If you are not able to tell which case you face at any step, what is the best expectation of success you can achieve and how should you behave to achieve it? Now suppose that on each step you are told whether you are facing case A or case B (although you still don’t know the true action values). This is an associative search task. What is the best expectation of success you can achieve in this task, and how should you behave to achieve it?

  • 若不能区分两种情形,则两个动作的期望回报为\(Q(a_1)=0.5*0.1+0.5*0.9=0.5,Q(a_2)=0.5*0.2+0.5*0.8=0.5\),则理论的最佳期望值为0.5,且无论怎么选择动作,得到的期望值都为0.5。则使用之前的强化学习方法,甚至随机策略都能达到该值。若两种情形可以区分开,则理论的最佳期望值为\(Q=0.5*Q(a_2|A)+0.5*Q(a_1|B)=0.55\)。可以通过之前所述的强化方法,如\(\epsilon\)-greedy,UCB,gradient bandit algorithm等进行学习,得到最终的动作选择策略。

2.11
(programming) Make a figure analogous to Figure 2.6 for the nonstationary case outlined in Exercise 2.5. Include the constant-step-size \(\epsilon\)-greedy algorithm with \(\alpha=0.1\). Use runs of 200,000 steps and, as a performance measure for each algorithm and parameter setting, use the average reward over the last 100,000 steps.