強化學習導論課後習題參考 - Chapter 1,2

強化學習導論課後習題參考 – Chapter 1,2

2021 年 3 月 11 日
筆記
Reinforcement Learning, RL

Reinforcement Learning: An Introduction (second edition) – Chapter 1,2

Chapter 1

1.1
Self-Play Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for selecting moves?

在tic-tac-toe這類完美信息的two-player的博弈中，self-play算法會收斂到納什均衡策略(定義，收斂性理論證明)。此時學到的策略和random的對手對打會生成不同的策略，random opponent的對局中最終會利用該對手的exploitability(定義)，生成一個針對該對手的策略。

1.2
Symmetries Many tic-tac-toe positions appear different but are really the same because of symmetries. How might we amend the learning process described above to take advantage of this? In what ways would this change improve the learning process? Now think again. Suppose the opponent did not take advantage of symmetries. In that case, should we? Is it true, then, that symmetrically equivalent positions should necessarily have the same value?

可以將狀態和動作做對應的旋轉、鏡像翻轉(data augmentation)，這些狀態可以對應相同的標籤，如value label、probability label等。這種方式可以減少和環境交互的次數，提高探索效率，一次交互得到多個樣本。但是，如果對手的策略並不服從對稱性，那麼我們無法利用數據增廣的方式提高學習效率，因為對稱的狀態所對應的策略不再一致，也就是沒有相同的value label或者probability label等，所以需要區別對待每個狀態，以此利用對手策略的exploitability得到針對該對手的更好的策略。

1.3
Greedy Play Suppose the reinforcement learning player was greedy, that is, it always played the move that brought it to the position that it rated the best. Might it learn to play better, or worse, than a nongreedy player? What problems might occur?

會學到一個更壞的策略。這種情形下agent無法有效探索未知的狀態，只會在當前不準確的估計上選擇最優動作，從而收斂到局部最優。

1.4
Learning from Exploration Suppose learning updates occurred after all moves, including exploratory moves. If the step-size parameter is appropriately reduced over time (but not the tendency to explore), then the state values would converge to a different set of probabilities. What (conceptually) are the two sets of probabilities computed when we do, and when we do not, learn from exploratory moves? Assuming that we do continue to make exploratory moves, which set of probabilities might be better to learn? Which would result in more wins?

如果不學習探索狀態指的是只不更新用探索動作到達的狀態，而整條軌跡上其他用\(greedy\)策略到達的狀態還是會學習的話，那麼不學習探索得到的概率會收斂到最優值，因為整條軌跡其實已經包括了探索，任意狀態在理論上都可以被訪問無限次。而學習探索狀態可能會得到一個包含探索概率的次優解。但如果探索以一個很小的概率進行，例如\(\epsilon-greedy\)的方式，那麼兩種學習方式都可能會收斂到最優決策，即使學到的概率裏面，學習了探索的概率可能會有些許不一致，但最終做出的決策可能和不學習探索狀態的決策完全一樣。如果探索永遠持續下去，不學習探索的策略會學得更好，獲勝更多。

1.5
Other Improvements Can you think of other ways to improve the reinforcement learning player? Can you think of any better way to solve the tic-tac-toe problem as posed?

狀態：可以對狀態空間進行表徵，簡化問題；回報：若回報量級(order)變化巨大，可適量作歸一化等操作；探索(exploration)：根據具體問題設計適合的探索方式，如sparse reward的情形可以考慮引入好奇心(curiosity)機制，或用信息增益(information gain)等方式引入內部回報(intrinsic reward)增加探索效率；參數：根據不同問題調節參數，如折扣率(discount)，步長(step size)等；算法：某些問題可以設計不同算法，如model based，model free，value based，policy based等。tic-tac-toe這類完美信息的兩人博弈遊戲，啟發式搜索算法如蒙特卡洛樹搜索(MCTS)或者AlphaZero算法應該是現存的最好選擇。

Chapter 2

2.1
In \(\epsilon\)-greedy action selection, for the case of two actions and \(\epsilon\)= 0.5, what is the probability that the greedy action is selected?

\(\epsilon= 0.5\)，則通過exploiting選擇greedy action的概率為\(1-\epsilon= 0.5\)，又由一共只有兩個動作，通過exploring選到greedy action的概率為\(\epsilon/2=0.25\)，綜上概率為0.75。

2.2
Bandit example Consider a k-armed bandit problem with \(k = 4\) actions, denoted 1, 2, 3, and 4. Consider applying to this problem a bandit algorithm using \(\epsilon\)-greedy action selection, sample-average action-value estimates, and initial estimates of \(Q_1(a) = 0\), for all \(a\). Suppose the initial sequence of actions and rewards is \(A_1 = 1,R_1=-1,A_2=2,R_2=1,A_3=2,R_3=-2,A_4=2,R_4=2,A_5=3,R_5=0\). On some of these time steps the \(\epsilon\) case may have occurred, causing an action to be selected at random. On which time steps did this definitely occur? On which time steps could this possibly have occurred?

列出一個Q值變化的表格方便說明:

time step	\(Q(1)\)	\(Q(2)\)	action	reward
1	0	0	1	-1
2	-1	0	2	1
3	-1	1	2	-2
4	-1	-0.5	2	2
5	-1	1/3	3	0

只要看每次選擇的動作是否對應最大的Q值即可，如果不是，則可以肯定是隨機動作。所以time steps 4,5肯定是隨機選擇的動作，其他的可能是隨機選擇，也可能是\(greedy\) action。

2.3 In the comparison shown in Figure 2.2, which method will perform best in the long run in terms of cumulative reward and probability of selecting the best action? How much better will it be? Express your answer quantitatively.

首先排除greedy method，因為要想找到全局最優決策，需要保證每個狀態可以被無限次訪問，greedy method不能保證。另外兩個方法都能保證，而長期來看探索越小，選到最優動作的概率越高。理論上，\(\epsilon=0.01\)時，選到最優動作概率的上界為\(1-\epsilon+\epsilon /10 = 99.1\%\)（10-armed testbed），而\(\epsilon=0.1\)的上界為\(1-\epsilon+\epsilon /10 = 91\%\)。所以\(\epsilon=0.01\)長期看來更優，概率高8.1%。

2.4
If the step-size parameters, \(\alpha_n\), are not constant, then the estimate \(Q_n\) is a weighted average of previously received rewards with a weighting different from that given by (2.6). What is the weighting on each prior reward for the general case, analogous to (2.6), in terms of the sequence of step-size parameters?

重寫(2.6)

\[\begin{array}{l}
Q_{n+1} = Q_n+\alpha_n[R_n-Q_n])\\
\qquad \ \ = \alpha_n R_n+(1-\alpha_n)Q_n \\
\qquad \ \ = \alpha_n R_n+(1-\alpha_n)[\alpha_{n-1}R_{n-1}+(1-\alpha_{n-1})Q_{n-1}] \\
\qquad \ \ = \alpha_n R_n + (1-\alpha_n)\alpha_{n-1}R_{n-1}+(1-\alpha_n)(1-\alpha_{n-1})\alpha_{n-2}R_{n-2}+ \\ \qquad \qquad … +(1-\alpha_n)(1-\alpha_{n-1})…(1-\alpha_2)\alpha_1R_1+(1-\alpha_1)^nQ_1 \\
\qquad \ \ = (1-\alpha_1)^n Q_1+\sum_{i=1}^{n}\alpha_iR_i\prod_{j=i+1}^n(1-\alpha_j)
\end{array}
\]

2.5
(programming) Design and conduct an experiment to demonstrate the difficulties that sample-average methods have for nonstationary problems. Use a modified version of the 10-armed testbed in which all the \(q_*(a)\) start out equal and then take independent random walks (say by adding a normally distributed increment with mean zero and standard deviation 0.01 to all the \(q_*(a)\) on each step). Prepare plots like Figure 2.2 for an action-value method using sample averages, incrementally computed, and another action-value method using a constant step-size parameter, \(\alpha\)= 0.1. Use \(\epsilon\)= 0.1 and longer runs, say of 10,000 steps.

2.6
Mysterious Spikes The results shown in Figure 2.3 should be quite reliable because they are averages over 2000 individual, randomly chosen 10-armed bandit tasks. Why, then, are there oscillations and spikes in the early part of the curve for the optimistic method? In other words, what might make this method perform particularly better or worse, on average, on particular early steps?

由於初始化的Q值比較大，若真實Q值比初始值小，那麼前期每訪問一個動作，其Q值會逐漸被更新小。從而每個動作都會被訪問到，這也導致前期策略的探索性很大。一段時間更新之後，Q值的估計逐漸變小變準確，從而最終大概率選到最優動作。所以整個過程前期由於Optimistic Initial Values的存在，動作的選擇探索性較高，會出現震蕩，隨後Q的估計逐漸變准，初始值的影響變小，Q值逐漸收斂。影響效果的因素和初始化的Q值，以及真實的Q值有關。由於沒有\(\epsilon\)探索策略，若真實的值比初始值大，可能選到一個動作之後，就再也跳不出局部最優，此時曲線不會震蕩，但表現一直維持在很低的水平不會提升。相反若真實的值比初始值小，則會出現圖2.3的情形。若真實的值和初始值的差異不大，此時的效果可能會比較平穩，而算法的探索性更多可能會取決於獎勵的方差。

2.7
Unbiased Constant-Step-Size Trick In most of this chapter we have used sample averages to estimate action values because sample averages do not produce the initial bias that constant step sizes do (see the analysis leading to (2.6)). However, sample averages are not a completely satisfactory solution because they may perform poorly on nonstationary problems. Is it possible to avoid the bias of constant step sizes while retaining their advantages on nonstationary problems? One way is to use a step size of

\[\beta_n \overset{.}{=} \alpha/ \bar o_n ,
\]

to process the \(n\)th reward for a particular action, where \(\alpha>0\) is a conventional constant step size, and \(\bar o_0\) is a trace of one that starts at 0:

\[\bar o_n \overset{\text{.}}{=} \bar o_{n-1} + \alpha(1-\bar o_{n-1}) , \ \ for \ \ n \ge 0, \ \ with \ \ \bar o_0 \overset{.}{=} 0 .
\]

Carry out an analysis like that in (2.6) to show that \(Q_n\) is an exponential recency-weighted average without initial bias.

這裡想要證明\(Q_n\)是exponential recency-weighted average的同時還是without initial bias的。由於這裡的step size\(\beta_n\)也是一個隨\(n\)變化的數，之前習題2.4推導過一次，直接搬過來把\(\alpha_n\)換成\(\beta_n\)有

\[\begin{array}{l}
Q_{n+1} = Q_n+\beta_n[R_n-Q_n])\\
\qquad \ \ = \beta_n R_n+(1-\beta_n)Q_n \\
\qquad \ \ = \beta_n R_n+(1-\beta_n)[\beta_{n-1}R_{n-1}+(1-\beta_{n-1})Q_{n-1}] \\
\qquad \ \ = \beta_n R_n + (1-\beta_n)\beta_{n-1}R_{n-1}+(1-\beta_n)(1-\beta_{n-1})\beta_{n-2}R_{n-2}+ \\ \qquad \qquad … +(1-\beta_n)(1-\beta_{n-1})…(1-\beta_2)\beta_1R_1+(1-\beta_1)^nQ_1 \\
\qquad \ \ = (1-\beta_1)^n Q_1+\sum_{i=1}^{n}\beta_iR_i\prod_{j=i+1}^n(1-\beta_j)
\end{array}
\]

顯然有該式子是exponential recency-weighted average的。要說明是without initial bias的，只需式子與\(Q_1\)無關即可。關注\((1-\beta_1)^n Q_1\)：

\[\begin{array}{l}
\beta_1 = \alpha/ \bar o_1 \\
\qquad \ \ = \alpha / (\bar o_0 + \alpha(1- \bar o_0)) \\
\qquad \ \ = \alpha/ (0+\alpha(1-0)) \\
\qquad \ \ = 1
\end{array}
\]

有\((1-\beta_1)^n Q_1=0\)，即與\(Q_1\)無關。

2.8
UCB Spikes In Figure 2.4 the UCB algorithm shows a distinct spike in performance on the 11th step. Why is this? Note that for your answer to be fully satisfactory it must explain both why the reward increases on the 11th step and why it decreases on the subsequent steps. Hint: If \(c = 1\), then the spike is less prominent.

注意到這個testbed是10-armed的，那麼前10個動作會分別訪問這個10個動作（If \(Nt(a) = 0\), then \(a\) is considered to be a maximizing action）。此時所有動作都被訪問一次，有平方根項\(c\sqrt{\frac{\ln t}{N_t(a)}}\)的值相同，那麼第11次會選擇前10次里最大回報對應的動作，所以第十一次的平均回報有一個大幅度提升。而當這個動作被訪問2次之後，其平方根項\(c\sqrt{\frac{\ln t}{N_t(a)}}\)會顯著小於其他只被訪問1次的動作，所以下一次會選擇一個回報較小的其他動作，導致平均回報有所下降。如果參數\(c = 2\)改為\(c = 1\)，那麼平方根項的權值變小，若第11次選擇的動作的獎勵值顯著大於其他動作，則隨後還會訪問到該動作，直到其平方根項的值顯著小於其他動作。所以此時，震蕩相較於\(c = 2\)會更加平緩。

2.9
Show that in the case of two actions, the soft-max distribution is the same as that given by the logistic, or sigmoid, function often used in statistics and artificial neural networks.

假設兩個動作為\(a_0,a_1\)，直接寫開

\[\begin{array}{l}
\Pr\{A_t=a\} \\ \overset{.}{=} \frac{e^{H_t(a)}}{e^{H_t(a_0)
}+e^{H_t(a_1)}} \\
=\left\{\begin{array}{l} \frac{1}{1+e^{-(H_t(a_0)-H_t(a_1))}} \ \ \ if \ \ a=a_0 \\\frac{1}{1+e^{-(H_t(a_1)-H_t(a_0))}}\ \ \ \ \ if \ \ a=a_1 \end{array}\right.
\end{array}
\]

即為sigmoid的形式，其中每個動作的概率大小取決於兩個動作的preference\(H_t(a)\)的差。

2.10
Suppose you face a 2-armed bandit task whose true action values change randomly from time step to time step. Specifically, suppose that, for any time step, the true values of actions 1 and 2 are respectively 0.1 and 0.2 with probability 0.5 (case A), and 0.9 and 0.8 with probability 0.5 (case B). If you are not able to tell which case you face at any step, what is the best expectation of success you can achieve and how should you behave to achieve it? Now suppose that on each step you are told whether you are facing case A or case B (although you still don』t know the true action values). This is an associative search task. What is the best expectation of success you can achieve in this task, and how should you behave to achieve it?

若不能區分兩種情形，則兩個動作的期望回報為\(Q(a_1)=0.5*0.1+0.5*0.9=0.5,Q(a_2)=0.5*0.2+0.5*0.8=0.5\)，則理論的最佳期望值為0.5，且無論怎麼選擇動作，得到的期望值都為0.5。則使用之前的強化學習方法，甚至隨機策略都能達到該值。若兩種情形可以區分開，則理論的最佳期望值為\(Q=0.5*Q(a_2|A)+0.5*Q(a_1|B)=0.55\)。可以通過之前所述的強化方法，如\(\epsilon\)-greedy，UCB，gradient bandit algorithm等進行學習，得到最終的動作選擇策略。

2.11
(programming) Make a figure analogous to Figure 2.6 for the nonstationary case outlined in Exercise 2.5. Include the constant-step-size \(\epsilon\)-greedy algorithm with \(\alpha=0.1\). Use runs of 200,000 steps and, as a performance measure for each algorithm and parameter setting, use the average reward over the last 100,000 steps.

Tags: Reinforcement Learning RL