詳解TensorFlow 2.0新特性在深度強化學習中的應用

2019 年 11 月 20 日
筆記

自TensorFlow官方發佈其2.0版本新性能以來，不少人可能對此會有些許困惑。因此博主Roman Ring寫了一篇概述性的文章，通過實現深度強化學習算法來具體的展示了TensorFlow 2.0的特性。

正所謂實踐出真知。

TensorFlow 2.0的特性公布已經有一段時間了，但很多人對此應當還是一頭霧水。

在本教程中，作者通過深度強化學習(DRL)來展示即將到來的TensorFlow 2.0的特性，具體來講就是通過實現優勢actor-critic(演員-評判家，A2C)智能體來解決經典的CartPole-v0環境。

雖然作者本文的目標是展示TensorFlow 2.0，但他先介紹了DRL方面的內容，包括對該領域的簡要概述。

事實上，由於2.0版本的主要關注點是簡化開發人員的工作，即易用性，所以現在正是使用TensorFlow進入DRL的好時機。

本文完整代碼資源鏈接： GitHub：https://github.com/inoryy/tensorflow2-deep-reinforcement-learning

Google Colab：https://colab.research.google.com/drive/12QvW7VZSzoaF-Org-u-N6aiTdBN5ohNA

安裝

由於TensorFlow 2.0仍處於試驗階段，建議將其安裝在一個獨立的(虛擬)環境中。我比較傾向於使用Anaconda，所以以此來做說明：

> conda create -n tf2 python=3.6  > source activate tf2  > pip install tf-nightly-2.0-preview # tf-nightly-gpu-2.0-preview for GPU version

讓我們來快速驗證一下，一切是否按着預測正常工作：

>>> import tensorflow as tf  >>> print(tf.__version__)  1.13.0-dev20190117  >>> print(tf.executing_eagerly())  True

不必擔心1.13.x版本，這只是一個早期預覽。此處需要注意的是，默認情況下我們是處於eager模式的！

>>> print(tf.reduce_sum([1, 2, 3, 4, 5]))  tf.Tensor(15, shape=(), dtype=int32)

如果讀者對eager模式並不熟悉，那麼簡單來講，從本質上它意味着計算是在運行時(runtime)被執行的，而不是通過預編譯的圖(graph)來執行。讀者也可以在TensorFlow文檔中對此做深入了解：

https://www.tensorflow.org/tutorials/eager/eager_basics

深度強化學習

一般來說，強化學習是解決順序決策問題的高級框架。RL智能體通過基於某些觀察採取行動來導航環境，並因此獲得獎勵。大多數RL算法的工作原理是最大化智能體在一個軌跡中所收集的獎勵的總和。

基於RL的算法的輸出通常是一個策略—一個將狀態映射到操作的函數。有效的策略可以像硬編碼的no-op操作一樣簡單。隨機策略表示為給定狀態下行為的條件概率分佈。

Actor-Critic方法

RL算法通常根據優化的目標函數進行分組。基於值的方法（如DQN）通過減少預期狀態-動作值(state-action value)的誤差來工作。

策略梯度(Policy Gradient)方法通過調整其參數直接優化策略本身，通常是通過梯度下降。完全計算梯度通常是很困難的，所以通常用蒙特卡洛(monte-carlo)方法來估計梯度。

最流行的方法是二者的混合：actor- critical方法，其中智能體策略通過「策略梯度」進行優化，而基於值的方法則用作期望值估計的引導。

深度actor- critical方法

雖然很多基礎的RL理論是在表格案例中開發的，但現代RL幾乎完全是用函數逼近器完成的，例如人工神經網絡。具體來說，如果策略和值函數用深度神經網絡近似，則RL算法被認為是「深度的」。

異步優勢(asynchronous advantage) actor- critical

多年來，為了解決樣本效率和學習過程的穩定性問題，已經為此做出了一些改進。

首先，梯度用回報(return)來進行加權：折現的未來獎勵，這在一定程度上緩解了信用(credit)分配問題，並以無限的時間步長解決了理論問題。

其次，使用優勢函數代替原始回報。收益與基線(如狀態行動估計)之間的差異形成了優勢，可以將其視為與某一平均值相比某一給定操作有多好的衡量標準。

第三，在目標函數中使用額外的熵最大化項，以確保智能體充分探索各種策略。本質上，熵以均勻分佈最大化，來測量概率分佈的隨機性。

最後，並行使用多個worker來加速樣品採集，同時在訓練期間幫助將它們去相關(decorrelate)。

將所有這些變化與深度神經網絡結合起來，我們得到了兩種最流行的現代算法：異步優勢actor- critical算法，或簡稱A3C/A2C。兩者之間的區別更多的是技術上的而不是理論上的：顧名思義，它歸結為並行worker如何估計其梯度並將其傳播到模型中。

有了這些，我將結束我們的DRL方法之旅，因為這篇博客文章的重點是TensorFlow 2.0特性。如果您仍然不確定主題，不要擔心，通過代碼示例，一切都會變得更加清晰明了。

使用TensorFlow 2.0實現Advantage Actor-Critic

讓我們看看實現各種現代DRL算法的基礎是什麼：是actor-critic agent，如前一節所述。為了簡單起見，我們不會實現並行worker，儘管大多數代碼都支持它。感興趣的讀者可以將這作為一個練習機會。

作為一個測試平台，我們將使用CartPole-v0環境。雖然有點簡單，但它仍然是一個很好的選擇。

通過Keras模型API實現的策略和價值

首先，讓我們在單個模型類下創建策略和價值預估神經網絡:

import numpy as np  import tensorflow as tf  import tensorflow.keras.layers as kl    class ProbabilityDistribution(tf.keras.Model):      def call(self, logits):          # sample a random categorical action from given logits          return tf.squeeze(tf.random.categorical(logits, 1), axis=-1)    class Model(tf.keras.Model):      def __init__(self, num_actions):          super().__init__('mlp_policy')          # no tf.get_variable(), just simple Keras API          self.hidden1 = kl.Dense(128, activation='relu')          self.hidden2 = kl.Dense(128, activation='relu')          self.value = kl.Dense(1, name='value')          # logits are unnormalized log probabilities          self.logits = kl.Dense(num_actions, name='policy_logits')          self.dist = ProbabilityDistribution()        def call(self, inputs):          # inputs is a numpy array, convert to Tensor          x = tf.convert_to_tensor(inputs, dtype=tf.float32)          # separate hidden layers from the same input tensor          hidden_logs = self.hidden1(x)          hidden_vals = self.hidden2(x)          return self.logits(hidden_logs), self.value(hidden_vals)        def action_value(self, obs):          # executes call() under the hood          logits, value = self.predict(obs)          action = self.dist.predict(logits)          # a simpler option, will become clear later why we don't use it          # action = tf.random.categorical(logits, 1)          return np.squeeze(action, axis=-1), np.squeeze(value, axis=-1)

然後驗證模型是否如預期工作：

import gym    env = gym.make('CartPole-v0')  model = Model(num_actions=env.action_space.n)    obs = env.reset()  # no feed_dict or tf.Session() needed at all  action, value = model.action_value(obs[None, :])  print(action, value) # [1] [-0.00145713]

這裡需要注意的是：

模型層和執行路徑是分別定義的
沒有「輸入」層，模型將接受原始numpy數組
通過函數API可以在一個模型中定義兩個計算路徑
模型可以包含一些輔助方法，比如動作採樣
在eager模式下，一切都可以從原始numpy數組中運行

Random Agent

現在讓我們轉到 A2CAgent 類。首先，讓我們添加一個 test 方法，該方法運行完整的episode並返回獎勵的總和。

class A2CAgent:      def __init__(self, model):          self.model = model        def test(self, env, render=True):          obs, done, ep_reward = env.reset(), False, 0          while not done:              action, _ = self.model.action_value(obs[None, :])              obs, reward, done, _ = env.step(action)              ep_reward += reward              if render:                  env.render()          return ep_reward

讓我們看看模型在隨機初始化權重下的得分：

agent = A2CAgent(model)  rewards_sum = agent.test(env)  print("%d out of 200" % rewards_sum) # 18 out of 200

離最佳狀態還很遠，接下來是訓練部分!

損失/目標函數

正如我在DRL概述部分中所描述的，agent通過基於某些損失(目標)函數的梯度下降來改進其策略。在 actor-critic 中，我們針對三個目標進行訓練：利用優勢加權梯度加上熵最大化來改進策略，以及最小化價值估計誤差。

import tensorflow.keras.losses as kls  import tensorflow.keras.optimizers as ko    class A2CAgent:      def __init__(self, model):          # hyperparameters for loss terms          self.params = {'value': 0.5, 'entropy': 0.0001}          self.model = model          self.model.compile(              optimizer=ko.RMSprop(lr=0.0007),              # define separate losses for policy logits and value estimate              loss=[self._logits_loss, self._value_loss]          )        def test(self, env, render=True):          # unchanged from previous section          ...        def _value_loss(self, returns, value):          # value loss is typically MSE between value estimates and returns          return self.params['value']*kls.mean_squared_error(returns, value)        def _logits_loss(self, acts_and_advs, logits):          # a trick to input actions and advantages through same API          actions, advantages = tf.split(acts_and_advs, 2, axis=-1)          # polymorphic CE loss function that supports sparse and weighted options          # from_logits argument ensures transformation into normalized probabilities          cross_entropy = kls.CategoricalCrossentropy(from_logits=True)          # policy loss is defined by policy gradients, weighted by advantages          # note: we only calculate the loss on the actions we've actually taken          # thus under the hood a sparse version of CE loss will be executed          actions = tf.cast(actions, tf.int32)          policy_loss = cross_entropy(actions, logits, sample_weight=advantages)          # entropy loss can be calculated via CE over itself          entropy_loss = cross_entropy(logits, logits)          # here signs are flipped because optimizer minimizes          return policy_loss - self.params['entropy']*entropy_loss

我們完成了目標函數！注意代碼非常緊湊：注釋行幾乎比代碼本身還多。

Agent Training Loop

最後，還有訓練環路。它有點長，但相當簡單：收集樣本，計算回報和優勢，並在其上訓練模型。

class A2CAgent:      def __init__(self, model):          # hyperparameters for loss terms          self.params = {'value': 0.5, 'entropy': 0.0001, 'gamma': 0.99}          # unchanged from previous section          ...       def train(self, env, batch_sz=32, updates=1000):          # storage helpers for a single batch of data          actions = np.empty((batch_sz,), dtype=np.int32)          rewards, dones, values = np.empty((3, batch_sz))          observations = np.empty((batch_sz,) + env.observation_space.shape)          # training loop: collect samples, send to optimizer, repeat updates times          ep_rews = [0.0]          next_obs = env.reset()          for update in range(updates):              for step in range(batch_sz):                  observations[step] = next_obs.copy()                  actions[step], values[step] = self.model.action_value(next_obs[None, :])                  next_obs, rewards[step], dones[step], _ = env.step(actions[step])                    ep_rews[-1] += rewards[step]                  if dones[step]:                      ep_rews.append(0.0)                      next_obs = env.reset()                _, next_value = self.model.action_value(next_obs[None, :])              returns, advs = self._returns_advantages(rewards, dones, values, next_value)              # a trick to input actions and advantages through same API              acts_and_advs = np.concatenate([actions[:, None], advs[:, None]], axis=-1)              # performs a full training step on the collected batch              # note: no need to mess around with gradients, Keras API handles it              losses = self.model.train_on_batch(observations, [acts_and_advs, returns])          return ep_rews        def _returns_advantages(self, rewards, dones, values, next_value):          # next_value is the bootstrap value estimate of a future state (the critic)          returns = np.append(np.zeros_like(rewards), next_value, axis=-1)          # returns are calculated as discounted sum of future rewards          for t in reversed(range(rewards.shape[0])):              returns[t] = rewards[t] + self.params['gamma'] * returns[t+1] * (1-dones[t])          returns = returns[:-1]          # advantages are returns - baseline, value estimates in our case          advantages = returns - values          return returns, advantages        def test(self, env, render=True):          # unchanged from previous section          ...        def _value_loss(self, returns, value):          # unchanged from previous section          ...        def _logits_loss(self, acts_and_advs, logits):          # unchanged from previous section          ...

訓練&結果

我們現在已經準備好在CartPole-v0上訓練這個single-worker A2C agent！訓練過程應該只用幾分鐘。訓練結束後，你應該看到一個智能體成功地實現了200分的目標。

rewards_history = agent.train(env)  print("Finished training, testing...")  print("%d out of 200" % agent.test(env)) # 200 out of 200

在源代碼中，我包含了一些額外的幫助程序，可以打印出正在運行的episode的獎勵和損失，以及rewards_history。

靜態計算圖

eager mode效果這麼好，你可能會想知道靜態圖執行是否也可以。當然是可以！而且，只需要多加一行代碼就可以啟用靜態圖執行。

with tf.Graph().as_default():      print(tf.executing_eagerly()) # False        model = Model(num_actions=env.action_space.n)      agent = A2CAgent(model)        rewards_history = agent.train(env)      print("Finished training, testing...")      print("%d out of 200" % agent.test(env)) # 200 out of 200

有一點需要注意的是，在靜態圖執行期間，我們不能只使用 Tensors，這就是為什麼我們需要在模型定義期間使用CategoricalDistribution的技巧。

One More Thing…

還記得我說過TensorFlow在默認情況下以eager 模式運行，甚至用一個代碼片段來證明它嗎？好吧,我騙了你。

如果你使用Keras API來構建和管理模型，那麼它將嘗試在底層將它們編譯為靜態圖。所以你最終得到的是靜態計算圖的性能，它具有eager execution的靈活性。

你可以通過model.run_eager標誌檢查模型的狀態，還可以通過將此標誌設置為True來強制使用eager mode，儘管大多數情況下可能不需要這樣做——如果Keras檢測到沒有辦法繞過eager mode，它將自動退出。

為了說明它確實是作為靜態圖運行的，這裡有一個簡單的基準測試：

# create a 100000 samples batch  env = gym.make('CartPole-v0')  obs = np.repeat(env.reset()[None, :], 100000, axis=0)

Eager Benchmark

%%time    model = Model(env.action_space.n)  model.run_eagerly = True    print("Eager Execution:  ", tf.executing_eagerly())  print("Eager Keras Model:", model.run_eagerly)    _ = model(obs)    ######## Results #######    Eager Execution:   True  Eager Keras Model: True  CPU times: user 639 ms, sys: 736 ms, total: 1.38 s

Static Benchmark

%%time    with tf.Graph().as_default():      model = Model(env.action_space.n)        print("Eager Execution:  ", tf.executing_eagerly())      print("Eager Keras Model:", model.run_eagerly)        _ = model.predict(obs)    ######## Results #######    Eager Execution:   False  Eager Keras Model: False  CPU times: user 793 ms, sys: 79.7 ms, total: 873 ms

Default Benchmark

%%time    model = Model(env.action_space.n)    print("Eager Execution:  ", tf.executing_eagerly())  print("Eager Keras Model:", model.run_eagerly)    _ = model.predict(obs)    ######## Results #######    Eager Execution:   True  Eager Keras Model: False  CPU times: user 994 ms, sys: 23.1 ms, total: 1.02 s

正如你所看到的，eager模式位於靜態模式之後，默認情況下，模型確實是靜態執行的。

結論

希望本文對理解DRL和即將到來的TensorFlow 2.0有所幫助。請注意，TensorFlow 2.0仍然只是預覽版的，一切都有可能發生變化，如果你對TensorFlow有什麼特別不喜歡(或喜歡:))的地方，請反饋給開發者。

一個總被提起的問題是，TensorFlow是否比PyTorch更好？也許是，也許不是。兩者都是很好的庫，所以很難說是哪一個更好。如果你熟悉PyTorch，你可能會注意到TensorFlow 2.0不僅趕上了它，而且還避免了PyTorch API的一些缺陷。

無論最後誰勝出，對於開發者來說，這場競爭給雙方都帶來了凈積極的結果，我很期待看到這些框架未來會變成什麼樣子。