ReinforceJS庫（動態展示DP、TD、DQN演算法運行過程）

2019 年 11 月 21 日
筆記

深度強化學習報道

來源：REINFORCEjs

編輯：DeepRL

深度強化學習的發展讓很多控制疑難問題有了新的發展思路，然而在大多數人的學習的過程中，尤其在基礎DP、TD、MC等知識點的學習過程中有了很大的難度。「ReinforceJS」是一個實現了一些常見的RL演算法的一個強化學習庫，通過有趣的網路演示得到支援，目前由[@karpathy]（https://twitter.com/karpathy）維護。該庫目前包括：用於求解有限（而不是太大）確定性MDP的動態編程。

ReinforceJS在GridWorld、PuckWorld、WaterWorld環境中進行了動態的展示：

Part 1

DP動態演示

ReinforceJS的API使用DP，如果要對MDP使用ReinforceJS動態編程，則必須定義一個環境對象env，該環境對象env具有dp代理需要的一些方法：

env.getNumStates（）返回狀態總數的整數
env.getMaxNumActions（）返回一個任意狀態下操作數最大的整數。
env.allowedactions接受整數s並返回可用操作的列表，該列表應為從零到maxNumActions的整數。
env.nextStateDistribution（s，a）是一個誤稱，因為現在庫假定確定性MDP，每個（狀態、操作）對都有一個唯一的新狀態。因此，函數應該返回一個整數，該整數標識世界的下一個狀態
env.獎勵（s，a，ns），返回代理為s，a，ns轉換所獲得的獎勵的浮點值。在最簡單的情況下，獎勵通常只基於狀態S。

// create environment  env = new Gridworld();  // create the agent, yay! Discount factor 0.9  agent = new RL.DPAgent(env, {'gamma':0.9});    // call this function repeatedly until convergence:  agent.learn();    // once trained, get the agent's behavior with:  var action = agent.act(); // returns the index of the chosen action

evaluatePolicy: function() {    // perform a synchronous update of the value function    var Vnew = zeros(this.ns); // initialize new value function array for each state    for(var s=0;s < this.ns;s++) {      var v = 0.0;      var poss = this.env.allowedActions(s); // fetch all possible actions      for(var i=0,n=poss.length;i < n;i++) {        var a = poss[i];        var prob = this.P[a*this.ns+s]; // probability of taking action under current policy        var ns = this.env.nextStateDistribution(s,a); // look up the next state        var rs = this.env.reward(s,a,ns); // get reward for s->a->ns transition        v += prob * (rs + this.gamma * this.V[ns]);      }      Vnew[s] = v;    }    this.V = Vnew; // swap  },

updatePolicy: function() {    // update policy to be greedy w.r.t. learned Value function    // iterate over all states...    for(var s=0;s < this.ns;s++) {      var poss = this.env.allowedActions(s);      // compute value of taking each allowed action      var vmax, nmax;      var vs = [];      for(var i=0,n=poss.length;i < n;i++) {        var a = poss[i];        // compute the value of taking action a        var ns = this.env.nextStateDistribution(s,a);        var rs = this.env.reward(s,a,ns);        var v = rs + this.gamma * this.V[ns];        // bookeeping: store it and maintain max        vs.push(v);        if(i === 0 || v > vmax) { vmax = v; nmax = 1; }        else if(v === vmax) { nmax += 1; }      }      // update policy smoothly across all argmaxy actions      for(var i=0,n=poss.length;i < n;i++) {        var a = poss[i];        this.P[a*this.ns+s] = (vs[i] === vmax) ? 1.0/nmax : 0.0;      }    }  },

Part 2

TD動態演示

// agent parameter spec to play with (this gets eval()'d on Agent reset)  var spec = {}  spec.update = 'qlearn'; // 'qlearn' or 'sarsa'  spec.gamma = 0.9; // discount factor, [0, 1)  spec.epsilon = 0.2; // initial epsilon for epsilon-greedy policy, [0, 1)  spec.alpha = 0.1; // value function learning rate  spec.lambda = 0; // eligibility trace decay, [0,1). 0 = no eligibility traces  spec.replacing_traces = true; // use replacing or accumulating traces  spec.planN = 50; // number of planning steps per iteration. 0 = no planning    spec.smooth_policy_update = true; // non-standard, updates policy smoothly to follow max_a Q  spec.beta = 0.1; // learning rate for smooth policy update

// create environment  env = new Gridworld();  // create the agent, yay!  var spec = { alpha: 0.01 } // see full options on top of this page  agent = new RL.TDAgent(env, spec);    setInterval(function(){ // start the learning loop    var action = agent.act(s); // s is an integer, action is integer    // execute action in environment and get the reward    agent.learn(reward); // the agent improves its Q,policy,model, etc.  }, 0);

Part 1

DQN動態演示

（1）本演示是對PuckWorld：

狀態空間現在大而連續：代理觀察自己的位置（x，y）、速度（vx，vy）、綠色目標和紅色目標的位置（總共8個數字）。
代理商可採取4種措施：向左側、右側、上方和下方施加推進器。這使代理可以控制其速度。
Puckworld動力學整合了代理的速度來改變其位置。綠色目標偶爾移動到一個隨機位置。紅色目標總是緩慢地跟隨代理。
對代理商的獎勵是基於其與綠色目標的距離（低即好）。但是，如果代理位於紅色目標附近（磁碟內），代理將獲得與其到紅色目標的距離成比例的負獎勵。

代理的最佳策略是始終朝向綠色目標（這是常規的puckworld），但也要避免紅色目標的影響區域。這使得事情變得更有趣，因為代理必須學會避免它。而且，有時看著紅色目標把特工逼到角落是很有趣的。在這種情況下，最佳的做法是暫時支付價格，以便快速地進行縮放，而不是在這種情況下陷入困境並支付更多的獎勵價格。

介面：代理當前體驗到的獎勵顯示為其顏色（綠色=高，紅色=低）。代理所採取的操作（移動的中等大小的圓）如箭頭所示。其DQN的網路結構如下：

// create environment  var env = {};  env.getNumStates = function() { return 8; }  env.getMaxNumActions = function() { return 4; }  // create the agent, yay!  var spec = { alpha: 0.01 } // see full options on top of this page  agent = new RL.DQNAgent(env, spec);    setInterval(function(){ // start the learning loop    var action = agent.act(s); // s is an array of length 8    // execute action in environment and get the reward    agent.learn(reward); // the agent improves its Q,policy,model, etc. reward is a float  }, 0);

（2）WaterWorld深度Q學習演示：

狀態空間更大、更連續：代理有30個眼睛感測器指向各個方向，每個方向觀察5個變數：範圍、感測對象的類型（綠色、紅色）和感測對象的速度。該代理的本體感受包括兩個額外的感測器，以其自身的速度在X和Y方向。這是總共152維的狀態空間。

代理商可採取4種措施：向左側、右側、上方和下方施加推進器。這使代理可以控制其速度。
動力學綜合了物體的速度來改變它的位置。綠色和紅色的目標反彈。
與任何紅色目標（這些是蘋果）接觸的獎勵是+1，與任何綠色目標（這是毒藥）接觸的獎勵是-1。

該代理的最佳策略是巡航，遠離綠色目標，吃紅色目標。這個演示有趣的是，狀態空間是如此的高維，而且被感知的變數是與代理相關的。它們不僅僅是前一個演示中固定數量目標的玩具x，y坐標。

// agent parameter spec to play with (this gets eval()'d on Agent reset)  var spec = {}  spec.update = 'qlearn'; // qlearn | sarsa  spec.gamma = 0.9; // discount factor, [0, 1)  spec.epsilon = 0.2; // initial epsilon for epsilon-greedy policy, [0, 1)  spec.alpha = 0.005; // value function learning rate  spec.experience_add_every = 5; // number of time steps before we add another experience to replay memory  spec.experience_size = 10000; // size of experience  spec.learning_steps_per_iteration = 5;  spec.tderror_clamp = 1.0; // for robustness  spec.num_hidden_units = 100 // number of neurons in hidden layer

在線頁面：

https://cs.stanford.edu/people/karpathy/reinforcejs/index.html

Github： https://github.com/karpathy/reinforcejs

ReinforceJS庫（動態展示DP、TD、DQN演算法運行過程）

VirMach 便宜 VPS

QNews

ReinforceJS庫（動態展示DP、TD、DQN演算法運行過程）

分享此文：

Related Posts

java基礎(一)注釋

C# 三種方式實現Socket數據接收(經典)

DRL在Unity自行車環境中配置與實踐

ICML 2019之Facebook論文成果集錦

VirMach 便宜 VPS

QNews

熱門搜尋