Chapter 8 Planning and Learning with Tabular Methods
Model-based方法依赖于planning,将模型作为输入并产生或改进与模拟环境交互的策略;
Model-free方法依赖于learning,将环境的真实experience作为输入;
现在把两种方法结合起来。
Integrated Architectures
8.1 Models and Planning
distribution models:描述所有可能以及这些可能的概率的model
sample model:描述一个可能,并根据该可能的概率进行采样
在动态规划中假设的MDP就是一个distribution models;而blackjack例子中介绍的就是sample model。
在任何一种情况下,model都是用来simulate the environment and produce simulated experience.
planning就是任何将model作为输入并产生或改进与模拟环境交互的policy的任何计算过程:
参考自David silver的强化学习公开课ppt
详细解释 Model
- 一个Model 就表示一个 ,参数是
- 假设state space 和action space 都是已知的
-
所以一个model 表示状态转移 和reward
-
还要假设状态转移和reward是条件独立的
Model Leanring的目的是从experience中估计model ,这是监督学习问题。
- 学习是回归问题
- 学习是密度估计问题
- 选择损失函数,如mean-squared error, KL diergence等
- 选择参数 最小化经验损失
使用的Model的例子
- Table Lookup Model (书和下面说的论文上介绍的Model)
- Linear Expectation Model
- Linear Gaussian Model
- Gaussian Process Model
- Deep Belief Network Model
…
简要介绍Table Lookup Model
Model是显式的 ,是总的访问次数
详细解释 Planning
Planning with a Model
- 给定一个model
- 解决
- 使用的planning算法
- Value iteration
- Policy iteration
- Tree search (这是书上介绍的MCTS)
Sample-Based Planning 就是从model中sample experience来做planning。使用model-free RL来sample,如:Monte-Carlo control; Sarsa; Q-learning
本书主要讲到的方法是 State-space planning:通过状态空间搜索最优策略或到达目标的最佳路径。
接下来讲到的方法包含两个基本想法:
- 所有的state-space planning 方法都包含把计算value functions作为改进policy的关键中间步骤
- 它们将updates或者backup operations应用到simulated experience上来计算value functions
8.2 Dyna: Integrated Planning, Acting, and Learning
把两者结合起来的强化学习的结构
直接学习:直接根据环境的experience学习和改进policy
间接学习:使用experience学习model,然后把model作为planning的输入,学习和改进policy
对Dyna-Q,the acting, model-learning,direct RL都需要很少的时间,剩下的最多的时间都是planning process消耗的。
(d)(e)(f)分别完成Direct reinforcement learning,model-learning,planning
先使用真实数据学习出Model,然后再使用Planning更新Q值
可以看一个planning作用的例子
可以看到planning的次数越多,收敛的越快
8.3 When the Model Is Wrong
之前都是假设环境是不会发生变化的,但是如果环境发生变化,那么model就是错误的。已经生成的policy就基本不会去探索那些已经发生变化的位置。
这里提出一个启发式的方法解决上述提到的问题,实现该方法的就是Dyna-Q+
给长时间没有探索过的step更多的reward
给出一个比较实际的解决方案:
In particular, if the modeled reward for a transition is , and the transition has not been tried in time steps, then planning updates are done as if that transition produced a reward of , for some small .
8.4 Prioritized Sweeping
之前的planning都是随机的从model中选取state-action pairs的,这样做很没有效率。用一个队列保存state-action pairs,估计的变化不是平稳的,有限改变变化大的。
A queue is maintained of every state{action pair whose estimated
value would change nontrivially if updated , prioritized by the size of the change.
8.5 Expected vs. Sample Updates
从one-step updates来考虑,到目前为止主要集中的三个维度(之后会介绍第四个维度)
- whether they update state values or action values. (eg. )
- whether they estimate the value for the optimal policy or for an arbitrary given policy. (eg. )
- whether the updates are expected updates, considering all possible events that might happen, or sample updates, considering a single sample of what might happen.
expected update: 正确性只受后继状态Q的限制
sample update: 还收到sample error的影响,但是计算量小
通过实验表明,在多分枝状态问题上,sample update 比 expected update要好
8.6 Trajectory Sampling
In other words, one simulates explicit individual trajectories and performs updates at the state or state-action pairs encountered along the way. We call this way of generating experience and updates trajectory sampling.
比较on-policy和uniform两种策略下,从初始状态S开始,在接下来的b个状态中选择一个,使用on-policy就是根据现在的policy选择其中一个状态,更新状态的值;使用uniform就是等可能的选择一个状态,来更新状态值。
在任何情况下,短期on-policy都表现的更好,但是长期来看,uniform会更好。
8.7 Real-time Dynamic Programming
RTDP updates the values of states visited in actual or simulated trajectories by means of expected tabular value-iteration updates
更新公式类似于
Simulation-Based Search
8.8 Planning at Decision Time
background planning : planning is not focussed on the current state.
decision-time planning:从一个状态开始选择一个动作,然后接着,重复该步骤,直到终止状态。更一般地说,以这种方式使用planning可能比one-step-ahead看起来更深,并评估导致许多不同预测状态和reward trajectories的行动选择。 与background planning使用planning不同,这里的planning侧重于特定的状态。
Decision-time planning is most useful in applications in which fast responses are not required.
Heuristic Search
经典的人工智能 state-space planning 方法是统一称为启发式搜索的decision planning methods。
如果有一个完美的模型和一个不完善的action-value function,那么实际上更深的搜索通常会产生更好的policy。
举西洋双陆棋的例子,说明即使是向前看几步,效果都会显著变好。
人自己下棋就会希望自己估计的准确,但人不能记住所有的状态,计算机却能,这就是为什么启发式搜索会很有效。
通过计算相关候选状态,decision-time planning可以比随便更新新产生更好的policy。
8.10 Rollout Algorithms
Rollout算法是基于MC control的decision-time planning算法,用来simulated trajectories,所有的trajectories都始于当前环境state。对于给定policy,平均所有模拟轨迹的returns来估计action values。
选择给出最大估计的action然后执行,得到下一个状态,然后重复执行该操作。
Rollout algorithms在给定rollout policy的情况下估计每个当前状态蒙特卡洛估计
8.11 Monte Carlo Tree Search
Monte Carlo Tree Search (MCTS)
MONTE CARLO TREE SEARCH (MCTS) is a method for finding optimal decisions in a given domain by taking random samples in the decision space and building a search tree according to the results.
蒙特卡洛方法(Monte Carlo Methods)
其中表示动作a从状态s开始被选中的次数;表示经过状态的游戏次数;表示从状态s开始的第i次模拟;则是当从状态被选择时为1。
对于很多reward distribution,不存在增长速率小于的policy
Upper Confidence Bounds (UCBs)
因为最简单,所以也成称
考虑多臂*,考虑选择使用臂,来最大化
其中表示服从臂的平均reward;表示臂被使用的次数;是到目前为止玩的次数。
其中鼓励expolitation,而鼓励exploration。
A. Algorithm
The basic algorithm involves iteratively building a search tree until some predefined computational budget–typically a time, memory, or iteration constraint–is reached, at which point the search is halted and the best performing root action returned.
- Selection: Starting at the root node, a child selection policy is recursively applied to descend through the tree until the most urgent expandable node is reached. A node is expandable if it represents a nonterminal state and has unvisited (i.e., unexpanded) children.
- Expansion: One (or more) child nodes are added to expand the tree, according to the available actions.
- Simulation: A simulation is run from the new node(s) according to the default policy to produce an outcome.
- Backpropagation: The simulation result is “backed up” (i.e., backpropagated) through the selected nodes to update their statistics.
上面的操作可以分成两组:
- Tree policy:Select or create a leaf node from the nodes already contained within the search tree (selection and expansion).
- Default policy: Play out the domain from a given nonterminal state to produce a value estimate (simulation).
The result of the overall search is the action that leads to the best child of the root node
One iteration of the general MCTS approach
Schadd describes four criteria for selecting the winning action, based on the work of Chaslot et al.:
- max child: select the root child with the highest reward;
- robust child: select the most visited root child;
- max-robust child: select the root child with both the highest visit count and the highest reward; if none exists, then continue searching until an acceptable visit count is achieved;
- secure child: select the child which maximizes a lower confidence bound.
B. Development
蒙特卡洛方法在选择action的时候没有游戏理论保证,简单来说就是即使运行了很长时间,最后选择的action也可能不是最优的。
蒙特卡洛搜索采用了树搜索的选择机制。可以使用等方法来平衡exploitation和exploration,最后可以尽快的减小error probability。
C. Upper Confidence Bounds for Trees (UCT)
1) The UCT Algorithm: MCTS的目的是逼近可能从当前状态出发选择的action的正确的game-theoretic value。可以选择使用做为 tree policy。
UCB1很简单,并且保证处于对回归增长的最佳限制的恒定因素之内。
every time a node (action) is to be selected within the existing tree, the choicemay bemodeled as an independentmultiarmed bandit problem. A child node is selected to maximize
where is the number of times the current (parent) node has been visited, is the number of times child has been visited, and is a constant.
上面的可以用来调整减少或增加exploration。byKocsis and Szepesv¨¢ri的论文中
剩下的部分,如果一个被UCB decent选择的节点有子节点目前还不是树的一部分,那么其中一个子节点就会被随机的选择并添加到树中。然后一直使用default policy直到终止状态。在最简单的情况下,default policy是均匀随机的。最后终止状态的值反向传播更新在本次迭代中,从新添加的节点到根节点中的所有已经访问过的节点。
每个节点都保存了两个值,节点被访问的次数,以及所有playout的值的加和。则是节点的game-theoretic value的逼近。
Algorithm 3 shows an alternative and more efficient backup method for two-player, zero-sum games with alternating moves, which is typically found in the literature.
给足够的时间和内存,UCT允许MCTS收敛到minimax tree,这就是最优的
下面是稍作修改,运行在Python3上的UCT的例子代码,包含三个游戏,来源于
www.mcts.ai
# This is a very simple implementation of the UCT Monte Carlo Tree Search algorithm in Python 2.7.
# The function UCT(rootstate, itermax, verbose = False) is towards the bottom of the code.
# It aims to have the clearest and simplest possible code, and for the sake of clarity, the code
# is orders of magnitude less efficient than it could be made, particularly by using a
# state.GetRandomMove() or state.DoRandomRollout() function.
#
# Example GameState classes for Nim, OXO and Othello are included to give some idea of how you
# can write your own GameState use UCT in your 2-player game. Change the game to be played in
# the UCTPlayGame() function at the bottom of the code.
#
# Written by Peter Cowling, Ed Powley, Daniel Whitehouse (University of York, UK) September 2012.
#
# Licence is granted to freely use and distribute for any sensible/legal purpose so long as this comment
# remains in any distributed code.
#
# For more information about Monte Carlo Tree Search check out our web site at www.mcts.ai
from math import *
import random
class GameState:
""" A state of the game, i.e. the game board. These are the only functions which are
absolutely necessary to implement UCT in any 2-player complete information deterministic
zero-sum game, although they can be enhanced and made quicker, for example by using a
GetRandomMove() function to generate a random move during rollout.
By convention the players are numbered 1 and 2.
"""
def __init__(self):
self.playerJustMoved = 2 # At the root pretend the player just moved is player 2 - player 1 has the first move
def Clone(self):
""" Create a deep clone of this game state.
"""
st = GameState()
st.playerJustMoved = self.playerJustMoved
return st
def DoMove(self, move):
""" Update a state by carrying out the given move.
Must update playerJustMoved.
"""
self.playerJustMoved = 3 - self.playerJustMoved
def GetMoves(self):
""" Get all possible moves from this state.
"""
def GetResult(self, playerjm):
""" Get the game result from the viewpoint of playerjm.
"""
def __repr__(self):
""" Don't need this - but good style.
"""
pass
class NimState:
""" A state of the game Nim. In Nim, players alternately take 1,2 or 3 chips with the
winner being the player to take the last chip.
In Nim any initial state of the form 4n+k for k = 1,2,3 is a win for player 1
(by choosing k) chips.
Any initial state of the form 4n is a win for player 2.
"""
def __init__(self, ch):
self.playerJustMoved = 2 # At the root pretend the player just moved is p2 - p1 has the first move
self.chips = ch
def Clone(self):
""" Create a deep clone of this game state.
"""
st = NimState(self.chips)
st.playerJustMoved = self.playerJustMoved
return st
def DoMove(self, move):
""" Update a state by carrying out the given move.
Must update playerJustMoved.
"""
assert move >= 1 and move <= 3 and move == int(move)
self.chips -= move
self.playerJustMoved = 3 - self.playerJustMoved
def GetMoves(self):
""" Get all possible moves from this state.
"""
return [i for i in range(1,min([4, self.chips + 1]))]
def GetResult(self, playerjm):
""" Get the game result from the viewpoint of playerjm.
"""
assert self.chips == 0
if self.playerJustMoved == playerjm:
return 1.0 # playerjm took the last chip and has won
else:
return 0.0 # playerjm's opponent took the last chip and has won
def __repr__(self):
s = "Chips:" + str(self.chips) + " JustPlayed:" + str(self.playerJustMoved)
return s
class OXOState:
""" A state of the game, i.e. the game board.
Squares in the board are in this arrangement
012
345
678
where 0 = empty, 1 = player 1 (X), 2 = player 2 (O)
"""
def __init__(self):
self.playerJustMoved = 2 # At the root pretend the player just moved is p2 - p1 has the first move
self.board = [0,0,0,0,0,0,0,0,0] # 0 = empty, 1 = player 1, 2 = player 2
def Clone(self):
""" Create a deep clone of this game state.
"""
st = OXOState()
st.playerJustMoved = self.playerJustMoved
st.board = self.board[:]
return st
def DoMove(self, move):
""" Update a state by carrying out the given move.
Must update playerToMove.
"""
assert move >= 0 and move <= 8 and move == int(move) and self.board[move] == 0
self.playerJustMoved = 3 - self.playerJustMoved
self.board[move] = self.playerJustMoved
def GetMoves(self):
""" Get all possible moves from this state.
"""
return [i for i in range(9) if self.board[i] == 0]
def GetResult(self, playerjm):
""" Get the game result from the viewpoint of playerjm.
"""
for (x,y,z) in [(0,1,2),(3,4,5),(6,7,8),(0,3,6),(1,4,7),(2,5,8),(0,4,8),(2,4,6)]:
if self.board[x] == self.board[y] == self.board[z]:
if self.board[x] == playerjm:
return 1.0
else:
return 0.0
if self.GetMoves() == []: return 0.5 # draw
assert False # Should not be possible to get here
def __repr__(self):
s= ""
for i in range(9):
s += ".XO"[self.board[i]]
if i % 3 == 2: s += "\n"
return s
class OthelloState:
""" A state of the game of Othello, i.e. the game board.
The board is a 2D array where 0 = empty (.), 1 = player 1 (X), 2 = player 2 (O).
In Othello players alternately place pieces on a square board - each piece played
has to sandwich opponent pieces between the piece played and pieces already on the
board. Sandwiched pieces are flipped.
This implementation modifies the rules to allow variable sized square boards and
terminates the game as soon as the player about to move cannot make a move (whereas
the standard game allows for a pass move).
"""
def __init__(self,sz = 8):
self.playerJustMoved = 2 # At the root pretend the player just moved is p2 - p1 has the first move
self.board = [] # 0 = empty, 1 = player 1, 2 = player 2
self.size = sz
assert sz == int(sz) and sz % 2 == 0 # size must be integral and even
for y in range(sz):
self.board.append([0]*sz)
self.board[sz//2][sz//2] = self.board[sz//2-1][sz//2-1] = 1
self.board[sz//2][sz//2-1] = self.board[sz//2-1][sz//2] = 2
def Clone(self):
""" Create a deep clone of this game state.
"""
st = OthelloState()
st.playerJustMoved = self.playerJustMoved
st.board = [self.board[i][:] for i in range(self.size)]
st.size = self.size
return st
def DoMove(self, move):
""" Update a state by carrying out the given move.
Must update playerToMove.
"""
(x,y)=(move[0],move[1])
assert x == int(x) and y == int(y) and self.IsOnBoard(x,y) and self.board[x][y] == 0
m = self.GetAllSandwichedCounters(x,y)
self.playerJustMoved = 3 - self.playerJustMoved
self.board[x][y] = self.playerJustMoved
for (a,b) in m:
self.board[a][b] = self.playerJustMoved
def GetMoves(self):
""" Get all possible moves from this state.
"""
return [(x,y) for x in range(self.size) for y in range(self.size) if self.board[x][y] == 0 and self.ExistsSandwichedCounter(x,y)]
def AdjacentToEnemy(self,x,y):
""" Speeds up GetMoves by only considering squares which are adjacent to an enemy-occupied square.
"""
for (dx,dy) in [(0,+1),(+1,+1),(+1,0),(+1,-1),(0,-1),(-1,-1),(-1,0),(-1,+1)]:
if self.IsOnBoard(x+dx,y+dy) and self.board[x+dx][y+dy] == self.playerJustMoved:
return True
return False
def AdjacentEnemyDirections(self,x,y):
""" Speeds up GetMoves by only considering squares which are adjacent to an enemy-occupied square.
"""
es = []
for (dx,dy) in [(0,+1),(+1,+1),(+1,0),(+1,-1),(0,-1),(-1,-1),(-1,0),(-1,+1)]:
if self.IsOnBoard(x+dx,y+dy) and self.board[x+dx][y+dy] == self.playerJustMoved:
es.append((dx,dy))
return es
def ExistsSandwichedCounter(self,x,y):
""" Does there exist at least one counter which would be flipped if my counter was placed at (x,y)?
"""
for (dx,dy) in self.AdjacentEnemyDirections(x,y):
if len(self.SandwichedCounters(x,y,dx,dy)) > 0:
return True
return False
def GetAllSandwichedCounters(self, x, y):
""" Is (x,y) a possible move (i.e. opponent counters are sandwiched between (x,y) and my counter in some direction)?
"""
sandwiched = []
for (dx,dy) in self.AdjacentEnemyDirections(x,y):
sandwiched.extend(self.SandwichedCounters(x,y,dx,dy))
return sandwiched
def SandwichedCounters(self, x, y, dx, dy):
""" Return the coordinates of all opponent counters sandwiched between (x,y) and my counter.
"""
x += dx
y += dy
sandwiched = []
while self.IsOnBoard(x,y) and self.board[x][y] == self.playerJustMoved:
sandwiched.append((x,y))
x += dx
y += dy
if self.IsOnBoard(x,y) and self.board[x][y] == 3 - self.playerJustMoved:
return sandwiched
else:
return [] # nothing sandwiched
def IsOnBoard(self, x, y):
return x >= 0 and x < self.size and y >= 0 and y < self.size
def GetResult(self, playerjm):
""" Get the game result from the viewpoint of playerjm.
"""
jmcount = len([(x,y) for x in range(self.size) for y in range(self.size) if self.board[x][y] == playerjm])
notjmcount = len([(x,y) for x in range(self.size) for y in range(self.size) if self.board[x][y] == 3 - playerjm])
if jmcount > notjmcount: return 1.0
elif notjmcount > jmcount: return 0.0
else: return 0.5 # draw
def __repr__(self):
s= ""
for y in range(self.size-1,-1,-1):
for x in range(self.size):
s += ".XO"[self.board[x][y]]
s += "\n"
return s
class Node:
""" A node in the game tree. Note wins is always from the viewpoint of playerJustMoved.
Crashes if state not specified.
"""
def __init__(self, move = None, parent = None, state = None):
self.move = move # the move that got us to this node - "None" for the root node
self.parentNode = parent # "None" for the root node
self.childNodes = []
self.wins = 0
self.visits = 0
self.untriedMoves = state.GetMoves() # future child nodes
self.playerJustMoved = state.playerJustMoved # the only part of the state that the Node needs later
def UCTSelectChild(self):
""" Use the UCB1 formula to select a child node. Often a constant UCTK is applied so we have
lambda c: c.wins/c.visits + UCTK * sqrt(2*log(self.visits)/c.visits to vary the amount of
exploration versus exploitation.
"""
s = sorted(self.childNodes, key = lambda c: c.wins/c.visits + sqrt(2*log(self.visits)/c.visits))[-1]
return s
def AddChild(self, m, s):
""" Remove m from untriedMoves and add a new child node for this move.
Return the added child node
"""
n = Node(move = m, parent = self, state = s)
self.untriedMoves.remove(m)
self.childNodes.append(n)
return n
def Update(self, result):
""" Update this node - one additional visit and result additional wins. result must be from the viewpoint of playerJustmoved.
"""
self.visits += 1
self.wins += result
def __repr__(self):
return "[M:" + str(self.move) + " W/V:" + str(self.wins) + "/" + str(self.visits) + " U:" + str(self.untriedMoves) + "]"
def TreeToString(self, indent):
s = self.IndentString(indent) + str(self)
for c in self.childNodes:
s += c.TreeToString(indent+1)
return s
def IndentString(self,indent):
s = "\n"
for i in range (1,indent+1):
s += "| "
return s
def ChildrenToString(self):
s = ""
for c in self.childNodes:
s += str(c) + "\n"
return s
def UCT(rootstate, itermax, verbose = False):
""" Conduct a UCT search for itermax iterations starting from rootstate.
Return the best move from the rootstate.
Assumes 2 alternating players (player 1 starts), with game results in the range [0.0, 1.0]."""
rootnode = Node(state = rootstate)
for i in range(itermax):
node = rootnode
state = rootstate.Clone()
# Select
while node.untriedMoves == [] and node.childNodes != []: # node is fully expanded and non-terminal
node = node.UCTSelectChild()
state.DoMove(node.move)
# Expand
if node.untriedMoves != []: # if we can expand (i.e. state/node is non-terminal)
m = random.choice(node.untriedMoves)
state.DoMove(m)
node = node.AddChild(m,state) # add child and descend tree
# Rollout - this can often be made orders of magnitude quicker using a state.GetRandomMove() function
while state.GetMoves() != []: # while state is non-terminal
state.DoMove(random.choice(state.GetMoves()))
# Backpropagate
while node != None: # backpropagate from the expanded node and work back to the root node
node.Update(state.GetResult(node.playerJustMoved)) # state is terminal. Update node with result from POV of node.playerJustMoved
node = node.parentNode
# Output some information about the tree - can be omitted
if (verbose): print (rootnode.TreeToString(0))
else: print (rootnode.ChildrenToString())
return sorted(rootnode.childNodes, key = lambda c: c.visits)[-1].move # return the move that was most visited
def UCTPlayGame():
""" Play a sample game between two UCT players where each player gets a different number
of UCT iterations (= simulations = tree nodes).
"""
# state = OthelloState(4) # uncomment to play Othello on a square board of the given size
state = OXOState() # uncomment to play OXO
# state = NimState(15) # uncomment to play Nim with the given number of starting chips
while (state.GetMoves() != []):
print (str(state))
if state.playerJustMoved == 1:
m = UCT(rootstate = state, itermax = 1000, verbose = False) # play with values for itermax and verbose = True
else:
m = UCT(rootstate = state, itermax = 100, verbose = False)
print ("Best Move: " + str(m) + "\n")
state.DoMove(m)
if state.GetResult(state.playerJustMoved) == 1.0:
print ("Player " + str(state.playerJustMoved) + " wins!")
elif state.GetResult(state.playerJustMoved) == 0.0:
print ("Player " + str(3 - state.playerJustMoved) + " wins!")
else: print ("Nobody wins!")
if __name__ == "__main__":
""" Play a single game to the end using UCT for both players.
"""
UCTPlayGame()