Skip to content

Commit edfdd1c

Browse files
JaworMichaldniku
andauthored
Upgrade atari_wrapper to tf2 (#452)
* add actor-critic theory * update tf functions for tf2 api * bug + code style fix * revert the import of tf in a method * remove TFSummaries * add tf2 summaries * remove log_dir from TFSummaries * bring back links to a2c algo description * unify notation in formulas * Add add_summary_scalar() stub to SummariesBase * Replace default False value with a more idiomatic None Co-authored-by: MichaelSolotky <> Co-authored-by: Lionel Miller <[email protected]>
1 parent 5461e46 commit edfdd1c

File tree

2 files changed

+107
-23
lines changed

2 files changed

+107
-23
lines changed

week06_policy_based/a2c-optional.ipynb

Lines changed: 69 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -144,10 +144,10 @@
144144
"To train the part of the model that predicts state values you will need to compute the value targets. \n",
145145
"Any callable could be passed to `EnvRunner` to be applied to each partial trajectory after it is collected. \n",
146146
"Thus, we can implement and use `ComputeValueTargets` callable. \n",
147-
"The formula for the value targets is simple:\n",
147+
"The formula for the value targets is simple, it's the right side of the following equation:\n",
148148
"\n",
149149
"$$\n",
150-
"\\hat v(s_t) = \\left( \\sum_{t'=0}^{T - 1 - t} \\gamma^{t'}r_{t+t'} \\right) + \\gamma^T \\hat{v}(s_{t+T}),\n",
150+
"V(s_t) = \\left( \\sum_{t'=0}^{T - 1 - t} \\gamma^{t'} \\cdot r (s_{t+t'}, a_{t + t'}) \\right) + \\gamma^T \\cdot V(s_{t+T}),\n",
151151
"$$\n",
152152
"\n",
153153
"In implementation, however, do not forget to use \n",
@@ -165,7 +165,7 @@
165165
"class ComputeValueTargets:\n",
166166
" def __init__(self, policy, gamma=0.99):\n",
167167
" self.policy = policy\n",
168-
" \n",
168+
"\n",
169169
" def __call__(self, trajectory):\n",
170170
" # This method should modify trajectory inplace by adding\n",
171171
" # an item with key 'value_targets' to it.\n",
@@ -214,7 +214,58 @@
214214
"cell_type": "markdown",
215215
"metadata": {},
216216
"source": [
217-
"Now is the time to implement the advantage actor critic algorithm itself. You can look into your lecture,\n",
217+
"# Actor-critic objective\n",
218+
"\n",
219+
"Here we define a loss function that uses rollout above to train advantage actor-critic agent.\n",
220+
"\n",
221+
"\n",
222+
"Our loss consists of three components:\n",
223+
"\n",
224+
"* __The policy \"loss\"__\n",
225+
" $$ \\hat J = {1 \\over T} \\cdot \\sum_t { \\log \\pi(a_t | s_t) } \\cdot A_{const}(s,a) $$\n",
226+
" * This function has no meaning in and of itself, but it was built such that\n",
227+
" * $ \\nabla \\hat J = {1 \\over N} \\cdot \\sum_t { \\nabla \\log \\pi(a_t | s_t) } \\cdot A(s,a) \\approx \\nabla E_{s, a \\sim \\pi} R(s,a) $\n",
228+
" * Therefore if we __maximize__ J_hat with gradient descent we will maximize expected reward\n",
229+
" \n",
230+
" \n",
231+
"* __The value \"loss\"__\n",
232+
" $$ L_{td} = {1 \\over T} \\cdot \\sum_t { [r + \\gamma \\cdot V_{const}(s_{t+1}) - V(s_t)] ^ 2 }$$\n",
233+
" * Ye Olde TD_loss from q-learning and alike\n",
234+
" * If we minimize this loss, V(s) will converge to $V_\\pi(s) = E_{a \\sim \\pi(a | s)} R(s,a) $\n",
235+
"\n",
236+
"\n",
237+
"* __Entropy Regularizer__\n",
238+
" $$ H = - {1 \\over T} \\sum_t \\sum_a {\\pi(a|s_t) \\cdot \\log \\pi (a|s_t)}$$\n",
239+
" * If we __maximize__ entropy we discourage agent from predicting zero probability to actions\n",
240+
" prematurely (a.k.a. exploration)\n",
241+
" \n",
242+
" \n",
243+
"So we optimize a linear combination of $L_{td}$ $- \\hat J$, $-H$\n",
244+
" \n",
245+
"```\n",
246+
"\n",
247+
"```\n",
248+
"\n",
249+
"```\n",
250+
"\n",
251+
"```\n",
252+
"\n",
253+
"```\n",
254+
"\n",
255+
"```\n",
256+
"\n",
257+
"\n",
258+
"__One more thing:__ since we train on T-step rollouts, we can use N-step formula for advantage for free:\n",
259+
" * At the last step, $A(s_t,a_t) = r(s_t, a_t) + \\gamma \\cdot V(s_{t+1}) - V(s) $\n",
260+
" * One step earlier, $A(s_t,a_t) = r(s_t, a_t) + \\gamma \\cdot r(s_{t+1}, a_{t+1}) + \\gamma ^ 2 \\cdot V(s_{t+2}) - V(s) $\n",
261+
" * Et cetera, et cetera. This way agent starts training much faster since it's estimate of A(s,a) depends less on his (imperfect) value function and more on actual rewards. There's also a [nice generalization](https://arxiv.org/abs/1506.02438) of this."
262+
]
263+
},
264+
{
265+
"cell_type": "markdown",
266+
"metadata": {},
267+
"source": [
268+
"You can also look into your lecture,\n",
218269
"[Mnih et al. 2016](https://arxiv.org/abs/1602.01783) paper, and [lecture](https://www.youtube.com/watch?v=Tol_jw5hWnI&list=PLkFD6_40KJIxJMR-j5A1mkxK26gh_qg37&index=20) by Sergey Levine."
219270
]
220271
},
@@ -288,9 +339,22 @@
288339
}
289340
],
290341
"metadata": {
342+
"kernelspec": {
343+
"display_name": "Python 3",
344+
"language": "python",
345+
"name": "python3"
346+
},
291347
"language_info": {
348+
"codemirror_mode": {
349+
"name": "ipython",
350+
"version": 3
351+
},
352+
"file_extension": ".py",
353+
"mimetype": "text/x-python",
292354
"name": "python",
293-
"pygments_lexer": "ipython3"
355+
"nbconvert_exporter": "python",
356+
"pygments_lexer": "ipython3",
357+
"version": "3.7.3"
294358
}
295359
},
296360
"nbformat": 4,

week06_policy_based/atari_wrappers.py

Lines changed: 38 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -213,12 +213,16 @@ def __init__(self, env, prefix=None, running_mean_size=100):
213213
self.episode_counter = 0
214214
self.prefix = prefix or self.env.spec.id
215215

216-
nenvs = getattr(self.env.unwrapped, "nenvs", 1)
217-
self.rewards = np.zeros(nenvs)
218-
self.had_ended_episodes = np.zeros(nenvs, dtype=np.bool)
219-
self.episode_lengths = np.zeros(nenvs)
216+
self.nenvs = getattr(self.env.unwrapped, "nenvs", 1)
217+
self.rewards = np.zeros(self.nenvs)
218+
self.had_ended_episodes = np.zeros(self.nenvs, dtype=np.bool)
219+
self.episode_lengths = np.zeros(self.nenvs)
220220
self.reward_queues = [deque([], maxlen=running_mean_size)
221-
for _ in range(nenvs)]
221+
for _ in range(self.nenvs)]
222+
self.global_step = 0
223+
224+
def add_summary_scalar(self, name, value):
225+
raise NotImplementedError
222226

223227
def should_write_summaries(self):
224228
""" Returns true if it's time to write summaries. """
@@ -260,6 +264,8 @@ def step(self, action):
260264
self.reward_queues[i].append(self.rewards[i])
261265
self.rewards[i] = 0
262266

267+
self.global_step += self.nenvs
268+
263269
if self.should_write_summaries():
264270
self.add_summaries()
265271
return obs, rew, done, info
@@ -272,19 +278,22 @@ def reset(self, **kwargs):
272278

273279

274280
class TFSummaries(SummariesBase):
275-
""" Writes env summaries using TensorFlow."""
281+
""" Writes env summaries using TensorFlow.
282+
In order to write summaries in a specific directory,
283+
you may define a writer and set it as default just before
284+
training loop as in an example here
285+
https://www.tensorflow.org/api_docs/python/tf/summary
286+
Other summaries could be added in A2C class or elsewhere
287+
"""
276288

277-
def __init__(self, env, prefix=None, running_mean_size=100, step_var=None):
289+
def __init__(self, env, prefix=None,
290+
running_mean_size=100, step_var=None):
278291

279292
super().__init__(env, prefix, running_mean_size)
280293

281-
import tensorflow as tf
282-
self.step_var = (step_var if step_var is not None
283-
else tf.train.get_global_step())
284-
285294
def add_summary_scalar(self, name, value):
286295
import tensorflow as tf
287-
tf.contrib.summary.scalar(name, value, step = self.step_var)
296+
tf.summary.scalar(name, value, self.global_step)
288297

289298

290299
class NumpySummaries(SummariesBase):
@@ -304,7 +313,7 @@ def get_values(cls, name):
304313
def clear(cls):
305314
cls._summaries = defaultdict(list)
306315

307-
def __init__(self, env, prefix = None, running_mean_size = 100):
316+
def __init__(self, env, prefix=None, running_mean_size=100):
308317
super().__init__(env, prefix, running_mean_size)
309318

310319
def add_summary_scalar(self, name, value):
@@ -316,6 +325,7 @@ def nature_dqn_env(env_id, nenvs=None, seed=None,
316325
""" Wraps env as in Nature DQN paper. """
317326
if "NoFrameskip" not in env_id:
318327
raise ValueError(f"env_id must have 'NoFrameskip' but is {env_id}")
328+
319329
if nenvs is not None:
320330
if seed is None:
321331
seed = list(range(nenvs))
@@ -327,20 +337,30 @@ def nature_dqn_env(env_id, nenvs=None, seed=None,
327337

328338
env = ParallelEnvBatch([
329339
lambda i=i, env_seed=env_seed: nature_dqn_env(
330-
env_id, seed=env_seed, summaries=False, clip_reward=False)
340+
env_id, seed=env_seed, summaries=None, clip_reward=False)
331341
for i, env_seed in enumerate(seed)
332342
])
333-
if summaries:
334-
summaries_class = NumpySummaries if summaries == 'Numpy' else TFSummaries
335-
env = summaries_class(env, prefix=env_id)
343+
if summaries is not None:
344+
if summaries == 'Numpy':
345+
env = NumpySummaries(env, prefix=env_id)
346+
elif summaries == 'TensorFlow':
347+
env = TFSummaries(env, prefix=env_id)
348+
else:
349+
raise ValueError(
350+
f"Unknown `summaries` value: expected either 'Numpy' or 'TensorFlow', got {summaries}")
336351
if clip_reward:
337352
env = ClipReward(env)
338353
return env
339354

340355
env = gym.make(env_id)
341356
env.seed(seed)
342-
if summaries:
357+
if summaries == 'Numpy':
358+
env = NumpySummaries(env)
359+
elif summaries == 'TensorFlow':
343360
env = TFSummaries(env)
361+
elif summaries:
362+
raise ValueError(f"summaries must be either Numpy, "
363+
f"or TensorFlow, or a falsy value, but is {summaries}")
344364
env = EpisodicLife(env)
345365
if "FIRE" in env.unwrapped.get_action_meanings():
346366
env = FireReset(env)

0 commit comments

Comments
 (0)