WhoSayIn · May 10, 2018 07:43
diff --git a/gistfile1.txt b/gistfile1.txt
 00:00
 good morning hi my name is Amelie I'm
 00:04
 going to be a session chair for the
 00:05
 morning so it's my great pleasure to
 00:08
 introduce cryovac solo who is going to
 00:11
 give an invited talk karai is a director
 00:15
 of research in deep mine and he's one of
 00:17
 the star researchers in our community he
 00:20
 has contributed to many highly
 00:22
 influential projects in deep mind such
 00:25
 as spatial transformer networks auto
 00:28
 regressive generated models such as
 00:31
 pixel recurrent networks and wave nets
 00:34
 and debris enforcement learning for
 00:36
 playing Atari games and alphago today he
 00:39
 will talk about from generative models
 00:42
 to generative agents so let's welcome
 00:46
 karai
 00:47
 [Applause]
 00:48
 [Music]
 00:53
 thank you very much Hong Kong for the
 00:55
 very nice introduction
 00:56
 and thanks everyone for being here it's
 00:59
 it's absolute pleasure
 01:00
 so is Hong like mentioned I'm going to
 01:04
 try to talk about unsupervised learning
 01:08
 in general starting from the generative
 01:10
 models may be a classical way when I try
 01:11
 to give another view that I think is
 01:14
 quite interesting that we have been we
 01:16
 have been working on recently when I
 01:20
 think about what are the important
 01:21
 things for us to do is a is a community
 01:24
 I think everyone here sort of agrees
 01:27
 that in the end what is important is to
 01:29
 be to be doing constitute to us learning
 01:31
 we sort of realize that supervised
 01:33
 learning has all sorts of successes but
 01:37
 in the end unsupervised learning is kind
 01:39
 of like the next frontier and when I
 01:42
 think about unsupervised learning there
 01:46
 are there are sort of like different
 01:48
 explanations that come to my mind and
 01:50
 when talking to people I think we all
 01:52
 have sort of different opinions on this
 01:54
 one of the things that I think is a
 01:57
 common explanation is we have an
 01:58
 unsupervised learning algorithm we run
 02:00
 it on our data what we expect is the
 02:02
 algorithm to understand our data and to
 02:04
 explain our data or or or our
 02:07
 environment right and and what we expect
 02:10
 from this is that the algorithm is going
 02:12
 to learn the intrinsic properties of our
 02:14
 data of our environment and then it's
 02:16
 going to be able to explain that through
 02:18
 those properties but most of the time
 02:20
 what happens is because of the kinds of
 02:24
 models that we use we resort to and at
 02:26
 the end
 02:26
 looking at samples and what we look at
 02:28
 the samples we try to see that did our
 02:31
 model really understand the environment
 02:32
 and if it understood the environment
 02:34
 then then the sample should be
 02:36
 meaningful of course we look at all
 02:38
 sorts of objective measures that we try
 02:39
 to that that we use during training like
 02:41
 Inception scores looked at cahoots and
 02:43
 such but in the end we always resort the
 02:45
 samples in terms of like understanding
 02:47
 if our model really can explain what's
 02:49
 going on in the environment the other
 02:51
 kind of general explanation that we all
 02:53
 use is like the goal of unsupervised
 02:56
 learning is to learn rich
 02:57
 representations right it's already
 02:58
 embedded in the name of the skill of
 03:01
 this conference the main goal of deep
 03:03
 learning unsupervised learning is with
 03:05
 learning
 03:05
 those are presentations but then when we
 03:08
 think about those representations again
 03:09
 it doesn't this explanation doesn't give
 03:11
 us an objective measure what we think
 03:13
 about is why those like how are we going
 03:17
 to think about those representations in
 03:19
 terms of being great and useful and to
 03:21
 me the most important bit is if we have
 03:24
 good and richer presentations then they
 03:26
 are useful for generalization for
 03:28
 transfer right and we need to we need to
 03:31
 sort of if you have a good unsupervised
 03:33
 learning model and it can give us good
 03:35
 through presentations then we can get
 03:37
 generalization so what I'm going to do
 03:39
 is today also tie it together with
 03:41
 something else that is really I think
 03:43
 for me it is very important as long as
 03:45
 I've mentioned some a big chunk of work
 03:47
 that we have been doing a deep mine that
 03:49
 I've been doing is about agents and
 03:51
 reinforcement learning and in this talk
 03:53
 I'm going to sort of take a look at
 03:55
 unsupervised learning from classical
 03:57
 sense of like learning a learning a
 04:00
 generative model and also learning an
 04:02
 agent that can do on supervised learning
 04:03
 so I'm going to start from the wavenet
 04:06
 model hopefully as many of you know it
 04:10
 is a generative model of audio it's a
 04:12
 pure deep learning model and turns it
 04:14
 does you can model any audio signal like
 04:17
 speech and and and music and then you
 04:20
 can get really realistic samples out of
 04:21
 that and the next thing I'm going to do
 04:25
 is I'm going to explain this other sort
 04:27
 of new approach that that I find really
 04:29
 interesting to unsupervised learning
 04:31
 that is based on deep reinforcement
 04:34
 learning learning an agent that can
 04:36
 actually that does unsupervised learning
 04:38
 so this model called spiral is based on
 04:41
 a new agent architecture that we have
 04:43
 been that we have been working on that
 04:45
 we have published recently called Impala
 04:46
 it's a very large highly scaleable
 04:49
 efficient off-post elearning agent
 04:51
 architecture that we use in spiral to do
 04:54
 unsupervised learning and the
 04:57
 interesting bit about the spiral work is
 04:59
 it does generalization through using
 05:01
 some sort of tool space tools that we as
 05:03
 people have created that we have created
 05:06
 so that we can actually solve not one
 05:08
 specific problem we can solve many
 05:10
 different problems using these tools and
 05:12
 using the interface of a two
 05:14
 and having an agent you can actually now
 05:16
 learn a generative model of your
 05:19
 environment
 05:19
 all right so without like more delay the
 05:24
 first thing that I'm going to try to
 05:25
 introduce is like quickly the very net
 05:27
 model way net is a generative model of
 05:30
 of audio as I said it is it samples the
 05:33
 robot your signal it doesn't use any
 05:35
 sort of interface to model the audio
 05:38
 signal audio in general is very very
 05:41
 high dimensional so the the standard
 05:43
 audio signal that we started when Miller
 05:45
 done moved a bit when we were at the
 05:48
 beginning but 16,000 samples per second
 05:50
 like if you compare that our usual
 05:52
 language modeling and and and machine
 05:55
 translation kind of tasks it is several
 05:57
 orders of magnitude more data so the
 06:00
 kinds of dependencies that one needs to
 06:02
 model to be able to model good audio
 06:04
 signals is very it's very long so this
 06:09
 model what it does is it samples it
 06:11
 models one sample at a time and it is a
 06:14
 soft max distribution to model the model
 06:17
 each sample depending on dependent on
 06:20
 all the all the previous samples of the
 06:22
 of the signal when you look at it more
 06:26
 closely though it is it is it is an
 06:28
 architecture that has quite a bit of
 06:30
 resemblance to the pixel CNN model maybe
 06:32
 some of you also are familiar with that
 06:34
 in the end it is a stack of multiple
 06:38
 commotion layers to be a little bit more
 06:40
 specific it has these residual blocks
 06:42
 you use multiples of those residual
 06:44
 blocks and each decision and in each
 06:46
 residual work there are these dilated
 06:49
 convolutional layers that that go on top
 06:53
 of each other and through those dilated
 06:54
 convolutional layers that are causal
 06:56
 convolutions we can model very long
 06:59
 dependencies so through that we can get
 07:01
 the modelling dependency in time now one
 07:06
 of the biggest design considerations
 07:08
 about wag net is it is designed to be
 07:11
 very very efficient during training
 07:13
 because during training what you can do
 07:15
 is because all the targets are known
 07:17
 when you generate the signal you
 07:19
 generate the whole signal at once just
 07:20
 run it like a convulsion on net you get
 07:22
 your signal then because you have the
 07:24
 targets you get your error signal
 07:26
 from that propagate back so training is
 07:28
 very efficient but of course when it
 07:30
 comes to sampling time in the end this
 07:31
 is an autoregressive model and through
 07:34
 those causal emotions you need to run
 07:36
 through them one sample at a time so if
 07:38
 you are sampling let's say 24 kilohertz
 07:39
 24,000 samples per second you need to
 07:42
 generate one sample at a time just like
 07:44
 you see in this animation and of course
 07:46
 this is painful this is painful but in
 07:49
 the end it works quite well and we can
 07:51
 generate very very high quality audio
 07:53
 with this so what I want to do is I want
 07:58
 to actually I want to I want to make you
 08:01
 listen to the unconditional samples from
 08:04
 this model so rag model the speed signal
 08:07
 and without any conditioning on text or
 08:10
 anything just take the audio signal and
 08:12
 model that with model that it wavenet
 08:14
 and then when you sample this is the
 08:17
 kind of so as you can see or here
 08:30
 hopefully the the quality is very high
 08:35
 and this is modeling really the raw
 08:37
 audio grow audio signal and this is
 08:40
 completely unconditional so what you
 08:42
 hear is sometimes you even hear short
 08:44
 words like okay from and then if you try
 08:48
 to listen all the tonation and
 08:49
 everything sounds quite natural and
 08:52
 sometimes it feels like you are
 08:53
 listening to someone speaking in a
 08:55
 language that you don't know so the the
 08:57
 main characteristics of the of the
 09:00
 signal is all captured there so in terms
 09:02
 of dependencies we are looking into like
 09:04
 something like several thousand samples
 09:06
 of dependencies are actually properly
 09:09
 and correctly modelled there and then of
 09:12
 course sorry and then of course what you
 09:16
 can do is you can you can augment this
 09:18
 model by conditioning on a text signal
 09:22
 that is associated with the signal that
 09:24
 you want to generate and by conditioning
 09:26
 on the text signal now you have a
 09:28
 generative model a conditional
 09:30
 generative model that actually solves a
 09:32
 real-world problem just by itself and
 09:34
 turn deep learning right so
 09:37
 the text you create the linguistic
 09:38
 embeddings from that using those
 09:40
 linguistic embeddings you can generate
 09:42
 the signal and then and then it starts
 09:46
 it's not talking right so it's a it's a
 09:48
 solution to the whole text to speech
 09:51
 synthesis problem that as you know is
 09:53
 very very common used in in in real
 09:57
 world sorry alright so when we did this
 10:03
 the the bayonet model and this was
 10:07
 around like almost two years ago now we
 10:10
 looked at the we looked at equality when
 10:12
 we use it as a TTS model and in green
 10:15
 what you see is the quality of the human
 10:17
 speech I can obtain through this mean
 10:19
 opinion scores and in blue you see the
 10:21
 wavenet and the other colors are the
 10:23
 other models that were the best models
 10:25
 around and at the time and you can see
 10:27
 that they met close the gap between the
 10:30
 human called speech and other models by
 10:33
 by a big margin so at the time this this
 10:37
 really got us excited because now we
 10:39
 actually had a model a deep learning
 10:41
 model that comes with all the
 10:42
 flexibilities and advantages of doing
 10:44
 deep learning and at the same time it's
 10:46
 modeling raw audio and it is it is it is
 10:49
 very very high quality
 10:50
 I could play text to speech samples that
 10:53
 is generated by this model but actually
 10:55
 what you can do is what I'm going to go
 10:56
 into next if you are using Google
 10:58
 assistant right now you are already
 10:59
 hearing back that there because this is
 11:01
 already in production so anyone who's
 11:03
 using Google assistant and like querying
 11:05
 Wikipedia and things like that the the
 11:08
 speech that is generated there is
 11:10
 actually coming from the very net model
 11:11
 and what I want to do is I want to
 11:13
 explain how we how we did that and that
 11:18
 brings me into our next project that we
 11:20
 did in the wagonette in the very net
 11:22
 domain this is the parallel way net
 11:24
 power the net project so of course when
 11:27
 you have a research project and at some
 11:29
 point you realize that okay it is
 11:30
 actually lands it actually lands itself
 11:33
 into the solution of a real-world
 11:34
 problem and you want to put it into
 11:37
 production in a very challenging
 11:39
 environment then then of course it
 11:41
 requires much more than our little
 11:44
 research group so this was a big
 11:45
 cooperation between the D point research
 11:47
 applied and the Google
 11:48
 speech teams actually so in this slide
 11:52
 what but what what I show is basis the
 11:55
 the the basic ingredients of how we turn
 11:58
 a wave net architecture into a
 12:01
 feed-forward and parallel architecture
 12:03
 because what we realize pretty soon when
 12:06
 we started when we try to attempt doing
 12:09
 doing putting putting a system like this
 12:13
 into production was actually speed of
 12:15
 course is very important quality is very
 12:17
 very important but the the importance is
 12:19
 of speed is it is not enough to actually
 12:22
 run something in real time the kind of
 12:24
 constraints that we track those ovals
 12:26
 like orders of magnitude faster than
 12:27
 real time even actually being able to
 12:30
 run in constant time so when one day
 12:32
 when the constraint becomes being able
 12:34
 to run in constant time the only thing
 12:36
 you can do is create a feed-forward
 12:38
 Network and then paralyze the signal
 12:40
 generation right so that is what we did
 12:43
 so in this slide at the top what you see
 12:45
 is the usual wavenet model we call it
 12:48
 the teacher now in the setting this
 12:49
 wavenet model is pure trained and it is
 12:52
 fixed and it is used as a scoring
 12:55
 function at the bottom what you see is
 12:57
 the generator that we call the student
 12:59
 and this student model is again an
 13:02
 architecture that is very close to write
 13:04
 net but it is a it is it is run as a
 13:07
 feed-forward convolutional network and
 13:09
 the way it is run is and it is trained
 13:11
 is actually it has two components one
 13:13
 component is coming from a net we know
 13:15
 that it is very efficient in training as
 13:17
 I said but slow in something the other
 13:19
 the other thing is based on the inverse
 13:21
 autoregressive flow work that was done
 13:22
 by the king - colleagues at opening I
 13:24
 last year and and and and this this
 13:28
 structure gives gives us the capability
 13:30
 to actually get a input noise signal in
 13:33
 and slowly transform that noise signal
 13:36
 into a into a proper distribution that
 13:39
 is going to be the speed signal right so
 13:42
 the way we train this is random noise
 13:44
 goes in together with the linguistic
 13:46
 features through layers and layers of
 13:48
 these flows the signal gets that that
 13:50
 random noise gets transferred into
 13:52
 speech signal that speed signal goes
 13:54
 into a net very net is like already the
 13:57
 best kind of scoring function that we
 13:59
 can use because it's a
 14:00
 it's a density model and wavenet scores
 14:03
 that and that score from that we get the
 14:06
 gradients back into the generator and
 14:09
 then we update the generator we call
 14:11
 this process the proper water density
 14:12
 distribution but of course when you are
 14:15
 trying to do real-world things and if
 14:18
 things are very challenging like speed
 14:19
 signals that is by itself not enough so
 14:21
 I have highlighted two components here
 14:23
 one of them as I said is the magnet
 14:25
 scoring function the other thing that we
 14:27
 use is a power loss because what happens
 14:30
 is when we train the model in this
 14:32
 manner the signal tends to be very low
 14:35
 energy sort of like whispering someone
 14:38
 speaks but they are like whispering so
 14:39
 during training we sort of edit this
 14:41
 extra loss that tries to conserve the
 14:43
 energy of the generated speech and with
 14:47
 these two the the wavenet scoring and
 14:49
 the power loss we were already getting
 14:51
 very high called speed signal but of
 14:54
 course like the constraints are very
 14:55
 very tough and what we did was we
 14:58
 trained another wave net model so we
 15:00
 sort of used wavenet everywhere right
 15:01
 that we are generating through a leg net
 15:03
 through convolution we are using very
 15:04
 net as a scoring function we again
 15:07
 trained another very net model this time
 15:08
 we used it as a speech recognition
 15:10
 system and that is the perceptual loss
 15:12
 that you see there so we train the wave
 15:14
 net again as a speech recognition system
 15:16
 what we do is during training of course
 15:18
 you have the text and the corresponding
 15:21
 speech signal we generate the we
 15:25
 generate the corresponding speech
 15:27
 through our generator we get the text
 15:29
 give that the speech recognition system
 15:30
 the speech recognition system of course
 15:32
 not needs to decode we generated signal
 15:35
 into those into that text right and we
 15:37
 get the error from there propagate back
 15:39
 into our generator so that's another
 15:41
 sort of quality improvement that we get
 15:42
 by using speech recognition as a
 15:45
 perceptual loss in our generation system
 15:47
 and the last thing that we did was using
 15:51
 a contrasting term that basically uses
 15:53
 okay we generate a signal conditioned on
 15:55
 some text you can you can create a
 15:58
 contrast applause we're saying that the
 16:01
 signal that is generated with the
 16:02
 corresponding text is it should be
 16:05
 different than the same signal if it if
 16:07
 it was conditioned on a separate text
 16:09
 right
 16:10
 there's a contrasting luster so more
 16:12
 specifically what we have is in the end
 16:14
 we end up with these four terms at the
 16:18
 top we see that the the original sort of
 16:22
 using vena there's a scoring function
 16:24
 the problem with advances the
 16:25
 distillation idea then we have the power
 16:28
 loss that that uses Fourier transforms
 16:31
 eternal to to conserve the energy and
 16:34
 the contrastive term and find out the
 16:36
 perceptual was that does the that does
 16:40
 the speech of cognition and when we all
 16:42
 these then of course what we did was we
 16:44
 looked at the quality now what what I'm
 16:47
 showing here is the quality with respect
 16:49
 to the again the best non wavenet model
 16:52
 so this is sort of like a year after the
 16:54
 original research pretty much exactly a
 16:57
 year and so during that time of course
 17:00
 the the best speech synthesis models
 17:02
 also improved but wavenet was still
 17:04
 better than better than anything else
 17:06
 and it was matching the quality of so
 17:08
 the new magnet the parallel bayonet is
 17:11
 exactly matching the quality of the of
 17:15
 the original magnitude and what what I'm
 17:18
 showing here is three different US
 17:20
 English voices and also Japanese and
 17:21
 this is the kind of thing that we always
 17:23
 want from deep learning right the
 17:25
 ability to generalize to new datasets to
 17:27
 new domains so we have developed all
 17:29
 this model one practically one single US
 17:31
 English voice and it was just a matter
 17:33
 of collecting or getting another data
 17:35
 set from another either speaker or
 17:38
 another language like some speaker
 17:41
 speaking Japanese you just get that run
 17:43
 it and there you go you have a speech
 17:45
 synthesis you have a production called
 17:46
 speaks into the system just by doing
 17:48
 that this is the kind of thing that we
 17:50
 really like from deep line right and and
 17:52
 if you are thinking about from from deep
 17:54
 learning and if you are thinking about
 17:55
 unsupervised learning I think this is
 17:57
 this is this is a very good
 17:58
 demonstration of that
 17:59
 so before switching to the next one I
 18:02
 also want to mention that we have also
 18:04
 done some further work on this called
 18:06
 wave RN and that is recently published
 18:08
 and
 18:09
 I encourage you to look into that one
 18:11
 too that's a very interesting piece of
 18:12
 work also for generating speech at very
 18:15
 very high speed the next thing I want to
 18:18
 talk about is the Impala architecture
 18:20
 the new agent architecture that I said
 18:22
 because as I said so now wavenet is a
 18:25
 sort of in a classical sense of of
 18:30
 unsupervised model that actually can
 18:32
 solve a real world problem now the next
 18:35
 thing I want to sort of start talking
 18:36
 about is this new different way of doing
 18:38
 unsupervised learning but for that most
 18:41
 another exciting bit is to be able to do
 18:44
 deep reinforcement learning at scale
 18:47
 sorry all right so I want to sort of
 18:54
 motivate why do we want to actually push
 18:56
 our deep reinforcement learning models
 18:57
 further and further because most of the
 18:59
 time what we do because this is a new
 19:01
 area is we take sort of like very simple
 19:05
 tasks in in some simple environments and
 19:08
 what we try to do is we try to train an
 19:10
 agent that shows a single task in that
 19:12
 environment well what we what we want to
 19:15
 do is we want to go further than that
 19:16
 right like again going back to the point
 19:18
 of generalization and being able to
 19:19
 solve multiple tasks we have created the
 19:22
 new task set this is an open source task
 19:24
 set that we have like we have an open
 19:26
 source environment called vm lab and as
 19:28
 part of that we have created this new
 19:29
 task set vm lab 30 it is 30 environments
 19:33
 that are sort of covering tasks around
 19:36
 language memory and navigation and those
 19:38
 kinds of things and the goal is not to
 19:41
 solve each one of them individually the
 19:43
 goal is to have one single agent one
 19:45
 single network that is that is solving
 19:48
 all those thoughts all at the same time
 19:50
 there is nothing custom in that agent
 19:52
 that is specific to any single one of
 19:55
 these environments when you look at
 19:56
 those environments I'm showing some of
 19:59
 those here the agency has a first-person
 20:02
 view so it is in like a maze-like
 20:04
 environment and the agent has a
 20:06
 first-person view camera input and it
 20:08
 can navigate around go forward backwards
 20:10
 and rotate around look up down jump and
 20:13
 those kinds of things and and it is
 20:16
 solving all different kinds of tasks
 20:18
 that are that are catered to test
 20:19
 different
 20:20
 kinds of kinds of abilities but the goal
 20:22
 is as I said again to solve all of them
 20:24
 at the same time one thing that becomes
 20:26
 really really important in this case is
 20:27
 of course the stability of our
 20:29
 algorithms because now we are not
 20:32
 solving one single task we are solving
 20:34
 30 of them and we want to really stable
 20:36
 models because we don't have the chance
 20:37
 to tune hyper parameters one single task
 20:39
 anymore and of course what becomes
 20:41
 really important is task interference
 20:43
 right hopefully what we expect again by
 20:45
 using deep learning is this is like a
 20:47
 multi task setting and in this multi
 20:48
 task setting we hope to see positive
 20:51
 transfer rather than task interference
 20:53
 and and and we hope to demonstrate this
 20:55
 in this in this challenging
 20:56
 reinforcement of a reinforcement
 20:58
 learning domain - okay I sort of
 21:03
 realized that I needed to put a slide
 21:05
 about by deep reinforcement learning
 21:07
 because a little bit to my surprise that
 21:10
 was actually not much reinforcement
 21:11
 learning in this conference this year
 21:12
 and I wanted to sort of a little bit
 21:15
 touch on why I think is important for
 21:18
 for the deep learning community before
 21:20
 this community to actually do deep
 21:22
 reinforcement learning because it is to
 21:24
 me it is at the core of if if one of the
 21:26
 goals that we work for here is AI then
 21:28
 it is at the core of order right
 21:30
 reinforcement learning is a very general
 21:32
 framework for it
 21:33
 for doing sequential decision-making for
 21:36
 learning sequential decision making
 21:38
 tasks and deep learning on the other
 21:40
 hand of course is the best model that we
 21:43
 have the best set of algorithms we have
 21:45
 to learn representations and
 21:47
 combinations of those combinations of
 21:51
 these two different models is is the
 21:55
 most sort of like arm is the best answer
 21:58
 so far we have in terms of learning very
 22:00
 good state representations of very
 22:03
 challenging tasks that are not just for
 22:05
 like solving toy domains but actually to
 22:08
 solve challenging real world problems of
 22:11
 course there are many things that are
 22:12
 there are open problems there like some
 22:14
 of them that are sort of interesting at
 22:16
 least for me is the idea of separating
 22:20
 the computational power of a model from
 22:22
 the number of weights or the number of
 22:24
 layers it has or basically again going
 22:27
 back to on supervised learning learning
 22:29
 to transfer
 22:30
 so if we do this deep reinforcement
 22:32
 learning models with the idea to to
 22:35
 actually generalize to transfer okay so
 22:39
 the Impala agent is based on the on
 22:44
 another work that we have done couple of
 22:46
 years ago called the a synchronous
 22:48
 advantage actor critic the a3c model in
 22:50
 the end it's a it's opposed to gradient
 22:53
 methods but you have is like that I
 22:54
 tried to sort of cartoonishly explain
 22:56
 that in the in the in the figure at
 22:58
 every time step the agent sees the
 23:00
 environment and at that time step the
 23:03
 agent outputs a post distribution and
 23:06
 also the also the value function the
 23:08
 value function is the agents expectation
 23:12
 of the total amount of reward that it's
 23:14
 going to get until the end of the
 23:16
 episode being in that state all right
 23:18
 and the policy is the distribution over
 23:19
 the actions that the agent has and at
 23:21
 every time step the agent looks at the
 23:23
 environment and updates is policy so
 23:25
 that it can be can actually act in the
 23:27
 environment and it updates his value
 23:28
 function and the way you train this is
 23:30
 with the with the post the gradient
 23:32
 intuitively this is actually is actually
 23:34
 very simple what you do is the gradient
 23:36
 of the policy is scaled by the
 23:39
 difference between the total reward that
 23:41
 the agent actually gets in the
 23:43
 environment - the baseline and the
 23:46
 baseline is the value function right so
 23:48
 what it means is if the agent ends up
 23:50
 doing better than what the value
 23:52
 function what its assumption was then
 23:55
 it's a good thing you have a positive
 23:56
 gradient you're going to reinforce your
 23:57
 understanding of the environment if the
 23:59
 agent does worse than what it got so
 24:02
 well so the value was higher than the
 24:04
 total reward that you got then you have
 24:06
 a negative gradient you need to shuffle
 24:08
 things around and the way you learn the
 24:10
 value function is by the usual and step
 24:13
 and step TD error now the a3c algorithm
 24:17
 so this was the actor critic part the a
 24:20
 synchronous party in 3 C algorithm is
 24:22
 composed of multiple actors and each
 24:24
 actor independently operates in the
 24:27
 environment and and and collecting for
 24:30
 collect observations
 24:32
 acts in the environment computes the
 24:34
 posted gradients and and
 24:37
 completes the gradients with respect to
 24:39
 the parameters of its network then what
 24:41
 it does is it sends those gradients back
 24:43
 into the parameter server then the
 24:45
 parameter server collects all these
 24:46
 gradients from all different actors
 24:48
 combines them together and then shares
 24:50
 those parameters with all the actors
 24:52
 around now what happens in this case is
 24:55
 as you increase the number of actors
 24:56
 this is the usual asynchronous
 24:58
 stochastic gradient descent setup as the
 25:01
 number of actors increases the stale
 25:03
 grade the staleness of the gradients
 25:05
 becomes a problem so what happens is in
 25:08
 the end is distribution the experience
 25:10
 collection is actually something very
 25:11
 very advantages it's very good and but
 25:14
 what happens is communicating gradients
 25:16
 might become a bottleneck as you try to
 25:17
 really scale things up so for that what
 25:21
 we tried was a different architecture
 25:27
 the idea of a sanctuary server is
 25:31
 actually quite useful but rather than
 25:33
 using it to just to just do the
 25:36
 accumulate the parameter updates the
 25:39
 idea of that learner is to make the
 25:42
 centralized component into a learner so
 25:45
 the all the whole learning algorithm is
 25:46
 is contained in that what the actors
 25:48
 does is only act in the environment not
 25:50
 compute the gradients or anything
 25:52
 send the observations back into learners
 25:54
 to the learner and the learner sends the
 25:56
 parameters back and in this in this way
 25:58
 what you are doing is you are completely
 26:00
 decoupling what happens about your
 26:02
 experience collection in your
 26:04
 environments from your learning
 26:06
 algorithm and in this way you are
 26:07
 actually gaining a lot of robustness
 26:09
 into noise in your environments
 26:11
 sometimes rendering times vary some some
 26:14
 environments are slow some environments
 26:16
 are fast
 26:17
 all that is completely decoupled from
 26:18
 your learning algorithm but of course
 26:20
 what you need is a good learning
 26:22
 algorithm to to be able to deal with
 26:24
 that kind of variation so in the end we
 26:27
 empower what we have is we have a very
 26:29
 efficient decoupled backward pass if you
 26:31
 were so actors generate trajectories as
 26:33
 I said but then but that that decoupling
 26:37
 creates this of posionous write the
 26:39
 policy in the actors the behavior poles
 26:41
 if you will is separate from the policy
 26:44
 in the learner
 26:45
 target policy so what we need is enough
 26:47
 posted earning of course there are many
 26:48
 of posted learning algorithms but we
 26:50
 really wanted to have a post gradient
 26:52
 method and and for that we developed
 26:56
 this new method called V trace and it's
 26:58
 an off-post advantage critic algorithm
 27:00
 the advantage of V traces it is using
 27:04
 these truncated important sampling
 27:06
 ratios to actually come up with an
 27:08
 estimate for the valley so because of
 27:12
 there is this imbalance between the
 27:13
 learners that and the actors you need to
 27:15
 balance those you need to balance that
 27:17
 difference the good thing about this is
 27:19
 it's an algorithm is a smooth transition
 27:22
 between the on post case and off policy
 27:24
 case when they when the actors and the
 27:26
 learner are completely in sync so you're
 27:29
 in the on policy case the algorithm
 27:30
 actually boils down to the usual a3c
 27:33
 update with the n steps bellman equation
 27:35
 if they become more separate than the
 27:38
 correction of the algorithm kicks in and
 27:41
 then you have the corrected corrected
 27:43
 estimate the algorithm has two main
 27:47
 components to those truncation factors
 27:49
 to control two different aspects of the
 27:52
 of off learning one of them is the robe
 27:55
 which controls the reach value function
 27:58
 the algorithm is going to converge
 28:00
 towards the behavior the value function
 28:02
 that code that corresponds to the
 28:04
 behavior policy or the value function
 28:06
 that corresponds to the target policy in
 28:07
 the learner and the other one controls
 28:09
 the speed of convergence the C factor by
 28:13
 by controlling the by controlling the
 28:15
 truncation that it can it can increase
 28:17
 or decrease the variance in learning and
 28:19
 the stick and it can it can it can have
 28:22
 an effect on the speed of convergence
 28:24
 now than me when we tested this of
 28:28
 course the goal is to test on all
 28:29
 environments at once but what we wanted
 28:31
 to do was first you look at the single
 28:33
 task is also we look at five different
 28:35
 environments and we see that in these
 28:37
 environments the Impala algorithm always
 28:39
 very stable it performs at the top so
 28:44
 the comparisons here are the Impala
 28:45
 algorithm the batch a3c method and they
 28:50
 touch a to C method and then different
 28:52
 versions of a three C algorithms and you
 28:54
 can see that Impala and batch a to C are
 28:56
 always at
 28:57
 performing at the top Impala seems to be
 29:00
 doing fine
 29:01
 they're like the the dark blue curve and
 29:03
 and this gives us the sort of feeling
 29:06
 that okay we have a nice outlet now of
 29:08
 course the other thing that is very
 29:09
 important and that is discussed a lot is
 29:12
 the stability of these algorithms right
 29:14
 I actually really like these floods
 29:16
 since during the a3c work actually keep
 29:19
 looking at these floods and we always
 29:21
 put them in the papers the plot here is
 29:23
 on the x-axis we have the heart we have
 29:25
 the hyper parameter combinations when
 29:27
 you when you of course trade any model
 29:29
 what we do all of us is we do some sort
 29:31
 of hyper parameter sweep and here what
 29:33
 we are doing is we are looking at the
 29:35
 final score that we achieve with every
 29:37
 single hyper parameter setting that we
 29:39
 that we get and you sort it and in the
 29:42
 in this kind of thought what you have is
 29:44
 the the the KERS the algorithms that are
 29:47
 at the top and that our most flood are
 29:49
 the most like better performing and most
 29:52
 stable algorithms right and what we see
 29:54
 here is Impala is always of course it's
 29:57
 achieving better results but it's not
 29:58
 achieving those results because there is
 30:00
 one sort of lucky - parameter setting is
 30:03
 consistently at the top and you can see
 30:05
 that it's not of course completely flat
 30:07
 because in the end we are sort of
 30:08
 searching over three orders of magnitude
 30:10
 in parameter settings the but we can see
 30:18
 that the algorithm is actually quite
 30:19
 stable now when we look at our our our
 30:22
 main goal here what we are looking at in
 30:24
 on the x-axis we have the wall clock
 30:26
 time and on the y-axis we have the sort
 30:29
 of the normalized score and the and the
 30:32
 red line that you see there is the a3
 30:34
 see and you can see that Impala not only
 30:37
 H is much better of course if they
 30:39
 choose them much much much faster the
 30:41
 other thing is comparing the green and
 30:43
 the orange line thirds that is the
 30:45
 comparison between training Impala in an
 30:47
 expert setting versus a multi task City
 30:49
 and we see that it achieves better
 30:51
 scores like the faster which again gives
 30:54
 us the idea that we are actually seeing
 30:56
 positive transfer it's it's a like to
 30:58
 like setting the all the all the all the
 31:02
 details of the network and the agent are
 31:03
 the same in one case you have one
 31:05
 network
 31:06
 tasks and in other case you train the
 31:08
 same network on all the tasks and what
 31:10
 you achieve is a better result because
 31:12
 of the positive transfer between those
 31:14
 tasks and what happens is if you give
 31:17
 Impala more resources you end up with
 31:20
 this almost vertical takeoff from there
 31:23
 right and what you have is you can
 31:24
 actually solve this challenging turkey
 31:26
 task domain in under 24 hours given the
 31:29
 resources and that is the kind of
 31:30
 algorithmic sort of power that we want
 31:33
 to be able to train these very highly
 31:35
 scalable agents now why do we want to do
 31:38
 that that is the point that I want to
 31:40
 come next and and and in the final part
 31:43
 this is the new spiral algorithm that I
 31:46
 want to talk about now just quickly
 31:49
 going back to the original ideas that
 31:52
 that I talked about unsupervised
 31:54
 learning is also about explaining
 31:56
 environments and generating samples but
 31:59
 maybe generate examples by explaining
 32:01
 environments and we talked about the
 32:03
 fact that when we have these deep
 32:04
 learning models like magnet we can
 32:06
 generate amazing samples but at the same
 32:08
 time maybe there's a different way we
 32:09
 can do these things less implicit in the
 32:11
 Sun set when we generate these samples
 32:13
 they come with some explanation and that
 32:15
 explanation can go through some using
 32:17
 some tools in this particular case what
 32:20
 we are going to do is we are going to
 32:22
 use a painting tool and we are going to
 32:24
 learn to control this painting tool it's
 32:26
 a real drawing program and we are going
 32:28
 to basically generate a program that the
 32:31
 painting tool will use to generate the
 32:33
 image and the main idea that I want to
 32:36
 convey is by using tools by it by by
 32:39
 learning how to use tools that are
 32:41
 already available that we have actually
 32:44
 we can start thinking about different
 32:46
 kinds of generalizations that I'll try
 32:47
 to demonstrate so in real word we have a
 32:50
 lot of examples of programs and their
 32:53
 executions and the results of those
 32:55
 programs they can be arithmetic programs
 32:57
 floating programs or even architectural
 32:59
 blueprints right and what we do is
 33:02
 because we know we have an information
 33:06
 on that generation process when we see
 33:10
 the results we can go and try to infer
 33:13
 what was the program what was the
 33:14
 blueprint that generated that that
 33:16
 particular input so we can do this and
 33:18
 the goal is to be able to do this with
 33:20
 our with our agents too
 33:22
 specifically we are going to use this
 33:24
 environment called lead my paint it is
 33:27
 actually a professional-grade
 33:28
 open-source drawing library and it's
 33:30
 used worldwide by many artists what we
 33:33
 are doing is we are using a limited
 33:34
 interface basically learning - learning
 33:36
 to draw brushstrokes we are going to
 33:39
 have an agent that does that the agent
 33:41
 in the end called spiral has three main
 33:43
 components first of all is the agent
 33:45
 that generates the brushstrokes sort of
 33:47
 I like to see that as writing the
 33:49
 program the second one is the
 33:51
 environment to lead my paint so the
 33:53
 brushstrokes come in environment turns
 33:55
 those into brushstrokes in the canvas
 33:57
 and that cameras got those into a
 34:00
 discriminator and the discriminator is
 34:01
 trained like again and that
 34:04
 discriminative looks at the generated
 34:05
 image and says does this look like a
 34:07
 real drawing and then gives a score and
 34:09
 that score is opposed to the usual gun
 34:11
 training rather than propagating the
 34:13
 gradient packs we get that score and we
 34:16
 train our agent with that score is a
 34:18
 reward so when you think about this all
 34:20
 these three components coming together
 34:21
 you have an unsupervised learning model
 34:23
 similar to the Ganz but rather than
 34:26
 generating in the pixel space we
 34:28
 generate in this program space and the
 34:30
 training is done through the done
 34:33
 through the reward that the agent itself
 34:35
 also learns so we are sort of trusting
 34:37
 another neural net just like in Gans
 34:39
 setup to actually guide learning but not
 34:41
 through its gradients just treat the
 34:42
 score function so in my opinion it makes
 34:44
 it in certain cases it makes it very
 34:46
 very sort of capable of using a
 34:49
 different kinds of tools so as I said
 34:52
 this agent the the reinforcement
 34:54
 learning part of the agent is completely
 34:56
 the same as the Impala
 34:57
 so we now that we have an agent that can
 35:00
 actually solve really challenging
 35:02
 reinforcement learning setups we take it
 35:03
 and put it into this environment
 35:05
 augmented with the ability to learn a
 35:08
 discriminative function to actually have
 35:11
 the reward the to emphasize again the
 35:13
 important thing here is yes we have an
 35:15
 agent but there is no environment that
 35:17
 actually says that ok this is the reward
 35:19
 that the agent should get the reward
 35:22
 generation is also inside the agent
 35:24
 thanks to again all the unsupervised
 35:26
 learning models
 35:26
 that is actually being studied here so
 35:29
 we specifically use against it up there
 35:31
 so can we generate the first thing of
 35:35
 course we try is when you are doing
 35:36
 unsupervised learning from scratch again
 35:38
 you go back to illness right you start
 35:40
 from M&S; and initially of course it's
 35:42
 generating various crash pad like things
 35:44
 but then through training it becomes
 35:47
 better and better and better here in the
 35:49
 middle you see that now the the agent
 35:52
 learned - these are complete
 35:53
 unconditional samples again the ones
 35:55
 that you see in the middle it learn to
 35:57
 create these trucks that generates these
 35:59
 digits right to emphasize this this
 36:01
 agent has never seen strokes that are
 36:04
 coming from real people how we draw
 36:06
 digits it learned to experiment with
 36:09
 these drugs and it's sort of built its
 36:11
 own policy to create these strokes that
 36:14
 would generate these images of course
 36:16
 you can train the whole set up is a
 36:17
 conditional generation process to
 36:19
 recreate a given image - I think the
 36:22
 main thing about this is it's learning
 36:24
 an unsupervised way to throw the strokes
 36:26
 I see it as the environment the the
 36:29
 league my paint environment sort of
 36:31
 gives us a grounded bottleneck to
 36:33
 actually create a meaningful
 36:35
 representation space of course the next
 36:38
 thing we tried was on the glut and again
 36:39
 you see the same things it can generate
 36:41
 unconditional meaningful only glove
 36:43
 looking like samples or it can recreate
 36:45
 on the glut samples but then
 36:48
 generalization right so here what we
 36:50
 tried was train the model on Omniglot
 36:52
 and then ask it to generate endless
 36:55
 digits right this is what you see in the
 36:57
 middle middle road there can it draw in
 36:59
 this digits this has never seen amnesty
 37:02
 just before but we all know that only
 37:04
 God is more general than in this and it
 37:06
 can do it right given an amnesty yet
 37:08
 it can actually draw that the network
 37:10
 itself has never seen any any amnesty
 37:13
 just during its training then we tried
 37:17
 Smiley's right there line drawings okay
 37:19
 so it can giving it smiley it can also
 37:21
 drop Smiley's - that is great so can we
 37:25
 do more we did this we took this cartoon
 37:30
 drawing and this is done by chopping it
 37:33
 up into 64 by 64 pieces and it's a
 37:36
 general line drawing right again this is
 37:38
 the
 37:39
 imagine that if the Train using Omniglot
 37:40
 and now you can see that it can actually
 37:43
 recreate that trolling certain areas are
 37:46
 read about right back around eyes
 37:47
 insides they are really complicated but
 37:49
 in general you can see that it is
 37:51
 actually capable of generating those
 37:52
 drawings so this gives you an idea of
 37:55
 okay generalization I can I can sort of
 37:58
 train on one domain and generalize the
 38:00
 new ones
 38:01
 so can I push it further the next thing
 38:03
 that we tried was okay the advantage of
 38:06
 using a tool is you have a meaningful
 38:08
 representation space that we can
 38:11
 hopefully transfer that representation
 38:13
 space into a new environment so here
 38:15
 what we do is again the same agent that
 38:17
 is trained using Omniglot we transfer
 38:19
 that simulated that that simulated
 38:22
 environment into real world the way we
 38:25
 do that is we we took that same program
 38:28
 and our friends at the robotics group at
 38:31
 deep mine wrote a controller to control
 38:36
 that robotic arm to take that program
 38:38
 and drove it this whole like experiment
 38:41
 happened in under a week really and what
 38:43
 we ended up with was the same agent the
 38:47
 same agent it is not fine-tuned through
 38:49
 all the setup or anything the same agent
 38:51
 generates its brushstroke programs and
 38:54
 then that program goes into a controller
 38:56
 that can be realized by a real robotic
 38:59
 arm right the advantage of doing this is
 39:01
 the reason we can do this is the
 39:03
 environment that we used is a real
 39:05
 environment we didn't sort of create
 39:07
 that environment the latent space if you
 39:10
 will is not something some arbitrary
 39:12
 latent space that we created because
 39:14
 it's a latent space that is defined by
 39:17
 us that is as a meaningful to space and
 39:20
 the reason we create those tools is to
 39:21
 solve many different problems anyways
 39:24
 right and this is an example of that
 39:25
 using that tool space gives us the
 39:27
 ability to actually transfer its
 39:29
 capability so with that I want to
 39:32
 conclude I tried to give an explanation
 39:36
 of you think about generative models and
 39:39
 unsupervised learning and to me of
 39:41
 course like I'm a hundred percent sure
 39:43
 everyone agrees that our aim is not to
 39:45
 just look at images right our aim is to
 39:47
 do much more
 39:48
 than that and I tried to give two
 39:50
 different two different aspects one of
 39:52
 them is the kind of genital models that
 39:55
 we can do actually right now can solve
 39:57
 real world problems like we have seen in
 39:59
 Vienna and also we can think about a
 40:01
 different kind of setup where we have
 40:03
 agents actually training and and
 40:06
 generating interpretable programs right
 40:09
 that is an important aspect that we have
 40:10
 seen that conversation coming up here
 40:12
 actually through several of the talks
 40:15
 here that being interbeing able to
 40:17
 generate interpretable programs is one
 40:19
 of the bottlenecks that we face right
 40:21
 now because there are many critical
 40:23
 applications that we want to solve there
 40:24
 are many tools that we're gonna eat you
 40:26
 eyes and this is one sort of step
 40:28
 towards that best way how how I see and
 40:30
 being able to do these requires us to
 40:33
 create these very capable reinforcement
 40:37
 learning agents that rely on new
 40:39
 algorithms that we need to that we need
 40:41
 to work on with that thank you very much
 40:44
 I think I want to thank all my
 40:46
 co-operators for their for their help on
 40:49
 this thank you very much
 40:50
 [Applause]
 40:50
 [Music]
 40:57
 [Applause]
 41:06
 we have time for maybe one or two
 41:09
 questions
 41:24
 okay so I have 100 so how do you think
 41:27
 about scaling to like more like general
 41:32
 domains beyond some simple strokes how
 41:37
 to generate like realistic scenes right
 41:41
 so one thing that I haven't shown here
 41:43
 actually yes creating realistic scenes
 41:46
 is is one case one thing that I haven't
 41:49
 talked about is actually as part of
 41:51
 sorry as part of this work it's actually
 41:54
 in the paper one thing that the team did
 41:57
 by the way I had to mention and this was
 41:59
 worked on most by Yaroslav gun in
 42:00
 Melbourne he's actually PhD student at
 42:03
 Mira and he spent his summer with us
 42:04
 doing his internship so as an amazing
 42:06
 job for actually doing it during an
 42:08
 internship pretty big congratulations to
 42:10
 him so one thing that that that we did
 42:12
 was actually try to generate images so
 42:14
 we took the survey data set and use the
 42:16
 same drawing program to actually to
 42:20
 actually draw those and in that case our
 42:23
 setup is just scaling towards those like
 42:26
 the same stuff set up actually scales
 42:27
 because it's a general drawing - and you
 42:30
 can control the color and we can do that
 42:32
 but it requires a little bit more sort
 42:35
 of like it was one of the last
 42:36
 experiments that we did but like it is
 42:38
 it is sort of in the words thanks for a
 42:42
 great talker I had a question about the
 42:44
 Impala results right you had a slide
 42:47
 where one with a curve where all workers
 42:51
 are learning versus having one
 42:54
 centralized sorry centralized learner
 42:57
 the all workers learning actually does
 43:00
 better
 43:01
 than the centralized letter and I found
 43:04
 that not quite surprising but like you
 43:07
 know it's great and it's great to see
 43:10
 the positive transfer between tasks do
 43:11
 you think
 43:12
 have you tried that on other Suites of
 43:13
 tasks do you think it's just because
 43:14
 it's tasks in this suite of tasks are
 43:17
 very similar to usually like it
 43:19
 definitely depends on that but the
 43:21
 reason we created those tasks it is for
 43:24
 that reason right in real world what we
 43:26
 have is we have the visual structure of
 43:28
 our world is unique so the kind of setup
 43:31
 that we have in deep defined lab that
 43:33
 that that tasks it is that it's a
 43:36
 unified visual environment you have one
 43:38
 sort of one one one kind of agent with a
 43:41
 unified action space and now you can
 43:43
 focus on solving different kinds of
 43:45
 tasks of course like that is the kind of
 43:47
 thing that we were testing given all
 43:48
 these through does it actually is it
 43:51
 possible to do the multi task positive
 43:53
 transfer that we see in supervised
 43:55
 learning cases that we were able to see
 43:57
 that in reinforcement learning yeah
 44:01
 hello this is exciting I have a question
 44:06
 about extending this to maybe more open
 44:09
 domains so what is the challenge it's a
 44:13
 challenge to be a number of actions to
 44:16
 pick because the number of strokes maybe
 44:19
 the strokes face smaller so what other
 44:22
 challenge to extend to open domains with
 44:27
 what do you like what do you have in
 44:29
 mind is open domains like number of
 44:31
 actions is definitely a challenge right
 44:32
 it is definitely one of the big
 44:34
 challenges that a lot of research in as
 44:36
 far as I know in RL goes into that but
 44:39
 that is that is I think only one of the
 44:41
 main challenges the other challenge of
 44:42
 course is the straight representation
 44:45
 that is mainly why we sort of used deep
 44:48
 learning right because we expect that
 44:51
 with deep learning we are going to be
 44:52
 able to learn better representations and
 44:54
 that still remains as a challenge
 44:56
 because being able to learn
 44:57
 representations is not an architectural
 44:59
 problem only it is also about finding
 45:03
 the right sort of training set up and
 45:05
 spyro was an example of that where we
 45:07
 can get that reward function that that
 45:08
 reward signal in an unsupervised way
 45:11
 right and in many different domains
 45:13
 like there are many different ways we
 45:15
 can do this but actually finding those
 45:16
 solutions also part of that
 45:20
 okay so let's Bank arriving
 45:24
 [Music]
 45:27
 [Applause]
 Up next
 AUTOPLAY