Created
May 10, 2018 07:43
-
-
Save WhoSayIn/4a6fc0e124297b8eaa70a521da48c090 to your computer and use it in GitHub Desktop.
DeepMind: From Generative Models to Generative Agents - Transcript
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
00:00 | |
good morning hi my name is Amelie I'm | |
00:04 | |
going to be a session chair for the | |
00:05 | |
morning so it's my great pleasure to | |
00:08 | |
introduce cryovac solo who is going to | |
00:11 | |
give an invited talk karai is a director | |
00:15 | |
of research in deep mine and he's one of | |
00:17 | |
the star researchers in our community he | |
00:20 | |
has contributed to many highly | |
00:22 | |
influential projects in deep mind such | |
00:25 | |
as spatial transformer networks auto | |
00:28 | |
regressive generated models such as | |
00:31 | |
pixel recurrent networks and wave nets | |
00:34 | |
and debris enforcement learning for | |
00:36 | |
playing Atari games and alphago today he | |
00:39 | |
will talk about from generative models | |
00:42 | |
to generative agents so let's welcome | |
00:46 | |
karai | |
00:47 | |
[Applause] | |
00:48 | |
[Music] | |
00:53 | |
thank you very much Hong Kong for the | |
00:55 | |
very nice introduction | |
00:56 | |
and thanks everyone for being here it's | |
00:59 | |
it's absolute pleasure | |
01:00 | |
so is Hong like mentioned I'm going to | |
01:04 | |
try to talk about unsupervised learning | |
01:08 | |
in general starting from the generative | |
01:10 | |
models may be a classical way when I try | |
01:11 | |
to give another view that I think is | |
01:14 | |
quite interesting that we have been we | |
01:16 | |
have been working on recently when I | |
01:20 | |
think about what are the important | |
01:21 | |
things for us to do is a is a community | |
01:24 | |
I think everyone here sort of agrees | |
01:27 | |
that in the end what is important is to | |
01:29 | |
be to be doing constitute to us learning | |
01:31 | |
we sort of realize that supervised | |
01:33 | |
learning has all sorts of successes but | |
01:37 | |
in the end unsupervised learning is kind | |
01:39 | |
of like the next frontier and when I | |
01:42 | |
think about unsupervised learning there | |
01:46 | |
are there are sort of like different | |
01:48 | |
explanations that come to my mind and | |
01:50 | |
when talking to people I think we all | |
01:52 | |
have sort of different opinions on this | |
01:54 | |
one of the things that I think is a | |
01:57 | |
common explanation is we have an | |
01:58 | |
unsupervised learning algorithm we run | |
02:00 | |
it on our data what we expect is the | |
02:02 | |
algorithm to understand our data and to | |
02:04 | |
explain our data or or or our | |
02:07 | |
environment right and and what we expect | |
02:10 | |
from this is that the algorithm is going | |
02:12 | |
to learn the intrinsic properties of our | |
02:14 | |
data of our environment and then it's | |
02:16 | |
going to be able to explain that through | |
02:18 | |
those properties but most of the time | |
02:20 | |
what happens is because of the kinds of | |
02:24 | |
models that we use we resort to and at | |
02:26 | |
the end | |
02:26 | |
looking at samples and what we look at | |
02:28 | |
the samples we try to see that did our | |
02:31 | |
model really understand the environment | |
02:32 | |
and if it understood the environment | |
02:34 | |
then then the sample should be | |
02:36 | |
meaningful of course we look at all | |
02:38 | |
sorts of objective measures that we try | |
02:39 | |
to that that we use during training like | |
02:41 | |
Inception scores looked at cahoots and | |
02:43 | |
such but in the end we always resort the | |
02:45 | |
samples in terms of like understanding | |
02:47 | |
if our model really can explain what's | |
02:49 | |
going on in the environment the other | |
02:51 | |
kind of general explanation that we all | |
02:53 | |
use is like the goal of unsupervised | |
02:56 | |
learning is to learn rich | |
02:57 | |
representations right it's already | |
02:58 | |
embedded in the name of the skill of | |
03:01 | |
this conference the main goal of deep | |
03:03 | |
learning unsupervised learning is with | |
03:05 | |
learning | |
03:05 | |
those are presentations but then when we | |
03:08 | |
think about those representations again | |
03:09 | |
it doesn't this explanation doesn't give | |
03:11 | |
us an objective measure what we think | |
03:13 | |
about is why those like how are we going | |
03:17 | |
to think about those representations in | |
03:19 | |
terms of being great and useful and to | |
03:21 | |
me the most important bit is if we have | |
03:24 | |
good and richer presentations then they | |
03:26 | |
are useful for generalization for | |
03:28 | |
transfer right and we need to we need to | |
03:31 | |
sort of if you have a good unsupervised | |
03:33 | |
learning model and it can give us good | |
03:35 | |
through presentations then we can get | |
03:37 | |
generalization so what I'm going to do | |
03:39 | |
is today also tie it together with | |
03:41 | |
something else that is really I think | |
03:43 | |
for me it is very important as long as | |
03:45 | |
I've mentioned some a big chunk of work | |
03:47 | |
that we have been doing a deep mine that | |
03:49 | |
I've been doing is about agents and | |
03:51 | |
reinforcement learning and in this talk | |
03:53 | |
I'm going to sort of take a look at | |
03:55 | |
unsupervised learning from classical | |
03:57 | |
sense of like learning a learning a | |
04:00 | |
generative model and also learning an | |
04:02 | |
agent that can do on supervised learning | |
04:03 | |
so I'm going to start from the wavenet | |
04:06 | |
model hopefully as many of you know it | |
04:10 | |
is a generative model of audio it's a | |
04:12 | |
pure deep learning model and turns it | |
04:14 | |
does you can model any audio signal like | |
04:17 | |
speech and and and music and then you | |
04:20 | |
can get really realistic samples out of | |
04:21 | |
that and the next thing I'm going to do | |
04:25 | |
is I'm going to explain this other sort | |
04:27 | |
of new approach that that I find really | |
04:29 | |
interesting to unsupervised learning | |
04:31 | |
that is based on deep reinforcement | |
04:34 | |
learning learning an agent that can | |
04:36 | |
actually that does unsupervised learning | |
04:38 | |
so this model called spiral is based on | |
04:41 | |
a new agent architecture that we have | |
04:43 | |
been that we have been working on that | |
04:45 | |
we have published recently called Impala | |
04:46 | |
it's a very large highly scaleable | |
04:49 | |
efficient off-post elearning agent | |
04:51 | |
architecture that we use in spiral to do | |
04:54 | |
unsupervised learning and the | |
04:57 | |
interesting bit about the spiral work is | |
04:59 | |
it does generalization through using | |
05:01 | |
some sort of tool space tools that we as | |
05:03 | |
people have created that we have created | |
05:06 | |
so that we can actually solve not one | |
05:08 | |
specific problem we can solve many | |
05:10 | |
different problems using these tools and | |
05:12 | |
using the interface of a two | |
05:14 | |
and having an agent you can actually now | |
05:16 | |
learn a generative model of your | |
05:19 | |
environment | |
05:19 | |
all right so without like more delay the | |
05:24 | |
first thing that I'm going to try to | |
05:25 | |
introduce is like quickly the very net | |
05:27 | |
model way net is a generative model of | |
05:30 | |
of audio as I said it is it samples the | |
05:33 | |
robot your signal it doesn't use any | |
05:35 | |
sort of interface to model the audio | |
05:38 | |
signal audio in general is very very | |
05:41 | |
high dimensional so the the standard | |
05:43 | |
audio signal that we started when Miller | |
05:45 | |
done moved a bit when we were at the | |
05:48 | |
beginning but 16,000 samples per second | |
05:50 | |
like if you compare that our usual | |
05:52 | |
language modeling and and and machine | |
05:55 | |
translation kind of tasks it is several | |
05:57 | |
orders of magnitude more data so the | |
06:00 | |
kinds of dependencies that one needs to | |
06:02 | |
model to be able to model good audio | |
06:04 | |
signals is very it's very long so this | |
06:09 | |
model what it does is it samples it | |
06:11 | |
models one sample at a time and it is a | |
06:14 | |
soft max distribution to model the model | |
06:17 | |
each sample depending on dependent on | |
06:20 | |
all the all the previous samples of the | |
06:22 | |
of the signal when you look at it more | |
06:26 | |
closely though it is it is it is an | |
06:28 | |
architecture that has quite a bit of | |
06:30 | |
resemblance to the pixel CNN model maybe | |
06:32 | |
some of you also are familiar with that | |
06:34 | |
in the end it is a stack of multiple | |
06:38 | |
commotion layers to be a little bit more | |
06:40 | |
specific it has these residual blocks | |
06:42 | |
you use multiples of those residual | |
06:44 | |
blocks and each decision and in each | |
06:46 | |
residual work there are these dilated | |
06:49 | |
convolutional layers that that go on top | |
06:53 | |
of each other and through those dilated | |
06:54 | |
convolutional layers that are causal | |
06:56 | |
convolutions we can model very long | |
06:59 | |
dependencies so through that we can get | |
07:01 | |
the modelling dependency in time now one | |
07:06 | |
of the biggest design considerations | |
07:08 | |
about wag net is it is designed to be | |
07:11 | |
very very efficient during training | |
07:13 | |
because during training what you can do | |
07:15 | |
is because all the targets are known | |
07:17 | |
when you generate the signal you | |
07:19 | |
generate the whole signal at once just | |
07:20 | |
run it like a convulsion on net you get | |
07:22 | |
your signal then because you have the | |
07:24 | |
targets you get your error signal | |
07:26 | |
from that propagate back so training is | |
07:28 | |
very efficient but of course when it | |
07:30 | |
comes to sampling time in the end this | |
07:31 | |
is an autoregressive model and through | |
07:34 | |
those causal emotions you need to run | |
07:36 | |
through them one sample at a time so if | |
07:38 | |
you are sampling let's say 24 kilohertz | |
07:39 | |
24,000 samples per second you need to | |
07:42 | |
generate one sample at a time just like | |
07:44 | |
you see in this animation and of course | |
07:46 | |
this is painful this is painful but in | |
07:49 | |
the end it works quite well and we can | |
07:51 | |
generate very very high quality audio | |
07:53 | |
with this so what I want to do is I want | |
07:58 | |
to actually I want to I want to make you | |
08:01 | |
listen to the unconditional samples from | |
08:04 | |
this model so rag model the speed signal | |
08:07 | |
and without any conditioning on text or | |
08:10 | |
anything just take the audio signal and | |
08:12 | |
model that with model that it wavenet | |
08:14 | |
and then when you sample this is the | |
08:17 | |
kind of so as you can see or here | |
08:30 | |
hopefully the the quality is very high | |
08:35 | |
and this is modeling really the raw | |
08:37 | |
audio grow audio signal and this is | |
08:40 | |
completely unconditional so what you | |
08:42 | |
hear is sometimes you even hear short | |
08:44 | |
words like okay from and then if you try | |
08:48 | |
to listen all the tonation and | |
08:49 | |
everything sounds quite natural and | |
08:52 | |
sometimes it feels like you are | |
08:53 | |
listening to someone speaking in a | |
08:55 | |
language that you don't know so the the | |
08:57 | |
main characteristics of the of the | |
09:00 | |
signal is all captured there so in terms | |
09:02 | |
of dependencies we are looking into like | |
09:04 | |
something like several thousand samples | |
09:06 | |
of dependencies are actually properly | |
09:09 | |
and correctly modelled there and then of | |
09:12 | |
course sorry and then of course what you | |
09:16 | |
can do is you can you can augment this | |
09:18 | |
model by conditioning on a text signal | |
09:22 | |
that is associated with the signal that | |
09:24 | |
you want to generate and by conditioning | |
09:26 | |
on the text signal now you have a | |
09:28 | |
generative model a conditional | |
09:30 | |
generative model that actually solves a | |
09:32 | |
real-world problem just by itself and | |
09:34 | |
turn deep learning right so | |
09:37 | |
the text you create the linguistic | |
09:38 | |
embeddings from that using those | |
09:40 | |
linguistic embeddings you can generate | |
09:42 | |
the signal and then and then it starts | |
09:46 | |
it's not talking right so it's a it's a | |
09:48 | |
solution to the whole text to speech | |
09:51 | |
synthesis problem that as you know is | |
09:53 | |
very very common used in in in real | |
09:57 | |
world sorry alright so when we did this | |
10:03 | |
the the bayonet model and this was | |
10:07 | |
around like almost two years ago now we | |
10:10 | |
looked at the we looked at equality when | |
10:12 | |
we use it as a TTS model and in green | |
10:15 | |
what you see is the quality of the human | |
10:17 | |
speech I can obtain through this mean | |
10:19 | |
opinion scores and in blue you see the | |
10:21 | |
wavenet and the other colors are the | |
10:23 | |
other models that were the best models | |
10:25 | |
around and at the time and you can see | |
10:27 | |
that they met close the gap between the | |
10:30 | |
human called speech and other models by | |
10:33 | |
by a big margin so at the time this this | |
10:37 | |
really got us excited because now we | |
10:39 | |
actually had a model a deep learning | |
10:41 | |
model that comes with all the | |
10:42 | |
flexibilities and advantages of doing | |
10:44 | |
deep learning and at the same time it's | |
10:46 | |
modeling raw audio and it is it is it is | |
10:49 | |
very very high quality | |
10:50 | |
I could play text to speech samples that | |
10:53 | |
is generated by this model but actually | |
10:55 | |
what you can do is what I'm going to go | |
10:56 | |
into next if you are using Google | |
10:58 | |
assistant right now you are already | |
10:59 | |
hearing back that there because this is | |
11:01 | |
already in production so anyone who's | |
11:03 | |
using Google assistant and like querying | |
11:05 | |
Wikipedia and things like that the the | |
11:08 | |
speech that is generated there is | |
11:10 | |
actually coming from the very net model | |
11:11 | |
and what I want to do is I want to | |
11:13 | |
explain how we how we did that and that | |
11:18 | |
brings me into our next project that we | |
11:20 | |
did in the wagonette in the very net | |
11:22 | |
domain this is the parallel way net | |
11:24 | |
power the net project so of course when | |
11:27 | |
you have a research project and at some | |
11:29 | |
point you realize that okay it is | |
11:30 | |
actually lands it actually lands itself | |
11:33 | |
into the solution of a real-world | |
11:34 | |
problem and you want to put it into | |
11:37 | |
production in a very challenging | |
11:39 | |
environment then then of course it | |
11:41 | |
requires much more than our little | |
11:44 | |
research group so this was a big | |
11:45 | |
cooperation between the D point research | |
11:47 | |
applied and the Google | |
11:48 | |
speech teams actually so in this slide | |
11:52 | |
what but what what I show is basis the | |
11:55 | |
the the basic ingredients of how we turn | |
11:58 | |
a wave net architecture into a | |
12:01 | |
feed-forward and parallel architecture | |
12:03 | |
because what we realize pretty soon when | |
12:06 | |
we started when we try to attempt doing | |
12:09 | |
doing putting putting a system like this | |
12:13 | |
into production was actually speed of | |
12:15 | |
course is very important quality is very | |
12:17 | |
very important but the the importance is | |
12:19 | |
of speed is it is not enough to actually | |
12:22 | |
run something in real time the kind of | |
12:24 | |
constraints that we track those ovals | |
12:26 | |
like orders of magnitude faster than | |
12:27 | |
real time even actually being able to | |
12:30 | |
run in constant time so when one day | |
12:32 | |
when the constraint becomes being able | |
12:34 | |
to run in constant time the only thing | |
12:36 | |
you can do is create a feed-forward | |
12:38 | |
Network and then paralyze the signal | |
12:40 | |
generation right so that is what we did | |
12:43 | |
so in this slide at the top what you see | |
12:45 | |
is the usual wavenet model we call it | |
12:48 | |
the teacher now in the setting this | |
12:49 | |
wavenet model is pure trained and it is | |
12:52 | |
fixed and it is used as a scoring | |
12:55 | |
function at the bottom what you see is | |
12:57 | |
the generator that we call the student | |
12:59 | |
and this student model is again an | |
13:02 | |
architecture that is very close to write | |
13:04 | |
net but it is a it is it is run as a | |
13:07 | |
feed-forward convolutional network and | |
13:09 | |
the way it is run is and it is trained | |
13:11 | |
is actually it has two components one | |
13:13 | |
component is coming from a net we know | |
13:15 | |
that it is very efficient in training as | |
13:17 | |
I said but slow in something the other | |
13:19 | |
the other thing is based on the inverse | |
13:21 | |
autoregressive flow work that was done | |
13:22 | |
by the king - colleagues at opening I | |
13:24 | |
last year and and and and this this | |
13:28 | |
structure gives gives us the capability | |
13:30 | |
to actually get a input noise signal in | |
13:33 | |
and slowly transform that noise signal | |
13:36 | |
into a into a proper distribution that | |
13:39 | |
is going to be the speed signal right so | |
13:42 | |
the way we train this is random noise | |
13:44 | |
goes in together with the linguistic | |
13:46 | |
features through layers and layers of | |
13:48 | |
these flows the signal gets that that | |
13:50 | |
random noise gets transferred into | |
13:52 | |
speech signal that speed signal goes | |
13:54 | |
into a net very net is like already the | |
13:57 | |
best kind of scoring function that we | |
13:59 | |
can use because it's a | |
14:00 | |
it's a density model and wavenet scores | |
14:03 | |
that and that score from that we get the | |
14:06 | |
gradients back into the generator and | |
14:09 | |
then we update the generator we call | |
14:11 | |
this process the proper water density | |
14:12 | |
distribution but of course when you are | |
14:15 | |
trying to do real-world things and if | |
14:18 | |
things are very challenging like speed | |
14:19 | |
signals that is by itself not enough so | |
14:21 | |
I have highlighted two components here | |
14:23 | |
one of them as I said is the magnet | |
14:25 | |
scoring function the other thing that we | |
14:27 | |
use is a power loss because what happens | |
14:30 | |
is when we train the model in this | |
14:32 | |
manner the signal tends to be very low | |
14:35 | |
energy sort of like whispering someone | |
14:38 | |
speaks but they are like whispering so | |
14:39 | |
during training we sort of edit this | |
14:41 | |
extra loss that tries to conserve the | |
14:43 | |
energy of the generated speech and with | |
14:47 | |
these two the the wavenet scoring and | |
14:49 | |
the power loss we were already getting | |
14:51 | |
very high called speed signal but of | |
14:54 | |
course like the constraints are very | |
14:55 | |
very tough and what we did was we | |
14:58 | |
trained another wave net model so we | |
15:00 | |
sort of used wavenet everywhere right | |
15:01 | |
that we are generating through a leg net | |
15:03 | |
through convolution we are using very | |
15:04 | |
net as a scoring function we again | |
15:07 | |
trained another very net model this time | |
15:08 | |
we used it as a speech recognition | |
15:10 | |
system and that is the perceptual loss | |
15:12 | |
that you see there so we train the wave | |
15:14 | |
net again as a speech recognition system | |
15:16 | |
what we do is during training of course | |
15:18 | |
you have the text and the corresponding | |
15:21 | |
speech signal we generate the we | |
15:25 | |
generate the corresponding speech | |
15:27 | |
through our generator we get the text | |
15:29 | |
give that the speech recognition system | |
15:30 | |
the speech recognition system of course | |
15:32 | |
not needs to decode we generated signal | |
15:35 | |
into those into that text right and we | |
15:37 | |
get the error from there propagate back | |
15:39 | |
into our generator so that's another | |
15:41 | |
sort of quality improvement that we get | |
15:42 | |
by using speech recognition as a | |
15:45 | |
perceptual loss in our generation system | |
15:47 | |
and the last thing that we did was using | |
15:51 | |
a contrasting term that basically uses | |
15:53 | |
okay we generate a signal conditioned on | |
15:55 | |
some text you can you can create a | |
15:58 | |
contrast applause we're saying that the | |
16:01 | |
signal that is generated with the | |
16:02 | |
corresponding text is it should be | |
16:05 | |
different than the same signal if it if | |
16:07 | |
it was conditioned on a separate text | |
16:09 | |
right | |
16:10 | |
there's a contrasting luster so more | |
16:12 | |
specifically what we have is in the end | |
16:14 | |
we end up with these four terms at the | |
16:18 | |
top we see that the the original sort of | |
16:22 | |
using vena there's a scoring function | |
16:24 | |
the problem with advances the | |
16:25 | |
distillation idea then we have the power | |
16:28 | |
loss that that uses Fourier transforms | |
16:31 | |
eternal to to conserve the energy and | |
16:34 | |
the contrastive term and find out the | |
16:36 | |
perceptual was that does the that does | |
16:40 | |
the speech of cognition and when we all | |
16:42 | |
these then of course what we did was we | |
16:44 | |
looked at the quality now what what I'm | |
16:47 | |
showing here is the quality with respect | |
16:49 | |
to the again the best non wavenet model | |
16:52 | |
so this is sort of like a year after the | |
16:54 | |
original research pretty much exactly a | |
16:57 | |
year and so during that time of course | |
17:00 | |
the the best speech synthesis models | |
17:02 | |
also improved but wavenet was still | |
17:04 | |
better than better than anything else | |
17:06 | |
and it was matching the quality of so | |
17:08 | |
the new magnet the parallel bayonet is | |
17:11 | |
exactly matching the quality of the of | |
17:15 | |
the original magnitude and what what I'm | |
17:18 | |
showing here is three different US | |
17:20 | |
English voices and also Japanese and | |
17:21 | |
this is the kind of thing that we always | |
17:23 | |
want from deep learning right the | |
17:25 | |
ability to generalize to new datasets to | |
17:27 | |
new domains so we have developed all | |
17:29 | |
this model one practically one single US | |
17:31 | |
English voice and it was just a matter | |
17:33 | |
of collecting or getting another data | |
17:35 | |
set from another either speaker or | |
17:38 | |
another language like some speaker | |
17:41 | |
speaking Japanese you just get that run | |
17:43 | |
it and there you go you have a speech | |
17:45 | |
synthesis you have a production called | |
17:46 | |
speaks into the system just by doing | |
17:48 | |
that this is the kind of thing that we | |
17:50 | |
really like from deep line right and and | |
17:52 | |
if you are thinking about from from deep | |
17:54 | |
learning and if you are thinking about | |
17:55 | |
unsupervised learning I think this is | |
17:57 | |
this is this is a very good | |
17:58 | |
demonstration of that | |
17:59 | |
so before switching to the next one I | |
18:02 | |
also want to mention that we have also | |
18:04 | |
done some further work on this called | |
18:06 | |
wave RN and that is recently published | |
18:08 | |
and | |
18:09 | |
I encourage you to look into that one | |
18:11 | |
too that's a very interesting piece of | |
18:12 | |
work also for generating speech at very | |
18:15 | |
very high speed the next thing I want to | |
18:18 | |
talk about is the Impala architecture | |
18:20 | |
the new agent architecture that I said | |
18:22 | |
because as I said so now wavenet is a | |
18:25 | |
sort of in a classical sense of of | |
18:30 | |
unsupervised model that actually can | |
18:32 | |
solve a real world problem now the next | |
18:35 | |
thing I want to sort of start talking | |
18:36 | |
about is this new different way of doing | |
18:38 | |
unsupervised learning but for that most | |
18:41 | |
another exciting bit is to be able to do | |
18:44 | |
deep reinforcement learning at scale | |
18:47 | |
sorry all right so I want to sort of | |
18:54 | |
motivate why do we want to actually push | |
18:56 | |
our deep reinforcement learning models | |
18:57 | |
further and further because most of the | |
18:59 | |
time what we do because this is a new | |
19:01 | |
area is we take sort of like very simple | |
19:05 | |
tasks in in some simple environments and | |
19:08 | |
what we try to do is we try to train an | |
19:10 | |
agent that shows a single task in that | |
19:12 | |
environment well what we what we want to | |
19:15 | |
do is we want to go further than that | |
19:16 | |
right like again going back to the point | |
19:18 | |
of generalization and being able to | |
19:19 | |
solve multiple tasks we have created the | |
19:22 | |
new task set this is an open source task | |
19:24 | |
set that we have like we have an open | |
19:26 | |
source environment called vm lab and as | |
19:28 | |
part of that we have created this new | |
19:29 | |
task set vm lab 30 it is 30 environments | |
19:33 | |
that are sort of covering tasks around | |
19:36 | |
language memory and navigation and those | |
19:38 | |
kinds of things and the goal is not to | |
19:41 | |
solve each one of them individually the | |
19:43 | |
goal is to have one single agent one | |
19:45 | |
single network that is that is solving | |
19:48 | |
all those thoughts all at the same time | |
19:50 | |
there is nothing custom in that agent | |
19:52 | |
that is specific to any single one of | |
19:55 | |
these environments when you look at | |
19:56 | |
those environments I'm showing some of | |
19:59 | |
those here the agency has a first-person | |
20:02 | |
view so it is in like a maze-like | |
20:04 | |
environment and the agent has a | |
20:06 | |
first-person view camera input and it | |
20:08 | |
can navigate around go forward backwards | |
20:10 | |
and rotate around look up down jump and | |
20:13 | |
those kinds of things and and it is | |
20:16 | |
solving all different kinds of tasks | |
20:18 | |
that are that are catered to test | |
20:19 | |
different | |
20:20 | |
kinds of kinds of abilities but the goal | |
20:22 | |
is as I said again to solve all of them | |
20:24 | |
at the same time one thing that becomes | |
20:26 | |
really really important in this case is | |
20:27 | |
of course the stability of our | |
20:29 | |
algorithms because now we are not | |
20:32 | |
solving one single task we are solving | |
20:34 | |
30 of them and we want to really stable | |
20:36 | |
models because we don't have the chance | |
20:37 | |
to tune hyper parameters one single task | |
20:39 | |
anymore and of course what becomes | |
20:41 | |
really important is task interference | |
20:43 | |
right hopefully what we expect again by | |
20:45 | |
using deep learning is this is like a | |
20:47 | |
multi task setting and in this multi | |
20:48 | |
task setting we hope to see positive | |
20:51 | |
transfer rather than task interference | |
20:53 | |
and and and we hope to demonstrate this | |
20:55 | |
in this in this challenging | |
20:56 | |
reinforcement of a reinforcement | |
20:58 | |
learning domain - okay I sort of | |
21:03 | |
realized that I needed to put a slide | |
21:05 | |
about by deep reinforcement learning | |
21:07 | |
because a little bit to my surprise that | |
21:10 | |
was actually not much reinforcement | |
21:11 | |
learning in this conference this year | |
21:12 | |
and I wanted to sort of a little bit | |
21:15 | |
touch on why I think is important for | |
21:18 | |
for the deep learning community before | |
21:20 | |
this community to actually do deep | |
21:22 | |
reinforcement learning because it is to | |
21:24 | |
me it is at the core of if if one of the | |
21:26 | |
goals that we work for here is AI then | |
21:28 | |
it is at the core of order right | |
21:30 | |
reinforcement learning is a very general | |
21:32 | |
framework for it | |
21:33 | |
for doing sequential decision-making for | |
21:36 | |
learning sequential decision making | |
21:38 | |
tasks and deep learning on the other | |
21:40 | |
hand of course is the best model that we | |
21:43 | |
have the best set of algorithms we have | |
21:45 | |
to learn representations and | |
21:47 | |
combinations of those combinations of | |
21:51 | |
these two different models is is the | |
21:55 | |
most sort of like arm is the best answer | |
21:58 | |
so far we have in terms of learning very | |
22:00 | |
good state representations of very | |
22:03 | |
challenging tasks that are not just for | |
22:05 | |
like solving toy domains but actually to | |
22:08 | |
solve challenging real world problems of | |
22:11 | |
course there are many things that are | |
22:12 | |
there are open problems there like some | |
22:14 | |
of them that are sort of interesting at | |
22:16 | |
least for me is the idea of separating | |
22:20 | |
the computational power of a model from | |
22:22 | |
the number of weights or the number of | |
22:24 | |
layers it has or basically again going | |
22:27 | |
back to on supervised learning learning | |
22:29 | |
to transfer | |
22:30 | |
so if we do this deep reinforcement | |
22:32 | |
learning models with the idea to to | |
22:35 | |
actually generalize to transfer okay so | |
22:39 | |
the Impala agent is based on the on | |
22:44 | |
another work that we have done couple of | |
22:46 | |
years ago called the a synchronous | |
22:48 | |
advantage actor critic the a3c model in | |
22:50 | |
the end it's a it's opposed to gradient | |
22:53 | |
methods but you have is like that I | |
22:54 | |
tried to sort of cartoonishly explain | |
22:56 | |
that in the in the in the figure at | |
22:58 | |
every time step the agent sees the | |
23:00 | |
environment and at that time step the | |
23:03 | |
agent outputs a post distribution and | |
23:06 | |
also the also the value function the | |
23:08 | |
value function is the agents expectation | |
23:12 | |
of the total amount of reward that it's | |
23:14 | |
going to get until the end of the | |
23:16 | |
episode being in that state all right | |
23:18 | |
and the policy is the distribution over | |
23:19 | |
the actions that the agent has and at | |
23:21 | |
every time step the agent looks at the | |
23:23 | |
environment and updates is policy so | |
23:25 | |
that it can be can actually act in the | |
23:27 | |
environment and it updates his value | |
23:28 | |
function and the way you train this is | |
23:30 | |
with the with the post the gradient | |
23:32 | |
intuitively this is actually is actually | |
23:34 | |
very simple what you do is the gradient | |
23:36 | |
of the policy is scaled by the | |
23:39 | |
difference between the total reward that | |
23:41 | |
the agent actually gets in the | |
23:43 | |
environment - the baseline and the | |
23:46 | |
baseline is the value function right so | |
23:48 | |
what it means is if the agent ends up | |
23:50 | |
doing better than what the value | |
23:52 | |
function what its assumption was then | |
23:55 | |
it's a good thing you have a positive | |
23:56 | |
gradient you're going to reinforce your | |
23:57 | |
understanding of the environment if the | |
23:59 | |
agent does worse than what it got so | |
24:02 | |
well so the value was higher than the | |
24:04 | |
total reward that you got then you have | |
24:06 | |
a negative gradient you need to shuffle | |
24:08 | |
things around and the way you learn the | |
24:10 | |
value function is by the usual and step | |
24:13 | |
and step TD error now the a3c algorithm | |
24:17 | |
so this was the actor critic part the a | |
24:20 | |
synchronous party in 3 C algorithm is | |
24:22 | |
composed of multiple actors and each | |
24:24 | |
actor independently operates in the | |
24:27 | |
environment and and and collecting for | |
24:30 | |
collect observations | |
24:32 | |
acts in the environment computes the | |
24:34 | |
posted gradients and and | |
24:37 | |
completes the gradients with respect to | |
24:39 | |
the parameters of its network then what | |
24:41 | |
it does is it sends those gradients back | |
24:43 | |
into the parameter server then the | |
24:45 | |
parameter server collects all these | |
24:46 | |
gradients from all different actors | |
24:48 | |
combines them together and then shares | |
24:50 | |
those parameters with all the actors | |
24:52 | |
around now what happens in this case is | |
24:55 | |
as you increase the number of actors | |
24:56 | |
this is the usual asynchronous | |
24:58 | |
stochastic gradient descent setup as the | |
25:01 | |
number of actors increases the stale | |
25:03 | |
grade the staleness of the gradients | |
25:05 | |
becomes a problem so what happens is in | |
25:08 | |
the end is distribution the experience | |
25:10 | |
collection is actually something very | |
25:11 | |
very advantages it's very good and but | |
25:14 | |
what happens is communicating gradients | |
25:16 | |
might become a bottleneck as you try to | |
25:17 | |
really scale things up so for that what | |
25:21 | |
we tried was a different architecture | |
25:27 | |
the idea of a sanctuary server is | |
25:31 | |
actually quite useful but rather than | |
25:33 | |
using it to just to just do the | |
25:36 | |
accumulate the parameter updates the | |
25:39 | |
idea of that learner is to make the | |
25:42 | |
centralized component into a learner so | |
25:45 | |
the all the whole learning algorithm is | |
25:46 | |
is contained in that what the actors | |
25:48 | |
does is only act in the environment not | |
25:50 | |
compute the gradients or anything | |
25:52 | |
send the observations back into learners | |
25:54 | |
to the learner and the learner sends the | |
25:56 | |
parameters back and in this in this way | |
25:58 | |
what you are doing is you are completely | |
26:00 | |
decoupling what happens about your | |
26:02 | |
experience collection in your | |
26:04 | |
environments from your learning | |
26:06 | |
algorithm and in this way you are | |
26:07 | |
actually gaining a lot of robustness | |
26:09 | |
into noise in your environments | |
26:11 | |
sometimes rendering times vary some some | |
26:14 | |
environments are slow some environments | |
26:16 | |
are fast | |
26:17 | |
all that is completely decoupled from | |
26:18 | |
your learning algorithm but of course | |
26:20 | |
what you need is a good learning | |
26:22 | |
algorithm to to be able to deal with | |
26:24 | |
that kind of variation so in the end we | |
26:27 | |
empower what we have is we have a very | |
26:29 | |
efficient decoupled backward pass if you | |
26:31 | |
were so actors generate trajectories as | |
26:33 | |
I said but then but that that decoupling | |
26:37 | |
creates this of posionous write the | |
26:39 | |
policy in the actors the behavior poles | |
26:41 | |
if you will is separate from the policy | |
26:44 | |
in the learner | |
26:45 | |
target policy so what we need is enough | |
26:47 | |
posted earning of course there are many | |
26:48 | |
of posted learning algorithms but we | |
26:50 | |
really wanted to have a post gradient | |
26:52 | |
method and and for that we developed | |
26:56 | |
this new method called V trace and it's | |
26:58 | |
an off-post advantage critic algorithm | |
27:00 | |
the advantage of V traces it is using | |
27:04 | |
these truncated important sampling | |
27:06 | |
ratios to actually come up with an | |
27:08 | |
estimate for the valley so because of | |
27:12 | |
there is this imbalance between the | |
27:13 | |
learners that and the actors you need to | |
27:15 | |
balance those you need to balance that | |
27:17 | |
difference the good thing about this is | |
27:19 | |
it's an algorithm is a smooth transition | |
27:22 | |
between the on post case and off policy | |
27:24 | |
case when they when the actors and the | |
27:26 | |
learner are completely in sync so you're | |
27:29 | |
in the on policy case the algorithm | |
27:30 | |
actually boils down to the usual a3c | |
27:33 | |
update with the n steps bellman equation | |
27:35 | |
if they become more separate than the | |
27:38 | |
correction of the algorithm kicks in and | |
27:41 | |
then you have the corrected corrected | |
27:43 | |
estimate the algorithm has two main | |
27:47 | |
components to those truncation factors | |
27:49 | |
to control two different aspects of the | |
27:52 | |
of off learning one of them is the robe | |
27:55 | |
which controls the reach value function | |
27:58 | |
the algorithm is going to converge | |
28:00 | |
towards the behavior the value function | |
28:02 | |
that code that corresponds to the | |
28:04 | |
behavior policy or the value function | |
28:06 | |
that corresponds to the target policy in | |
28:07 | |
the learner and the other one controls | |
28:09 | |
the speed of convergence the C factor by | |
28:13 | |
by controlling the by controlling the | |
28:15 | |
truncation that it can it can increase | |
28:17 | |
or decrease the variance in learning and | |
28:19 | |
the stick and it can it can it can have | |
28:22 | |
an effect on the speed of convergence | |
28:24 | |
now than me when we tested this of | |
28:28 | |
course the goal is to test on all | |
28:29 | |
environments at once but what we wanted | |
28:31 | |
to do was first you look at the single | |
28:33 | |
task is also we look at five different | |
28:35 | |
environments and we see that in these | |
28:37 | |
environments the Impala algorithm always | |
28:39 | |
very stable it performs at the top so | |
28:44 | |
the comparisons here are the Impala | |
28:45 | |
algorithm the batch a3c method and they | |
28:50 | |
touch a to C method and then different | |
28:52 | |
versions of a three C algorithms and you | |
28:54 | |
can see that Impala and batch a to C are | |
28:56 | |
always at | |
28:57 | |
performing at the top Impala seems to be | |
29:00 | |
doing fine | |
29:01 | |
they're like the the dark blue curve and | |
29:03 | |
and this gives us the sort of feeling | |
29:06 | |
that okay we have a nice outlet now of | |
29:08 | |
course the other thing that is very | |
29:09 | |
important and that is discussed a lot is | |
29:12 | |
the stability of these algorithms right | |
29:14 | |
I actually really like these floods | |
29:16 | |
since during the a3c work actually keep | |
29:19 | |
looking at these floods and we always | |
29:21 | |
put them in the papers the plot here is | |
29:23 | |
on the x-axis we have the heart we have | |
29:25 | |
the hyper parameter combinations when | |
29:27 | |
you when you of course trade any model | |
29:29 | |
what we do all of us is we do some sort | |
29:31 | |
of hyper parameter sweep and here what | |
29:33 | |
we are doing is we are looking at the | |
29:35 | |
final score that we achieve with every | |
29:37 | |
single hyper parameter setting that we | |
29:39 | |
that we get and you sort it and in the | |
29:42 | |
in this kind of thought what you have is | |
29:44 | |
the the the KERS the algorithms that are | |
29:47 | |
at the top and that our most flood are | |
29:49 | |
the most like better performing and most | |
29:52 | |
stable algorithms right and what we see | |
29:54 | |
here is Impala is always of course it's | |
29:57 | |
achieving better results but it's not | |
29:58 | |
achieving those results because there is | |
30:00 | |
one sort of lucky - parameter setting is | |
30:03 | |
consistently at the top and you can see | |
30:05 | |
that it's not of course completely flat | |
30:07 | |
because in the end we are sort of | |
30:08 | |
searching over three orders of magnitude | |
30:10 | |
in parameter settings the but we can see | |
30:18 | |
that the algorithm is actually quite | |
30:19 | |
stable now when we look at our our our | |
30:22 | |
main goal here what we are looking at in | |
30:24 | |
on the x-axis we have the wall clock | |
30:26 | |
time and on the y-axis we have the sort | |
30:29 | |
of the normalized score and the and the | |
30:32 | |
red line that you see there is the a3 | |
30:34 | |
see and you can see that Impala not only | |
30:37 | |
H is much better of course if they | |
30:39 | |
choose them much much much faster the | |
30:41 | |
other thing is comparing the green and | |
30:43 | |
the orange line thirds that is the | |
30:45 | |
comparison between training Impala in an | |
30:47 | |
expert setting versus a multi task City | |
30:49 | |
and we see that it achieves better | |
30:51 | |
scores like the faster which again gives | |
30:54 | |
us the idea that we are actually seeing | |
30:56 | |
positive transfer it's it's a like to | |
30:58 | |
like setting the all the all the all the | |
31:02 | |
details of the network and the agent are | |
31:03 | |
the same in one case you have one | |
31:05 | |
network | |
31:06 | |
tasks and in other case you train the | |
31:08 | |
same network on all the tasks and what | |
31:10 | |
you achieve is a better result because | |
31:12 | |
of the positive transfer between those | |
31:14 | |
tasks and what happens is if you give | |
31:17 | |
Impala more resources you end up with | |
31:20 | |
this almost vertical takeoff from there | |
31:23 | |
right and what you have is you can | |
31:24 | |
actually solve this challenging turkey | |
31:26 | |
task domain in under 24 hours given the | |
31:29 | |
resources and that is the kind of | |
31:30 | |
algorithmic sort of power that we want | |
31:33 | |
to be able to train these very highly | |
31:35 | |
scalable agents now why do we want to do | |
31:38 | |
that that is the point that I want to | |
31:40 | |
come next and and and in the final part | |
31:43 | |
this is the new spiral algorithm that I | |
31:46 | |
want to talk about now just quickly | |
31:49 | |
going back to the original ideas that | |
31:52 | |
that I talked about unsupervised | |
31:54 | |
learning is also about explaining | |
31:56 | |
environments and generating samples but | |
31:59 | |
maybe generate examples by explaining | |
32:01 | |
environments and we talked about the | |
32:03 | |
fact that when we have these deep | |
32:04 | |
learning models like magnet we can | |
32:06 | |
generate amazing samples but at the same | |
32:08 | |
time maybe there's a different way we | |
32:09 | |
can do these things less implicit in the | |
32:11 | |
Sun set when we generate these samples | |
32:13 | |
they come with some explanation and that | |
32:15 | |
explanation can go through some using | |
32:17 | |
some tools in this particular case what | |
32:20 | |
we are going to do is we are going to | |
32:22 | |
use a painting tool and we are going to | |
32:24 | |
learn to control this painting tool it's | |
32:26 | |
a real drawing program and we are going | |
32:28 | |
to basically generate a program that the | |
32:31 | |
painting tool will use to generate the | |
32:33 | |
image and the main idea that I want to | |
32:36 | |
convey is by using tools by it by by | |
32:39 | |
learning how to use tools that are | |
32:41 | |
already available that we have actually | |
32:44 | |
we can start thinking about different | |
32:46 | |
kinds of generalizations that I'll try | |
32:47 | |
to demonstrate so in real word we have a | |
32:50 | |
lot of examples of programs and their | |
32:53 | |
executions and the results of those | |
32:55 | |
programs they can be arithmetic programs | |
32:57 | |
floating programs or even architectural | |
32:59 | |
blueprints right and what we do is | |
33:02 | |
because we know we have an information | |
33:06 | |
on that generation process when we see | |
33:10 | |
the results we can go and try to infer | |
33:13 | |
what was the program what was the | |
33:14 | |
blueprint that generated that that | |
33:16 | |
particular input so we can do this and | |
33:18 | |
the goal is to be able to do this with | |
33:20 | |
our with our agents too | |
33:22 | |
specifically we are going to use this | |
33:24 | |
environment called lead my paint it is | |
33:27 | |
actually a professional-grade | |
33:28 | |
open-source drawing library and it's | |
33:30 | |
used worldwide by many artists what we | |
33:33 | |
are doing is we are using a limited | |
33:34 | |
interface basically learning - learning | |
33:36 | |
to draw brushstrokes we are going to | |
33:39 | |
have an agent that does that the agent | |
33:41 | |
in the end called spiral has three main | |
33:43 | |
components first of all is the agent | |
33:45 | |
that generates the brushstrokes sort of | |
33:47 | |
I like to see that as writing the | |
33:49 | |
program the second one is the | |
33:51 | |
environment to lead my paint so the | |
33:53 | |
brushstrokes come in environment turns | |
33:55 | |
those into brushstrokes in the canvas | |
33:57 | |
and that cameras got those into a | |
34:00 | |
discriminator and the discriminator is | |
34:01 | |
trained like again and that | |
34:04 | |
discriminative looks at the generated | |
34:05 | |
image and says does this look like a | |
34:07 | |
real drawing and then gives a score and | |
34:09 | |
that score is opposed to the usual gun | |
34:11 | |
training rather than propagating the | |
34:13 | |
gradient packs we get that score and we | |
34:16 | |
train our agent with that score is a | |
34:18 | |
reward so when you think about this all | |
34:20 | |
these three components coming together | |
34:21 | |
you have an unsupervised learning model | |
34:23 | |
similar to the Ganz but rather than | |
34:26 | |
generating in the pixel space we | |
34:28 | |
generate in this program space and the | |
34:30 | |
training is done through the done | |
34:33 | |
through the reward that the agent itself | |
34:35 | |
also learns so we are sort of trusting | |
34:37 | |
another neural net just like in Gans | |
34:39 | |
setup to actually guide learning but not | |
34:41 | |
through its gradients just treat the | |
34:42 | |
score function so in my opinion it makes | |
34:44 | |
it in certain cases it makes it very | |
34:46 | |
very sort of capable of using a | |
34:49 | |
different kinds of tools so as I said | |
34:52 | |
this agent the the reinforcement | |
34:54 | |
learning part of the agent is completely | |
34:56 | |
the same as the Impala | |
34:57 | |
so we now that we have an agent that can | |
35:00 | |
actually solve really challenging | |
35:02 | |
reinforcement learning setups we take it | |
35:03 | |
and put it into this environment | |
35:05 | |
augmented with the ability to learn a | |
35:08 | |
discriminative function to actually have | |
35:11 | |
the reward the to emphasize again the | |
35:13 | |
important thing here is yes we have an | |
35:15 | |
agent but there is no environment that | |
35:17 | |
actually says that ok this is the reward | |
35:19 | |
that the agent should get the reward | |
35:22 | |
generation is also inside the agent | |
35:24 | |
thanks to again all the unsupervised | |
35:26 | |
learning models | |
35:26 | |
that is actually being studied here so | |
35:29 | |
we specifically use against it up there | |
35:31 | |
so can we generate the first thing of | |
35:35 | |
course we try is when you are doing | |
35:36 | |
unsupervised learning from scratch again | |
35:38 | |
you go back to illness right you start | |
35:40 | |
from M&S; and initially of course it's | |
35:42 | |
generating various crash pad like things | |
35:44 | |
but then through training it becomes | |
35:47 | |
better and better and better here in the | |
35:49 | |
middle you see that now the the agent | |
35:52 | |
learned - these are complete | |
35:53 | |
unconditional samples again the ones | |
35:55 | |
that you see in the middle it learn to | |
35:57 | |
create these trucks that generates these | |
35:59 | |
digits right to emphasize this this | |
36:01 | |
agent has never seen strokes that are | |
36:04 | |
coming from real people how we draw | |
36:06 | |
digits it learned to experiment with | |
36:09 | |
these drugs and it's sort of built its | |
36:11 | |
own policy to create these strokes that | |
36:14 | |
would generate these images of course | |
36:16 | |
you can train the whole set up is a | |
36:17 | |
conditional generation process to | |
36:19 | |
recreate a given image - I think the | |
36:22 | |
main thing about this is it's learning | |
36:24 | |
an unsupervised way to throw the strokes | |
36:26 | |
I see it as the environment the the | |
36:29 | |
league my paint environment sort of | |
36:31 | |
gives us a grounded bottleneck to | |
36:33 | |
actually create a meaningful | |
36:35 | |
representation space of course the next | |
36:38 | |
thing we tried was on the glut and again | |
36:39 | |
you see the same things it can generate | |
36:41 | |
unconditional meaningful only glove | |
36:43 | |
looking like samples or it can recreate | |
36:45 | |
on the glut samples but then | |
36:48 | |
generalization right so here what we | |
36:50 | |
tried was train the model on Omniglot | |
36:52 | |
and then ask it to generate endless | |
36:55 | |
digits right this is what you see in the | |
36:57 | |
middle middle road there can it draw in | |
36:59 | |
this digits this has never seen amnesty | |
37:02 | |
just before but we all know that only | |
37:04 | |
God is more general than in this and it | |
37:06 | |
can do it right given an amnesty yet | |
37:08 | |
it can actually draw that the network | |
37:10 | |
itself has never seen any any amnesty | |
37:13 | |
just during its training then we tried | |
37:17 | |
Smiley's right there line drawings okay | |
37:19 | |
so it can giving it smiley it can also | |
37:21 | |
drop Smiley's - that is great so can we | |
37:25 | |
do more we did this we took this cartoon | |
37:30 | |
drawing and this is done by chopping it | |
37:33 | |
up into 64 by 64 pieces and it's a | |
37:36 | |
general line drawing right again this is | |
37:38 | |
the | |
37:39 | |
imagine that if the Train using Omniglot | |
37:40 | |
and now you can see that it can actually | |
37:43 | |
recreate that trolling certain areas are | |
37:46 | |
read about right back around eyes | |
37:47 | |
insides they are really complicated but | |
37:49 | |
in general you can see that it is | |
37:51 | |
actually capable of generating those | |
37:52 | |
drawings so this gives you an idea of | |
37:55 | |
okay generalization I can I can sort of | |
37:58 | |
train on one domain and generalize the | |
38:00 | |
new ones | |
38:01 | |
so can I push it further the next thing | |
38:03 | |
that we tried was okay the advantage of | |
38:06 | |
using a tool is you have a meaningful | |
38:08 | |
representation space that we can | |
38:11 | |
hopefully transfer that representation | |
38:13 | |
space into a new environment so here | |
38:15 | |
what we do is again the same agent that | |
38:17 | |
is trained using Omniglot we transfer | |
38:19 | |
that simulated that that simulated | |
38:22 | |
environment into real world the way we | |
38:25 | |
do that is we we took that same program | |
38:28 | |
and our friends at the robotics group at | |
38:31 | |
deep mine wrote a controller to control | |
38:36 | |
that robotic arm to take that program | |
38:38 | |
and drove it this whole like experiment | |
38:41 | |
happened in under a week really and what | |
38:43 | |
we ended up with was the same agent the | |
38:47 | |
same agent it is not fine-tuned through | |
38:49 | |
all the setup or anything the same agent | |
38:51 | |
generates its brushstroke programs and | |
38:54 | |
then that program goes into a controller | |
38:56 | |
that can be realized by a real robotic | |
38:59 | |
arm right the advantage of doing this is | |
39:01 | |
the reason we can do this is the | |
39:03 | |
environment that we used is a real | |
39:05 | |
environment we didn't sort of create | |
39:07 | |
that environment the latent space if you | |
39:10 | |
will is not something some arbitrary | |
39:12 | |
latent space that we created because | |
39:14 | |
it's a latent space that is defined by | |
39:17 | |
us that is as a meaningful to space and | |
39:20 | |
the reason we create those tools is to | |
39:21 | |
solve many different problems anyways | |
39:24 | |
right and this is an example of that | |
39:25 | |
using that tool space gives us the | |
39:27 | |
ability to actually transfer its | |
39:29 | |
capability so with that I want to | |
39:32 | |
conclude I tried to give an explanation | |
39:36 | |
of you think about generative models and | |
39:39 | |
unsupervised learning and to me of | |
39:41 | |
course like I'm a hundred percent sure | |
39:43 | |
everyone agrees that our aim is not to | |
39:45 | |
just look at images right our aim is to | |
39:47 | |
do much more | |
39:48 | |
than that and I tried to give two | |
39:50 | |
different two different aspects one of | |
39:52 | |
them is the kind of genital models that | |
39:55 | |
we can do actually right now can solve | |
39:57 | |
real world problems like we have seen in | |
39:59 | |
Vienna and also we can think about a | |
40:01 | |
different kind of setup where we have | |
40:03 | |
agents actually training and and | |
40:06 | |
generating interpretable programs right | |
40:09 | |
that is an important aspect that we have | |
40:10 | |
seen that conversation coming up here | |
40:12 | |
actually through several of the talks | |
40:15 | |
here that being interbeing able to | |
40:17 | |
generate interpretable programs is one | |
40:19 | |
of the bottlenecks that we face right | |
40:21 | |
now because there are many critical | |
40:23 | |
applications that we want to solve there | |
40:24 | |
are many tools that we're gonna eat you | |
40:26 | |
eyes and this is one sort of step | |
40:28 | |
towards that best way how how I see and | |
40:30 | |
being able to do these requires us to | |
40:33 | |
create these very capable reinforcement | |
40:37 | |
learning agents that rely on new | |
40:39 | |
algorithms that we need to that we need | |
40:41 | |
to work on with that thank you very much | |
40:44 | |
I think I want to thank all my | |
40:46 | |
co-operators for their for their help on | |
40:49 | |
this thank you very much | |
40:50 | |
[Applause] | |
40:50 | |
[Music] | |
40:57 | |
[Applause] | |
41:06 | |
we have time for maybe one or two | |
41:09 | |
questions | |
41:24 | |
okay so I have 100 so how do you think | |
41:27 | |
about scaling to like more like general | |
41:32 | |
domains beyond some simple strokes how | |
41:37 | |
to generate like realistic scenes right | |
41:41 | |
so one thing that I haven't shown here | |
41:43 | |
actually yes creating realistic scenes | |
41:46 | |
is is one case one thing that I haven't | |
41:49 | |
talked about is actually as part of | |
41:51 | |
sorry as part of this work it's actually | |
41:54 | |
in the paper one thing that the team did | |
41:57 | |
by the way I had to mention and this was | |
41:59 | |
worked on most by Yaroslav gun in | |
42:00 | |
Melbourne he's actually PhD student at | |
42:03 | |
Mira and he spent his summer with us | |
42:04 | |
doing his internship so as an amazing | |
42:06 | |
job for actually doing it during an | |
42:08 | |
internship pretty big congratulations to | |
42:10 | |
him so one thing that that that we did | |
42:12 | |
was actually try to generate images so | |
42:14 | |
we took the survey data set and use the | |
42:16 | |
same drawing program to actually to | |
42:20 | |
actually draw those and in that case our | |
42:23 | |
setup is just scaling towards those like | |
42:26 | |
the same stuff set up actually scales | |
42:27 | |
because it's a general drawing - and you | |
42:30 | |
can control the color and we can do that | |
42:32 | |
but it requires a little bit more sort | |
42:35 | |
of like it was one of the last | |
42:36 | |
experiments that we did but like it is | |
42:38 | |
it is sort of in the words thanks for a | |
42:42 | |
great talker I had a question about the | |
42:44 | |
Impala results right you had a slide | |
42:47 | |
where one with a curve where all workers | |
42:51 | |
are learning versus having one | |
42:54 | |
centralized sorry centralized learner | |
42:57 | |
the all workers learning actually does | |
43:00 | |
better | |
43:01 | |
than the centralized letter and I found | |
43:04 | |
that not quite surprising but like you | |
43:07 | |
know it's great and it's great to see | |
43:10 | |
the positive transfer between tasks do | |
43:11 | |
you think | |
43:12 | |
have you tried that on other Suites of | |
43:13 | |
tasks do you think it's just because | |
43:14 | |
it's tasks in this suite of tasks are | |
43:17 | |
very similar to usually like it | |
43:19 | |
definitely depends on that but the | |
43:21 | |
reason we created those tasks it is for | |
43:24 | |
that reason right in real world what we | |
43:26 | |
have is we have the visual structure of | |
43:28 | |
our world is unique so the kind of setup | |
43:31 | |
that we have in deep defined lab that | |
43:33 | |
that that tasks it is that it's a | |
43:36 | |
unified visual environment you have one | |
43:38 | |
sort of one one one kind of agent with a | |
43:41 | |
unified action space and now you can | |
43:43 | |
focus on solving different kinds of | |
43:45 | |
tasks of course like that is the kind of | |
43:47 | |
thing that we were testing given all | |
43:48 | |
these through does it actually is it | |
43:51 | |
possible to do the multi task positive | |
43:53 | |
transfer that we see in supervised | |
43:55 | |
learning cases that we were able to see | |
43:57 | |
that in reinforcement learning yeah | |
44:01 | |
hello this is exciting I have a question | |
44:06 | |
about extending this to maybe more open | |
44:09 | |
domains so what is the challenge it's a | |
44:13 | |
challenge to be a number of actions to | |
44:16 | |
pick because the number of strokes maybe | |
44:19 | |
the strokes face smaller so what other | |
44:22 | |
challenge to extend to open domains with | |
44:27 | |
what do you like what do you have in | |
44:29 | |
mind is open domains like number of | |
44:31 | |
actions is definitely a challenge right | |
44:32 | |
it is definitely one of the big | |
44:34 | |
challenges that a lot of research in as | |
44:36 | |
far as I know in RL goes into that but | |
44:39 | |
that is that is I think only one of the | |
44:41 | |
main challenges the other challenge of | |
44:42 | |
course is the straight representation | |
44:45 | |
that is mainly why we sort of used deep | |
44:48 | |
learning right because we expect that | |
44:51 | |
with deep learning we are going to be | |
44:52 | |
able to learn better representations and | |
44:54 | |
that still remains as a challenge | |
44:56 | |
because being able to learn | |
44:57 | |
representations is not an architectural | |
44:59 | |
problem only it is also about finding | |
45:03 | |
the right sort of training set up and | |
45:05 | |
spyro was an example of that where we | |
45:07 | |
can get that reward function that that | |
45:08 | |
reward signal in an unsupervised way | |
45:11 | |
right and in many different domains | |
45:13 | |
like there are many different ways we | |
45:15 | |
can do this but actually finding those | |
45:16 | |
solutions also part of that | |
45:20 | |
okay so let's Bank arriving | |
45:24 | |
[Music] | |
45:27 | |
[Applause] | |
Up next | |
AUTOPLAY |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment