Machine Learning: Mixture Density Network (MDN)-RNN (41)

Make sure that you have already gone through the RNN Blog

Once done, let’s start

A baseball player gets 0.39 seconds to hit a ball whereas a human reaction time is nearly 0.25 seconds.

Despite having 0.14 seconds more, it’s not easy to react. Because you have to move your hand and a bat alongside it . There are so many things to do. To solve this, baseball players start moving their bat earlier so that, they can save time and focus much more on the ball, direction to react once it appears near.

That’s what we can do using MDN-RNN

Here we can see RNN is getting value(h) and passing it to next RNN etc.

Let’s play a game

Predict the next work :

Can’t guess? Let’s add more words

Now, you can predict some words here

You can use any words but atleast now it will make sense right?

So, just like that, we need words from past (Too cool) and present (for), to predict the future (school).

But why MDN in this architecture?

MDN helps to predict undeterministic value which means we don’t want to fix an answer. From the above example, assume that you used “school” to fill the gap and your friend said, it should be “office”

Is that wrong? Nope!!

The answer works for it too! If we fix/determine, that might restrict more options.

So, when we want to use in a World Model,

You will see that the Variational Encoder is the spatial representation, it is the space, it deals with how the environment looks and what the space can potentially look like, it gives us many different options of what the environment might be like through that latent vector and how we mapped it onto a distribution.

Whereas the MDN-RNN, it gives us options and different variations of time, how the future can look like, what different things might happen in the future.

So together they model space-time, this is space (V), this is time (M), and both of them have distributions, they are nondeterministic they way we built them, they're nondeterministic and therefore, they're giving us different possibilities of space and time and that is allowing us to train our neural network in that.

Let’s see a demo:

If we only use the V (Variational AutoEncoder), we can see a racing car

as you can see, the car is driving very, it's kind of like jerky from left to right, and it's missing these sharp corners.

But now once we add a MDN/RNN to it, so Model M, on top of Model V, you can see the result is much smoother,so it's not as shaky, it's not as wobbly, and at the same time, it can better take these sharp corners because it can predict the future, what's happening in the future.

Let’s understand the theory now:

Firstly, let’s remember how a neural network work:

Let’s assume that we are passing this image through the Neural Network

Now, we got the weight of the dog from the neural network

But it’s not correct!

Then back propagation happens and more and more images are used for training

And gradually the model can predict more accurate results. Here you can see in the last image that, the model predicted 5.5 lb and the dog weight around 5.7 lb.

Great!!

This is how a model work.

But what if we get a range rather than an exact value?

Here we got the mu (u) as the weight and we have sigma to give the range. Then we

can say with a 68% confidence that the weight of this dog is between 5.1 and 5.9 pounds . Then if we go even further and take two standard deviations each way then we can say that with a 95% degree confidence that the weight of this dog is between 4.7 and 6.3 pounds.

So, we are asking our neural network to output 2 values

So, using that we can find our range out output result.

In MDN, we assume that any General distribution (purple)

can be broken down to a mixture of Gaussian/mixture (red and blue) of normal distribution.

As, we can divide one distribution to multiple distribution, here is what we can do to our output

Earlier we had 2 output nodes (mu and sigma) and now we have 2 values for each solution (mu1, mu2, sigma1, sigma2). What about the alpha1 and alph2?

They are the weights at which these distributions are added up together to get the general distribution. So if we look at the next slide here our mixture model is the sum of alpha one times first distribution plus alpha two times the second distribution.And plus, we could have, once again we could have many more distributions and each one of them would have it's own weight.