TinyML (Part 6): Introducing Convolutions

Till now we have learned basic neural networks.

But with images like these, it can get much more complex. We're zoomed in on the horse, for example. The image may not take up the full frame, either, or it could be off center.Using convolutions or filters can help with this problem, and we'll look at how they work next.

So for example, if you take a look at this image from Fashion-MNIST it's a 28-by-28 grayscale image of an ankle boot. We can use this to demonstrate a very simple filter. For every pixel in the image, we'll take its immediate neighbors. So for example, if the current pixel has value 192, we can see that its neighbors above are 0, 64, 128. Its neighbors below are 142, 226, and 168, for example. We can then define a filter where we multiply each pixel by the respective filter value. So the pixel above and to the left, which is zero, is multiplied by the top left filter value, which is minus one. That gives us zero. We do the same for each pixel, and we sum up the values. This will become the new value for our current pixel instead of 192.

This might seem really unusual, but it's actually a common tool in image processing. And if you've ever done any kind of image processing such as what you might do on Instagram or Photoshop, that's pretty much how it works, by having a filter and applying that filter to the image. Here's a simple example.

Consider the image on the left.If I apply the filter shown to it, I'll get the results on the right. It greatly enhances the vertical lines, and it dims everything else. So you could consider this to be a vertical line detector.

And similarly, this filter can spot horizontal lines, darkening almost everything else in the image that isn't a horizontal line. By applying filters like these, you can remove almost everythingbut a distinguishable feature. And this process is called feature extraction.

When combined with a process called pooling, we can also do something really powerful. Pooling is simply the process of removing pixels while maintaining important information. So for example, if you look at the four-by-four square of values on the left of the screen, these emulate what pixels in an image might look like. We can break this down into four two-by-two blocks of pixels, and then we can see that in the center of the screen. So our 0, 65, 48, 192 from the top left becomes the first block. The 128, 128, 144, 144 from the top rightbecomes the second block and so on.

From each of these blocks, we pick the biggest value and we throw the rest away. We then reassemble these, and we have a new set of four pixels, which was pooled from the original set of 16. Now, where this gets really interesting and powerful is if we apply it to an image after filtering the image. It can have the effect of enhancing the features that we extracted. And not only that, if you're applying many filters to your image in a layer, you're in effect making many copies of your image.So if it's a large image, you end up with a lotof data flowing through your network.So it's good to have a way to compress that data without losingthe important features.And this is just one layer.When you apply multiple layers of filters,then really complex features, such as faces or handsinstead of the vertical or horizontal lines I've shown here,could be spotted and extracted.

Recall the image earlier where we had that filter thatextracted vertical lines. After a pooling, the image will look like the one on the right. We can see that those lines pop a lot more.And importantly, the image, of course, is now one quarterthe size of the original.

How does this impact computer vision? Well, let's consider a scenario. Here's a picture, and you don't know its content. I've simulated that by making the image really blurry.

So say there's a filter or a set of filters that can be passed over the image that extracts these pixels. It's two vertical shapes, and they look a little bit like human legs.

And then maybe other filters can extract features like these. They're blobs with cylinders that stick out of them. Now our human brain sees these instantly as hands, but an untrained network doesn't know that yet. It just knows that a filter can extract something that looks like this.

And yet another filter extracts this, roughly a circle with another two circles on it as well as some parallel features. Our human brain recognizes this as a face with eyes and a mouth. But, again, the computer has no context for what these are. It only knows that a particular filter could extract this.

So then the computer could match the features that were extracted by the first filter-- in other words, the legs-- with those from the second filter, also known as the hands, and those from the third filter, also known as the face. And if all of these features are present in a picture that is labeled human, then when training a neural network, we could have it learn the filters that extract information that's consistently present in images that are labeled human. So then these could be used to spot future pictures to see if they contain a human.

By learning sets of filters that can then spot human or horse, we now have the beginnings of a computer vision model that can handle sophisticated images and predict what's within them.

So to recap, filters or stacks of filters can extract features like hands or ears. And then during training, they can be combined with the labels of the known images to help us train a network that can predict image contents.

So our model, instead of just weights and biases, can now also learn filters. So when we pass in data like an image, we can predict its contents. Remember, the network learned filter values that it has a high confidence that if they can extract features and those features are present in the image thenit can match that to a label. So now our inferences are going way beyond just y equals mx plus b. So in this case, we know we have a human.

If we return to the machine learning paradigm diagram, we remember it's looping through, making a guess, measuring our accuracy, and then optimizing our guess repeatedly.

In this case, instead of weights and biases, we initialize the filter values randomly, and applying these to the images will give us our guesses.

By measuring our guesses as to the class of the image based on the extracted features against the ground truth, we can now measure our accuracy.

And as before, we can use an optimizer to tweak our filter values, knowing that we'll be stepping in the right direction before we repeat the process.

So if my filters end up turning my image into these three images where I have extracted features that look like legs, hands, and a head and this image is labeled human, then these filters will be associated with human. And then they can be used to predict pictures. In this case, we can see that he's human, but the computer could only recognize that by looking at the filter values, extracting features from them and if they extract the features as expected.

Check out this code

From DNN to CNN

For a recap, here's our DNN encode that we'vebeen using to classify mnist or fashion mnist images. For this example, I'll use the fashion mnist version, which is a little bit more sophisticated than the handwriting digits. And it makes it harder for us to get a higher level of accuracy.

The model architecture is here.And it's a simple sequential, which flattens the input and itthrough dense layers.

When we train a model with it, we'll be able to see the accuracy and validation accuracy reported back to us. And in this case, after 20 epochs, I got an accuracy of 89.53% and a validation accuracy of 86.77%.These results are great.

But we might be able to improve on them using convolutions.So let's take a look.

Now, here's the complete code for an updated neural architectur to use convolutional neural networks, instead of a straight up DNN.

First, we can see that on our input layer, we need to use an input shape.But it's a little different. It's now 28 by 28 by 1, instead of 28 by 28. And this is because a convolution of layer expects the image dimensions to be fully defined. So a color image will have three channels for red, green, and blue. So it will be something by something by 3. But a monochrome image only has one channel.So it's specified here. Our image is 28 by 28 with one channel. So it's 28 by 28 by 1.

Our output layer will be 10 neurons as before. It's because we have 10 classes.

We'll also have a flatten before we get into the DNN.This will occur after the filters have done their job. The filters will act on the image as we saw previously.So they will still have a rectangular image. And once they've given updated pixel values, then we'll flatten that out before we get into the DNN. And then, the network can behave as before.

So let's now look at how to define the convolutional neural network. First, will specify the convolutional layers using Conv2D. It's 2D, because the images have a width and height. And we'll pass filters across them to change the image to extract features.

Remember, these filter values will be learned over time.

The Pooling that we described previously is implemented by specifying max pooling as a layer. There are different types of Pooling. There's max, min and average.But for this, we'll use max pooling, where we take the largest value from the pool.

Looking at the convolutional layers, we can explore the parameters. First is this number, which is the first parameter in the convolutional layer. And this is the number of filters to apply to the image at this layer. So in this case, we have 64 filters that will be applied to the image. Remember that the filter values are going to be randomly initialized as our Guess. And then, they will be learned over time.

The 3 comma 3 here indicates the size of the filter. Recall that when we were talking about filters that we had defined a 3 by 3 filter that multiplied across the current pixel and each of its neighbors. This could be a different value.For example, 5 by 5 to look at neighboring pixels up to two pixels away, and so on. But it will always be odd numbers-- 3 by 3, 5 by 5, 7 by 7.

And it's similar for the Pools, where the size of the Pool is specified as a parameter, which in this case is a 2 by 2 pool. So we'll be picking one pixel out of four. If we want different size pools, we can define them here. Remember that by Pooling, we're also reducing the size of our image, helping us to emphasize and summarize the features that we extracted. The convolutional layer had 64 filters in it, which means 64 copies of the image are going to be made. So reducing the size of them will reduce the amount of data flowing through our network

Try the code

In this Colab you’ll explore the power of Convolutional Neural Networks (CNNs). You’ll train both a traditional DNN and a CNN and see how CNNs can far outperform standard networks on computer vision tasks. You’ll then dive into both the convolutional and max pooling layers that power the CNN.

Try this one

In this Colab you’ll build off of the prior Colab exercise using CNNs to learn the Fashion MNIST dataset, and you’ll start to visualize the convolution and pooling layers to better understand how the model “sees” the world. Hopefully this will give you more insight into how neural networks see the world as compared to how you see the world!

Done!!