CNN

What do you guess from this?

Basically our brain looks for feature and depending on that it tries to think if the person is looking at right or towards us.

Again, this image

your brain might think the girl is looking that side or,

you might think this as an old lady's face

Also, this image shows a duck and rabbit

again this image

our brain can't decide which picture is correct

So, it summarizes that our brain looks for features and judge something.

Let's now see how computer classifies these

the left one is selected as "cheetah" by the computer.

the cheetah has got the most probability.

The second one was chosen "bullet train" by the computer

what's the last one?

here scissors was chosen with high probability but it should have been hand glass. But you see the computer could not define it correctly as the image was not clear.

CNN is getting famous as we need use it in self driving, photo detecting on facebook and many more!!

the person working in Google started ANN and the person working in Facebook started CNN

So, how it works?

So, it takes an input and gives and output.

For example,

but sometimes it's tough to detect if a person is smiling or sad. It all depends on feature.

When we take an image. For example, black and white here:

Well, neural networks leverage the fact that the black and white image is a two-dimensional array. So the way we see it right now on the left is just the visual representation, just some kind of picture, and for simplicity's sake it's just a two by two picture. But in computer terms, it's actually a two-dimensional array with every single one of those pixels having a value between zero and 255. So that's eight bits of information, two to the power of eight is 256, so, therefore, the values are from zero to 255, and that's the intensity of the color, and, in this case, the color white. So zero will be a completely black pixel, 255 will be a completely white pixel, and between them you have the grayscale range of possible options for this pixel. And based on that information, computers are able to then work with the image. And that's kind of like the starting point, that any image has a digital representation, has a digital form and those are just basically ones and zeros that form a number of zero to 255 for every single pixel. That's what the computer works with. It doesn't actually work with colors or anything, it works with the ones and zeros at the end of the day. That's kind of like the foundation of it all.

And in a colored image it's actually a three-dimensional array. You've got a blue layer, a green layer and a red layer, and that stands for RGB, red, green, blue, and each one of those colors has its own intensity. So, basically, a pixel has three values assigned to it, each one of them is between zero and 255, and, therefore, you can find out what this image, what color exactly this pixel is by combining those three values.

So that's the foundation of it all, that's the red channel, the green channel, the blue channel.

Finally, an example, a very trivial example of a smiling face in computer terms. If we just really simplify things,

instead of having from zero to 255, instead of having those values, just so that we can understand things better and really grasp the concepts, we're going to say zero is white, one is black, right?

So we're just going to simplify things to the extreme and you will see that that image can be represented like that.

And the steps that we're going to be going through with these images are:

Convolution Operation

It's combined operation of two function.

We have 2 matrix ( input image and feature matrix)

now we multiply this

from the input image, we take a matrix which is of feature detector . Now we look for 1's in input image but that has to be in the same position of feature detector.

as you can see that, in these positions we don't have any 1's , so 0 matches.

So, we have given 0

then we go to second box of input image (same size of feature detector)

Here we have 1 matches and we write it down on Feature Map

Are we losing information when we're applying the Feature Detector?

Well, some information we are losing, of course, because we have less values in our resulting matrix but at the same time the purpose of the Feature Detector is to detect certain features, certain parts of the image that are integral. And so, for instance, if you think about it this way like the Feature Detector has a certain pattern on it. The highest number in your Feature Map is when that pattern matches up.

In fact the highest number you can get is, in our simplified example, is when the feature matches exactly. And you can see with that number 4

features is how we see things, is how we recognize things. We don't look at every single pixel, so to speak, in what we see on an image or in real life we don't look at every single pixel we look at features we look at the nose, the hat, the feather, the eyes under or the little black marks under the cheetah's eyes to distinguish it between a cheetah and a leopard,

So, to give an overall picture, this is how it works

the first layer is something we created right now by getting the feature map

So the front one, let's say the front one is the one we just created, but then how come there is many of them. But we create multiple Feature Maps because we use different filters, right, and that's another way that we preserve lots of the information so we don't just have one Feature Map, we look for certain features and then, or basically the network decides through its training.

Again, with a different filter or feature detector, we get a new feature map

and that goes on....

So, here is an example of Tajmahal which we will apply filters (feature detector)

once we apply filter to this image, it sharpens

It's quite intuitive if you think of it so 5 is the pixel, the main pixel like in the middle of the filter, or the Feature Detector. And then minus 1, minus 1, minus 1 just so you just kind of like reduces the pixels around it.

Then Blur, so basically it takes equal significance, it gives equal significance to all of the pixels around the one in the center and therefore it combines them together and you get a blur.

Edge Enhance, so here you can see that's minus 1 and 1 and you get 0s right, so you delete, remove the pixels around the main one in the middle and you only keep this one at a minus 1 and it gives you an edge.

Edge Detect, right, so this one probably makes more sense right you take the middle one, you reduce the middle one probably like the strength of the middle pixel and then you look for the 1s, you look for these 1s you increase the strength of the 1s around them so you have the 1s there. Yeah so that gives you like an Edge Detection that you can see once you get there.

Emboss, another one so the key here is that it's asymmetrical and you can see the image becomes asymmetrical as well so you got that like kind of feeling that it's standing out towards you and that's what you get when you have like minuses here and pluses here.

ReLU

Now, we will apply rectifier

And the reason why we're apply the Rectifier, is because we want to increase non-linearity in our image, or in our network, in our Convolutional neural network, and a Rectifier acts as that filter, or acts as that function, which breaks up linearity.

And the reason why we want to increase non-linearity in our network is because images themselves are highly non-linear, especially if you're recognizing different objects next to each other, or just on this background and stuff like that,

Like the image is going to have lots of non-linear elements, and the transition between pixels, adjacent pixels, is often gonna be non-linear. That's because of borders, there's different colors, there's different elements in your images.

But at the same time, when we're applying mathematical operations such as convolution, and running this feature detection to create our feature maps, we risk that we might create something linear, and therefore we need to break up the linearity.

Here is a imagine, an original image.

Now when we apply a feature detector to this image we get something like this.

Well, when you apply a feature detector to a proper image, which is not just zeros and ones, but has lots of different values and you apply, as we saw previously, feature detectors can have negative values in themselves. Sometimes you will get negative values, and here their black ones are negative, white ones are positive.

And what a Rectified linear unit function does, is it removes all the black, right? Anything below zero it turns into zero,

and so from this, it turns to this

but why ReLU/why do we need to break the linearity?

here you can see bright, darker, darker, darker, darker, darker, darker, darker. So this part looks like it's linear (gradual progression) So this part looks like it's linear

Then you break it up like that.

So, linearity gets broken.

Max Pooling

On the 1st image, the image is positioned properly and the cheetah is looking straight at you. On the 2nd image, it's a bit rotated and the 3rd image is a bit squashed. And the thing here is that we want the neural network to be able to recognize the cheetah in every single one of these images

There's lots of little differences and so if the neural network looks for exactly a certain feature, for instance a distinctive feature of the cheetah is the tears that are on its face going from the eyes or the shadows that look like tears, the texture, the pattern that is going from its eyes down on the sides of its nose, that looks like tears that's a distinctive feature of the cheetah.

But if it's looking for that feature which it learned from certain cheetahs, in an exact location or an exact shape or form or texture, it will never find these other cheetahs. So, we have to make sure that our neural network has a property called spatial invariance, meaning that it doesn't care where the features are located, not so much as in which part of the image because we've kind of taken that into consideration with our map, with our convolution layer.

But it doesn't have to care if the features are a bit tilted, if the features are a bit different in texture, if the features are a bit closer or if the features are a bit further apart, relative to each other.

So, if the feature itself is a bit distorted, our neural network has to have some level of flexibility to be able to still find that feature and that is what pooling is all about. So, let's have a look at how pooling works.

How it works?

We have got the feature map and now we are taking 2*2 matrix now and checking what is the max value there. We got 1 is the max value . So, let's add that to the Pooled Feature Map.

it goes again for this

And finally,

First of all, we still were able to preserve the features, right. The maximum numbers, they represent, because we know how the convolution layer works, we know that the maximum or the large numbers in your feature map, they represent where you actually found the closest similarity to your feature.

But by then pooling these features, we are, first of all, getting rid of 75% of the information that is not the feature, which is not the important things that we're looking out for because we're disregarding 3 pixels out of 4 so we're we're only keeping 25%

Then also because we are taking the maximum of the pixels or the values that we have, we are therefore accounting for any distortion.

And in addition to all of that, we're reducing the size, so there's another benefit.

So, we've got, we're preserving the features, we're introducing spatial invariance, we're reducing the size by 75% which is huge, which is really gonna help in terms of processing. And, moreover, another benefit of pooling is we're reducing the number of parameters and therefore, we're preventing overfitting. It is a very important benefit of pooling.

Here, is a website you can use to draw numbers between 0-9 and see all of the layers

Here Pooling has Downsampling.

Flattening

now , using our pooled feature map, we can flatten them and make a single row

And the reason for that is because we want to later input this into an artificial neural network (ANN) for further processing.

And with lots of maps, we get this

So, to sum up. This is the process we have come up till now

We take an image and apply filters(feature detectors) to get 1 feature map and then we use multiple filters to get lots of maps and create that convolution layer.

Then we apply max pooling to each layer and downsize them individually. That's why we can see much smaller size after pooling. Then finally we flatten them to create a vector of inputs.

After that we apply, ANN to all of the inputs

so, we get output and then we readjust the weights and filters (feature detectors)

assuming an example of detecting an animal and the model thinks it can be dog and cat.

Assuming when it detects that the animal is dog

the neuron there , valued from 0-1 indicate which will work

Here 1,1,,0.9 are max values and these are working to make the dog as output.

When the output is Cat, these 3 neurons are mostly contributing

So, we got this

Let's see what happens with an image of dog

Now, we haven't got an output but we got some value at neuron to indicate which to trigger. The dog output will listen to his 3 neurons which gives him great value. They are 1, 1,0.8. So, in average we got 0.95

And the cat output gets 0.2, 0.8 and 0.1 as its 3 values

So, it gets 0.05 in average.

So, It detects a Dog finally with most probability.

Again, for an image of Cat, we get this

The cat's 3 valued inputs give 1, 1, 0.4 and in average 0.79

So, this is how the image we shared earlier gives probability and detects an object

So, we can summarize the process like this

You may read this The 9 Deep Learning Papers You Need To Know About (Understanding CNNs Part 3)

Applying Softmax:

What we have here is the convolutional neural network that we built in the main part of the section and then at the end it pops out some probabilities for 0.95 for a dog and 0.05, or 5%, for a cat, given that photo on the left has an input.

This is after the train has been conducted. This is actually, it's running, and it's classifying a certain image. And so the question here is, how come these two values add up to one. Because as far as we know from everything that we've learned about our neural networks, there is nothing to say that these two final neurons are connected between each other.

So how would they know what the value of the other one is and how would they know to add their values up to one. Well, the answer is they wouldn't, in the classic version of artificial neural network, and the only way that they do is because we introduce a special function, called the softmax function, in order to help us out of this situation.

Applying cross entropy to the dog image result:

we pass 1 , 0 as we know this should be a dog image . So, to mean dog, we give 1.

We apply Cross entropy with it. 0.9 goes to q and 1 goes to p. Again, 0.1 goes to q and 0 goes to p

Let's give another example, with 3 images

2 of our Neural networks give us this results

You see first 2 image's neural networks (NN1 and NN2) did guess things right.

But the last image's Neural network values were bad and wrong.

Let's keep them in a table for each Neural Networks

now if we want to see how many they guessed wrong using classification error

it gives 33% for both but we can see NN1 has always greater probability to judge the result. For example, when the image was for dog (1st one)

NN1 gave 0.9 for dog and 0.1 for cat whereas NN2 gave 0.6 for dog and 0.4 for cat. So, they both had more probability for dog but 0.9>0.6

So, NN1 outperformed NN2.

Using Mean squared error and cross entropy

Now both guessed better and correct

The cross entropy gives us better result to get which Neural Network performed well.

Let's code this down

Problem statement

We have 3 folders

training_set consists of lots of cats and dogs image.

test_set has some cats and dogs image to check if we are right or how much right.

single_prediction has two image. 1 cat and 1 dog . It's basically production folder to detect finally if an image is cat or dog.

Download the dataset folder

Import files

import tensorflow as tf #for deeplearning

from tensorflow.keras.preprocessing.image import ImageDataGenerator #for image processing

Data Pre-processing

We will apply transformation to avoid overfitting

Actually, we will get very high accuracies on the training set, you know, close to 98%, and much lower accuracies on the test set. And that is called over fitting. And that's something we absolutely need to avoid anyway.

So basically we're gonna apply some geometrical transformations like transvections to shift some of the pixels. Then we're gonna rotate a bit the images. We're gonna do some horizontal flips. We're gonna do some zoom in and zoom out. Well, you know, we're gonna apply a series of transformation so as to modify the images and get them, as we say, augmented. In fact, the technical term of what we're gonna do now, you know with all these transformations, is called image augmentation.

Let's check Keras API

Check this out. It has been deprecated now!!

We take a part of sample code from this blog

Now let's copy and paste this part

Let's copy this part as well

let's set img_height and img_width as 64 and batch_size as 32

we are done with the preprocessing for training data

Now, let's do it for training data.

Note: We won't do those image augmentation for test data as our target is to check what results we get using test data. If we make those changes, then result will surely be on our favor. That's cheating!

We will just re-scale but not those data augmentation.

Building the CNN

Initializing a cnn object which is of the class Sequential just like ANN

cnn = tf.keras.models.Sequential() #cnn object created which be of the class Sequential

now we will add convolution layers. For that we will use Conv2D

cnn.add(tf.keras.layers.Conv2D(filters=32,kernel_size=3,activation='relu',input_shape=[64,64,3]))

There are various types of CNN but using the classical one, we will use 32 filters first.

Also, we will kernel_size as 3 which means the filter (feature detector) will be of size 3*3

also, we will use activation function after that which is 'relu' . Finally, we did set the image_height and width as 64 and as this is a colorful image, we will set a value 3

so, input_shape=[64,64,3]

Pooling

this time we are using MaxPool2D class

Just a reminder

here we used 2*2 matrix to form the pooled feature map . So, pool_size will be 2

Again, we shifted 2 pixels each time to go to a new matrix.

So, strides will be 2

now, adding second convolutional layer

We can just copy paste the codes for 1st convolutional layer and pooling

but as we are not now taking the input .[We already took the input in 1st layer]

We will remove input_shape

cnn.add(tf.keras.layers.Conv2D(filters=32,kernel_size=3,activation='relu'))

cnn.add(tf.keras.layers.MaxPool2D(pool_size=2,strides=2))

Flattening

We will use Flatten class now

Full connection

Now we will apply the ANN

In ANN, we used dense layers

Here we are taking larger number of neurons as we need more computation. So, units=128

again, we will apply activation function rectifier. Which we write as relu

Final output layer

We will use the dense class as that's fully connected to the fully connection layer

So, we just copied that

now for each image set, for dogs we need 1 output and for cats we need 1

So, we will set 1 as unit .

and in binary classification, we use sigmoid classification

Train the CNN

compile them

we are using adam optimizer , loss function 'binary_crossentropy' (We mentioned earlier why we will use cross entropy) and set metrics depending on accuracy

Now training

we have provided training_set for training and provided test_set in valiadation_data and epochs=25 (to train it faster )

Single prediction

We have 2 images and we will now see what's the result.

import numpy as np

from keras.preprocessing import image #importing image module

#load image from the folder

test_image = image.load_img('dataset/single_prediction/cat_or_dog_1.jpg', target_size = (64, 64)) #dataset/single_prediction/cat_or_dog_1.jpg is the image location, target_size = (64, 64) [Note: Image size has to be same as we trained]

#predict method expects input in array . So, convert PIL image to Numpy array.

test_image = image.img_to_array(test_image)

We have trained on batch. So, batch number 1 had 32 images, batch number 2 had 32 images and so on.... So, we need to add extra dimension batch

test_image = np.expand_dims(test_image, axis = 0)#adding a dimension to the test_image and on the 1st dimension.

result = cnn.predict(test_image/255.0) # we need to normalize if we get more than 50%, it's a dog and if not, we will have a cat training_set.class_indices

Result has now the batch dimension and so, we will acess the first one using result[0] and then we will get access to only prediction within that by result[0][0]if result[0][0] > 0.5: : #we are looking for the prediction on the basis of probability. So, if we cross more than 50%, it will be dog

prediction = 'dog'

else:

prediction = 'cat'