Machine Learning : Deep Learning - CNN (Part 27)
What do you guess from this?
Basically our brain looks for feature and depending on that it tries to think if the person is looking at right or towards us.
Again, this image
your brain might think the girl is looking that side or,
you might think this as an old lady's face
Also, this image shows a duck and rabbit
again this image
our brain can't decide which picture is correct
So, it summarizes that our brain looks for features and judge something.
Let's now see how computer classifies these
the left one is selected as "cheetah" by the computer.
the cheetah has got the most probability.
The second one was chosen "bullet train" by the computer
what's the last one?
here scissors was chosen with high probability but it should have been hand glass. But you see the computer could not define it correctly as the image was not clear.
CNN is getting famous as we need use it in self driving, photo detecting on facebook and many more!!
the person working in Google started ANN and the person working in Facebook started CNN
So, how it works?
So, it takes an input and gives and output.
For example,
but sometimes it's tough to detect if a person is smiling or sad. It all depends on feature.
When we take an image. For example, black and white here:
Well, neural networks leverage the fact that the black and white image is a two-dimensional array. So the way we see it right now on the left is just the visual representation, just some kind of picture, and for simplicity's sake it's just a two by two picture. But in computer terms, it's actually a two-dimensional array with every single one of those pixels having a value between zero and 255. So that's eight bits of information, two to the power of eight is 256, so, therefore, the values are from zero to 255, and that's the intensity of the color, and, in this case, the color white. So zero will be a completely black pixel, 255 will be a completely white pixel, and between them you have the grayscale range of possible options for this pixel. And based on that information, computers are able to then work with the image. And that's kind of like the starting point, that any image has a digital representation, has a digital form and those are just basically ones and zeros that form a number of zero to 255 for every single pixel. That's what the computer works with. It doesn't actually work with colors or anything, it works with the ones and zeros at the end of the day. That's kind of like the foundation of it all.
And in a colored image it's actually a three-dimensional array. You've got a blue layer, a green layer and a red layer, and that stands for RGB, red, green, blue, and each one of those colors has its own intensity. So, basically, a pixel has three values assigned to it, each one of them is between zero and 255, and, therefore, you can find out what this image, what color exactly this pixel is by combining those three values.
So that's the foundation of it all, that's the red channel, the green channel, the blue channel.
Finally, an example, a very trivial example of a smiling face in computer terms. If we just really simplify things,
instead of having from zero to 255, instead of having those values, just so that we can understand things better and really grasp the concepts, we're going to say zero is white, one is black, right?
So we're just going to simplify things to the extreme and you will see that that image can be represented like that.
And the steps that we're going to be going through with these images are:
Convolution Operation
It's combined operation of two function.
We have 2 matrix ( input image and feature matrix)
now we multiply this
from the input image, we take a matrix which is of feature detector . Now we look for 1's in input image but that has to be in the same position of feature detector.
as you can see that, in these positions we don't have any 1's , so 0 matches.
So, we have given 0
then we go to second box of input image (same size of feature detector)
Here we have 1 matches and we write it down on Feature Map
Are we losing information when we're applying the Feature Detector?
Well, some information we are losing, of course, because we have less values in our resulting matrix but at the same time the purpose of the Feature Detector is to detect certain features, certain parts of the image that are integral. And so, for instance, if you think about it this way like the Feature Detector has a certain pattern on it. The highest number in your Feature Map is when that pattern matches up.
In fact the highest number you can get is, in our simplified example, is when the feature matches exactly. And you can see with that number 4
features is how we see things, is how we recognize things. We don't look at every single pixel, so to speak, in what we see on an image or in real life we don't look at every single pixel we look at features we look at the nose, the hat, the feather, the eyes under or the little black marks under the cheetah's eyes to distinguish it between a cheetah and a leopard,
So, to give an overall picture, this is how it works
the first layer is something we created right now by getting the feature map
So the front one, let's say the front one is the one we just created, but then how come there is many of them. But we create multiple Feature Maps because we use different filters, right, and that's another way that we preserve lots of the information so we don't just have one Feature Map, we look for certain features and then, or basically the network decides through its training.
Again, with a different filter or feature detector, we get a new feature map
and that goes on....
So, here is an example of Tajmahal which we will apply filters (feature detector)
once we apply filter to this image, it sharpens
It's quite intuitive if you think of it so 5 is the pixel, the main pixel like in the middle of the filter, or the Feature Detector. And then minus 1, minus 1, minus 1 just so you just kind of like reduces the pixels around it.
Then Blur, so basically it takes equal significance, it gives equal significance to all of the pixels around the one in the center and therefore it combines them together and you get a blur.
Edge Enhance, so here you can see that's minus 1 and 1 and you get 0s right, so you delete, remove the pixels around the main one in the middle and you only keep this one at a minus 1 and it gives you an edge.
Edge Detect, right, so this one probably makes more sense right you take the middle one, you reduce the middle one probably like the strength of the middle pixel and then you look for the 1s, you look for these 1s you increase the strength of the 1s around them so you have the 1s there. Yeah so that gives you like an Edge Detection that you can see once you get there.
Emboss, another one so the key here is that it's asymmetrical and you can see the image becomes asymmetrical as well so you got that like kind of feeling that it's standing out towards you and that's what you get when you have like minuses here and pluses here.
ReLU
Now, we will apply rectifier
And the reason why we're apply the Rectifier, is because we want to increase non-linearity in our image, or in our network, in our Convolutional neural network, and a Rectifier acts as that filter, or acts as that function, which breaks up linearity.
And the reason why we want to increase non-linearity in our network is because images themselves are highly non-linear, especially if you're recognizing different objects next to each other, or just on this background and stuff like that,
Like the image is going to have lots of non-linear elements, and the transition between pixels, adjacent pixels, is often gonna be non-linear. That's because of borders, there's different colors, there's different elements in your images.
But at the same time, when we're applying mathematical operations such as convolution, and running this feature detection to create our feature maps, we risk that we might create something linear, and therefore we need to break up the linearity.
Here is a imagine, an original image.
Now when we apply a feature detector to this image we get something like this.
Well, when you apply a feature detector to a proper image, which is not just zeros and ones, but has lots of different values and you apply, as we saw previously, feature detectors can have negative values in themselves. Sometimes you will get negative values, and here their black ones are negative, white ones are positive.
And what a Rectified linear unit function does, is it removes all the black, right? Anything below zero it turns into zero,
and so from this, it turns to this
but why ReLU/why do we need to break the linearity?
here you can see bright, darker, darker, darker, darker, darker, darker, darker. So this part looks like it's linear (gradual progression) So this part looks like it's linear
Then you break it up like that.
So, linearity gets broken.
Max Pooling
On the 1st image, the image is positioned properly and the cheetah is looking straight at you. On the 2nd image, it's a bit rotated and the 3rd image is a bit squashed. And the thing here is that we want the neural network to be able to recognize the cheetah in every single one of these images
There's lots of little differences and so if the neural network looks for exactly a certain feature, for instance a distinctive feature of the cheetah is the tears that are on its face going from the eyes or the shadows that look like tears, the texture, the pattern that is going from its eyes down on the sides of its nose, that looks like tears that's a distinctive feature of the cheetah.
But if it's looking for that feature which it learned from certain cheetahs, in an exact location or an exact shape or form or texture, it will never find these other cheetahs. So, we have to make sure that our neural network has a property called spatial invariance, meaning that it doesn't care where the features are located, not so much as in which part of the image because we've kind of taken that into consideration with our map, with our convolution layer.
But it doesn't have to care if the features are a bit tilted, if the features are a bit different in texture, if the features are a bit closer or if the features are a bit further apart, relative to each other.
So, if the feature itself is a bit distorted, our neural network has to have some level of flexibility to be able to still find that feature and that is what pooling is all about. So, let's have a look at how pooling works.
How it works?
We have got the feature map and now we are taking 2*2 matrix now and checking what is the max value there. We got 1 is the max value . So, let's add that to the Pooled Feature Map.
it goes again for this
And finally,
First of all, we still were able to preserve the features, right. The maximum numbers, they represent, because we know how the convolution layer works, we know that the maximum or the large numbers in your feature map, they represent where you actually found the closest similarity to your feature.
But by then pooling these features, we are, first of all, getting rid of 75% of the information that is not the feature, which is not the important things that we're looking out for because we're disregarding 3 pixels out of 4 so we're we're only keeping 25%
Then also because we are taking the maximum of the pixels or the values that we have, we are therefore accounting for any distortion.
And in addition to all of that, we're reducing the size, so there's another benefit.
So, we've got, we're preserving the features, we're introducing spatial invariance, we're reducing the size by 75% which is huge, which is really gonna help in terms of processing. And, moreover, another benefit of pooling is we're reducing the number of parameters and therefore, we're preventing overfitting. It is a very important benefit of pooling.
Here, is a website you can use to draw numbers between 0-9 and see all of the layers
Here Pooling has Downsampling.
Flattening
now , using our pooled feature map, we can flatten them and make a single row
And the reason for that is because we want to later input this into an artificial neural network (ANN) for further processing.
And with lots of maps, we get this
So, to sum up. This is the process we have come up till now
We take an image and apply filters(feature detectors) to get 1 feature map and then we use multiple filters to get lots of maps and create that convolution layer.
Then we apply max pooling to each layer and downsize them individually. That's why we can see much smaller size after pooling. Then finally we flatten them to create a vector of inputs.
After that we apply, ANN to all of the inputs
so, we get output and then we readjust the weights and filters (feature detectors)
assuming an example of detecting an animal and the model thinks it can be dog and cat.
Assuming when it detects that the animal is dog
the neuron there , valued from 0-1 indicate which will work
Here 1,1,,0.9 are max values and these are working to make the dog as output.
When the output is Cat, these 3 neurons are mostly contributing
So, we got this
Let's see what happens with an image of dog
Now, we haven't got an output but we got some value at neuron to indicate which to trigger. The dog output will listen to his 3 neurons which gives him great value. They are 1, 1,0.8. So, in average we got 0.95
And the cat output gets 0.2, 0.8 and 0.1 as its 3 values
So, it gets 0.05 in average.
So, It detects a Dog finally with most probability.
Again, for an image of Cat, we get this
The cat's 3 valued inputs give 1, 1, 0.4 and in average 0.79
So, this is how the image we shared earlier gives probability and detects an object
So, we can summarize the process like this
You may read this The 9 Deep Learning Papers You Need To Know About (Understanding CNNs Part 3)
Applying Softmax:
What we have here is the convolutional neural network that we built in the main part of the section and then at the end it pops out some probabilities for 0.95 for a dog and 0.05, or 5%, for a cat, given that photo on the left has an input.
This is after the train has been conducted. This is actually, it's running, and it's classifying a certain image. And so the question here is, how come these two values add up to one. Because as far as we know from everything that we've learned about our neural networks, there is nothing to say that these two final neurons are connected between each other.
So how would they know what the value of the other one is and how would they know to add their values up to one. Well, the answer is they wouldn't, in the classic version of artificial neural network, and the only way that they do is because we introduce a special function, called the softmax function, in order to help us out of this situation.
Applying cross entropy to the dog image result:
we pass 1 , 0 as we know this should be a dog image . So, to mean dog, we give 1.
We apply Cross entropy with it. 0.9 goes to q and 1 goes to p. Again, 0.1 goes to q and 0 goes to p
Let's give another example, with 3 images
2 of our Neural networks give us this results
You see first 2 image's neural networks (NN1 and NN2) did guess things right.
But the last image's Neural network values were bad and wrong.
Let's keep them in a table for each Neural Networks
now if we want to see how many they guessed wrong using classification error
it gives 33% for both but we can see NN1 has always greater probability to judge the result. For example, when the image was for dog (1st one)
NN1 gave 0.9 for dog and 0.1 for cat whereas NN2 gave 0.6 for dog and 0.4 for cat. So, they both had more probability for dog but 0.9>0.6
So, NN1 outperformed NN2.
Using Mean squared error and cross entropy
Now both guessed better and correct
The cross entropy gives us better result to get which Neural Network performed well.
Read more about cross entropy loss
Let's code this down
Problem statement
We have 3 folders
training_set consists of lots of cats and dogs image.
test_set has some cats and dogs image to check if we are right or how much right.
single_prediction has two image. 1 cat and 1 dog . It's basically production folder to detect finally if an image is cat or dog.
Download the dataset folder
Import files
import tensorflow as tf #for deeplearning
from tensorflow.keras.preprocessing.image import ImageDataGenerator #for image processing
Data Pre-processing
We will apply transformation to avoid overfitting
Actually, we will get very high accuracies on the training set, you know, close to 98%, and much lower accuracies on the test set. And that is called over fitting. And that's something we absolutely need to avoid anyway.
So basically we're gonna apply some geometrical transformations like transvections to shift some of the pixels. Then we're gonna rotate a bit the images. We're gonna do some horizontal flips. We're gonna do some zoom in and zoom out. Well, you know, we're gonna apply a series of transformation so as to modify the images and get them, as we say, augmented. In fact, the technical term of what we're gonna do now, you know with all these transformations, is called image augmentation.
Let's check Keras API
Check this out. It has been deprecated now!!
We take a part of sample code from this blog
Now let's copy and paste this part
Let's copy this part as well
let's set img_height and img_width as 64 and batch_size as 32
we are done with the preprocessing for training data
Now, let's do it for training data.
Note: We won't do those image augmentation for test data as our target is to check what results we get using test data. If we make those changes, then result will surely be on our favor. That's cheating!
We will just re-scale but not those data augmentation.
Building the CNN
Initializing a cnn object which is of the class Sequential just like ANN
cnn = tf.keras.models.Sequential() #cnn object created which be of the class Sequential
now we will add convolution layers. For that we will use Conv2D
cnn.add(tf.keras.layers.Conv2D(filters=32,kernel_size=3,activation='relu',input_shape=[64,64,3]))
There are various types of CNN but using the classical one, we will use 32 filters first.
Also, we will kernel_size as 3 which means the filter (feature detector) will be of size 3*3
also, we will use activation function after that which is 'relu' . Finally, we did set the image_height and width as 64 and as this is a colorful image, we will set a value 3
so, input_shape=[64,64,3]
Pooling
this time we are using MaxPool2D class
Just a reminder
here we used 2*2 matrix to form the pooled feature map . So, pool_size will be 2
Again, we shifted 2 pixels each time to go to a new matrix.
So, strides will be 2
now, adding second convolutional layer
We can just copy paste the codes for 1st convolutional layer and pooling
but as we are not now taking the input .[We already took the input in 1st layer]
We will remove input_shape
cnn.add(tf.keras.layers.Conv2D(filters=32,kernel_size=3,activation='relu'))
cnn.add(tf.keras.layers.MaxPool2D(pool_size=2,strides=2))
Flattening
We will use Flatten class now
Full connection
Now we will apply the ANN
In ANN, we used dense layers
Here we are taking larger number of neurons as we need more computation. So, units=128
again, we will apply activation function rectifier. Which we write as relu
Final output layer
We will use the dense class as that's fully connected to the fully connection layer
So, we just copied that
now for each image set, for dogs we need 1 output and for cats we need 1
So, we will set 1 as unit .
and in binary classification, we use sigmoid classification
Train the CNN
compile them
we are using adam optimizer , loss function 'binary_crossentropy' (We mentioned earlier why we will use cross entropy) and set metrics depending on accuracy
Now training
we have provided training_set for training and provided test_set in valiadation_data and epochs=25 (to train it faster )
Single prediction
We have 2 images and we will now see what's the result.
import numpy as np
from keras.preprocessing import image #importing image module
#load image from the folder
test_image = image.load_img('dataset/single_prediction/cat_or_dog_1.jpg', target_size = (64, 64)) #dataset/single_prediction/cat_or_dog_1.jpg is the image location, target_size = (64, 64) [Note: Image size has to be same as we trained]
#predict method expects input in array . So, convert PIL image to Numpy array.
test_image = image.img_to_array(test_image)
We have trained on batch. So, batch number 1 had 32 images, batch number 2 had 32 images and so on.... So, we need to add extra dimension batch
test_image = np.expand_dims(test_image, axis = 0)#adding a dimension to the test_image and on the 1st dimension.
result = cnn.predict(test_image/255.0) # we need to normalize if we get more than 50%, it's a dog and if not, we will have a cat training_set.class_indices
Result has now the batch dimension and so, we will acess the first one using result[0] and then we will get access to only prediction within that by result[0][0]if result[0][0] > 0.5: : #we are looking for the prediction on the basis of probability. So, if we cross more than 50%, it will be dog
prediction = 'dog'
else:
prediction = 'cat'
Let's download anaconda now
then install tensorflow in anaconda prompt
let's download the code from google collab and keep the dataset in the folder. Then let's open Jupyter notebook
So, with this image
we get this one
Also , for the cat image we get this
Done!!
Try the code