Machine Learning : Logistic regression (Part 10)
Logistic regression is predicting a category from number of independent variables
Here, let's assume a case that. We will check in which age people takes the health insurance. And depending on that, we will predict the probability of taking the health insurance for different aged people.
Here you can see some people who did not take the insurance and spotted with red dots. Again reverse for some who took the insurance and have blue dots.
Now, if we want to see if a person of age 35 & 45 will actually take insurance or not, we can see that, the person who has age 35 has 42% chance to take the insurance.
Again, for the person who has the age 45 has 81% chance to take the insurance.
But, logistic regression need a categorical answer like Yes or, No.
So, we have set 50% as the limit. If we have more than or equal 50%,the answer will be "Yes"
Again, if less, we will select "No"
The curve is not linear and called as "Sigmoid" . So the function will be this one
So, for 1 independent parameter (X1),we have
Same goes when we have more and more independent variables (X1,X2,X3,........) etc.
Maximum Likelihood
Here , we can calculate the percentage of "Yes" for Blue dots. You can see first dot to have 3% probability according to our prediction ( sigmoid) . Which means, it should be "No" (as 3% <50%) . But we know the person took Health care as it's in the Yes line
Same goes for other points.
In right side, we have red dots, for which, we can try to guess the percentage of Yes, and it's 1-0.01 , 1-0.04 and etc respectively.
we will multiply all of these values
This will give us the value Likelihood.
Which is the best curve?
The best curve have the max likelihood
Let's code
Problem statement: We are launching a new SUV in the market and we want to know which age people will buy it. Here is a list of people with their age and salary. Also, we have previous data of either they did buy any SUV before or not.
Let's code now
Firstly import and split the data into X_train, y_train and X_test and y_test
Now we will feature scale as the age and salary are big numbers. Let's reduce those.
Note: We won't apply this on y matrix as y has the value 0 and 1. So, we don't apply feature scaling on such column.
Training the Logistic Regression model on the Training set
from sklearn.linear_model import LogisticRegression
#importing the LogisticRegression class
classifier= LogisticRegression(random_state=0)
#created an object
classifier.fit
(X_train,y_train)
#fitted the model
Predicting a new result
testing for a 30 years old person with 870000-->
classifier.predict([[30,870000]]) #predicts expect 2d array
as we did feature scaling, we need to scale them. Will use sc object as we created this object for scaling and need the same scaling object
print(classifier.predict(sc.transform([[30,87000]])))
Predicting the Test set results
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))
Here our code shows [0 0] where left 0 is our predicted result and next 0 on right is real data of purchase
Making the Confusion Matrix
#a 2d matrix which shows number of correct and incorrect prediction
from sklearn.metrics import confusion_matrix,accuracy_score
cm= confusion_matrix(y_test, y_pred)
print(cm)
#accuracy scoreprint(accuracy_score(y_test, y_pred)) #from documentation : accuracy_score(y_true, y_pred)
#26 correct prediction y_pred=1 who did by SUV y_test=1## 77 incorrect prediction y_pred=0 of matrix who bought SUV but the model could not define that y_test=1# 6 customers did not buy (y_test= 0) SUV but was predicted did buy y_pred=1# 41 customers did did not by y_test=0 and predicted did not buy too y_pred=00
Visualizing the Training set results
from matplotlib.colors import ListedColormap
X_set, y_set = sc.inverse_transform(X_train), y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max() + 10, step = 0.25),
np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000, step = 0.25))
plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(), X2.ravel()]).T)).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistic Regression (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show
()
Here we can see the whole red region as Predicted "No"" and green region as predicted "Yes".
But the red dots means real No (Actually the person did not buy))
and green dots mean real Yes
Visualizing the Test set results
from matplotlib.colors import ListedColormap
X_set, y_set = sc.inverse_transform(X_test), y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max() + 10, step = 0.25),
np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000, step = 0.25))
plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(), X2.ravel()]).T)).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistic Regression (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show
()
Done!
Try this code
Here is the datasheet