Logistic regression is predicting a category from number of independent variables

Here, let's assume a case that. We will check in which age people takes the health insurance. And depending on that, we will predict the probability of taking the health insurance for different aged people.

Here you can see some people who did not take the insurance and spotted with red dots. Again reverse for some who took the insurance and have blue dots.

Now, if we want to see if a person of age 35 & 45 will actually take insurance or not, we can see that, the person who has age 35 has 42% chance to take the insurance.

Again, for the person who has the age 45 has 81% chance to take the insurance.

But, logistic regression need a categorical answer like Yes or, No.

So, we have set 50% as the limit. If we have more than or equal 50%,the answer will be "Yes"

Again, if less, we will select "No"

The curve is not linear and called as "Sigmoid" . So the function will be this one

So, for 1 independent parameter (X1),we have

Same goes when we have more and more independent variables (X1,X2,X3,........) etc.

Maximum Likelihood

Here , we can calculate the percentage of "Yes" for Blue dots. You can see first dot to have 3% probability according to our prediction ( sigmoid) . Which means, it should be "No" (as 3% <50%) . But we know the person took Health care as it's in the Yes line

Same goes for other points.

In right side, we have red dots, for which, we can try to guess the percentage of Yes, and it's 1-0.01 , 1-0.04 and etc respectively.

we will multiply all of these values

This will give us the value Likelihood.

Which is the best curve?

The best curve have the max likelihood

Let's code

Problem statement: We are launching a new SUV in the market and we want to know which age people will buy it. Here is a list of people with their age and salary. Also, we have previous data of either they did buy any SUV before or not.

Let's code now

Firstly import and split the data into X_train, y_train and X_test and y_test

Now we will feature scale as the age and salary are big numbers. Let's reduce those.

Note: We won't apply this on y matrix as y has the value 0 and 1. So, we don't apply feature scaling on such column.

Training the Logistic Regression model on the Training set

from sklearn.linear_model import LogisticRegression #importing the LogisticRegression class

classifier= LogisticRegression(random_state=0) #created an object

classifier.fit(X_train,y_train) #fitted the model

Predicting a new result

testing for a 30 years old person with 870000-->

classifier.predict([[30,870000]]) #predicts expect 2d array

as we did feature scaling, we need to scale them. Will use sc object as we created this object for scaling and need the same scaling object

print(classifier.predict(sc.transform([[30,87000]])))

Predicting the Test set results

y_pred = classifier.predict(X_test)

print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

Here our code shows [0 0] where left 0 is our predicted result and next 0 on right is real data of purchase

Making the Confusion Matrix

#a 2d matrix which shows number of correct and incorrect prediction

from sklearn.metrics import confusion_matrix,accuracy_score

cm= confusion_matrix(y_test, y_pred)

print(cm)

#accuracy scoreprint(accuracy_score(y_test, y_pred)) #from documentation : accuracy_score(y_true, y_pred)

#26 correct prediction y_pred=1 who did by SUV y_test=1## 77 incorrect prediction y_pred=0 of matrix who bought SUV but the model could not define that y_test=1# 6 customers did not buy (y_test= 0) SUV but was predicted did buy y_pred=1# 41 customers did did not by y_test=0 and predicted did not buy too y_pred=00

Visualizing the Training set results

from matplotlib.colors import ListedColormap

X_set, y_set = sc.inverse_transform(X_train), y_train

X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max() + 10, step = 0.25),

np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000, step = 0.25))

plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(), X2.ravel()]).T)).reshape(X1.shape),

alpha = 0.75, cmap = ListedColormap(('red', 'green')))

plt.xlim(X1.min(), X1.max())

plt.ylim(X2.min(), X2.max())

for i, j in enumerate(np.unique(y_set)):

plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j)

plt.title('Logistic Regression (Training set)')

plt.xlabel('Age')

plt.ylabel('Estimated Salary')

plt.legend()

plt.show()

Here we can see the whole red region as Predicted "No"" and green region as predicted "Yes".

But the red dots means real No (Actually the person did not buy)) and green dots mean real Yes

Visualizing the Test set results