Machine learning - Multiple Linear Regression Model (Part 3)

Here, y is the dependent variable and X1, X2, ......Xn is the independent variable

Assumptions of linear regressions

Here you can see apart from the first one, all other models are not working well.

These are the assumptions we need to keep in mind. The green ticks indicate we can use linear regression on those cases and the red cross indicates we can't apply linear regression there.

  1. Linearity: If Y and X does not have relationship, we will see something like

So, better not apply linear regression.

  1. Homoscedasticity : Basically here you won't find equal distribution of the data around the liner line.

  1. Multivariate Normality: Due to the error distribution from the linear line, you will see a function type scene.

  1. Independence: If independent variables themselves are dependent on each other, you will see this

  1. Lack of Multicollinearity: if independent variables are equal to each other

6)The outlier check: If data is far away from the line

Let's start working with this csv file

Here profit column goes to y (dependent variable)

x1 for R& D Spend, x2 for admin, x3 for marketing but what about state??

State column does not have mathematical values but can we make from this?

Yes!!

Remember, earlier on Data Pre-processing blog, we made Dummy variables from the country column. Check that out

Let's continue again!

So, State has categorical value and depending on that, we can create 2 column New York and California.

but from that, rather than taking 2 column, can't we only take the New York column which is actually saying 1 means New York and 0 for California?
Yes! Can you guess when would we need to take all of the column in count? Can you guess??

Yes, you guessed it right. When we have more than 2 column , we had to take all column because it would have been impossible to judge what 0 means then.

If you did not understand, kindly check the "Data Pre processing blog" and check what we did to Country column and how we used those dummy variables to the linear equation.

So, D1 means New York Column. Again, if we use both column, it is a trap because

here New York + Column both in combined means 1. So, only one column can easily present what we need here.

Let's build a model now

for our dependent variable , y now we have so many independent variables and we need to keep some for the model and through some away (remember the dummy variable we did not use earlier)

  1. All in : When you know all independent variables (X1,X2...) matter for the dependent variable(y)

  1. Backward Elimination:

    1. Forward Elimination:

    2. Bidirectional Elimination

    3. All possible one

Let's work on Google Colab then

Let's work on this csv file

Let's go for data pre-processing and we can see we have no missing data but we have categorical data for state. We will create dummy variable for that.

Import the libraries

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

Import the dataset

dataset =pd.read_csv('50_Startups.csv')

X = dataset.iloc[:, :-1].values

y = dataset.iloc[:, -1].values

Encoding categorical data

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder

ct= ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[3])], remainder="passthrough") #changed the 4th column as it has the state name

X = np.array(ct.fit_transform(X))

So, from this

we turn it to

Splitting the dataset into the Training set and Test set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Note: We don't need feature scaling in multiple linear regression model.

Because here the co-efficient (b1,b2,....) etc creates a balance with the variables (x1,x2,......). So, no need to scale up or anything.

Training the Multiple Linear Regression model on the Training set

from sklearn.linear_model import LinearRegression

imported the LinearRegression class from sklearn

regressor = LinearRegression() created an object

regressor.fit(X_train,y_train) fitted the model with training set (X_train, y_train)

Done!!

Predicting the Test set results with real profit

#1st one with predicted from test set

y_pred= regressor.predict(X_test) #consists of 20% data

np.set_printoptions(precision=2) #upto 2 decimal point

Now we want to add the profit we had in the csv file to compare the error.

Here we will keep 2 matrix side by side and thus concatenating the matrix.

first one is our predicted one and thus y_pred but as we want to see it vertically, we are reshaping it with the length of it's row . So, reshape(len(y_pred),1)

secondly , adding the real profit from the test set which we did not use so far. we can now use it to compare it with our prediction. So, y_test.reshape(len(y_test),1)

then we set the axis=1 for horizontal view

print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),axis=1))

So, the output

But if we set axis =0, we get values vertically ones

Now , the question might arise :

How do I use my multiple linear regression model to make a single prediction, for example, the profit of a startup with R&D Spend = 160000, Administration Spend = 130000, Marketing Spend = 300000 and State = California?

print(regressor.predict([[1, 0, 0, 160000, 130000, 300000]]))

For california, we used 1, 0, 0

for R&D 160000 , for administration spend = 130000, Marketing spend = 300000

And we got the profit

Again, for a question:

How do I get the final regression equation y = b0 + b1 x1 + b2 x2 + ... with the final values of the coefficients?

Therefore, our model predicts that the profit of a Californian startup which spent 160000 in R&D, 130000 in Administration and 300000 in Marketing is $ 181566,92.

Important note 1: Notice that the values of the features were all input in a double pair of square brackets. That's because the "predict" method always expects a 2D array as the format of its inputs. And putting our values into a double pair of square brackets makes the input exactly a 2D array. Simply put:

1,0,0,160000,130000,300000→scalars

[1,0,0,160000,130000,300000]→1D array

[[1,0,0,160000,130000,300000]]→2D array

Important note 2: Notice also that the "California" state was not input as a string in the last column but as "1, 0, 0" in the first three columns. That's because of course the predict method expects the one-hot-encoded values of the state, and as we see in the second row of the matrix of features X, "California" was encoded as "1, 0, 0". And be careful to include these values in the first three columns, not the last three ones, because the dummy variables are always created in the first columns.

print(regressor.coef_)

print(regressor.intercept_)

Therefore, the equation of our multiple linear regression model is:

Profit=86.6×Dummy State 1−873×Dummy State 2+786×Dummy State 3+0.773×R&D Spend+0.0329×Administration+0.0366×Marketing Spend+42467.53

Important Note: To get these coefficients we called the "coef_" and "intercept_" attributes from our regressor object. Attributes in Python are different than methods and usually return a simple value or an array of values.

Done!!

Here is the final ipynp file for your practice