Machine learning - Multiple Linear Regression Model (Part 3)
Here, y is the dependent variable and X1, X2, ......Xn is the independent variable
Assumptions of linear regressions
Here you can see apart from the first one, all other models are not working well.
These are the assumptions we need to keep in mind. The green ticks indicate we can use linear regression on those cases and the red cross indicates we can't apply linear regression there.
- Linearity: If Y and X does not have relationship, we will see something like
So, better not apply linear regression.
- Homoscedasticity : Basically here you won't find equal distribution of the data around the liner line.
- Multivariate Normality: Due to the error distribution from the linear line, you will see a function type scene.
- Independence: If independent variables themselves are dependent on each other, you will see this
- Lack of Multicollinearity: if independent variables are equal to each other
6)The outlier check: If data is far away from the line
Let's start working with this csv file
Here profit column goes to y (dependent variable)
x1 for R& D Spend, x2 for admin, x3 for marketing but what about state??
State column does not have mathematical values but can we make from this?
Yes!!
Remember, earlier on Data Pre-processing blog, we made Dummy variables from the country column. Check that out
Let's continue again!
So, State has categorical value and depending on that, we can create 2 column New York and California.
but from that, rather than taking 2 column, can't we only take the New York column which is actually saying 1 means New York and 0 for California?
Yes! Can you guess when would we need to take all of the column in count? Can you guess??
Yes, you guessed it right. When we have more than 2 column , we had to take all column because it would have been impossible to judge what 0 means then.
If you did not understand, kindly check the "Data Pre processing blog" and check what we did to Country column and how we used those dummy variables to the linear equation.
So, D1 means New York Column. Again, if we use both column, it is a trap because
here New York + Column both in combined means 1. So, only one column can easily present what we need here.
Let's build a model now
for our dependent variable , y now we have so many independent variables and we need to keep some for the model and through some away (remember the dummy variable we did not use earlier)
- All in : When you know all independent variables (X1,X2...) matter for the dependent variable(y)
Backward Elimination:
Forward Elimination:
-
Bidirectional Elimination
-
All possible one
Let's work on Google Colab then
Let's work on this csv file
Let's go for data pre-processing and we can see we have no missing data but we have categorical data for state. We will create dummy variable for that.
Import the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Import the dataset
dataset =
pd.read
_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
Encoding categorical data
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct= ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[3])], remainder="passthrough") #changed the 4th column as it has the state name
X = np.array(
ct.fit
_transform(X))
So, from this
we turn it to
Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
Note: We don't need feature scaling in multiple linear regression model.
Because here the co-efficient (b1,b2,....) etc creates a balance with the variables (x1,x2,......). So, no need to scale up or anything.
Training the Multiple Linear Regression model on the Training set
from sklearn.linear_model import LinearRegression
imported the LinearRegression class from sklearn
regressor = LinearRegression()
created an object
regressor.fit
(X_train,y_train)
fitted the model with training set (X_train, y_train)
Done!!
Predicting the Test set results with real profit
#1st one with predicted from test set
y_pred= regressor.predict(X_test)
#consists of 20% data
np.set_printoptions(precision=2)
#upto 2 decimal point
Now we want to add the profit we had in the csv file to compare the error.
Here we will keep 2 matrix side by side and thus concatenating the matrix.
first one is our predicted one and thus y_pred but as we want to see it vertically, we are reshaping it with the length of it's row . So, reshape(len(y_pred),1)
secondly , adding the real profit from the test set which we did not use so far. we can now use it to compare it with our prediction. So, y_test.reshape(len(y_test),1)
then we set the axis=1 for horizontal view
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),axis=1))
So, the output
But if we set axis =0, we get values vertically ones
Now , the question might arise :
How do I use my multiple linear regression model to make a single prediction, for example, the profit of a startup with R&D Spend = 160000, Administration Spend = 130000, Marketing Spend = 300000 and State = California?
print(regressor.predict([[1, 0, 0, 160000, 130000, 300000]]))
For california, we used 1, 0, 0
for R&D 160000 , for administration spend = 130000, Marketing spend = 300000
And we got the profit
Again, for a question:
How do I get the final regression equation y = b0 + b1 x1 + b2 x2 + ... with the final values of the coefficients?
Therefore, our model predicts that the profit of a Californian startup which spent 160000 in R&D, 130000 in Administration and 300000 in Marketing is $ 181566,92.
Important note 1: Notice that the values of the features were all input in a double pair of square brackets. That's because the "predict" method always expects a 2D array as the format of its inputs. And putting our values into a double pair of square brackets makes the input exactly a 2D array. Simply put:
1,0,0,160000,130000,300000→scalars
[1,0,0,160000,130000,300000]→1D array
[[1,0,0,160000,130000,300000]]→2D array
Important note 2: Notice also that the "California" state was not input as a string in the last column but as "1, 0, 0" in the first three columns. That's because of course the predict method expects the one-hot-encoded values of the state, and as we see in the second row of the matrix of features X, "California" was encoded as "1, 0, 0". And be careful to include these values in the first three columns, not the last three ones, because the dummy variables are always created in the first columns.
print(regressor.coef_)
print(regressor.intercept_)
Therefore, the equation of our multiple linear regression model is:
Profit=86.6×Dummy State 1−873×Dummy State 2+786×Dummy State 3+0.773×R&D Spend+0.0329×Administration+0.0366×Marketing Spend+42467.53
Important Note: To get these coefficients we called the "coef_" and "intercept_" attributes from our regressor object. Attributes in Python are different than methods and usually return a simple value or an array of values.
Done!!
Here is the final ipynp file for your practice