Machine Learning : Data Pre Processing Part 1
In this blog, we are using the data set called 'Data.csv'
Country | Age | Salary | Purchased |
France | 44 | 72000 | No |
Spain | 27 | 48000 | Yes |
Germany | 30 | 54000 | No |
Spain | 38 | 61000 | No |
Germany | 40 | Yes | |
France | 35 | 58000 | Yes |
Spain | 52000 | No | |
France | 48 | 79000 | Yes |
Germany | 50 | 83000 | No |
France | 37 | 67000 | Yes |
Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Import the dataset
Here we will read the dataset first using
dataset =
pd.read
_csv('Data.csv')
Create Feature matrix and Dependent matrix
then we will create Feature matrix called X & dependent matrix called Y from this one
Here Country, Age and salary are features and purchased is dependent on them. So, let's create one matrix with country, age and salary
X = dataset.iloc[:, :-1].values
here iloc[row,collumn]
means locate index. As we need all rows here for matrix feature, first we will use :
for all rows and then we will take columns and we need all columns except the last . So , we will use :-1
Again for dependent matrix, we need just the purchased column
y = dataset.iloc[:, -1].values
here we need all rows but only for the last column.
We can check the matrix then
Taking care of the missing Data
Here we can see missing data which we can delete them if they are very less in number.
or, you can replace the missing data with average of all of the data in the column.
we will use sklearn's SimpleImputer by this
from sklearn.impute import SimpleImputer
then create an object of SimpleImputer
imputer = SimpleImputer(argument 1, argument 2)
In the first argument, we will say what we are working on and it's missing_values=np.nan
Secondly what are we doing to the missing values. Here we are going to take mean value and thus we will use strategy='mean'
in argument 2.
So,
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
Now we will use fit method to use our imputer to the X matrix(Feature matrix)
imputer.fit()
Within the fit method, we will use the matrix and the column which have missing data. So,
imputer.fit
(X[:, 1:3])
here : means all the rows and 1:3 means 2nd and 3rd column which have missing data
Now , we will use transform method which will return us the updated column .
X[:, 1:3] = imputer.transform(X[:, 1:3])
We have then sent that to X matrix and it's updated now!
Now , let's check the new X matrix
Encoding Categorical Data
Here we can same countries multiple times. Basically France, Spain and Germany multiple times
This makes the data to repeat multiple times. To avoid this, we can create 3 column (Independent variable) for 3 different countries.
Again, we have purchased column which have Yes and No (Dependent variable) only. So, we can create column for that too.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(
ct.fit
_transform(X))
Here
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
are for importing the necessary tools like OneHotEncoder which we will us and transform our column too.
then,
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
means, we are creating an object of ColumnTransfer within which we will have transformers
and remainder.
The remainder will be set to passthrough
so that we just work with the stuffs mentioned on transformers and don't touch others.
within transformers, we have set encoder
mentioned as we are using encoder method, then OneHotEncoder
which is the encoder technique used and then [0]
which means the country column which we will transform to 3 different columns.
then we will apply it by X = np.array(
ct.fit
_transform(X))
Now , we can see three different column instead of
Now, let's work for the dependent variable (purchase)
as these are labels so, we will import these
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
Now just transform the matrix y
y =
le.fit
_transform(y)
Splitting the dataset to training set and Testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
We are going to split our data to 4 matrix , 2 (X_test,y_test) for testing and 2 (X_train,y_train) for training.
from sklearn.model_selection import train_test_split
for importing
and X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
means, X and y matrix to be used (existing matrix)
as want to 20% of our data to be test set, so test_size=0.2
, random_state=1
so that we can split randomly
So, if we check them out, here it is:
Feature Scaling
Standardization
After doing that, all of the values will be between -3 to +3
Normalization
After doing that, all of the values will be in between -1 to +1
Let's import few things
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
Remember, we won't apply feature scaling on the dummy variables ( we created 3 dummy variables instead of country variable)
again we will fit our scaling to remaining columns of X_train
X_train[:, 3:] =
sc.fit
_transform(X_train[:, 3:])
we have selected all rows of column 3 and remaining other.
The fit
method will calculate the mean and transform
method will apply the standardization formula and apply.
The feature of test data need to be scaled by same scaler used on training data (x_train)
if we use fit_transfer, we will get new scaler to scale which is not needed. We will use the same scaler used on X_train to our X_test
X_test[:, 3:] = sc.transform(X_test[:, 3:])
We are just using transform
method to use the same scaler which was applied to training data
So, this is how we will find our X_train, X_test now
Check out this file to practice with data_processing_tools & the data.csv
Thank you