Machine Learning : Data Pre Processing Part 1

In this blog, we are using the data set called 'Data.csv'

CountryAgeSalaryPurchased
France4472000No
Spain2748000Yes
Germany3054000No
Spain3861000No
Germany40Yes
France3558000Yes
Spain52000No
France4879000Yes
Germany5083000No
France3767000Yes

Importing the libraries

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

Import the dataset

Here we will read the dataset first using

dataset =pd.read_csv('Data.csv')

Create Feature matrix and Dependent matrix

then we will create Feature matrix called X & dependent matrix called Y from this one

Here Country, Age and salary are features and purchased is dependent on them. So, let's create one matrix with country, age and salary

X = dataset.iloc[:, :-1].values

here iloc[row,collumn] means locate index. As we need all rows here for matrix feature, first we will use : for all rows and then we will take columns and we need all columns except the last . So , we will use :-1

Again for dependent matrix, we need just the purchased column

y = dataset.iloc[:, -1].values

here we need all rows but only for the last column.

We can check the matrix then

Taking care of the missing Data

Here we can see missing data which we can delete them if they are very less in number.

or, you can replace the missing data with average of all of the data in the column.

we will use sklearn's SimpleImputer by this

from sklearn.impute import SimpleImputer

then create an object of SimpleImputer

imputer = SimpleImputer(argument 1, argument 2)

In the first argument, we will say what we are working on and it's missing_values=np.nan

Secondly what are we doing to the missing values. Here we are going to take mean value and thus we will use strategy='mean' in argument 2.

So,

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

Now we will use fit method to use our imputer to the X matrix(Feature matrix)

imputer.fit()

Within the fit method, we will use the matrix and the column which have missing data. So,

imputer.fit(X[:, 1:3])

here : means all the rows and 1:3 means 2nd and 3rd column which have missing data

Now , we will use transform method which will return us the updated column .

X[:, 1:3] = imputer.transform(X[:, 1:3])

We have then sent that to X matrix and it's updated now!

Now , let's check the new X matrix

Encoding Categorical Data

Here we can same countries multiple times. Basically France, Spain and Germany multiple times

This makes the data to repeat multiple times. To avoid this, we can create 3 column (Independent variable) for 3 different countries.

Again, we have purchased column which have Yes and No (Dependent variable) only. So, we can create column for that too.

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')

X = np.array(ct.fit_transform(X))

Here

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder

are for importing the necessary tools like OneHotEncoder which we will us and transform our column too.

then,

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')

means, we are creating an object of ColumnTransfer within which we will have transformers and remainder. The remainder will be set to passthrough so that we just work with the stuffs mentioned on transformers and don't touch others.

within transformers, we have set encoder mentioned as we are using encoder method, then OneHotEncoder which is the encoder technique used and then [0] which means the country column which we will transform to 3 different columns.

then we will apply it by X = np.array(ct.fit_transform(X))

Now , we can see three different column instead of

Now, let's work for the dependent variable (purchase)

as these are labels so, we will import these

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

Now just transform the matrix y

y =le.fit_transform(y)

Splitting the dataset to training set and Testing set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

We are going to split our data to 4 matrix , 2 (X_test,y_test) for testing and 2 (X_train,y_train) for training.

from sklearn.model_selection import train_test_split for importing

and X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1) means, X and y matrix to be used (existing matrix)

as want to 20% of our data to be test set, so test_size=0.2 , random_state=1 so that we can split randomly

So, if we check them out, here it is:

Feature Scaling

Standardization

After doing that, all of the values will be between -3 to +3

Normalization

After doing that, all of the values will be in between -1 to +1

Let's import few things

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

Remember, we won't apply feature scaling on the dummy variables ( we created 3 dummy variables instead of country variable)

again we will fit our scaling to remaining columns of X_train

X_train[:, 3:] =sc.fit_transform(X_train[:, 3:])

we have selected all rows of column 3 and remaining other.

The fit method will calculate the mean and transform method will apply the standardization formula and apply.

The feature of test data need to be scaled by same scaler used on training data (x_train)

if we use fit_transfer, we will get new scaler to scale which is not needed. We will use the same scaler used on X_train to our X_test

X_test[:, 3:] = sc.transform(X_test[:, 3:])

We are just using transform method to use the same scaler which was applied to training data

So, this is how we will find our X_train, X_test now

Check out this file to practice with data_processing_tools & the data.csv

Thank you