Data Preprocessing

Kunal Chhikara
4 min readJun 15, 2021

Data preprocessing is a technique in which we convert the data into a clear usable form of data. Most of the times, real world data is inconsistent, incomplete and lacking certain behavioral trends. Data preprocessing techniques are used to overcome such problems.

Data Science lifecycle

Data preprocessing is the most time consuming step in a data science lifecycle. On a average a data scientist spends 80% of his/her time in just cleaning the data to make it usable in the model.

Steps involved in preprocessing —

  1. Importing the libraries — First we need to import the necessary libraries required to start any project. These libraries include Numpy, which is the fundamental package for scientific calculations in python; Pandas for data manipulation, reading, analysis and data frame creation; Matplotlib for data visualization.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

2. Importing the dataset —

df = pd.read_csv('data.csv')
df.head()
df.info()

In this step we import the data in our IDE and try to get some basic insights about the missing data and the data types using the data frame info() method.

3. Handling the missing data — We can use different strategies for handling the missing data.

The first strategy includes dropping the rows containing the missing data. But this is not preferred because it can cause the problem of under fitting in case of less data.

We can replace the missing data with some statistical values like mean, median, most frequent or standard deviation using the Imputer class from scikit learn preprocessing.

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='mean', axis=1)
imputer = imputer.fit(df)
imputed_data = imputer.fit_transform(df.values)

4. Categorical Encoding

Categorical data refers to the knowledge that has specific categories within the dataset. Machine Learning models are based on mathematical equations. Thus, we’ll intuitively understand that keeping the specific data within the equation will cause certain issues since you’d only need numbers within the equations.

Three most common categorical encoding techniques include:

Integer Encoding: Where each unique label is mapped to an integer.

One Hot Encoding: Where each label is mapped to a binary vector.

Learned Embedding: Where a distributed representation of the categories is learned.

5. Splitting the Dataset into training and test set

Now we divide our data into two sets, one for training our model called the training set and therefore the other for testing the performance of our model called the test set. The split is usually 80/20. To implement this we import the “train_test_split” method of “sklearn.model_selection” library.

from sklearn.model_selection import train_test_split          X_train, X_test, y_train, y_test= train_test_split(X, y, test_size= 0.2, random_state=0)
  • X_train —independent features for the training data
  • X_test — independent features for the test data
  • y_train — dependent variables for training data
  • y_test — dependent variable for testing data

Feature scaling

Feature scaling marks the end of the data preprocessing in Machine Learning. It’s a way to standardize the independent variables of a dataset within a selected range. Most of the machine learning algorithms use the Euclidean distance between two data points in their computations . Due to this, high magnitudes features will weigh more within the distance calculations than features with low magnitudes. The feature scaling limits the range of variables in order that you’ll compare them on common grounds.

We can perform feature scaling in two ways —

Standardization

Normalization

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

--

--