Data Visualization using Matplotlib and Seaborn libraries.

Kunal Chhikara
5 min readJun 11, 2021

What is Data Visualization ?

Data visualization is the practice of translating information into a visual context, such as a map or graph, to make data easier for the human brain to understand and pull insights from. The main goal of data visualization is to make it easier to identify patterns, trends and outliers in large data sets.

Different type of analysis:-

Univariate (U) : In univariate analysis we use a single feature to analyze its properties.

Bivariate (B): When we compare the data between exactly 2 features then its called bivariate analysis.

Multivariate (M): Comparing more than 2 variables is called as Multivariate analysis.

Some common plots used in data visualization:

Scatter plot

Pair plot

Heat maps

Box plot

Pie Charts

Violin plot

Distribution plot

Joint plot Bar chart

Line plot

Visualization libraries-

Matplotlib — It is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.

Seaborn — It is a library for making statistical graphics in Python. It is built on top of matplotlib and closely integrated with pandas data structures. It aims to make visualization a central part of exploring and understanding data.

Installing and importing the libraries in the IDE

pip install matplotlib
pip install seaborn
import matplotlib.pyplot as plt
import seaborn as sns

Loading the dataset

df = pd.read_csv('iris.csv')

Scatter Plot

It is one of the most commonly used plots for simple data visualization. It gives us a representation of where each point in the entire dataset are present with respect to any 2 or 3 features (or columns). They are available in 2D as well as 3D.

for n in range(0,150):
if df['Species'][n] == 'Iris-setosa':
plt.scatter(df['SepalLengthCm'][n], df['SepalWidthCm'][n], color = 'red')
plt.xlabel('SepalLengthCm')
plt.ylabel('SepalWidthCm')
elif df['Species'][n] == 'Iris-versicolor':
plt.scatter(df['SepalLengthCm'][n], df['SepalWidthCm'][n], color = 'blue')
plt.xlabel('SepalLengthCm')
plt.ylabel('SepalWidthCm')
elif df['Species'][n] == 'Iris-virginica':
plt.scatter(df['SepalLengthCm'][n], df['SepalWidthCm'][n], color = 'green')
plt.xlabel('SepalLengthCm')
plt.ylabel('SepalWidthCm')

Pair Plot

Lets say we have n number of features in a data, Pair plot will help us create us a (n x n) figure where the diagonal plots will be histogram plot of the feature corresponding to that row and rest of the plots are the combination of feature from each row in y axis and feature from each column in x axis.

sns.pairplot(df, hue = 'Species')

Box Plot

A box plot (or box-and-whisker plot) shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution.

plt.style.use('ggplot')
plt.subplot(2,2,1)
sns.boxplot(x = 'Species', y = 'SepalLengthCm', data = df)
plt.subplot(2,2,2)
sns.boxplot(x = 'Species', y = 'SepalWidthCm', data = df)
plt.subplot(2,2,3)
sns.boxplot(x = 'Species', y = 'PetalLengthCm', data = df)
plt.subplot(2,2,4)
sns.boxplot(x = 'Species', y = 'PetalWidthCm', data = df)

Violin Plot

The violin plots can be inferred as a combination of Box plot at the middle and distribution plots (Kernel Density Estimation ) on both side of the data. This can give us the details of distribution like whether the distribution is mutimodal, Skewness etc.

plt.style.use('ggplot')
plt.subplot(2,2,1)
sns.violinplot(x = 'Species', y = 'SepalLengthCm', data = df)
plt.subplot(2,2,2)
sns.violinplot(x = 'Species', y = 'SepalWidthCm', data = df)
plt.subplot(2,2,3)
sns.violinplot(x = 'Species', y = 'PetalLengthCm', data = df)
plt.subplot(2,2,4)
sns.violinplot(x = 'Species', y = 'PetalWidthCm', data = df)

Joint Plot

Join plots can do both univariate as well as bivariate analysis. The main plot will give us a bivariate analysis, whereas on the top and right side we will get univariate plots of both the variables that were considered. It makes our job easy by getting both scatter plots for bivariate and Distribution plot for univariate, both in a single plot.

sns.jointplot(x = 'SepalLengthCm', y = 'SepalWidthCm', data = df)

Strip Plot

A strip plot can be drawn on its own, but it is also a good complement to a box or violin plot in cases where you want to show all observations along with some representation of the underlying distribution. It is is a graphical data analysis technique for summarizing a univariate data set. It is typically used for small data sets (histograms and density plots are typically preferred for larger data sets).

plt.subplot(2,2,1)
sns.stripplot(x = 'Species', y = 'SepalLengthCm', data = df, jitter = True)
plt.subplot(2,2,2)
sns.stripplot(x = 'Species', y = 'SepalWidthCm', data = df, jitter = True)
plt.subplot(2,2,3)
sns.stripplot(x = 'Species', y = 'PetalLengthCm', data = df, jitter = True)
plt.subplot(2,2,4)
sns.stripplot(x = 'Species', y = 'PetalWidthCm', data = df, jitter = True)

Scatter plot with regression line

Seaborn’s lmplot is a 2D scatterplot with an optional overlaid regression line. Logistic regression for binary classification is also supported with lmplot . It is intended as a convenient interface to fit regression models across conditional subsets of a dataset. The function can draw a scatterplot of two variables, x and y, and then fit the regression model y~x and plot the resulting regression line with a 95% confidence interval for that regression.

sns.lmplot(x = 'SepalLengthCm', y = 'SepalWidthCm', data = df, hue = 'Species', col = 'Species')

Heat Map

Heatmap is defined as a graphical representation of data using colors to visualize the value of the matrix. In this, to represent more common values or higher activities brighter colors basically reddish colors are used and to represent less common or activity values, darker colors are preferred. Heatmap is also defined by the name of the shading matrix.

sns.heatmap(df.corr(),annot=True,cmap='BuPu')

--

--