Pandas is a data manipulation and analysis software package for the Python programming language. It includes data structures and methods for manipulating numerical tables and time series . It’s an open-source software with a three-clause BSD licence. The name Pandas is derived from the word “panel data” which is an econometrics word for data sets that comprise observations for the same persons over multiple time periods.

How pandas work ?

  • Efficiently work with large n-dimensional arrays (NumPy)
  • Take slices and transpose those into different shapes (NumPy)
  • Draw charts using Matplotlib

Installing Pandas

pip install pandas

After installing pandas we can call it in our notebook by:

import pandas as pd

The Data Frame object

DataFrame

Creating a data frame from an array

df = pd.DataFrame({'name':['Kunal','Ritvik','Vineet'],'age':['20','20','21']})
df

This creates a 3x2 dataframe with column names ‘name’ and ‘age’.

Adding another column

df['Course']=['B.Tech','B.Arc','B.Tech']

This adds another column in the dataframe. This operation is inplace, which means that it will return nothing and the dataframe is now updated.

Deleting a column

df.drop(['age'],axis=1,inplace=True)

This deletes the column age from the dataset. We have specified the axis=1 because we want to delete a column. If we needed to delete a row then axis=0 would be used along with the row label.

Reading the Data

dataFrame = pd.read_csv('sample-data.csv')
dataFrame

This command will store the contents on the sample-data file inside a pandas Data Frame object.

Important Data Frame Functions

The head() function gives us the first 5 rows of the dataframe whereas the tail() function gives us the last 5 rows of the dataframe.

dataFrame.head()
dataFrame.tail()

Selecting from the data frame

We can make selections from the data frame using the following functions:

loc - label based and access several elements at the same time
Allows to pass 1-D arrays as indexers. Arrays can be either slices (subsets) of the index or column, or they can be boolean arrays which are equal in length to the index or columns.

iloc - position based and access several elements at same time
Similar to loc except with positions rather that index values. However, we cannot assign new columns or indices.

at - label based
Works very similar to loc for scalar indexers. Cannot operate on array indexers. It can assign new indices and columns.

iat - position based
Works similarly to iloc. Cannot work in array indexers. Cannot assign new indices and columns.

Finding the information about the dataframe

The shape function tells us the shape of the data frame.

The info function tells about the data types and the number of null values of a column

The describe function tells us about some basic statistical details like mean, standard deviation , count etc. of a data frame.

dataFrame.shape 
(10455, 10)

dataFrame.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10455 entries, 0 to 10454
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 time 10455 non-null int64
1 accX 10455 non-null float64
2 accY 10455 non-null float64
3 accZ 10455 non-null float64
4 gyrX 10455 non-null float64
5 gyrY 10455 non-null float64
6 gyrZ 10455 non-null float64
7 magX 10455 non-null float64
8 magY 10455 non-null float64
9 magZ 10455 non-null float64
dtypes: float64(9), int64(1)
memory usage: 816.9 KB

dataFrame.describe()