0% found this document useful (0 votes)
5 views

Python Unit 3

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Python Unit 3

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

UNIT-III

Introduction to NumPy, Pandas, Matplotlib:


Data Analysis:
Data Analysis is a process of inspecting, cleaning, transforming and modeling data with the goal of
discovering useful information, suggesting conclusions, and supporting decision-making.
Steps for Data Analysis, Data Manipulation and Data Visualization:

1. Transform Raw Data in a Desired Format


2. Clean the Transformed Data (Step 1 and 2 also called as a Pre-processing of Data)
3. Prepare a Model
4. Analyze Trends and Make Decisions
To perform the above 4 steps we require NumPy, Pandas, Matplotlib libraries in python.

NumPy:
NumPy stands for ‘Numerical Python’ or ‘Numeric Python’. NumPy is an open-source Python library
that facilitates efficient numerical operations on large quantities of data. The main data structure in
this library is the powerful NumPy array, ndarray, which can have any number of dimensions. The
NumPy library contains many useful features for performing mathematical and logical operations
on these special arrays. NumPy is a part of a set of Python libraries that are used for scientific
computing due to its efficient data analysis capabilities.

Features:

NumPy library is having the following features:

1) OPEN SOURCE
2) EASY TO USE
3) PROVIDES HUGE NUMERICAL COMPUTING TOOLS
4) INTEROPERABLE
5) PERFORMANT
6) POWERFUL N-DIMENSIONAL ARRAYS

Installing NumPy:

 The only prerequisite for installing NumPy is Python itself. If you don’t have Python yet and
want the simplest way to get started, we recommend you use the Anaconda Distribution.
 You can download Anaconda from: https://www.anaconda.com/products/individual

Dept. Of C.S.E-C.B.I.T Page 1


 NumPy can be installed with conda, with pip, with a package manager on macOS and Linux,
or from source.

CONDA

If you use conda, you can install NumPy from the defaults or conda-forge channels:

# Best practice, use an environment rather than install in the base env

conda create -n my-env

conda activate my-env

# If you want to install from conda-forge

conda config --env --add channels conda-forge

# The actual install command

conda install numpy

PIP

If you use pip, you can install NumPy with:

pip install numpy

How to import NumPy

To access NumPy and its functions import it in your Python code like this:

import numpy as np
We shorten the imported name to np for better readability of code using NumPy.

Dept. Of C.S.E-C.B.I.T Page 2


What’s the difference between a Python list and a NumPy array?

 NumPy gives you an enormous range of fast and efficient ways of creating arrays and
manipulating numerical data inside them.
 While a Python list can contain different data types within a single list, all of the elements in
a NumPy array should be homogeneous.
 The mathematical operations that are meant to be performed on arrays would be extremely
inefficient if the arrays weren’t homogeneous.
 NumPy arrays are faster and more compact than Python lists. An array consumes less
memory and is convenient to use.

What is an array?

 An array is a central data structure of the NumPy library.


 An array is a grid of values and it contains information about the raw data, how to locate an
element, and how to interpret an element.
 Array has a grid of elements that can be indexed in various ways. The elements are all of the
same type, referred to as the array dtype.
 An array can be indexed by a tuple of nonnegative integers, by booleans, by another array,
or by integers.
 The rank of the array is the number of dimensions.
 The shape of the array is a tuple of integers giving the size of the array along each
dimension.
 One way we can initialize NumPy arrays is from Python lists, using nested lists for two- or
higher-dimensional data.

For example:

>>> a = np.array([1, 2, 3, 4, 5, 6])


or:

>>> a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])


We can access the elements in the array using square brackets. When you’re accessing elements,
remember that indexing in NumPy starts at 0. That means that if you want to access the first
element in your array, you’ll be accessing element “0”.

>>> print(a[0])
[1 2 3 4]

Dept. Of C.S.E-C.B.I.T Page 3


NumPy array creation functions
 NumPy has over 40 built-in functions for creating arrays.

 These functions can be split into roughly three categories, based on the dimension of the
array they create:

1) 1D arrays
2) 2D arrays
3) nD arrays

1) 1-D array creation functions:


The 1D array creation functions are: numpy.linspace and numpy.arange.

a) numpy.linspace:

numpy.linspace will create arrays with a specified number of elements, and spaced equally between
the specified beginning and end values. For example:

>>> np.linspace(1., 4., 6)


array([ 1. , 1.6, 2.2, 2.8, 3.4, 4. ])

The advantage of this creation function is that you guarantee the number of elements and the
starting and end point.

b) numpy.arange:

numpy.arange creates arrays with regularly incrementing values. Check the documentation for
complete information and examples. A few examples are shown:

>>> np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> np.arange(2, 10, dtype=float) //dtype is used for specifying the data type.
array([ 2., 3., 4., 5., 6., 7., 8., 9.])
>>> np.arange(2, 3, 0.1)
array([ 2. , 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9])

Dept. Of C.S.E-C.B.I.T Page 4


2) 2-D array creation functions
The 2D array creation functions e.g. numpy.eye, numpy.diag.

a) numpy.eye:

np.eye(n, m) defines a 2D identity matrix. The elements where i=j (row index and column index are
equal) are 1 and the rest are 0, as such:

>>> np.eye(3)
array([ [1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])

>>> np.eye(3, 5)
array([[1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0.]])

b) numpy.diag:

numpy.diag can define either a square 2D array with given values along the diagonal.

>>> np.diag([1, 2, 3])


array([[1, 0, 0],
[0, 2, 0],
[0, 0, 3]])

3) n-D array creation functions:


The ndarray creation functions e.g. numpy.ones, numpy.zeros, and random define arrays based
upon the desired shape.

a) numpy.zeros will create an array filled with 0 values with the specified shape. The default
dtype is float64:

Dept. Of C.S.E-C.B.I.T Page 5


>>> np.zeros((2, 3))
array([ [0., 0., 0.],
[0., 0., 0.]])
b) numpy.ones will create an array filled with 1 values.

>>> np.ones((2, 3))


array([ [ 1., 1., 1.],
[ 1., 1., 1.]])

Indexing:
Array indexing refers to any use of the square brackets ([]) to index array values. There are many
options to indexing, which give NumPy indexing great power.

Single element indexing

Single element indexing for a 1-D array is what one expects. It work exactly like that for other
standard Python sequences. It is 0-based, and accepts negative indices for indexing from the end of
the array.

>>> x = np.arange(10)
>>> x[2]
2
>>> x[-2]
8

Unlike lists and tuples, NumPy arrays support multidimensional indexing for multidimensional
arrays. That means that it is not necessary to separate each dimension’s index into its own set of
square brackets.

>>> x.shape = (2,5) # now x is 2-dimensional


>>> x[1,3]
8
>>> x[1,-1]

Dept. Of C.S.E-C.B.I.T Page 6


9

Slicing:

The slicing and striding works exactly the same way it does for lists and tuples except that they can
be applied to multiple dimensions as well. A few examples illustrates best:

>>> x = np.arange(10)
>>> x[2:5]
array([2, 3, 4])
>>> x[:-7]
array([0, 1, 2])

Assigning values to indexed arrays:

As mentioned, one can select a subset of an array to assign to using a single index, slices, and index
and mask arrays. The value being assigned to the indexed array must be shape consistent (the same
shape or broadcastable to the shape the index produces). For example, it is permitted to assign a
constant to a slice:

>>> x = np.arange(10)
>>> x[2:7] = 1
or an array of the right size:

>>> x[2:7] = np.arange(5)

Array mathematics
When standard mathematical operations are used with arrays, they are applied on an element-
by-element basis. This means that the arrays should be the same size during addition,
subtraction, etc.:

Dept. Of C.S.E-C.B.I.T Page 7


>>> a = np.array([1,2,3], float)

>>> b = np.array([5,2,6], float)

>>> a + b
array([6., 4., 9.])

>>> a – b

array([-4., 0., -3.])

>>> a * b

array([5., 4., 18.])

For two-dimensional arrays, multiplication remains element wise and does not correspond
to matrix multiplication. There are special functions for matrix math that we will cover later.

>>> a = np.array([[1,2], [3,4]], float)

>>> b = np.array([[2,0], [1,3]], float)

>>> a * b

array([[2., 0.], [3., 12.]])

Some array functions in numpy():


Numpy providing us a lot of functions to work with arrays and to perform various operations on
arrays. Among those some are listed below:

1) flatten():

flatten() can be used for converting 2-D arrays into 1-D array.

Ex:

Dept. Of C.S.E-C.B.I.T Page 8


2) Transpose():

Transpose() can be used for transforming rows as columns and columns as rows.

Ex:

>>> a = np.array(range(6), float).reshape((2, 3))


>>> a
array([[ 0., 1., 2.],
[ 3., 4., 5.]])
>>> a.transpose()

array([[ 0., 3.],


[ 1., 4.],
[ 2., 5.]])

3) concatenate():
concatenate() can be used for adding(combining) two arrays as one array.

Ex:

>>> a = np.array([1,2], float)

>>> b = np.array([3,4,5,6], float)

>>> c = np.array([7,8,9], float)

>>> np.concatenate((a, b, c))

array([1., 2., 3., 4., 5., 6., 7., 8., 9.])

4) reshape():
reshape() can be used for change the rows and columns of one matrix.

Ex:
import numpy as np
a=np.arange(6)
a.reshape((2,3))

output: array([[0, 1, 2],


[3, 4, 5]])

Generally arange() can create 1-D array. But after applying the reshape() the array has converted
into 2-D array.

Dept. Of C.S.E-C.B.I.T Page 9


Introduction to pandas:
 Pandas is an API used to analyze, organize, and structure data.
 Pandas has a wide variety of use-cases and is hugely flexible for preparing your input data for
machine learning, deep learning, and neural network models.
 Pandas is a powerful tool that lets you:
a. Convert JSON, CSV, array, dictionaries, and other data to row and column format
b. Work with them using names instead of indexes (you can still opt for indexes)
c. The pandas package is the most important tool at the disposal of Data Scientists
and Analysts working in Python today.
 The biggest benefit with Pandas is it makes extremely complicated data transformations
easy and natural.

How pandas work


Pandas is built on top of NumPy and Matplotlib. So, Pandas can:

 Efficiently work with large n-dimensional arrays (NumPy)


 Take slices and transpose those into different shapes (NumPy)
 Draw charts (Matplotlib)

Install and import

Pandas is an easy package to install. Open up your terminal program (for Mac users) or
command line (for PC users) and install it using either of the following commands:

conda install pandas OR pip install pandas


Alternatively, if you're currently viewing this article in a Jupyter notebook you can run this cell:

!pip install pandas

The ! at the beginning runs cells as if they were in a terminal.


To import pandas we usually import it with a shorter name since it's used so much:

import pandas as pd

Dept. Of C.S.E-C.B.I.T Page 10


Core components of pandas: Series and DataFrames

The primary two components of pandas are the series and DataFrame.
 A series is essentially a column, and a DataFrame is a multi-dimensional table made up
of a collection of Series.

 DataFrames and Series are quite similar in that many operations that you can do with
one you can do with the other, such as filling in null values and calculating the mean.

1) Basic pandas operations:


 Create a dataframe from an array (The fundamental Pandas object is called a DataFrame. It
is a 2-dimensional size-mutable, potentially heterogeneous, tabular data structure.
 A DataFrame can be created multiple ways. It can be created by passing in a dictionary or a
list of lists to the pd.DataFrame () method, or by reading data from a CSV file.)
 First create a dataframe from an array.

Ex: import pandas as pd


df = pd.DataFrame([["Fred",80],["Jill",90]],columns=["student", "grade"])
df
output:

The dataframe index is just the row count, 0 and 1. If you want to use the student name as the
index. Use set_index to do that.
Normally Pandas dataframe operations create a new dataframe. But we can use inplace=True in
some operations to update the existing dataframe without having to make a new one.

Dept. Of C.S.E-C.B.I.T Page 11


Ex: df.set_index("student",inplace=True)

Output:

Add a column to a Pandas dataframe:

By just using dataframe[‘new column name’] to add the new column. It inserts the new column
into the existing dataframe.

EX: df['birthdate']=['1970-01-12', '1972-05-12']

Output:

Filter dataframe by column value:


Here we select all students born on 13-02-1991:

Ex: df[df['birthdate']=='13-02-1991']

Output:

Select 1 column from dataframe:

Here we select one column. This is not called a dataframe, but a series. It’s basically a dataframe of
one column.

Ex: grade=df['grade']

Dept. Of C.S.E-C.B.I.T Page 12


Output:

Add rows to a pandas dataframe:


To add rows to the data frame we create a new dataframe and append it to the existing one.

Ex: df3=df.append(df2)

Select Pandas dataframe rows by index position:


Here we select the first two rows using iloc, which selects by index.

EX: df3.iloc[0:2]

Output:

2) How to read in data:


It’s quite simple to load data from various file formats into a DataFrame.

Reading data from CSVs:


A simple way to store big data sets is to use CSV files (comma separated files).

CSV files contains plain text and is a well know format that can be read by everyone including
Pandas. With CSV files all you need is a single line to load in the data:

EX: df = pd.read_csv('purchases.csv')

df

Dept. Of C.S.E-C.B.I.T Page 13


OUT:

Unnamed: 0 apples Oranges


0 June 3 0
1 Robert 2 3
2 Lily 0 7
3 David 1 2

CSVs don't have indexes like our DataFrames, so all we need to do is just designate
the index_col when reading:
df = pd.read_csv('purchases.csv', index_col=0)

df
OUT:

apples oranges
June 3 0
Robert 2 3
Lily 0 7
David 1 2

Reading data from JSON


Big data sets are often stored, or extracted as JSON.

JSON is plain text, but has the format of an object, and is well known in the world of programming,
including Pandas. If you have a JSON file — which is essentially a stored Python dict — pandas
can read this just as easily:
EX: df = pd.read_json('purchases.json')

df
OUT:

apples Oranges
David 1 2
June 3 0
Lily 0 7

Dept. Of C.S.E-C.B.I.T Page 14


Reading data from a SQL database

First, we need pysqlite3 installed, so run this command in your terminal:


pip install pysqlite3
Or run this cell if you're in a notebook:

!pip install pysqlite3

If you’re working with data from a SQL database you need to first establish a connection using
an appropriate Python library, then pass a query to pandas.

import sqlite3
con = sqlite3.connect("database.db")
df = pd.read_sql_query("SELECT * FROM purchases", con)
df

output:

Index apples oranges


0 June 3 0
1 Robert 2 3
2 Lily 0 7
3 David 1 2

3) Pandas - Analyzing DataFrames

Viewing the Data:

One of the most used method for getting a quick overview of the DataFrame, is the head() method.

The head() method returns the headers and a specified number of rows, starting from the top.

Example
Get a quick overview by printing the first 10 rows of the DataFrame:

import pandas as pd
df = pd.read_csv('data.csv')
print(df.head(10))

Dept. Of C.S.E-C.B.I.T Page 15


Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
5 60 102 127 300.5
6 60 110 136 374.0
7 45 104 134 253.3
8 30 109 133 195.1
9 60 98 124 269.0

In our examples we will be using a CSV file called 'data.csv'.

Download data.csv, or open data.csv in your browser.

Note: if the number of rows is not specified, the head() method will return the top 5 rows.

Example
Print the first 5 rows of the DataFrame:

import pandas as pd

df = pd.read_csv('data.csv')
print(df.head())

Duration Pulse Maxpulse Calories


0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
There is also a tail() method for viewing the last rows of the DataFrame.
The tail() method returns the headers and a specified number of rows, starting from the bottom.

Example
Print the last 5 rows of the DataFrame:

print(df.tail())

Duration Pulse Maxpulse Calories


164 60 105 140 290.8
165 60 110 145 300.4
166 60 115 145 310.2
167 75 120 150 320.4
168 75 125 150 330.4

Dept. Of C.S.E-C.B.I.T Page 16


Introduction to Matplotlib
Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib is a multi-
platform data visualization library built on NumPy arrays and designed to work with the broader
SciPy stack. It was introduced by John Hunter in the year 2002.

One of the greatest benefits of visualization is that it allows us visual access to huge amounts of
data in easily digestible visuals. Matplotlib consists of several plots like line, bar, scatter, histogram
etc.

Installation:
Windows, Linux and macOS distributions have matplotlib and most of its dependencies as wheel
packages. Run the following command to install matplotlib package :

python pip install -U matplotlib

Importing matplotlib :
from matplotlib import pyplot as plt
or
import matplotlib.pyplot as plt

Basic plots in Matplotlib :

Matplotlib comes with a wide variety of plots. Plots helps to understand trends, patterns, and to
make correlations. They’re typically instruments for reasoning about quantitative information.
Some of the sample plots are covered here.

Line plot :

# importing matplotlib module


From matplotlib import pyplot as plt

# x-axis values
x = [5, 2, 9, 4, 7]

# Y-axis values
y = [10, 5, 8, 4, 2]

# Function to plot
plt.plot(x,y)

# function to show the plot


plt.show()

Dept. Of C.S.E-C.B.I.T Page 17


Output :

Bar plot :

# importing matplotlib module


from matplotlib import pyplot as plt

# x-axis values
x = [5, 2, 9, 4, 7]

# Y-axis values
y = [10, 5, 8, 4, 2]

# Function to plot the bar


plt.bar(x,y)

# function to show the plot


plt.show()

Output:

Dept. Of C.S.E-C.B.I.T Page 18


Histogram :

# importing matplotlib module


from matplotlib import pyplot as plt

# Y-axis values
y = [10, 5, 8, 4, 2]

# Function to plot histogram


plt.hist(y)

# Function to show the plot


plt.show()

Output :

Scatter Plot :
# importing matplotlib module
from matplotlib import pyplot as plt

# x-axis values
x = [5, 2, 9, 4, 7]

# Y-axis values
y = [10, 5, 8, 4, 2]

# Function to plot scatter


plt.scatter(x, y)

# function to show the plot

Dept. Of C.S.E-C.B.I.T Page 19


plt.show()

Output :

Box Plot

A Box Plot is also known as Whisker plot is created to display the summary of the set of data values
having properties like minimum, first quartile, median, third quartile and maximum. In the box plot,
a box is created from the first quartile to the third quartile, a vertical line is also there which goes
through the box at the median. Here x-axis denotes the data to be plotted while the y-axis shows
the frequency distribution.
import numpy as np
np.random.seed(10)
data = np.random.normal(100, 20, 200)

fig = plt.figure(figsize =(10, 7))

# Creating plot
plt.boxplot(data)

# show plot
plt.show()

Dept. Of C.S.E-C.B.I.T Page 20


Exploratory Data Analysis (EDA):
 Exploratory data analysis is one of the best practices used in data science today.
 Exploratory Data Analysis or (EDA) is understanding the data sets by summarizing their main
characteristics often plotting them visually. Plotting in EDA consists of Histograms, Box plot,
Scatter plot and many more.
 EDA involves looking at and describing the data set from different angles and then
summarizing it.
 There is no one method or common methods in order to perform EDA. But by using some
common methods and plots that would be used in the EDA process.

1) Importing the required libraries for EDA:


Below are the some of the libraries that are used in order to perform EDA (Exploratory data
analysis) .

# Importing required libraries.


import pandas as pd
import numpy as np
import seaborn as sns #visualisation
import matplotlib.pyplot as plt #visualisation
%matplotlib inline
sns.set(color_codes=True

Dept. Of C.S.E-C.B.I.T Page 21


2) Loading the data into the data frame.
Loading the data into the pandas data frame is certainly one of the most important steps in EDA, as we
can see that the value from the data set is comma-separated. So all we have to do is to just read the CSV
into a data frame and pandas data frame does the job.

EX:
df = pd.read_csv(“data.csv”)
# To display the top 5 rows
df.head(5)

3) Checking the types of data:


Here we check for the datatypes because sometimes we have to convert that string to the integer
data only then we can plot the data via a graph.

Ex:
# Checking the data type
df.dtypes

4) Dropping irrelevant columns:


This step is certainly needed in every EDA because sometimes there would be many columns that we
never use in such cases dropping is the only solution
EX:
# Dropping irrelevant columns
df = df.drop(*‘Engine Fuel Type’, ‘Market Category’, ‘Vehicle Style’, ‘Popularity’, ‘Number of Doors’,
‘Vehicle Size’+, axis=1)

5) Dropping the duplicate rows


While performing Data Analysis we only need to have the distinct data then only we can able to
perform accurate data analysis.

EX:
# Dropping the duplicates
df = df.drop_duplicates()

6) Dropping the missing or null values


Dropping all the missing values are detected and are dropped later. Now, this is not a good approach
to do so, because many people just replace the missing values with the mean or the average of that
column.

Ex:

# Finding the null values.


print(df.isnull().sum())

Dept. Of C.S.E-C.B.I.T Page 22


# Dropping the missing values.
df = df.dropna()
df.count()

# After dropping the values


print(df.isnull().sum())

7) Detecting Outliers
An outlier is a point or set of points that are different from other points. Sometimes they can be very
high or very low. It’s often a good idea to detect and remove the outliers. Because outliers are one of
the primary reasons for resulting in a less accurate model.

8) Plot different features against one another (scatter), against frequency (histogram)

Hence the above are some of the steps involved in Exploratory data analysis, these are some general
steps that you must follow in order to perform EDA.

Data Science life cycle:


Data Science Lifecycle revolves around the use of machine learning and different analytical
strategies to produce insights and predictions from information in order to acquire a commercial
enterprise objective. The complete method includes a number of steps like data cleaning,
preparation, modelling, model evaluation, etc.

There are various steps involved in the Data science life cycle. The below figure represents the life
cycle of Data science.

Dept. Of C.S.E-C.B.I.T Page 23


1) Business Understanding:

The complete cycle revolves around the enterprise goal. You need to understand if the customer
desires to minimize savings loss, or if they prefer to predict the rate of a commodity, etc. This is
very important and basic action need to be performed.

2) Data Understanding:
After enterprise understanding, the subsequent step is data understanding. T his step includes
describing the data, their structure, their relevance, their records type. Explore the information
using graphical plots. Basically, extracting any data that you can get about the information
through simply exploring the data.

3) Preparation of Data:
Next comes the data preparation stage. This consists of steps like choosing the applicable data,
integrating the data by means of merging the data sets, cleaning it, treating the lacking values
through either eliminating them or imputing them, treating inaccurate data through eliminating
them, additionally test for outliers the use of box plots and cope with them.

4) Exploratory Data Analysis:


This step includes getting some concept about the answer and elements affecting it. Many data
visualization strategies are considerably used to discover each and every characteristic
individually and by means of combining them with different features.

5) Data Modeling:
Data modeling is the heart of data analysis. A model takes the organized data as input and gives
the preferred output. This step consists of selecting the suitable kind of model, whether the
problem is a classification problem, or a regression problem or a clustering problem.

6) Model Evaluation:
Here the model is evaluated for checking if it is geared up to be deployed. The model is examined
on an unseen data, evaluated on a cautiously thought out set of assessment metrics. The model
assessment helps us select and construct an ideal model. If the model is not evaluated properly, it
will fail in the actual world.

7) Model Deployment:
The model after a rigorous assessment is at the end deployed in the preferred structure and
channel. This is the last step in the data science life cycle.

Each step in the data science life cycle defined above must be performed upon carefully. If any
step is performed improperly, and hence, have an effect on the next step and the complete effort
goes to waste.

Dept. Of C.S.E-C.B.I.T Page 24


Descriptive statistics:
Descriptive statistics is the type of statistics which is used to summarize and describe the dataset. It
is used to describe the characteristics of data. Descriptive statistics are generally used to determine
if the sample is normally distributed. It is displayed through tables, charts, frequency distributions
and is generally reported as a measure of central tendency.

Descriptive statistics include the following details about the data


1) Central Tendency
2) Statistical Dispersion
3) The Bell Curve

 Central Tendency
Central Tendency is the measure of very basic but very useful statistical functions that represents
a central point or typical value of the dataset. This can be find out by using the following ways:
1. Mean – also known as the average
2. Median – the centermost value of the given dataset
3. Mode – The value which appears most frequently in the given dataset
Depending on what exactly you’re trying to describe, you will use a different measure of central
tendency. Mean and median can only be used for numerical data. The mode can be used with
numerical and nominal data both.
 Statistical Dispersion
Dispersion in statistics is a way of describing how spread out a set of data is. When a data set has a
large value, the values in the set are widely scattered; when it is small the items in the set are
tightly clustered. This can be performed by using the following ways:
1. Range – Range gives us the understanding of how spread out the given data is
2. Variance – It gives us the understanding of how the far the measurements are from the
mean.
3. Standard deviation – Square root of the variance is standard deviation, also the
measurement of how far the data deviate from the mean

 The Bell Curve – It is a graph of a normal distribution of a variable, it is called a bell curve
because of its shape.
1. Skewness – It is the measure of the asymmetry of a distribution of a variable about its
mean
2. Kurtosis – It is the measure of the “tailedness” of a distribution of a variable. It gives us
the understanding of how closely the data is spread out.

Descriptive statistics is extremely useful in examining the given data. We can get the complete
understanding of the data with the use of descriptive statistics

Dept. Of C.S.E-C.B.I.T Page 25


Basic tools (plots, graphs and summary statistics) of EDA:
A graph is a mathematical diagram that depicts the relationship between two or more sets of
numerical data over a period of time. Basic data is mainly 2-dimensional with a focus on raw data
represented through lines, curves, etc.

Charts, on the other hand, is a representation of datasets with the intent of making the user
understand the information in a better manner. Graphs are a good example of charts used for data
visualization.

Types of Graphs and Charts

There are various types of graphs and charts used in data visualization.

1) Bar Chart/Graph:

A bar chart is a graph represented by spaced rectangular bars that describe the data points in a set
of data. It is usually used to plot discrete and categorical data.

Types of Bar Chart

 Grouped Bar Chart

Grouped bar charts are used when the datasets have subgroups that need to be visualized on the
graph. Each subgroup is usually differentiated from the other by shading them with distinct colors.

Dept. Of C.S.E-C.B.I.T Page 26


 Stacked Bar Chart

The stacked bar graphs are also used to show subgroups in a dataset. But in this case, the
rectangular bars defining each group are stacked on top of each other.

Advantages of a Bar Chart

 Summarizes a large amount of data in an understandable form.


 Easily accessible to a wide audience.

Disadvantages of a Bar Chart

 It does not reveal key assumptions like causes, effects, patterns, etc.
 May require further explanation.

2) Pie Chart:

A pie chart is a circular graph used to illustrate numerical proportions in a dataset. This graph is
usually divided into various sectors, where each sector represents the proportion of a particular
numerical element in the set.

Types of Pie Chart:

 Simple Pie Chart

This is the most basic type of pie chart and can also be simply called a pie chart.

Dept. Of C.S.E-C.B.I.T Page 27


 Exploded Pie Chart

In an exploded pie chart, one of the sectors of the circle is separated (or exploded) from the chart. It
is used to lay emphasis on a particular element in the data set.

Uses of Pie Chart

 It summarizes data into a visually appealing form.


 It is quite simple compared to many graph types.

Disadvantages of Pie Chart

 It is inapplicable for large datasets.


 It cannot visualize groups of data.

3) Line Graph or Chart

Line graphs are represented by a group of data points joined together by a straight line. Each of
these data points describes the relationship between the horizontal and the vertical axis on the
graph.

When constructing a line chart, you may decide to include the data points or not.

Types of Line Graph

 Simple Line Graph

In a simple line graph, only one line is plotted on the graph. One of the axes defines the
independent variables while the other axis contains dependent variables.

Dept. Of C.S.E-C.B.I.T Page 28


 Multiple Line Graph

Multiple line graphs contain two or more lines representing more than one variable in a dataset.
This type of graph can be used to study two or more variables over the same period of time.

Uses of a Line Graph

 It helps in studying data trends over a period of time.


 They are easy to read and plot.

Disadvantages of a Line Graph

 It can only be used to visualize data over a short period of time.


 It is not convenient to plot when dealing with fractions and decimals

4) Histogram Chart:

Histogram chart visualizes the frequency of discrete and continuous data in a dataset using joined
rectangular bars. Each rectangular bar defines the number of elements that fall into a predefined
class interval.

Advantages of Histogram Chart:

 It helps in visualizing large amounts of data.


 It reveals the variation, centering, and distribution of the data.

Dept. Of C.S.E-C.B.I.T Page 29


Dis advantages of Histogram Chart:

 It does not visualize the exact values in a dataset.


 It only visualizes continuous data.

5) Scatter Plot

Scatter plots are charts used to visualize random variables with dot-like markers that represent
each data point. These markers are usually scattered across the chart area of the plot.

Types of Scatter Plot

Scatter plots are grouped into different types according to the correlation of the data points. These
correlation types are highlighted below

 Positive Correlation

Two groups of data visualized on a scatter plot are said to be positively correlated if an increase in
one implies an increase in the other. A scatter plot diagram can be said to have a high or low
positive correlation.

 Negative Correlation

Two groups of data visualized on a scatter plot are said to be negatively correlated if an increase in
one implies a decrease in the other A scatter plot diagram can be said to have a high or low
negative correlation.

 No Correlation

Two groups of data visualized on a scatter plot are said to have no correlation if there is no clear
correlation between them.

Dept. Of C.S.E-C.B.I.T Page 30


Advantages:

 It clearly shows data spread


 Ir is usually colorful and visually appealing

Disadvantages:

 It cannot give the exact extent of correlation.


 It can only be used to study the relationship between 2 variables.

6) Box and Whisker Chart:

A box and whisker chart is a statistical graph for displaying sets of numerical data through their
quartiles. It displays a frequency distribution of the data.

The box and whisker chart helps you to display the spread and skewness for a given set of data
using the five number summary principle: minimum, maximum, median, lower and upper
quartiles. The ‘five-number summary’ principle allows providing a statistical summary for a
particular set of numbers. It shows you the range (minimum and maximum numbers), the spread
(upper and lower quartiles), and the center (median) for the set of data numbers.

A very simple figure of a box and whisker plot you can see below:

Box and Whisker Chart Uses:

 When you want to observe the upper, lower quartiles, mean, median, deviations, etc. for a
large set of data.
 When you want to see a quick view of the dataset distribution.

Dept. Of C.S.E-C.B.I.T Page 31


 When you have multiple data sets that come from independent sources and relate to each
other in some way.
 When you need to compare data from different categories.

7) Dot Plot:

Dot plot or dot graph is just one of the many types of graphs and charts to organize statistical
data. It uses dots to represent data. A Dot Plot is used for relatively small sets of data and the values
fall into a number of discrete categories.

If a value appears more than one time, the dots are ordered one above the other. That way the
column height of dots shows the frequency for that value.

Dot Plot Uses:

 To plot frequency counts when you have a small number of categories.


 Dot plots are very useful when the variable is quantitative or categorical.
 Dot graphs are also used for univariate data.

8) Heat map:

A heat map is a two-dimensional representation of data in which values are represented by colors.
A simple heat map provides an immediate visual summary of information. More elaborate heat
maps allow the viewer to understand complex data sets.

In this election heat map, the red states are Republican and the blue states are Democrat.

Dept. Of C.S.E-C.B.I.T Page 32


9) Summary statistics:

Summary statistics are used to summarize a set of observations, in order to communicate the
largest amount of information as simply as possible. Statisticians commonly try to describe the
observations in

 a measure of location, or central tendency, such as the arithmetic mean


 a measure of statistical dispersion like the standard mean absolute deviation
 a measure of the shape of the distribution like skewness or kurtosis
 if more than one variable is measured, a measure of statistical dependence such as a
correlation coefficient

A common collection of order statistics used as summary statistics are the five-number summary,
sometimes extended to a seven-number summary.

The five-number summary is a set of descriptive statistics that provides information about a
dataset. It consists of the five most important sample percentiles:

1. the sample minimum (smallest observation)


2. the lower quartile or first quartile
3. the median (the middle value)
4. the upper quartile or third quartile
5. the sample maximum (largest observation)

10) Heat map:

A heatmap contains values representing various shades of the same colour for each value to be
plotted. Usually the darker shades of the chart represent higher values than the lighter shade. For
a very different value a completely different colour can also be used.
The below example is a two-dimensional plot of values which are mapped to the indices and
columns of the chart.

Dept. Of C.S.E-C.B.I.T Page 33


from pandas import DataFrame
import matplotlib.pyplot as plt

data=[{2,3,4,1},{6,3,5,2},{6,3,5,4},{3,7,5,4},{2,8,1,5}]
Index= ['I1', 'I2','I3','I4','I5']
Cols = ['C1', 'C2', 'C3','C4']
df = DataFrame(data, index=Index, columns=Cols)

plt.pcolor(df)
plt.show()

OUTPUT:

Philosophy of EDA
EDA is done for some of the same reasons it’s done with smaller datasets, but there are additional
reasons to do it with data that has been generated from logs. There are important reasons anyone
working with data should do EDA.
EDA means to gain intuition about the data; to make comparisons between distributions; for sanity
checking to find out where data is missing or if there are outliers; and to summarize the data. In the
context of data generated from logs, EDA also helps with debugging the logging process.
EDA helps you make sure the product is performing as intended. Although there’s lots of
visualization involved in EDA, we distinguish
between EDA and data visualization in that EDA is done toward the beginning of analysis, and data
visualization, is done toward the end to communicate one’s findings. With EDA, the graphics are
solely done for you to understand what’s going on.
With EDA, we can also use the understanding we get to inform and improve the development of
algorithms.
Plotting data and making comparisons can get you extremely far, and is far better to do than getting
a dataset and immediately running a regression just because you know how.

Dept. Of C.S.E-C.B.I.T Page 34


Analysts and data scientists should not enforced EDA as a critical part of the process of working
with data, but make it part of process!.

Data Visualization:

1) Scatter plot

Scatterplots use a collection of points placed using Cartesian Coordinates to display values from two
variables. By displaying a variable in each axis, you can detect if a relationship or correlation
between the two variables exists.

Various types of correlation can be interpreted through the patterns displayed on Scatterplots.
These are: positive (values increase together), negative (one value decreases as the other
increases), null (no correlation), linear, exponential and U-shaped. The strength of the correlation
can be determined by how closely packed the points are to each other on the graph. Points that
end up far outside the general cluster of points are known as outliers.

Dept. Of C.S.E-C.B.I.T Page 35


2) Bar Chart

A bar chart displays categorical data with rectangular bars whose length or height corresponds to
the value of each data point.

Bar charts can be visualized using vertical or horizontal bars. Bar charts are best used to compare a
single category of data or several. When comparing more than one category of data, the bars can
be grouped together to created a grouped bar chart.

Bar charts use volume to demonstrate differences between each bar. Because of this, bar charts
should always start at zero. When bar charts do not start at zero, it risks users misjudging the
difference between data values.

differences between each bar. Because of this, bar charts should always start at zero. When bar
charts do not start at zero, it risks users misjudging the difference between data values.

3) Histogram

A histogram is a chart that displays numeric data in ranges, where each bar represents how
frequently numbers fall into a particular range.

Like a bar chart, histograms consist of a series of vertical bars along the x-axis. Histograms are most
commonly used to depict what a set of data looks like in aggregate. At a quick glance, histograms
tell whether a dataset has values that are clustered around a small number of ranges or are more
spread out.

Dept. Of C.S.E-C.B.I.T Page 36


4) Boxplot:

A box and whisker plot is a graph that presents the distribution of a category of data.

Typically, box and whisker plots break the data into four or five points. Four point, or quartile
boxplots, present the “box” as defined by the first and third quartile. The median value is also
depicted in the box and the “whiskers” represent the minimum and maximum values in the data.

In a five point, or quintile boxplot, the points are the minimum. Presenting data in this way is useful
for indicating whether a distribution is skewed and whether there are potential outliers in the data.
Box and whisker plots are also useful for comparing two or more datasets and for representing a
large number of observations.

Box and whisker plots can be displayed horizontally or vertically and displayed side-by-side for
comparisons.

Dept. Of C.S.E-C.B.I.T Page 37


5) Heat Maps:

You can quickly grasp the state and impact of a large number of variables at one time by
displaying your data with a heat map visualization. A heat map visualization is a combination of
nested, colored rectangles, each representing an attribute element. Heat Maps are often used
in the financial services industry to review the status of a portfolio.

The rectangles contain a wide variety and many shadings of colors, which emphasize the weight
of the various components. In a heat map visualization:

 The size of each rectangle represents its relative weight. The legend provides
information about the minimum and maximum values.

 The color of each rectangle represents its relative value. The legend provides the range
of values for each color.
 Data is grouped based on the order of the attributes in the Grouping area of the Editor
panel.

Dept. Of C.S.E-C.B.I.T Page 38

You might also like