0% found this document useful (0 votes)
15 views79 pages

Unit - Iii

Uploaded by

rp402948
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views79 pages

Unit - Iii

Uploaded by

rp402948
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 79

UNIT - III

Introduction to NumPy, Pandas, Matplotlib.

Exploratory Data Analysis (EDA), Data Science life cycle, Descriptive Statistics, Basic tools
(plots, graphs and summary statistics) of EDA, Philosophy of EDA. Data Visualization:
Scatter plot, bar chart, histogram, boxplot, heat maps, etc

NumPy :
 NumPy is a python library used for working with arrays.
 NumPy stands for Numerical Python.
 It is the core library for scientific computing, which contains a powerful n-dimensional array
object.

Operations using NumPy


Using NumPy, a developer can perform the following operations −

 Mathematical and logical operations on arrays.


 Fourier transforms and routines for shape manipulation.
 Operations related to linear algebra. NumPy has in-built functions for linear algebra and
random number generation.

Why Use NumPy?


 In Python we have lists that serve the purpose of arrays, but they are slow to process.
 NumPy aims to provide an array object that is up to 50x faster than traditional Python lists.
 The array object in NumPy is called ndarray, it provides a lot of supporting functions that
make working with ndarray very easy.
 Arrays are very frequently used in data science, where speed and resources are very
important.

Install NumPy Package?

Before NumPy's functions and methods can be used, NumPy must be installed. Depending on
which distribution of Python you use, the installation method is slightly different.

Install NumPy with pip

To install NumPy with pip, bring up a terminal window and type:

> pip install numpy

This command installs NumPy in the current working Python environment.

Verify NumPy installation


 To verify NumPy is installed, invoke NumPy's version using the Python REPL. Import
NumPy and call the .__version__ attribute common to most Python packages.
 To get started with NumPy, let's adopt the standard convention and import it using the
name.
import numpy
Now, it is ready to use.

 Usually, numpy is imported with np alias. alias is alternate name for referencing the same
thing.
import numpy as np

>>>np.__version__

'1.16.4'

 A version number like '1.16.4' indicates a successful NumPy installation.

NumPy Creating Arrays


Create a NumPy ndarray Object

 NumPy is used to work with arrays. The array object in NumPy is called ndarray.
 We can create a NumPy ndarray object by using the array() function.
 To create an ndarray, we can pass a list, tuple or any array-like object into the array() method,
and it will be converted into an ndarray:

Example:

import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)

output: [1 2 3 4 5]

type(): This built-in Python function tells us the type of the object passed to it. Like in above code it
shows that arr is numpy.ndarray type.

import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
print(type(arr))

output: [1 2 3 4 5]
<class 'numpy.ndarray'>

Dimensions in Arrays
A dimension in arrays is one level of array depth (nested arrays).

0-D Arrays

0-D arrays, or Scalars, are the elements in an array. Each value in an array is a 0-D array.
Example
import numpy as np
arr = np.array(42)
print(arr)

Output: 42

1-D Arrays

 An array that has 0-D arrays as its elements is called uni-dimensional or 1-D array.
 These are the most common and basic arrays.

Example

import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)

output: [1 2 3 4 5]

2-D Arrays

 An array that has 1-D arrays as its elements is called a 2-D array.
 These are often used to represent matrix or 2nd order tensors.
 NumPy has a whole sub module dedicated towards matrix operations called numpy.mat

Example

import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr)

output:

[[1 2 3]

[4 5 6]]

3-D arrays

 An array that has 2-D arrays (matrices) as its elements is called 3-D array.
 These are often used to represent a 3rd order tensor.

Example

import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(arr)
Output:

[[[1 2 3]

[4 5 6]]

[[1 2 3]

[4 5 6]]]

Check Number of Dimensions


NumPy Arrays provides the ndim attribute that returns an integer that tells us how many dimensions
the array have.

Example

import numpy as np
a = np.array(42)
b = np.array([1, 2, 3, 4, 5])
c = np.array([[1, 2, 3], [4, 5, 6]])
d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(a.ndim)
print(b.ndim)
print(c.ndim)
print(d.ndim)

Output:

NumPy Array Indexing


Access Array Elements

 Array indexing is the same as accessing an array element.


 You can access an array element by referring to its index number.
 The indexes in NumPy arrays start with 0, meaning that the first element has index 0, and the
second has index 1 etc.

Example

import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr[0])
print(arr[2] + arr[3])

Output:

Access 2-D Arrays

 To access elements from 2-D arrays we can use comma separated integers representing the
dimension and the index of the element.
 Think of 2-D arrays like a table with rows and columns, where the row represents the
dimension and the index represents the column.

Example

Access the element on the first row, second column:

import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print(arr[0, 1])

Output:

Access 3-D Arrays

To access elements from 3-D arrays we can use comma separated integers representing the
dimensions and the index of the element.

Example

Access the third element of the second array of the first array:

import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
print(arr[0, 1, 2])

Output:

Example Explained

arr[0, 1, 2] prints the value 6.

And this is why:

The first number represents the first dimension, which contains two arrays:
[[1, 2, 3], [4, 5, 6]]
and:
[[7, 8, 9], [10, 11, 12]]
Since we selected 0, we are left with the first array:
[[1, 2, 3], [4, 5, 6]]

The second number represents the second dimension, which also contains two arrays:
[1, 2, 3]
and:
[4, 5, 6]
Since we selected 1, we are left with the second array:
[4, 5, 6]

The third number represents the third dimension, which contains three values:
4
5
6
Since we selected 2, we end up with the third value:
6

Negative Indexing
Use negative indexing to access an array from the end.

Example

Print the last element from the 2nd dim:

import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print( arr[1, -1])

Output:

10

NumPy Array Slicing


Slicing arrays
 Slicing in python means taking elements from one given index to another given index.
 We pass slice instead of index like this: [start:end].
 We can also define the step, like this: [start:end:step].
 If we don't pass start its considered 0
 If we don't pass end its considered length of array in that dimension
 If we don't pass step its considered 1

Example

Slice elements from index 1 to index 5 from the following array:

import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5])

Output:

[2 3 4 5]
Note: The result includes the start index, but excludes the end index.

Example

import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[4:])

Output:

[5 6 7]

Example
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[:4])

Output:

[1 2 3 4]

Negative Slicing

Use the minus operator to refer to an index from the end:

Example
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[-3:-1])

Output:

[5 6]

STEP
Use the step value to determine the step of the slicing:

Example

import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5:2])

Output:
[2 4]

Slicing 2-D Arrays


Example

From the second element, slice elements from index 1 to index 4 (not included):

import numpy as np
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
print(arr[1, 1:4])

Output:

[7 8 9]

Note: Remember that second element has index 1.

Example

From both elements, return index 2:

import numpy as np
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
print(arr[0:2, 2])

Output:

[3 8]

Example

From both elements, slice index 1 to index 4 (not included), this will return a 2-D array:

import numpy as np
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
print(arr[0:2, 1:4])

Output:

[[2 3 4]
[7 8 9]]

Data Types in Python


By default Python have these data types:

 strings - used to represent text data, the text is given under quote marks. e.g. "ABCD"
 integer - used to represent integer numbers. e.g. -1, -2, -3
 float - used to represent real numbers. e.g. 1.2, 42.42
 boolean - used to represent True or False.
 complex - used to represent complex numbers. e.g. 1.0 + 2.0j, 1.5 + 2.5j

Data Types in NumPy


NumPy has some extra data types, and refer to data types with one character, like i for integers, u for
unsigned integers etc. Below is a list of all data types in NumPy and the characters used to represent them.

 i - integer
 b - boolean
 u - unsigned integer
 f - float
 c - complex float
 m - timedelta
 M - datetime
 O - object
 S - string
 U - unicode string
 V - fixed chunk of memory for other type ( void )

Checking the Data Type of an Array


The NumPy array object has a property called dtype that returns the data type of the array:

Example

Get the data type of an array object:

import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr.dtype)

Output:

int32

For i, u, f, S and U we can define size as well.

Example

Create an array with data type 4 bytes integer:

import numpy as np
arr = np.array([1, 2, 3, 4], dtype='i4')
print(arr)
print(arr.dtype)

Output:

[1 2 3 4]
int32

What if a Value Can Not Be Converted?


If a type is given in which elements can't be casted then NumPy will raise a ValueError.

ValueError: In Python ValueError is raised when the type of passed argument to a function is
unexpected/incorrect.

Example
A non integer string like 'a' cannot be converted to integer (will raise an error):

import numpy as np
arr = np.array(['a', '2', '3'], dtype='i')

print(arr)

Output:

ValueError: invalid literal for int() with base 10: 'a'

Converting Data Type on Existing Arrays

 The best way to change the data type of an existing array, is to make a copy of the array with
the astype() method.
 The astype() function creates a copy of the array, and allows you to specify the data type as a
parameter.
 The data type can be specified using a string, like 'f' for float, 'i' for integer etc. or you can use the
data type directly like float for float and int for integer.

Example

Change data type from float to integer by using 'i' as parameter value:

import numpy as np
arr = np.array([1.1, 2.1, 3.1])
newarr = arr.astype('i')
print(newarr)
print(newarr.dtype)

Output:

[1 2 3 4]
int32

Example

Change data type from float to integer by using int as parameter value:

import numpy as np
arr = np.array([1.1, 2.1, 3.1])
newarr = arr.astype(int)
print(newarr)
print(newarr.dtype)

Output:

[1 2 3 4]
int32
Example

Change data type from integer to boolean:

import numpy as np
arr = np.array([1, 0, 3])
newarr = arr.astype(bool)
print(newarr)
print(newarr.dtype)

Output:

[ True False True]

bool

NumPy Array Shape


Shape of an Array

The shape of an array is the number of elements in each dimension.

Get the Shape of an Array

NumPy arrays have an attribute called shape that returns a tuple with each index having the number of
corresponding elements.

Example

Print the shape of a 2-D array:

import numpy as np
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
print(arr.shape)

Output:

(2, 4)

The example above returns (2, 4), which means that the array has 2 dimensions, where the first
dimension has 2 elements and the second has 4.

Example

Create an array with 5 dimensions using ndmin using a vector with values 1,2,3,4 and verify that last
dimension has value 4:

import numpy as np
arr = np.array([1, 2, 3, 4], ndmin=5)
print(arr)
print('shape of array :', arr.shape)

Output:

[[[[[1 2 3 4]]]]]

shape of array : (1, 1, 1, 1, 4)

Reshaping arrays
 Reshaping means changing the shape of an array.
 The shape of an array is the number of elements in each dimension.
 By reshaping we can add or remove dimensions or change number of elements in each
dimension.

Reshape From 1-D to 2-D


Example

Convert the following 1-D array with 12 elements into a 2-D array. The outermost dimension will
have 4 arrays, each with 3 elements:

import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
newarr = arr.reshape(4, 3)
print(newarr)

Output:

[[ 1 2 3]

[ 4 5 6]

[ 7 8 9]

[10 11 12]]

Reshape From 1-D to 3-D


Example

Convert the following 1-D array with 12 elements into a 3-D array. The outermost dimension will
have 2 arrays that contains 3 arrays, each with 2 elements:

Output:

import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
newarr = arr.reshape(2, 3, 2)
print(newarr)

[ 3 4]

[ 5 6]]

[[ 7 8]
[ 9 10]

[11 12]]]

Reshape Into any Shape?

 Yes, as long as the elements required for reshaping are equal in both shapes.
 We can reshape an 8 elements 1D array into 4 elements in 2 rows 2D array but we cannot
reshape it into a 3 elements 3 rows 2D array as that would require 3x3 = 9 elements.

Example

Try converting 1D array with 8 elements to a 2D array with 3 elements in each dimension (will raise
an error):

import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
newarr = arr.reshape(3, 3)
print(newarr)

Output:

Traceback (most recent call last):

File "C:\Users\DELL\Desktop\array1.py", line 3, in <module>

newarr = arr.reshape(3, 3)

ValueError: cannot reshape array of size 8 into shape (3,3)

Unknown Dimension
 You are allowed to have one "unknown" dimension.
 Meaning that you do not have to specify an exact number for one of the dimensions in the
reshape method.
 Pass -1 as the value, and NumPy will calculate this number for you.

Example

Convert 1D array with 8 elements to 3D array with 2x2 elements:

import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
newarr = arr.reshape(2, 2, -1)
print(newarr)

Output:

[[[1 2]

[3 4]]

[[5 6]

[7 8]]]
Note: We can not pass -1 to more than one dimension.

Flattening the arrays


 Flattening array means converting a multidimensional array into a 1D array.
 We can use reshape(-1) to do this.

Example

Convert the array into a 1D array:

import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
newarr = arr.reshape(-1)
print(newarr)

Output:

[1 2 3 4 5 6]

Iterating Arrays
 Iterating means going through elements one by one.
 As we deal with multi-dimensional arrays in numpy, we can do this using basic for loop of
python.
 If we iterate on a 1-D array it will go through each element one by one.

Example

Iterate on the elements of the following 1-D array:

import numpy as np
arr = np.array([1, 2, 3])
for x in arr:
print(x)

Output:

Iterating 2-D Arrays


In a 2-D array it will go through all the rows.

Example

Iterate on the elements of the following 2-D array:

import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
for x in arr:
print(x)

Output:

[1 2 3]

[4 5 6]

If we iterate on a n-D array it will go through n-1th dimension one by one. To return the actual values,
the scalars, we have to iterate the arrays in each dimension.

Example

Iterate on each scalar element of the 2-D array:

import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
for x in arr:
for y in x:
print(y)

Output:

Iterating 3-D Arrays


In a 3-D array it will go through all the 2-D arrays. To return the actual values, the scalars, we have to
iterate the arrays in each dimension.

Example

Iterate on the elements of the following 3-D array:

import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
for x in arr:
print(x)

Output:

[[1 2 3]

[4 5 6]]

[[ 7 8 9]

[10 11 12]]
Example

Iterate down to the scalars. To return the actual values, the scalars, we have to iterate the arrays in
each dimension.

import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
for x in arr:
for y in x:
for z in y:
print(z)

Output:

10

11

12

Iterating Array With Different Data Types


We can use op_dtypes argument and pass it the expected datatype to change the datatype of elements
while iterating.

NumPy does not change the data type of the element in-place (where the element is in array) so it
needs some other space to perform this action, that extra space is called buffer, and in order to enable
it in nditer() we pass flags=['buffered'].

Example

import numpy as np
arr = np.array([1, 2, 3])
for x in np.nditer(arr, flags=['buffered'], op_dtypes=['S']):
print(x)

Output:

b'1'

b'2'

b'3'

Enumerated Iteration Using ndenumerate()


 Enumeration means mentioning sequence number of some things one by one.
 Sometimes we require corresponding index of the element while iterating,
the ndenumerate() method can be used for those use cases.

Example

Enumerate on following 1D arrays elements:

import numpy as np
arr = np.array([1, 2, 3])
for idx, x in np.ndenumerate(arr):
print(idx, x)

Output:

(0,) 1

(1,) 2

(2,) 3

Enumerate on following 2D array's elements:

import numpy as np
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
for idx, x in np.ndenumerate(arr):
print(idx, x)

Output:

(0, 0) 1

(0, 1) 2

(0, 2) 3

(0, 3) 4

(1, 0) 5

(1, 1) 6

(1, 2) 7

(1, 3) 8
Joining NumPy Arrays
 Joining means putting contents of two or more arrays in a single array.
 In SQL we join tables based on a key, whereas in NumPy we join arrays by axes.
 We pass a sequence of arrays that we want to join to the concatenate() function, along with
the axis. If axis is not explicitly passed, it is taken as 0.

Example

Join two arrays

import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.concatenate((arr1, arr2))
print(arr)

Output:

[1 2 3 4 5 6]

Example

Join two 2-D arrays along rows (axis=1):

import numpy as np
arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6], [7, 8]])
arr = np.concatenate((arr1, arr2), axis=1)
print(arr)

Output:

[[1 2 5 6]

[3 4 7 8]]

Splitting NumPy Arrays


 Splitting is reverse operation of Joining.
 Joining merges multiple arrays into one and Splitting breaks one array into multiple.
 We use array_split() for splitting arrays, we pass it the array we want to split and the number
of splits.

Example

Split the array in 3 parts:

import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6])
newarr = np.array_split(arr, 3)
print(newarr)

Output:
[array([1, 2]), array([3, 4]), array([5, 6])]

Note: The return value is an array containing three arrays.

If the array has less elements than required, it will adjust from the end accordingly.

Example

Split the array in 4 parts:

import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6])
newarr = np.array_split(arr, 4)
print(newarr)

Output:

[array([1, 2]), array([3, 4]), array([5]), array([6])]

Note: We also have the method split() available but it will not adjust the elements when elements are
less in source array for splitting like in example above, array_split() worked properly but split() would
fail.

Split Into Arrays


 The return value of the array_split() method is an array containing each of the split as an
array.
 If you split an array into 3 arrays, you can access them from the result just like any array
element:

Example

Access the splitted arrays:

import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6])
newarr = np.array_split(arr, 3)
print(newarr[0])
print(newarr[1])
print(newarr[2])

Output:

[1 2]

[3 4]

[5 6]

Splitting 2-D Arrays


 Use the same syntax when splitting 2-D arrays.
 Use the array_split() method, pass in the array you want to split and the number of splits you
want to do.

Example

Split the 2-D array into three 2-D arrays.

import numpy as np
arr = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])
newarr = np.array_split(arr, 3)
print(newarr)

Output:

[array([[1, 2], [3, 4]]),

array([[5, 6], [7, 8]]),

array([[ 9, 10],[11, 12]])]

 The example above returns three 2-D arrays.


 In addition, you can specify which axis you want to do the split around.
 The example below also returns three 2-D arrays, but they are split along the row (axis=1).

Example

Split the 2-D array into three 2-D arrays along rows.

import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18]])
newarr = np.array_split(arr, 3, axis=1)
print(newarr)

Output:

[array([[ 1],[ 4],

[ 7],

[10],

[13],

[16]]), array([[ 2],

[ 5],

[ 8],

[11],

[14],

[17]]), array([[ 3],

[ 6],

[ 9],

[12],
[15],

[18]])]

>>>

 An alternate solution is using hsplit() opposite of hstack()

Example

Use the hsplit() method to split the 2-D array into three 2-D arrays along rows.

import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18]])
newarr = np.hsplit(arr, 3)
print(newarr)

Output:

[array([[ 1],

[ 4],

[ 7],

[10],

[13],

[16]]), array([[ 2],

[ 5],

[ 8],

[11],

[14],

[17]]), array([[ 3],

[ 6],

[ 9],

[12],

[15],

[18]])]

Searching Arrays
 You can search an array for a certain value, and return the indexes that get a match.
 To search an array, use the where() method.
Example

Find the indexes where the value is 4:

import numpy as np
arr = np.array([1, 2, 3, 4, 5, 4, 4])
x = np.where(arr == 4)
print(x)

Output:

(array([3, 5, 6], dtype=int32),)

 The example above will return a tuple: (array([3, 5, 6],), Which means that the value 4 is
present at index 3, 5, and 6.

Example

Find the indexes where the values are even:

import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
x = np.where(arr%2 == 0)
print(x)

Output:

(array([1, 3, 5, 7], dtype=int32),)

Example

Find the indexes where the values are odd:

import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
x = np.where(arr%2 == 1)
print(x)

Output:

(array([0, 2, 4, 6], dtype=int32),)

Sorting Arrays

 Sorting means putting elements in an ordered sequence.


 Ordered sequence is any sequence that has an order corresponding to elements, like numeric
or alphabetical, ascending or descending.
 The NumPy ndarray object has a function called sort(), that will sort a specified array.
Example

Sort the array:

import numpy as np
arr = np.array([3, 2, 0, 1])
print(np.sort(arr))

Output:

[0 1 2 3]

Note: This method returns a copy of the array, leaving the original array unchanged. You can also sort
arrays of strings, or any other data type:

Example

Sort the array alphabetically:

import numpy as np
arr = np.array(['banana', 'cherry', 'apple'])
print(np.sort(arr))

Output:

['apple' 'banana' 'cherry']

Example

Sort a boolean array:

import numpy as np
arr = np.array([True, False, True])
print(np.sort(arr))

Output:

[False True True]

Sorting a 2-D Array

If you use the sort() method on a 2-D array, both arrays will be sorted:

Example

Sort a 2-D array:


import numpy as np
arr = np.array([[3, 2, 4], [5, 0, 1]])
print(np.sort(arr))

Output:

[[2 3 4]

[0 1 5]]

Pandas
What is Pandas?
 Pandas is a Python library used for working with data sets.
 Pandas is used for data analysis in Python and developed by Wes McKinney in 2008.
 Pandas is defined as an open-source library that provides high-performance data analyzing,
cleaning, exploring, and manipulating data and machine learning tasks in Python.
 The name of Pandas is derived from the word Panel Data, which means an Econometrics
from Multidimensional data.

Why Use Pandas?


Pandas in Python for its following advantages:

 Pandas allow us to analyze big data and make conclusions based on statistical theories.
 Pandas can clean messy data sets, and make them readable and relevant.
 Relevant data is very important in data science.
 Easily handles missing data
 It uses Series for one-dimensional data structure and DataFrame for multi-dimensional data
structure.
 It provides an efficient way to slice the data
 It provides a flexible way to merge, concatenate or reshape the data

How to Install Pandas?


 The first step of working in pandas is to ensure whether it is installed in the Python folder or
not.
 If not then we need to install it in our system using pip command.
 Type cmd command in the search box and locate the folder using cd command where
python-pip file has been installed.
 After locating it, type the command:
pip install pandas
 After the pandas have been installed into the system, you need to import the library. This
module is generally imported as:
import pandas

Pandas as pd
 Pandas is usually imported under the pd alias.
 alias: In Python alias are an alternate name for referring to the same thing.
 Create an alias with the as keyword while importing:
 Now the Pandas package can be referred to as pd instead of pandas.
import pandas as pd

Checking Pandas Version


The version string is stored under __version__ attribute.

Example

import pandas as pd
print(pd.__version__)

Python Pandas Data Structure

The Pandas provides two data structures for processing the data, i.e., Series and DataFrame, which
are discussed below:

1) Pandas Series
 A Pandas Series is like a column in a table.
 It is defined as a one-dimensional array that is capable of storing various data types.
 The row labels of series are called the index.
 We can easily convert the list, tuple, and dictionary into series using "series' method. It has
one parameter.
 A Series cannot contain multiple columns.

Syntax:

pandas.Series( data, index, dtype, copy)


The parameters of the constructor are as follows −
 data : data takes various forms like ndarray, list, constants
 index : Index values must be unique and hashable, same length as data.
 Dtype: It refers to the data type of series.
 Copy: It is used for copying the data
Create an Empty Series

A basic series, which can be created is an Empty Series.

Example

#import the pandas library and aliasing as pd


import pandas as pd
s = pd.Series()
print s

Output:

Warning (from warnings module):

File "C:/Users/DELL/Desktop/panda.py", line 3

s = pd.Series()

DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future
version. Specify a dtype explicitly to silence this warning.

Series([], dtype: float64)

Create a Series from ndarray

 If data is an ndarray, then index passed must be of the same length.


 If no index is passed, then by default index will be range(n) where n is array length, i.e.,
[0,1,2,3…. range(len(array))-1].
Example 1:

Create a simple Pandas Series from a list:

import pandas as pd
import numpy as np
info = np.array(['P','a','n','d','a','s'])
a = pd.Series(info])
print(a)

Output:

0 P

1 a

2 n

3 d
4 a

5 s

dtype: object

Example 2:

Create a simple Pandas Series from a list:

import pandas as pd
import numpy as np
info = np.array(['P','a','n','d','a','s'])
a = pd.Series(info, index = [100, 101, 102, 103, 104, 105])
print(a)

Output:

100 P

101 a

102 n

103 d

104 a

105 s

dtype: object

Create a Series from Scalar

 If data is a scalar value, an index must be provided. The value will be repeated to match the
length of index.

Example:

#import the pandas library and aliasing as pd


import pandas as pd
import numpy as np
s = pd.Series(5, index=[0, 1, 2, 3])
print s

Output:

0 5

1 5

2 5

3 5

dtype: int64
Accessing Data from Series with Position

Data in the series can be accessed similar to that in an ndarray.

Example 1:

Retrieve the first element. As we already know, the counting starts from zero for the array, which
means the first element is stored at zeroth position and so on.

import pandas as pd
s = pd.Series([1,2,3,4,5])
#retrieve the first element
print s[0]

Output:

Example 2 :

Retrieve the first three elements in the Series. If a : is inserted in front of it, all items from that index
onwards will be extracted. If two parameters (with : between them) is used, items between the two
indexes.

import pandas as pd
s = pd.Series([1,2,3,4,5])
#retrieve the first element
print s[ : 3]

Output:

0 1

1 2

2 3

dtype: int64

Example 3:

Retrieve the last three elements.

import pandas as pd
s = pd.Series([1,2,3,4,5] )
#retrieve the first element
print s[-3 : ]

Output:

2 3

3 4

4 5

dtype: int64

Retrieve Data Using Label (Index)


A Series is like a fixed-size dict in that you can get and set values by index label.

Example 1:

Retrieve a single element using index label value.

import pandas as pd
s = pd.Series([1,2,3,4,5], index = ['a','b','c','d','e'] )
#retrieve the first element
print s[‘a’]

Output:

Example 2

Retrieve multiple elements using a list of index label values.

import pandas as pd
s = pd.Series([1,2,3,4,5], index = ['a','b','c','d','e'] )
#retrieve the first element
print s[[‘a’, ‘b’, ‘c’]]

Output:

a 1

b 2

c 3

dtype: int64

Example 3

If a label is not contained, an exception is raised.

import pandas as pd
s = pd.Series([1,2,3,4,5], index = ['a','b','c','d','e'] )
#retrieve the first element
print s[‘f’]

Output:

KeyError: 'f'

2) Pandas DataFrame:

 Pandas DataFrame is a widely used data structure which works with a two-dimensional array
with labeled axes (rows and columns).
 DataFrame is defined as a standard way to store data that has two different indexes, i.e., row
index and column index.
 It consists of the following properties:
o The columns can be heterogeneous types like int, bool, and so on.
o It can be seen as a dictionary of Series structure where both the rows and columns are
indexed. It is denoted as "columns" in case of columns and "index" in case of rows.

Syntax:

pandas.DataFrame( data, index, columns, dtype, copy)

 The parameters of the constructor are as follows –

 data: It consists of different forms like ndarray, series, map, constants, lists, array.

 index: The Default np.arrange(n) index is used for the row labels if no index is passed.

 columns: The default syntax is np.arrange(n) for the column labels. It shows only true
if no index is passed.

 dtype: It refers to the data type of each column.

 copy(): It is used for copying the data.

Create a DataFrame

We can create a DataFrame using following ways:

 dict
 Lists
 Numpy ndarrrays
 Series

Create an empty DataFrame


To create an empty DataFrame in Pandas:

# importing the pandas library


import pandas as pd
df = pd.DataFrame()
print (df)

Output:

Empty DataFrame

Columns: []

Index: []

Create a DataFrame using List:


The DataFrame can be created using a single list or a list of lists.

Example 1:

# importing the pandas library


import pandas as pd
# a list of strings
x = ['CIVIL', 'EEE', 'MECH','ECE','CSE','AIDS']
# Calling DataFrame constructor on list
df = pd.DataFrame(x)
print(df)

Output:

0 CIVIL

1 EEE

2 MECH

3 ECE

4 CSE

5 AIDS

Example 2:

# importing the pandas library


import pandas as pd
# a list of strings
x = [[101,'CIVIL'], [201,'EEE'], [301,'MECH'],[401,'ECE'],[501,'CSE'],[3001,'AIDS']]
# Calling DataFrame constructor on list
df = pd.DataFrame (x,columns = ['CODE','NAME'])
print(df)

Output:
CODE NAME

0 101 CIVIL

1 201 EEE

2 301 MECH

3 401 ECE

4 501 CSE

5 3001 AIDS

Example 3:

# importing the pandas library


import pandas as pd
# a list of strings
x = [[101,'CIVIL'], [201,'EEE'], [301,'MECH'],[401,'ECE'],[501,'CSE'],[3001,'AIDS']]
# Calling DataFrame constructor on list
df = pd.DataFrame (x,columns = ['CODE','NAME'], dtype = ‘float’)
print(df)

Output:

CODE NAME

0 101.0 CIVIL

1 201.0 EEE

2 301.0 MECH

3 401.0 ECE

4 501.0 CSE

5 3001.0 AIDS

Create a DataFrame from Dict of ndarrays / Lists

 All the ndarrays must be of same length. If index is passed, then the length of the index
should equal to the length of the arrays.
 If no index is passed, then by default, index will be range(n), where n is the array length.
Example 1:

# importing the pandas library

import pandas as pd
x = {'DEPTCODE':[101,201, 301, 401,501,3001],'DEPARTMENT NAME':['CIVIL', 'EEE',
'MECH','ECE','CSE','AIDS']}

df = pd.DataFrame(x)

print(df)

Output:

DEPTCODE DEPARTMENT NAME

0 101 CIVIL

1 201 EEE

2 301 MECH

3 401 ECE

4 501 CSE

5 3001 AIDS

Create a DataFrame from List of Dicts

List of Dictionaries can be passed as input data to create a DataFrame. The dictionary keys are by
default taken as column names.
Example 1:

import pandas as pd

data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]

df = pd.DataFrame(data, index = [row1’, ‘row2])

print df

Output:

a b c

row1 1 2 NaN

row2 5 10 20.0

Column Selection, Addition, and Deletion

Column Selection:

We can select any column from the DataFrame. Here is the code that demonstrates how to select a
column from the DataFrame.

Example:

import pandas as pd
info = {'one' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f']),
'two' : pd.Series([1, 2, 3, 4, 5, 6, 7, 8], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])}
d1 = pd.DataFrame(info)
print (d1 ['one'])
Output:

a 1.0

b 2.0

c 3.0

d 4.0

e 5.0

f 6.0

g NaN

h NaN

Name: one, dtype: float64

Column Addition

We add any new column to an existing DataFrame. The below code demonstrates how to add any new
column to an existing DataFrame:

Example:

# importing the pandas library


import pandas as pd
info = {'one' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']),
'two' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f'])}
df = pd.DataFrame(info)
# Add a new column to an existing DataFrame object
print ("Add new column by passing series")
df['three']=pd.Series([20,40,60],index=['a','b','c'])
print (df)
print ("Add new column using existing DataFrame columns")
df['four']=df['one']+df['three']
print (df)

Output:

Add new column by passing series

one two three

a 1.0 1 20.0

b 2.0 2 40.0

c 3.0 3 60.0
d 4.0 4 NaN

e 5.0 5 NaN

f NaN 6 NaN

Add new column using existing DataFrame columns

one two three four

a 1.0 1 20.0 21.0

b 2.0 2 40.0 42.0

c 3.0 3 60.0 63.0

d 4.0 4 NaN NaN

e 5.0 5 NaN NaN

f NaN 6 NaN NaN

Column Deletion:

We delete any column from the existing DataFrame. This code helps to demonstrate how the column
can be deleted from an existing DataFrame:

Example:

# importing the pandas library


import pandas as pd
info = {'one' : pd.Series([1, 2], index= ['a', 'b']),
'two' : pd.Series([1, 2, 3], index=['a', 'b', 'c'])}
df = pd.DataFrame(info)
print ("The DataFrame:")
print (df)
# using del function
print ("Delete the first column:")
del df['one']
print (df)

Output:

The DataFrame:
one two
a 1.0 1
b 2.0 2
c NaN 3
Delete the first column:
two
a 1
b 2
c 3

Row Selection, Addition, and Deletion


Row Selection:

We can select, add, or delete any row at anytime. First of all, we will understand the row selection.
Let's see how we can select a row using different ways that are as follows:

Selection by Label:

We can select any row by passing the row label to a loc function.

Example:

# importing the pandas library


import pandas as pd
info = {'one' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']),
'two' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f'])}
df = pd.DataFrame(info)
print (df.loc['b'])
Output:

one 2.0
two 2.0
Name: b, dtype: float64

Selection by integer location:

The rows can also be selected by passing the integer location to an iloc function.

Example:

# importing the pandas library

import pandas as pd
info = {'one' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']),
'two' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f'])}
df = pd.DataFrame(info)
print (df.iloc[3])

Output:

one 4.0
two 4.0
Name: d, dtype: float64

Slice Rows

It is another method to select multiple rows using ':' operator.


Example:

# importing the pandas library


import pandas as pd
info = {'one' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']),
'two' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f'])}
df = pd.DataFrame(info)
print (df[2:5])
Output:

one two

c 3.0 3

d 4.0 4

e 5.0 5

Addition of rows:

We can easily add new rows to the DataFrame using append function. It adds the new rows at the end.

Example:

# importing the pandas library


import pandas as pd
d = pd.DataFrame([[7, 8], [9, 10]], columns = ['x','y'])
d2 = pd.DataFrame([[11, 12], [13, 14]], columns = ['x','y'])
d = d.append(d2)
print (d)
Output:

x y
0 7 8
1 9 10
0 11 12
1 13 14

Deletion of rows:

We can delete or drop any rows from a DataFrame using the index label. If in case, the label is
duplicate then multiple rows will be deleted.

Example:

# importing the pandas library


import pandas as pd
a_info = pd.DataFrame([[4, 5], [6, 7]], columns = ['x','y'])
b_info = pd.DataFrame([[8, 9], [10, 11]], columns = ['x','y'])
a_info = a_info.append(b_info)
# Drop rows with label 0
a_info = a_info.drop(0)

DataFrame Basic Functionality


The following tables lists down the important attributes or methods that help in DataFrame Basic
Functionality.
Sr.No. Attribute or Method & Description
1 T : Transposes rows and columns.
2 axes : Returns a list with the row axis labels and column axis labels as the only members.

3 dtypes : Returns the dtypes in this object.


4 empty : True if NDFrame is entirely empty [no items]; if any of the axes are of length 0.

5 ndim : Number of axes / array dimensions.

6 shape : Returns a tuple representing the dimensionality of the DataFrame.

7 size : Number of elements in the NDFrame.

8 values : Numpy representation of NDFrame.

9 head() : Returns the first n rows.

10 tail() : Returns last n rows.

DataFrame Functions
There are lots of functions used in DataFrame which are as follows:

Functions Description
Pandas DataFrame.append() Add the rows of other dataframe to the end of the given
dataframe.
Pandas DataFrame.apply() Allows the user to pass a function and apply it to every single
value of the Pandas series.
Pandas DataFrame.assign() Add new column into a dataframe.
Pandas DataFrame.astype() Cast the Pandas object to a specified dtype.astype() function.
Pandas DataFrame.concat() Perform concatenation operation along an axis in the
DataFrame.
Pandas DataFrame.count() Count the number of non-NA cells for each column or row.
Pandas DataFrame.describe() Calculate some statistical data like percentile, mean and std
of the numerical values of the Series or DataFrame.
Pandas Remove duplicate values from the DataFrame.
DataFrame.drop_duplicates()
Pandas DataFrame.groupby() Split the data into various groups.
Pandas DataFrame.head() Returns the first n rows for the object based on position.
Pandas DataFrame.hist() Divide the values within a numerical variable into "bins".
Pandas DataFrame.iterrows() Iterate over the rows as (index, series) pairs.
Pandas DataFrame.mean() Return the mean of the values for the requested axis.
Pandas DataFrame.melt() Unpivots the DataFrame from a wide format to a long format.
Pandas DataFrame.merge() Merge the two datasets together into one.
Pandas DataFrame.pivot_table() Aggregate data with calculations such as Sum, Count,
Average, Max, and Min.
Pandas DataFrame.query() Filter the dataframe.
Pandas DataFrame.sample() Select the rows and columns from the dataframe randomly.
Pandas DataFrame.shift() Shift column or subtract the column value with the previous
row value from the dataframe.
Pandas DataFrame.sort() Sort the dataframe.
Pandas DataFrame.sum() Return the sum of the values for the requested axis by the
user.
Pandas DataFrame.to_excel() Export the dataframe to the excel file.
Pandas DataFrame.transpose() Transpose the index and columns of the dataframe.
Pandas DataFrame.where() Check the dataframe for one or more conditions.

Working with CSV files


What is CSV file:
 CSV (Comma Separated Values) is a simple file format used to store tabular data, such as
a spreadsheet or database.
 A CSV file stores tabular data (numbers and text) in plain text.
 Each line of the file is a data record.
 Each record consists of one or more fields, separated by commas.
 Each cell in the spreadsheet is separated by commas, hence the name.
 The use of the comma as a field separator is the source of the name for this file format.

Python DataFrame to CSV File


 A CSV (comma-seperated value) are the text files that allows data to be stored in a table
format.
 Using .to_csv() method in Python Pandas we can convert DataFrame to CSV file.
 Syntax of pandas.DataFrame.to_csv() Function

Example:
import pandas as pd
mid_term_marks = {"Student": ["Kamal", "Arun", "David", "Thomas", "Steven"],
"Economics": [10, 8, 6, 5, 8],
"Fine Arts": [7, 8, 5, 9, 6],
"Mathematics": [7, 3, 5, 8, 5]}
mid_term_marks_df = pd.DataFrame(mid_term_marks)
print(mid_term_marks_df)
mid_term_marks_df.to_csv("D:\midterm.csv")
print(pd.read_csv(‘D:\midterm.csv’)

Output:

Student Economics Fine Arts Mathematics

0 Kamal 10 7 7

1 Arun 8 8 3

2 David 6 5 5
3 Thomas 5 9 8

4 Steven 8 6 5

Python Read CSV file :

 CSV stands for comma-separated values. A CSV file is a delimited text file that uses a
comma to separate values.

 CSV file to store tabular data in plain text.

 The CSV file format is quite popular and supported by many software applications such as
Notepad, Microsoft Excel and Google Spreadsheet.

 We can create a CSV file using the following ways:

1. Using Notepad: We can create a CSV file using Notepad. In the Notepad, open a new
file in which separate the values by comma and save the file with .csv extension.

2. Using Excel: We can also create a CSV file using Excel. In Excel, open a new file in
which specify each value in a different cell and save it with filetype CSV.

To read data row-wise from a CSV file in Python, we can use reader are present in the CSV module
allows us to fetch data row-wise.

Pandas read_csv() Method


 Pandas is an opensource library that allows to you import CSV in Python and perform data
manipulation. Pandas provide an easy way to create, manipulate and delete the data.
 You must install pandas library with command <code>pip install pandas</code>. In
Windows, you will execute this command in Command Prompt while in Linux in the
Terminal.
 To import a CSV dataset, you can use the object pd.read_csv().

Syntax
pandas.read_csv(filepath_or_buffer,sep=',',`names=None`,`index_col=None`,
`skipinitialspace=False`)

 filepath_or_buffer: Path or URL with the data


 sep=’, ‘: Define the delimiter to use
 `names=None`: Name the columns. If the dataset has ten columns, you need to pass ten
names
 `index_col=None`: If yes, the first column is used as a row index
 `skipinitialspace=False`: Skip spaces after delimiter.

Example:

import pandas
result = pandas.read_csv('D:\data.csv')
print(result)

Pandas - Cleaning Data


Data Cleaning

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset.

Bad data could be:

1. Empty cells
2. Data in wrong format
3. Wrong data
4. Duplicates

1) Cleaning Empty Cells


Empty cells can potentially give you a wrong result when you analyze data.

Remove Rows

One way to deal with empty cells is to remove rows that contain empty cells.

Example

#Return a new Data Frame with no empty cells:

Example:
import pandas as pd
df = pd.read_csv('data.csv')
new_df = df.dropna()
print(new_df.to_string())

Note: By default, the dropna() method returns a new DataFrame, and will not change the original.

If you want to change the original DataFrame, use the inplace = True argument:

Example

#Remove all rows with NULL values:

import pandas as pd
df = pd.read_csv('data.csv')
df.dropna(inplace = True)
print(df.to_string())

Note: Now, the dropna(inplace = True) will NOT return a new DataFrame, but it will remove all rows
containg NULL values from the original DataFrame.

Replace Empty Values

 Another way of dealing with empty cells is to insert a new value instead.
 This way you do not have to delete entire rows just because of some empty cells.
 The fillna() method allows us to replace empty cells with a value:

#Replace NULL values with the number 130:

import pandas as pd
df = pd.read_csv('data.csv')
df.fillna(130, inplace = True)

2) Cleaning Data of Wrong Format:


 Cells with data of wrong format can make it difficult, or even impossible, to analyze data.
 To fix it, you have two options: remove the rows, or convert all cells in the columns into the
same format.

Convert Into a Correct Format example data:


 In our Data Frame, we have two cells with the wrong format.
 The 'Date' column should be a string that represents a date:
 Let's try to convert all cells in the 'Date' column into dates.
 Pandas has a to_datetime() method for this:

Example
#Convert to date:

import pandas as pd
df = pd.read_csv('data.csv')
df['Date'] = pd.to_datetime(df['Date'])
print(df.to_string())

As you can see from the result, the date in row 26 was fixed, but the empty date in row 22 got a NaT
(Not a Time) value, in other words an empty value. One way to deal with empty values is simply
removing the entire row.
Removing Rows
The result from the converting in the example above gave us a NaT value, which can be handled as a
NULL value, and we can remove the row by using the dropna() method.

Example

#Remove rows with a NULL value in the "Date" column:

df.dropna(subset=['Date'], inplace = True)

3) Fixing Wrong Data

 "Wrong data" does not have to be "empty cells" or "wrong format", it can just be wrong, like
if someone registered "199" instead of "1.99".
 Sometimes you can spot wrong data by looking at the data set.
 If you take a look at our data set, you can see that in row 7, the duration is 450, but for all the
other rows the duration is between 30 and 60.
Replacing Values
 One way to fix wrong values is to replace them with something else.
 In our example, it is most likely a typo, and the value should be "45" instead of "450", and we
could just insert "45" in row 7:

Example

#Set "Duration" = 45 in row 7:

df.loc[7, 'Duration'] = 45

 For small data sets you might be able to replace the wrong data one by one, but not for big
data sets.
 To replace wrong data for larger data sets you can create some rules, e.g. set some boundaries
for legal values, and replace any values that are outside of the boundaries.

Example

 Loop through all values in the "Duration" column.


 If the value is higher than 120, set it to 120:
for x in df.index:
if df.loc[x, "Duration"] > 120:
df.loc[x, "Duration"] = 60

Removing Rows

 Another way of handling wrong data is to remove the rows that contain wrong data.
 This way you do not have to find out what to replace them with, and there is a good chance
you do not need them to do your analyses.

Example

#Delete rows where "Duration" is higher than 120:

for x in df.index:
if df.loc[x, "Duration"] > 120:
df.drop(x, inplace = True)

4) Removing Duplicates
Discovering Duplicates
 Duplicate rows are rows that have been registered more than one time.
 By taking a look at our test data set
 To discover duplicates, we can use the duplicated() method.
 The duplicated() method returns a Boolean values for each row:

Example

Returns True for every row that is a duplicate, othwerwise False:

print(df.duplicated())

Removing Duplicates
To remove duplicates, use the drop_duplicates() method.

Example

#Remove all duplicates:

df.drop_duplicates(inplace = True)

The (inplace = True) will make sure that the method does NOT return a new DataFrame, but it will
remove all duplicates from the original DataFrame.

Matplotlib
 Matplotlib is a comprehensive library for creating static, animated, and interactive
visualizations in Python.
 Matplotlib was created by John D. Hunter in 2002
 Its first version was released by 2003.
 Matplotlib is open source and we can use it freely.
 Matplotlib is mostly written in python, a few segments are written in C, Objective-C and
Javascript for Platform compatibility.

Features of matplotlib
 Matplotlib is used as a data visualization library for the Python programming language.
 Matplotlib provides a procedural interface called Pylab, which is used designed to make it
work like MATLAB, a programming language used by scientists, researchers. MATLAB is a
paid application software and not open source.
 It is similar to plotting in MATLAB, as it allows users to have a full control over fonts, lines,
colors, styles, and axes properties like MATLAB.
 It provides excellent way to produce quality static-visualizations that can be used for
publications and professional presentations.
 Matplotlib is a cross-platform library that can be used in various python scripts, any python
shell (available in IDLE, pycharm, etc) and IPython shells (cond, jupyter notebook), the web
application servers (Django, flask), and various GUI toolkits (Tkinter, PyQt).

Installation of Matplotlib
The python package manager pip is also used to install matplotlib. Open the command prompt
window, and type the following command:

pip install matplotlib

Verify the Installation

To verify that matplotlib is installed properly or not, type the following command includes
calling .__version __ in the terminal.

import matplotlib
matplotlib.__version__

Output:
'3.1.1'

Matplotlib Pyplot
Pyplot
 Most of the Matplotlib utilities lies under the pyplot sub module, and are usually imported
under the plt alias:
 The matplotlib.pyplot is the collection command style functions that make matplotlib feel
like working with MATLAB.
 Each pyplot function makes some change to the plot (figure).
 The pyplot module provide the plot() function which is frequently use to plot a graph.
 A function can create a figure: matplotlib.pyplot.figure(), Another function that creates a
plotting area in a figure: matplotlib.pyplot.plot().
 Plots some lines in a plotting area.
 Decorates the plot with labels, annotations, etc.

 You can import the pyplot API in python by the following code:

import matplotlib.pyplot as plt

-- OR

from matplotlib import pyplot as plt

 In the above code, the pyplot API from the matplotlib library is imported into the program
and referenced as the alias name plt. You can give any name, but plt is standard and most
commonly used.

Types of plots in matplotlib


There are a variety of plots available in matplotlib, the following are some most commonly used plots:

S.No. Plot functions Description

1 plot() You can plot markers and/or lines to the axes.

2 scatter() It creates a scatter plot of x VS y.

3 bar() It creates a bar plot.

4 barh() It creates a horizontal bar plot.

5 hist() It plots a histogram.

6 hist2d() It creates a 2D histogram plot.

7 boxplot() It creates a box-with-whisker plot.

8 pie() It plots a pie chart.

9 stackplot() It creates a stacked area plot.

10 polar() It creates a polar plot.

11 stem() It creates a stem plot.

12 step() It creates a step plot.


S.No. Plot functions Description

13 quiver() It plots a 2D field of arrows.

Plot():
 The plot() function is used to draw line graph. The line graph is one of charts which shows
information as a series of the line
 By default, the plot() function draws a line from point to point.
 The function takes parameters for specifying points in the diagram.

Syntax :
matplotlib.pyplot.plot()
 Parameters: This function accepts parameters that enables us to set axes scales and format
the graphs. These parameters are mentioned below :-
 plot(x, y): plot x and y using default line style and color.
 plot.axis([xmin, xmax, ymin, ymax]): scales the x-axis and y-axis from minimum to
maximum values.
 plot.(x, y, color=’green’, marker=’o’, linestyle=’dashed’, linewidth=2, markersize=12): x
and y co-ordinates are marked using circular markers of size 12 and green color line with
— style of width 2
 plot.xlabel(‘X-axis’): names x-axis
 plot.ylabel(‘Y-axis’): names y-axis
 plot.title(‘Title name’): Give a title to your plot
 plot(x, y, label = ‘Sample line ‘) plotted Sample Line will be displayed as a legend

Example
 If we need to plot a line from (1, 3) to (8, 10), we have to pass two arrays [1, 8] and [3, 10] to
the plot function
 Draw a line in a diagram from position (1, 3) to position (8, 10):

import matplotlib.pyplot as plt


import numpy as np

xpoints = np.array([1, 8])


ypoints = np.array([3, 10])

plt.plot(xpoints, ypoints)
plt.show()
Output:

Let's have a look on the simple example:

from matplotlib import pyplot as plt


plt.plot([1,2,3,4,5])
plt.ylabel("y axis")
plt.xlabel('x axis')
plt.show()

Output:
What are Plots (Graphics):
Plots (graphics), also known as charts, are a visual representation of data in the form of colored
(mostly) graphics.

Plot Types

The six most commonly used Plots come under Matplotlib. These are:

 Line Plot
 Bar Plot
 Scatter Plot
 Pie Plot
 Area Plot
 Histogram Plot

Line Plot:

Line plots are drawn by joining straight lines connecting data points where the x-axis and y-axis
values intersect

Example:

from matplotlib import pyplot as plt

from matplotlib import style

style.use('ggplot')

x = [5,8,10]

y = [12,16,6]

x2 = [6,9,11]

y2 = [6,15,7]

plt.plot(x,y,'g',label='line one', linewidth=5)

plt.plot(x2,y2,'c',label='line two',linewidth=5)

plt.title('Epic Info')

plt.ylabel('Y axis')

plt.xlabel('X axis')

plt.legend()

plt.grid(True,color='k')

plt.show()
Output:

Bar Plot:

The bar plots are vertical/horizontal rectangular graphs that show data comparison

Example:

from matplotlib import pyplot as plt

plt.bar([0.25,1.25,2.25,3.25,4.25],[50,40,70,80,20],

label="BMW",width=.5)

plt.bar([.75,1.75,2.75,3.75,4.75],[80,20,20,50,60],

label="Audi", color='r',width=.5)

plt.legend()

plt.xlabel('Days')

plt.ylabel('Distance (kms)')

plt.title('Information')

plt.show()
Output:

Histogram Plot:

Histograms are used to show a distribution whereas a bar chart is used to compare different entities.

Example:

import matplotlib.pyplot as plt

population_age =
[22,55,62,45,21,22,34,42,42,4,2,102,95,85,55,110,120,70,65,55,111,115,80,75,65,54,44,43,42,48]

bins = [0,10,20,30,40,50,60,70,80,90,100]

plt.hist(population_age, bins, histtype='bar', rwidth=0.8)

plt.xlabel('age groups')

plt.ylabel('Number of people')

plt.title('Histogram')

plt.show()
Output:

Scatter Plot:

The scatter plots while comparing various data variables to determine the connection between
dependent and independent variables.

import matplotlib.pyplot as pyplot


x1 = [1, 2.5,3,4.5,5,6.5,7]
y1 = [1,2, 3, 2, 1, 3, 4]
x2=[8, 8.5, 9, 9.5, 10, 10.5, 11]
y2=[3,3.5, 3.7, 4,4.5, 5, 5.2]
pyplot.scatter(x1, y1, label = 'high bp low heartrate', color='c')
pyplot.scatter(x2,y2,label='low bp high heartrate',color='g')
pyplot.title('Smart Band Data Report')
pyplot.xlabel('x')
pyplot.ylabel('y')
pyplot.legend()
pyplot.show()

Output:
Pie Plot:

A pie plot is a circular graph where the data get represented within that components/segments or
slices of pie.

import matplotlib.pyplot as pyplot

slice = [12, 25, 50, 36, 19]


activities = ['NLP','Neural Network', 'Data analytics', 'Quantum Computing', 'Machine Learning']
cols = ['r','b','c','g', 'orange']
pyplot.pie(slice,
labels =activities,
colors = cols,
startangle = 90,
shadow = True,
explode =(0,0.1,0,0,0),
autopct ='%1.1f%%')
pyplot.title('Training Subjects')

# Print the chart


pyplot.show()

Output:
Area Plot:

The area plots spread across certain areas with bumps and drops (highs and lows) and
are also known as stack plots

import matplotlib.pyplot as plt

days = [1,2,3,4,5]

sleeping =[7,8,6,11,7]

eating = [2,3,4,3,2]

working =[7,8,7,2,2]

playing = [8,5,7,8,13]

plt.plot([],[],color='m', label='Sleeping', linewidth=5)

plt.plot([],[],color='c', label='Eating', linewidth=5)

plt.plot([],[],color='r', label='Working', linewidth=5)

plt.plot([],[],color='k', label='Playing', linewidth=5)

plt.stackplot(days, sleeping,eating,working,playing, colors=['m','c','r','k'])

plt.xlabel('x')

plt.ylabel('y')

plt.title('Stack Plot')

plt.legend()

plt.show()

Output:
Data Science
What is Data Science?
 Data science is a deep study of the massive amount of data, which involves extracting
meaningful insights from raw, structured, and unstructured data that is processed using the
scientific method, different technologies, and algorithms.
 It is a multidisciplinary field that uses tools and techniques to manipulate the data so that we
can find something new and meaningful.
 In short, we can say that data science is all about:
 Asking the correct questions and analyzing the raw data.
 Modeling the data using various complex and efficient algorithms.
 Visualizing the data to get a better perspective.
 Understanding the data to make better decisions and finding the final result.

Data Science Lifecycle :


 The life-cycle of data science consists of 6 stages.

1. Discovery:

 The first phase is discovery, which involves asking the right questions.
 When we start any data science project, we need to determine what are the basic
requirements, priorities, and project budget.
 In this phase, we need to determine all the requirements of the project such as the number of
people, technology, time, data, an end goal, and then we can frame the business problem on
first hypothesis level.
2. Data preparation:

 Data preparation is also known as Data Munging (Transformation).


 In this phase, we need to perform the following tasks:
1. Data cleaning
2. Data Reduction
3. Data integration
4. Data transformation
 After performing all the above tasks, we can easily use this data for our further processes.

3. Model Planning:

 In this phase, we need to determine the various methods and techniques to establish the
relation between input variables.
 We will apply Exploratory data analytics(EDA) by using various statistical formula and
visualization tools to understand the relations between variable and to see what data can
inform us.
 Common tools used for model planning are:
1. SQL Analysis Services
2. R
3. SAS
4. Python

4. Model-building:

 In this phase, the process of model building starts.


 We will create datasets for training and testing purpose.
 We will apply different techniques such as association, classification, and clustering, to build
the model.
 Following are some common Model building tools:
1. SAS Enterprise Miner
2. WEKA
3. SPCS Modeler
4. MATLAB

5. Operationalize:

 In this phase, we will deliver the final reports of the project, along with briefings, code, and
technical documents.
 This phase provides us a clear overview of complete project performance and other
components on a small scale before the full deployment.

6. Communicate results:

 In this phase, we will check if we reach the goal, which we have set on the initial phase. We
will communicate the findings and final result with the business team.
Exploratory Data Analysis(EDA)
What is Data Analysis:

Data Analysis is a process of inspecting, cleaning, transforming, and modeling data to


discover useful information for business decision-making.

Steps for Data Analysis, Data Manipulation and Data Visualization:

 Transform Raw Data in a Desired Format


 Clean the Transformed Data (Step 1 and 2 also called as a Pre-processing of Data)
 Prepare a Model
 Analyze Trends and Make Decisions

Exploratory Data Analysis(EDA)

 Exploratory Data Analysis (EDA) is the first step in data analysis process .
 Exploratory Data Analysis (EDA) is developed by “John Tukey” in the 1970s.

What is Exploratory Data Analysis ?

 Exploratory Data Analysis or (EDA) is understanding the data sets by summarizing their main
characteristics often plotting them visually.
 This step is very important especially when we arrive at modeling the data in order to apply
Machine learning.
 Plotting in EDA consists of Histograms, Box plot, Scatter plot and many more.
 It often takes much time to explore the data.
 Exploratory Data Analysis helps us to –
 Give insight into a data set.
 Understand the underlying structure.
 Extract important parameters and relationships that hold between them.
 Test underlying assumptions

How to perform Exploratory Data Analysis ?


 There is no one method or common methods in order to perform EDA.
 EDA is performed depends on the dataset that we are working.
Example:
Consider a data set related to employee.
This dataset contains more of 1000rows and more than 8 columns

1. Importing the required libraries for EDA

 import pandas as pd
 import numpy as np
 import seaborn as sns
 import matplotlib.pyplot as plt
 %matplotlib inline
 sns.set(color_codes=True)
2. Loading the data into the data frame.

 Loading the data into the pandas data frame is certainly one of the most important steps in
EDA.
 The value from the data set is comma-separated. So read the CSV into a data frame
 Pandas data frame is used to read the data.
df =pd.read_csv("https://media.geeksforgeeks.org/wp-content/uploads/employees.csv")
 df.head() # To display the top 5 rows

 df.tail() # To display the bottom 5 rows

3. Checking the types of data

 Here we check for the datatypes because sometimes the Salary or the Salary of the employees
would be stored as a string. If in that case, we have to convert that string to the integer data,
only then integer data we can plot the data via a graph.
 Here, in this case, the data is already in integer format so nothing to worry.

 df.dtypes
 Let’s see the shape of the data using the shape.
df.shape
(1000,8)
This means that this dataset has 1000 rows and 8 columns.

 Let’s get a quick summary of the dataset using the describe() method. The describe() function
applies basic statistical computations on the dataset like extreme values, count of data points
standard deviation, etc. Any missing value or NaN value is automatically skipped. describe()
function gives a good picture of the distribution of data.
df.describe()

 Now, let’s also the columns and their data types. For this, we will use the info() method.
df.info()
4. Dropping irrelevant columns.

 This step is certainly needed in every EDA because sometimes there would be many columns
that we never use. In such cases drop the irrelevant columns.
 For example, In this case, the columns such as Last Login Time, Senior Management, doesn't
make any sense so just drop.

df = df.drop([‘Last Login Time’, ‘Senior Management’], axis=1, inplace=True)

df.head(5)

5. Renaming the columns

 In this instance, most of the column names are very confusing to read, so rename their column
names.
 This is a good approach it improves the readability of the data set.
df = df.rename(columns={“Start Date": “SDate", })
df.head(5)

6. Dropping the duplicate rows

 First finding the no of rows & columns.


df.shape
df.count() # Used to count the number of rows
 Finding no of duplicate data.
duplicate_rows_df = df.duplicated()
print("number of duplicate rows: ", duplicate_rows_df.shape)
number of duplicate rows: (0, 6)
 removing rows of duplicate data
df = df.drop_duplicates()
df.count() # Used to count the number of rows

7. Dropping the missing or null values.

 Missing Data is a very big problem in real-life scenarios. Missing Data can also refer to as
NA(Not Available) values in pandas. There are several useful functions for detecting,
removing, and replacing null values in Pandas DataFrame :
print(df.isnull().sum())

 We can see that every column has a different amount of missing values. Like Gender as 145
missing values and salary has 0. Now for handling these missing values there can be several
cases like dropping the rows containing NaN or replacing NaN with either mean, median,
mode, or some other value.
 Now, let’s try to fill the missing values of gender with the string “No Gender”.
df["Gender"].fillna("No Gender", inplace = True)
df.isnull().sum()

 Now, Let’s fill the senior management with the mode value.
mode = df['Senior Management'].mode().values[0]
df['Senior Management']= df['Senior Management'].replace(np.nan, mode)
df.isnull().sum()

 Now for the first name and team, we cannot fill the missing values with arbitrary data, so,
let’s drop all the rows containing these missing values.
df = df.dropna(axis = 0, how ='any')
print(df.isnull().sum())
df.shape
8. Detecting Outliers:

 Outliers are nothing but an extreme value that deviates from the other observations in the
dataset.
 An outlier is a point or set of points that are different from other points.
 Sometimes they can be very high or very low.
 It's often a good idea to detect and remove the outliers. Because outliers are one of the
primary reasons for resulting in a less accurate model.
 Hence it's a good idea to remove them.
 IQR (Inter-Quartile Range) score technique is used to detect and remove outlier.
 Outliers can be seen with visualizations using a box plot.
 sns.boxplot(x=df['Salary'])

 Herein all the plots, we can find some points are outside the box they are none other than
outliers.
 Q1 = df.quantile(0.25)
 Q3 = df.quantile(0.75)
 IQR = Q3 - Q1
 print(IQR)
 df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
 df.shape

8. Data Visualization

Data Visualization is the process of analyzing data in the form of graphs or maps, making it a lot
easier to understand the trends or patterns in the data. There are various types of visualizations –

Histogram Plot

 A histogram is a graphical representation commonly used to visualize the distribution of


numerical data
 A histogram divides the values within a numerical variable into “bins”, and counts the
number of observations that fall into each bin

Example:

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(x='Salary', data=df, )
plt.show()
Box Plot

Box Plot is the visual representation of the depicting groups of numerical data through their
quartiles. Boxplot is also used for detect the outlier in data set. It captures the summary of the data
efficiently with a simple box and whiskers and allows us to compare easily across groups.

Example:

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot( x="Salary", y='Team', data=df, )
plt.show()

Heat Maps Plot

 Heat Maps is a type of plot which is necessary when we need to find the dependent
variables.

 One of the best ways to find the relationship between the features can be done using heat
maps.
Example:

# importing packages
import seaborn as sns
plt.figure(figsize=(10,5))
c= df.corr()
sns.heatmap(c,cmap="BrBG", cannot=True)

Heat Maps Plot

 A Scatter plot is a type of data visualization technique that shows the relationship between
two numerical variables.
 Calling the scatter() method on the plot member draws a plot between two variables or two
columns of pandas DataFrame.

Example:

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(df[‘First Name'], df[‘Salary'])
ax.set_xlabel(‘First Name')
ax.set_ylabel(‘Salary')
plt.show()
Descriptive Statistics
 Descriptive Statistics is the default process in Data analysis.
 Exploratory Data Analysis (EDA) is not complete without a Descriptive Statistic analysis.
 Descriptive Statistics is divided into two parts:
1. Measure of Central Data points and
2. Measure of Dispersion.

1. Measure of Central Data points:

The following operations are performed under Measure of Central Data points. Each of these
measures describes a different indication of the typical or central value in the distribution.

1. Count
2. Mean
3. Mode
4. Median

2. Measure of Dispersion

The following operations are performed under Measure of Dispersion. Measures of dispersion can
be defined as positive real numbers that measure how homogeneous or heterogeneous the given data .

1. Range
2. Percentiles (or) Quartiles
3. Standard deviation
4. Variance
5. Skewness

Example:

 Consider a file:

https://media.geeksforgeeks.org/wp-content/uploads/employees.csv

 Before starting descriptive statistics analysis complete the data collection and cleaning
process.

Data Collection:

 # loading data set as Pandas dataframe


import pandas as pd
df = pd.DataFrame()
df =pd.read_csv("https://media.geeksforgeeks.org/wpcontent/uploads/employees.csv")

 df.head() # To display the top 5 rows


 df.tail() # To display the last 5 rows
 Here we check for the datatypes
df.dtypes
 Let’s see the shape of the data using the shape.
df.shape
df.count() # Used to count the number of rows
 Now, let’s also the columns and their data types. For this, we will use the info() method.
df.info()

Data Cleaning :

Data cleaning means fixing bad data in your data set before data analysis.

 Empty cells or Null values


 Data in wrong format
 Wrong data
 Duplicates

Describe() method :

 Let’s get a quick summary of the dataset using the describe() method.
 The describe() function applies basic statistical computations on the dataset like extreme
values, count of data points standard deviation, etc.
 Any missing value or NaN value is automatically skipped.
 describe() function gives a good picture of the distribution of data.

Measure of Central Data points:

Count

It calculates the total count of numerical column data (or) each category of the categorical
variables.

# get column height from df

height =df["height"]

print(height)

Mean

 The Sum of values present in the column divided by total rows of that column is known as
mean.

 It is also known as average.


Median

 The center value of an attribute is known as a median.

 The median value divides the data points into two parts. That means 50% of data points are
present above the median and 50% below.

Mode

 The mode is that data point whose count is maximum in a column.

 There is only one mean and median value for each column. But, attributes can have more
than one mode value

#Calculating mean, median, mode of dataset height


mean = height.mean()
median =height.median()
mode = height.mode()
print(mean , median, mode)

Output:

53.73152709359609 54.1 0 50.8 dtype: float64

Measures of Dispersion

Range

 The difference between the max value to min value in a column is known as the range.

Standard Deviation

 The standard deviation value tells us how much all data points deviate from the mean value.

The standard deviation is affected by the outliers because it uses the mean for its calculation

 Formulas for Standard Deviation


 Notations for Standard Deviation

 σ = Standard Deviation
 xi = Terms Given in the Data
 x̄ = Mean
 n = Total number of Terms

 Example for standard deviation

#standard deviation of data set using std() function


std_dev =df.std()
print(std_dev)
# standard deviation of the specific column
sv_height=df.loc[:,"height"].std()
print(sv_height)

Output:

2.442525704031867

Variance

 Variance is the square of standard deviation.

Variance = (Standard deviation)2= σ2

 In the case of outliers, the variance value becomes large and noticeable.

 Example for standard variance

# variance of data set using var() functionvariance=df.var()


print(variance)
#variance of the specific column
var_height=df.loc[:,"height"].var()
print(var_height)

Output:

5.965931814856368

Skewness

 Ideally, the distribution of data should be in the shape of Gaussian (bell curve).

 But practically, data shapes are skewed or have asymmetry. This is known as skewness in
data.

 Formula for skewness is:

Skew = 3 * (Mean – Median) / Standard Deviation

 Skewness value can be negative (left) skew or positive (right) skew. Its value should be close
to zero.
 Example for skewness

 df.skew()

# skewness of the specific column

df.loc[:,"height"].skew()

output:

0.06413448813322854

Percentiles or Quartiles

 Column values can be spread by calculating the summary of several percentiles.

 Median is also known as the 50th percentile of data.

 Here is a different percentile.

1. The minimum value equals to 0th percentile.


2. The maximum value equals to 100th percentile.
3. The first quartile equals to 25th percentile.
4. The third quartile equals to 75th percentile.

Quartiles

 It divides the data set into four equal points.

 First quartile = 25th percentile

 Second quartile = 50th percentile (Median)

 Third quartile = 75th percentile

 Based on the quartile, there is a another measure called inter-quartile range that also measures
the variability in the dataset. It is defined as:

IQR = Q3 - Q1

 IQR is not affected by the presence of outliers.


 Example :
price = df.price.sort_values()
Q1 = np.percentile(price, 25)
Q2 = np.percentile(price, 50)
Q3 = np.percentile(price, 75)
IQR = Q3 - Q1
IQR

Output:

8718.5
Basic tools (plots, graphs and summary statistics) of EDA
 Exploratory data analysis or “EDA” is a critical first step in analyzing the data
 The uses of EDA are:
1. Detection of mistakes
2. Checking of assumptions
3. Preliminary selection of appropriate models
4. Determining relationships among the exploratory variables

Typical data format (Data Types) and the types of EDA

Data Types:

Data types are mainly classified in to 2 types. Those are

Categorical Data

 Categorical data represents characteristics.


 It can represent things like a person’s gender, language etc.
 Categorical data can also take on numerical values.
Example: 1 for female and 0 for male.
 These numbers don’t have mathematical meaning

Nominal Data

 Nominal values represent discrete units and are used to label variables that have no
quantitative value.
 Nominal data has no order.
 If the order of the values changed their meaning would not change.

Examples:

.
Ordinal Data

 Ordinal values represent discrete and ordered units.


 It is same as nominal data, except that it’s ordering matters.
Example :

Numerical Data

1. Discrete Data

 Discrete data contains values as distinct and separate.


 This type of data can’t be measured but it can be counted.
 It basically represents information that can be categorized into a classification.
 An example is the number of heads in 100 coin flips.

2. Continuous Data

 Continuous Data represents measurements and therefore their values can’t be counted but
they can be measured.
 An example would be the height of a person, which can describe by using intervals on the real
number line.

Interval Data

 Interval values represent ordered units that have the same difference.

Example:

Ratio Data
 Ratio values are also ordered units that have the same difference.
 Ratio values are the same as interval values, with the difference that they do have an absolute
zero.
 Good examples are height, weight, length etc.

Example:

Types of EDA

 The EDA types of techniques are either graphical or quantitative (non-graphical).


 While the graphical methods involve summarizing the data in a diagrammatic or visual way.
 The quantitative method, on the other hand, involves the calculation of summary statistics.
 These two types of methods are further divided into univariate and multivariate methods.
 Univariate methods consider one variable (data column) at a time.
 Multivariate methods consider two or more variables at a time to explore relationships.
 Totally there are four types of EDA .
1. Univariate graphical,
2. Multivariate graphical,
3. Univariate non-graphical, and
4. Multivariate non-graphical.

Univariate non-graphical:

 This is the simplest form of data analysis among the four options.
 In this type of analysis, the data that is being analysed consists of just a single variable.
 The main purpose of this analysis is to describe the data and to find patterns.

Univariate graphical:

 The graphical method provides the full picture of the data.


 The three main methods of analysis under this type are histogram, stem and leaf plot, and box
plots.
 The histogram represents the total count of cases for a range of values.
 Along with the data values, the stem and leaf plot shows the shape of the distribution.
 The box plots graphically depict a summary of minimum, first quartile median, third quartile,
and maximum.

Multivariate non-graphical:
 The multivariate non-graphical type of EDA generally depicts the relationship between
multiple variables of data through cross-tabulation or statistics.

Multivariate graphical:

 This type of EDA displays the relationship between two or more set of data.
 A bar chart, where each group represents a level of one of the variables and each bar within
the group represents levels of other variables.

Summary Statistics of EDA


 one purpose of EDA is to spot problems in data (as part of data wrangling) and understand
variable properties like:
1. central trends (mean)
2. spread (variance)
3. skew
 Suggest possible modeling strategies (e.g., probability distributions)
 EDA is used to understand relationship between pairs of variables, e.g. their correlation or
covariance.

COVARIANCE

 Covariance is a measure of relationship between 2 variables that is scale dependent, i.e. how
much will a variable change when another variable changes.
 This can be represented with the following equation:

 xi is the ith observation in variable x,


 x¯ is the mean for variable x,
 yi is the ith observation in variable y,
 y¯ is the mean for variable y, and
 N is the number of observations
 This can be calculated easily within Python - particulatly when using Pandas as
import pandas as pd
df = pd.DataFrame()
df = pd.read_csv(“data.csv")
df.cov()

CORRELATION

 It is a statistical metric to measure what extent different variables are interdependent.


 In short, if one variable changes, how does it affect other variable.

 Where, xi is the ith observation in variable x,


 x¯ is the mean for variable x,
 yi is the ith observation in variable y,
 y¯ is the mean for variable y, and
 N is the number of observations
 sx is the standard deviation for variable x
 sy is the standard deviation for variable y

 this can be calculated easily within Python - particulatly when using Pandas as

import pandas as pd
df = pd.DataFrame()
df = pd.read_csv(“data.csv")
df.corr()

Philosophy of Exploratory Data Analysis

 The important reasons to implement EDA when working with data are:
1. To gain intuition about the data;
2. To make comparisons between distributions;
3. For sanity checking (making sure the data is on the scale we expect, in the
format we thought it should be);
4. To find out where data is missing or if there are outliers; and to summarize the data.
 In the context of data generated from logs, EDA also helps with debugging the logging
process.
 In the end, EDA helps us to make sure the product is performing as intended.
 There’s lots of visualization involved in EDA.
 The distinguish between EDA and data visualization is that EDA is done toward the
beginning of analysis, and data visualization is done toward the end to Communicate one’s
findings.
 With EDA, the graphics are solely done for us to understand what’s going on.
 EDA are used to improve the development of algorithms.
Data Visualization
What is Data Visualization :

 Data Visualization is the presentation of data in graphical format.


 Presenting huge amount of data in a simple and easy-to-understand format and helps
communicate information clearly and effectively.
 The data in a graphical format allows them to identify new trends and patterns easily.
 The main benefits of data visualization are as follows:
1. It simplifies the complex quantitative information
2. It helps analyze and explore big data easily
3. It identifies the areas that need attention or improvement
4. It identifies the relationship between data points and variables
5. It explores new patterns and reveals hidden patterns in the data
 Three major considerations for Data Visualization:
 Clarity
 Accuracy
 Efficiency

Clarity - Clarity ensures that the data set is complete and relevant.

Accuracy – Accuracy ensures using appropriate graphical representation to convey the right
message.

Efficiency - Efficiency uses efficient visualization technique which highlights all the data
points

 Some basic factors to be aware of before visualizing the data.

 Visual effect
 Coordination System
 Data Types and Scale
 Informative Interpretation
Visual effect - Visual Effect includes the usage of appropriate shapes, colors, and
size to represent the analyzed data.
Coordination System - The Coordinate System helps to organize the data points
within the provided coordinates.
Data Types and Scale - The Data Types and Scale choose the type of data such as
numeric or categorical.
Informative Interpretation – The Informative Interpretation helps create visuals in
an effective and easily interpreted ill manner using labels, title legends, and pointers.

 Python offers multiple great graphing libraries.

 Some popular plotting libraries:

 Matplotlib
 Pandas Visualization
 Seaborn
 ggplot
 Plotly
 Plots (graphics), also known as charts, are a visual representation of data in the form of
colored (mostly) graphics.

Histogram Plot

 A histogram is a graphical representation commonly used to visualize the distribution of


numerical data
 A histogram divides the values within a numerical variable into “bins”, and counts the
number of observations that fall into each bin

Example:

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(x='Salary', data=df, )
plt.show()

Box Plot

Box Plot is the visual representation of the depicting groups of numerical data through their
quartiles. Boxplot is also used for detect the outlier in data set. It captures the summary of the data
efficiently with a simple box and whiskers and allows us to compare easily across groups.

Example:

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot( x="Salary", y='Team', data=df, )
plt.show()
Heat Maps Plot

 Heat Maps is a type of plot which is necessary when we need to find the dependent
variables.

 One of the best ways to find the relationship between the features can be done using heat
maps.

Example:

# importing packages
import seaborn as sns
plt.figure(figsize=(10,5))
c= df.corr()
sns.heatmap(c,cmap="BrBG", cannot=True)

Heat Maps Plot

 A Scatter plot is a type of data visualization technique that shows the relationship between
two numerical variables.
 Calling the scatter() method on the plot member draws a plot between two variables or two
columns of pandas DataFrame.
Example:

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(df[‘First Name'], df[‘Salary'])
ax.set_xlabel(‘First Name')
ax.set_ylabel(‘Salary')
plt.show()

You might also like