Unit - Iii
Unit - Iii
Exploratory Data Analysis (EDA), Data Science life cycle, Descriptive Statistics, Basic tools
(plots, graphs and summary statistics) of EDA, Philosophy of EDA. Data Visualization:
Scatter plot, bar chart, histogram, boxplot, heat maps, etc
NumPy :
NumPy is a python library used for working with arrays.
NumPy stands for Numerical Python.
It is the core library for scientific computing, which contains a powerful n-dimensional array
object.
Before NumPy's functions and methods can be used, NumPy must be installed. Depending on
which distribution of Python you use, the installation method is slightly different.
Usually, numpy is imported with np alias. alias is alternate name for referencing the same
thing.
import numpy as np
>>>np.__version__
'1.16.4'
NumPy is used to work with arrays. The array object in NumPy is called ndarray.
We can create a NumPy ndarray object by using the array() function.
To create an ndarray, we can pass a list, tuple or any array-like object into the array() method,
and it will be converted into an ndarray:
Example:
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
output: [1 2 3 4 5]
type(): This built-in Python function tells us the type of the object passed to it. Like in above code it
shows that arr is numpy.ndarray type.
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
print(type(arr))
output: [1 2 3 4 5]
<class 'numpy.ndarray'>
Dimensions in Arrays
A dimension in arrays is one level of array depth (nested arrays).
0-D Arrays
0-D arrays, or Scalars, are the elements in an array. Each value in an array is a 0-D array.
Example
import numpy as np
arr = np.array(42)
print(arr)
Output: 42
1-D Arrays
An array that has 0-D arrays as its elements is called uni-dimensional or 1-D array.
These are the most common and basic arrays.
Example
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
output: [1 2 3 4 5]
2-D Arrays
An array that has 1-D arrays as its elements is called a 2-D array.
These are often used to represent matrix or 2nd order tensors.
NumPy has a whole sub module dedicated towards matrix operations called numpy.mat
Example
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr)
output:
[[1 2 3]
[4 5 6]]
3-D arrays
An array that has 2-D arrays (matrices) as its elements is called 3-D array.
These are often used to represent a 3rd order tensor.
Example
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(arr)
Output:
[[[1 2 3]
[4 5 6]]
[[1 2 3]
[4 5 6]]]
Example
import numpy as np
a = np.array(42)
b = np.array([1, 2, 3, 4, 5])
c = np.array([[1, 2, 3], [4, 5, 6]])
d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(a.ndim)
print(b.ndim)
print(c.ndim)
print(d.ndim)
Output:
Example
import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr[0])
print(arr[2] + arr[3])
Output:
To access elements from 2-D arrays we can use comma separated integers representing the
dimension and the index of the element.
Think of 2-D arrays like a table with rows and columns, where the row represents the
dimension and the index represents the column.
Example
import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print(arr[0, 1])
Output:
To access elements from 3-D arrays we can use comma separated integers representing the
dimensions and the index of the element.
Example
Access the third element of the second array of the first array:
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
print(arr[0, 1, 2])
Output:
Example Explained
The first number represents the first dimension, which contains two arrays:
[[1, 2, 3], [4, 5, 6]]
and:
[[7, 8, 9], [10, 11, 12]]
Since we selected 0, we are left with the first array:
[[1, 2, 3], [4, 5, 6]]
The second number represents the second dimension, which also contains two arrays:
[1, 2, 3]
and:
[4, 5, 6]
Since we selected 1, we are left with the second array:
[4, 5, 6]
The third number represents the third dimension, which contains three values:
4
5
6
Since we selected 2, we end up with the third value:
6
Negative Indexing
Use negative indexing to access an array from the end.
Example
import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print( arr[1, -1])
Output:
10
Example
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5])
Output:
[2 3 4 5]
Note: The result includes the start index, but excludes the end index.
Example
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[4:])
Output:
[5 6 7]
Example
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[:4])
Output:
[1 2 3 4]
Negative Slicing
Example
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[-3:-1])
Output:
[5 6]
STEP
Use the step value to determine the step of the slicing:
Example
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5:2])
Output:
[2 4]
From the second element, slice elements from index 1 to index 4 (not included):
import numpy as np
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
print(arr[1, 1:4])
Output:
[7 8 9]
Example
import numpy as np
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
print(arr[0:2, 2])
Output:
[3 8]
Example
From both elements, slice index 1 to index 4 (not included), this will return a 2-D array:
import numpy as np
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
print(arr[0:2, 1:4])
Output:
[[2 3 4]
[7 8 9]]
strings - used to represent text data, the text is given under quote marks. e.g. "ABCD"
integer - used to represent integer numbers. e.g. -1, -2, -3
float - used to represent real numbers. e.g. 1.2, 42.42
boolean - used to represent True or False.
complex - used to represent complex numbers. e.g. 1.0 + 2.0j, 1.5 + 2.5j
i - integer
b - boolean
u - unsigned integer
f - float
c - complex float
m - timedelta
M - datetime
O - object
S - string
U - unicode string
V - fixed chunk of memory for other type ( void )
Example
import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr.dtype)
Output:
int32
Example
import numpy as np
arr = np.array([1, 2, 3, 4], dtype='i4')
print(arr)
print(arr.dtype)
Output:
[1 2 3 4]
int32
ValueError: In Python ValueError is raised when the type of passed argument to a function is
unexpected/incorrect.
Example
A non integer string like 'a' cannot be converted to integer (will raise an error):
import numpy as np
arr = np.array(['a', '2', '3'], dtype='i')
print(arr)
Output:
The best way to change the data type of an existing array, is to make a copy of the array with
the astype() method.
The astype() function creates a copy of the array, and allows you to specify the data type as a
parameter.
The data type can be specified using a string, like 'f' for float, 'i' for integer etc. or you can use the
data type directly like float for float and int for integer.
Example
Change data type from float to integer by using 'i' as parameter value:
import numpy as np
arr = np.array([1.1, 2.1, 3.1])
newarr = arr.astype('i')
print(newarr)
print(newarr.dtype)
Output:
[1 2 3 4]
int32
Example
Change data type from float to integer by using int as parameter value:
import numpy as np
arr = np.array([1.1, 2.1, 3.1])
newarr = arr.astype(int)
print(newarr)
print(newarr.dtype)
Output:
[1 2 3 4]
int32
Example
import numpy as np
arr = np.array([1, 0, 3])
newarr = arr.astype(bool)
print(newarr)
print(newarr.dtype)
Output:
bool
NumPy arrays have an attribute called shape that returns a tuple with each index having the number of
corresponding elements.
Example
import numpy as np
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
print(arr.shape)
Output:
(2, 4)
The example above returns (2, 4), which means that the array has 2 dimensions, where the first
dimension has 2 elements and the second has 4.
Example
Create an array with 5 dimensions using ndmin using a vector with values 1,2,3,4 and verify that last
dimension has value 4:
import numpy as np
arr = np.array([1, 2, 3, 4], ndmin=5)
print(arr)
print('shape of array :', arr.shape)
Output:
[[[[[1 2 3 4]]]]]
Reshaping arrays
Reshaping means changing the shape of an array.
The shape of an array is the number of elements in each dimension.
By reshaping we can add or remove dimensions or change number of elements in each
dimension.
Convert the following 1-D array with 12 elements into a 2-D array. The outermost dimension will
have 4 arrays, each with 3 elements:
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
newarr = arr.reshape(4, 3)
print(newarr)
Output:
[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[10 11 12]]
Convert the following 1-D array with 12 elements into a 3-D array. The outermost dimension will
have 2 arrays that contains 3 arrays, each with 2 elements:
Output:
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
newarr = arr.reshape(2, 3, 2)
print(newarr)
[ 3 4]
[ 5 6]]
[[ 7 8]
[ 9 10]
[11 12]]]
Yes, as long as the elements required for reshaping are equal in both shapes.
We can reshape an 8 elements 1D array into 4 elements in 2 rows 2D array but we cannot
reshape it into a 3 elements 3 rows 2D array as that would require 3x3 = 9 elements.
Example
Try converting 1D array with 8 elements to a 2D array with 3 elements in each dimension (will raise
an error):
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
newarr = arr.reshape(3, 3)
print(newarr)
Output:
newarr = arr.reshape(3, 3)
Unknown Dimension
You are allowed to have one "unknown" dimension.
Meaning that you do not have to specify an exact number for one of the dimensions in the
reshape method.
Pass -1 as the value, and NumPy will calculate this number for you.
Example
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
newarr = arr.reshape(2, 2, -1)
print(newarr)
Output:
[[[1 2]
[3 4]]
[[5 6]
[7 8]]]
Note: We can not pass -1 to more than one dimension.
Example
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
newarr = arr.reshape(-1)
print(newarr)
Output:
[1 2 3 4 5 6]
Iterating Arrays
Iterating means going through elements one by one.
As we deal with multi-dimensional arrays in numpy, we can do this using basic for loop of
python.
If we iterate on a 1-D array it will go through each element one by one.
Example
import numpy as np
arr = np.array([1, 2, 3])
for x in arr:
print(x)
Output:
Example
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
for x in arr:
print(x)
Output:
[1 2 3]
[4 5 6]
If we iterate on a n-D array it will go through n-1th dimension one by one. To return the actual values,
the scalars, we have to iterate the arrays in each dimension.
Example
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
for x in arr:
for y in x:
print(y)
Output:
Example
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
for x in arr:
print(x)
Output:
[[1 2 3]
[4 5 6]]
[[ 7 8 9]
[10 11 12]]
Example
Iterate down to the scalars. To return the actual values, the scalars, we have to iterate the arrays in
each dimension.
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
for x in arr:
for y in x:
for z in y:
print(z)
Output:
10
11
12
NumPy does not change the data type of the element in-place (where the element is in array) so it
needs some other space to perform this action, that extra space is called buffer, and in order to enable
it in nditer() we pass flags=['buffered'].
Example
import numpy as np
arr = np.array([1, 2, 3])
for x in np.nditer(arr, flags=['buffered'], op_dtypes=['S']):
print(x)
Output:
b'1'
b'2'
b'3'
Example
import numpy as np
arr = np.array([1, 2, 3])
for idx, x in np.ndenumerate(arr):
print(idx, x)
Output:
(0,) 1
(1,) 2
(2,) 3
import numpy as np
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
for idx, x in np.ndenumerate(arr):
print(idx, x)
Output:
(0, 0) 1
(0, 1) 2
(0, 2) 3
(0, 3) 4
(1, 0) 5
(1, 1) 6
(1, 2) 7
(1, 3) 8
Joining NumPy Arrays
Joining means putting contents of two or more arrays in a single array.
In SQL we join tables based on a key, whereas in NumPy we join arrays by axes.
We pass a sequence of arrays that we want to join to the concatenate() function, along with
the axis. If axis is not explicitly passed, it is taken as 0.
Example
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.concatenate((arr1, arr2))
print(arr)
Output:
[1 2 3 4 5 6]
Example
import numpy as np
arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6], [7, 8]])
arr = np.concatenate((arr1, arr2), axis=1)
print(arr)
Output:
[[1 2 5 6]
[3 4 7 8]]
Example
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6])
newarr = np.array_split(arr, 3)
print(newarr)
Output:
[array([1, 2]), array([3, 4]), array([5, 6])]
If the array has less elements than required, it will adjust from the end accordingly.
Example
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6])
newarr = np.array_split(arr, 4)
print(newarr)
Output:
Note: We also have the method split() available but it will not adjust the elements when elements are
less in source array for splitting like in example above, array_split() worked properly but split() would
fail.
Example
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6])
newarr = np.array_split(arr, 3)
print(newarr[0])
print(newarr[1])
print(newarr[2])
Output:
[1 2]
[3 4]
[5 6]
Example
import numpy as np
arr = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])
newarr = np.array_split(arr, 3)
print(newarr)
Output:
Example
Split the 2-D array into three 2-D arrays along rows.
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18]])
newarr = np.array_split(arr, 3, axis=1)
print(newarr)
Output:
[ 7],
[10],
[13],
[ 5],
[ 8],
[11],
[14],
[ 6],
[ 9],
[12],
[15],
[18]])]
>>>
Example
Use the hsplit() method to split the 2-D array into three 2-D arrays along rows.
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18]])
newarr = np.hsplit(arr, 3)
print(newarr)
Output:
[array([[ 1],
[ 4],
[ 7],
[10],
[13],
[ 5],
[ 8],
[11],
[14],
[ 6],
[ 9],
[12],
[15],
[18]])]
Searching Arrays
You can search an array for a certain value, and return the indexes that get a match.
To search an array, use the where() method.
Example
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 4, 4])
x = np.where(arr == 4)
print(x)
Output:
The example above will return a tuple: (array([3, 5, 6],), Which means that the value 4 is
present at index 3, 5, and 6.
Example
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
x = np.where(arr%2 == 0)
print(x)
Output:
Example
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
x = np.where(arr%2 == 1)
print(x)
Output:
Sorting Arrays
import numpy as np
arr = np.array([3, 2, 0, 1])
print(np.sort(arr))
Output:
[0 1 2 3]
Note: This method returns a copy of the array, leaving the original array unchanged. You can also sort
arrays of strings, or any other data type:
Example
import numpy as np
arr = np.array(['banana', 'cherry', 'apple'])
print(np.sort(arr))
Output:
Example
import numpy as np
arr = np.array([True, False, True])
print(np.sort(arr))
Output:
If you use the sort() method on a 2-D array, both arrays will be sorted:
Example
Output:
[[2 3 4]
[0 1 5]]
Pandas
What is Pandas?
Pandas is a Python library used for working with data sets.
Pandas is used for data analysis in Python and developed by Wes McKinney in 2008.
Pandas is defined as an open-source library that provides high-performance data analyzing,
cleaning, exploring, and manipulating data and machine learning tasks in Python.
The name of Pandas is derived from the word Panel Data, which means an Econometrics
from Multidimensional data.
Pandas allow us to analyze big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
Relevant data is very important in data science.
Easily handles missing data
It uses Series for one-dimensional data structure and DataFrame for multi-dimensional data
structure.
It provides an efficient way to slice the data
It provides a flexible way to merge, concatenate or reshape the data
Pandas as pd
Pandas is usually imported under the pd alias.
alias: In Python alias are an alternate name for referring to the same thing.
Create an alias with the as keyword while importing:
Now the Pandas package can be referred to as pd instead of pandas.
import pandas as pd
Example
import pandas as pd
print(pd.__version__)
The Pandas provides two data structures for processing the data, i.e., Series and DataFrame, which
are discussed below:
1) Pandas Series
A Pandas Series is like a column in a table.
It is defined as a one-dimensional array that is capable of storing various data types.
The row labels of series are called the index.
We can easily convert the list, tuple, and dictionary into series using "series' method. It has
one parameter.
A Series cannot contain multiple columns.
Syntax:
Example
Output:
s = pd.Series()
DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future
version. Specify a dtype explicitly to silence this warning.
import pandas as pd
import numpy as np
info = np.array(['P','a','n','d','a','s'])
a = pd.Series(info])
print(a)
Output:
0 P
1 a
2 n
3 d
4 a
5 s
dtype: object
Example 2:
import pandas as pd
import numpy as np
info = np.array(['P','a','n','d','a','s'])
a = pd.Series(info, index = [100, 101, 102, 103, 104, 105])
print(a)
Output:
100 P
101 a
102 n
103 d
104 a
105 s
dtype: object
If data is a scalar value, an index must be provided. The value will be repeated to match the
length of index.
Example:
Output:
0 5
1 5
2 5
3 5
dtype: int64
Accessing Data from Series with Position
Example 1:
Retrieve the first element. As we already know, the counting starts from zero for the array, which
means the first element is stored at zeroth position and so on.
import pandas as pd
s = pd.Series([1,2,3,4,5])
#retrieve the first element
print s[0]
Output:
Example 2 :
Retrieve the first three elements in the Series. If a : is inserted in front of it, all items from that index
onwards will be extracted. If two parameters (with : between them) is used, items between the two
indexes.
import pandas as pd
s = pd.Series([1,2,3,4,5])
#retrieve the first element
print s[ : 3]
Output:
0 1
1 2
2 3
dtype: int64
Example 3:
import pandas as pd
s = pd.Series([1,2,3,4,5] )
#retrieve the first element
print s[-3 : ]
Output:
2 3
3 4
4 5
dtype: int64
Example 1:
import pandas as pd
s = pd.Series([1,2,3,4,5], index = ['a','b','c','d','e'] )
#retrieve the first element
print s[‘a’]
Output:
Example 2
import pandas as pd
s = pd.Series([1,2,3,4,5], index = ['a','b','c','d','e'] )
#retrieve the first element
print s[[‘a’, ‘b’, ‘c’]]
Output:
a 1
b 2
c 3
dtype: int64
Example 3
import pandas as pd
s = pd.Series([1,2,3,4,5], index = ['a','b','c','d','e'] )
#retrieve the first element
print s[‘f’]
Output:
KeyError: 'f'
2) Pandas DataFrame:
Pandas DataFrame is a widely used data structure which works with a two-dimensional array
with labeled axes (rows and columns).
DataFrame is defined as a standard way to store data that has two different indexes, i.e., row
index and column index.
It consists of the following properties:
o The columns can be heterogeneous types like int, bool, and so on.
o It can be seen as a dictionary of Series structure where both the rows and columns are
indexed. It is denoted as "columns" in case of columns and "index" in case of rows.
Syntax:
data: It consists of different forms like ndarray, series, map, constants, lists, array.
index: The Default np.arrange(n) index is used for the row labels if no index is passed.
columns: The default syntax is np.arrange(n) for the column labels. It shows only true
if no index is passed.
Create a DataFrame
dict
Lists
Numpy ndarrrays
Series
Output:
Empty DataFrame
Columns: []
Index: []
Example 1:
Output:
0 CIVIL
1 EEE
2 MECH
3 ECE
4 CSE
5 AIDS
Example 2:
Output:
CODE NAME
0 101 CIVIL
1 201 EEE
2 301 MECH
3 401 ECE
4 501 CSE
5 3001 AIDS
Example 3:
Output:
CODE NAME
0 101.0 CIVIL
1 201.0 EEE
2 301.0 MECH
3 401.0 ECE
4 501.0 CSE
5 3001.0 AIDS
All the ndarrays must be of same length. If index is passed, then the length of the index
should equal to the length of the arrays.
If no index is passed, then by default, index will be range(n), where n is the array length.
Example 1:
import pandas as pd
x = {'DEPTCODE':[101,201, 301, 401,501,3001],'DEPARTMENT NAME':['CIVIL', 'EEE',
'MECH','ECE','CSE','AIDS']}
df = pd.DataFrame(x)
print(df)
Output:
0 101 CIVIL
1 201 EEE
2 301 MECH
3 401 ECE
4 501 CSE
5 3001 AIDS
List of Dictionaries can be passed as input data to create a DataFrame. The dictionary keys are by
default taken as column names.
Example 1:
import pandas as pd
print df
Output:
a b c
row1 1 2 NaN
row2 5 10 20.0
Column Selection:
We can select any column from the DataFrame. Here is the code that demonstrates how to select a
column from the DataFrame.
Example:
import pandas as pd
info = {'one' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f']),
'two' : pd.Series([1, 2, 3, 4, 5, 6, 7, 8], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])}
d1 = pd.DataFrame(info)
print (d1 ['one'])
Output:
a 1.0
b 2.0
c 3.0
d 4.0
e 5.0
f 6.0
g NaN
h NaN
Column Addition
We add any new column to an existing DataFrame. The below code demonstrates how to add any new
column to an existing DataFrame:
Example:
Output:
a 1.0 1 20.0
b 2.0 2 40.0
c 3.0 3 60.0
d 4.0 4 NaN
e 5.0 5 NaN
f NaN 6 NaN
Column Deletion:
We delete any column from the existing DataFrame. This code helps to demonstrate how the column
can be deleted from an existing DataFrame:
Example:
Output:
The DataFrame:
one two
a 1.0 1
b 2.0 2
c NaN 3
Delete the first column:
two
a 1
b 2
c 3
We can select, add, or delete any row at anytime. First of all, we will understand the row selection.
Let's see how we can select a row using different ways that are as follows:
Selection by Label:
We can select any row by passing the row label to a loc function.
Example:
one 2.0
two 2.0
Name: b, dtype: float64
The rows can also be selected by passing the integer location to an iloc function.
Example:
import pandas as pd
info = {'one' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']),
'two' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f'])}
df = pd.DataFrame(info)
print (df.iloc[3])
Output:
one 4.0
two 4.0
Name: d, dtype: float64
Slice Rows
one two
c 3.0 3
d 4.0 4
e 5.0 5
Addition of rows:
We can easily add new rows to the DataFrame using append function. It adds the new rows at the end.
Example:
x y
0 7 8
1 9 10
0 11 12
1 13 14
Deletion of rows:
We can delete or drop any rows from a DataFrame using the index label. If in case, the label is
duplicate then multiple rows will be deleted.
Example:
DataFrame Functions
There are lots of functions used in DataFrame which are as follows:
Functions Description
Pandas DataFrame.append() Add the rows of other dataframe to the end of the given
dataframe.
Pandas DataFrame.apply() Allows the user to pass a function and apply it to every single
value of the Pandas series.
Pandas DataFrame.assign() Add new column into a dataframe.
Pandas DataFrame.astype() Cast the Pandas object to a specified dtype.astype() function.
Pandas DataFrame.concat() Perform concatenation operation along an axis in the
DataFrame.
Pandas DataFrame.count() Count the number of non-NA cells for each column or row.
Pandas DataFrame.describe() Calculate some statistical data like percentile, mean and std
of the numerical values of the Series or DataFrame.
Pandas Remove duplicate values from the DataFrame.
DataFrame.drop_duplicates()
Pandas DataFrame.groupby() Split the data into various groups.
Pandas DataFrame.head() Returns the first n rows for the object based on position.
Pandas DataFrame.hist() Divide the values within a numerical variable into "bins".
Pandas DataFrame.iterrows() Iterate over the rows as (index, series) pairs.
Pandas DataFrame.mean() Return the mean of the values for the requested axis.
Pandas DataFrame.melt() Unpivots the DataFrame from a wide format to a long format.
Pandas DataFrame.merge() Merge the two datasets together into one.
Pandas DataFrame.pivot_table() Aggregate data with calculations such as Sum, Count,
Average, Max, and Min.
Pandas DataFrame.query() Filter the dataframe.
Pandas DataFrame.sample() Select the rows and columns from the dataframe randomly.
Pandas DataFrame.shift() Shift column or subtract the column value with the previous
row value from the dataframe.
Pandas DataFrame.sort() Sort the dataframe.
Pandas DataFrame.sum() Return the sum of the values for the requested axis by the
user.
Pandas DataFrame.to_excel() Export the dataframe to the excel file.
Pandas DataFrame.transpose() Transpose the index and columns of the dataframe.
Pandas DataFrame.where() Check the dataframe for one or more conditions.
Example:
import pandas as pd
mid_term_marks = {"Student": ["Kamal", "Arun", "David", "Thomas", "Steven"],
"Economics": [10, 8, 6, 5, 8],
"Fine Arts": [7, 8, 5, 9, 6],
"Mathematics": [7, 3, 5, 8, 5]}
mid_term_marks_df = pd.DataFrame(mid_term_marks)
print(mid_term_marks_df)
mid_term_marks_df.to_csv("D:\midterm.csv")
print(pd.read_csv(‘D:\midterm.csv’)
Output:
0 Kamal 10 7 7
1 Arun 8 8 3
2 David 6 5 5
3 Thomas 5 9 8
4 Steven 8 6 5
CSV stands for comma-separated values. A CSV file is a delimited text file that uses a
comma to separate values.
The CSV file format is quite popular and supported by many software applications such as
Notepad, Microsoft Excel and Google Spreadsheet.
1. Using Notepad: We can create a CSV file using Notepad. In the Notepad, open a new
file in which separate the values by comma and save the file with .csv extension.
2. Using Excel: We can also create a CSV file using Excel. In Excel, open a new file in
which specify each value in a different cell and save it with filetype CSV.
To read data row-wise from a CSV file in Python, we can use reader are present in the CSV module
allows us to fetch data row-wise.
Syntax
pandas.read_csv(filepath_or_buffer,sep=',',`names=None`,`index_col=None`,
`skipinitialspace=False`)
Example:
import pandas
result = pandas.read_csv('D:\data.csv')
print(result)
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset.
1. Empty cells
2. Data in wrong format
3. Wrong data
4. Duplicates
Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells.
Example
Example:
import pandas as pd
df = pd.read_csv('data.csv')
new_df = df.dropna()
print(new_df.to_string())
Note: By default, the dropna() method returns a new DataFrame, and will not change the original.
If you want to change the original DataFrame, use the inplace = True argument:
Example
import pandas as pd
df = pd.read_csv('data.csv')
df.dropna(inplace = True)
print(df.to_string())
Note: Now, the dropna(inplace = True) will NOT return a new DataFrame, but it will remove all rows
containg NULL values from the original DataFrame.
Another way of dealing with empty cells is to insert a new value instead.
This way you do not have to delete entire rows just because of some empty cells.
The fillna() method allows us to replace empty cells with a value:
import pandas as pd
df = pd.read_csv('data.csv')
df.fillna(130, inplace = True)
Example
#Convert to date:
import pandas as pd
df = pd.read_csv('data.csv')
df['Date'] = pd.to_datetime(df['Date'])
print(df.to_string())
As you can see from the result, the date in row 26 was fixed, but the empty date in row 22 got a NaT
(Not a Time) value, in other words an empty value. One way to deal with empty values is simply
removing the entire row.
Removing Rows
The result from the converting in the example above gave us a NaT value, which can be handled as a
NULL value, and we can remove the row by using the dropna() method.
Example
"Wrong data" does not have to be "empty cells" or "wrong format", it can just be wrong, like
if someone registered "199" instead of "1.99".
Sometimes you can spot wrong data by looking at the data set.
If you take a look at our data set, you can see that in row 7, the duration is 450, but for all the
other rows the duration is between 30 and 60.
Replacing Values
One way to fix wrong values is to replace them with something else.
In our example, it is most likely a typo, and the value should be "45" instead of "450", and we
could just insert "45" in row 7:
Example
df.loc[7, 'Duration'] = 45
For small data sets you might be able to replace the wrong data one by one, but not for big
data sets.
To replace wrong data for larger data sets you can create some rules, e.g. set some boundaries
for legal values, and replace any values that are outside of the boundaries.
Example
Removing Rows
Another way of handling wrong data is to remove the rows that contain wrong data.
This way you do not have to find out what to replace them with, and there is a good chance
you do not need them to do your analyses.
Example
for x in df.index:
if df.loc[x, "Duration"] > 120:
df.drop(x, inplace = True)
4) Removing Duplicates
Discovering Duplicates
Duplicate rows are rows that have been registered more than one time.
By taking a look at our test data set
To discover duplicates, we can use the duplicated() method.
The duplicated() method returns a Boolean values for each row:
Example
print(df.duplicated())
Removing Duplicates
To remove duplicates, use the drop_duplicates() method.
Example
df.drop_duplicates(inplace = True)
The (inplace = True) will make sure that the method does NOT return a new DataFrame, but it will
remove all duplicates from the original DataFrame.
Matplotlib
Matplotlib is a comprehensive library for creating static, animated, and interactive
visualizations in Python.
Matplotlib was created by John D. Hunter in 2002
Its first version was released by 2003.
Matplotlib is open source and we can use it freely.
Matplotlib is mostly written in python, a few segments are written in C, Objective-C and
Javascript for Platform compatibility.
Features of matplotlib
Matplotlib is used as a data visualization library for the Python programming language.
Matplotlib provides a procedural interface called Pylab, which is used designed to make it
work like MATLAB, a programming language used by scientists, researchers. MATLAB is a
paid application software and not open source.
It is similar to plotting in MATLAB, as it allows users to have a full control over fonts, lines,
colors, styles, and axes properties like MATLAB.
It provides excellent way to produce quality static-visualizations that can be used for
publications and professional presentations.
Matplotlib is a cross-platform library that can be used in various python scripts, any python
shell (available in IDLE, pycharm, etc) and IPython shells (cond, jupyter notebook), the web
application servers (Django, flask), and various GUI toolkits (Tkinter, PyQt).
Installation of Matplotlib
The python package manager pip is also used to install matplotlib. Open the command prompt
window, and type the following command:
To verify that matplotlib is installed properly or not, type the following command includes
calling .__version __ in the terminal.
import matplotlib
matplotlib.__version__
Output:
'3.1.1'
Matplotlib Pyplot
Pyplot
Most of the Matplotlib utilities lies under the pyplot sub module, and are usually imported
under the plt alias:
The matplotlib.pyplot is the collection command style functions that make matplotlib feel
like working with MATLAB.
Each pyplot function makes some change to the plot (figure).
The pyplot module provide the plot() function which is frequently use to plot a graph.
A function can create a figure: matplotlib.pyplot.figure(), Another function that creates a
plotting area in a figure: matplotlib.pyplot.plot().
Plots some lines in a plotting area.
Decorates the plot with labels, annotations, etc.
You can import the pyplot API in python by the following code:
-- OR
In the above code, the pyplot API from the matplotlib library is imported into the program
and referenced as the alias name plt. You can give any name, but plt is standard and most
commonly used.
Plot():
The plot() function is used to draw line graph. The line graph is one of charts which shows
information as a series of the line
By default, the plot() function draws a line from point to point.
The function takes parameters for specifying points in the diagram.
Syntax :
matplotlib.pyplot.plot()
Parameters: This function accepts parameters that enables us to set axes scales and format
the graphs. These parameters are mentioned below :-
plot(x, y): plot x and y using default line style and color.
plot.axis([xmin, xmax, ymin, ymax]): scales the x-axis and y-axis from minimum to
maximum values.
plot.(x, y, color=’green’, marker=’o’, linestyle=’dashed’, linewidth=2, markersize=12): x
and y co-ordinates are marked using circular markers of size 12 and green color line with
— style of width 2
plot.xlabel(‘X-axis’): names x-axis
plot.ylabel(‘Y-axis’): names y-axis
plot.title(‘Title name’): Give a title to your plot
plot(x, y, label = ‘Sample line ‘) plotted Sample Line will be displayed as a legend
Example
If we need to plot a line from (1, 3) to (8, 10), we have to pass two arrays [1, 8] and [3, 10] to
the plot function
Draw a line in a diagram from position (1, 3) to position (8, 10):
plt.plot(xpoints, ypoints)
plt.show()
Output:
Output:
What are Plots (Graphics):
Plots (graphics), also known as charts, are a visual representation of data in the form of colored
(mostly) graphics.
Plot Types
The six most commonly used Plots come under Matplotlib. These are:
Line Plot
Bar Plot
Scatter Plot
Pie Plot
Area Plot
Histogram Plot
Line Plot:
Line plots are drawn by joining straight lines connecting data points where the x-axis and y-axis
values intersect
Example:
style.use('ggplot')
x = [5,8,10]
y = [12,16,6]
x2 = [6,9,11]
y2 = [6,15,7]
plt.plot(x2,y2,'c',label='line two',linewidth=5)
plt.title('Epic Info')
plt.ylabel('Y axis')
plt.xlabel('X axis')
plt.legend()
plt.grid(True,color='k')
plt.show()
Output:
Bar Plot:
The bar plots are vertical/horizontal rectangular graphs that show data comparison
Example:
plt.bar([0.25,1.25,2.25,3.25,4.25],[50,40,70,80,20],
label="BMW",width=.5)
plt.bar([.75,1.75,2.75,3.75,4.75],[80,20,20,50,60],
label="Audi", color='r',width=.5)
plt.legend()
plt.xlabel('Days')
plt.ylabel('Distance (kms)')
plt.title('Information')
plt.show()
Output:
Histogram Plot:
Histograms are used to show a distribution whereas a bar chart is used to compare different entities.
Example:
population_age =
[22,55,62,45,21,22,34,42,42,4,2,102,95,85,55,110,120,70,65,55,111,115,80,75,65,54,44,43,42,48]
bins = [0,10,20,30,40,50,60,70,80,90,100]
plt.xlabel('age groups')
plt.ylabel('Number of people')
plt.title('Histogram')
plt.show()
Output:
Scatter Plot:
The scatter plots while comparing various data variables to determine the connection between
dependent and independent variables.
Output:
Pie Plot:
A pie plot is a circular graph where the data get represented within that components/segments or
slices of pie.
Output:
Area Plot:
The area plots spread across certain areas with bumps and drops (highs and lows) and
are also known as stack plots
days = [1,2,3,4,5]
sleeping =[7,8,6,11,7]
eating = [2,3,4,3,2]
working =[7,8,7,2,2]
playing = [8,5,7,8,13]
plt.xlabel('x')
plt.ylabel('y')
plt.title('Stack Plot')
plt.legend()
plt.show()
Output:
Data Science
What is Data Science?
Data science is a deep study of the massive amount of data, which involves extracting
meaningful insights from raw, structured, and unstructured data that is processed using the
scientific method, different technologies, and algorithms.
It is a multidisciplinary field that uses tools and techniques to manipulate the data so that we
can find something new and meaningful.
In short, we can say that data science is all about:
Asking the correct questions and analyzing the raw data.
Modeling the data using various complex and efficient algorithms.
Visualizing the data to get a better perspective.
Understanding the data to make better decisions and finding the final result.
1. Discovery:
The first phase is discovery, which involves asking the right questions.
When we start any data science project, we need to determine what are the basic
requirements, priorities, and project budget.
In this phase, we need to determine all the requirements of the project such as the number of
people, technology, time, data, an end goal, and then we can frame the business problem on
first hypothesis level.
2. Data preparation:
3. Model Planning:
In this phase, we need to determine the various methods and techniques to establish the
relation between input variables.
We will apply Exploratory data analytics(EDA) by using various statistical formula and
visualization tools to understand the relations between variable and to see what data can
inform us.
Common tools used for model planning are:
1. SQL Analysis Services
2. R
3. SAS
4. Python
4. Model-building:
5. Operationalize:
In this phase, we will deliver the final reports of the project, along with briefings, code, and
technical documents.
This phase provides us a clear overview of complete project performance and other
components on a small scale before the full deployment.
6. Communicate results:
In this phase, we will check if we reach the goal, which we have set on the initial phase. We
will communicate the findings and final result with the business team.
Exploratory Data Analysis(EDA)
What is Data Analysis:
Exploratory Data Analysis (EDA) is the first step in data analysis process .
Exploratory Data Analysis (EDA) is developed by “John Tukey” in the 1970s.
Exploratory Data Analysis or (EDA) is understanding the data sets by summarizing their main
characteristics often plotting them visually.
This step is very important especially when we arrive at modeling the data in order to apply
Machine learning.
Plotting in EDA consists of Histograms, Box plot, Scatter plot and many more.
It often takes much time to explore the data.
Exploratory Data Analysis helps us to –
Give insight into a data set.
Understand the underlying structure.
Extract important parameters and relationships that hold between them.
Test underlying assumptions
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(color_codes=True)
2. Loading the data into the data frame.
Loading the data into the pandas data frame is certainly one of the most important steps in
EDA.
The value from the data set is comma-separated. So read the CSV into a data frame
Pandas data frame is used to read the data.
df =pd.read_csv("https://media.geeksforgeeks.org/wp-content/uploads/employees.csv")
df.head() # To display the top 5 rows
Here we check for the datatypes because sometimes the Salary or the Salary of the employees
would be stored as a string. If in that case, we have to convert that string to the integer data,
only then integer data we can plot the data via a graph.
Here, in this case, the data is already in integer format so nothing to worry.
df.dtypes
Let’s see the shape of the data using the shape.
df.shape
(1000,8)
This means that this dataset has 1000 rows and 8 columns.
Let’s get a quick summary of the dataset using the describe() method. The describe() function
applies basic statistical computations on the dataset like extreme values, count of data points
standard deviation, etc. Any missing value or NaN value is automatically skipped. describe()
function gives a good picture of the distribution of data.
df.describe()
Now, let’s also the columns and their data types. For this, we will use the info() method.
df.info()
4. Dropping irrelevant columns.
This step is certainly needed in every EDA because sometimes there would be many columns
that we never use. In such cases drop the irrelevant columns.
For example, In this case, the columns such as Last Login Time, Senior Management, doesn't
make any sense so just drop.
df.head(5)
In this instance, most of the column names are very confusing to read, so rename their column
names.
This is a good approach it improves the readability of the data set.
df = df.rename(columns={“Start Date": “SDate", })
df.head(5)
Missing Data is a very big problem in real-life scenarios. Missing Data can also refer to as
NA(Not Available) values in pandas. There are several useful functions for detecting,
removing, and replacing null values in Pandas DataFrame :
print(df.isnull().sum())
We can see that every column has a different amount of missing values. Like Gender as 145
missing values and salary has 0. Now for handling these missing values there can be several
cases like dropping the rows containing NaN or replacing NaN with either mean, median,
mode, or some other value.
Now, let’s try to fill the missing values of gender with the string “No Gender”.
df["Gender"].fillna("No Gender", inplace = True)
df.isnull().sum()
Now, Let’s fill the senior management with the mode value.
mode = df['Senior Management'].mode().values[0]
df['Senior Management']= df['Senior Management'].replace(np.nan, mode)
df.isnull().sum()
Now for the first name and team, we cannot fill the missing values with arbitrary data, so,
let’s drop all the rows containing these missing values.
df = df.dropna(axis = 0, how ='any')
print(df.isnull().sum())
df.shape
8. Detecting Outliers:
Outliers are nothing but an extreme value that deviates from the other observations in the
dataset.
An outlier is a point or set of points that are different from other points.
Sometimes they can be very high or very low.
It's often a good idea to detect and remove the outliers. Because outliers are one of the
primary reasons for resulting in a less accurate model.
Hence it's a good idea to remove them.
IQR (Inter-Quartile Range) score technique is used to detect and remove outlier.
Outliers can be seen with visualizations using a box plot.
sns.boxplot(x=df['Salary'])
Herein all the plots, we can find some points are outside the box they are none other than
outliers.
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
df.shape
8. Data Visualization
Data Visualization is the process of analyzing data in the form of graphs or maps, making it a lot
easier to understand the trends or patterns in the data. There are various types of visualizations –
Histogram Plot
Example:
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(x='Salary', data=df, )
plt.show()
Box Plot
Box Plot is the visual representation of the depicting groups of numerical data through their
quartiles. Boxplot is also used for detect the outlier in data set. It captures the summary of the data
efficiently with a simple box and whiskers and allows us to compare easily across groups.
Example:
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot( x="Salary", y='Team', data=df, )
plt.show()
Heat Maps is a type of plot which is necessary when we need to find the dependent
variables.
One of the best ways to find the relationship between the features can be done using heat
maps.
Example:
# importing packages
import seaborn as sns
plt.figure(figsize=(10,5))
c= df.corr()
sns.heatmap(c,cmap="BrBG", cannot=True)
A Scatter plot is a type of data visualization technique that shows the relationship between
two numerical variables.
Calling the scatter() method on the plot member draws a plot between two variables or two
columns of pandas DataFrame.
Example:
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(df[‘First Name'], df[‘Salary'])
ax.set_xlabel(‘First Name')
ax.set_ylabel(‘Salary')
plt.show()
Descriptive Statistics
Descriptive Statistics is the default process in Data analysis.
Exploratory Data Analysis (EDA) is not complete without a Descriptive Statistic analysis.
Descriptive Statistics is divided into two parts:
1. Measure of Central Data points and
2. Measure of Dispersion.
The following operations are performed under Measure of Central Data points. Each of these
measures describes a different indication of the typical or central value in the distribution.
1. Count
2. Mean
3. Mode
4. Median
2. Measure of Dispersion
The following operations are performed under Measure of Dispersion. Measures of dispersion can
be defined as positive real numbers that measure how homogeneous or heterogeneous the given data .
1. Range
2. Percentiles (or) Quartiles
3. Standard deviation
4. Variance
5. Skewness
Example:
Consider a file:
https://media.geeksforgeeks.org/wp-content/uploads/employees.csv
Before starting descriptive statistics analysis complete the data collection and cleaning
process.
Data Collection:
Data Cleaning :
Data cleaning means fixing bad data in your data set before data analysis.
Describe() method :
Let’s get a quick summary of the dataset using the describe() method.
The describe() function applies basic statistical computations on the dataset like extreme
values, count of data points standard deviation, etc.
Any missing value or NaN value is automatically skipped.
describe() function gives a good picture of the distribution of data.
Count
It calculates the total count of numerical column data (or) each category of the categorical
variables.
height =df["height"]
print(height)
Mean
The Sum of values present in the column divided by total rows of that column is known as
mean.
The median value divides the data points into two parts. That means 50% of data points are
present above the median and 50% below.
Mode
There is only one mean and median value for each column. But, attributes can have more
than one mode value
Output:
Measures of Dispersion
Range
The difference between the max value to min value in a column is known as the range.
Standard Deviation
The standard deviation value tells us how much all data points deviate from the mean value.
The standard deviation is affected by the outliers because it uses the mean for its calculation
σ = Standard Deviation
xi = Terms Given in the Data
x̄ = Mean
n = Total number of Terms
Output:
2.442525704031867
Variance
In the case of outliers, the variance value becomes large and noticeable.
Output:
5.965931814856368
Skewness
Ideally, the distribution of data should be in the shape of Gaussian (bell curve).
But practically, data shapes are skewed or have asymmetry. This is known as skewness in
data.
Skewness value can be negative (left) skew or positive (right) skew. Its value should be close
to zero.
Example for skewness
df.skew()
df.loc[:,"height"].skew()
output:
0.06413448813322854
Percentiles or Quartiles
Quartiles
Based on the quartile, there is a another measure called inter-quartile range that also measures
the variability in the dataset. It is defined as:
IQR = Q3 - Q1
Output:
8718.5
Basic tools (plots, graphs and summary statistics) of EDA
Exploratory data analysis or “EDA” is a critical first step in analyzing the data
The uses of EDA are:
1. Detection of mistakes
2. Checking of assumptions
3. Preliminary selection of appropriate models
4. Determining relationships among the exploratory variables
Data Types:
Categorical Data
Nominal Data
Nominal values represent discrete units and are used to label variables that have no
quantitative value.
Nominal data has no order.
If the order of the values changed their meaning would not change.
Examples:
.
Ordinal Data
Numerical Data
1. Discrete Data
2. Continuous Data
Continuous Data represents measurements and therefore their values can’t be counted but
they can be measured.
An example would be the height of a person, which can describe by using intervals on the real
number line.
Interval Data
Interval values represent ordered units that have the same difference.
Example:
Ratio Data
Ratio values are also ordered units that have the same difference.
Ratio values are the same as interval values, with the difference that they do have an absolute
zero.
Good examples are height, weight, length etc.
Example:
Types of EDA
Univariate non-graphical:
This is the simplest form of data analysis among the four options.
In this type of analysis, the data that is being analysed consists of just a single variable.
The main purpose of this analysis is to describe the data and to find patterns.
Univariate graphical:
Multivariate non-graphical:
The multivariate non-graphical type of EDA generally depicts the relationship between
multiple variables of data through cross-tabulation or statistics.
Multivariate graphical:
This type of EDA displays the relationship between two or more set of data.
A bar chart, where each group represents a level of one of the variables and each bar within
the group represents levels of other variables.
COVARIANCE
Covariance is a measure of relationship between 2 variables that is scale dependent, i.e. how
much will a variable change when another variable changes.
This can be represented with the following equation:
CORRELATION
this can be calculated easily within Python - particulatly when using Pandas as
import pandas as pd
df = pd.DataFrame()
df = pd.read_csv(“data.csv")
df.corr()
The important reasons to implement EDA when working with data are:
1. To gain intuition about the data;
2. To make comparisons between distributions;
3. For sanity checking (making sure the data is on the scale we expect, in the
format we thought it should be);
4. To find out where data is missing or if there are outliers; and to summarize the data.
In the context of data generated from logs, EDA also helps with debugging the logging
process.
In the end, EDA helps us to make sure the product is performing as intended.
There’s lots of visualization involved in EDA.
The distinguish between EDA and data visualization is that EDA is done toward the
beginning of analysis, and data visualization is done toward the end to Communicate one’s
findings.
With EDA, the graphics are solely done for us to understand what’s going on.
EDA are used to improve the development of algorithms.
Data Visualization
What is Data Visualization :
Clarity - Clarity ensures that the data set is complete and relevant.
Accuracy – Accuracy ensures using appropriate graphical representation to convey the right
message.
Efficiency - Efficiency uses efficient visualization technique which highlights all the data
points
Visual effect
Coordination System
Data Types and Scale
Informative Interpretation
Visual effect - Visual Effect includes the usage of appropriate shapes, colors, and
size to represent the analyzed data.
Coordination System - The Coordinate System helps to organize the data points
within the provided coordinates.
Data Types and Scale - The Data Types and Scale choose the type of data such as
numeric or categorical.
Informative Interpretation – The Informative Interpretation helps create visuals in
an effective and easily interpreted ill manner using labels, title legends, and pointers.
Matplotlib
Pandas Visualization
Seaborn
ggplot
Plotly
Plots (graphics), also known as charts, are a visual representation of data in the form of
colored (mostly) graphics.
Histogram Plot
Example:
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(x='Salary', data=df, )
plt.show()
Box Plot
Box Plot is the visual representation of the depicting groups of numerical data through their
quartiles. Boxplot is also used for detect the outlier in data set. It captures the summary of the data
efficiently with a simple box and whiskers and allows us to compare easily across groups.
Example:
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot( x="Salary", y='Team', data=df, )
plt.show()
Heat Maps Plot
Heat Maps is a type of plot which is necessary when we need to find the dependent
variables.
One of the best ways to find the relationship between the features can be done using heat
maps.
Example:
# importing packages
import seaborn as sns
plt.figure(figsize=(10,5))
c= df.corr()
sns.heatmap(c,cmap="BrBG", cannot=True)
A Scatter plot is a type of data visualization technique that shows the relationship between
two numerical variables.
Calling the scatter() method on the plot member draws a plot between two variables or two
columns of pandas DataFrame.
Example:
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(df[‘First Name'], df[‘Salary'])
ax.set_xlabel(‘First Name')
ax.set_ylabel(‘Salary')
plt.show()