0% found this document useful (0 votes)
18 views87 pages

5 - Data Summaries and Visualization

Uploaded by

b00098269
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views87 pages

5 - Data Summaries and Visualization

Uploaded by

b00098269
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

Exploring Data with Graphs and Numerical

Summaries
Intro to AI and Data Science
NGN 112 – Fall 2024

Amer S. Zakaria
Department of Electrical Engineering
College of Engineering

American University of Sharjah

Prepared by Dr. Hussam Alshraideh, INE

Last Updated on: 28th of Oct. 2024


Table of Content
2

Introduction to Statistics using Python

Data Loading, Visualization and Preprocessing

Data Summarization for Data Science Applications


3 Introduction to Statistics using Python
4 What Are the Types of Data?
Variable
5

A variable is any characteristic that is recorded


for the subjects in a study
 Examples: Marital status, Height, Weight, IQ

 A variable can be classified as either

 Categorical or
 Quantitative
◼ Discreteor
◼ Continuous

www.thewallstickercompany.com.au
Categorical Variable
6

A variable is categorical if each observation belongs to one of a set of


categories.
 Examples:

1. Gender
2. Religion
3. Type of residence (Apt, Villa, …)
4. Belief in Aliens (Yes or No)
Quantitative Variable
7

A variable is called quantitative if observations take numerical values for


different magnitudes of the variable.

 Examples:
1. Age
2. Number of siblings
3. Annual Income
Quantitative vs. Categorical
8

 For Quantitative variables, key features are the center (a


representative value) and spread (variability).

 Example: average exam grade is 77.8% and spread (min grade 57% and
highest 96%)

 For Categorical variables, a key feature is the percentage of


observations in each of the categories .

 Example: 45% male students and 55% female students


Discrete Quantitative Variable
9

 A quantitative variable
is discrete if its possible
values form a set of
separate numbers:
0,1,2,3,….
 Examples:
1. Number of pets in a
household
2. Number of children in a
family
3. Number of foreign
languages spoken by an
individual
upload.wikimedia.org
Continuous Quantitative Variable
10

 A quantitative variable is
continuous if its possible values
form an interval
 Examples:
1. Height/Weight
2. Age
3. Blood pressure
4. Measurements

www.wtvq.com
11 Describe the Center of Quantitative Data
Mean
12

 The mean is the sum of


the observations
divided by the number
of observations
 It is the center of mass
Python: mean()
13

import numpy as np

X = np.array([210, 260, 125, 140])

np.mean(X)

#or

X.mean()
Median
14

Order Data Midpoint of the observations


1 78 Order Data when ordered from least to
2 91 1 78 greatest
3 94 2 91 1. Order observations

4 98 3 94 2. If the number of observations

5 99 4 98 is:
6 101 5 99 a) Odd, the median is the
7 103 6 101 middle observation
8 105 7 103 b) Even, the median is the
average of the two middle
9 114 8 105
observations
9 114
10 121
Python: median()
15

import numpy as np

X = np.array([ 210, 260, 125, 140])

np.median(X)
Example: Data & Histograms (1/2)
16

Example: The scores of 30 students are as follows:

[85,92,78,88,95,90,88,72,68,98,84,91,88,75,92,89,79,83,87,94,86,88,76,81,90,92,70,85,89,93,85,92,
78,88,95,90,88,72,68,98,84,91,88,75,92,89,79,83,87,94,86,88,76,81,90,92,70,85,89,93]

• To create a histogram for this data, you would first group the scores into bins or intervals. (e.g.,
60-69, 70-79, 80-89, 90-99).

• Now, you count how many students scored within each of these ranges.

60-69: 0 students
70-79: 4 students
80-89: 13 students
90-99: 13 students
Example: Data & Histograms (2/2)
17

Source:
https://www.techtarget.com/searchsoftwarequality/definition/histogram

Most student grades are Most student grades are Most student grades are
around average low high
Comparing the Mean and Median
18

 Mean and median of a symmetric distribution are close


 Mean is often preferred because it uses all values in its calculations
 In a skewed distribution, the mean is farther out in the
skewed tail than the median
Median is preferred because it is a better
representative of a typical observation
Mode
19

 Value that occurs most often (like what is the most frequent
major of students in NGN112-04 ?)
 Highest bar in the histogram
 Mode is most often used with categorical data
Python: st.mode()
20

#run this on colab

import numpy as np
from scipy import stats as st

X = np.array([ 210,210, 260, 210, 260, 210, 125, 140])

st.mode(X)
21 Describe the Spread of Quantitative Data
Range
22

Range = max - min


Advantage: simple description of the spreadness of the data
Disadvantage: The range is strongly affected by outliers.
Python: Range
23

import numpy as np

X = np.array([ 210,210, 260, 210, 260, 210, 125, 140])

Range=np.max(X)-np.min(X) #or X.max()–X.min()

print(Range)
Standard Deviation
24

 Each data value has an associated deviation from the mean, x-x
 A deviation is positive if it falls above the mean and negative if it
falls below the mean
 The sum of the deviations is always zero
Standard Deviation
25

Standard deviation gives a measure of variation


by summarizing the deviations of each
observation from the mean and calculating an
adjusted average of these deviations:
1. Find mean
2. Find each
deviation
3. Square deviations
4. Sum squared
deviations
5. Divide sum by n-1
6. Take square
25 root
Example: Standard Deviation
26

Metabolic rates of 7 men (calories/24 hours)

26
Python: Standard deviation std()
27

import numpy as np

X = np.array([ 210,210, 260, 210, 260, 210, 125, 140])

np.std(X)

#or

X.std()

27
Measures of Position: Percentiles
28

The kth percentile, denoted, Pk , of a set of data,


is a value such that k percent of the observations
are less than or equal to the value.
Again: Percentile
29

The pth percentile is a value such that the p


percent of the observations falls below or at
that value.
Quartiles
30

Divide data sets into four equal parts


• The 1st quartile, Q1, divides the bottom 25% of the data from the top 75%.
Equivalent to the 25th percentile.
• The 2nd quartile divides the bottom 50% of the data from the top 50% of the
data
• Equivalent to the 50th percentile, which is equivalent to the median.
• The 3rd quartile divides the bottom 75% of the data from the top 25% of the
data. Equivalent to the 75th percentile.

3-30
Finding Quartiles
31

Splits the data into four parts


1. Arrange data in order
2. The median is the
second quartile, Q2
3. Q1 is the median of the
lower half of the
observations
4. Q3 is the median of the
upper half of the
observations
Measure of Spread: Quartiles
32

Quartiles divide a ranked


data set into four equal parts:
Q1= first quartile = 2.2

1.25% of the data at or


below Q1 and 75% above
M = median = 3.4

2.50% of the data are above


the median and 50% are
below Q3= third quartile = 4.35

3.75% of the data at or


below Q3 and 25% above
Numeric Summarization of Data:
The 5 Number Summary
33

The five-number summary of a


dataset consists of:
1. Minimum value
2. First Quartile
3. Median
4. Third Quartile
5. Maximum value
Python: Percentiles and Quartiles
34

import numpy as np

# random.normal function produces a list of random


numbers with a Normal Gaussian Distribution. 170 is
the mean, 10 is the standard deviation, and 250 is the
number of generated samples.
x = np.random.normal(170, 10, 250)

np.min(x)
np.percentile(x, 25)
np.percentile(x, 50)
np.percentile(x, 75)
np.max(x)
import numpy as np
The full code
X = np.array([ 210,210, 260, 210, 260, 210, 125, 140])

Range=np.max(X)-np.min(X)
print('Range = ', Range)

std = np.std(X)
print('std = ', std)

n1 =np.min(X)
n2 =np.percentile(X, 25)
n3= np.percentile(X, 50)
n4 =np.percentile(X, 75)
n5= np.max(X)

print('Five number summary: ',n1,' ',n2,' ', n3, ' ', n4, ' ',n5)

print('------------------------')

Output:
Range = 135 std = 45.753244420477984
Five number summary: 125 192.5 210.0 222.5 260
------------------------
35
36 Describe Categorical Variables
Proportion & Percentage (Rel. Freq.)
37

Proportions and percentages are also called relative


frequencies.
Frequency Table
38

A frequency table is a
listing of possible values
for a variable, together
with the number of
observations or relative
frequencies for each
value.
Python: Frequency Tables
39

import pandas as pd

data = {'Fruit': ['apple', 'apple', 'banana', 'orange', 'apple',


'apple', 'banana', 'banana', 'orange', 'banana', 'apple']}
df = pd.DataFrame(data)

#or

#df = pd.DataFrame(data = ['apple', 'apple', 'banana', 'orange',


'apple', 'apple', 'banana', 'banana', 'orange', 'banana', 'apple'],
columns=['Fruit']) #columns: means the headers of the columns

print(df)

# Calculate absolute frequencies


absolute_frequencies = df['Fruit'].value_counts()
print(absolute_frequencies) # which is a series

print()
relative_frequencies = df['Fruit'].value_counts(normalize=True)
#normalize means divide by the total which is len(df) or 11
Fruit
The output 0 apple
1 apple
2 banana
3 orange
4 apple
5 apple
6 banana
7 banana
8 orange
9 banana
10 apple

apple 5
banana 4
orange 2
Name: Fruit, dtype: int64

apple 0.454545
banana 0.363636
orange 0.181818
Name: Fruit, dtype: float64

40
41 Describe Data Using Graphical Summaries
Pie Charts
42

 Summarize categorical
variable
 Drawn as circle where each
category is a slice
 The size of each slice is
proportional to the
percentage in that category
Python: Pie Chart
43

import pandas as pd
df = pd.DataFrame(data = ['apple',
'apple', 'banana', 'orange', 'apple',
'apple', 'banana', 'banana', 'orange',
'banana', 'apple'], columns=['Fruit'])

#First: Find the absolute frequencies

absolute_frequencies =
df['Fruit'].value_counts()

#Second: Create a one-column dataframe


df2 = pd.DataFrame({'Fruit':
absolute_frequencies})

print(df2)

df2.plot.pie(y='Fruit', figsize=(5,5),
autopct='%1.1f%%')
Bar Graphs
44

 Summarizes categorical
variable
 Vertical bars for each category

 Height of each bar represents

either counts or percentages


 Easier to compare categories
with bar graph than with pie
chart
 Called Pareto Charts when

ordered from tallest to shortest


Python: Bar chart
45

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame(data = ['apple',
'apple', 'banana', 'orange', 'apple',
'apple', 'banana', 'banana', 'orange',
'banana', 'apple'], columns=['Fruit'])

plt.figure(figsize=(5,5)) #width and


height in inches

sns.countplot(x='Fruit', data=df,
hue=df['Fruit'])

#or
ax=sns.countplot(x='Fruit',data=df)
Histograms
46

Graph that uses bars to show


frequencies (counts) or
relative frequencies of
possible outcomes for a
quantitative variable
Python: Histogram
47

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#random normal data(mean, std, size)


mydata=np.random.normal(170, 10, 250)

ax = sns.histplot(data = mydata)

# Set labels and title


ax.set_xlabel("Value")
ax.set_ylabel("Frequency")
ax.set_title("Histogram of data")
Python: Histogram (with figure size)
48

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#random normal data(mean, std, size)


mydata=np.random.normal(170, 10, 250)

plt.figure(figsize=(10,8))

ax = sns.histplot(data = mydata)
# Set labels and title
ax.set_xlabel("Value")
ax.set_ylabel("Frequency")
ax.set_title("Histogram of data")

# Show the plot Summary of steps:


plt.show() #optional 1. Create a figure: plt.figure…
2. Create a histogram: sns.histplot…
3. Show the plt.show…
Interpreting Histograms
49

 Assess where a distribution is


centered by finding the
Left and right sides
median are mirror images
 Assess the spread of a
distribution
 Shape of a distribution:
roughly symmetric, skewed to
the right, or skewed to the left
Examples of Skewness
50
Outlier
51

An outlier falls far from the rest of the data


Boxplot
52

1. Box goes from the Q1 to Q3


2. Line is drawn inside the box at
the median
3. Line goes from lower end of
box (Q1) to smallest
observation not a potential
outlier
4. Line goes from upper end of
box (Q3) to largest
observation not a potential
outlier
5. Potential outliers are shown
separately, often with * or +
Comparing Distributions
53

Boxplots do not display the shape of the distribution as


clearly as histograms, but are useful for making graphical
comparisons of two or more datasets (or distributions)
Python: Boxplot
54

import numpy as np
import seaborn as sns

#random normal data(mean, std, size)


mydata = np.random.normal(170, 10,
250)

#Default: Vertical orientation


ax = sns.boxplot(data=mydata)

#Horizontal orientation
ax =
sns.boxplot(data=mydata,orient='h')
Python: Boxplot (with figure options)
55

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#random normal data(mean, std, size)


mydata = np.random.normal(170, 10, 250)

plt.figure(figsize=(4,5))

ax = sns.boxplot(mydata)

#or
ax = sns.boxplot(y = mydata)

plt.show() #optional
56 Data Preprocessing
Data Normalization
57

 In machine learning, introduced in the next chapter, the data


needs to be normalized prior to training the machine learning
model.
 Normalization means that all data variables will have the
same range, for example, [0 to 1] or [-1 to 1] or [-3.4 to 3.4]
 This is needed as different variables have different ranges.

 For example, to predict a GPA, we need to know 3 variables:


 the number of hours a student studies,
 their IQ,
 and their attendance record.
 All these variables have different ranges of values; normalization
guarantees that they will have the same range, e.g., [0,1]
 We will look at 2 normalizations: Z-scores and min-max.
Data Normalization: Z-Scores
58

An observation from a bell-shaped distribution is a


potential outlier if its z-score < -3 or z-score > +3
• Suppose that the average and standard deviation values for attribute
income are $55,000 and $10,000, respectively.
• An income of 60,000 would have a z-score of (60,000-
55,000)/(10,000)=0.5
• We say that 60,000 is above the average by 0.5 standard deviation
58
Data Normalization: Z-Score
59

from sklearn import preprocessing


import numpy as np
X_train = np.array([[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]])

Z_score_scaler = preprocessing.StandardScaler()

Z_score_scaler.fit(X_train)
# The fit method computes the mean and standard deviation of each feature/column in
X_train, which will be used for scaling later
print('means per column:', Z_score_scaler .mean_)
print('variances per column: ', Z_score_scaler .var_)

X_scaled = Z_score_scaler.transform(X_train) #The transform method applies the


scaling transformation to X_train, standardizing each feature/column by subtracting
the mean and dividing by the standard deviation
Output:
print('Original Data:\n',X_train) means per column: [1. 0. 0.33333333]
print('Scaled Data:\n', X_scaled) variances per column: [0.66666667 0.66666667 1.55555556]
Original Data:
[[ 1. -1. 2.]
[ 2. 0. 0.]
[ 0. 1. -1.]]
Scaled Data:
[[ 0. -1.22474487 1.33630621]
[ 1.22474487 0. -0.26726124] 59
[-1.22474487 1.22474487 -1.06904497]]
Data Normalization: min-max
60

Scaled data falls in the [0, 1] range

Suppose that the minimum and maximum values for attribute income are
$12,000 and $98,000, respectively.
An income of 60,000 would have a scaled value of (60,000-
12,000)/(98,000-12,000)=0.558

60
Data Normalization: min-max
61

from sklearn import preprocessing


import numpy as np
X_train = np.array([[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]])

min_max_scaler = preprocessing.MinMaxScaler()

min_max_scaler.fit(X_train)
# The fit method find the minimum and maximum of the data X_train

X_scaled = min_max_scaler.transform(X_train) #The transform method


applies the min-max scaling transformation to X_train to scale it
within the range of [0,1].

#Each feature/column is transformed individually

print('Original Data:\n',X_train) Output:


Original Data:
print('Scaled Data:\n', X_scaled) [[ 1. -1. 2.]
[ 2. 0. 0.]
[ 0. 1. -1.]]
Scaled Data:
[[0.5 0. 1. ]
[1. 0.5 0.33333333]
61
[0. 1. 0. ]]
Preprocessing of data:
Discretization or quantization
62

• Discretization (otherwise known


as quantization or binning)
provides a way to partition a
continuous variable/feature
into discrete values.
• Certain datasets with
continuous features may benefit
from discretization because
discretization can transform the Example: The GPA is a continuous variable, it can have
values between 0 and 4
dataset of continuous attributes
to one with only nominal We can discretized it into 4 bins as follows:
attributes.
0 <= GPA <= 1.0 replace with 1
1 < GPA <= 2.0 replace with 2
2 < GPA <= 3.0 replace with 3
3 < GPA <= 4.0 replace with 4
Preprocessing of data: Discretization
(TS, example: 3 variables: GPA, IQ and attendance score)
63

from sklearn import preprocessing


import numpy as np
x = np.array([[ 3.5, 132, 100 ], [ 2.9, 119, 95 ], [ 1.9, 99, 65 ]])
est = preprocessing.KBinsDiscretizer(n_bins=[4, 3, 2], encode='ordinal')
Output:
est.fit(x) Original Data:
[[ 3.5 132. 100. ]
[ 2.9 119. 95. ]
[ 1.9 99. 65. ]]
Binned Data:
x_bins = est.transform(x) [[3. 2. 1.]
print('Original Data:\n',x) [2. 1. 1.]
print('Binned Data:\n', x_bins) [0. 0. 0.]]

• Each row represents a sample (or feature vector), and each column represents a variable/feature.
• Next, you create an instance of KBinsDiscretizer with n_bins=[4, 3, 2]. This means that you
want to divide the first variable/feature into 4 bins (one of 4 options), the second feature into 3 bins (one of 3
options), and the third feature into 2 bins(one of 2 options).
• The encode='ordinal' parameter indicates that you want to encode the bins with ordinal integers. An
ordinal number is a number that indicates the position like 1st, 2nd,…or in zero indexing 0, 1,…
• Then, you fit the KBinsDiscretizer object ‘est’ to the data X using the fit method.
• Finally, you transform the data X using the transform method of ‘est’, which discretizes the values in X into
the specified number of bins. The result is a transformed array with the same shape as X.
Preprocessing of data:
Encoding categorical features
64

• Often, features are not given as continuous values but as categorical ones.
These need to be converted into numbers prior to using machine learning

• For example, a person could have features:


• ["male", "female"], Encoded to: [1, 0]
• ["from Europe", "from US", "from Asia"], Encoded to: [1, 2, 0]
• ["uses Firefox", "uses Chrome", "uses Safari", "uses Edge"]. Encode to: [2, 0, 3, 1]

• Such features can be efficiently coded as integers, for instance


• ["male", "from US", "uses Edge"] ex. could be expressed as [1, 2, 1]
• while ["female", "from Asia", "uses Chrome"] ex . could be expressed as [0, 0, 0].

• Types of Encoders:
• Ordinal Encoders {1st,2nd,3rd,…} or {0, 1, 2,…} for multi-dimensional data as in
the example above
• Label Encoders (similar to ordinal encoders but for 1-D row arrays).
• One Hot Encoding: Binary Encoding 0 or 1
Preprocessing of data: Ordinal Encoding
An ordinal number is a number that indicates the position like 1st, 2nd,…
65

from sklearn import preprocessing


import numpy as np

X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses


Firefox'],['male', 'from Asia', 'uses Chrome'],['male', 'from US', 'uses Edge']]

enc = preprocessing.OrdinalEncoder()
enc.fit(X)

# notice the input to transform is a 2D array hence the [[..]]


X_sample = [['female', 'from US', 'uses Safari']]

rst = enc.transform(X_sample)

print(rst) Output:
[[0. 2. 3.]]

Done by Sorting. Sort by capital letter then alphabetical for


every feature (column) of data
66
Data summarization for data science
applications
67 Case Study 1: Iris Dataset
Iris Flowers Dataset
68

 In this case study, we will use the Iris sample data, which contains information on 150
Iris flowers, 50 each from one of three Iris species: Setosa, Versicolour, and Virginica.
Each flower is characterized by five attributes:
1. sepal length in centimeters
2. sepal width in centimeters
3. petal length in centimeters
4. petal width in centimeters
5. class (Setosa, Versicolour, Virginica) This is the labels

Data is available online at: https://archive.ics.uci.edu/dataset/53/iris


https://www.youtube.com/watch?v=pTjsr_0YWas
Iris Flowers Dataset
69
Step 1: Reading the data
70

import pandas as pd

#In Colab, upload iris.data file and locate in under “Sample Data” folder. Then
right click on the uploaded “iris.data” file and copy path
data = pd.read_csv('iris.data', header=None)
# or

# data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data',header=None)

# Add column headers


data.columns = ['sepal length', 'sepal width', 'petal length', 'petal width',
'species']

# Display the five rows


print(data.head())

Output: sepal length sepal width petal length petal width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setos
Step 2: Numerical summaries
71

absolute_frequencies = data['species'].value_counts()
print(absolute_frequencies)

Output: class
Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
Name: count, dtype: int64

print(data.info())
Output: <class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal length 150 non-null float64
1 sepal width 150 non-null float64
2 petal length 150 non-null float64
3 petal width 150 non-null float64
4 class 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None
Step 2: Numerical summaries
72

print(data.describe(include='all')) # Includes categorical data

Output:
sepal length sepal width petal length petal width class
count 150.000000 150.000000 150.000000 150.000000 150
unique NaN NaN NaN NaN 3
top NaN NaN NaN NaN Iris-setosa
freq NaN NaN NaN NaN 50
mean 5.843333 3.054000 3.758667 1.198667 NaN
std 0.828066 0.433594 1.764420 0.763161 NaN
min 4.300000 2.000000 1.000000 0.100000 NaN
25% 5.100000 2.800000 1.600000 0.300000 NaN
50% 5.800000 3.000000 4.350000 1.300000 NaN
75% 6.400000 3.300000 5.100000 1.800000 NaN
max 7.900000 4.400000 6.900000 2.500000 Na
Step 2: Numerical summaries
73

print(data.describe()) # Excludes categorical data

Output:
sepal length sepal width petal length petal width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
Step 3: Visual Summaries - Histogram
74

import matplotlib.pyplot as plt


import seaborn as sns

plt.figure()
plt.xlabel('Sepal Length')
plt.ylabel('Frequency')
plt.title('Histogram of Sepal
Length')

sns.histplot(data['sepal length'],
bins = 8)

plt.show()

# or
# data['sepal length'].hist(bins=8)
Step 3: Visual Summaries – Box Plots
75

import matplotlib.pyplot as plt


import seaborn as sns

plt.figure()
plt.xlabel('Feature')
plt.ylabel('Value (cm)')
plt.title('Data Boxplot')
sns.boxplot(data)
plt.show()

# or

plt.figure()
plt.xlabel('Feature')
plt.ylabel('Value (cm)')
plt.title('Data Boxplot')
data.boxplot()
plt.show()
Step 3: Visual Summaries – Scatter Plots
76
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure()
# Select two columns for the scatter plot
# Create a scatter plot of the selected columns
sns.scatterplot(data=data[0:150], x='sepal length', y='sepal width', hue='species')
plt.title('Iris Flowers')
plt.show()

In Scatter Plots:

Each data point is represented by a dot


on the graph

Allowing you to see how one variable


behaves in relation to another.

Scatterplots are helpful for identifying


relationships between the two
variables being compared (sepal length
vs. sepal width in this example)
Step 3: Visual Summaries – Pair Plots
77

import matplotlib.pyplot as plt


import seaborn as sns

plt.figure()
sns.pairplot(data,hue='species')
plt.show()

The pairplot is scatter plot for all


combination of the variable

Here we have 4 variables, hence 4 × 4 = 16


plots

The diagonal is a plot of a variable with


itself; hence it shows histograms of the
variable for each class (we have 4 classes
here)
Step 3: Visual Summaries – Pair Plots
78

import matplotlib.pyplot as plt


import seaborn as sns

plt.figure()

# Change the diagonal plots to


histograms
sns.pairplot(data,hue='species',
diag_kind ='hist')

plt.show()
Step 3: Visual Summaries – Heat Map
79

import matplotlib.pyplot as plt


import seaborn as sns

plt.figure()

data_numerical_columns = data.select_dtypes(include=['number'])

sns.heatmap(data_numerical_columns.corr(),annot=True)

plt.show()

Another interesting case of data visualization is to


use a heatmap to visualize the correlation
matrix of the dataset (1 strong +ve linear
correlation, 0 no correlation, -1 strong –ve linear
correlation).

This type of visualization helps to identify which


variables are positively correlated (tend to
change together in the same direction), negatively
correlated (tend to change in opposite directions),
or have no significant correlation.
Filter a DataFrame
80

 You can filter DataFrames to obtain a subset of the data prior to plotting if needed.
 For example, assume that you want to filter the iris dataset for flowers with a class type of ‘setosa’.
 You can write one of the following:
data = pd.read_csv('iris.data', header=None)
data.columns = ['sepal length', 'sepal width', 'petal length', 'petal width',
'species']

filtered = data[(data.species == "Iris-setosa")]


print(filtered.head())

# or

filtered = data.query('species == "Iris-setosa"')


print(filtered.head())

Output: sepal_length sepal_width petal_length petal_width Class


0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
Learning Outcomes
81

Upon completion of the course, students will be able to:


1. Identify the importance of AI and Data Science for society
2. Perform data loading, preprocessing, summarization and
visualization
3. Apply machine learning methods to solve basic regression
and classification problems
4. Apply artificial neural networks to solve simple engineering
problems
5. Implement basic data science and machine learning tasks
using programming tools
Extra: Preprocessing of data: One-Hot-Encoding
Location: After slide#65
82
 Machine learning algorithms typically work with numerical data, so you need to convert categorical
values into numbers. One-hot encoding does this by creating binary (0 or 1) columns for each category.
 Example: consider a dataset with a column called "Color" with categorical values like "Red," "Green," and
"Blue."
 Original Categorical Data values: [Red,Green,Blue] can be represented in one column
Color # name of variable or column
Red # color of first observation
Red # color of second observation
Blue # color of third observation

 One-hot encoding would convert this into three binary columns, one for each color:
One-Hot Encoded Data: Red Green Blue # 3 columns = the number of possible values
Red= [0, 0, 1]
Green= [0, 1, 0]
Blue= [1, 0, 0]
 Note that the variable color had 3 values Red, Green, Blue hence 3 columns are needed to represent this
one variable
 In other words, the categorical variable color is one column and One-Hot is 3 columns
Extra: Preprocessing of data: One-Hot-Encoding
83

from sklearn import preprocessing


import numpy as np
#2 feature vectors (rows) and 3 variables (columns)
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]

enc = preprocessing.OneHotEncoder()
enc.fit(X)
rst = enc.transform([['female', 'from US', 'uses Safari'], ['male', 'from Europe',
'uses Firefox']])
print(rst.toarray())

Output:

Male
female From US Uses Safari
From Europe Uses Firefox

Note: In the output, the number of binary columns is equal to the number of values of a variable.
Extra: Step 2: Numerical summaries
Location: After slide#73
84

from pandas.api.types import


is_numeric_dtype

for col in data.columns:


if is_numeric_dtype(data[col]):
print('%s:' % (col))
print('\t Mean = %.2f' %
data[col].mean())
print('\t Standard deviation =
%.2f' % data[col].std())
print('\t Minimum = %.2f' %
data[col].min())
print('\t Maximum = %.2f' %
data[col].max())
Extra: Step 3: Visual Summaries
Location: After slide#78
85
86 Extra: Case Study 2: House Prices

Location: After slide#80


Extra: House prices in Melbourne
87

 Data description and analysis is available in google colab notebook


at:

 https://colab.research.google.com/drive/1FKJldbBKkBNELM_28y6l0
gRvUHZRLim8?usp=sharing

You might also like