0% found this document useful (0 votes)
9 views97 pages

5_Data Summaries and Visualization

The document provides an overview of data types, statistics, and visualization techniques relevant to AI and data science. It covers categorical and quantitative variables, methods for summarizing data such as mean, median, mode, and standard deviation, as well as graphical representations like pie charts and bar charts. The content is structured for a course at the American University of Sharjah, focusing on practical applications in data science.

Uploaded by

Yusra Eltilib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views97 pages

5_Data Summaries and Visualization

The document provides an overview of data types, statistics, and visualization techniques relevant to AI and data science. It covers categorical and quantitative variables, methods for summarizing data such as mean, median, mode, and standard deviation, as well as graphical representations like pie charts and bar charts. The content is structured for a course at the American University of Sharjah, focusing on practical applications in data science.

Uploaded by

Yusra Eltilib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 97

Exploring Data with Graphs and Numerical

Summaries
Intro to AI and Data Science
NGN 112 – Fall 2024

Ammar Hasan
Department of Electrical Engineering
College of Engineering

American University of Sharjah

Prepared by Dr. Hussam Alshraideh, INE

Last Updated on: 15th of October 2024


Table of Content
2

Types of Data

Data Statistics

Data Visualization

Data Preprocessing

Case study of Data Summarization for Data Science Applications


3

Types of Data
Variable
4

A variable is any characteristic that is recorded


for the subjects in a study
 Examples: Marital status, Height, Weight, IQ

 A variable can be classified as either

 Categorical or
 Quantitative
◼ Discreteor
◼ Continuous

www.thewallstickercompany.com.au
Categorical Variable
5

A variable is categorical if each observation belongs to one of a set of


categories.
 Examples:

1. Gender
2. Religion
3. Type of residence (Apt, Villa, …)
4. Belief in Aliens (Yes or No)
Quantitative Variable
6

A variable is called quantitative if observations take numerical values for


different magnitudes of the variable.

 Examples:
1. Age
2. Number of siblings
3. Annual Income
Quantitative vs. Categorical
7

 For Quantitative variables, key features are the center (a


representative value) and spread (variability).

 Example: average exam grade is 77.8% and spread (min grade 57% and
highest 96%)

 For Categorical variables, a key feature is the percentage of


observations in each of the categories .

 Example: 45% male students and 55% female students


Discrete Quantitative Variable
8

 A quantitative variable
is discrete if its possible
values form a set of
separate numbers:
0,1,2,3,….
 Examples:
1. Number of pets in a
household
2. Number of children in a
family
3. Number of foreign
languages spoken by an
individual
upload.wikimedia.org
Continuous Quantitative Variable
9

 A quantitative variable is
continuous if its possible values
form an interval
 Examples:
1. Height/Weight
2. Age
3. Blood pressure
4. Measurements

www.wtvq.com
10

Data Statistics: Describe data using numerical summaries

Center of Quantitative Data

Spread of Quantitative Data

Frequency Table of categorical data


11

Center of Quantitative Data


Mean
12

 The mean is the sum of


the observations
divided by the number
of observations
 It is the center of mass
Python: mean()
13

import numpy as np

X = np.array([210, 260, 125, 140])

np.mean(X) #or X.mean()


Median
14

Order Data Midpoint of the observations


1 78 Order Data when ordered from least to
2 91 1 78 greatest
3 94 2 91 1. Order observations

4 98 3 94 2. If the number of observations

5 99 4 98 is:
6 101 5 99 a) Odd, the median is the
7 103 6 101 middle observation
8 105 7 103 b) Even, the median is the
average of the two middle
9 114 8 105
observations
9 114
10 121
Python: median()
15

import numpy as np

X = np.array([ 210, 260, 125, 140])

np.median(X)
Example: Data & Histograms (1/2)
16

Example: The scores of 30 students are as follows:

[85,92,78,88,95,90,88,72,68,98,84,91,88,75,92,89,79,83,87,94,86,88,76,81,90,92,70,85,89,93,85,92,
78,88,95,90,88,72,68,98,84,91,88,75,92,89,79,83,87,94,86,88,76,81,90,92,70,85,89,93]

• To create a histogram for this data, you would first group the scores into bins or intervals. (e.g.,
60-69, 70-79, 80-89, 90-99).

• Now, you count how many students scored within each of these ranges.

60-69: 0 students
70-79: 4 students
80-89: 13 students
90-99: 13 students
Example: Data & Histograms (2/2)
17

Source:
https://www.techtarget.com/searchsoftwarequality/definition/histogram

Most student grades are Most student grades are Most student grades are
around average low high
Comparing the Mean and Median
18

 Mean and median of a symmetric distribution are close


 Mean is often preferred because it uses all values in its calculations
 In a skewed distribution, the mean is farther out in the
skewed tail than the median
Median is preferred because it is a better
representative of a typical observation
Mode
19

 Value that occurs most often (like what is the most frequent
major of students in NGN112?)
 Highest bar in the histogram
 Mode is most often used with categorical data
Python: st.mode()
20

#run this on colab

import numpy as np
from scipy import stats as st

X = np.array([ 210,210, 260, 210, 260, 210, 125, 140])

st.mode(X)
21

Spread of Quantitative Data


Range
22

Range = max - min


Advantage: simple description of the spreadness of the data
Disadvantage: The range is strongly affected by outliers.
Python: Range
23

import numpy as np

X = np.array([ 210,210, 260, 210, 260, 210, 125, 140])

Range=np.max(X)-np.min(X) #or X.max()–X.min()


Range
Standard Deviation
24

 Each data value has an associated deviation from the mean, x-x
 A deviation is positive if it falls above the mean and negative if it
falls below the mean
 The sum of the deviations is always zero
Standard Deviation
25

Standard deviation gives a measure of variation


by summarizing the deviations of each
observation from the mean and calculating an
adjusted average of these deviations:
1. Find mean
for sample 2. Find each deviation
3. Square deviations
4. Sum squared
or deviations
5. Divide sum by n-1 or n
for population
6. Take square root
25
Example: Standard Deviation
26

Metabolic rates of 7 men (calories/24 hours)

26
Python: Standard deviation std()
27

import numpy as np

X = np.array([ 210,210, 260, 210, 260, 210, 125, 140])

np.std(X)

# or X.std()

27
Measures of Position: Percentiles
28

The kth percentile, denoted, Pk , of a set of data,


is a value such that k percent of the observations
are less than or equal to the value.
Again: Percentile
29

The pth percentile is a value such that the p


percent of the observations falls below or at
that value.
Quartiles
30

Divide data sets into four equal parts


• The 1st quartile, Q1, divides the bottom 25% of the data from the top 75%.
Equivalent to the 25th percentile.
• The 2nd quartile divides the bottom 50% of the data from the top 50% of the
data
• Equivalent to the 50th percentile, which is equivalent to the median.
• The 3rd quartile divides the bottom 75% of the data from the top 25% of the
data. Equivalent to the 75th percentile.

3-30
Finding Quartiles
31

Splits the data into four parts


1. Arrange data in order
2. The median is the
second quartile, Q2
3. Q1 is the median of the
lower half of the
observations
4. Q3 is the median of the
upper half of the
observations
Measure of Spread: Quartiles
32

Quartiles divide a ranked


data set into four equal parts:
Q1= first quartile = 2.2

1.25% of the data at or


below Q1 and 75% above
M = median = 3.4

2.50% of the data are above


the median and 50% are
below Q3= third quartile = 4.35

3.75% of the data at or


below Q3 and 25% above
Numeric Summarization of Data:
The 5 Number Summary
33

The five-number summary of a


dataset consists of:
1. Minimum value
2. First Quartile
3. Median
4. Third Quartile
5. Maximum value
Python: Percentiles and Quartiles
34

import numpy as np

# random.normal function produces a list of random


numbers with a Normal Gaussian Distribution. 170 is
the mean, 10 is the standard deviation, and 250 is the
number of generated samples.
x = np.random.normal(170, 10, 250)

np.min(x)
np.percentile(x, 25)
np.percentile(x, 50)
np.percentile(x, 75)
np.max(x)
import numpy as np
The full code
X = np.array([ 210,210, 260, 210, 260, 210, 125, 140])

Range=np.max(X)-np.min(X)
print('Range = ', Range)

std = np.std(X)
print('std = ', std)

n1 =np.min(X)
n2 =np.percentile(X, 25)
n3= np.percentile(X, 50)
n4 =np.percentile(X, 75)
n5= np.max(X)

print('Five number summary: ',n1,' ',n2,' ', n3, ' ', n4, ' ',n5)

print('------------------------')

Output:
Range = 135 std = 45.753244420477984
Five number summary: 125 192.5 210.0 222.5 260
------------------------
35
36

Frequency Table of Categorical Data


Proportion & Percentage (Rel. Freq.)
37

Proportions and percentages are also called relative


frequencies.
Frequency Table
38

A frequency table is a
listing of possible values
for a variable, together
with the number of
observations or relative
frequencies for each
value.
Python: Frequency Tables
39

import pandas as pd
#df = pd.DataFrame(data = ['apple', 'apple', 'banana', 'orange',
'apple', 'apple', 'banana', 'banana', 'orange', 'banana', 'apple'],
columns=['Fruit']) #columns: means the headers of the columns
#or
data = {'Fruit': ['apple', 'apple', 'banana', 'orange', 'apple',
'apple', 'banana', 'banana', 'orange', 'banana', 'apple']}
df = pd.DataFrame(data)
print(df)
print()

absolute_frequencies = df['Fruit'].value_counts()
print(absolute_frequencies) # which is a series

print()
relative_frequencies = df['Fruit'].value_counts(normalize=True)
#normalize means divide by the total which is len(df) or 11
print(relative_frequencies)
Fruit
The output 0 apple
1 apple
2 banana
3 orange
4 apple
5 apple
6 banana
7 banana
8 orange
9 banana
10 apple

apple 5
banana 4
orange 2
Name: Fruit, dtype: int64

apple 0.454545
banana 0.363636
orange 0.181818
Name: Fruit, dtype: float64

40
41

Data Visualization: Describe Data using graphical summaries

Pie Charts

Bar Charts

Histograms

Box Plots
42

Pie Charts
Pie Charts
43

 Summarize categorical
variable
 Drawn as circle where each
category is a slice
 The size of each slice is
proportional to the
percentage in that category
Python: Pie Chart
44

import pandas as pd
df = pd.DataFrame(data = ['apple',
'apple', 'banana', 'orange', 'apple',
'apple', 'banana', 'banana', 'orange',
'banana', 'apple'], columns=['Fruit'])

absolute_frequencies =
df['Fruit'].value_counts()

df2 = pd.DataFrame({'Fruit':
absolute_frequencies},
index = ['apple', 'banana', 'orange'] )
df2
df2.plot.pie(y='Fruit', figsize=(5,5),
autopct='%1.1f%%')
45

Bar Charts
Bar Charts
46

 Summarizes categorical
variable
 Vertical bars for each category

 Height of each bar represents

either counts or percentages


 Easier to compare categories
with bar graph than with pie
chart
 Called Pareto Charts when

ordered from tallest to shortest


Python: Bar chart
47

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.DataFrame(data = ['apple', 'apple', 'banana', 'orange',


'apple', 'apple', 'banana', 'banana', 'orange', 'banana',
'apple'], columns=['Fruit'])
sns.set(style='darkgrid') #optional
plt.figure(figsize=(5,5)) #width and height in inches
sns.countplot(x='Fruit', data=df, hue=df['Fruit'])
#or ax=sns.countplot(x='Fruit',data=df)
Python: Pie and Bar Chart Exercise

In Google Colabs, plot a pie chart and bar chart for the
following data, which is the list of Major of all the students in
this section of NGN112

df = pd.DataFrame(data = ['CS', 'CoE', 'CS', 'CoE', 'ME', 'INE',


'ME', 'ChE', 'CvE', 'CS', 'CoE', 'CS', 'CoE', 'ELE', 'INE', 'ME',
'ChE', 'CvE', 'CS', 'CoE', 'CS', 'CS', 'ELE', 'INE', 'ME', 'ChE',
'ELE', 'CS', 'CoE', 'CS', 'CS', 'ELE', 'ME', 'ME', 'CS', 'CoE',],
columns=['Major'])
49

Histograms
Histograms
50

Graph that uses bars to show


frequencies (counts) or
relative frequencies of
possible outcomes for a
quantitative variable
Python: Histogram
51

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#random normal data(mean, std, size)


mydata=np.random.normal(170, 10, 250)

ax = sns.histplot(data = mydata)

# Set labels and title


ax.set_xlabel("Value")
ax.set_ylabel("Frequency")
ax.set_title("Histogram of data")
Python: Histogram (with figure size)
52

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#random normal data(mean, std, size


mydata=np.random.normal(170, 10, 250)

sns.set(style="darkgrid") #optional
plt.figure(figsize=(10,8))

ax = sns.histplot(data = mydata)
# Set labels and title
ax.set_xlabel("Value")
ax.set_ylabel("Frequency")
ax.set_title("Histogram of data")
Summary of steps:
# Show the plot 1. Create a figure: plt.figure…
plt.show() #optional in Colab 2. Create a histogram: sns.histplot…
3. Show the plt.show…
Interpreting Histograms
53

 Assess where a distribution is


centered by finding the
Left and right sides
median are mirror images
 Assess the spread of a
distribution
 Shape of a distribution:
roughly symmetric, skewed to
the right, or skewed to the left
Examples of Skewness
54
Outlier
55

An outlier falls far from the rest of the data


56

Boxplots
Boxplot
57

1. Box goes from the Q1 to Q3


2. Line is drawn inside the box at
the median
3. Line goes from lower end of
box (Q1) to smallest
observation not a potential
outlier
4. Line goes from upper end of
box (Q3) to largest
observation not a potential
outlier
5. Potential outliers are shown
separately, often with * or +
Comparing Distributions
58

Boxplots do not display the shape of the distribution as


clearly as histograms, but are useful for making graphical
comparisons of two or more datasets (or distributions)
Python: Boxplot
59

import numpy as np
import seaborn as sns

#random normal data(mean, std, size


mydata = np.random.normal(170, 10, 250)

ax = sns.boxplot(data=mydata)
Python: Boxplot (with figure options)
60

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#random normal data(mean, std, size


mydata = np.random.normal(170, 10, 250)

#optional
sns.set(style="darkgrid")
plt.figure(figsize=(4,5))

ax = sns.boxplot(y=mydata, orient="v")

plt.show() #optional in Colab


Python: Histogram and Box Plot Exercise

In Google Colabs, plot a histogram and box plot of the following


data, which is the age of students in this section

mydata = np.array([18.5, 19.2, 19.8, 19.0, 18.4,


18.1, 18.6, 19.3, 20.4, 19.1, 18.5, 18.2, 18.3,
18.9, 19.7, 18.7, 18.1, 17.8])

What is shape of the distribution? From the box plot find the
values of max, min, median, Q1, Q2, Q3, any outliers

Repeat after appending 16.0 in the data


62

Data Preprocessing

Z-Score Normalization of quantitative data

Min-Max Normalization of quantitative data

Discretization of quantitative data

Encoding of categorical data


Data Normalization
63

 In machine learning, introduced in the next chapter, the data


needs to be normalized prior to training the machine learning
model.
 Normalization means that all data variables will have the
same range, for example, [0 to 1] or [-1 to 1] or [-3.4 to 3.4]
 This is needed as different variables have different ranges.

 For example, to predict a GPA, we need to know 3 variables:


 the number of hours a student studies,
 their IQ,
 and their attendance record.
 All these variables have different ranges of values; normalization
guarantees that they will have the same range, e.g., [0,1]
 We will look at 2 normalizations: Z-scores and min-max.
64

Z-Score Normalization
of quantitative data
Data Normalization: Z-Scores
65

An observation from a bell-shaped distribution is a


potential outlier if its z-score < -3 or z-score > +3
• Suppose that the average and standard deviation values for attribute
income are $55,000 and $10,000, respectively.
• An income of 60,000 would have a z-score of (60,000-
55,000)/(10,000)=0.5
• We say that 60,000 is above the average by 0.5 standard deviation
65
Data Normalization: Z-Score
66

from sklearn import preprocessing


import numpy as np

X_train = np.array([[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]])

scaler = preprocessing.StandardScaler().fit(X_train)
#The fit method computes the mean and standard deviation of each feature/column in
X_train, which will be used for scaling later
print(X_train)
print('means per column: ',scaler.mean_)
print('variances per column: ',scaler.var_)

X_scaled = scaler.transform(X_train) #computes the zscores

X_scaled #Output

#The transform method applies the


scaling transformation to X_train,
standardizing each feature/column by
subtracting the mean and dividing by
the standard deviation
66
Data Normalization: Z-Score

from sklearn import preprocessing


import numpy as np

X_train = np.array([[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]])

scaler = preprocessing.StandardScaler().fit(X_train)
#The fit method computes the mean and standard deviation of each feature/column in
X_train, which will be used for scaling later
print(X_train)
print('means per column: ',scaler.mean_)
print('variances per column: ',scaler.var_)

X_scaled = scaler.transform(X_train) #computes the zscores

X_scaled #Output Exercise


Repeat for data of hours of
#The transform method applies the study, IQ, attendance percentage
scaling transformation to X_train,
standardizing each feature/column by X_train = np.array([
subtracting the mean and dividing by [ 6, 150, 95],
[ 3, 120, 89],
the standard deviation [ 2, 130, 98],
67
[ 4, 143, 87]])
68

Min-Max Normalization
of quantitative data
Data Normalization: min-max
69

Scaled data falls in the [0, 1] range

Suppose that the minimum and maximum values for attribute income are
$12,000 and $98,000, respectively.
An income of 60,000 would have a scaled value of (60,000-
12,000)/(98,000-12,000)=0.558

69
Data Normalization: min-max
70

from sklearn import preprocessing


import numpy as np

X_train = np.array([[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]])
min_max_scaler = preprocessing.MinMaxScaler()

X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax #Output

• Initializes a MinMaxScaler object named min_max_scaler.


• Applies the min-max normalization to the training data.
• The fit_transform method computes the minimum and maximum values of each
feature/column in X_train and scales the features within the range [0, 1].
• Each feature/column is transformed individually.

70
Data Normalization: min-max
71

from sklearn import preprocessing


import numpy as np

X_train = np.array([[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]])
min_max_scaler = preprocessing.MinMaxScaler()

X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax #Output

• Initializes a MinMaxScaler object named min_max_scaler.


• Applies the min-max normalization to the training data.
• The fit_transform method computes the minimum and maximum values of each
feature/column in X_train and scales the features within the range [0, 1].
• Each feature/column is transformed individually.
Exercise
Repeat for data of hours of
study, IQ, attendance percentage
X_train = np.array([
[ 6, 150, 95],
[ 3, 120, 89],
[ 2, 130, 98], 71
[ 4, 143, 87]])
72

Discretization or Quantization
of quantitative data
Preprocessing of data:
Discretization or quantization
73

• Discretization (otherwise known


as quantization or binning)
provides a way to partition a
continuous variable/feature
into discrete values.
• Certain datasets with
continuous features may benefit
from discretization because
discretization can transform the Example: The GPA is a continuous variable, it can have
values between 0 and 4
dataset of continuous attributes
to one with only nominal We can discretized it into 4 bins as follows:
attributes.
0 <= GPA <= 1.0 replace with 1
1 < GPA <= 2.0 replace with 2
2 < GPA <= 3.0 replace with 3
3 < GPA <= 4.0 replace with 4
Preprocessing of data: Discretization
(TS, example: 3 variables: GPA, IQ and attendance score)
74

from sklearn import preprocessing


import numpy as np
X = np.array([[ 3.5, 132, 100 ], [ 2.9, 119, 95 ], [ 1.9, 99, 65 ]])
est = preprocessing.KBinsDiscretizer(n_bins=[4, 3, 2], encode='ordinal').fit(X)
est.transform(X)

• Each row represents a sample (or feature vector), and each column represents a variable/feature.
• Next, you create an instance of KBinsDiscretizer with n_bins=[4, 3, 2]. This means
that you want to divide the first variable/feature into 4 bins (one of 4 options), the second feature into
3 bins (one of 3 options), and the third feature into 2 bins(one of 2 options).
• The encode='ordinal' parameter indicates that you want to encode the bins with ordinal
integers. An ordinal number is a number that indicates the position like 1st, 2nd,…or in zero indexing 0,
1,…
• Then, you fit the KBinsDiscretizer object ‘est’ to the data X using the fit method.
• Finally, you transform the data X using the transform method of ‘est’, which discretizes the values in
X into the specified number of bins. The result is a transformed array with the same shape as X.
75

Encoding
of categorical data
Preprocessing of data:
Encoding categorical features
76

• Often, features are not given as continuous values but as categorical ones.
These need to be converted into numbers prior to using machine learning

• For example, a person could have features:


• ["male", "female"],
• ["from Europe", "from US", "from Asia"],
• ["uses Firefox", "uses Chrome", "uses Safari", "uses Edge"].

• Such features can be efficiently coded as integers, for instance


• ["male", "from US", "uses Edge"] ex. could be expressed as [1, 2, 1]
• while ["female", "from Asia", "uses Chrome"] ex . could be expressed as [0, 0, 0].

• Types of Encoders:
• Ordinal Encoders {1st,2nd,3rd,…} or {0, 1, 2,…} for multi-dimensional data as in
the example above
• Label Encoders (similar to ordinal encoders but for 1-D row arrays).
• One Hot Encoding: Binary Encoding 0 or 1
Preprocessing of data: Ordinal Encoding
An ordinal number is a number that indicates the position like 1st, 2nd,…
77

from sklearn import preprocessing


import numpy as np

X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe',


'uses Firefox']]

enc = preprocessing.OrdinalEncoder()
enc.fit(X)
#notice the input to transform is a 2D array hence the [[..]]
rst = enc.transform([['female', 'from US', 'uses Safari']])
print(rst)

Done by Sorting. Sort by capital letter then alphabetical for


every feature (column) of data
78

Case study of Data Summarization for Data


Science Applications
79

Case Study 1: Iris Dataset


Iris Flowers Dataset
80

 In this case study, we will use the Iris sample data, which contains information on 150
Iris flowers, 50 each from one of three Iris species: Setosa, Versicolour, and Virginica.
Each flower is characterized by five attributes:
1. sepal length in centimeters
2. sepal width in centimeters
3. petal length in centimeters
4. petal width in centimeters
5. class (Setosa, Versicolour, Virginica) This is the labels

Data is available online at: https://archive.ics.uci.edu/dataset/53/iris


https://www.youtube.com/watch?v=pTjsr_0YWas
Iris Flowers Dataset
81
Step 1: Reading the data
82

import pandas as pd
#data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data',header=None) or
#In Colab, upload iris.data file and locate in under “Sample Data”
folder. Then right click on the uploaded “iris.data” file and copy
path
data = pd.read_csv('sample_data/iris.csv', header=None)

#ts: add column headers


data.columns = ['sepal length', 'sepal width', 'petal length',
'petal width', 'class’]

#display few rows


data.head()
Step 2: Numerical summaries
83

data['class'].value_counts()

data.info()
Step 2: Numerical summaries
84

data.describe(include='all')
Step 2: Numerical summaries
85

data.describe() # excludes NaN

sepal length sepal width petal length petal width


count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
Step 3: Visual Summaries
86

import matplotlib.pyplot as plt


import seaborn as sns

plt.xlabel('Sepal Length')
plt.ylabel('Frequency')
plt.title(‘Histogram of Sepal
Length’)
#plt.ylim(0,30)

data['sepal length'].hist(bins=8)

plt.show()

#Or
#sns.histplot(data['sepal length'])
Step 3: Visual Summaries
87

plt.ylabel('Value (cm)')
plt.xlabel('Attribute')
plt.title('Data Boxplot')

data.boxplot()

#or
sns.boxplot(data)
Step 3: Visual Summaries
88
# Select two columns for the scatter plot
# Create a scatter plot of the selected columns
sns.scatterplot(data=data[0:150], x='sepal length', y='sepal width',
hue='class') # hue: the diff in colors is based on the class labels

plt.title('Iris Flowers')
plt.show()

TS: In Scatter Plots:

Each data point is represented by a dot


on the graph

Allowing you to see how one variable


behaves in relation to another.

Scatterplots are helpful for identifying


relationships between the two
variables being compared (sepal length
vs. sepal width in this example)
Step 3: Visual Summaries (continued)
89

sns.pairplot(data,hue='
class’)
plt.show()

The pairplot is scatter


plot for all combination
of the variable

Here we have 4
variables, hence 4x4=16
plots

The diagonal is a plot of


a var with itself, hence it
shows histograms of the
variable for each class
(we have 3 classes here)
Step 3: Visual Summaries (continued)
90

sns.pairplot(data,hue=
'class', diag_kind
='hist')

plt.show()
Step 3: Visual Summaries (continued)
91

Another interesting case of data visualization is to


use a heatmap to visualize the correlation
matrix of the dataset (1 strong +ve linear
correlation, 0 no correlation, -1 strong –ve linear
correlation)

data_numerical_columns =
data.select_dtypes(include=['numb
er’])

sns.heatmap(data_numerical_column
s.corr(),annot=True)
plt.show()

This type of visualization helps to identify


which variables are positively correlated
(tend to change together in the same
direction), negatively correlated (tend to
change in opposite directions), or have no
significant correlation.
Filter a DataFrame
92

 You can filter DataFrames to obtain a subset of the data prior to plotting if
needed.
 For example assume that you want to filter the IRIS dataset for flowers with
a class type of setosa.
 You can write one of the following:

data = pd.read_csv('sample_data/iris.csv', header=None)


data.columns=['sepal_length','sepal_width','petal_length','petal_width', 'Class']

filtered = data[(data.Class == "Iris-setosa")]


filtered.head()
#OR
filtered = data.query('Class == "Iris-setosa"')
filtered.head()
93

Case Study 2: House Prices Dataset


House prices in Melbourne
94

 Data description and analysis is available in google colab notebook


at:

 https://colab.research.google.com/drive/1FKJldbBKkBNELM_28y6l0
gRvUHZRLim8?usp=sharing
Learning Outcomes
95

Upon completion of the course, students will be able to:


1. Identify the importance of AI and Data Science for society
2. Perform data loading, preprocessing, summarization and
visualization
3. Apply machine learning methods to solve basic regression
and classification problems
4. Apply artificial neural networks to solve simple engineering
problems
5. Implement basic data science and machine learning tasks
using programming tools
Extra slide Not included in curriculum
Preprocessing of data: One-Hot-Encoding
96
 Machine learning algorithms typically work with numerical data, so you need to convert categorical
values into numbers. One-hot encoding does this by creating binary (0 or 1) columns for each category.
 Example: consider a dataset with a column called "Color" with categorical values like "Red," "Green," and
"Blue."
 Original Categorical Data values: [Red,Green,Blue] can be represented in one column
Color # name of variable or column
Red # color of first observation
Red # color of second observation
Blue # color of third observation

 One-hot encoding would convert this into three binary columns, one for each color:
One-Hot Encoded Data: Red Green Blue # 3 columns = the number of possible values
Red= [0, 0, 1]
Green= [0, 1, 0]
Blue= [1, 0, 0]
 Note that the variable color had 3 values Red, Green, Blue hence 3 columns are needed to represent this
one variable
 In other words, the categorical variable color is one column and One-Hot is 3 columns
Extra slide Not included in curriculum
Preprocessing of data: One-Hot-Encoding
97

from sklearn import preprocessing


import numpy as np
#2 feature vectors (rows) and 3 variables (columns)
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]

enc = preprocessing.OneHotEncoder()
enc.fit(X)
rst = enc.transform([['female', 'from US', 'uses Safari'], ['male', 'from Europe',
'uses Firefox']])
print(rst.toarray())

Output:

Male
female From US Uses Safari
From Europe Uses Firefox

Note: In the output, the number of binary columns is equal to the number of values of a variable.

You might also like