0% found this document useful (0 votes)

10 views97 pages

5 - Data Summaries and Visualization

The document provides an overview of data types, statistics, and visualization techniques relevant to AI and data science. It covers categorical and quantitative variables, methods for summarizing data such as mean, median, mode, and standard deviation, as well as graphical representations like pie charts and bar charts. The content is structured for a course at the American University of Sharjah, focusing on practical applications in data science.

Uploaded by

Yusra Eltilib

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views97 pages

5 - Data Summaries and Visualization

Uploaded by

Yusra Eltilib

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 97

Exploring Data with Graphs and Numerical

Summaries
Intro to AI and Data Science
NGN 112 – Fall 2024

Ammar Hasan
Department of Electrical Engineering
College of Engineering

American University of Sharjah

Prepared by Dr. Hussam Alshraideh, INE

Last Updated on: 15th of October 2024

Table of Content
2

Types of Data

Data Statistics

Data Visualization

Data Preprocessing

Case study of Data Summarization for Data Science Applications

Types of Data
Variable
4

A variable is any characteristic that is recorded

for the subjects in a study
 Examples: Marital status, Height, Weight, IQ

 A variable can be classified as either

 Categorical or
 Quantitative
◼ Discreteor
◼ Continuous

www.thewallstickercompany.com.au
Categorical Variable
5

A variable is categorical if each observation belongs to one of a set of

categories.
 Examples:

1. Gender
2. Religion
3. Type of residence (Apt, Villa, …)
4. Belief in Aliens (Yes or No)
Quantitative Variable
6

A variable is called quantitative if observations take numerical values for

different magnitudes of the variable.

 Examples:
1. Age
2. Number of siblings
3. Annual Income
Quantitative vs. Categorical
7

 For Quantitative variables, key features are the center (a

representative value) and spread (variability).

 Example: average exam grade is 77.8% and spread (min grade 57% and
highest 96%)

 For Categorical variables, a key feature is the percentage of

observations in each of the categories .

 Example: 45% male students and 55% female students

Discrete Quantitative Variable
8

 A quantitative variable
is discrete if its possible
values form a set of
separate numbers:
0,1,2,3,….
 Examples:
1. Number of pets in a
household
2. Number of children in a
family
3. Number of foreign
languages spoken by an
individual
upload.wikimedia.org
Continuous Quantitative Variable
9

 A quantitative variable is
continuous if its possible values
form an interval
 Examples:
1. Height/Weight
2. Age
3. Blood pressure
4. Measurements

www.wtvq.com
10

Data Statistics: Describe data using numerical summaries

Center of Quantitative Data

Spread of Quantitative Data

Frequency Table of categorical data

Center of Quantitative Data

Mean
12

 The mean is the sum of

the observations
divided by the number
of observations
 It is the center of mass
Python: mean()
13

import numpy as np

X = np.array([210, 260, 125, 140])

np.mean(X) #or X.mean()

Median
14

Order Data Midpoint of the observations

1 78 Order Data when ordered from least to
2 91 1 78 greatest
3 94 2 91 1. Order observations

4 98 3 94 2. If the number of observations

5 99 4 98 is:
6 101 5 99 a) Odd, the median is the
7 103 6 101 middle observation
8 105 7 103 b) Even, the median is the
average of the two middle
9 114 8 105
observations
9 114
10 121
Python: median()
15

import numpy as np

X = np.array([ 210, 260, 125, 140])

np.median(X)
Example: Data & Histograms (1/2)
16

Example: The scores of 30 students are as follows:

[85,92,78,88,95,90,88,72,68,98,84,91,88,75,92,89,79,83,87,94,86,88,76,81,90,92,70,85,89,93,85,92,
78,88,95,90,88,72,68,98,84,91,88,75,92,89,79,83,87,94,86,88,76,81,90,92,70,85,89,93]

• To create a histogram for this data, you would first group the scores into bins or intervals. (e.g.,
60-69, 70-79, 80-89, 90-99).

• Now, you count how many students scored within each of these ranges.

60-69: 0 students
70-79: 4 students
80-89: 13 students
90-99: 13 students
Example: Data & Histograms (2/2)
17

Source:
https://www.techtarget.com/searchsoftwarequality/definition/histogram

Most student grades are Most student grades are Most student grades are
around average low high
Comparing the Mean and Median
18

 Mean and median of a symmetric distribution are close

 Mean is often preferred because it uses all values in its calculations
 In a skewed distribution, the mean is farther out in the
skewed tail than the median
Median is preferred because it is a better
representative of a typical observation
Mode
19

 Value that occurs most often (like what is the most frequent
major of students in NGN112?)
 Highest bar in the histogram
 Mode is most often used with categorical data
Python: st.mode()
20

#run this on colab

import numpy as np
from scipy import stats as st

X = np.array([ 210,210, 260, 210, 260, 210, 125, 140])

st.mode(X)
21

Spread of Quantitative Data

Range
22

Range = max - min

Advantage: simple description of the spreadness of the data
Disadvantage: The range is strongly affected by outliers.
Python: Range
23

import numpy as np

X = np.array([ 210,210, 260, 210, 260, 210, 125, 140])

Range=np.max(X)-np.min(X) #or X.max()–X.min()

Range
Standard Deviation
24

 Each data value has an associated deviation from the mean, x-x
 A deviation is positive if it falls above the mean and negative if it
falls below the mean
 The sum of the deviations is always zero
Standard Deviation
25

Standard deviation gives a measure of variation

by summarizing the deviations of each
observation from the mean and calculating an
adjusted average of these deviations:
1. Find mean
for sample 2. Find each deviation
3. Square deviations
4. Sum squared
or deviations
5. Divide sum by n-1 or n
for population
6. Take square root
25
Example: Standard Deviation
26

Metabolic rates of 7 men (calories/24 hours)

26
Python: Standard deviation std()
27

import numpy as np

X = np.array([ 210,210, 260, 210, 260, 210, 125, 140])

np.std(X)

# or X.std()

27
Measures of Position: Percentiles
28

The kth percentile, denoted, Pk , of a set of data,

is a value such that k percent of the observations
are less than or equal to the value.
Again: Percentile
29

The pth percentile is a value such that the p

percent of the observations falls below or at
that value.
Quartiles
30

Divide data sets into four equal parts

• The 1st quartile, Q1, divides the bottom 25% of the data from the top 75%.
Equivalent to the 25th percentile.
• The 2nd quartile divides the bottom 50% of the data from the top 50% of the
data
• Equivalent to the 50th percentile, which is equivalent to the median.
• The 3rd quartile divides the bottom 75% of the data from the top 25% of the
data. Equivalent to the 75th percentile.

3-30
Finding Quartiles
31

Splits the data into four parts

1. Arrange data in order
2. The median is the
second quartile, Q2
3. Q1 is the median of the
lower half of the
observations
4. Q3 is the median of the
upper half of the
observations
Measure of Spread: Quartiles
32

Quartiles divide a ranked

data set into four equal parts:
Q1= first quartile = 2.2

1.25% of the data at or

below Q1 and 75% above
M = median = 3.4

2.50% of the data are above

the median and 50% are
below Q3= third quartile = 4.35

3.75% of the data at or

below Q3 and 25% above
Numeric Summarization of Data:
The 5 Number Summary
33

The five-number summary of a

dataset consists of:
1. Minimum value
2. First Quartile
3. Median
4. Third Quartile
5. Maximum value
Python: Percentiles and Quartiles
34

import numpy as np

# random.normal function produces a list of random

numbers with a Normal Gaussian Distribution. 170 is
the mean, 10 is the standard deviation, and 250 is the
number of generated samples.
x = np.random.normal(170, 10, 250)

np.min(x)
np.percentile(x, 25)
np.percentile(x, 50)
np.percentile(x, 75)
np.max(x)
import numpy as np
The full code
X = np.array([ 210,210, 260, 210, 260, 210, 125, 140])

Range=np.max(X)-np.min(X)
print('Range = ', Range)

std = np.std(X)
print('std = ', std)

n1 =np.min(X)
n2 =np.percentile(X, 25)
n3= np.percentile(X, 50)
n4 =np.percentile(X, 75)
n5= np.max(X)

print('Five number summary: ',n1,' ',n2,' ', n3, ' ', n4, ' ',n5)

print('------------------------')

Output:
Range = 135 std = 45.753244420477984
Five number summary: 125 192.5 210.0 222.5 260
------------------------
35
36

Frequency Table of Categorical Data

Proportion & Percentage (Rel. Freq.)
37

Proportions and percentages are also called relative

frequencies.
Frequency Table
38

A frequency table is a
listing of possible values
for a variable, together
with the number of
observations or relative
frequencies for each
value.
Python: Frequency Tables
39

import pandas as pd
#df = pd.DataFrame(data = ['apple', 'apple', 'banana', 'orange',
'apple', 'apple', 'banana', 'banana', 'orange', 'banana', 'apple'],
columns=['Fruit']) #columns: means the headers of the columns
#or
data = {'Fruit': ['apple', 'apple', 'banana', 'orange', 'apple',
'apple', 'banana', 'banana', 'orange', 'banana', 'apple']}
df = pd.DataFrame(data)
print(df)
print()

absolute_frequencies = df['Fruit'].value_counts()
print(absolute_frequencies) # which is a series

print()
relative_frequencies = df['Fruit'].value_counts(normalize=True)
#normalize means divide by the total which is len(df) or 11
print(relative_frequencies)
Fruit
The output 0 apple
1 apple
2 banana
3 orange
4 apple
5 apple
6 banana
7 banana
8 orange
9 banana
10 apple

apple 5
banana 4
orange 2
Name: Fruit, dtype: int64

apple 0.454545
banana 0.363636
orange 0.181818
Name: Fruit, dtype: float64

40
41

Data Visualization: Describe Data using graphical summaries

Pie Charts

Bar Charts

Histograms

Box Plots
42

Pie Charts
Pie Charts
43

 Summarize categorical
variable
 Drawn as circle where each
category is a slice
 The size of each slice is
proportional to the
percentage in that category
Python: Pie Chart
44

import pandas as pd
df = pd.DataFrame(data = ['apple',
'apple', 'banana', 'orange', 'apple',
'apple', 'banana', 'banana', 'orange',
'banana', 'apple'], columns=['Fruit'])

absolute_frequencies =
df['Fruit'].value_counts()

df2 = pd.DataFrame({'Fruit':
absolute_frequencies},
index = ['apple', 'banana', 'orange'] )
df2
df2.plot.pie(y='Fruit', figsize=(5,5),
autopct='%1.1f%%')
45

Bar Charts
Bar Charts
46

 Summarizes categorical
variable
 Vertical bars for each category

 Height of each bar represents

either counts or percentages

 Easier to compare categories
with bar graph than with pie
chart
 Called Pareto Charts when

ordered from tallest to shortest

Python: Bar chart
47

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.DataFrame(data = ['apple', 'apple', 'banana', 'orange',

'apple', 'apple', 'banana', 'banana', 'orange', 'banana',
'apple'], columns=['Fruit'])
sns.set(style='darkgrid') #optional
plt.figure(figsize=(5,5)) #width and height in inches
sns.countplot(x='Fruit', data=df, hue=df['Fruit'])
#or ax=sns.countplot(x='Fruit',data=df)
Python: Pie and Bar Chart Exercise

In Google Colabs, plot a pie chart and bar chart for the
following data, which is the list of Major of all the students in
this section of NGN112

df = pd.DataFrame(data = ['CS', 'CoE', 'CS', 'CoE', 'ME', 'INE',

'ME', 'ChE', 'CvE', 'CS', 'CoE', 'CS', 'CoE', 'ELE', 'INE', 'ME',
'ChE', 'CvE', 'CS', 'CoE', 'CS', 'CS', 'ELE', 'INE', 'ME', 'ChE',
'ELE', 'CS', 'CoE', 'CS', 'CS', 'ELE', 'ME', 'ME', 'CS', 'CoE',],
columns=['Major'])
49

Histograms
Histograms
50

Graph that uses bars to show

frequencies (counts) or
relative frequencies of
possible outcomes for a
quantitative variable
Python: Histogram
51

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#random normal data(mean, std, size)

mydata=np.random.normal(170, 10, 250)

ax = sns.histplot(data = mydata)

# Set labels and title

ax.set_xlabel("Value")
ax.set_ylabel("Frequency")
ax.set_title("Histogram of data")
Python: Histogram (with figure size)
52

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#random normal data(mean, std, size

mydata=np.random.normal(170, 10, 250)

sns.set(style="darkgrid") #optional
plt.figure(figsize=(10,8))

ax = sns.histplot(data = mydata)
# Set labels and title
ax.set_xlabel("Value")
ax.set_ylabel("Frequency")
ax.set_title("Histogram of data")
Summary of steps:
# Show the plot 1. Create a figure: plt.figure…
plt.show() #optional in Colab 2. Create a histogram: sns.histplot…
3. Show the plt.show…
Interpreting Histograms
53

 Assess where a distribution is

centered by finding the
Left and right sides
median are mirror images
 Assess the spread of a
distribution
 Shape of a distribution:
roughly symmetric, skewed to
the right, or skewed to the left
Examples of Skewness
54
Outlier
55

An outlier falls far from the rest of the data

Boxplots
Boxplot
57

1. Box goes from the Q1 to Q3

2. Line is drawn inside the box at
the median
3. Line goes from lower end of
box (Q1) to smallest
observation not a potential
outlier
4. Line goes from upper end of
box (Q3) to largest
observation not a potential
outlier
5. Potential outliers are shown
separately, often with * or +
Comparing Distributions
58

Boxplots do not display the shape of the distribution as

clearly as histograms, but are useful for making graphical
comparisons of two or more datasets (or distributions)
Python: Boxplot
59

import numpy as np
import seaborn as sns

#random normal data(mean, std, size

mydata = np.random.normal(170, 10, 250)

ax = sns.boxplot(data=mydata)
Python: Boxplot (with figure options)
60

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#random normal data(mean, std, size

mydata = np.random.normal(170, 10, 250)

#optional
sns.set(style="darkgrid")
plt.figure(figsize=(4,5))

ax = sns.boxplot(y=mydata, orient="v")

plt.show() #optional in Colab

Python: Histogram and Box Plot Exercise

In Google Colabs, plot a histogram and box plot of the following

data, which is the age of students in this section

mydata = np.array([18.5, 19.2, 19.8, 19.0, 18.4,

18.1, 18.6, 19.3, 20.4, 19.1, 18.5, 18.2, 18.3,
18.9, 19.7, 18.7, 18.1, 17.8])

What is shape of the distribution? From the box plot find the
values of max, min, median, Q1, Q2, Q3, any outliers

Repeat after appending 16.0 in the data

Data Preprocessing

Z-Score Normalization of quantitative data

Min-Max Normalization of quantitative data

Discretization of quantitative data

Encoding of categorical data

Data Normalization
63

 In machine learning, introduced in the next chapter, the data

needs to be normalized prior to training the machine learning
model.
 Normalization means that all data variables will have the
same range, for example, [0 to 1] or [-1 to 1] or [-3.4 to 3.4]
 This is needed as different variables have different ranges.

 For example, to predict a GPA, we need to know 3 variables:

 the number of hours a student studies,
 their IQ,
 and their attendance record.
 All these variables have different ranges of values; normalization
guarantees that they will have the same range, e.g., [0,1]
 We will look at 2 normalizations: Z-scores and min-max.
64

Z-Score Normalization
of quantitative data
Data Normalization: Z-Scores
65

An observation from a bell-shaped distribution is a

potential outlier if its z-score < -3 or z-score > +3
• Suppose that the average and standard deviation values for attribute
income are $55,000 and $10,000, respectively.
• An income of 60,000 would have a z-score of (60,000-
55,000)/(10,000)=0.5
• We say that 60,000 is above the average by 0.5 standard deviation
65
Data Normalization: Z-Score
66

from sklearn import preprocessing

import numpy as np

X_train = np.array([[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]])

scaler = preprocessing.StandardScaler().fit(X_train)
#The fit method computes the mean and standard deviation of each feature/column in
X_train, which will be used for scaling later
print(X_train)
print('means per column: ',scaler.mean_)
print('variances per column: ',scaler.var_)

X_scaled = scaler.transform(X_train) #computes the zscores

X_scaled #Output

#The transform method applies the

scaling transformation to X_train,
standardizing each feature/column by
subtracting the mean and dividing by
the standard deviation
66
Data Normalization: Z-Score

from sklearn import preprocessing

import numpy as np

X_train = np.array([[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]])

X_scaled = scaler.transform(X_train) #computes the zscores

X_scaled #Output Exercise

Repeat for data of hours of
#The transform method applies the study, IQ, attendance percentage
scaling transformation to X_train,
standardizing each feature/column by X_train = np.array([
subtracting the mean and dividing by [ 6, 150, 95],
[ 3, 120, 89],
the standard deviation [ 2, 130, 98],
67
[ 4, 143, 87]])
68

Min-Max Normalization
of quantitative data
Data Normalization: min-max
69

Scaled data falls in the [0, 1] range

Suppose that the minimum and maximum values for attribute income are
$12,000 and $98,000, respectively.
An income of 60,000 would have a scaled value of (60,000-
12,000)/(98,000-12,000)=0.558

69
Data Normalization: min-max
70

from sklearn import preprocessing

import numpy as np

X_train = np.array([[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]])
min_max_scaler = preprocessing.MinMaxScaler()

X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax #Output

• Initializes a MinMaxScaler object named min_max_scaler.

70
Data Normalization: min-max
71

from sklearn import preprocessing

import numpy as np

X_train = np.array([[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]])
min_max_scaler = preprocessing.MinMaxScaler()

X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax #Output

• Initializes a MinMaxScaler object named min_max_scaler.

• Applies the min-max normalization to the training data.
• The fit_transform method computes the minimum and maximum values of each
feature/column in X_train and scales the features within the range [0, 1].
• Each feature/column is transformed individually.
Exercise
Repeat for data of hours of
study, IQ, attendance percentage
X_train = np.array([
[ 6, 150, 95],
[ 3, 120, 89],
[ 2, 130, 98], 71
[ 4, 143, 87]])
72

Discretization or Quantization
of quantitative data
Preprocessing of data:
Discretization or quantization
73

• Discretization (otherwise known

as quantization or binning)
provides a way to partition a
continuous variable/feature
into discrete values.
• Certain datasets with
continuous features may benefit
from discretization because
discretization can transform the Example: The GPA is a continuous variable, it can have
values between 0 and 4
dataset of continuous attributes
to one with only nominal We can discretized it into 4 bins as follows:
attributes.
0 <= GPA <= 1.0 replace with 1
1 < GPA <= 2.0 replace with 2
2 < GPA <= 3.0 replace with 3
3 < GPA <= 4.0 replace with 4
Preprocessing of data: Discretization
(TS, example: 3 variables: GPA, IQ and attendance score)
74

from sklearn import preprocessing

import numpy as np
X = np.array([[ 3.5, 132, 100 ], [ 2.9, 119, 95 ], [ 1.9, 99, 65 ]])
est = preprocessing.KBinsDiscretizer(n_bins=[4, 3, 2], encode='ordinal').fit(X)
est.transform(X)

• Each row represents a sample (or feature vector), and each column represents a variable/feature.
• Next, you create an instance of KBinsDiscretizer with n_bins=[4, 3, 2]. This means
that you want to divide the first variable/feature into 4 bins (one of 4 options), the second feature into
3 bins (one of 3 options), and the third feature into 2 bins(one of 2 options).
• The encode='ordinal' parameter indicates that you want to encode the bins with ordinal
integers. An ordinal number is a number that indicates the position like 1st, 2nd,…or in zero indexing 0,
1,…
• Then, you fit the KBinsDiscretizer object ‘est’ to the data X using the fit method.
• Finally, you transform the data X using the transform method of ‘est’, which discretizes the values in
X into the specified number of bins. The result is a transformed array with the same shape as X.
75

Encoding
of categorical data
Preprocessing of data:
Encoding categorical features
76

• Often, features are not given as continuous values but as categorical ones.
These need to be converted into numbers prior to using machine learning

• For example, a person could have features:

• ["male", "female"],
• ["from Europe", "from US", "from Asia"],
• ["uses Firefox", "uses Chrome", "uses Safari", "uses Edge"].

• Such features can be efficiently coded as integers, for instance

• ["male", "from US", "uses Edge"] ex. could be expressed as [1, 2, 1]
• while ["female", "from Asia", "uses Chrome"] ex . could be expressed as [0, 0, 0].

• Types of Encoders:
• Ordinal Encoders {1st,2nd,3rd,…} or {0, 1, 2,…} for multi-dimensional data as in
the example above
• Label Encoders (similar to ordinal encoders but for 1-D row arrays).
• One Hot Encoding: Binary Encoding 0 or 1
Preprocessing of data: Ordinal Encoding
An ordinal number is a number that indicates the position like 1st, 2nd,…
77

from sklearn import preprocessing

import numpy as np

X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe',

'uses Firefox']]

enc = preprocessing.OrdinalEncoder()
enc.fit(X)
#notice the input to transform is a 2D array hence the [[..]]
rst = enc.transform([['female', 'from US', 'uses Safari']])
print(rst)

Done by Sorting. Sort by capital letter then alphabetical for

every feature (column) of data
78

Case study of Data Summarization for Data

Science Applications
79

Case Study 1: Iris Dataset

Iris Flowers Dataset
80

 In this case study, we will use the Iris sample data, which contains information on 150
Iris flowers, 50 each from one of three Iris species: Setosa, Versicolour, and Virginica.
Each flower is characterized by five attributes:
1. sepal length in centimeters
2. sepal width in centimeters
3. petal length in centimeters
4. petal width in centimeters
5. class (Setosa, Versicolour, Virginica) This is the labels

Data is available online at: https://archive.ics.uci.edu/dataset/53/iris

https://www.youtube.com/watch?v=pTjsr_0YWas
Iris Flowers Dataset
81
Step 1: Reading the data
82

import pandas as pd
#data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data',header=None) or
#In Colab, upload iris.data file and locate in under “Sample Data”
folder. Then right click on the uploaded “iris.data” file and copy
path
data = pd.read_csv('sample_data/iris.csv', header=None)

#ts: add column headers

data.columns = ['sepal length', 'sepal width', 'petal length',
'petal width', 'class’]

#display few rows

data.head()
Step 2: Numerical summaries
83

data['class'].value_counts()

data.info()
Step 2: Numerical summaries
84

data.describe(include='all')
Step 2: Numerical summaries
85

data.describe() # excludes NaN

sepal length sepal width petal length petal width

count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
Step 3: Visual Summaries
86

import matplotlib.pyplot as plt

import seaborn as sns

plt.xlabel('Sepal Length')
plt.ylabel('Frequency')
plt.title(‘Histogram of Sepal
Length’)
#plt.ylim(0,30)

data['sepal length'].hist(bins=8)

plt.show()

#Or
#sns.histplot(data['sepal length'])
Step 3: Visual Summaries
87

plt.ylabel('Value (cm)')
plt.xlabel('Attribute')
plt.title('Data Boxplot')

data.boxplot()

#or
sns.boxplot(data)
Step 3: Visual Summaries
88
# Select two columns for the scatter plot
# Create a scatter plot of the selected columns
sns.scatterplot(data=data[0:150], x='sepal length', y='sepal width',
hue='class') # hue: the diff in colors is based on the class labels

plt.title('Iris Flowers')
plt.show()

TS: In Scatter Plots:

Each data point is represented by a dot

on the graph

Allowing you to see how one variable

behaves in relation to another.

Scatterplots are helpful for identifying

relationships between the two
variables being compared (sepal length
vs. sepal width in this example)
Step 3: Visual Summaries (continued)
89

sns.pairplot(data,hue='
class’)
plt.show()

The pairplot is scatter

plot for all combination
of the variable

Here we have 4
variables, hence 4x4=16
plots

The diagonal is a plot of

a var with itself, hence it
shows histograms of the
variable for each class
(we have 3 classes here)
Step 3: Visual Summaries (continued)
90

sns.pairplot(data,hue=
'class', diag_kind
='hist')

plt.show()
Step 3: Visual Summaries (continued)
91

Another interesting case of data visualization is to

use a heatmap to visualize the correlation
matrix of the dataset (1 strong +ve linear
correlation, 0 no correlation, -1 strong –ve linear
correlation)

data_numerical_columns =
data.select_dtypes(include=['numb
er’])

sns.heatmap(data_numerical_column
s.corr(),annot=True)
plt.show()

This type of visualization helps to identify

which variables are positively correlated
(tend to change together in the same
direction), negatively correlated (tend to
change in opposite directions), or have no
significant correlation.
Filter a DataFrame
92

 You can filter DataFrames to obtain a subset of the data prior to plotting if
needed.
 For example assume that you want to filter the IRIS dataset for flowers with
a class type of setosa.
 You can write one of the following:

data = pd.read_csv('sample_data/iris.csv', header=None)

data.columns=['sepal_length','sepal_width','petal_length','petal_width', 'Class']

filtered = data[(data.Class == "Iris-setosa")]

filtered.head()
#OR
filtered = data.query('Class == "Iris-setosa"')
filtered.head()
93

Case Study 2: House Prices Dataset

House prices in Melbourne
94

 Data description and analysis is available in google colab notebook

at:

 https://colab.research.google.com/drive/1FKJldbBKkBNELM_28y6l0
gRvUHZRLim8?usp=sharing
Learning Outcomes
95

Upon completion of the course, students will be able to:

1. Identify the importance of AI and Data Science for society
2. Perform data loading, preprocessing, summarization and
visualization
3. Apply machine learning methods to solve basic regression
and classification problems
4. Apply artificial neural networks to solve simple engineering
problems
5. Implement basic data science and machine learning tasks
using programming tools
Extra slide Not included in curriculum
Preprocessing of data: One-Hot-Encoding
96
 Machine learning algorithms typically work with numerical data, so you need to convert categorical
values into numbers. One-hot encoding does this by creating binary (0 or 1) columns for each category.
 Example: consider a dataset with a column called "Color" with categorical values like "Red," "Green," and
"Blue."
 Original Categorical Data values: [Red,Green,Blue] can be represented in one column
Color # name of variable or column
Red # color of first observation
Red # color of second observation
Blue # color of third observation
…
 One-hot encoding would convert this into three binary columns, one for each color:
One-Hot Encoded Data: Red Green Blue # 3 columns = the number of possible values
Red= [0, 0, 1]
Green= [0, 1, 0]
Blue= [1, 0, 0]
 Note that the variable color had 3 values Red, Green, Blue hence 3 columns are needed to represent this
one variable
 In other words, the categorical variable color is one column and One-Hot is 3 columns
Extra slide Not included in curriculum
Preprocessing of data: One-Hot-Encoding
97

from sklearn import preprocessing

import numpy as np
#2 feature vectors (rows) and 3 variables (columns)
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]

enc = preprocessing.OneHotEncoder()
enc.fit(X)
rst = enc.transform([['female', 'from US', 'uses Safari'], ['male', 'from Europe',
'uses Firefox']])
print(rst.toarray())

Output:

Male
female From US Uses Safari
From Europe Uses Firefox

Note: In the output, the number of binary columns is equal to the number of values of a variable.

HeadRush Amp & Effect List
No ratings yet
HeadRush Amp & Effect List
10 pages
CAT D399 Workshop Manual
97% (37)
CAT D399 Workshop Manual
434 pages
Piping Engineers Interview Questions
100% (2)
Piping Engineers Interview Questions
16 pages
Statistics For Css
No ratings yet
Statistics For Css
73 pages
Descriptive Statistics and Exploratory Data Analysis
No ratings yet
Descriptive Statistics and Exploratory Data Analysis
36 pages
SurgeTesting EARbasics 0716
100% (1)
SurgeTesting EARbasics 0716
2 pages
Auditing Theory 2013
No ratings yet
Auditing Theory 2013
28 pages
Job Application Letter Volunteer
100% (1)
Job Application Letter Volunteer
6 pages
Kamala Das Poems
No ratings yet
Kamala Das Poems
14 pages
Transformations SAT
No ratings yet
Transformations SAT
14 pages
WSP ELE ES 002 00 Engineering Specification For Electrical Facilities
No ratings yet
WSP ELE ES 002 00 Engineering Specification For Electrical Facilities
24 pages
2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
HUMIDIFICADOR Fisher Paykel MR850 2
No ratings yet
HUMIDIFICADOR Fisher Paykel MR850 2
57 pages
Digging Tools PDF
No ratings yet
Digging Tools PDF
6 pages
Industrial Plant Layout
No ratings yet
Industrial Plant Layout
18 pages
Unit 1 Introduction To HRM
100% (1)
Unit 1 Introduction To HRM
6 pages
Input Output Devices
No ratings yet
Input Output Devices
44 pages
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
No ratings yet
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
20 pages
Understanding How PeopleCode Events Work
No ratings yet
Understanding How PeopleCode Events Work
14 pages
Resume Shekhar Vijay Barewar
No ratings yet
Resume Shekhar Vijay Barewar
4 pages
Chapter 1 BFC34303 (Lyy)
No ratings yet
Chapter 1 BFC34303 (Lyy)
104 pages
Exploring Data: AP Statistics Unit 1: Chapters 1-4
No ratings yet
Exploring Data: AP Statistics Unit 1: Chapters 1-4
83 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
01 Data
No ratings yet
01 Data
100 pages
Chapter 4: Summarizing & Exploring Data (Descriptive Statistics) Graphics! Graphics! Graphics! (And Some Numbers)
No ratings yet
Chapter 4: Summarizing & Exploring Data (Descriptive Statistics) Graphics! Graphics! Graphics! (And Some Numbers)
85 pages
Kamuli District DDP III 2020 - 2025 - 0
No ratings yet
Kamuli District DDP III 2020 - 2025 - 0
233 pages
Unit 2 1
No ratings yet
Unit 2 1
54 pages
STAB22 Lecture's Notes
No ratings yet
STAB22 Lecture's Notes
64 pages
Ap Stat Exam Rev ch1-13
No ratings yet
Ap Stat Exam Rev ch1-13
120 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
3 Data Description
No ratings yet
3 Data Description
87 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
Probability+&+Statistics Formulas
No ratings yet
Probability+&+Statistics Formulas
47 pages
PROBABILITY Lecture 1 - 2 - 3
No ratings yet
PROBABILITY Lecture 1 - 2 - 3
63 pages
It B.tech II Year II Sem DV (R18a0555)
No ratings yet
It B.tech II Year II Sem DV (R18a0555)
73 pages
7 Neural Networks
No ratings yet
7 Neural Networks
70 pages
6062b249804f2baef22989a2 - SS AP Statistics
No ratings yet
6062b249804f2baef22989a2 - SS AP Statistics
35 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
2 Introduction To Python Part 1 2
No ratings yet
2 Introduction To Python Part 1 2
94 pages
DS Chapter - 2
No ratings yet
DS Chapter - 2
73 pages
Entreprene Urship: Welcome
No ratings yet
Entreprene Urship: Welcome
33 pages
MATH 361 (Autosaved)
No ratings yet
MATH 361 (Autosaved)
17 pages
Stats Lect
No ratings yet
Stats Lect
77 pages
02 Data
No ratings yet
02 Data
36 pages
Statistics Midterm Review
No ratings yet
Statistics Midterm Review
21 pages
Data Management
No ratings yet
Data Management
36 pages
Data Management
No ratings yet
Data Management
43 pages
ST8114 Module1 PartI UnivariateEDA
No ratings yet
ST8114 Module1 PartI UnivariateEDA
60 pages
C1S1 Statistics Packet
No ratings yet
C1S1 Statistics Packet
24 pages
Variables & Chart
No ratings yet
Variables & Chart
60 pages
Bock HGX44e
No ratings yet
Bock HGX44e
24 pages
5 - Data Summaries and Visualization
No ratings yet
5 - Data Summaries and Visualization
87 pages
614 Descriptive Statistcs
No ratings yet
614 Descriptive Statistcs
56 pages
Chapter 1 Descriptivestatistics
No ratings yet
Chapter 1 Descriptivestatistics
21 pages
Topic 2 - Descriptive - Statistics
No ratings yet
Topic 2 - Descriptive - Statistics
36 pages
3-Data Description
No ratings yet
3-Data Description
91 pages
Data Mining-5 - Getting Know Data 1
No ratings yet
Data Mining-5 - Getting Know Data 1
27 pages
Dr. Juana's Lecture Slides
No ratings yet
Dr. Juana's Lecture Slides
86 pages
Stats and Its Real World Applications.
No ratings yet
Stats and Its Real World Applications.
53 pages
Lab Plan 5: Statistics and Probability: Describing A Single Set of Data
No ratings yet
Lab Plan 5: Statistics and Probability: Describing A Single Set of Data
19 pages
CS822 DataMining Week2
No ratings yet
CS822 DataMining Week2
28 pages
02 Exploratory Data Analytics
No ratings yet
02 Exploratory Data Analytics
41 pages
Unit 3
No ratings yet
Unit 3
45 pages
Lectures CVE 232 (Part G) - Prof. Aqeel Ahmed
No ratings yet
Lectures CVE 232 (Part G) - Prof. Aqeel Ahmed
30 pages
Dis Vishnu
No ratings yet
Dis Vishnu
48 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
19 pages
Variables and Data Presentation
No ratings yet
Variables and Data Presentation
64 pages
Medium Term Strategy Rbi
No ratings yet
Medium Term Strategy Rbi
17 pages
02 Kinds of Data
No ratings yet
02 Kinds of Data
41 pages
New Chapter 13 Elementary Statistics
No ratings yet
New Chapter 13 Elementary Statistics
15 pages
Week2 UnderstandingData
No ratings yet
Week2 UnderstandingData
27 pages
Data Mining 1
No ratings yet
Data Mining 1
29 pages
Nummerical Summaries
No ratings yet
Nummerical Summaries
11 pages
SINGLE VARIABLE Notes 5.3 Year 10
No ratings yet
SINGLE VARIABLE Notes 5.3 Year 10
9 pages
Slide-04-Chapter2-Getting To Know Your Data
No ratings yet
Slide-04-Chapter2-Getting To Know Your Data
47 pages
In Uence of Geographical Phenomenon On Yoga: A Study On Yoga-Geography
No ratings yet
In Uence of Geographical Phenomenon On Yoga: A Study On Yoga-Geography
10 pages
Lecture 2 - Exploratory Data Analysis
No ratings yet
Lecture 2 - Exploratory Data Analysis
35 pages
Safety Data Sheet: 1. Identification of The Substance/Mixture and The Supplier
No ratings yet
Safety Data Sheet: 1. Identification of The Substance/Mixture and The Supplier
8 pages
Staticus: Math 103 Lecture 9 Class Notes
No ratings yet
Staticus: Math 103 Lecture 9 Class Notes
4 pages
Edu 402 Quiz Solved: Brutal Facts
No ratings yet
Edu 402 Quiz Solved: Brutal Facts
7 pages
Chapter2 Stats
No ratings yet
Chapter2 Stats
9 pages
Pharmacy Site File Checklist
No ratings yet
Pharmacy Site File Checklist
7 pages
Unit 5 Descriptive Statistics
No ratings yet
Unit 5 Descriptive Statistics
7 pages
Lecture3 Classnotes
No ratings yet
Lecture3 Classnotes
31 pages
AP Stats Semester 1 Finals Prep
No ratings yet
AP Stats Semester 1 Finals Prep
4 pages
The Design and Manufacture of Medicines: M I C H A E L E - Aultoribpharmphdfaapsfrpharms
No ratings yet
The Design and Manufacture of Medicines: M I C H A E L E - Aultoribpharmphdfaapsfrpharms
3 pages
Indonesia Security Market Report 2017
No ratings yet
Indonesia Security Market Report 2017
6 pages
Tripping Batteries
No ratings yet
Tripping Batteries
5 pages
Syllabus PHY101 Fall23 Hamdan
No ratings yet
Syllabus PHY101 Fall23 Hamdan
3 pages
NGN 112 - Syllabus
No ratings yet
NGN 112 - Syllabus
5 pages
How To Write A Research Proposal
No ratings yet
How To Write A Research Proposal
3 pages
Rubric For Preparation of Design/Computational Plate
No ratings yet
Rubric For Preparation of Design/Computational Plate
1 page
Trevor Ivan - Final Assessment
No ratings yet
Trevor Ivan - Final Assessment
3 pages