0% found this document useful (0 votes)

19 views87 pages

5 - Data Summaries and Visualization

Uploaded by

b00098269

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views87 pages

5 - Data Summaries and Visualization

Uploaded by

b00098269

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 87

Exploring Data with Graphs and Numerical

Summaries
Intro to AI and Data Science
NGN 112 – Fall 2024

Amer S. Zakaria
Department of Electrical Engineering
College of Engineering

American University of Sharjah

Prepared by Dr. Hussam Alshraideh, INE

Last Updated on: 28th of Oct. 2024

Table of Content
2

Introduction to Statistics using Python

Data Loading, Visualization and Preprocessing

Data Summarization for Data Science Applications

3 Introduction to Statistics using Python
4 What Are the Types of Data?
Variable
5

A variable is any characteristic that is recorded

for the subjects in a study
 Examples: Marital status, Height, Weight, IQ

 A variable can be classified as either

 Categorical or
 Quantitative
◼ Discreteor
◼ Continuous

www.thewallstickercompany.com.au
Categorical Variable
6

A variable is categorical if each observation belongs to one of a set of

categories.
 Examples:

1. Gender
2. Religion
3. Type of residence (Apt, Villa, …)
4. Belief in Aliens (Yes or No)
Quantitative Variable
7

A variable is called quantitative if observations take numerical values for

different magnitudes of the variable.

 Examples:
1. Age
2. Number of siblings
3. Annual Income
Quantitative vs. Categorical
8

 For Quantitative variables, key features are the center (a

representative value) and spread (variability).

 Example: average exam grade is 77.8% and spread (min grade 57% and
highest 96%)

 For Categorical variables, a key feature is the percentage of

observations in each of the categories .

 Example: 45% male students and 55% female students

Discrete Quantitative Variable
9

 A quantitative variable
is discrete if its possible
values form a set of
separate numbers:
0,1,2,3,….
 Examples:
1. Number of pets in a
household
2. Number of children in a
family
3. Number of foreign
languages spoken by an
individual
upload.wikimedia.org
Continuous Quantitative Variable
10

 A quantitative variable is
continuous if its possible values
form an interval
 Examples:
1. Height/Weight
2. Age
3. Blood pressure
4. Measurements

www.wtvq.com
11 Describe the Center of Quantitative Data
Mean
12

 The mean is the sum of

the observations
divided by the number
of observations
 It is the center of mass
Python: mean()
13

import numpy as np

X = np.array([210, 260, 125, 140])

np.mean(X)

#or

X.mean()
Median
14

Order Data Midpoint of the observations

1 78 Order Data when ordered from least to
2 91 1 78 greatest
3 94 2 91 1. Order observations

4 98 3 94 2. If the number of observations

5 99 4 98 is:
6 101 5 99 a) Odd, the median is the
7 103 6 101 middle observation
8 105 7 103 b) Even, the median is the
average of the two middle
9 114 8 105
observations
9 114
10 121
Python: median()
15

import numpy as np

X = np.array([ 210, 260, 125, 140])

np.median(X)
Example: Data & Histograms (1/2)
16

Example: The scores of 30 students are as follows:

[85,92,78,88,95,90,88,72,68,98,84,91,88,75,92,89,79,83,87,94,86,88,76,81,90,92,70,85,89,93,85,92,
78,88,95,90,88,72,68,98,84,91,88,75,92,89,79,83,87,94,86,88,76,81,90,92,70,85,89,93]

• To create a histogram for this data, you would first group the scores into bins or intervals. (e.g.,
60-69, 70-79, 80-89, 90-99).

• Now, you count how many students scored within each of these ranges.

60-69: 0 students
70-79: 4 students
80-89: 13 students
90-99: 13 students
Example: Data & Histograms (2/2)
17

Source:
https://www.techtarget.com/searchsoftwarequality/definition/histogram

Most student grades are Most student grades are Most student grades are
around average low high
Comparing the Mean and Median
18

 Mean and median of a symmetric distribution are close

 Mean is often preferred because it uses all values in its calculations
 In a skewed distribution, the mean is farther out in the
skewed tail than the median
Median is preferred because it is a better
representative of a typical observation
Mode
19

 Value that occurs most often (like what is the most frequent
major of students in NGN112-04 ?)
 Highest bar in the histogram
 Mode is most often used with categorical data
Python: st.mode()
20

#run this on colab

import numpy as np
from scipy import stats as st

X = np.array([ 210,210, 260, 210, 260, 210, 125, 140])

st.mode(X)
21 Describe the Spread of Quantitative Data
Range
22

Range = max - min

Advantage: simple description of the spreadness of the data
Disadvantage: The range is strongly affected by outliers.
Python: Range
23

import numpy as np

X = np.array([ 210,210, 260, 210, 260, 210, 125, 140])

Range=np.max(X)-np.min(X) #or X.max()–X.min()

print(Range)
Standard Deviation
24

 Each data value has an associated deviation from the mean, x-x
 A deviation is positive if it falls above the mean and negative if it
falls below the mean
 The sum of the deviations is always zero
Standard Deviation
25

Standard deviation gives a measure of variation

by summarizing the deviations of each
observation from the mean and calculating an
adjusted average of these deviations:
1. Find mean
2. Find each
deviation
3. Square deviations
4. Sum squared
deviations
5. Divide sum by n-1
6. Take square
25 root
Example: Standard Deviation
26

Metabolic rates of 7 men (calories/24 hours)

26
Python: Standard deviation std()
27

import numpy as np

X = np.array([ 210,210, 260, 210, 260, 210, 125, 140])

np.std(X)

#or

X.std()

27
Measures of Position: Percentiles
28

The kth percentile, denoted, Pk , of a set of data,

is a value such that k percent of the observations
are less than or equal to the value.
Again: Percentile
29

The pth percentile is a value such that the p

percent of the observations falls below or at
that value.
Quartiles
30

Divide data sets into four equal parts

• The 1st quartile, Q1, divides the bottom 25% of the data from the top 75%.
Equivalent to the 25th percentile.
• The 2nd quartile divides the bottom 50% of the data from the top 50% of the
data
• Equivalent to the 50th percentile, which is equivalent to the median.
• The 3rd quartile divides the bottom 75% of the data from the top 25% of the
data. Equivalent to the 75th percentile.

3-30
Finding Quartiles
31

Splits the data into four parts

1. Arrange data in order
2. The median is the
second quartile, Q2
3. Q1 is the median of the
lower half of the
observations
4. Q3 is the median of the
upper half of the
observations
Measure of Spread: Quartiles
32

Quartiles divide a ranked

data set into four equal parts:
Q1= first quartile = 2.2

1.25% of the data at or

below Q1 and 75% above
M = median = 3.4

2.50% of the data are above

the median and 50% are
below Q3= third quartile = 4.35

3.75% of the data at or

below Q3 and 25% above
Numeric Summarization of Data:
The 5 Number Summary
33

The five-number summary of a

dataset consists of:
1. Minimum value
2. First Quartile
3. Median
4. Third Quartile
5. Maximum value
Python: Percentiles and Quartiles
34

import numpy as np

# random.normal function produces a list of random

numbers with a Normal Gaussian Distribution. 170 is
the mean, 10 is the standard deviation, and 250 is the
number of generated samples.
x = np.random.normal(170, 10, 250)

np.min(x)
np.percentile(x, 25)
np.percentile(x, 50)
np.percentile(x, 75)
np.max(x)
import numpy as np
The full code
X = np.array([ 210,210, 260, 210, 260, 210, 125, 140])

Range=np.max(X)-np.min(X)
print('Range = ', Range)

std = np.std(X)
print('std = ', std)

n1 =np.min(X)
n2 =np.percentile(X, 25)
n3= np.percentile(X, 50)
n4 =np.percentile(X, 75)
n5= np.max(X)

print('Five number summary: ',n1,' ',n2,' ', n3, ' ', n4, ' ',n5)

print('------------------------')

Output:
Range = 135 std = 45.753244420477984
Five number summary: 125 192.5 210.0 222.5 260
------------------------
35
36 Describe Categorical Variables
Proportion & Percentage (Rel. Freq.)
37

Proportions and percentages are also called relative

frequencies.
Frequency Table
38

A frequency table is a
listing of possible values
for a variable, together
with the number of
observations or relative
frequencies for each
value.
Python: Frequency Tables
39

import pandas as pd

data = {'Fruit': ['apple', 'apple', 'banana', 'orange', 'apple',

'apple', 'banana', 'banana', 'orange', 'banana', 'apple']}
df = pd.DataFrame(data)

#or

#df = pd.DataFrame(data = ['apple', 'apple', 'banana', 'orange',

'apple', 'apple', 'banana', 'banana', 'orange', 'banana', 'apple'],
columns=['Fruit']) #columns: means the headers of the columns

print(df)

# Calculate absolute frequencies

absolute_frequencies = df['Fruit'].value_counts()
print(absolute_frequencies) # which is a series

print()
relative_frequencies = df['Fruit'].value_counts(normalize=True)
#normalize means divide by the total which is len(df) or 11
Fruit
The output 0 apple
1 apple
2 banana
3 orange
4 apple
5 apple
6 banana
7 banana
8 orange
9 banana
10 apple

apple 5
banana 4
orange 2
Name: Fruit, dtype: int64

apple 0.454545
banana 0.363636
orange 0.181818
Name: Fruit, dtype: float64

40
41 Describe Data Using Graphical Summaries
Pie Charts
42

 Summarize categorical
variable
 Drawn as circle where each
category is a slice
 The size of each slice is
proportional to the
percentage in that category
Python: Pie Chart
43

import pandas as pd
df = pd.DataFrame(data = ['apple',
'apple', 'banana', 'orange', 'apple',
'apple', 'banana', 'banana', 'orange',
'banana', 'apple'], columns=['Fruit'])

#First: Find the absolute frequencies

absolute_frequencies =
df['Fruit'].value_counts()

#Second: Create a one-column dataframe

df2 = pd.DataFrame({'Fruit':
absolute_frequencies})

print(df2)

df2.plot.pie(y='Fruit', figsize=(5,5),
autopct='%1.1f%%')
Bar Graphs
44

 Summarizes categorical
variable
 Vertical bars for each category

 Height of each bar represents

either counts or percentages

 Easier to compare categories
with bar graph than with pie
chart
 Called Pareto Charts when

ordered from tallest to shortest

Python: Bar chart
45

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame(data = ['apple',
'apple', 'banana', 'orange', 'apple',
'apple', 'banana', 'banana', 'orange',
'banana', 'apple'], columns=['Fruit'])

plt.figure(figsize=(5,5)) #width and

height in inches

sns.countplot(x='Fruit', data=df,
hue=df['Fruit'])

#or
ax=sns.countplot(x='Fruit',data=df)
Histograms
46

Graph that uses bars to show

frequencies (counts) or
relative frequencies of
possible outcomes for a
quantitative variable
Python: Histogram
47

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#random normal data(mean, std, size)

mydata=np.random.normal(170, 10, 250)

ax = sns.histplot(data = mydata)

# Set labels and title

ax.set_xlabel("Value")
ax.set_ylabel("Frequency")
ax.set_title("Histogram of data")
Python: Histogram (with figure size)
48

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#random normal data(mean, std, size)

mydata=np.random.normal(170, 10, 250)

plt.figure(figsize=(10,8))

ax = sns.histplot(data = mydata)
# Set labels and title
ax.set_xlabel("Value")
ax.set_ylabel("Frequency")
ax.set_title("Histogram of data")

# Show the plot Summary of steps:

plt.show() #optional 1. Create a figure: plt.figure…
2. Create a histogram: sns.histplot…
3. Show the plt.show…
Interpreting Histograms
49

 Assess where a distribution is

centered by finding the
Left and right sides
median are mirror images
 Assess the spread of a
distribution
 Shape of a distribution:
roughly symmetric, skewed to
the right, or skewed to the left
Examples of Skewness
50
Outlier
51

An outlier falls far from the rest of the data

Boxplot
52

1. Box goes from the Q1 to Q3

2. Line is drawn inside the box at
the median
3. Line goes from lower end of
box (Q1) to smallest
observation not a potential
outlier
4. Line goes from upper end of
box (Q3) to largest
observation not a potential
outlier
5. Potential outliers are shown
separately, often with * or +
Comparing Distributions
53

Boxplots do not display the shape of the distribution as

clearly as histograms, but are useful for making graphical
comparisons of two or more datasets (or distributions)
Python: Boxplot
54

import numpy as np
import seaborn as sns

#random normal data(mean, std, size)

mydata = np.random.normal(170, 10,
250)

#Default: Vertical orientation

ax = sns.boxplot(data=mydata)

#Horizontal orientation
ax =
sns.boxplot(data=mydata,orient='h')
Python: Boxplot (with figure options)
55

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#random normal data(mean, std, size)

mydata = np.random.normal(170, 10, 250)

plt.figure(figsize=(4,5))

ax = sns.boxplot(mydata)

#or
ax = sns.boxplot(y = mydata)

plt.show() #optional
56 Data Preprocessing
Data Normalization
57

 In machine learning, introduced in the next chapter, the data

needs to be normalized prior to training the machine learning
model.
 Normalization means that all data variables will have the
same range, for example, [0 to 1] or [-1 to 1] or [-3.4 to 3.4]
 This is needed as different variables have different ranges.

 For example, to predict a GPA, we need to know 3 variables:

 the number of hours a student studies,
 their IQ,
 and their attendance record.
 All these variables have different ranges of values; normalization
guarantees that they will have the same range, e.g., [0,1]
 We will look at 2 normalizations: Z-scores and min-max.
Data Normalization: Z-Scores
58

An observation from a bell-shaped distribution is a

potential outlier if its z-score < -3 or z-score > +3
• Suppose that the average and standard deviation values for attribute
income are $55,000 and $10,000, respectively.
• An income of 60,000 would have a z-score of (60,000-
55,000)/(10,000)=0.5
• We say that 60,000 is above the average by 0.5 standard deviation
58
Data Normalization: Z-Score
59

from sklearn import preprocessing

import numpy as np
X_train = np.array([[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]])

Z_score_scaler = preprocessing.StandardScaler()

Z_score_scaler.fit(X_train)
# The fit method computes the mean and standard deviation of each feature/column in
X_train, which will be used for scaling later
print('means per column:', Z_score_scaler .mean_)
print('variances per column: ', Z_score_scaler .var_)

X_scaled = Z_score_scaler.transform(X_train) #The transform method applies the

scaling transformation to X_train, standardizing each feature/column by subtracting
the mean and dividing by the standard deviation
Output:
print('Original Data:\n',X_train) means per column: [1. 0. 0.33333333]
print('Scaled Data:\n', X_scaled) variances per column: [0.66666667 0.66666667 1.55555556]
Original Data:
[[ 1. -1. 2.]
[ 2. 0. 0.]
[ 0. 1. -1.]]
Scaled Data:
[[ 0. -1.22474487 1.33630621]
[ 1.22474487 0. -0.26726124] 59
[-1.22474487 1.22474487 -1.06904497]]
Data Normalization: min-max
60

Scaled data falls in the [0, 1] range

Suppose that the minimum and maximum values for attribute income are
$12,000 and $98,000, respectively.
An income of 60,000 would have a scaled value of (60,000-
12,000)/(98,000-12,000)=0.558

60
Data Normalization: min-max
61

from sklearn import preprocessing

import numpy as np
X_train = np.array([[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]])

min_max_scaler = preprocessing.MinMaxScaler()

min_max_scaler.fit(X_train)
# The fit method find the minimum and maximum of the data X_train

X_scaled = min_max_scaler.transform(X_train) #The transform method

applies the min-max scaling transformation to X_train to scale it
within the range of [0,1].

#Each feature/column is transformed individually

print('Original Data:\n',X_train) Output:

Original Data:
print('Scaled Data:\n', X_scaled) [[ 1. -1. 2.]
[ 2. 0. 0.]
[ 0. 1. -1.]]
Scaled Data:
[[0.5 0. 1. ]
[1. 0.5 0.33333333]
61
[0. 1. 0. ]]
Preprocessing of data:
Discretization or quantization
62

• Discretization (otherwise known

as quantization or binning)
provides a way to partition a
continuous variable/feature
into discrete values.
• Certain datasets with
continuous features may benefit
from discretization because
discretization can transform the Example: The GPA is a continuous variable, it can have
values between 0 and 4
dataset of continuous attributes
to one with only nominal We can discretized it into 4 bins as follows:
attributes.
0 <= GPA <= 1.0 replace with 1
1 < GPA <= 2.0 replace with 2
2 < GPA <= 3.0 replace with 3
3 < GPA <= 4.0 replace with 4
Preprocessing of data: Discretization
(TS, example: 3 variables: GPA, IQ and attendance score)
63

from sklearn import preprocessing

import numpy as np
x = np.array([[ 3.5, 132, 100 ], [ 2.9, 119, 95 ], [ 1.9, 99, 65 ]])
est = preprocessing.KBinsDiscretizer(n_bins=[4, 3, 2], encode='ordinal')
Output:
est.fit(x) Original Data:
[[ 3.5 132. 100. ]
[ 2.9 119. 95. ]
[ 1.9 99. 65. ]]
Binned Data:
x_bins = est.transform(x) [[3. 2. 1.]
print('Original Data:\n',x) [2. 1. 1.]
print('Binned Data:\n', x_bins) [0. 0. 0.]]

• Each row represents a sample (or feature vector), and each column represents a variable/feature.
• Next, you create an instance of KBinsDiscretizer with n_bins=[4, 3, 2]. This means that you
want to divide the first variable/feature into 4 bins (one of 4 options), the second feature into 3 bins (one of 3
options), and the third feature into 2 bins(one of 2 options).
• The encode='ordinal' parameter indicates that you want to encode the bins with ordinal integers. An
ordinal number is a number that indicates the position like 1st, 2nd,…or in zero indexing 0, 1,…
• Then, you fit the KBinsDiscretizer object ‘est’ to the data X using the fit method.
• Finally, you transform the data X using the transform method of ‘est’, which discretizes the values in X into
the specified number of bins. The result is a transformed array with the same shape as X.
Preprocessing of data:
Encoding categorical features
64

• Often, features are not given as continuous values but as categorical ones.
These need to be converted into numbers prior to using machine learning

• For example, a person could have features:

• ["male", "female"], Encoded to: [1, 0]
• ["from Europe", "from US", "from Asia"], Encoded to: [1, 2, 0]
• ["uses Firefox", "uses Chrome", "uses Safari", "uses Edge"]. Encode to: [2, 0, 3, 1]

• Such features can be efficiently coded as integers, for instance

• ["male", "from US", "uses Edge"] ex. could be expressed as [1, 2, 1]
• while ["female", "from Asia", "uses Chrome"] ex . could be expressed as [0, 0, 0].

• Types of Encoders:
• Ordinal Encoders {1st,2nd,3rd,…} or {0, 1, 2,…} for multi-dimensional data as in
the example above
• Label Encoders (similar to ordinal encoders but for 1-D row arrays).
• One Hot Encoding: Binary Encoding 0 or 1
Preprocessing of data: Ordinal Encoding
An ordinal number is a number that indicates the position like 1st, 2nd,…
65

from sklearn import preprocessing

import numpy as np

X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses

Firefox'],['male', 'from Asia', 'uses Chrome'],['male', 'from US', 'uses Edge']]

enc = preprocessing.OrdinalEncoder()
enc.fit(X)

# notice the input to transform is a 2D array hence the [[..]]

X_sample = [['female', 'from US', 'uses Safari']]

rst = enc.transform(X_sample)

print(rst) Output:
[[0. 2. 3.]]

Done by Sorting. Sort by capital letter then alphabetical for

every feature (column) of data
66
Data summarization for data science
applications
67 Case Study 1: Iris Dataset
Iris Flowers Dataset
68

 In this case study, we will use the Iris sample data, which contains information on 150
Iris flowers, 50 each from one of three Iris species: Setosa, Versicolour, and Virginica.
Each flower is characterized by five attributes:
1. sepal length in centimeters
2. sepal width in centimeters
3. petal length in centimeters
4. petal width in centimeters
5. class (Setosa, Versicolour, Virginica) This is the labels

Data is available online at: https://archive.ics.uci.edu/dataset/53/iris

https://www.youtube.com/watch?v=pTjsr_0YWas
Iris Flowers Dataset
69
Step 1: Reading the data
70

import pandas as pd

#In Colab, upload iris.data file and locate in under “Sample Data” folder. Then
right click on the uploaded “iris.data” file and copy path
data = pd.read_csv('iris.data', header=None)
# or

# data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data',header=None)

# Add column headers

data.columns = ['sepal length', 'sepal width', 'petal length', 'petal width',
'species']

# Display the five rows

print(data.head())

Output: sepal length sepal width petal length petal width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setos
Step 2: Numerical summaries
71

absolute_frequencies = data['species'].value_counts()
print(absolute_frequencies)

Output: class
Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
Name: count, dtype: int64

print(data.info())
Output: <class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal length 150 non-null float64
1 sepal width 150 non-null float64
2 petal length 150 non-null float64
3 petal width 150 non-null float64
4 class 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None
Step 2: Numerical summaries
72

print(data.describe(include='all')) # Includes categorical data

Output:
sepal length sepal width petal length petal width class
count 150.000000 150.000000 150.000000 150.000000 150
unique NaN NaN NaN NaN 3
top NaN NaN NaN NaN Iris-setosa
freq NaN NaN NaN NaN 50
mean 5.843333 3.054000 3.758667 1.198667 NaN
std 0.828066 0.433594 1.764420 0.763161 NaN
min 4.300000 2.000000 1.000000 0.100000 NaN
25% 5.100000 2.800000 1.600000 0.300000 NaN
50% 5.800000 3.000000 4.350000 1.300000 NaN
75% 6.400000 3.300000 5.100000 1.800000 NaN
max 7.900000 4.400000 6.900000 2.500000 Na
Step 2: Numerical summaries
73

print(data.describe()) # Excludes categorical data

Output:
sepal length sepal width petal length petal width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
Step 3: Visual Summaries - Histogram
74

import matplotlib.pyplot as plt

import seaborn as sns

plt.figure()
plt.xlabel('Sepal Length')
plt.ylabel('Frequency')
plt.title('Histogram of Sepal
Length')

sns.histplot(data['sepal length'],
bins = 8)

plt.show()

# or
# data['sepal length'].hist(bins=8)
Step 3: Visual Summaries – Box Plots
75

import matplotlib.pyplot as plt

import seaborn as sns

plt.figure()
plt.xlabel('Feature')
plt.ylabel('Value (cm)')
plt.title('Data Boxplot')
sns.boxplot(data)
plt.show()

# or

plt.figure()
plt.xlabel('Feature')
plt.ylabel('Value (cm)')
plt.title('Data Boxplot')
data.boxplot()
plt.show()
Step 3: Visual Summaries – Scatter Plots
76
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure()
# Select two columns for the scatter plot
# Create a scatter plot of the selected columns
sns.scatterplot(data=data[0:150], x='sepal length', y='sepal width', hue='species')
plt.title('Iris Flowers')
plt.show()

In Scatter Plots:

Each data point is represented by a dot

on the graph

Allowing you to see how one variable

behaves in relation to another.

Scatterplots are helpful for identifying

relationships between the two
variables being compared (sepal length
vs. sepal width in this example)
Step 3: Visual Summaries – Pair Plots
77

import matplotlib.pyplot as plt

import seaborn as sns

plt.figure()
sns.pairplot(data,hue='species')
plt.show()

The pairplot is scatter plot for all

combination of the variable

Here we have 4 variables, hence 4 × 4 = 16

plots

The diagonal is a plot of a variable with

itself; hence it shows histograms of the
variable for each class (we have 4 classes
here)
Step 3: Visual Summaries – Pair Plots
78

import matplotlib.pyplot as plt

import seaborn as sns

plt.figure()

# Change the diagonal plots to

histograms
sns.pairplot(data,hue='species',
diag_kind ='hist')

plt.show()
Step 3: Visual Summaries – Heat Map
79

import matplotlib.pyplot as plt

import seaborn as sns

plt.figure()

data_numerical_columns = data.select_dtypes(include=['number'])

sns.heatmap(data_numerical_columns.corr(),annot=True)

plt.show()

Another interesting case of data visualization is to

use a heatmap to visualize the correlation
matrix of the dataset (1 strong +ve linear
correlation, 0 no correlation, -1 strong –ve linear
correlation).

This type of visualization helps to identify which

variables are positively correlated (tend to
change together in the same direction), negatively
correlated (tend to change in opposite directions),
or have no significant correlation.
Filter a DataFrame
80

 You can filter DataFrames to obtain a subset of the data prior to plotting if needed.
 For example, assume that you want to filter the iris dataset for flowers with a class type of ‘setosa’.
 You can write one of the following:
data = pd.read_csv('iris.data', header=None)
data.columns = ['sepal length', 'sepal width', 'petal length', 'petal width',
'species']

filtered = data[(data.species == "Iris-setosa")]

print(filtered.head())

# or

filtered = data.query('species == "Iris-setosa"')

print(filtered.head())

Output: sepal_length sepal_width petal_length petal_width Class

0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
Learning Outcomes
81

Upon completion of the course, students will be able to:

1. Identify the importance of AI and Data Science for society
2. Perform data loading, preprocessing, summarization and
visualization
3. Apply machine learning methods to solve basic regression
and classification problems
4. Apply artificial neural networks to solve simple engineering
problems
5. Implement basic data science and machine learning tasks
using programming tools
Extra: Preprocessing of data: One-Hot-Encoding
Location: After slide#65
82
 Machine learning algorithms typically work with numerical data, so you need to convert categorical
values into numbers. One-hot encoding does this by creating binary (0 or 1) columns for each category.
 Example: consider a dataset with a column called "Color" with categorical values like "Red," "Green," and
"Blue."
 Original Categorical Data values: [Red,Green,Blue] can be represented in one column
Color # name of variable or column
Red # color of first observation
Red # color of second observation
Blue # color of third observation
…
 One-hot encoding would convert this into three binary columns, one for each color:
One-Hot Encoded Data: Red Green Blue # 3 columns = the number of possible values
Red= [0, 0, 1]
Green= [0, 1, 0]
Blue= [1, 0, 0]
 Note that the variable color had 3 values Red, Green, Blue hence 3 columns are needed to represent this
one variable
 In other words, the categorical variable color is one column and One-Hot is 3 columns
Extra: Preprocessing of data: One-Hot-Encoding
83

from sklearn import preprocessing

import numpy as np
#2 feature vectors (rows) and 3 variables (columns)
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]

enc = preprocessing.OneHotEncoder()
enc.fit(X)
rst = enc.transform([['female', 'from US', 'uses Safari'], ['male', 'from Europe',
'uses Firefox']])
print(rst.toarray())

Output:

Male
female From US Uses Safari
From Europe Uses Firefox

Note: In the output, the number of binary columns is equal to the number of values of a variable.
Extra: Step 2: Numerical summaries
Location: After slide#73
84

from pandas.api.types import

is_numeric_dtype

for col in data.columns:

if is_numeric_dtype(data[col]):
print('%s:' % (col))
print('\t Mean = %.2f' %
data[col].mean())
print('\t Standard deviation =
%.2f' % data[col].std())
print('\t Minimum = %.2f' %
data[col].min())
print('\t Maximum = %.2f' %
data[col].max())
Extra: Step 3: Visual Summaries
Location: After slide#78
85
86 Extra: Case Study 2: House Prices

Location: After slide#80

Extra: House prices in Melbourne
87

 Data description and analysis is available in google colab notebook

at:

 https://colab.research.google.com/drive/1FKJldbBKkBNELM_28y6l0
gRvUHZRLim8?usp=sharing

Statistics For Css
No ratings yet
Statistics For Css
73 pages
2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
Unit 2 1
No ratings yet
Unit 2 1
54 pages
Slide-04-Chapter2-Getting To Know Your Data
No ratings yet
Slide-04-Chapter2-Getting To Know Your Data
47 pages
DSILYTC Session 5 - Descriptive Statistics
No ratings yet
DSILYTC Session 5 - Descriptive Statistics
99 pages
3-Data Description
No ratings yet
3-Data Description
91 pages
02 Kinds of Data
No ratings yet
02 Kinds of Data
41 pages
Unit 3
No ratings yet
Unit 3
45 pages
Lecture3 Classnotes
No ratings yet
Lecture3 Classnotes
31 pages
S1 - Descriptive Statistics
No ratings yet
S1 - Descriptive Statistics
133 pages
CS361 FA23 Lec2 Post
No ratings yet
CS361 FA23 Lec2 Post
67 pages
ST8114 Module1 PartI UnivariateEDA
No ratings yet
ST8114 Module1 PartI UnivariateEDA
60 pages
Week2 UnderstandingData
No ratings yet
Week2 UnderstandingData
27 pages
Data Mining 1
No ratings yet
Data Mining 1
29 pages
5 - Data Summaries and Visualization
No ratings yet
5 - Data Summaries and Visualization
97 pages
CS822 DataMining Week2
No ratings yet
CS822 DataMining Week2
28 pages
CH 2
No ratings yet
CH 2
35 pages
Week 1
No ratings yet
Week 1
25 pages
Chapter1.2 PythonPandas2
No ratings yet
Chapter1.2 PythonPandas2
38 pages
Variables & Chart
No ratings yet
Variables & Chart
60 pages
Stats Lect
No ratings yet
Stats Lect
77 pages
614 Descriptive Statistcs
No ratings yet
614 Descriptive Statistcs
56 pages
Stats and Its Real World Applications.
No ratings yet
Stats and Its Real World Applications.
53 pages
Knowing The Data Set
No ratings yet
Knowing The Data Set
31 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
PRW Questions
No ratings yet
PRW Questions
31 pages
DS Chapter - 2
No ratings yet
DS Chapter - 2
73 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
New Chapter 13 Elementary Statistics
No ratings yet
New Chapter 13 Elementary Statistics
15 pages
01 Data
No ratings yet
01 Data
100 pages
02 Exploratory Data Analytics
No ratings yet
02 Exploratory Data Analytics
41 pages
3 Data Description
No ratings yet
3 Data Description
87 pages
Topic 2 - Descriptive - Statistics
No ratings yet
Topic 2 - Descriptive - Statistics
36 pages
02 Data
No ratings yet
02 Data
36 pages
Lecture 2 - Exploratory Data Analysis
No ratings yet
Lecture 2 - Exploratory Data Analysis
35 pages
6.lab Activity
No ratings yet
6.lab Activity
23 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
19 pages
Lab Plan 5: Statistics and Probability: Describing A Single Set of Data
No ratings yet
Lab Plan 5: Statistics and Probability: Describing A Single Set of Data
19 pages
Chapter 4: Summarizing & Exploring Data (Descriptive Statistics) Graphics! Graphics! Graphics! (And Some Numbers)
No ratings yet
Chapter 4: Summarizing & Exploring Data (Descriptive Statistics) Graphics! Graphics! Graphics! (And Some Numbers)
85 pages
Program-1
No ratings yet
Program-1
15 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
STAB22 Lecture's Notes
No ratings yet
STAB22 Lecture's Notes
64 pages
Stats Notes
No ratings yet
Stats Notes
16 pages
It B.tech II Year II Sem DV (R18a0555)
No ratings yet
It B.tech II Year II Sem DV (R18a0555)
73 pages
Edaunit IV
No ratings yet
Edaunit IV
15 pages
Data Management
No ratings yet
Data Management
43 pages
C1S1 Statistics Packet
No ratings yet
C1S1 Statistics Packet
24 pages
Probability+&+Statistics Formulas
No ratings yet
Probability+&+Statistics Formulas
47 pages
Introduction To The Practice of Basic Statistics (Textbook Outline)
100% (14)
Introduction To The Practice of Basic Statistics (Textbook Outline)
65 pages
Nummerical Summaries
No ratings yet
Nummerical Summaries
11 pages
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
No ratings yet
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
20 pages
AP Stats Semester 1 Finals Prep
No ratings yet
AP Stats Semester 1 Finals Prep
4 pages
Unit 5 Descriptive Statistics
No ratings yet
Unit 5 Descriptive Statistics
7 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
NAT REVIEW in Statistics and Probability For Answer
No ratings yet
NAT REVIEW in Statistics and Probability For Answer
52 pages
Exploring Data: AP Statistics Unit 1: Chapters 1-4
No ratings yet
Exploring Data: AP Statistics Unit 1: Chapters 1-4
83 pages
Statistics Midterm Review
No ratings yet
Statistics Midterm Review
21 pages
Stats AP Review
100% (2)
Stats AP Review
38 pages
Measures of Central Tendency Dispersion and Correlation
100% (1)
Measures of Central Tendency Dispersion and Correlation
27 pages
MATH 361 (Autosaved)
No ratings yet
MATH 361 (Autosaved)
17 pages
Sample - Customer Churn Prediction Python Documentation
No ratings yet
Sample - Customer Churn Prediction Python Documentation
33 pages
Research Design Statistical Analysis 2nd Edition Jerome L. Myers Instant Download
No ratings yet
Research Design Statistical Analysis 2nd Edition Jerome L. Myers Instant Download
52 pages
BU EC507 Syllabus
No ratings yet
BU EC507 Syllabus
3 pages
3.3 Percentiles and Box-and-Whisker Plots
No ratings yet
3.3 Percentiles and Box-and-Whisker Plots
16 pages
STAT 1201 0 Syllabus
No ratings yet
STAT 1201 0 Syllabus
13 pages
Chapter 02 Regression - Ongoing
No ratings yet
Chapter 02 Regression - Ongoing
57 pages
Hypothesis Formulation and Testing Revised PDF - Meenu Maheshwari
No ratings yet
Hypothesis Formulation and Testing Revised PDF - Meenu Maheshwari
35 pages
Linear Regression For Machine Learning
100% (1)
Linear Regression For Machine Learning
2 pages
O-Level Statistics (4040) - Quiz Level 2
No ratings yet
O-Level Statistics (4040) - Quiz Level 2
21 pages
Remote Sensing
0% (1)
Remote Sensing
20 pages
BP 801t Biostatistics and Research Methodology Jun 2020
No ratings yet
BP 801t Biostatistics and Research Methodology Jun 2020
3 pages
Statistical Machine Learning
No ratings yet
Statistical Machine Learning
28 pages
Statistical Significance Versus Clinical Relevance
No ratings yet
Statistical Significance Versus Clinical Relevance
38 pages
Chapter 8. Sampling Distribution and Estimation Nguyen Thi Thu Van (This Version Is Dated On 22 Aug, 2021)
No ratings yet
Chapter 8. Sampling Distribution and Estimation Nguyen Thi Thu Van (This Version Is Dated On 22 Aug, 2021)
1 page
Chi Square
No ratings yet
Chi Square
39 pages
Intro 4 Up
No ratings yet
Intro 4 Up
7 pages
Business Analytics
No ratings yet
Business Analytics
10 pages
Dickey, Fuller - 1981 - Likelihood Ratio Statistics For Autoregressive Time Series With A Unit Root
No ratings yet
Dickey, Fuller - 1981 - Likelihood Ratio Statistics For Autoregressive Time Series With A Unit Root
17 pages
Chapter 5 Estimation
No ratings yet
Chapter 5 Estimation
17 pages
RCBD Principles, Randomization and Layout
No ratings yet
RCBD Principles, Randomization and Layout
23 pages
Assignment (Unit I)
No ratings yet
Assignment (Unit I)
2 pages
Module 9: Statistical Inference of Two Samples: The Z-Test
No ratings yet
Module 9: Statistical Inference of Two Samples: The Z-Test
11 pages
Business-Mathematics Grade11 q2 Module5 Week5
No ratings yet
Business-Mathematics Grade11 q2 Module5 Week5
15 pages
EMTIV Assignment 2020
No ratings yet
EMTIV Assignment 2020
3 pages
Bioinformatics: Missing Value Estimation Methods For DNA Microarrays
No ratings yet
Bioinformatics: Missing Value Estimation Methods For DNA Microarrays
6 pages
Lampiran Lampiran 1 Data Panel Roe (Returnon Assets), Debt Kebijakan Dividen
No ratings yet
Lampiran Lampiran 1 Data Panel Roe (Returnon Assets), Debt Kebijakan Dividen
10 pages
Formula Sheet STAT 2066
No ratings yet
Formula Sheet STAT 2066
4 pages
MATH 7 3RD Alternative Test q4
No ratings yet
MATH 7 3RD Alternative Test q4
2 pages
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet