5_Data Summaries and Visualization
5_Data Summaries and Visualization
Summaries
Intro to AI and Data Science
NGN 112 – Fall 2024
Ammar Hasan
Department of Electrical Engineering
College of Engineering
Types of Data
Data Statistics
Data Visualization
Data Preprocessing
Types of Data
Variable
4
Categorical or
Quantitative
◼ Discreteor
◼ Continuous
www.thewallstickercompany.com.au
Categorical Variable
5
1. Gender
2. Religion
3. Type of residence (Apt, Villa, …)
4. Belief in Aliens (Yes or No)
Quantitative Variable
6
Examples:
1. Age
2. Number of siblings
3. Annual Income
Quantitative vs. Categorical
7
Example: average exam grade is 77.8% and spread (min grade 57% and
highest 96%)
A quantitative variable
is discrete if its possible
values form a set of
separate numbers:
0,1,2,3,….
Examples:
1. Number of pets in a
household
2. Number of children in a
family
3. Number of foreign
languages spoken by an
individual
upload.wikimedia.org
Continuous Quantitative Variable
9
A quantitative variable is
continuous if its possible values
form an interval
Examples:
1. Height/Weight
2. Age
3. Blood pressure
4. Measurements
www.wtvq.com
10
import numpy as np
5 99 4 98 is:
6 101 5 99 a) Odd, the median is the
7 103 6 101 middle observation
8 105 7 103 b) Even, the median is the
average of the two middle
9 114 8 105
observations
9 114
10 121
Python: median()
15
import numpy as np
np.median(X)
Example: Data & Histograms (1/2)
16
[85,92,78,88,95,90,88,72,68,98,84,91,88,75,92,89,79,83,87,94,86,88,76,81,90,92,70,85,89,93,85,92,
78,88,95,90,88,72,68,98,84,91,88,75,92,89,79,83,87,94,86,88,76,81,90,92,70,85,89,93]
• To create a histogram for this data, you would first group the scores into bins or intervals. (e.g.,
60-69, 70-79, 80-89, 90-99).
• Now, you count how many students scored within each of these ranges.
60-69: 0 students
70-79: 4 students
80-89: 13 students
90-99: 13 students
Example: Data & Histograms (2/2)
17
Source:
https://www.techtarget.com/searchsoftwarequality/definition/histogram
Most student grades are Most student grades are Most student grades are
around average low high
Comparing the Mean and Median
18
Value that occurs most often (like what is the most frequent
major of students in NGN112?)
Highest bar in the histogram
Mode is most often used with categorical data
Python: st.mode()
20
import numpy as np
from scipy import stats as st
st.mode(X)
21
import numpy as np
Each data value has an associated deviation from the mean, x-x
A deviation is positive if it falls above the mean and negative if it
falls below the mean
The sum of the deviations is always zero
Standard Deviation
25
26
Python: Standard deviation std()
27
import numpy as np
np.std(X)
# or X.std()
27
Measures of Position: Percentiles
28
3-30
Finding Quartiles
31
import numpy as np
np.min(x)
np.percentile(x, 25)
np.percentile(x, 50)
np.percentile(x, 75)
np.max(x)
import numpy as np
The full code
X = np.array([ 210,210, 260, 210, 260, 210, 125, 140])
Range=np.max(X)-np.min(X)
print('Range = ', Range)
std = np.std(X)
print('std = ', std)
n1 =np.min(X)
n2 =np.percentile(X, 25)
n3= np.percentile(X, 50)
n4 =np.percentile(X, 75)
n5= np.max(X)
print('Five number summary: ',n1,' ',n2,' ', n3, ' ', n4, ' ',n5)
print('------------------------')
Output:
Range = 135 std = 45.753244420477984
Five number summary: 125 192.5 210.0 222.5 260
------------------------
35
36
A frequency table is a
listing of possible values
for a variable, together
with the number of
observations or relative
frequencies for each
value.
Python: Frequency Tables
39
import pandas as pd
#df = pd.DataFrame(data = ['apple', 'apple', 'banana', 'orange',
'apple', 'apple', 'banana', 'banana', 'orange', 'banana', 'apple'],
columns=['Fruit']) #columns: means the headers of the columns
#or
data = {'Fruit': ['apple', 'apple', 'banana', 'orange', 'apple',
'apple', 'banana', 'banana', 'orange', 'banana', 'apple']}
df = pd.DataFrame(data)
print(df)
print()
absolute_frequencies = df['Fruit'].value_counts()
print(absolute_frequencies) # which is a series
print()
relative_frequencies = df['Fruit'].value_counts(normalize=True)
#normalize means divide by the total which is len(df) or 11
print(relative_frequencies)
Fruit
The output 0 apple
1 apple
2 banana
3 orange
4 apple
5 apple
6 banana
7 banana
8 orange
9 banana
10 apple
apple 5
banana 4
orange 2
Name: Fruit, dtype: int64
apple 0.454545
banana 0.363636
orange 0.181818
Name: Fruit, dtype: float64
40
41
Pie Charts
Bar Charts
Histograms
Box Plots
42
Pie Charts
Pie Charts
43
Summarize categorical
variable
Drawn as circle where each
category is a slice
The size of each slice is
proportional to the
percentage in that category
Python: Pie Chart
44
import pandas as pd
df = pd.DataFrame(data = ['apple',
'apple', 'banana', 'orange', 'apple',
'apple', 'banana', 'banana', 'orange',
'banana', 'apple'], columns=['Fruit'])
absolute_frequencies =
df['Fruit'].value_counts()
df2 = pd.DataFrame({'Fruit':
absolute_frequencies},
index = ['apple', 'banana', 'orange'] )
df2
df2.plot.pie(y='Fruit', figsize=(5,5),
autopct='%1.1f%%')
45
Bar Charts
Bar Charts
46
Summarizes categorical
variable
Vertical bars for each category
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In Google Colabs, plot a pie chart and bar chart for the
following data, which is the list of Major of all the students in
this section of NGN112
Histograms
Histograms
50
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
ax = sns.histplot(data = mydata)
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid") #optional
plt.figure(figsize=(10,8))
ax = sns.histplot(data = mydata)
# Set labels and title
ax.set_xlabel("Value")
ax.set_ylabel("Frequency")
ax.set_title("Histogram of data")
Summary of steps:
# Show the plot 1. Create a figure: plt.figure…
plt.show() #optional in Colab 2. Create a histogram: sns.histplot…
3. Show the plt.show…
Interpreting Histograms
53
Boxplots
Boxplot
57
import numpy as np
import seaborn as sns
ax = sns.boxplot(data=mydata)
Python: Boxplot (with figure options)
60
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#optional
sns.set(style="darkgrid")
plt.figure(figsize=(4,5))
ax = sns.boxplot(y=mydata, orient="v")
What is shape of the distribution? From the box plot find the
values of max, min, median, Q1, Q2, Q3, any outliers
Data Preprocessing
Z-Score Normalization
of quantitative data
Data Normalization: Z-Scores
65
X_train = np.array([[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]])
scaler = preprocessing.StandardScaler().fit(X_train)
#The fit method computes the mean and standard deviation of each feature/column in
X_train, which will be used for scaling later
print(X_train)
print('means per column: ',scaler.mean_)
print('variances per column: ',scaler.var_)
X_scaled #Output
X_train = np.array([[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]])
scaler = preprocessing.StandardScaler().fit(X_train)
#The fit method computes the mean and standard deviation of each feature/column in
X_train, which will be used for scaling later
print(X_train)
print('means per column: ',scaler.mean_)
print('variances per column: ',scaler.var_)
Min-Max Normalization
of quantitative data
Data Normalization: min-max
69
Suppose that the minimum and maximum values for attribute income are
$12,000 and $98,000, respectively.
An income of 60,000 would have a scaled value of (60,000-
12,000)/(98,000-12,000)=0.558
69
Data Normalization: min-max
70
X_train = np.array([[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]])
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax #Output
70
Data Normalization: min-max
71
X_train = np.array([[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]])
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax #Output
Discretization or Quantization
of quantitative data
Preprocessing of data:
Discretization or quantization
73
• Each row represents a sample (or feature vector), and each column represents a variable/feature.
• Next, you create an instance of KBinsDiscretizer with n_bins=[4, 3, 2]. This means
that you want to divide the first variable/feature into 4 bins (one of 4 options), the second feature into
3 bins (one of 3 options), and the third feature into 2 bins(one of 2 options).
• The encode='ordinal' parameter indicates that you want to encode the bins with ordinal
integers. An ordinal number is a number that indicates the position like 1st, 2nd,…or in zero indexing 0,
1,…
• Then, you fit the KBinsDiscretizer object ‘est’ to the data X using the fit method.
• Finally, you transform the data X using the transform method of ‘est’, which discretizes the values in
X into the specified number of bins. The result is a transformed array with the same shape as X.
75
Encoding
of categorical data
Preprocessing of data:
Encoding categorical features
76
• Often, features are not given as continuous values but as categorical ones.
These need to be converted into numbers prior to using machine learning
• Types of Encoders:
• Ordinal Encoders {1st,2nd,3rd,…} or {0, 1, 2,…} for multi-dimensional data as in
the example above
• Label Encoders (similar to ordinal encoders but for 1-D row arrays).
• One Hot Encoding: Binary Encoding 0 or 1
Preprocessing of data: Ordinal Encoding
An ordinal number is a number that indicates the position like 1st, 2nd,…
77
enc = preprocessing.OrdinalEncoder()
enc.fit(X)
#notice the input to transform is a 2D array hence the [[..]]
rst = enc.transform([['female', 'from US', 'uses Safari']])
print(rst)
In this case study, we will use the Iris sample data, which contains information on 150
Iris flowers, 50 each from one of three Iris species: Setosa, Versicolour, and Virginica.
Each flower is characterized by five attributes:
1. sepal length in centimeters
2. sepal width in centimeters
3. petal length in centimeters
4. petal width in centimeters
5. class (Setosa, Versicolour, Virginica) This is the labels
import pandas as pd
#data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data',header=None) or
#In Colab, upload iris.data file and locate in under “Sample Data”
folder. Then right click on the uploaded “iris.data” file and copy
path
data = pd.read_csv('sample_data/iris.csv', header=None)
data['class'].value_counts()
data.info()
Step 2: Numerical summaries
84
data.describe(include='all')
Step 2: Numerical summaries
85
plt.xlabel('Sepal Length')
plt.ylabel('Frequency')
plt.title(‘Histogram of Sepal
Length’)
#plt.ylim(0,30)
data['sepal length'].hist(bins=8)
plt.show()
#Or
#sns.histplot(data['sepal length'])
Step 3: Visual Summaries
87
plt.ylabel('Value (cm)')
plt.xlabel('Attribute')
plt.title('Data Boxplot')
data.boxplot()
#or
sns.boxplot(data)
Step 3: Visual Summaries
88
# Select two columns for the scatter plot
# Create a scatter plot of the selected columns
sns.scatterplot(data=data[0:150], x='sepal length', y='sepal width',
hue='class') # hue: the diff in colors is based on the class labels
plt.title('Iris Flowers')
plt.show()
sns.pairplot(data,hue='
class’)
plt.show()
Here we have 4
variables, hence 4x4=16
plots
sns.pairplot(data,hue=
'class', diag_kind
='hist')
plt.show()
Step 3: Visual Summaries (continued)
91
data_numerical_columns =
data.select_dtypes(include=['numb
er’])
sns.heatmap(data_numerical_column
s.corr(),annot=True)
plt.show()
You can filter DataFrames to obtain a subset of the data prior to plotting if
needed.
For example assume that you want to filter the IRIS dataset for flowers with
a class type of setosa.
You can write one of the following:
https://colab.research.google.com/drive/1FKJldbBKkBNELM_28y6l0
gRvUHZRLim8?usp=sharing
Learning Outcomes
95
enc = preprocessing.OneHotEncoder()
enc.fit(X)
rst = enc.transform([['female', 'from US', 'uses Safari'], ['male', 'from Europe',
'uses Firefox']])
print(rst.toarray())
Output:
Male
female From US Uses Safari
From Europe Uses Firefox
Note: In the output, the number of binary columns is equal to the number of values of a variable.