5 - Data Summaries and Visualization
5 - Data Summaries and Visualization
Summaries
Intro to AI and Data Science
NGN 112 – Fall 2024
Amer S. Zakaria
Department of Electrical Engineering
College of Engineering
Categorical or
Quantitative
◼ Discreteor
◼ Continuous
www.thewallstickercompany.com.au
Categorical Variable
6
1. Gender
2. Religion
3. Type of residence (Apt, Villa, …)
4. Belief in Aliens (Yes or No)
Quantitative Variable
7
Examples:
1. Age
2. Number of siblings
3. Annual Income
Quantitative vs. Categorical
8
Example: average exam grade is 77.8% and spread (min grade 57% and
highest 96%)
A quantitative variable
is discrete if its possible
values form a set of
separate numbers:
0,1,2,3,….
Examples:
1. Number of pets in a
household
2. Number of children in a
family
3. Number of foreign
languages spoken by an
individual
upload.wikimedia.org
Continuous Quantitative Variable
10
A quantitative variable is
continuous if its possible values
form an interval
Examples:
1. Height/Weight
2. Age
3. Blood pressure
4. Measurements
www.wtvq.com
11 Describe the Center of Quantitative Data
Mean
12
import numpy as np
np.mean(X)
#or
X.mean()
Median
14
5 99 4 98 is:
6 101 5 99 a) Odd, the median is the
7 103 6 101 middle observation
8 105 7 103 b) Even, the median is the
average of the two middle
9 114 8 105
observations
9 114
10 121
Python: median()
15
import numpy as np
np.median(X)
Example: Data & Histograms (1/2)
16
[85,92,78,88,95,90,88,72,68,98,84,91,88,75,92,89,79,83,87,94,86,88,76,81,90,92,70,85,89,93,85,92,
78,88,95,90,88,72,68,98,84,91,88,75,92,89,79,83,87,94,86,88,76,81,90,92,70,85,89,93]
• To create a histogram for this data, you would first group the scores into bins or intervals. (e.g.,
60-69, 70-79, 80-89, 90-99).
• Now, you count how many students scored within each of these ranges.
60-69: 0 students
70-79: 4 students
80-89: 13 students
90-99: 13 students
Example: Data & Histograms (2/2)
17
Source:
https://www.techtarget.com/searchsoftwarequality/definition/histogram
Most student grades are Most student grades are Most student grades are
around average low high
Comparing the Mean and Median
18
Value that occurs most often (like what is the most frequent
major of students in NGN112-04 ?)
Highest bar in the histogram
Mode is most often used with categorical data
Python: st.mode()
20
import numpy as np
from scipy import stats as st
st.mode(X)
21 Describe the Spread of Quantitative Data
Range
22
import numpy as np
print(Range)
Standard Deviation
24
Each data value has an associated deviation from the mean, x-x
A deviation is positive if it falls above the mean and negative if it
falls below the mean
The sum of the deviations is always zero
Standard Deviation
25
26
Python: Standard deviation std()
27
import numpy as np
np.std(X)
#or
X.std()
27
Measures of Position: Percentiles
28
3-30
Finding Quartiles
31
import numpy as np
np.min(x)
np.percentile(x, 25)
np.percentile(x, 50)
np.percentile(x, 75)
np.max(x)
import numpy as np
The full code
X = np.array([ 210,210, 260, 210, 260, 210, 125, 140])
Range=np.max(X)-np.min(X)
print('Range = ', Range)
std = np.std(X)
print('std = ', std)
n1 =np.min(X)
n2 =np.percentile(X, 25)
n3= np.percentile(X, 50)
n4 =np.percentile(X, 75)
n5= np.max(X)
print('Five number summary: ',n1,' ',n2,' ', n3, ' ', n4, ' ',n5)
print('------------------------')
Output:
Range = 135 std = 45.753244420477984
Five number summary: 125 192.5 210.0 222.5 260
------------------------
35
36 Describe Categorical Variables
Proportion & Percentage (Rel. Freq.)
37
A frequency table is a
listing of possible values
for a variable, together
with the number of
observations or relative
frequencies for each
value.
Python: Frequency Tables
39
import pandas as pd
#or
print(df)
print()
relative_frequencies = df['Fruit'].value_counts(normalize=True)
#normalize means divide by the total which is len(df) or 11
Fruit
The output 0 apple
1 apple
2 banana
3 orange
4 apple
5 apple
6 banana
7 banana
8 orange
9 banana
10 apple
apple 5
banana 4
orange 2
Name: Fruit, dtype: int64
apple 0.454545
banana 0.363636
orange 0.181818
Name: Fruit, dtype: float64
40
41 Describe Data Using Graphical Summaries
Pie Charts
42
Summarize categorical
variable
Drawn as circle where each
category is a slice
The size of each slice is
proportional to the
percentage in that category
Python: Pie Chart
43
import pandas as pd
df = pd.DataFrame(data = ['apple',
'apple', 'banana', 'orange', 'apple',
'apple', 'banana', 'banana', 'orange',
'banana', 'apple'], columns=['Fruit'])
absolute_frequencies =
df['Fruit'].value_counts()
print(df2)
df2.plot.pie(y='Fruit', figsize=(5,5),
autopct='%1.1f%%')
Bar Graphs
44
Summarizes categorical
variable
Vertical bars for each category
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame(data = ['apple',
'apple', 'banana', 'orange', 'apple',
'apple', 'banana', 'banana', 'orange',
'banana', 'apple'], columns=['Fruit'])
sns.countplot(x='Fruit', data=df,
hue=df['Fruit'])
#or
ax=sns.countplot(x='Fruit',data=df)
Histograms
46
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
ax = sns.histplot(data = mydata)
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10,8))
ax = sns.histplot(data = mydata)
# Set labels and title
ax.set_xlabel("Value")
ax.set_ylabel("Frequency")
ax.set_title("Histogram of data")
import numpy as np
import seaborn as sns
#Horizontal orientation
ax =
sns.boxplot(data=mydata,orient='h')
Python: Boxplot (with figure options)
55
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(4,5))
ax = sns.boxplot(mydata)
#or
ax = sns.boxplot(y = mydata)
plt.show() #optional
56 Data Preprocessing
Data Normalization
57
Z_score_scaler = preprocessing.StandardScaler()
Z_score_scaler.fit(X_train)
# The fit method computes the mean and standard deviation of each feature/column in
X_train, which will be used for scaling later
print('means per column:', Z_score_scaler .mean_)
print('variances per column: ', Z_score_scaler .var_)
Suppose that the minimum and maximum values for attribute income are
$12,000 and $98,000, respectively.
An income of 60,000 would have a scaled value of (60,000-
12,000)/(98,000-12,000)=0.558
60
Data Normalization: min-max
61
min_max_scaler = preprocessing.MinMaxScaler()
min_max_scaler.fit(X_train)
# The fit method find the minimum and maximum of the data X_train
• Each row represents a sample (or feature vector), and each column represents a variable/feature.
• Next, you create an instance of KBinsDiscretizer with n_bins=[4, 3, 2]. This means that you
want to divide the first variable/feature into 4 bins (one of 4 options), the second feature into 3 bins (one of 3
options), and the third feature into 2 bins(one of 2 options).
• The encode='ordinal' parameter indicates that you want to encode the bins with ordinal integers. An
ordinal number is a number that indicates the position like 1st, 2nd,…or in zero indexing 0, 1,…
• Then, you fit the KBinsDiscretizer object ‘est’ to the data X using the fit method.
• Finally, you transform the data X using the transform method of ‘est’, which discretizes the values in X into
the specified number of bins. The result is a transformed array with the same shape as X.
Preprocessing of data:
Encoding categorical features
64
• Often, features are not given as continuous values but as categorical ones.
These need to be converted into numbers prior to using machine learning
• Types of Encoders:
• Ordinal Encoders {1st,2nd,3rd,…} or {0, 1, 2,…} for multi-dimensional data as in
the example above
• Label Encoders (similar to ordinal encoders but for 1-D row arrays).
• One Hot Encoding: Binary Encoding 0 or 1
Preprocessing of data: Ordinal Encoding
An ordinal number is a number that indicates the position like 1st, 2nd,…
65
enc = preprocessing.OrdinalEncoder()
enc.fit(X)
rst = enc.transform(X_sample)
print(rst) Output:
[[0. 2. 3.]]
In this case study, we will use the Iris sample data, which contains information on 150
Iris flowers, 50 each from one of three Iris species: Setosa, Versicolour, and Virginica.
Each flower is characterized by five attributes:
1. sepal length in centimeters
2. sepal width in centimeters
3. petal length in centimeters
4. petal width in centimeters
5. class (Setosa, Versicolour, Virginica) This is the labels
import pandas as pd
#In Colab, upload iris.data file and locate in under “Sample Data” folder. Then
right click on the uploaded “iris.data” file and copy path
data = pd.read_csv('iris.data', header=None)
# or
# data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data',header=None)
Output: sepal length sepal width petal length petal width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setos
Step 2: Numerical summaries
71
absolute_frequencies = data['species'].value_counts()
print(absolute_frequencies)
Output: class
Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
Name: count, dtype: int64
print(data.info())
Output: <class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal length 150 non-null float64
1 sepal width 150 non-null float64
2 petal length 150 non-null float64
3 petal width 150 non-null float64
4 class 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None
Step 2: Numerical summaries
72
Output:
sepal length sepal width petal length petal width class
count 150.000000 150.000000 150.000000 150.000000 150
unique NaN NaN NaN NaN 3
top NaN NaN NaN NaN Iris-setosa
freq NaN NaN NaN NaN 50
mean 5.843333 3.054000 3.758667 1.198667 NaN
std 0.828066 0.433594 1.764420 0.763161 NaN
min 4.300000 2.000000 1.000000 0.100000 NaN
25% 5.100000 2.800000 1.600000 0.300000 NaN
50% 5.800000 3.000000 4.350000 1.300000 NaN
75% 6.400000 3.300000 5.100000 1.800000 NaN
max 7.900000 4.400000 6.900000 2.500000 Na
Step 2: Numerical summaries
73
Output:
sepal length sepal width petal length petal width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
Step 3: Visual Summaries - Histogram
74
plt.figure()
plt.xlabel('Sepal Length')
plt.ylabel('Frequency')
plt.title('Histogram of Sepal
Length')
sns.histplot(data['sepal length'],
bins = 8)
plt.show()
# or
# data['sepal length'].hist(bins=8)
Step 3: Visual Summaries – Box Plots
75
plt.figure()
plt.xlabel('Feature')
plt.ylabel('Value (cm)')
plt.title('Data Boxplot')
sns.boxplot(data)
plt.show()
# or
plt.figure()
plt.xlabel('Feature')
plt.ylabel('Value (cm)')
plt.title('Data Boxplot')
data.boxplot()
plt.show()
Step 3: Visual Summaries – Scatter Plots
76
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure()
# Select two columns for the scatter plot
# Create a scatter plot of the selected columns
sns.scatterplot(data=data[0:150], x='sepal length', y='sepal width', hue='species')
plt.title('Iris Flowers')
plt.show()
In Scatter Plots:
plt.figure()
sns.pairplot(data,hue='species')
plt.show()
plt.figure()
plt.show()
Step 3: Visual Summaries – Heat Map
79
plt.figure()
data_numerical_columns = data.select_dtypes(include=['number'])
sns.heatmap(data_numerical_columns.corr(),annot=True)
plt.show()
You can filter DataFrames to obtain a subset of the data prior to plotting if needed.
For example, assume that you want to filter the iris dataset for flowers with a class type of ‘setosa’.
You can write one of the following:
data = pd.read_csv('iris.data', header=None)
data.columns = ['sepal length', 'sepal width', 'petal length', 'petal width',
'species']
# or
enc = preprocessing.OneHotEncoder()
enc.fit(X)
rst = enc.transform([['female', 'from US', 'uses Safari'], ['male', 'from Europe',
'uses Firefox']])
print(rst.toarray())
Output:
Male
female From US Uses Safari
From Europe Uses Firefox
Note: In the output, the number of binary columns is equal to the number of values of a variable.
Extra: Step 2: Numerical summaries
Location: After slide#73
84
https://colab.research.google.com/drive/1FKJldbBKkBNELM_28y6l0
gRvUHZRLim8?usp=sharing