HCI - Notes-Ch3
HCI - Notes-Ch3
1
A Data Science Project
• Data science is an inter-disciplinary field that uses scientific
methods, processes, algorithms and systems to extract
knowledge and insights from many structural and
unstructured data. It unifies statistical data analysis, machine
learning-based data analysis, domain knowledge and their
related methods in order to understand and analyze actual
phenomena and testing hypotheses, with data.
2
A Data Science Project
3
A Data Science Project
A DS project follows the next steps:
Feature Engineering:
1. Create new features from the existing dataset that can potentially enhance the
predictive power of the model, such as calculating the average usage over a specific
time period, frequency of interactions, or customer tenure.
2. Transform and preprocess data to make it suitable for the model, such as one-hot
encoding categorical variables and scaling numerical features.
Model Creation:
1. Build a predictive model, such as a machine learning algorithm (e.g., logistic
regression, random forest, or neural network), to predict the likelihood of customer
churn based on the engineered features.
2. Train the model using historical data and evaluate its performance using appropriate
metrics, such as accuracy, precision, recall, and F1-score.
Impact Analysis:
1. Analyze the model's predictions and assess its effectiveness in identifying potential
churners.
2. Calculate the projected impact of implementing retention strategies based on the
model's predictions, such as estimated patient survival rate.
4
Data Cleaning
• Data cleaning (or data cleansing) is the process of detecting
and correcting (or removing) corrupt or inaccurate records
from a record set, table, or database. Wikipedia
• The main issues in data cleaning are:
– Missing Values: some data can be absent. For example, the blood pressure of some
patients can be unknown.
– Outliers: some data can be highly atypical. For example, some patients may be older
than 100.
– Errors: some data can be corrupt. For example, heart rate can become null because the
connected machine disconnects while moving the patient.
– Duplicated Data: some data can be redundant. For example, we could have birth date,
age, and admission date (birthdate = admission_date – age).
– Pre-Calculation: some required data can be calculated from the available data. For
example, body mass index can be calculated from patient’s height and weight
(BMI=weight(in Kg) / height(in m)2). Dependent Variables
– Useless Features: some features can be irrelevant to the current DS project. For
example, some study may not need patient’s gender information.
– Useless Cases: some cases can be irrelevant to the current DS project. For example,
pediatric studies may remove patients older than 18.
5
Feature Engineering and Model Creation
• Def (Feature engineering) the process of using domain knowledge to
extract features from raw data via data mining techniques. Wikipedia
6
Statistical Analysis of Health Care Data
• Sample Tools
– MS Excel
– Python
7
Sample Tool: MS Excel
• Rationale:
– Accessibility: “everybody” has MS Excel in its laptop.
– Applicability: health care professionals use to work with
MS Excel to store and process their databases.
– Simplicity: all the statistical involved in the course is easily
performed with MS Excel.
8
Sample Tool: Python
• Rationale:
– Accessibility: it is open access.
– Programmable: analyses can be embedded within
computer programs.
– Simple: with minor indications statistics with Python is
easy.
– Powerful: statistic functions are fast.
– Complete: python provides a great variety of statistical
functions implemented and ready to be used.
9
Types of variables
• The types of variables used in a study influence both the
descriptive and inferential statistics during the analysis phase
10
Categorical Variables (two or more groups or “categories” being
measured)
11
Numerical Variables
*NUMERICAL– measurement can be quantified as a number
• Continuous – uninterrupted; any number is possible (e.g.
1.25)
– Examples:
• Age
• Temperature
• Discrete – integers; only some numbers are possible
– Examples:
• Number of children
• Number of strokes a patient has had
12
How do you summarize your data?
• Descriptive statistics: used to describe, organize, and summarize data
– Categorical variables: number (N), frequency (%)
– Numerical variables:
• mean + standard deviation
• median + range
• mode
• Graphs, charts:
– Categorical: contingency tables, bar charts
• Note: Not pie charts—relative areas of the pie are
difficult to distinguish!
– Numerical: shape of the distribution, box plots, histograms
13
Summarizing Categorical Data
14
Summarizing Numerical Data
Key questions to ask:
• How are the data distributed?
– Where is the center?
– What is the range?
– What’s the shape of the distribution? (e.g.,
Gaussian (normal), right- or left-skewed)
15
“Where is the center”?
Measures of central tendency
• Mean
• Median
• Mode
16
“Where is the center?”
Measures of central tendency: Mean
• Mean – the average; the balancing point
– Calculation: the sum of values divided by the sample size
– In math shorthand: n
x X + X ++ X
X = i =1
= 1 2 n
n n
• Example Mean calculation:
– Age of participants: 17 19 21 22 23 23 23 38
n
X i
17 +19 + 21+ 22 + 23 + 23 + 23 + 38
X= i=1
= = 23.25
n 8
17
“Where is the center?”
Measures of central tendency: Mean
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Mean = 3 Mean = 4
1 + 2 + 3 + 4 + 5 15 1 + 2 + 3 + 4 + 10 20
= =3 = =4
5 5 5 5
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
18
“Where is the center?”
Measures of central tendency: Median
• Median – the exact middle value
Calculation:
– If there are an odd number of observations, find the middle value.
– If there are an even number of observations, find the middle two
values and average them.
• Example:
– Age of participants: 17 19 21 22 23 23 23 38
19
“Where is the center?”
Measures of central tendency: Median
Scenario One
Scenario Two
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Median = 3 Median = 3
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
20
“Where is the center?”
Measures of central tendency: Mode
• Example:
– Age of participants: 17 19 21 22 23 23 23 38
Mode = 23
21
Deciding which measure of central
tendency to use
22
Which measures of central tendency to
use? Shape of the distribution
Shape describes how numerical data are distributed:
1. Symmetric: same shape on both sides of the mean.
– Mean = Median
2. Skewed: outlying observations occur in only one direction
– Left skewed: outlying values are small (left of center)
• Mean < Median
– Right skewed: outlying values are large (right of center)
• Mean > Median
23
Shape of the distribution
24
Which measure of central tendency to
use? General guidelines
25
“What is the range?”
Measures of Variation/Spread
Measures of variation give information on the spread or variability
of the data values.
– Range
– Percentiles/quartiles
– Interquartile range (IQR)
Same center,
– Standard deviation/Variance different variation
26
Quartiles
27
Interquartile Range
• Interquartile range = 3rd quartile – 1st quartile = Q3 – Q1
• IQR contains the central 50% of the observations.
28
Variance
• Average (roughly) of squared deviations of values from the mean
Mean Mean
n
σ 𝑛 ሜ 2
(x i − X) 2
𝞭2 =
𝑖 (𝑥𝑖 − 𝑋)
S2 = i
𝑛
n −1
(Sample) (Population)
29
Standard Deviation
• Most commonly used measure of variation
• Is the square root of the variance; has same units as original data
(x i − X )2
𝞭=
σ𝑛 ሜ 2
𝑖 (𝑥𝑖 −𝑋)
S = i
𝑛
n −1
(Sample) (Population)
30
Standard Deviation vs Variance
31
Calculation Example:
Sample Standard Deviation
32
Comparing Standard Deviations
Mean = 15.5
SD = 3.338
Data A
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5
Data B SD = 0.926
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5
Data C
11 12 13 14 15 16 17 18 19 20 21
SD = 4.570
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
33
The Normal Distribution
34
The beauty of the normal curve
Normal distribution: 68-95-99.7 Rule
36
Graphical displays of numerical data
37
Displaying numerical data
Box plot (aka box-and-whisker plot)
• Graphically shows the quartiles, mean, median,
maximum, and minimum of the data.
– The ends of the box are drawn at the first and third quartiles
(cut-offs for the lowest and highest 25%)
– A line is drawn inside the box at the median (exact middle value)
– Lines, called whiskers, are extended from the ends of the box
out to the minimum and maximum
– Often, the mean (the average) is indicated with an asterisk,
cross, or dotted line
38
Box Plot for a symmetric distribution
Maximum or
Q3 +
(1.5*IQR)
Interquartile Range
* Median (Q2)
Minimum or
Q1-(1.5*IQR)
Variable name
Displaying numerical data:
Histogram
• Gives the percentage (proportion) of the study population in
ranges of the continuous variable or in categories
• X-axis: measure of interest
• Y-axis: number or percentage of observations
Histogram
25.0
16.7
Count
8.3
0.0
20.0 23.3 26.7 30.0
Age
Distribution shape and box plot
Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3
* *
• Interquartile range: obtain the first and third quartiles Q1 and Q3, then [Q1, Q3] in
the interquartile range containing 50% of the data.
42
Statistical Description of Data
N 451
Variable Type Mean St. Dev. 95% CI
Age Numeric 46.4080 16.4298 44.8916 47.9243
Sex Categoric male 44.79% female 55.21%
Height Numeric 166.1353 37.1946 162.7025 169.5681
Population Description
Weight Numeric 68.1441 16.5998 66.6121 69.6762
QRS duration Numeric 88.9224 15.3814 87.5028 90.3420
P-R interval Numeric 155.0953 44.8755 150.9537 159.2370
Q-T interval Numeric 367.2239 33.4208 364.1395 370.3084
T interval Numeric 169.9335 35.6711 166.6413 173.2257
P interval Numeric 89.9756 25.8480 87.5900 92.3612
Heart Rate Numeric 74.4634 13.8707 73.1833 75.7436
Ragged R wave Categoric exists 0.22% not exists 99.78%
Diphasic der R valveCategoric exists 1.11% not exists 98.89%
43
Descriptive Statistics with Python
• Context
Numpy : Python library adding support for large, multi-dimensional arrays and matrices, along
import numpy as np with a large collection of high-level mathematical functions to operate on these arrays.
import statistics Scipy: Python library on Numpy that provides additional functions for optimization, linear
import scipy algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE
solvers and other tasks common in science and engineering.
import math
Import collections
• Qualitative Variables
data = ['a','a','a','e','e','e','e','e','e','i','o','o','o','o','o','u','u','u','u','u']
Frequency all categories: collections.Counter(data)
Frequency of one category: collections.Counter(data)[‘a’]
Percentage: collections.Counter(data)[‘a’]/len(data)*100
• Quantitative Variables
data = [1,2,3,3,3,4,4,4,4,5,5,5,6,6,7,7,7,7,7,8,10,10,10]
N: len(data)
Mean: statistics.mean(data)
Median: statistics.median(data)
Mode: statistics.mode(data)
Variance: statistics.variance(data)
Std. Deviation: statistics.stdev(data)
95% CI (normal): from scipy import stats n = 23
cv = stats.norm.ppf(0.975) mean = 5.565217391304348
error = cv * stdev / math.sqrt(n) median = 5
CI = (mean – error, mean + error) mode = 7
Quartiles: q1 = np.quantile(data, 0.25) variance = 6.3478260869565215
q3 = np.quantile(data, 0.75)
standard deviation = 2.5194892512087685
See the presentation to data description CI (95%) = (4.5355506551396925, 6.594884127469003)
Quartiles = 4.0 5.0 7.0
and visualization in Python 44