100% found this document useful (1 vote)
68 views44 pages

HCI - Notes-Ch3

This document discusses data analysis techniques for clinical data, including descriptive statistics, inferential statistics, and machine learning methods. It covers summarizing categorical data using contingency tables and frequencies, summarizing numerical data using measures of central tendency and distribution shape. Key steps in a data science project like goal setting, data extraction, cleaning, feature engineering, model creation, and impact analysis are also outlined. Statistical analysis tools like Excel and Python are presented.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
68 views44 pages

HCI - Notes-Ch3

This document discusses data analysis techniques for clinical data, including descriptive statistics, inferential statistics, and machine learning methods. It covers summarizing categorical data using contingency tables and frequencies, summarizing numerical data using measures of central tendency and distribution shape. Key steps in a data science project like goal setting, data extraction, cleaning, feature engineering, model creation, and impact analysis are also outlined. Statistical analysis tools like Excel and Python are presented.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

3.

Clinical Data Analysis


• A Data Science Project
• Statistical Analysis of Health Care Data
– Descriptive Statistics
– Inferential Statistics
– Regression
• Artificial Intelligence Analysis of Health Care Data
– Unsupervised Machine Learning
– Supervised Machine Learning

1
A Data Science Project
• Data science is an inter-disciplinary field that uses scientific
methods, processes, algorithms and systems to extract
knowledge and insights from many structural and
unstructured data. It unifies statistical data analysis, machine
learning-based data analysis, domain knowledge and their
related methods in order to understand and analyze actual
phenomena and testing hypotheses, with data.

• A DS project follows the next steps:


1. Goals and Objectives Setting
2. Data Extraction
3. Data Cleaning
4. Feature Engineering
5. Model Creation
6. Impact Analysis

2
A Data Science Project

A DS project follows the next steps:

Goals and Objectives Setting:


1. Define the specific objectives of the project, such as reducing death rate by a
certain percentage within a defined time frame.
2. Set the goal of building a predictive model to identify potential patient survival
and implement proactive retention strategies.
Data Extraction:
1. Gather relevant data from various sources, such as Patients EHR.
2. Extract data related to patients' demographics, diseases, and clinical patterns.
Data Cleaning:
1. Identify and handle missing values, outliers, and inconsistencies in the dataset.
2. Standardize data formats, resolve discrepancies, and ensure data integrity.

3
A Data Science Project
A DS project follows the next steps:

Feature Engineering:
1. Create new features from the existing dataset that can potentially enhance the
predictive power of the model, such as calculating the average usage over a specific
time period, frequency of interactions, or customer tenure.
2. Transform and preprocess data to make it suitable for the model, such as one-hot
encoding categorical variables and scaling numerical features.
Model Creation:
1. Build a predictive model, such as a machine learning algorithm (e.g., logistic
regression, random forest, or neural network), to predict the likelihood of customer
churn based on the engineered features.
2. Train the model using historical data and evaluate its performance using appropriate
metrics, such as accuracy, precision, recall, and F1-score.
Impact Analysis:
1. Analyze the model's predictions and assess its effectiveness in identifying potential
churners.
2. Calculate the projected impact of implementing retention strategies based on the
model's predictions, such as estimated patient survival rate.

4
Data Cleaning
• Data cleaning (or data cleansing) is the process of detecting
and correcting (or removing) corrupt or inaccurate records
from a record set, table, or database. Wikipedia
• The main issues in data cleaning are:
– Missing Values: some data can be absent. For example, the blood pressure of some
patients can be unknown.
– Outliers: some data can be highly atypical. For example, some patients may be older
than 100.
– Errors: some data can be corrupt. For example, heart rate can become null because the
connected machine disconnects while moving the patient.
– Duplicated Data: some data can be redundant. For example, we could have birth date,
age, and admission date (birthdate = admission_date – age).
– Pre-Calculation: some required data can be calculated from the available data. For
example, body mass index can be calculated from patient’s height and weight
(BMI=weight(in Kg) / height(in m)2). Dependent Variables
– Useless Features: some features can be irrelevant to the current DS project. For
example, some study may not need patient’s gender information.
– Useless Cases: some cases can be irrelevant to the current DS project. For example,
pediatric studies may remove patients older than 18.
5
Feature Engineering and Model Creation
• Def (Feature engineering) the process of using domain knowledge to
extract features from raw data via data mining techniques. Wikipedia

• Def (Scientific modelling) the process of making a particular part or


feature of the world easier to understand, define, quantify, visualize, or
simulate by referencing it to existing and usually commonly accepted
knowledge. It identifies relevant aspects of a situation in the real world
and then uses different types of models for different aims, such as
conceptual models to better understand, operational models to
operationalize, mathematical models to quantify, and graphical models to
visualize the subject.

• In this course, Model Creation refers to the use of artificial intelligence


(AI) algorithms to process available data and produce mathematical
models (or Blackbox models) that can be used to infer facts about new
data. TO BE CONSIDERED LATER

6
Statistical Analysis of Health Care Data
• Sample Tools
– MS Excel
– Python

• Statistical Data Analysis: Descriptive Statistics


• Descriptive Statistics with Python
• Case Study 1: Data Description with Python

• Statistical Data Analysis: Inferential Statistics


• Inferential Statistics with Python
• Case Study 2: Data Analysis with Python

7
Sample Tool: MS Excel

• Rationale:
– Accessibility: “everybody” has MS Excel in its laptop.
– Applicability: health care professionals use to work with
MS Excel to store and process their databases.
– Simplicity: all the statistical involved in the course is easily
performed with MS Excel.

8
Sample Tool: Python

• Rationale:
– Accessibility: it is open access.
– Programmable: analyses can be embedded within
computer programs.
– Simple: with minor indications statistics with Python is
easy.
– Powerful: statistic functions are fast.
– Complete: python provides a great variety of statistical
functions implemented and ready to be used.

9
Types of variables
• The types of variables used in a study influence both the
descriptive and inferential statistics during the analysis phase

• However, need to pay attention to this is the planning phase

1. What type of data (i.e., variable) is it?


• Categorical: nominal or ordinal
• Numerical: continuous or discrete

2. How do you summarize and present it?


• Numerical summary statistics
• Graphs, charts

10
Categorical Variables (two or more groups or “categories” being
measured)

• Nominal – i.e. “names”→ descriptive only, no


natural order
• Examples: Gender, Race/ethnicity
• Ordinal variables
– The sequence of categories is meaningful – it is “ordered”
– We assign numbers to that value
– But the intervals between those numbers are not
meaningful (they are not equally spaced)
– Examples: Likert scales (1-5, strongly disagree, disagree,
neutral, agree, strongly agree)

11
Numerical Variables
*NUMERICAL– measurement can be quantified as a number
• Continuous – uninterrupted; any number is possible (e.g.
1.25)
– Examples:
• Age
• Temperature
• Discrete – integers; only some numbers are possible
– Examples:
• Number of children
• Number of strokes a patient has had

12
How do you summarize your data?
• Descriptive statistics: used to describe, organize, and summarize data
– Categorical variables: number (N), frequency (%)
– Numerical variables:
• mean + standard deviation
• median + range
• mode
• Graphs, charts:
– Categorical: contingency tables, bar charts
• Note: Not pie charts—relative areas of the pie are
difficult to distinguish!
– Numerical: shape of the distribution, box plots, histograms

13
Summarizing Categorical Data

• Contingency tables: number, frequency


Table 1. Barcelona residency match results, 2010

Residency Site N. patients %


Barcelona 23 24.7%
Tarragona 15 16.1%
Other 55 59.1%
Total 93 100.0%

14
Summarizing Numerical Data
Key questions to ask:
• How are the data distributed?
– Where is the center?
– What is the range?
– What’s the shape of the distribution? (e.g.,
Gaussian (normal), right- or left-skewed)

• Are there “outliers”?

• Are there data points that don’t make sense?

15
“Where is the center”?
Measures of central tendency
• Mean
• Median
• Mode

16
“Where is the center?”
Measures of central tendency: Mean
• Mean – the average; the balancing point
– Calculation: the sum of values divided by the sample size
– In math shorthand: n

 x X + X ++ X
X = i =1
= 1 2 n

n n
• Example Mean calculation:
– Age of participants: 17 19 21 22 23 23 23 38
n

X i
17 +19 + 21+ 22 + 23 + 23 + 23 + 38
X= i=1
= = 23.25
n 8

17
“Where is the center?”
Measures of central tendency: Mean

• Should not be used with ordinal data


• The mean is affected by extreme values (outliers)
Scenario One Scenario Two

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Mean = 3 Mean = 4
1 + 2 + 3 + 4 + 5 15 1 + 2 + 3 + 4 + 10 20
= =3 = =4
5 5 5 5

Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

18
“Where is the center?”
Measures of central tendency: Median
• Median – the exact middle value
Calculation:
– If there are an odd number of observations, find the middle value.
– If there are an even number of observations, find the middle two
values and average them.
• Example:
– Age of participants: 17 19 21 22 23 23 23 38

Median = (22+23)/2 = 22.5

19
“Where is the center?”
Measures of central tendency: Median

• The median is NOT affected by extreme values (outliers).

Scenario One
Scenario Two
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Median = 3 Median = 3

Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

20
“Where is the center?”
Measures of central tendency: Mode

• Mode – the value that occurs most frequently

• Example:
– Age of participants: 17 19 21 22 23 23 23 38

Mode = 23

21
Deciding which measure of central
tendency to use

• Need to see the shape of the distribution


– Describes how data are distributed

– Helps us decide whether to use the mean or the


median to describe numerical data

22
Which measures of central tendency to
use? Shape of the distribution
Shape describes how numerical data are distributed:
1. Symmetric: same shape on both sides of the mean.
– Mean = Median
2. Skewed: outlying observations occur in only one direction
– Left skewed: outlying values are small (left of center)
• Mean < Median
– Right skewed: outlying values are large (right of center)
• Mean > Median

23
Shape of the distribution

Symmetric Outliers to the LEFT


Mean = Median of center pull the Left-Skewed
Mean < Median
mean LEFT, so the
mean is LESS than
the median

The median is the


CENTER (and it doesn’t Outliers to the Right-Skewed
change due to RIGHT of center pull Mean > Median
outliers), so think of the mean to the
RIGHT, so the mean
whether outliers pull
is MORE than the
the mean to the left or
median
right

24
Which measure of central tendency to
use? General guidelines

1. Mean: numerical data and symmetric distribution

2. Median: ordinal data or numerical data


if skewed distribution

3. Mode: bimodal distributions

25
“What is the range?”
Measures of Variation/Spread
Measures of variation give information on the spread or variability
of the data values.
– Range
– Percentiles/quartiles
– Interquartile range (IQR)
Same center,
– Standard deviation/Variance different variation

26
Quartiles

25% 25% 25% 25%


Q1 Q2 Q3
• The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger.
• Q2 is the same as the median
• 50% are smaller, 50% are larger
• Only 25% of the observations are greater than the third
quartile Q3

27
Interquartile Range
• Interquartile range = 3rd quartile – 1st quartile = Q3 – Q1
• IQR contains the central 50% of the observations.

28
Variance
• Average (roughly) of squared deviations of values from the mean
Mean Mean
n
σ 𝑛 ሜ 2
 (x i − X) 2
𝞭2 =
𝑖 (𝑥𝑖 − 𝑋)
S2 = i
𝑛
n −1
(Sample) (Population)

• Why square the deviations?


– Adding deviations would be 0 (squares eliminate the negatives)
– Increasing contribution to variance further from mean

29
Standard Deviation
• Most commonly used measure of variation

• Shows variation about the mean: approximately the average


distance of each observation from the mean

• Is the square root of the variance; has same units as original data

 (x i − X )2
𝞭=
σ𝑛 ሜ 2
𝑖 (𝑥𝑖 −𝑋)
S = i
𝑛
n −1

(Sample) (Population)

30
Standard Deviation vs Variance

31
Calculation Example:
Sample Standard Deviation

Age data (N=8) : 17 19 21 22 23 23 23 38


N=8 Mean = X = 23.25

(17 − 23.25)2 + (19 − 23.25)2 +  + (38 − 23.25)2


S=
8−1
280
= = 6.3
7

32
Comparing Standard Deviations

Mean = 15.5
SD = 3.338
Data A
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5
Data B SD = 0.926
11 12 13 14 15 16 17 18 19 20 21

Mean = 15.5
Data C
11 12 13 14 15 16 17 18 19 20 21
SD = 4.570

Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

33
The Normal Distribution

Changing σ (SD) increases or


decreases the spread.

• Most common data distribution: bell curve,


normal curve, Gaussian distribution.
• Centered on the “mean” value (also the
most frequently occurring value)
• Symmetric on either side of mean
• Can be narrow or wide. Width described by
the standard deviation (how values deviate
from the mean)

34
The beauty of the normal curve
Normal distribution: 68-95-99.7 Rule

For any normal distribution with mean  and


standard deviation :
• 68% of the observations fall within one standard
deviation of the mean (in the interval [-, +])
• 95% of the observations fall within two standard
deviations of the mean (in the interval [-2,
+2])
• 99.7% of the observations fall within three
standard deviations of the mean (in the interval
[-3, +3]).
• Almost all values fall within 3 standard
deviations.
-3 SD -2 SD -1 SD mean 1 SD 2 SD 3 SD

Confidence intervals : e.g., 95%CI = [mean  1.96*st.dev/n]


99%CI = [mean  2.58*st.dev/n]
35
Which measures of variability to use?
General guidelines
• Standard deviation
• When the mean is used (numerical data and symmetric
distribution)
• Percentiles and IQR:
• When median is used (i.e., ordinal data or skewed numerical data)
• When mean is used but objective is to compare individual
observations with a set of norms
• IQR
• To describe the central 50% of a distribution, regardless of shape
• Range
• Used with numerical data to emphasize extreme values

36
Graphical displays of numerical data

• To show the distribution (shape, center, range,


variation) of continuous variables.

1. The Box Plot


2. The Histogram

37
Displaying numerical data
Box plot (aka box-and-whisker plot)
• Graphically shows the quartiles, mean, median,
maximum, and minimum of the data.
– The ends of the box are drawn at the first and third quartiles
(cut-offs for the lowest and highest 25%)
– A line is drawn inside the box at the median (exact middle value)
– Lines, called whiskers, are extended from the ends of the box
out to the minimum and maximum
– Often, the mean (the average) is indicated with an asterisk,
cross, or dotted line

38
Box Plot for a symmetric distribution

Maximum or
Q3 +
(1.5*IQR)

75th percentile (Q3)


Value

Interquartile Range
* Median (Q2)

25th percentile (Q1)

Minimum or
Q1-(1.5*IQR)

Variable name
Displaying numerical data:
Histogram
• Gives the percentage (proportion) of the study population in
ranges of the continuous variable or in categories
• X-axis: measure of interest
• Y-axis: number or percentage of observations

Histogram
25.0

16.7
Count

8.3

0.0
20.0 23.3 26.7 30.0
Age
Distribution shape and box plot

Left-Skewed Symmetric Right-Skewed

Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3

* *

Mean < Median Mean = Median Mean > Median


Statistical Data Analysis: Descriptive Statistics
Descriptive statistics is a branch of statistics aiming at quantitatively
describe or summarize features of a collection of data.
• Qualitative variables: proportion or percentage of occurrence of
each variable value (e.g., percentage of patients taking one drug).
• Quantitative variables:
– Measures of central tendency
• Mean: arithmetic average of the values.
• Median: middle value of the set of values. mode σ𝑛𝑖=1 𝑥𝑖
𝑚𝑒𝑎𝑛 =
• Mode: most commonly observed value of the set of values. median 𝑛
– Measures of dispersion or variability
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2
• Variance and standard deviation: st. dev = square-root(variance) 𝑠=
𝑛−1
– ~68% of the cases are in the interval [mean  st.dev]
– ~95% of the cases are in the interval [mean  2*st.dev]

• Confidence intervals: e.g., 95%CI = [mean  1.96*st.dev/n]


99%CI = [mean  2.58*st.dev/n]

• Interquartile range: obtain the first and third quartiles Q1 and Q3, then [Q1, Q3] in
the interquartile range containing 50% of the data.

42
Statistical Description of Data
N 451
Variable Type Mean St. Dev. 95% CI
Age Numeric 46.4080 16.4298 44.8916 47.9243
Sex Categoric male 44.79% female 55.21%
Height Numeric 166.1353 37.1946 162.7025 169.5681
Population Description
Weight Numeric 68.1441 16.5998 66.6121 69.6762
QRS duration Numeric 88.9224 15.3814 87.5028 90.3420
P-R interval Numeric 155.0953 44.8755 150.9537 159.2370
Q-T interval Numeric 367.2239 33.4208 364.1395 370.3084
T interval Numeric 169.9335 35.6711 166.6413 173.2257
P interval Numeric 89.9756 25.8480 87.5900 92.3612
Heart Rate Numeric 74.4634 13.8707 73.1833 75.7436
Ragged R wave Categoric exists 0.22% not exists 99.78%
Diphasic der R valveCategoric exists 1.11% not exists 98.89%

Comparative Population Description


Male Female All
N 202 44.79% 249 55.21% 451 100%
Variable Type Mean St. Dev. 95% CI Mean St. Dev. 95% CI Mean St. Dev. 95% CI
Age Numeric 47.4109 16.4466 45.1428 49.6790 45.5944 16.4042 43.556816 47.631939 46.4080 16.4298 0.4978 37.1946
Height Numeric 171.2228 72.6881 147.6103 194.8353 162.0080 39.8710 157.05566 166.9604 166.1353 37.1946 162.7025 169.5681
Weight Numeric 72.6881 171.2228 59.6308 85.7454 64.4578 14.7585 62.624677 66.290986 68.1441 16.5998 66.6121 69.6762
QRS duration Numeric 94.6832 94.6832 72.9829 116.3834 84.2490 14.4695 82.451745 86.046246 88.9224 15.3814 87.5028 90.3420
P-R interval Numeric 157.3564 157.3564 107.0805 207.6324 153.2610 45.9316 147.55589 158.9662 155.0953 44.8755 150.9537 159.2370
Q-T interval Numeric 364.5693 364.5693 340.1280 389.0106 369.3775 33.4351 365.22454 373.53048 367.2239 33.4208 364.1395 370.3084
T interval Numeric 177.2327 177.2327 164.5085 189.9568 164.0120 35.4926 159.60352 168.42058 169.9335 35.6711 166.6413 173.2257
P interval Numeric 92.2673 92.2673 82.1306 102.4040 88.1165 25.5852 84.938527 91.294404 89.9756 25.8480 87.5900 92.3612
Heart Rate Numeric 73.5050 73.5050 63.3682 83.6417 75.2410 12.8728 73.64204 76.839888 74.4634 13.8707 73.1833 75.7436
Ragged R wave Categoric exists 0.00% not exists 100.00% exists 0.40% not exists 99.60% exists 0.22% not exists 99.78%
Diphasic der R valveCategoric exists 0.99% not exists 99.01% exists 1.20% not exists 98.80% exists 1.11% not exists 98.89%

43
Descriptive Statistics with Python
• Context
Numpy : Python library adding support for large, multi-dimensional arrays and matrices, along
import numpy as np with a large collection of high-level mathematical functions to operate on these arrays.
import statistics Scipy: Python library on Numpy that provides additional functions for optimization, linear
import scipy algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE
solvers and other tasks common in science and engineering.
import math
Import collections
• Qualitative Variables
data = ['a','a','a','e','e','e','e','e','e','i','o','o','o','o','o','u','u','u','u','u']
Frequency all categories: collections.Counter(data)
Frequency of one category: collections.Counter(data)[‘a’]
Percentage: collections.Counter(data)[‘a’]/len(data)*100
• Quantitative Variables
data = [1,2,3,3,3,4,4,4,4,5,5,5,6,6,7,7,7,7,7,8,10,10,10]
N: len(data)
Mean: statistics.mean(data)
Median: statistics.median(data)
Mode: statistics.mode(data)
Variance: statistics.variance(data)
Std. Deviation: statistics.stdev(data)
95% CI (normal): from scipy import stats n = 23
cv = stats.norm.ppf(0.975) mean = 5.565217391304348
error = cv * stdev / math.sqrt(n) median = 5
CI = (mean – error, mean + error) mode = 7
Quartiles: q1 = np.quantile(data, 0.25) variance = 6.3478260869565215
q3 = np.quantile(data, 0.75)
standard deviation = 2.5194892512087685
See the presentation to data description CI (95%) = (4.5355506551396925, 6.594884127469003)
Quartiles = 4.0 5.0 7.0
and visualization in Python 44

You might also like