100% found this document useful (1 vote)

118 views27 pages

Cofactor Statistics

The document discusses key concepts in statistics including descriptive and inferential statistics, types of statistical data such as interval, ratio, categorical and ordinal data, methods for presenting statistical data such as tables, histograms, scatterplots and correlation. It also covers topics such as measures of central tendency, distribution of data, statistical errors, sources of bias, diagnostic tests and levels of evidence in research.

Uploaded by

Adam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

118 views27 pages

Cofactor Statistics

Uploaded by

Adam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

STATISTICS

(a) To describe the stages in the design of a clinical trial, taking into account the
research question and hypothesis
- Literature review
- Statistical advice
- Ideal study protocol to minimise the risk of bias and to achieve
- Optimum power of the study
- Ethical issues and informed consent
- Data collection and processing

(b) To explain the concepts in statistics such as distribution of data and frequency
distributions, measures of central tendency and dispersion of data, and the
appropriate selection and application of non-parametric and parametric tests in
statistical inference

(c) To explain the principles of errors of statistical inference and describe techniques to
minimise such errors through good study design

(d) Have an understanding of sources of bias and confounding in medical research and
methods available that can reduce such bias

(e) To describe the features of a diagnostic test, including the concepts of sensitivity,
specificity, positive and negative predictive value and how these are affected by the
prevalence of the disease in question

(f) To describe the various statistical methods used to estimate risk

(g) To describe the features of evidence-based medicine, including levels of evidence,

meta-analysis and systematic review
Overview of Statistics:
- “Statistics” is the mathematical science dealing with the presentation, analysis and
interpretation of numerical information (data). There are two types of statistics:
o (1) Descriptive statistics
 Raw data simplified as tables, graphs and summary statistics (Eg. mean,
standard deviations)
o (2) Inferential statistics
 Analysis and conclusions are drawn about an entire population using data
taken from a “sample population”
 A “sample population” is used because (i) study of the entire population is
not feasible, and (ii) if it is representative of the entire population, then
statistical methods (Eg. laws of probability theory) can be employed to
extrapolate conclusions from it and then apply it to the entire population
 This type of statistics is associated with two types of errors:
 (i) Sampling error – Sample parameter (Eg. mean) drawn from a
“sample population” may differ from the true population
parameter. This is measured by the “standard error of the mean”
and is minimised by increasing sample size
 (ii) Non-sampling error – This arrises when the “sample
population” is not representative of the entire population, which
leads to “Bias” (see below). This is minimised by random
sampling of the entire population to ensure that the sample
population is truly representative of it
- Key statistical terms:
o Parameter – A measurable characteristic of a “population” (Eg. average height of
Australians) that is usually fixed and unknown
o Variable – A measurable characteristic of a “sample” (Eg. height in a sample
population) that varies from individual to individual. Can be either “dependent”
(measured variable) or “independent” (variable that is controlled experimentally)

Note that parameter and variable can be either quantitative (interval or

ordinal data) or qualitative (nominal data)

o Sample statistics – Mathematical quantity calculated from sample data variables

and used as an estimate of a corresponding population parameter

Types of Statistical Data:

- (1) Non-categorical Interval data
o Data that is continuous, ordered and quantitative. Includes:
 (a) Interval data – Non-categorical data that has no natural zero, and
where differences make sense but ratios do not (Eg. temperature, dates,)
 (b) Ratio data – Non-categorical data that has a natural zero, and where
differences and ratios make sense (Eg. height, weight, age, Etc.)
 (c) Integer data – Non-categorical data that is only assigned interval values
(Eg. age)
o Presenting non-categorical data:
 (a) Table – Data divided arbitrarily into ranges (“class intervals”) of
appropriate and equal sizes, and in rank order

 (b) Histogram
 Displays a frequency distribution of “class intervals” (along the x-
axis), where each vertical rectangle for each “class interval” is
proportional to the frequency
 Shows distribution of data according to its overall shape (Ie.
symmetric, skewed, multimodal) and location of central tendency
of data

 (c) Frequency curves – Similar to histogram except bars replaced by

points joined by a line (similar to line graph)

 (d) Dot plots – Individual interval data points are presented (cf.
subdividing them into class intervals). Useful with relatively small data sets
 (e) Scatterplots (See “Association” below)
- (2) Categorical data
o Data that is discrete and qualitative. Either:
 (a) Nominal – Data stratified into groups with no rank order and each
group is arbitrarily labelled (Eg. gender, hair colour)
 (b) Ordinal – Data stratified into groups with rank order where each
group has no direct relationship with one another (Eg. pain score where a
score of 5 is not necessary half the pain intensity as a score of 10)
o Presenting categorical data:
 (a) Table
 “Summary table” of data (Eg. frequencies of the data set)

 “Contingency table” – Special frequency table used to cross-

classify two or more independent sample variables (rows represent
sample groups (Eg. control vs treatment); columns represent
outcomes (Eg. +ve vs –ve outcome))
 (b) Pie chart
 Circle divided into sectors showing the relative frequency of each
category (with the angle of each sector is proportion to the relative
frequency of each category
 Poor method for displaying information as it is difficult to
compare the sizes of the sectors within the chart

 (c) Bar chart – Displays a frequency distribution of the different

categories, where bar height for each category is proportional to the
frequency

 (d) Line graph – Similar to bar chart except bars replaced by points joined
by a line. Useful when categories depicted on x-axis exist as a continuum
(Eg. sedation score)

Correlation, Association and Linear Regression:

Association (and correlation):

- Defined as the relationship between two sets of paired interval data where the
distribution of one set of interval data is affected by the other (Eg. height and weight)
- Shown graphically in a “Scatterplot” diagram (where x and y axes represent interval data,
and each datum comprises of a x-y pair of values represented by a dot on the plot)
Note that a straight linear relationship between the two sets of paired
data suggests a strong association (or correlation) between the variables

- The strength of the association or correlation between two sets of paired interval data (Ie.
closeness to a straight line) can be evaluated by the “Correlation coefficient”:
o If data sets are normally distributed, a “Pearson correlation coefficient (r)” is
calculated:
 “r” exists in a range between -1 (all data points exist on a straight line with
a –ve correlation) and +1 (all data points exist on a straight line with a
+ve correlation). A “r” of 0 means no correlation (Eg. random scatterplot
pattern)
 P-value can be determined to assess the statistical significance of “r” (Ie.
whether a true linear correlation between the two sample data exists) and
whether linear regression analysis can be undertaken to derive an equation
for the straight linear relationship
 “r2” can be calculated as a value between 0 and +1 – When this is
expressed as a percentage, it is the amount of variance between the two
sample variables that is shared (Eg. r2 of 0.8 between x and y means 80%
of the variance in y is due to variation in x, and vice versa)
o If data set is not normally distributed, a “Spearman or Kendal correlation
coefficient” is calculated instead
- Note that a strong association or correlation does NOT imply causality between the two
variables – All other potential reasons for the association or correlation must be
considered and excluded first!

Linear regression:
- If a statistically significant correlation exists between two variables, the nature of the
correlation or association between them can be examined by calculating the equation for
the straight linear relationship
Y= a + bX + E

- “Error” (E) – Accounts for variations in Y not associated with a

change in X
- Slope of line (b) is the “Regression coefficient”

Bland-Altman plot:
- Correlation coefficient analysis is valid only when comparing two independent (or
completely different) variables that may be associated (Eg. patient height and weight) –
BUT when comparing two different methods of measuring the same variable (Eg. PA
catheter vs transoesophageal ECHO for CO monitor), this form of analysis may be
misleading as a statistically significant correlation may still mean clinically unacceptable
differences between the two methods of measurement
- Bland-Altman plot is more relevant in this situation. This involves:
o Plotting the mean of each individual data pair (x-axis; Eg. PA catheter CO
measurement) against the difference between each data pair (y-axis; Eg. difference
in CO measurement between the two methods)
o Correlation coefficient (r) between the two methods can be determined
o “Bias” or mean difference between all the data pairs is shown by a solid line
o “Limits of agreement” or a 95% CI for range of differences between individual
data pairs is derived (equal to twice the standard deviation of the distribution of
the differences)

Describing the Distribution of Data:

Statistical data can be summarised by its:

- (1) Central tendency of data:
o (a) Mode
 Measure of the most frequent value in a dataset
 Mainly useful for dataset containing nominal categorical values (Eg. most
frequent genotype)
o (b) Median
 Central datum that divides the dataset into two equal parts when all the
data are arranged in rank or numerical order (note that when an even
number of data exists, the mean of the two central data points is taken as
the median)
 Useful for ordinal categorical data (Eg. sedation score) and interval data
(esp when data is not symmetrically distributed, and mean is strongly
influenced by outliers)
o (c) Mean
 Sum of all individual data values (x1 + x2 + x3…+xn) divided by the
number of values (n) in the data set
 Useful for interval data when it is normally (symmetrically) distributed
when it is least influenced by outliers. Also can be used for ordinal
categorical data (albeit confusing; Eg. SS of 1.6?)
- (2) Variability (or spread) of data around the measure of central tendency:
o Best demonstrated graphically from either a:
 (a) Cumulative frequency curve (where the symmetry or skewness of data
is assessed)
 (b) Box and whisker plot (where the box represents the “Interquartile
range” and includes a median line within the box; whiskers represent the
outliers up to the 10th and 90th percentiles; asterixis represent any further
outliers)

o This can be measured using:

 (a) Range – Interval between the highest and lowest values in a
distribution
 (b) Percentiles – xth percentile is defined as the cut-off point whereby x%
of the sample has a value equal to or less than the cut-off point (Eg. 35th
percentile splits data into 2 groups containing 35% and 65% of data)
 (c) Quartiles – Splits data into 4 equal groups (lower [25th percentile],
middle [50th percentile or median], upper [75th percentile] quartiles). The
“Interquartile range” (IQR) refers to data range between the lower and
upper quartiles
 (d) Standard deviation (SD):
 “Variance” is calculated by summing the square of all the data
point deviations from the mean (either sample or population
mean), and averaging it by the population size (n) to give the
Nb. If the data point “population” variance (σ2), or the degrees of freedom (n-1) to give
deviations from the mean
the “sample” variance (s2)
were NOT squared, the sum
of these differences given a
normal distribution would be
zero!

 “SD” of the population or sample data is the square root of its

respective variance

 Note that “Standard error of the mean” (SEM) is NOT a measure of

variability and is DIFFERENT from SD (See below)

Distribution of data:
- (1) Normal (or “parametric”) distribution
o Characterised by:
 A unimodal, symmetrical, bell-shaped curve whereby all 3 measures of
central tendency (mean, median, mode) are equal and described at the
zenith of the curve
 The “mean” positions the curve on the x-axis and is its best measure of
central tendency
 The “standard deviation” (SD) determines the shape (or width) of the
curve as it is the measure of data variability – Note that 68% of data lie
within +/- 1 SD of the mean; 95% data lie within 2 SD of the mean;
99.7% data lie within +/- 3 SD of the mean

 The normal distribution curve can be altered by having:

 (a) Kurtosis, which describes the peakedness of the curve
o +ve kurtosis (sharper peak and thinner distribution)
o -ve kurtosis (wider and flattened distribution)
o Zero kurtosis refers to a normal distribution
 (b) Skewness, which describes the symmetry of the curve
o +ve skew (curve has a longer upper tail, whereby the mean
is shifted to the right MORE than the median – thus,
median is the more appropriate measure of central
tendency for skewed distribution)
o -ve skew (curve has a longer lower tail, whereby the mean
is shifted to the left MORE than the median)

 When the distribution becomes asymmetrical or skewed (Ie. “non-

parametric”):
 Median positions the curve on the x-axis and is the best measure
of central tendency (cf. the mean)
 IQR determines the shape (or width) of the curve as it
approximates the measure of data variability (cf. the SD)
- (2) Standard normal distribution (or z-distribution):
o Characterised by a normal distribution having a (i) Mean, median and mode of 0
and (ii) SD of 1
o Any dataset with a normal distribution can be converted into standard normal
distribution (or z-distribution) where the “z-value” is calculated (see below) and
indicates the number of SDs a datum value is above or below the mean:

Z = (xi – μ) μ – Mean of original normal distribution

σ σ – SD of original normal distribution
xi – Datum value from original distribution
o This is useful for comparing similar datum value from different normal
distributions (Ie. where the data from each normal distribution stand in terms of
number of SDs from their respective means)
- (3) t-distribution
o Used to analyse the means of SMALL sample populations on the assumption that
the variable is normally distributed in the entire population
o It comprises of a family of curves whose distributions vary according to the size
of the sample population (or degrees of freedom):
 With larger sample sizes (or degrees of freedom), it approaches that of a
normal distribution (such that it becomes a standard normal distribution
or z-distribution when the degrees of freedom is infinity)
 With smaller sample sizes (or degrees of freedom), there is more –ve
kurtosis (Ie. tails of the curve bigger) – This means that the measured
differences between two sample groups necessary to prove real difference
is greater.

o The proper distribution to be used for a given sample size is specified by the
“degrees of freedom” (which is equal to n – 1)
- (4) Binomial distribution:
o Describe any situation where there are “n” independent trials with two mutually
exclusive and independent outcomes (Eg. “n” number of coin tossing events),
and the outcome of interest occurs with a probability of “p” on each trial (Eg.
“p” of tossing tails and “1-p” of tossing heads)
o It follows a normal distribution provided “n” is reasonably large and “p” does not
take too extreme a value (0 or 1), such that:
 Mean of binomial distribution = n x p
 SD of binomial distribution = √np(1-p)
- (5) Poisson distribution:
o Discrete distribution where there is no strict upper limit to the possible values of
the variable. The variable is the count of a number of independent events that
occur randomly in a fixed interval of time or space (Eg. radioactive emissions
from a source over time)
o Used as the limiting form of a binomial distribution when “n” is large, and “p” is
small

Choosing the Appropriate Statistical Test for Statistical Inference:

There are two major groups of statistical tests:

- (1) Parametric tests
o These tests assume a normal distribution (which involves estimating the
distribution’s mean and SD)
o Include – Z-statistic, student t-test, X2 test, McNemar and Fisher exact test,
Pearson correlation coefficient
o Can be used when:
 (a) Population parameter being studied follows a normal distribution
(regardless of the study’s sample size or distribution of sample data)
 (b) Distribution of the population parameter under study is unclear BUT
the sample data obtained appears to follow a normal distribution – This is
inferred from either (i) Viewing a histogram or frequency curve of sample
data collected from a large enough sample size, OR (ii) +ve Shapiro-
Wilkes test for normality (for studies with small sample sizes < 20)
 (c) Non parametric data is converted into one that follows a normal
distribution by logarithmic, square root, or reciprocal transformation
- (2) Non-parametric tests
o These tests are “distribution-free” and do not involve estimation of any
distribution parameter (such as mean and SD) because interval data is converted
to rank-ordered data
o Include – Wilcoxon signed rank sum test, Mann-Whitney U test, Spearman’s rank
correlation coefficient
o Used when a parametric test is NOT appropriate (such as when the population
parameter studied does not obey a normal distribution, there are doubts that the
sample data follows a normal distribution, or the study sample size is very small)
o Note that it has very little power and is very hard to generate P-values < 0.05 (Ie.
difficult to show differences between groups when one exists) in comparison to a
parametric test

The choice of a specific statistical testing for inference will depend on:
- (1) Type of data being tested (Eg. interval or categorical)
- (2) Number of groups of data being tested (one group, two groups or multiple groups)
- (3) Distribution of the data (Eg. normal distribution)
o The type of statistical test employed (parametric vs non-parametric) will depend
on the distribution pattern of the (i) population parameter being studied, and (ii)
sample data
o The “normality” of the sample data distribution is assessed by:
 (a) Observing the distribution of the histogram or frequency curve –
However, with small sample sizes (n < 20) this may not be obvious
 (b) When sample sizes are small (n < 20), normality can be assessed by
formal statistical analysis using the “Shapiro-Wilkes test”
o A “parametric test” can be used IF:
 (a) Population parameter being studied follows a normal distribution
(regardless of the study’s sample size or distribution of sample data)
 (b) Distribution of the population parameter under study is unclear BUT
the sample data obtained appears to follow a normal distribution (given a
large enough sample size)
 (c) Non parametric data is converted into one that follows a normal
distribution by logarithmic, square root, or reciprocal transformation
o A “non-parametric test” is used if there is any doubt regarding the distribution of
the population parameter or the sample data (esp when “n” is very small)
o Note that in order to avoid choosing the incorrect type of statistical test, a large
number of subjects in the study should be employed (n > 100) – This is allows
both types of analyses to produce similar results and are equally power (otherwise
misleading results occur when the wrong test is applied)
- (4) Pairing of data (Eg. paired vs unpaired)
o “Unpaired statistical analysis” should be used for unpaired (or unmatched or
independent) data – This describes data obtained from studies where treatment
groups (and their respective outcomes) are INDEPENDENT of one another
(Eg. RCT where subjects are allocated randomly into treatment groups such that
each group should be as similar to each other as possible EXCEPT for the
intervention they receive)
o “Paired statistical analysis” should be used for paired (or matched or dependent)
data – This describes data typically obtained from “Crossover studies” where all
subjects recruited receive one treatment, followed by the other after a suitable
“washout” period (Ie. to allow the effects of the first treatment to subside). The
effectiveness of data pairing can be assessed by “correlation coefficient” (and its
corresponding P-value)
o Note that “paired statistical tests” are MORE powerful and require fewer subjects
to be recruitment to prove a difference exists between groups – BUT more time
is required for the “crossover study” to occur, and there is risk that the “washout
period” is inadequate

Statistical tests used for inference of interval data:

- (1) Analysing one group of interval data:
o (a) Inferring a population parameter (Eg. mean) from sample data
 Inferential statistics involves analysing a sample population to estimate a
population parameter (as it is impractical to study the entire population).
This estimate involves calculating the (i) mean and (ii) SD of the sample
 “Central limit theorem” states that repetitive studies of sample
populations drawn from the entire population would produce a family of
different sample means (with their own SDs) that would follow a normal
distribution, EVEN when the population data itself is not normally
distributed

 This theorem allows inferences of sample population data to be made to

the entire population such that the normal distribution of the different
sample population means has:
 (i) A mean (x) that approximates the population mean (μ)
 (ii) A spread (or SD) that gives the “Standard error of the mean”
 Therefore, a 95% confidence interval of the population mean (μ) is
expressed by x ± 1.96 x SEM

“Standard Error of the Mean” (SEM):

- “SEM” describes the precision of the sample mean (Ie. how far the sample
mean is from the population mean). It is a measure of “sampling error” and
used to calculate “confidence intervals” (See below)
- It is calculated using sample variance (s2) as the true population variance (σ2)
is not known:

SEM = s2 = SD
n √n

- SEM relates (a) inversely with sample size (Ie. SEM decreases with larger
sample sizes), and (b) proportionately with SD (Ie. SEM decreases with SD)
o (b) Determining if the sample data differs from a population parameter
 (a) Student’s one sample t-test (for parametric data)
 “t-value” is calculated as follows:
t = (x – μ)
SEM
 P-value for this test statistic is looked up on the relevant t-
distribution (with degrees of freedom equal to n-1) to assess if the
difference between the sample data and population parameter is
statistically significant
 (b) Wilcoxon rank sum test (for non-parametric data)
 Each sample datum is assigned a rank depending on how far it is
from the median (Ie. datum values lower than the median have –
ve values; those higher than the median have +ve values)
 Then these signed ranks are summated to produce a “W-value” –
If the null hypothesis is true, W is near zero
- (2) Comparing two groups of interval data:
o (a) Student’s two sample t-test (for parametric data)
 Can used to compare results of two unpaired (independent) or paired
(dependent) groups of parametric data
 This is done by calculating either:
 (i) A confidence interval (CI) for the difference in means between
the two sets of data

CI = (xA – xB) ± t x √(s2/nA + s2/nB)

(xA – xB) is the difference between the two sample means

(√(s2/nA + s2/nB)) is the difference in SEM between the two

sample means (whereby “s” is the pooled sample SD)

“t-value” depends on the CI of interest (t = 1.96 for 95% CI)

Thus, if the CI excludes zero, then there is 95%

confidence (given t = 1.96) that there is a statistically
significant difference between the two groups

 (ii) A P-value for the difference in means between the two sets of
data, which involves determining the t-value:

t = (xA – xB) __
√(s2/nA + s2/nB)

P-value for the calculated t-value is looked up in the

appropriate t-distribution – Note that the number of degrees
of freedom is equal to (nA + nB – 2)

Note that for this t-test to be valid, variance of both groups must be
SIMILAR for pooling of the variance to occur (otherwise Welch’s
correction to the t-test must be applied)
o (b) Non-parametric interval data
 (i) Mann-Whitney U test is used to compare results of two unpaired (or
independent) sets of non-parametric data

(ii) Wilcoxon matched pairs test is used to compare results of two paired
(or dependent) sets of non-parametric data
 Note that both tests analyse data by comparing the medians (cf. means)
and by considering the data in rank order values (cf. absolute values)
- (3) Comparing three or more groups of interval data:
o (a) For parametric data:
 (i) Analysis of variance (ANOVA)
 Compares mean of three or more unpaired (or independent) sets
of parametric data to determine if they are statistically significant
from one another
 If there is statistically significant analysis, this means at least one
of the data sets has a different mean from the others (but this
does NOT mean that all of them do!) – Thus, several “post-hoc”
tests can be done to determine which differences between certain
datasets are significant
 (ii) Repeated measures ANOVA test
 Similar to ANOVA except used for datasets that are paired (or
dependent)
o (b) For non-parametric data
 (i) Kruskal-Wallis ANOVA by ranks test (for unpaired or unmatched
datasets)
 (ii) Friedman test (for paired or matched datasets)

Dangers of multiple comparisons:

- When performing multiple t-tests (or equivalent) for multiple treatments, the P-level for
significance (which is equal to the accepted level of type I error (α = 0.05)) means that
there is a 1 in 20 chance that the NO will be incorrectly rejected (Ie. 1 in 20 treatments
studied will incorrectly yield a statistically significant result due to chance alone, rather than
due to any actual difference between the treatments)
- To prevent this:
o (i) Employ “Bonferroni’s correction factor” – P-value for significance adjusted (or
accepted type I error) is lowered from P< 0.05 to P < (0.05/n), where “n” is the
number of comparisons being made (Nb. BUT type II error is increased and
power of the study is reduced!)
o (ii) Above parametric or non-parametric tests (Eg. ANOVA) employed to assess
all the study treatment groups, followed by post hoc individual comparisons
ONLY if the P-value for the test is < 0.05

Statistical tests used for inference of categorical data:

2 x 2 contingency table Data type 1 (Eg. death) Data type 2 (Eg. survival) Totals
Category 1 (Eg. A B A+B
treatment A)
Category 2 (Eg. C D C+D
treatment B)
Totals A+C B+D N=A+B+C+D

- (a) For two sample proportions of categorical data grouped in a 2x2 contingency table:
o “Fisher exact test” is used when data are unpaired, while the “McNemar’s test” is
used when data are paired
o Unlike the X2 test, these tests do not assume random sampling and are used with
small sample sizes (n < 20) and “expected” frequency (E < 5)
o These tests calculate the probability of all tables that would produce the same
observed marginal totals (Ie. A+B, C+D, A+C, B+D)
- (b) For three or more sample categorical data grouped in a contingency table, a “Chi-
squared test” is used:
o This test assumes:
 (i) Random sampling of data from the population
 (ii) Sufficient sample size (n > 20)
 (iii) “Expected” frequency at least > 5 (unless Yates’ correction applied)
 (iv) Observations independent of each other (Ie. unpaired)
o Test statistic (X2) is calculated as follows

Oi = Observed frequency
Ei = Expected frequency
n = Number of cells in table

Where the “expected” frequency for each cell is the “row total” x “column total” divided by N:
[(A+B)x(A+C)] / N [(A+B)x(B+D)] / N
[(C+D)x(A+C)] / N [(C+D)x(B+D)] / N

o The “degrees of freedom” is equal to the [number of columns minus one] x

[number of rows minus one] (NOT counting the totals for rows or columns)
o The test statistic with appropriate degrees of freedom is looked up on distribution
table for P-value to determine significance

Measures of Frequency and Association:

Incidence and Prevalence:

- “Incidence” is defined as the number of new occurrences over a period of time
Cumulative incidence = (# individuals with new event over time)
(# susceptible individuals at the start)

- “Prevalence” is the proportion of a population with a health state of interest at a given

time
Point prevalence = (total # with attribute)__
(total population at the time)

Relative Risk, Risk Reduction and Odds Ratios:

Exposure Non-exposure Total

Disease (or outcome) a b a+b
No disease (or outcome) c d c+d
Total a+c b+d n=a+b+c+d

- “Relative risk” (RR)

o Defined as the ratio of the cumulative incidence of a disease (or observed risk of
an outcome) in an exposed group to the cumulative incidence of a disease (or
observed risk of an outcome) in an unexposed group

RR = (Cumulative incidence in exposed)

(Cumulative incidence in unexposed)

= (a/(a+b))
(c/(c+d))
o Requires an estimate of incidence – It can be directly estimated from RCTs and
cohort studies
o Can be used to determine:
 (i) Relative protection (RP) = (1 / RR)
 (ii) Relative risk reduction (RRR) = (1 – RR)
- “Absolute risk reduction” (ARR) and “Number needed to treat” (NTT):
o ARR is the difference in incidence of an outcome between exposed and
unexposed groups

ARR = Incidence in exposed (a/a+c) – Incidence in unexposed (b/b+d)

o NNT is the number of cases needed to be treated to avoid one outcome (or
disease). It is calculated by the reciprocal of ARR
NNT = (1 / ARR)

- “Odds ratio” (OR)

o Defined as the ratio of the odds of the disease (or outcome of interest) occurring
in an exposed group to the odds of the disease (or outcome of interest) occurring
in an unexposed group

OR = Odds of disease in exposed group

Odds of disease in unexposed group

= (a/c)
(b/d)
o A measure of strength of an association in retrospective case-control studies (but
can also be used for RCTs and cohort studies)

Sensitivity, Specificity, Predictive Values and Likelihood Ratio:

Disease +ve Disease -ve Total

Test +ve a b a+b
Test –ve c d c+d
Total a+c b+d n=a+b+c+d

- “Sensitivity” is the proportion of reference test +ve patients (diseased) who test +ve with
the screening test

Sensitivity = a = TP_

(a+c) (TP + FN)

- “Specificity” is the proportion of reference test –ve patients (no disease) who test –ve
with the screening test

Specificity = d = TN__
(b+d) (TN + FP)
- “Positive predictive value” is the chance a subject has the disease after being tested +ve
by the screening test
PPV = a = TP__
(a+b) (TP + FP)

- “Negative predictive value” is the chance a subject is not diseased after being tested –ve
by the screening test

NPV = d = TN__
(c+d) (TN + FN)

Note that interpreting +ve (and –ve) test results by the PPV (and NPV) are
influenced by prevalence of the disease (or pre-test chance)

- “Likelihood ratio” (LR) is defined as the likelihood that a given test result would be
expected in a patient with the target illness compared to the likelihood that the same
result would be expected in a patient without the target illness

LR +ve = sens / (1 – spec) LR –ve = (1- sens) / spec

o LR is used to select appropriate diagnostic test as it:

 (i) Assesses how good a diagnostic test is
 (ii) Is unlikely to change with the prevalence of the illness (cf. sensitivity
and specificity)
 (iii) Can be used to calculate the post-test probability for the target illness
 (a) Post-test probability can be calculated as follows:

Pre-test odds = prevalence / (1- prevalence)

Post-test odds = pre-test odds x LR
Post-test probability = post-test odds / (post-test odds + 1)

 (b) “Nomogram” can be used to determine post-test probability

from pre-test probability using LR
Note: LR > 1 produces a post-test probability higher than the pre-test
probability (Ie. can rule in disease), while a LR < 1 produces a post-test
probability lower than the pre-test probability (Ie. can rule out disease)

Probability Theory:
- Probability theory is vital to inferential statistical analysis as it helps predicts population
parameters from sample data on the assumption that sample data are “typical” of the
population data
- It determines the probability of an event mathematically – This probability is expressed as
a relative frequency that the event occurs in an infinite number of trials. This frequency
ranger from 0 (event never occurs) to 1 (event always occurs)
- Scenarios:
o For mutually exclusive events (Ie. occurrence of one event is dependent of the
other events occurring):
 P(A and B) = 0
 P(A or B) = P(A) + P(B)
o If the events are exhaustive, the probability of the events will add to 1 (Eg.
binomial probabilities such as coin flipping) – Thus, P(A or B) = 1
o If the two events are not mutually exclusive (Ie. occurrence of one event is
independent of the other event occurring):
 P(A and B) = P(A) x P(B)
 P(A or B) = P(A) + P(B) – P(A and B)

Significance Testing:

Carrying out statistical significance testing between two groups:

- (1) Formulate clinical hypothesis, which includes generating the “Null hypothesis” (HO)
and “Alternate hypothesis” (HA)
o “HO” is defined as no difference in outcome between the two groups with respect
to the variable of interest, thus any difference seen between the groups must be
entirely due to chance alone
o “HA” is defined as an actual difference in outcome between the two groups with
respect to the variable of interest that cannot be explained by chance alone
Note:
- Rejection of HO means that the HA is accepted (although it has not been
definitely proven)
- Failure to reject HO does not mean that the groups are truly the same, but only
that a difference could not be detected

- (2) Calculate the difference between the groups with respect of the outcome being
measured (“point estimate”)
- (3) Calculate the probability of obtaining the observed data assuming the difference
observed is generated by random chance alone (Ie. HO is true):
o (a) This is accomplished by selecting an appropriate statistical method to produce
a “test statistic” (Eg. t-test with t-value)
o (b) The appropriate probability distribution for this test statistic is looked up
according to sample size and variability of the data (Eg. t-distribution for t-value
with proper degrees of freedom), and the probability value for it is determined
o (c) If the probability of HO being true is smaller than a set “critical value” (Ie. P <
0.05), then it can be confidently rejected (Ie. conclude difference is likely real); if
the probability of HO being true is higher than this “critical value”, then it cannot
be rejected (Ie. conclude difference is likely due to chance alone)

One-tailed vs Two-tailed tests:

- “Two-tailed test” determines the probability of HO being true given that EITHER group
may have the higher mean
- “One-tailed test” determines the probability of HO being true given that the sample mean
of ONE group is likely greater than the other

Note that it is more appropriate to do two-tailed test – This is because analysis of data by a one-tailed
test may incorrectly accept the HO (Ie. differences between group due to chance alone) when the
opposite result to that expected is revealed

Overview of P-value:
- “P-value” is defined as a probability number between 0 and 1 indicating the likelihood
that a difference seen between two groups is due to chance alone (Ie. probability that the
HO is true)
- The threshold P-value for statistical significance is equivalent to “Type I error” (α), which
is the probability of inappropriately rejecting the “null hypothesis”:
o α (or the threshold P-value for statistical significance) is usually set at 0.05 by
researchers – This means that there is a 1 in 20 chance for erroneously rejecting
the null hypothesis when in fact the difference seen is really due to chance
o A lower α (or threshold P-value for statistical significance) means less chance of
erroneously rejecting the null hypothesis – However, this decreases the power of
the study (Ie. reduced ability to detect a difference between two groups when one
exists) unless a compensatory rise in sample size is made
o As a result, if the P-value is LESS than α (or < 0.05), then the null hypothesis can
be rejected (as there is a very small chance that it is true), and the difference
between the two groups is likely to be real and unlikely to be due to chance alone
- Issues with using P-values:
o (1) Only indicate statistical significance between the two group parameters being
compared
o (2) Lacks clinical applicability – P-values do not provide clinical significance as
there is no indication of magnitude or the direction of difference between the two
groups parameters being compared
o (3) Does not provide any information about the precision of the results
o (4) Does not provide an estimation of the population parameter

Overview of Confidence Intervals:

- Definition of a confidence interval:
o Defined as the range of values derived from a sample population of interest that
includes an unknown population estimate (Eg. mean, median, OR, RR,
proportion, regression analysis, Etc.)
o This range is specified by its degree of confidence (usually 95% or 99%, but any
% can be calculated), and its two ends of the range are its “Confidence limits”

Thus, a “95% confidence interval” is the range of values derived from a sample
population of interest where there is a 95% probability of encompassing the true
population estimate

- Width of the confidence interval:

o There are two factors that determine the width of the confidence interval:
 (1) Degree of confidence required (usually 95%, but any % can be
calculated)
 (2) “Standard error” of the true population estimate being measured
(SEM) – This tells us the degree of precision of the sample mean in
predicating the actual population mean

SEM = Standard deviation

√ sample size

o Assuming a normal distribution, the width of the “95% confidence interval” is

thus equal to:

95% CI = Sample mean +/- (1.96 x SEM)

o Therefore, the confidence interval can be narrowed by either:

 (a) Decreasing the SEM, by either (i) increasing sample size OR (ii)
decreasing the variance of the population (standard deviation)
 (b) Decreasing degree of confidence required
- Practical uses of confidence intervals:
o (1) Allows an estimation of population parameters (Eg. mean, median, Etc.) from
sample statistic within a range of values (Ie. explained as x +/- y units)
o (2) Allows rapid assessment of the precision of the sample – A smaller CI width
means increased precision; larger CI width means decreased precision
o (3) Permits comparison of difference between two groups to ascertain both
statistical and clinical significance:
 (a) There is a statistically significant difference when:
 There is no overlap in CI between the two groups parameters
being compared
 The CI does not cross 0 (or equality) when assessing the
difference between two group parameters
 The CI does not include 1 when assessing the OR or RR as the
parameter
 (b) Clinical significance (and clinical applicability) is determined by
assessing:
 The magnitude of difference in the CI range and true population
estimate between the two group parameters being compared (Ie.
larger magnitude of difference means more clinical significance)
 The magnitude at which the CI is away from equality (or 0) when
assessing the difference between two groups parameters (Ie.
further the CI is away from 0, the more clinically significant)
 The magnitude at which the CI is away from 1 when assessing the
OR or RR as the parameter

Statistical Significance vs. Clinical Significance:

- “Statistical significance” shows that a statistical difference in measured outcomes exists
between treatment groups
- “Clinical significance” reveals the difference in measured outcomes between treatment
groups is clinically relevant
Note:
- If the study sample size is large enough, even a very small difference between two groups
can be shown statistically – BUT the difference between the two groups can be so small
that it is not important clinically (Ie. statistically significant, BUT no clinically significant
outcomes)
- P-values only provide statistical significance, while CI’s provide BOTH statistical and
clinical significance

Study power and Type I/II statistical errors:

- “Type I error” (α)
o Defined as the probability of inappropriately rejecting the “null hypothesis” – It is
arbitrarily set at P = 0.05 by researchers (meaning a 1 in 20 chance for
erroneously rejecting the null hypothesis when in fact the difference seen is really
due to chance)
o A lower α means less chance of erroneously rejecting the null hypothesis –
However, this decreases the power of the study as it increases type II errors (Ie.
reduced ability to detect a difference between two groups when one really exists)
UNLESS a compensatory rise in sample size is made
- Type II error (β)
o Defined as the probability of incorrectly accepting the null hypothesis when there
is in fact a real difference between the groups – It is arbitrarily set at P = 0.2 by
researchers
o This usually occurs when there is uncertainty (or variability) in the data, which has
“drowned out” any real difference between the groups, thereby reducing the
ability to reject the null hypothesis
o Thus, β can be reduced by increasing certainty (or reducing variability) in the data
– This is achieved by INCREASING the sample size!

Nb. β is larger than α – This is because it is safer and more ethical to falsely accept NO rather than to
falsely reject it (Ie. more risk of incorrectly changing medical practice as it can cause harm!)

- Power
o Defined as the ability of the study to detect a difference between two groups
when a difference actually exists, thus allowing the null hypothesis to be rejected
appropriately
o Power can be calculated as follows:
Power = 1 – β
Thus, power can be decreased by a rise in type II error, which is generally
caused by the following factors:
- (i) Recruiting an inadequate sample size for the study
- (ii) Decreasing type I error (lowering α) – UNLESS a compensatory
rise in sample size is made
- (iii) Small effect size (Ie. little difference exists between the two
groups)
- (iv) Imprecise means of measuring the study and outcome factors
- (v) Employing a two-sided test (rather than a one-sided one)

o The power of a study is expected to be ~ 0.8:

 High powered studies that utilise very large pooled data sets (Eg. meta-
analysis) are useful in demonstrating statistically significant differences
when one actually exists
 Low powered studies are of little value as they:
 (i) Are unlikely to show statistical significance when one exists –
This leads to incorrectly accepting the HO (or type II error)
 (ii) Can be unethical – They exposes subjects unnecessarily to a
treatment and its risks
 (iii) Can be fiscally wasted use of funding on a study
o Power is one of the main determinants of a study’s sample size (Ie. a study with
adequate power to show significance needs an adequate sample size)

Note that the sample size (n) for a given study will depend on:
- (1) Power required (MAIN) (Z1-β)
- (2) Size of minimum actual effect important enough to
n = (Z1-α/2 + Z1-β) x σ2
require detection (Δ2)
Δ2
- (3) Two-sided significance level (Z1 – α/2)
- (4) Variance in the population (σ2)

Study Design:

Overview of study design:

- (1) Prospective studies (Ie. look forwards in time):
o Involves selecting a study group in order to determine the outcome over time
following exposure to a study factor
o Describes most studies used, which include:
 (a) Randomised controlled trials
 (b) Cohort studies
 (c) Prospective diagnostic studies
 (d) Descriptive studies (Nb. can be retrospective also)
- (2) Retrospective studies (Ie. look backwards in time):
o Involves selecting a study group based on their exposure to a study factor in
relation to an outcome established at the start of the study
o Includes:
 (a) Case-control studies
 (b) Cross-sectional studies
 (c) Descriptive studies (Nb. can be prospective also)
o Useful in rare situations when:
 (i) Prospective approach would take too long to accrue data
 (ii) There is significant lag between exposure to risk factor and onset of
illness
 (iii) Prospective approach would be unethical
(iii) Cost is an issue as it is inexpensive to collect data for retrospective
studies (Ie. utilise existing databases)
o Disadvantages include:
 (i) Risk of recall bias, leading to incomplete and inaccurate information on
events of the past
 (ii) Increased confounding bias due to inability to randomise subjects into
groups (Ie. groups less likely to be similar in baseline characteristics)

Appropriateness of a study design for a clinical question:

Clinical question Best type of study design
Aetiology RCT > Cohort > Case-control > Cross-sectional > Case report
Therapy RCT > Cohort > Case-control > Cross-sectional > Case report
Diagnosis Prospective study
Prognosis Cohort > Case-control > Cross-sectional > Case report (Nb.
cannot randomise people to get a disease!)
Magnitude Descriptive studies (Eg. population-based survey)

Types of study design:

- (1) Randomised controlled trial (RCT):
o Provides best evidence of causality – Used to study aetiology and therapy clinical
questions
o Study group is selected from a population of interest and the study participants
are then randomly assigned to treatment groups (control and intervention) and
the outcome of interest is measured in a prospective fashion

o In order to minimise systematic bias (esp confounding factors) that can obscure
the true relation between the study and outcome factors, a good RCT will employ
– (i) Randomisation, (ii) Double blinding, and (iii) Allocation concealment
o Advantages:
 (i) Prospective nature of study – Study factor (Eg. intervention) can be
administered in a precise and controlled manner, and the outcome factor
can be measured over time. Also allows the size, funding and data analysis
of the study to be determined prior to commencing the study
 (ii) Randomisation of subjects to treatment groups – This minimises
allocation bias and confounder bias (caused by unequal distribution of
confounding factors between treatment groups)
 (iii) Double blinding – Experimenters and subjects are blinded to the
designation of study factor to subjects within the study. This decreases
subject-observer bias
 (iv) Control group – Allows more meaningful conclusions that can be
made of the studied treatment
 (v) Measurements (esp parametric data) can be chosen precisely, thus
making it easier to make observations consistently and for statistical
methods to be used
 (vi) RCTs can have a high power – This is ideal for detecting small but
clinically relevant conclusions
 (vii) RCTs allow for subgroup analysis – This enhance usefulness for
clinical practice
 (viii) Large, multi-centre RCTs have greater applicability and afford higher
level of evidence (level I)
 (ix) Even RCTs with inconclusive results can be eminently publishable
o Issues:
 (i) Increased expense and time consumption (Ie. very difficult and costly
to organise and supervise study at multiple research sites)
 (ii) Results may not mimic real life treatment situation
 (iii) Risk of choosing subjects whose consent is invalid or treatments that
are unethical (Ie. denying treatment to one group)
 (iv) Recruitment/selection bias can occur (Ie. patients too ill or declined)
– This causes patient selection to be too specific, which results in
decreased the variance and reduced applicability of the study results to the
general population
 (v) Inaccurate results due to systemic bias with poor RCT design (Ie. lack
of blinding, poor randomisation process)
- (2) Cohort study:
o Used to provide evidence of causality – Used to study aetiology, therapy and
prognosis clinical questions
o Less ideal than a RCT – Thus used when an RCT is NOT feasible (Ie. test the
effects of smoking)
o Study group is selected from a population of interest and the participants are
separated according to the exposure or lack of exposure to the study factor (but
not allocated per se). The outcome of interest is measured in a prospective
fashion from the time of exposure to the study factor

o Advantages:
 (i) Allows causality as there is temporal sequence between exposure and
outcome
 (ii) Allows assessment of multiple outcomes and of other factors that may
influence outcome
 (iii) Permits measurement of incidence
 (iv) Exposure can be measured without bias
o Issues:
 (i) Not always cost/time efficient (can require very large sample sizes)
 (ii) May be difficult to accurately define and measure exposure at times
 (iii) Bias can occur due to loss to follow-up
- (3) Case-control study:
o Used to provide evidence of causality – Used to study aetiology, therapy and
prognosis clinical questions
o Less ideal than a RCT – Thus used when an RCT is NOT feasible (Ie. test the
effects of smoking)
o Studies begin with identification of +ve cases (Ie. have outcome of interest), then
controls are then selected to be compared with these cases retrospectively for
their exposure to the study factor

o Issues:
 (i) Difficulty is ensuring that the cases and the controls are from the same
population (minimise if study is population-based rather than hospital-
based)
 (ii) Appropriateness of control group (need to question if control would
be a +ve case if they had the outcome of interest)
 (iii) Potential for bias (esp selection bias of cases/controls, recall, survivor
and misclassification biases)
 (iv) Cannot infer causality due to lack of temporal sequence of events
 (v) Cannot derive incidence or prevalence data
 (vi) Inefficient type of study if study factor (or exposure) is rare
o Advantages:
 (i) Capable of detecting an effect with much smaller numbers (cf. cohort
study)
 (ii) Can explore the importance of several study or aetiological factors
simultaneously (“Hypothesis generation”)
 (iii) Relatively quick and inexpensive type of study
 (iv) Good for studying causes of rare outcomes or outcomes with long
latency periods
- (4) Cross-sectional study
o Used to provide evidence of causality – Used to study aetiology, therapy and
prognosis clinical questions
o Study population is selected and then each participant is examined for the
presence of the study factors and outcome factors. The relationship between the
study and outcome factors are examined as they exist at one time
- (5) Prospective diagnostic study
o Used to study diagnostic clinical question
o Involves a blinded comparison where all subjects get the test of interest as well as
a test with 100% certainty (“gold standard”)

- (6) Descriptive study

o Used to study magnitude clinical question
o Involves population-based survey (Eg. health survey or clinical audit) –
Determine the proportion of the population that has a particular characteristic of
interest
- (7) Case report/series
o Used to study aetiology, prognosis and therapeutic questions
o Involves reporting a case or summarising a number of cases observed clinically
o Advantages – (i) Quick and easy, (ii) Alerts community of particular issues
o Disadvantages – (i) Lack control group, (ii) Cannot imply causality, (iii) Significant
bias present (including confounders), leading to spurious associations

Overview of bias:
- “Bias” is defined as the systematic disposition of a trial design that causes an estimated
measure of association or frequency to differ way from the truth (Ie. produces results that
are consistently better or worse than they actually are)
- It can occur in either direction, and does not necessarily get smaller with an increase in
sample size
- Types of bias:
o (1) Recruitment/Selection bias
 Bias in choosing the subjects to participate in the study. This can lead to
certain subjects being included (or excluded) from the study
 Includes – Sampling bias, Volunteer bias, Prevalence bias
o (2) Information bias
 Bias occurs during taking of measurements or recording of data
 Includes – Recall bias, Misclassification bias, Subject-Observer bias,
Ascertainment bias, Measurement bias
o (3) Allocation bias
 Situation where recruitment and allocation are performed by the
experimenter. This can lead to removal (or not recruiting) a patient after
knowledge of allocation
o (4) Confounding bias (See below)
o (5) Publication bias – Tendency for positive studies to only be published
o (6) Language bias – Tendency to limit searches to English
- “Confounding bias” is the largest source of bias (esp in RCTs):
o It is defined as a situation in which a measure of the effect of the study factor is
distorted because of the association of the study factor with other factors (aka.
“confounders”) that influence the outcome
o Must fulfil three criteria:
 (i) Independent risk factor for the outcome of interest
 (ii) Not be an intervening variable on the causal pathway between study
and outcome factor
 (iii) Associated with the study factor in the data being analysed
o Confounding can be minimised by:
 (i) Omitting the confounding factor during recruitment (Ie. exclude
certain age group or gender)
 (ii) Randomisation – This ensures that potential confounders are equally
distributed between treatment groups
 (iii) Collate potential confounders in a table (Eg. Table 1) so any
maldistribution can be evaluated – Note that the P-value of the size of
difference of the confounding factor between treatment groups does
NOT indicate whether confounding is likely or not. It really depends on
how powerful the confounding factor is!
 (iv) If there is maldistribution of a potential confounder between
treatment groups, it can be corrected for mathematically using statistical
methods (Ie. analysis for the potential confounder can be performed to
see if the treatment outcome under study is similar in subjects with and
without the potential confounder)
- Removing bias from a study:
o In order to eliminate all possible bias, the study should employ:
 (i) Allocation concealment – Experimenters are unaware of the
randomisation sequence and assignment of each subjects to a particular
treatment group. This removes allocation bias
 (ii) Randomisation – Subjects are randomly assigned to treatment groups
(Ie. by computer). This removes confounder bias by ensuring that the
groups are similar and the only difference between them is the study
factor
 (iii) Double blinding – Experimenter and subjects are not aware of which
treatment the subject has received. This removes “observer bias”
o By doing this, the only confounding that arises within the study is that which
occurs by random chance (or non-systematic bias) – Statistics is then employed to
quantify that degree of random chance!

Systematic review and meta-analysis:

- “Systematic review” is a formal process of identifying, appraising and evaluating primary
research studies using strict criteria to draw conclusions about a specific issue
- “Meta-analysis” is a quantitative form of a systematic review whereby data from multiple
similar studies are assimilated to measure the overall effect of a study factor using all the
available evidence. Note that meta-analyses are never as statistically robust as a large RCT,
but is useful when such study cannot be conducted for practical reasons or is yet to be
performed
- “Forest plots” are commonly used in meta-analyses to demonstrate graphically the
differences between the studies included, and provide an estimate of the overall result

Note:
- Squares represent the point estimate of the difference between
the groups, with its size being proportion to the weight of the
study (which is determined by sample size)
- The horizontal lines from the square represent the 95% CI of the
estimate
- Diamond represents overall result pooled from all studies
- y-axis is the line of no effect

Levels of evidence for types of clinical studies:

Level of evidence Type of studies
Level one - Large RCTs
- Systematic reviews (with formal meta-analysis)
Level two - Smaller RCTs
- Cohort studies
Level three Case-control studies
Level four - Cross-sectional analytic studies
- Descriptive studies
Level five - Case reports
- Anecdote

NVIDIA Gen AI Slides Download
No ratings yet
NVIDIA Gen AI Slides Download
353 pages
AI Maturity Vitrineia-Rapportmaturite-En-Vf
No ratings yet
AI Maturity Vitrineia-Rapportmaturite-En-Vf
29 pages
Cortellis Regulatory Intelligence Factsheet
No ratings yet
Cortellis Regulatory Intelligence Factsheet
1 page
Barclays Center Project POS
No ratings yet
Barclays Center Project POS
772 pages
EdX Workplace Intelligence AI Report
No ratings yet
EdX Workplace Intelligence AI Report
15 pages
Chart The Course For Responsible AI Governance
No ratings yet
Chart The Course For Responsible AI Governance
29 pages
Seminar 7 Introduction To Databases
No ratings yet
Seminar 7 Introduction To Databases
41 pages
Excel Adv Formulae & Functions
No ratings yet
Excel Adv Formulae & Functions
26 pages
DataMining Lecture 1
No ratings yet
DataMining Lecture 1
35 pages
01 Basics of Data Analytics and Machine Learning
No ratings yet
01 Basics of Data Analytics and Machine Learning
16 pages
AI Solution Provider Validation Checklist
100% (1)
AI Solution Provider Validation Checklist
8 pages
Ontology Unit 2 Notes
No ratings yet
Ontology Unit 2 Notes
25 pages
Visualisation For Data Science Predict Overview 3267
No ratings yet
Visualisation For Data Science Predict Overview 3267
15 pages
Playbook Executive+Briefing Machine Learning
No ratings yet
Playbook Executive+Briefing Machine Learning
38 pages
AIGP Study Guide March 2024
No ratings yet
AIGP Study Guide March 2024
46 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
10 pages
AI Guide For Government - AI CoE
No ratings yet
AI Guide For Government - AI CoE
85 pages
Enterprise Ontology-Based Information Systems Development
No ratings yet
Enterprise Ontology-Based Information Systems Development
25 pages
Social Network Analytics Session1
No ratings yet
Social Network Analytics Session1
35 pages
Us Ers Deloitte Health Care Compliance and Risk Advisory Services
No ratings yet
Us Ers Deloitte Health Care Compliance and Risk Advisory Services
8 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
Dr. Varsha P.S.
No ratings yet
Dr. Varsha P.S.
9 pages
EPRS-knowledge-sources - Artificial Intelligence
100% (1)
EPRS-knowledge-sources - Artificial Intelligence
48 pages
AI Question Bank With Solutions - 2021-22
No ratings yet
AI Question Bank With Solutions - 2021-22
45 pages
Brief Introduction To GenAI
No ratings yet
Brief Introduction To GenAI
1 page
Accelerating Ai Maturity
No ratings yet
Accelerating Ai Maturity
16 pages
Agentic AI Fundamentals Quiz Complete With Code Diagrams Nida Rizwan
100% (1)
Agentic AI Fundamentals Quiz Complete With Code Diagrams Nida Rizwan
14 pages
Social Network Analytics Session2
No ratings yet
Social Network Analytics Session2
34 pages
Agility in Audit Could Scrum Improve The Audit Process2018current Issues in Auditing
No ratings yet
Agility in Audit Could Scrum Improve The Audit Process2018current Issues in Auditing
22 pages
AI and Data Science
No ratings yet
AI and Data Science
12 pages
Machine Learning and Deep Learning Approaches For CyberSecurity A Review
No ratings yet
Machine Learning and Deep Learning Approaches For CyberSecurity A Review
14 pages
【11】Principles to Practice - Responsible AI in a Dynamic Regulatory Environment 20240516
No ratings yet
【11】Principles to Practice - Responsible AI in a Dynamic Regulatory Environment 20240516
55 pages
Ai 900
No ratings yet
Ai 900
23 pages
Opportunities For Artificial Intelligence
No ratings yet
Opportunities For Artificial Intelligence
10 pages
Accenture Next Era Commercialization Point View May2023
No ratings yet
Accenture Next Era Commercialization Point View May2023
30 pages
2024 ZS B2B Marketing A Missing Key To Advanced KAM in Pharma and Medtech
No ratings yet
2024 ZS B2B Marketing A Missing Key To Advanced KAM in Pharma and Medtech
17 pages
Cloudera Nokia Case Study Final
No ratings yet
Cloudera Nokia Case Study Final
2 pages
AI Risk Repository-1
No ratings yet
AI Risk Repository-1
93 pages
Day 1
No ratings yet
Day 1
32 pages
Python Programming-Grade 9
No ratings yet
Python Programming-Grade 9
53 pages
Offensive Artificial Intelligence Current State of The Art and Future Directions
No ratings yet
Offensive Artificial Intelligence Current State of The Art and Future Directions
6 pages
Artificial Intelligence: Vishwakarma Institute of Information Technology, Pune
100% (1)
Artificial Intelligence: Vishwakarma Institute of Information Technology, Pune
37 pages
Recomender System Notes
No ratings yet
Recomender System Notes
28 pages
State: of Ai
No ratings yet
State: of Ai
30 pages
ISO IEC 17043-2023 Preview
No ratings yet
ISO IEC 17043-2023 Preview
12 pages
Kanini Legacy Dot-Net App Modernization Ebook
No ratings yet
Kanini Legacy Dot-Net App Modernization Ebook
32 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
21 pages
Distributed System
100% (1)
Distributed System
119 pages
Federated Learning Overview, Strategies, Applications, Tools and
No ratings yet
Federated Learning Overview, Strategies, Applications, Tools and
24 pages
Lcture-1 Introduction To Artificial Intelligence Version-1
No ratings yet
Lcture-1 Introduction To Artificial Intelligence Version-1
54 pages
CSF 2.0 Implementation Examples
No ratings yet
CSF 2.0 Implementation Examples
30 pages
Why AI Governance Is A Business Imperative For Scaling Enterprise Artificial Intelligence 1
No ratings yet
Why AI Governance Is A Business Imperative For Scaling Enterprise Artificial Intelligence 1
11 pages
Tactical Generative AI in Insurance Isn't Working. How To Create A Successful Ge
No ratings yet
Tactical Generative AI in Insurance Isn't Working. How To Create A Successful Ge
6 pages
Data Science Introduction
No ratings yet
Data Science Introduction
82 pages
RSA Archer Associate Exam Guide
No ratings yet
RSA Archer Associate Exam Guide
3 pages
Six Week-Total Handson Internship Program On Machine Learning
No ratings yet
Six Week-Total Handson Internship Program On Machine Learning
8 pages
Ethics of Artificial Intelligence
No ratings yet
Ethics of Artificial Intelligence
26 pages
Statistics Presentation
No ratings yet
Statistics Presentation
21 pages
Final - Data and Ai Governance.6sept2023
No ratings yet
Final - Data and Ai Governance.6sept2023
42 pages
New Therapy Platform
No ratings yet
New Therapy Platform
20 pages
Building Data-Driven Applications with LlamaIndex: A practical guide to retrieval-augmented generation (RAG) to enhance LLM applications
From Everand
Building Data-Driven Applications with LlamaIndex: A practical guide to retrieval-augmented generation (RAG) to enhance LLM applications
Andrei Gheorghiu
No ratings yet
Aims of The Subjective Assessment
No ratings yet
Aims of The Subjective Assessment
2 pages
Mobil SHC 634 Msds
No ratings yet
Mobil SHC 634 Msds
13 pages
Rope Handling Informative
No ratings yet
Rope Handling Informative
39 pages
Custom PC - January 2022
No ratings yet
Custom PC - January 2022
118 pages
AAPL Forensic Psychiatry Review Course Brochure
No ratings yet
AAPL Forensic Psychiatry Review Course Brochure
2 pages
Um-09-04-23 Reconstitution of The Division Quality Management System (QMS) Teams
No ratings yet
Um-09-04-23 Reconstitution of The Division Quality Management System (QMS) Teams
7 pages
Attitudes of Dental and Pharmacy Students To Oral Health Behaviour at Jazan University, Kingdom of Saudi Arabia
No ratings yet
Attitudes of Dental and Pharmacy Students To Oral Health Behaviour at Jazan University, Kingdom of Saudi Arabia
5 pages
Solutions
No ratings yet
Solutions
17 pages
Psychology Reaction Paper 1
No ratings yet
Psychology Reaction Paper 1
8 pages
TDG22DNTM14 1en
No ratings yet
TDG22DNTM14 1en
76 pages
Arc The Lad III - Monster Card
No ratings yet
Arc The Lad III - Monster Card
29 pages
User Manual: Selective Pulse Metal Detector "Chance"
No ratings yet
User Manual: Selective Pulse Metal Detector "Chance"
4 pages
Bel-Ray Rust Preventative Coating
No ratings yet
Bel-Ray Rust Preventative Coating
3 pages
Aço Inox - Fases
No ratings yet
Aço Inox - Fases
19 pages
Pelekasis Opening Menu
No ratings yet
Pelekasis Opening Menu
1 page
550 KV Gas Insulated Transmission
No ratings yet
550 KV Gas Insulated Transmission
8 pages
(Final) ASSIGNMENT CMT405 - Leaching PDF
No ratings yet
(Final) ASSIGNMENT CMT405 - Leaching PDF
11 pages
Letter of Year 2024 Casona (Miguel Febles)
No ratings yet
Letter of Year 2024 Casona (Miguel Febles)
3 pages
Elisa Reader & Microplate Washer: Farheen Saba PH.D Ist Year Deptt. of Zoology Manuu
No ratings yet
Elisa Reader & Microplate Washer: Farheen Saba PH.D Ist Year Deptt. of Zoology Manuu
30 pages
Instant Download Multilingual Dictionary of Narcotic Drugs and Psychotropic Substances Under International Control Explanatory Notes 1 Mul Edition United Nations Office at Vienna PDF All Chapter
No ratings yet
Instant Download Multilingual Dictionary of Narcotic Drugs and Psychotropic Substances Under International Control Explanatory Notes 1 Mul Edition United Nations Office at Vienna PDF All Chapter
87 pages
Market Manipulation Patterns: by Lema James
No ratings yet
Market Manipulation Patterns: by Lema James
36 pages
BDS Admission Form 2024 25
No ratings yet
BDS Admission Form 2024 25
7 pages
Eng QP Format
No ratings yet
Eng QP Format
15 pages
Rodriguez Lopez, Alejandro-April 2024
No ratings yet
Rodriguez Lopez, Alejandro-April 2024
2 pages
Schneider Electric Harmony-Timer-Relays RE17RAMU
No ratings yet
Schneider Electric Harmony-Timer-Relays RE17RAMU
17 pages
Carmen's Cafe Brookside Dinner Menu
No ratings yet
Carmen's Cafe Brookside Dinner Menu
4 pages
Project Schedule For Plant Design
No ratings yet
Project Schedule For Plant Design
4 pages
Stimulating Music Supports Attention in Listeners With Attentional Difficulties 2021.10.01.462777.full
No ratings yet
Stimulating Music Supports Attention in Listeners With Attentional Difficulties 2021.10.01.462777.full
30 pages
An Information Assurance Policy
No ratings yet
An Information Assurance Policy
6 pages

Cofactor Statistics

Uploaded by

Cofactor Statistics

Uploaded by

STATISTICS

(f) To describe the various statistical methods used to estimate risk

(g) To describe the features of evidence-based medicine, including levels of evidence,

Note that parameter and variable can be either quantitative (interval or

o Sample statistics – Mathematical quantity calculated from sample data variables

Types of Statistical Data:

 (c) Frequency curves – Similar to histogram except bars replaced by

 “Contingency table” – Special frequency table used to cross-

 (c) Bar chart – Displays a frequency distribution of the different

Correlation, Association and Linear Regression:

Association (and correlation):

- “Error” (E) – Accounts for variations in Y not associated with a

Describing the Distribution of Data:

Statistical data can be summarised by its:

o This can be measured using:

 “SD” of the population or sample data is the square root of its

 Note that “Standard error of the mean” (SEM) is NOT a measure of

 The normal distribution curve can be altered by having:

 When the distribution becomes asymmetrical or skewed (Ie. “non-

Z = (xi – μ) μ – Mean of original normal distribution

Choosing the Appropriate Statistical Test for Statistical Inference:

There are two major groups of statistical tests:

Statistical tests used for inference of interval data:

 This theorem allows inferences of sample population data to be made to

“Standard Error of the Mean” (SEM):

CI = (xA – xB) ± t x √(s2/nA + s2/nB)

(xA – xB) is the difference between the two sample means

(√(s2/nA + s2/nB)) is the difference in SEM between the two

“t-value” depends on the CI of interest (t = 1.96 for 95% CI)

Thus, if the CI excludes zero, then there is 95%

P-value for the calculated t-value is looked up in the

Dangers of multiple comparisons:

Statistical tests used for inference of categorical data:

o The “degrees of freedom” is equal to the [number of columns minus one] x

Measures of Frequency and Association:

Incidence and Prevalence:

- “Prevalence” is the proportion of a population with a health state of interest at a given

Relative Risk, Risk Reduction and Odds Ratios:

Exposure Non-exposure Total

- “Relative risk” (RR)

RR = (Cumulative incidence in exposed)

ARR = Incidence in exposed (a/a+c) – Incidence in unexposed (b/b+d)

- “Odds ratio” (OR)

OR = Odds of disease in exposed group

Sensitivity, Specificity, Predictive Values and Likelihood Ratio:

Disease +ve Disease -ve Total

Sensitivity = a__ = TP___

LR +ve = sens / (1 – spec) LR –ve = (1- sens) / spec

o LR is used to select appropriate diagnostic test as it:

Pre-test odds = prevalence / (1- prevalence)

 (b) “Nomogram” can be used to determine post-test probability

Carrying out statistical significance testing between two groups:

One-tailed vs Two-tailed tests:

Overview of Confidence Intervals:

- Width of the confidence interval:

SEM = Standard deviation

o Assuming a normal distribution, the width of the “95% confidence interval” is

95% CI = Sample mean +/- (1.96 x SEM)

o Therefore, the confidence interval can be narrowed by either:

Statistical Significance vs. Clinical Significance:

Study power and Type I/II statistical errors:

o The power of a study is expected to be ~ 0.8:

Overview of study design:

Appropriateness of a study design for a clinical question:

Types of study design:

- (6) Descriptive study

Systematic review and meta-analysis:

Levels of evidence for types of clinical studies:

You might also like

Sensitivity = a = TP_