MD51 Lecture 1
MD51 Lecture 1
Lecturer: Dr K. Mukherjee
Office: S16-06-100 Tel: 874 2764 Email: [email protected]
To train practitioners of the biomedical sciences in the use and interpretation of statistical data analysis.
Objectives
Teaching approach
nonmathematical introduction
explanation of concepts rather than proofs emphasis on methodology and procedures emphasise use of statistical package rather than manual calculation emphasis on choosing the right procedure emphasis on correct interpretation of results examples from clinical research literature
Sampling
biomedical research projects mostly carried out on small numbers of study subjects challenging problem to project results from small samples studies to individuals at large
Biological Variation
Necessitates the use of statistical methods in biomedicine to put numerical data into a context by which we can better judge their meaning
Population
inductive statistical methods
sample
Altman (1991) Practical Statistics for Medical Research, Chapman and Hall.
Lesson:
CARE
must be exercised when reading scientific papers in biomedical journals! Knowledge of basic biostatistics is required
There are three kinds of lies: lies, damned lies and statistics Benjamin Disraeli It is easy to lie with statistics, but it is easier to lie without them Frederick Mosteller Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write. H.G. Wells
Population
inductive statistical methods
sample
Any characteristic that can be measured or classified into categories is called a variable
Types of variables
(1) Qualitative variables cannot be measured numerically categorical in nature, e.g., gender categories must not overlap and must cover all possibilities w Nominal variables (No inherent ordering of categories) M/F, Yes/No Blood group (A, B, AB, O) Ethnic group (Chinese, Malay, Indian, Others) w Ordinal variables (Categories are ordered in some sense) response to treatment: unimproved, improved, much improved pain severity: no pain, slight pain, moderate pain, severe pain
(2) Quantitative variables can be measured numerically, e.g., weight, height, concentration can be continuous or discrete w a continuous variable can take on any value (subject to precision of measuring instrument) within some range or interval, e.g., weight, height, blood pressure, cholesterol level w a discrete variable is usually a count of something and hence takes on integer values only, e.g., number of admissions to NUH Variable types and measurement types have implications on how data should be displayed or summarized determines the kind of statistical procedures that should be used
SUMMARY Variable
Types of variables
Qualitative or categorical
Quantitative measurement
Measurement scales
Let data speak for itself Get a good feel of the data before formal analysis Graphs and plots easier to understand and interpret Reveal patterns in data which may shed light on the appropriate model/analysis to use e.g., Skewed or symmetric distribution Multiple peaks / mode Are there any outliers ? Relatioship between variables.
% of world spendings
30 25 20 15 10 5 0
Africa Australasia Canada Europe Japan Latin America Middle East SE Asia & China USA
Region
Canada
( 2, 2.0%)
Japan
100 90
% of world spending
80 70 60 50 40 30 20 10 0
Africa Australasia Canada Europe Japan Latin America Middle East SE Asia & Chin USA
35 30
% of world spending
25 20 15 10 5 0
Africa Australasia Canada Europe Japan Latin America Middle East SE Asia & China USA
Region
Canada
( 2, 2.0%)
USA
(34, 34.0%)
90 80 70 60 50 40 30 20 10 0
Japan
Comparison of methods
Bar charts can be read more accurately and offer better distinction between close together values Pie charts especially useful for showing percentage distribution Pie charts can display large and small % simultaneously without scale break A single bar chart is preferable to a single segmented bar chart A series of segmented bar charts is easier to read than a series of pie charts or ordinary bar charts
Number of workers
Profession
Number of workers
Profession
Number of workers
3000
2000
1000
Profession
Percent by sector
Profession
4000
Private Public
Number of workers
3000
2000
1000
Profession
6000 5000
Private Public
100 90 80
Private Public
Number of workers
Percent by sector
Profession
Profession
Plotting by sector rather than by profession Look at the data from a different angle Highlight different aspects of the data
Clustered bar charts of number of health professionals
4000
Number of workers
3000
2000
1000
0 Private Public
Sector
6000 5000
Number of workers
Sector
Sector
Sector
4000
Number of workers
3000
100 90 80 70 60 50 40 30 20 10 0
2000
1000
0 Private Public
Sector
Private
Public
6000 5000
Number of workers
100 90 80 70 60 50 40 30 20 10 0
Private
Public
Sector
Sector
Comparison of methods Stacked bar chart is also a bar chart for the combined data Some of the bars in a stacked bar chart are not aligned Bars in clustered bar charts are aligned but it is harder to visualize how the component bars would stack up Back to back bar charts are applicable when there are 2 groups only, the aggregated bars are not aligned Series of stacked or segmented bar charts useful in showing time trend
Time Trend
Exaggerate visually the increase in # prescriptions written per person by starting at 8 rather than 0
Treatment A 3 15 9 27 B 2 22 30 54
Frequency
20
10
Response to treatment
Can compare the response type percentages for the two treatments Response to
Within treatment percentage
100 90 80 70 60 50 40 30 20 10 0 A B treatment None Partial Complete
Treatment
Treatment
Histogram Divide the range of the data into a suitably chosen number of intervals/bins, all of the same width The number of observations that fall within each interval is plotted
Relative frequency histogram Plot the proportions of observations that fall within the class intervals
Frequency
10
30
Percent
20
10
SysVol
Comparison of methods
Histogram good at revealing distributional shape such as symmetry, skewness, number of peaks etc difficult to superimpose or draw side by side Frequency polygons can be superimposed for easy comparison
Can be superimposed
The median is the middle value (if n is odd) or the average of the two middle values (if n is even), it is a measure of the center of the data Quartiles: dividing the set of ordered values into 4 equal parts Q2 = second quartile = median
first 25% second 25% third 25% fourth 25%
Q1
Q2
Q3
Box plot
Draw a box from the lower quartile to the upper quartile and a line to mark the position of the median Extend from both edges of the box by 1.5 IQR, pull back the lines until they hit observation Observations more than 1.5 IQR away from the lower or upper quartile are marked out as outside values for further investigation and checking
Dotplot for SysVol = End-systolic volume, a measure of the size of the heart
50
100
150
200
SysVol
20
120
220
SysV ol
quick visual summary of a data set capture prominent features like location, spread, skewness and outliers can easily draw a series of box plots side by side; not so for histograms
Taste
$/oz
0.11 0.17 0.11 0.15 0.1 0.11 0.21 0.2 0.14 0.14 0.23 0.25 0.07 0.09 0.1 0.1 0.19 0.11 0.19 0.17 0.12 0.12 0.12 0.1 0.11 0.13 0.1 0.09 0.11 0.15 0.13 0.1 0.18 0.09 0.07 0.08 0.06 0.08 0.05 0.07 0.08 0.08 0.07 0.09 0.06 0.07
$/lbProt Cal
14.23 21.7 14.49 20.49 14.47 15.45 25.25 24.02 18.86 18.86 30.65 25.62 8.12 12.74 14.21 13.39 22.31 19.95 22.9 19.78 14.86 17.32 15.2 14.01 13.92 18.24 14.12 11.83 15.41 17.4 17.32 15.61 20.4 12.65 11.17 11.75 9.49 10.21 6.37 8.42 9.37 9 8.07 9.39 6.59 8.43 186 181 176 149 184 190 158 139 175 148 152 111 141 153 190 157 131 149 135 132 173 191 182 190 172 147 146 139 175 136 179 153 107 195 135 140 138 129 132 102 106 94 102 90 99 107
Sod
495 477 425 322 482 587 370 322 479 375 330 300 386 401 645 440 317 319 298 253 458 506 473 545 496 360 387 386 507 393 405 372 144 511 405 428 339 430 375 396 383 387 542 359 357 528
Prot/Fat
1 2 1 1 1 1 2 2 1 1 1 3 2 1 1 1 2 1 2 2 2 1 1 1 2 1 1 2 1 3 1 1 3 1 1 1 1 2 2 3 3 4 5 5 4 2
Happy Hill Supers Beef Bland Georgies Skinless Beef Beef Bland Special Market's Beef Premium Bland B Spike's Beef Beef Medium Hungry Hugh's Beef Jumbo Medium Beef Great Dinner Beef Beef Medium RJB Kosher Beef Beef Medium Wonder Kosher Beef Skinless Medium Bee Happy FatsBeef Jumbo Beef Medium Midwest Beef Beef Medium General Kosher Beef Beef Medium Wall's Kosher Beef Beef Lower Medium F Hickory Natural Beef Smoke Medium Smith BeefBeef Medium Premium Beef Beef Medium Family StoreSkinless Beef Beef Medium Sam's Kosher Beef Beef Medium Hammer Beef Beef Medium Athens Beef Beef Medium Regents Kosher Beef Beef Scrumpt. Really Big Meat Bland Biggest Jumbo Meat Bland Home Made Meat Bland Martha's Jumbo Meat Dinner Bland Hammer Premium Meat Bland Willie's Wieners Meat Bland Premium Hot Meat Dogs Medium Airport Wieners Meat Medium Judy's Favorite Meat Jumbos Medium Stick Lean Meat Supreme Jumbo Medium Stick Jumbo Meat Medium Fat Jack Jumbo Meat Medium Thin Jack Veal Meat Medium Top Grade Hot Meat Dogs Medium Blended w/Chicken&Beef Meat Scrumpt. Heaven Made Meat Scrumpt. Baked and Meat Smoked Scrumpt. Smart Person Poultry Chicken Bland Woods Park Poultry Chicken Medium Tony Turkey Poultry Medium Rose Garden Poultry Turkey Medium Low Fat Turkey Poultry Medium Special Market's Poultry Turkey Medium Caloryless Poultry Turkey Medium Heaven Made Poultry Lower Fat Medium McDowell'sPoultry Jumbo Chicken Medium
Dataset Hotdogs
1969
1972
Reduction in concentration through time Higher during winter months Skewed toward higher value Spread increases with level