Data Management
Data Management
Management
Statistics
Definition: the practice or science of collecting and
analyzing numerical data in large quantiles, especially
for the purpose of inferring proportions in a whole
from those in a representative sample
• Variable: a • Data: values
characteristic or (measurements or
attribute that can observations) that
assume different variables can assume
values • Data is the information
collected – the group
• Variables whose of information forms a
values are determined data set
by chance are called • Each value in the set is
random variables a data point or
datum
“Language of Statistics”
• Inferential Statistics
• Descriptive Statistics consists of generalizing
involves the from samples to
collection, populations, performing
organization, estimations, and
hypothesis tests,
summarization, and
determining relationships
presentation of data among variables, and
making predictions
Two Branches of
Statistics
Population Sample
ALL subjects (human or • “Small” group of subjects
otherwise) that are being (human or otherwise) selected
from the population
studied
• Examples
Examples • 1000 adult Americans surveyed
All citizens of the United States to determine if he/she favors
All students enrolled at GHC the legalization of marijuana
during Fall 2009 • 21 GHC students in Mr.
The governors of the 50 United Griffin’s statistics class
surveyed to determine height
States
Population vs Sample
Qualitative Variables Quantitative Variables
Can be placed into distinct Numerical
categories, according to some Can be ordered or ranked
characteristic or attribute Examples:
(typically non-numeric) Heights
Examples: Weights
Eye Color Pulse Rate
Gender Age
Religious Preference Body Temperatures
Yes/No Credit Hours
Variable Classifications
Discrete Variables Continuous Variables
• Can be assigned values such Can assume an infinite number of
as 0, 1, 2, 3 values between any two specific
values
• “Countable”
Obtained by measuring
• Examples: Often include fractions and
• Number of children decimals
• Number of credit cards Examples:
• Number of calls received by Temperature
switchboard Height
• Number of students Weight
Quantitative Variables
Data
Quantitative Qualitative
Discrete Continuous*
Measurement Scales
Interval Ratio
• Ranks data Ranks data
• PRECISE DIFFERENCES Precise differences exist
between units of measure do TRUE ZERO exist
exist
Examples:
• No meaningful zero
Height
• Examples: Weight
• Temperature (0° does not mean
Area
no heat at all)
• IQ Scores (0 does not imply no Number of phone calls received
intelligence) Salary
Measurement Scales
Summary Measures
Percentile Kurtosis
Maximum Range
Quartile
Minimum Coefficient of
Decile
Variance Variation
Central
Tendency Interquartile
Range
Mean Mode
Standard Deviation
Median
Data Description
Chapter 3
• 3-1 Introduction
• 3-2 Measures of Central Tendency
• 3-3 Measures of Variation
• 3-4 Measures of Position
• 3-5 Exploratory Data Analysis
• 3-6 Summary
Outline
Center: a representative or average value that indicates
where the middle of the data set is located
Variation: a measure of the amount that the values vary
among themselves
Distribution: the nature or shape of the distribution of
data (such as bell-shaped, uniform, or skewed)
Outliers: Sample values that lie very far away from the
majority of other sample values
Time: Changing characteristics of data over time
Important
Characteristics of Data
• The most common characteristic to measure is the center the
dataset. Often people talk about the AVERAGE.
• Examples
• The average American man is five feet, nine inches tall; the average
woman is five feet, 3.6 inches
• On the average day, 24 million people receive animal bites
Measures of Variation
• We also need to know the Measures of Position
• Percentiles, Deciles, and Quartiles
• Used extensively in Psychology and Education, referred to as
“Norms”
• These tell use where a specific data value falls within the
data set or its relative position in comparison with other
data values
Measures of Position
• Objective(s)
• Summarize data using measures of central tendency, such as
the mean, median, mode, and midrange
Statistic
Parameter
Represented by GREEK Represented by ROMAN
letters (English) letters
• When calculating the measures of central tendency, variation,
or position, do NOT round intermediately. Round only the
final answer
• Rounding intermediately tends to increase the difference
between the calculated value and the actual “exact” value
• Round measures of central tendency and variation to one
more decimal place than occurs in the raw data
• For example, if the raw data are given in whole numbers, then
measures should be rounded to nearest tenth. If raw data are
given in tenths, then measures should be rounded to nearest
hundredth.
General Rounding
Guidelines
• Measures of Center is the data value(s) at the center or
middle of a data set
• Mean
• Median
• Mode
• Midrange
Mean
Mean ----Formula
• Notation • Mean of a set of
• ∑ (sigma) denotes the sample values (read
sum of a set of values as x-bar)
• x is the variable usually x
x
used to represent the n
individual data values
• n represents the number of • Mean of all values in
values in a sample a population (read as
• N represents the number “mu”)
of values in a population
x
N
• The number of highway miles per gallon of the 10 worst
vehicles is given:
12 15 13 14
15 16 17 16
17 18
• Find the mean.
Mean ---Example
• Is the middle value when the raw data values are arranged in
order from smallest to largest or vice versa
• Is used when one must find the center or midpoint of a data set
• Is used when one must determine whether the data values fall
into the upper half or lower half of the distribution
• Is affected less than the mean by extremely high or love values
• Does not have to be an original data value
• Various notations:
• MD, Med,
~
x
Median
Even Number of Data Values (n is
Odd Number of Data Values (n is odd) even)
12 15 13 14
15 16 17 16
17 18
• Find the median.
Median ---Example
• Measured amounts of lead (in mg/m3) in the air are given:
Median – Example #2
• Is the data value(s) that occurs most often in a data set
• Sometimes said to be the most typical case
• Is the easiest average to compute
• Cane be used when the data are nominal, such as religious
preference, gender, or political affiliation
• Is not always unique. A data set can have more than one
mode, or the mode may not exist for a data set
• Has no “special” symbol
• Look for the number(s) that occur the most often in the data
set
Mode
• The number of highway miles per gallon of the 10 worst
vehicles is given:
12 15 13 14
15 16 17 16
17 18
• Find the mode.
Mode ---Example
• Measured amounts of lead (in mg/m3) in the air are given:
Mode – Example #2
• Is a rough estimate of the midpoint for the data set
• Is found by adding the lowest and highest data values and
dividing by 2
• Is easy to compute
• Gives the midpoint
• Is affected by extremely high or low data values
• Is rarely used
• Is denoted by MR
12 15 13 14
15 16 17 16
17 18
• Find the midrange.
Midrange ---Example
• There is no single best answer to that question because
there are no objective criteria for determining the most
representative measure for all data sets
• Avoid the term “average” , instead use the actual measure
of central tendency that is calculated (mean, median,
mode, or midrange)
• Use the advantages and disadvantages stated above to
decide which measure of central tendency is best.
FREQUENCY
DISTRIBUTION
Introduction
• After organizing the • We will be discussing
data, we must present the following statistical
them in a way that is charts and graphs
• Histograms
easily understandable.
• Frequency Polygons
• Ogives
• STATISTICAL & GRAPHS are
• Pareto Charts
the most useful method • Time Series Graphs
for presenting data • Stem & Leaf Plot
Introduction
• A frequency distribution is the organization of raw data in
table from, using classes and frequencies
• Class is a quantitative or qualitative category
• Examples: Political
affiliations, religious
affiliations, major field of study
%
f 100
n
Find the grand totals for
frequency & percent
20-29 19.5-29.5 23
30-39 29.5-39.5 21
40-49 39.5-49.5 21
50-59 49.5-59.5 4
60-69 59.5-69.5 1
70-79 69.5-79.5 1
Class Limits (Ages Frequency *
Frequencies Midpoint
in Years) Midpoint
20-29 23 24.5 563.5
30-39 21 34.5 724.5
40-49 21 44.5 934.5
50-59 4 54.5 218
60-69 1 64.5 64.5
70-79 1 74.5 74.5
Total 71 2579.5
Computation of Grouped
mean =2579.5/71 = 36.33
Class Limits (Ages in
Frequencies less than cumulative
Years)
20-29 23 23
30-39 21 44
40-49 21 65
50-59 4 69
60-69 1 70
70-79 1 71
Total 71