100% found this document useful (1 vote)
154 views51 pages

Data Management

Data management involves collecting, organizing, and analyzing numerical data. Statistics is the science of analyzing quantitative data. There are two main branches of statistics: descriptive statistics, which involves summarizing and presenting data; and inferential statistics, which involves generalizing from samples to populations. Key concepts in statistics include variables, data, populations, samples, and different types of variables such as quantitative vs qualitative, discrete vs continuous, and different measurement scales. Measures of central tendency like the mean, median, and mode; and measures of variation like range and standard deviation are used to describe important characteristics of data.

Uploaded by

Jarren Basilan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
154 views51 pages

Data Management

Data management involves collecting, organizing, and analyzing numerical data. Statistics is the science of analyzing quantitative data. There are two main branches of statistics: descriptive statistics, which involves summarizing and presenting data; and inferential statistics, which involves generalizing from samples to populations. Key concepts in statistics include variables, data, populations, samples, and different types of variables such as quantitative vs qualitative, discrete vs continuous, and different measurement scales. Measures of central tendency like the mean, median, and mode; and measures of variation like range and standard deviation are used to describe important characteristics of data.

Uploaded by

Jarren Basilan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

Data

Management
Statistics
Definition: the practice or science of collecting and
analyzing numerical data in large quantiles, especially
for the purpose of inferring proportions in a whole
from those in a representative sample
• Variable: a • Data: values
characteristic or (measurements or
attribute that can observations) that
assume different variables can assume
values • Data is the information
collected – the group
• Variables whose of information forms a
values are determined data set
by chance are called • Each value in the set is
random variables a data point or
datum
“Language of Statistics”
• Inferential Statistics
• Descriptive Statistics consists of generalizing
involves the from samples to
collection, populations, performing
organization, estimations, and
hypothesis tests,
summarization, and
determining relationships
presentation of data among variables, and
making predictions

Two Branches of
Statistics
Population Sample
ALL subjects (human or • “Small” group of subjects
otherwise) that are being (human or otherwise) selected
from the population
studied
• Examples
Examples • 1000 adult Americans surveyed
 All citizens of the United States to determine if he/she favors
 All students enrolled at GHC the legalization of marijuana
during Fall 2009 • 21 GHC students in Mr.
 The governors of the 50 United Griffin’s statistics class
surveyed to determine height
States

Population vs Sample
Qualitative Variables Quantitative Variables
Can be placed into distinct Numerical
categories, according to some Can be ordered or ranked
characteristic or attribute Examples:
(typically non-numeric)  Heights
Examples:  Weights
 Eye Color  Pulse Rate
 Gender  Age
 Religious Preference  Body Temperatures
 Yes/No  Credit Hours

Variable Classifications
Discrete Variables Continuous Variables
• Can be assigned values such Can assume an infinite number of
as 0, 1, 2, 3 values between any two specific
values
• “Countable”
Obtained by measuring
• Examples: Often include fractions and
• Number of children decimals
• Number of credit cards Examples:
• Number of calls received by  Temperature
switchboard  Height
• Number of students  Weight

Quantitative Variables
Data

Quantitative Qualitative

Discrete Continuous*

•Since continuous data is measured, answers are rounded to


nearest given unit; however the boundaries (possible values)
are understood to be x  0.5
Nominal Ordinal
Classifies data into mutually Classifies data into categories
exclusive (nonoverlapping) RANKING, but precise
exhausting categories
differences between ranks do
No order or ranking can be
not exist
imposed
Examples: Examples:
 Gender  Letter grades (A, B, C, D, F)
 Zip Codes  Judging contest (1st, 2nd , 3rd )
 Political Affiliation  Ratings (Above Avg, Avg, Below
 Religion Avg, Poor)

Measurement Scales
Interval Ratio
• Ranks data Ranks data
• PRECISE DIFFERENCES Precise differences exist
between units of measure do TRUE ZERO exist
exist
Examples:
• No meaningful zero
 Height
• Examples:  Weight
• Temperature (0° does not mean
 Area
no heat at all)
• IQ Scores (0 does not imply no  Number of phone calls received
intelligence)  Salary

Measurement Scales
Summary Measures

Location Variation Skewness

Percentile Kurtosis
Maximum Range
Quartile
Minimum Coefficient of
Decile
Variance Variation
Central
Tendency Interquartile
Range
Mean Mode
Standard Deviation
Median
Data Description
Chapter 3
• 3-1 Introduction
• 3-2 Measures of Central Tendency
• 3-3 Measures of Variation
• 3-4 Measures of Position
• 3-5 Exploratory Data Analysis
• 3-6 Summary

Outline
Center: a representative or average value that indicates
where the middle of the data set is located
Variation: a measure of the amount that the values vary
among themselves
Distribution: the nature or shape of the distribution of
data (such as bell-shaped, uniform, or skewed)
Outliers: Sample values that lie very far away from the
majority of other sample values
Time: Changing characteristics of data over time

Important
Characteristics of Data
• The most common characteristic to measure is the center the
dataset. Often people talk about the AVERAGE.

• “‘Average’ when you stop to think about it is a funny concept.


Although it describes all of us it describes none of us…. While
none of us wants to be the average American, we all want to know
about him or her.” Mike Feinsilber &William Meed, American
Averages

• Examples
• The average American man is five feet, nine inches tall; the average
woman is five feet, 3.6 inches
• On the average day, 24 million people receive animal bites

Section 3-1 Introduction


“Average” Is ambiguous, since several different methods can
be used to obtain an average
Loosely stated, the average means the center of the
distribution or the most typical case

Measures of Average are also called the Measures of Central


Tendency
Mean
Median
Mode
Midrange
Is an average enough to describe a data set?
NO!
Consider: A shoe store owner knows that the average size of a man’s
shoe is size 10, but she would not be in business very long if she
ordered only size 10 shoes
So, what else do we need to know?
We need to know how the data are dispersed—do they cluster around
the center or are they spread more evenly throughout the distribution .
How spread out are the data points?
Measures of Variation or Measures of Dispersion
Range
Variance
Standard Deviation

Measures of Variation
• We also need to know the Measures of Position
• Percentiles, Deciles, and Quartiles
• Used extensively in Psychology and Education, referred to as
“Norms”
• These tell use where a specific data value falls within the
data set or its relative position in comparison with other
data values

Measures of Position
• Objective(s)
• Summarize data using measures of central tendency, such as
the mean, median, mode, and midrange

Section 3-2 Measures of


Central Tendency
Population Sample
• Consists of all subjects • A group (subgroup) of
(human or otherwise) that subjects randomly
are being studied selected from a
population

RECALL from Chapter 1


Parameter Statistic
A characteristic or numerical
A characteristic or numerical
measurement obtained by
measurement obtained by using
using all the data values
the data values from a sample
from a specified population
Sample
Population

Statistic
Parameter
Represented by GREEK Represented by ROMAN
letters (English) letters
• When calculating the measures of central tendency, variation,
or position, do NOT round intermediately. Round only the
final answer
• Rounding intermediately tends to increase the difference
between the calculated value and the actual “exact” value
• Round measures of central tendency and variation to one
more decimal place than occurs in the raw data
• For example, if the raw data are given in whole numbers, then
measures should be rounded to nearest tenth. If raw data are
given in tenths, then measures should be rounded to nearest
hundredth.

General Rounding
Guidelines
• Measures of Center is the data value(s) at the center or
middle of a data set

• Mean
• Median
• Mode
• Midrange

• We will consider the definition, calculation (formula),


advantages, disadvantages, properties, and uses for each
measure of central tendency

Measures of Central Tendency


• AKA Arithmetic Average
• Is found by adding the data values and dividing by the total
number of values
• In general, mean is the most important of all numerical
measurements used to describe data
• Is what most people call an “average”
• Is unique and in most cases, but is not necessarily an actual data value
• Varies less than the median or mode when samples are taken from the
same population and all three measures are computed for those
samples
• Is used in computing other statistics, such as variance
• Is affected by extremely high or low values (outliers) and may not be
the appropriate average to use in those situations

Mean
Mean ----Formula
• Notation • Mean of a set of
• ∑ (sigma) denotes the sample values (read
sum of a set of values as x-bar)
• x is the variable usually x
 x
used to represent the n
individual data values
• n represents the number of • Mean of all values in
values in a sample a population (read as
• N represents the number “mu”)
of values in a population
  x
N
• The number of highway miles per gallon of the 10 worst
vehicles is given:

12 15 13 14
15 16 17 16
17 18
• Find the mean.

Mean ---Example
• Is the middle value when the raw data values are arranged in
order from smallest to largest or vice versa
• Is used when one must find the center or midpoint of a data set
• Is used when one must determine whether the data values fall
into the upper half or lower half of the distribution
• Is affected less than the mean by extremely high or love values
• Does not have to be an original data value
• Various notations:
• MD, Med,
~
x
Median
Even Number of Data Values (n is
Odd Number of Data Values (n is odd) even)

• Arrange data in order • Arrange data in order


from smallest to largest from smallest to largest
• Find the data value in the • Find the mean of the
“exact” middle TWO middle numbers
(there is no “exact”
middle)

Finding the Median


• The number of highway miles per gallon of the 10 worst
vehicles is given:

12 15 13 14
15 16 17 16
17 18
• Find the median.

Median ---Example
• Measured amounts of lead (in mg/m3) in the air are given:

5.40 1.10 0.42 0.73


0.48 1.10 0.66
• Find the median

Median – Example #2
• Is the data value(s) that occurs most often in a data set
• Sometimes said to be the most typical case
• Is the easiest average to compute
• Cane be used when the data are nominal, such as religious
preference, gender, or political affiliation
• Is not always unique. A data set can have more than one
mode, or the mode may not exist for a data set
• Has no “special” symbol
• Look for the number(s) that occur the most often in the data
set

Mode
• The number of highway miles per gallon of the 10 worst
vehicles is given:

12 15 13 14
15 16 17 16
17 18
• Find the mode.

Mode ---Example
• Measured amounts of lead (in mg/m3) in the air are given:

5.40 1.10 0.42 0.73


0.48 1.10 0.66
• Find the mode.

Mode – Example #2
• Is a rough estimate of the midpoint for the data set
• Is found by adding the lowest and highest data values and
dividing by 2
• Is easy to compute
• Gives the midpoint
• Is affected by extremely high or low data values
• Is rarely used
• Is denoted by MR

highest value  lowest value


MR 
2
Midrange
• The number of highway miles per gallon of the 10 worst
vehicles is given:

12 15 13 14
15 16 17 16
17 18
• Find the midrange.

Midrange ---Example
• There is no single best answer to that question because
there are no objective criteria for determining the most
representative measure for all data sets
• Avoid the term “average” , instead use the actual measure
of central tendency that is calculated (mean, median,
mode, or midrange)
• Use the advantages and disadvantages stated above to
decide which measure of central tendency is best.

Which Measure of Central Tendency is best?


Frequency
Distributions &
Graphs
To describe situations, draw
To conduct a statistical study,
conclusions, or make
we must gather data (values inferences about events, we
(measurements or must organize the data in
observations) that variables some meaningful way.
can assume).
◦ Most convenient method for
◦ Data collected in its original form organizing data is a
is called RAW DATA

FREQUENCY
DISTRIBUTION

Introduction
• After organizing the • We will be discussing
data, we must present the following statistical
them in a way that is charts and graphs
• Histograms
easily understandable.
• Frequency Polygons
• Ogives
• STATISTICAL & GRAPHS are
• Pareto Charts
the most useful method • Time Series Graphs
for presenting data • Stem & Leaf Plot

Introduction
• A frequency distribution is the organization of raw data in
table from, using classes and frequencies
• Class is a quantitative or qualitative category

• Frequency of a class is the number of data values contained in a


specific class

What is a Frequency Distribution?


Categorical Frequency
Distribution Grouped Frequency Distribution

• Used for data that can be • Used with quantitative data


used in specific categories, • Classes (groups) included
such as nominal or ordinal more than one unit of
level data. measurement

• Examples: Political
affiliations, religious
affiliations, major field of study

Types of Frequency Distributions


 Make a table
Class Tally Frequency %
 Tally the data
 Count the tallies
 Find percentage of values in each
class using the following formula:

%
f 100
n
 Find the grand totals for
frequency & percent

About Categorical Frequency


Distributions
 Definitions
◦ Lower Class Limit (LCL) is the smallest data value that can be
included in the class
◦ Upper Class Limit (UCL) is the largest data value that can be
included in the class
◦ Class Boundaries are used to separate the classes so that there
are no gaps in the classes included in the frequency distribution
◦ Class Width is the difference between two consecutive LCL
 Find by subtracting LCL2 –LCL1

About Grouped Frequency Distributions


 We must decide how many classes to use and the width of
each class using the following guidelines:
◦ There should be between 5 and 20 classes.
◦ It is preferable, but not absolutely necessary that the class width
be an odd number
◦ The classes must be mutually exclusive (nonoverlapping values)
◦ The classes must be continuous (no gaps, even if frequency is 0)
◦ The classes must be exhaustive (use all the data)
◦ The classes must be equal in width

Grouped Frequency Distribution


Decide on the number of classes (given)
Determine the class width (given)
Select a starting point (this is the first LCL) (given)
Determine the LCL by adding the class width to first LCL to
determine next LCL, …..
Determine the UCL by subtracting 1 from second LCL to
obtain first UCL, then add class width to determine next
UCL…..
Tally the data

Grouped Frequency Distribution


Ages of NASCAR Nextel Cup Drivers in Years (NASCAR.com) (Data is
ranked---Collected Spring 2008)
21 21 21 23 23 23 24 25
25 26 26 26 26 27 27 28
28 28 28 29 29 29 29 30
30 30 30 31 31 31 31 31
32 34 35 35 35 36 36 37
37 38 38 39 41 42 42 42
43 43 43 44 44 44 44 45
45 46 47 48 48 48 49 49
49 50 50 51 51 65 72

Example-Construct a frequency distribution of the ages of Cup Drivers. Use


6 classes beginning with a lower class limit of 20 and class width of 10
To organize data in a meaningful, intelligible way
To enable the reader to determine the nature or shape of the
distribution
To facilitate computational procedures for measures of
average and spread
To enable us to draw charts and graphs for the presentation of
data
To enable the reader to make comparisons among different
data sets

Reasons for Constructing a Frequency


Distribution
Class Limits (Ages in Class Boundaries Frequencies
Years)

20-29 19.5-29.5 23

30-39 29.5-39.5 21

40-49 39.5-49.5 21

50-59 49.5-59.5 4

60-69 59.5-69.5 1

70-79 69.5-79.5 1
Class Limits (Ages Frequency *
Frequencies Midpoint
in Years) Midpoint
20-29 23 24.5 563.5
30-39 21 34.5 724.5
40-49 21 44.5 934.5
50-59 4 54.5 218
60-69 1 64.5 64.5
70-79 1 74.5 74.5

Total 71 2579.5

Computation of Grouped
mean =2579.5/71 = 36.33
Class Limits (Ages in
Frequencies less than cumulative
Years)
20-29 23 23
30-39 21 44
40-49 21 65
50-59 4 69
60-69 1 70
70-79 1 71
Total 71

How to compute for the median

1. Determine the median class


2. Divide the total by 2= 71/2 = 35.5
3. Median= 29.5 +[ (35.5-23)/21 ] *10 = 35.45
Class Limits (Ages in
Frequencies less than cumulative
Years)
20-29 23 23
30-39 21 44
40-49 21 65
50-59 4 69
60-69 1 70
70-79 1 71
Total 71

How to compute for the mode


1. Determine the modal class
2. Identify in the frequency column the highest frequency
3. Median= 19.5 + (0/(0+2))* 10 = 19.5

You might also like