0% found this document useful (0 votes)
71 views18 pages

MMW Chap 3 Data Management Statistics Part 1

Uploaded by

paduajeanu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views18 pages

MMW Chap 3 Data Management Statistics Part 1

Uploaded by

paduajeanu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

CHAPTER 3

Introduction to Data Management


OBJECTIVES

➢ To know the history of Statistics.


➢ To define Statistics. To be able to differentiate Statistics from Statistic.
➢ To enumerate the functions and types of Statistics. To enumerate the levels of
measurement.
➢ To enumerate the different kinds of Sampling Techniques.
➢ To characterize a good questionnaire.
➢ To identify the steps in a statistical study.
➢ To describe the different methods of presenting data.
➢ To determine the different types of graphs.

INTRODUCTION

History of Statistics

Historical records show that since the beginning of civilization simple


forms of statistics had already been used. This is manifested by pictorial
representations and other symbols used to record numbers of people, animals,
and inanimate objects on skins, slabs, sticks of wood, or the walls of caves.
Records show that even before 3000 B.C., the Babylonians used small clay
tablets to record tabulations of agricultural harvests and of commodities
bartered or sold. The Egyptians analyzed the population and material wealth of
their country before they begun building the pyramids in the 31st century B.C.
Even the biblical books of Numbers and 1 Chronicles show statistical works.
Numbers contained two separate censuses of the Israelites while 1 Chronicles
described the material wealth of various Jewish tribes. In China, similar
numerical records existed before 2000 B.C. As early as 594 B.C., the ancient
Greeks held censuses used as bases for taxation.

Records also show that the Roman Empire was the first government to
gather extensive data about the population, area, and wealth of the territories
that it controlled. In Europe, few comprehensive censuses were made during
the Middle Ages in the early 16th century, registration of deaths and births
begun in England. Then in 1662 the first noteworthy statistical study of
population was made. In 1691, a similar study of mortality made in Breslau,

1
Germany was used by the English astronomer Edmond Halley as a basis for the
earliest mortality table. In the 19th century, investigators recognized the need
to reduce information to numerical values to avoid the ambiguity of verbal
description.

At present, statistics is a reliable means of describing accurately the


values of economic, political, social, psychological, biological, and physical data.
Statistics serves as a tool to correlate and analyze collected data. It is no longer
confined to gathering and tabulating data. Now, it is also a process of
interpreting the information that serves as a basis for preparing plans.

Notable Figures in the Development of Statistics

1. Cardan, in 1550 made the first crude definition of probability.


2. Pierre de Fermat (1601-1665), inventor of Fermat's last theorem, and with Blaise
Pascal, they laid the mathematical foundations of probability theory.
3. Blaise Pascal (1623-1662), French philosopher, mathematician, who with Pierre de
Fermat, developed a mathematical model of probability.
4. Christiaan Huygen (1629-1695), Dutch scientist and mathematician wrote the first book
on probability theory, "The Value of all Chances in Games of Fortune". This theory was
applied to vital statistics for humans, which led to its use in annuities.
5. Thomas Bayes (1702 - 1761), developer of the Bayes' theorem, which established the
mathematical foundation for probability inference.
6. Pafnuty Chebychev (1821-1894), Russian mathematician who developed what is known
as Cheychev's Theorum, a method for calculation of probability when the probability
distribution of a population is unknown.
7. Karl Pearson (1857 - 1936), applied statistics to biology and medicine. In 1900, he
developed the chi-square test and popularized the concept that populations may have
skewed distributions, as opposed to normal distributions. Worked in the fields of heredity
and evolution.
8. Charles Spearman (1863-1945), American psychologist who created Spearman's rank
correlation coefficient which allowed for factor analysis.
9. Ronald Aylmer Fisher (1890-1962), developed analysis of variance (ANOVA) and
made important contributions regarding experimental design. Other contributions
included the development of extreme value theory and the P-values for determining the
reliability of statistical predictions. His work demonstrated that "uncertainty may be
capable of precise quantitative assessment."
10. Andrey Kolmogorov (1903-1987), mathematician, published a monograph which laid
the foundation for advanced probability theory and random processes.
11. John Wilder Tukey (1915 -2000), developed methods for robust data analysis, time
series analysis methods, graphical methods for exploratory data analysis (stem-leaf
diagrams and box and whisker plots), paired and multiple comparisons, and in
cooperative work with James Cooley, developed what is known as the Fast Fourier
Transform.

2
Proper Discussion
Meaning of Statistics

Statistics is a branch of mathematics that deals with the


collection, classification, analysis, and interpretation of numerical facts which mainly
leads to drawing inferences on the basis of their quantifiable likelihood.

There are other meanings attached to the word Statistics.

In its plural sense, the word Statistics refer to numerical facts and figures collected in a
systematic manner with a definite purpose in any field of study. In this sense, statistics are also
aggregates of facts which are expressed in numerical form.

In its singular sense, it refers to the science comprising methods which are used in
collection, analysis, interpretation and presentation of numerical data.

Note:
“Statistic” refers to a numerical quantity like mean, median,
variance etc…, calculated from sample value.

Functions or Uses of Statistics

• Statistics helps in providing a better understanding and exact description of a


phenomenon of nature.
• Statistical helps in proper and efficient planning of a statistical inquiry in any field
of study.
• Statistical helps in collecting an appropriate quantitative data.
• Statistics helps in presenting complex data in a suitable tabular, diagrammatic and
graphic form for an easy and clear comprehension of the data.
• Statistics helps in understanding the nature and pattern of variability of a
phenomenon through quantitative observations.
• Statistics helps in drawing valid inference, along with a measure of their
reliability about the population parameters from the sample data.
• Statistics presents facts and figures in a definite form. That makes the statement
logical and convincing than mere description.
• Comparison between different sets of observation is an important function of
statistics. Statistical devices like averages, ratios, coefficients etc. are used for the
purpose of comparison.

3
• Formulating and testing of hypothesis is an important function of statistics. So
statistics examines the truth and helps in innovating new ideas.
• Statistics helps in formulating plans and policies in different fields. Statistical
analysis of data forms the beginning of policy formulations. Hence, statistics is
essential for planners, economists, scientists and administrators to prepare
different plans and programmers.
• Statistics helps in forecasting the trend and tendencies. Statistical techniques are
used for predicting the future values of a variable.

Two Branches of Statistics


Statistics is divided into two areas, descriptive Statistics and Inferential Statistics.

Note:
Population refers to the total set of observations that can be made.
Sample refers to a set of observations drawn from a population.

➢ Descriptive Statistics

Descriptive statistics summarizes a given data set, which can either


be a representation of the entire population or a sample. It describes patterns
and general trends in a data set. In most cases, descriptive statistics are used
to examine or explore one variable at a time.

Example:

On the last 3 Sundays, Henry D. Carsalesman sold 2, 1, and 0 new cars


respectively. An example of descriptive statistics is the following statement:

"Henry averaged 1 new car sold for the last 3 Sundays."

➢ Inferential Statistics

Inferential statistics provide ways of testing the reliability of the


findings of a study and inferring characteristics from a small group of
participants or people (sample) onto much larger groups of people (the
population).

4
Example:

Of 350 randomly selected people in the town of Bangued, Abra, 280 people had
the last name Racsa. An example of inferential statistics is the following statement:

"80% of all people living in Abra have the last name Racsa."

We have no information about all people living in Abra, just about the 350 living
in Bangued. We have taken that information and generalized it to talk about all people
living in Abra. The easiest way to tell that this statement is not descriptive is by trying to
verify it based upon the information provided.

Levels of Measurement

The level of measurement refers to the relationship among the values that
are assigned to the of
Why is Level attributes for a variable.
Measurement Important?

First, knowing the level of measurement helps you decide how to interpret the data from
that variable. When you know that a measure is nominal (like the one just described), then you
know that the numerical values are just short codes for the longer names. Second, knowing the
level of measurement helps you decide what statistical analysis is appropriate on the values that
were assigned. If a measure is nominal, then you know that you would never average the data
values or do a t-test on the data.

There are typically four levels of measurement that are defined:

• Nominal
• Ordinal
• Interval
• Ratio

5
A nominal variable contains categorical data for mutually exclusive, but
not-ordered, categories.

Nominal measurement is simply concerned with sorting observations into categories.


Because the single property of nominal data is classification it tells us nothing about differences
in degree or amount. Numbers assigned to categories (as identification codes) have no numeric
value (we cannot add, subtract, divide or multiply nominal data) and any ordering of categories
is arbitrary. This is the most primitive form of measurement. The presence vs. absence of
something is a form of nominal measurement (“do you smoke?” YES, NO). Although it is
considered a form of measurement the collection of nominal data is more easily thought of as a
sorting method.

Examples:

• Religion (Protestant, Catholic, Hebrew, Buddhist, etc)


• Race (Caucasian, African-American, Hispanic, Asian, etc)
• Linguistic Group
• Marital Status (Married, Single, Divorced)
• Gender (Male or Female)

In ordinal measurement the attributes can be rank-ordered. Here,


distances between attributes do not have any meaning.

Examples:

• Rankings (1st, 2nd, 3rd, etc)


• Grades (A, B, C, D. F)
• Evaluations
o High, Medium, Low
o Likert Scales
▪ 5 pt (Strongly Agree, Agree, Neither Agree nor Disagree, Disagree,
Strongly Disagree)
▪ 7 pt liberalism scale (Strongly Liberal, Liberal, Weakly Liberal, Moderate,
Weakly Conservative, Conservative, Strongly Conservative)
• Stages in development
Academic Letter Grade

An interval measurement is a scale that represents quantity and has


equal units but for which zero represents simply an additional point of
measurement.

6
Essentially, interval data are ordinal, but they have an extra property - the ability to
meaningfully add and subtract measurements. In interval-scaled data, the gaps between the
numbers are comparable, unlike with ordinal data. Any interval has the same meaning regardless
of its location on the scale. "X is five inches longer than y" has meaning regardless of the values
of X and Y. However, ratios are meaningless on an interval scale because an interval scale has
no true zero. Temperature scales are an example of this, so are decibel scales. Zero degrees
Fahrenheit does not mean the total absence of temperature. Zero decibels do not mean there is no
sound. Furthermore, if it is 80 degrees outside today and it was only 40 degrees outside yesterday
we cannot say that today is twice as hot as yesterday. Similarly a sound level of 80 dB is not
twice as loud as a sound level of 40 dB. In short, if the data can be ordered and the arithmetic
difference is meaningful, then the data are at least interval data.

Examples:

• Money
• People
• Education (in years)
• Temperature scale (Fahrenheit or Celsius)
• Measurement of Sea Level

In ratio measurement there is always an absolute zero that is


meaningful. This means that you can construct a meaningful fraction (or ratio)
with a ratio variable.

Ratio data are the highest form of data measurement and the form we are most familiar
with. For ratio data both differences and ratios are interpretable. Ratio data have a natural zero.
Ratio data look a lot like interval data. However, the zero point has a special meaning in ratio-
scaled data: it indicates the absence of whatever property is being measured. Ratio data always
have the flavor of counting: when you measure the amount of money that you have, you are
counting up coins and bills. When you are measuring your height, you are counting the number
of inches off the ground to the top of your head. Both ratio and interval data make use of a wide
range of statistical analysis tools.

Examples:

• Physical Measures (Height or Weight)


• Kelvin temperature (includes absolute zero)
• Population
• Annual income
• Family size
It's important to recognize that there is a hierarchy implied in the level of measurement
idea. At lower levels of measurement, assumptions tend to be less restrictive and data analyses

7
tend to be less sensitive. At each level up the hierarchy, the current level includes all of the
qualities of the one below it and adds something new. In general, it is desirable to have a higher
level of measurement (e.g., interval or ratio) rather than a lower one (nominal or ordinal).

Collecting Data:

Questionnaire is a form containing a set of questions, especially one


addressed to a statistically significant number of subjects as a way of gathering
information for a survey.

Types of Questions

All researchers must make two basic decisions when designing a survey--they must
decide: 1) whether they are going to employ an oral or written method, and 2) whether they are
going to choose questions that are open or close-ended.

• Closed-Ended Questions. Closed-ended questions limit respondents' answers to the


survey. The participants are allowed to choose from either a pre-existing set of
dichotomous answers, such as yes/no, true/false, or multiple choice with an option for
"other" to be filled in, or ranking scale response options.
• Open-Ended Questions. Open-ended questions do not give respondents answers to
choose from, but rather are phrased so that the respondents are encouraged to explain
their answers and reactions to the question with a sentence, a paragraph, or even a page or
more, depending on the survey. If you wish to find information on the same topic as
asked above, but would like to find out what respondents would come up with on their
own, you might choose an open-ended question.

Characteristics of a Good Questionnaire

• Questions worded simply and clearly, not ambiguous or vague, must be objective
• Attractive in appearance (questions spaced out, and neatly arranged)
• Write a descriptive title for the questionnaire
• Write an introduction to the questionnaire
• Order questions in logical sequence
• Keep questionnaire uncluttered and easy to complete
• Delicate questions last (especially demographic questions)
• Design for easy tabulation
• Design to achieve objectives
• Define terms
• Avoid double negatives (I haven't no money)
• Avoid double barreled questions (this AND that)
• Avoid loaded questions ("Have you stopped beating your wife?")

8
• Phrase questions for all respondents

Methods in Collecting Data

Survey. Statistical surveys are used to collect quantitative information from a specific
population. A survey may focus on opinions or factual information depending upon the purpose
of the study. Surveys may involve answering a questionnaire or being interviewed by a
researcher. The census is a type of survey.

Advantages:

• Can be administered in a variety of forms (telephone, mail, on-line, mall


interview, etc.)
• Are efficient for collecting data from a large population
• Can be designed to focus only on the needed response questions
• Are applicable to a wide range of topics

Disadvantages:

• Are dependent upon the respondent's honesty and motivation when answering
• Can be flawed by non-response
• Can possess questions or answer choices that may be interpreted differently by
different respondents (such as the choice "agree slightly")

Experimental Study. In an experimental study, the researcher takes measurements, or surveys,


the sample population. The researcher then manipulates the sample population in some
manner. After the manipulation, the researcher re-measures, or re-surveys, using the same
procedures to determine if the manipulation possibly changed the measurements.

During a "controlled" experiment, the researcher will separate the sample population into
groups with one group established as the control group. All groups will be manipulated in some
manner, except for the control group which will remain the same.

Observational Study. In an observational study, the sample population being studied is


measured, or surveyed, as it is. The researcher does not influence the population in any way or
attempt to intervene in the study. There is no experimental manipulation. Instead, data is
simply gathered and correlations are investigated

Analyze the Data:

Sampling Methods

Sampling methods are used to select a sample from within a general


population.

9
Proper sampling methods are important for eliminating bias in the selection process.
They can also allow for the reduction of cost or effort in gathering samples.

Probability Sampling Methods

A probability sampling method is any method of sampling that utilizes


some form of random selection.

In order to have a random selection method, you must set up some process or procedure
that assures that the different units in your population have equal probabilities of being chosen.
Humans have long practiced various forms of random selection, such as picking a name out of a
hat, or choosing the short straw. The key benefit of probability sampling methods is that they
guarantee that the sample chosen is representative of the population. This ensures that the
statistical conclusions will be valid.

Main types of Probability Sampling Methods

Simple Random sampling. Simple random sampling refers to


any sampling method that has the following properties: the population
consists of N objects or the sample consists of n objects.

There are many ways to obtain a simple random sample. One way would be the lottery
method. Each of the N population members is assigned a unique number. The numbers are
placed in a bowl and thoroughly mixed. Then, a blind-folded researcher selects n numbers.
Population members having the selected numbers are included in the sample.

Examples:

1. A group of 25 employees chosen out of a hat from a company of 250 employees.


In this case, the population is all 250 employees, and the sample is random
because each employee has an equal chance of being chosen.
2. Six random numbers are chosen from a raffle in the set of numbers from 1 to 49.
Each number ball has the same weight, therefore an equal chance of being drawn
is existing.

Stratified sampling. With stratified sampling, the population


is divided into groups, based on some characteristic. Then, within
each group, a probability sample (often a simple random sample) is
selected. In stratified sampling, the groups are called strata.

10
Examples:

1. Suppose we conduct a national survey. We might divide the population into


groups or strata, based on geography - north, east, south, and west. Then, within
each stratum, we might randomly select survey respondents.

2. A campus survey is conducted. Population may be divided into strata based on the
colleges – College of Arts, College of Sciences, College of Engineering, College
of Accountancy, etc. With each stratum, random selection of respondents may be
conducted.

Cluster sampling. With cluster sampling, every member of the


population is assigned to one, and only one, group. Each group is
called a cluster.

A sample of clusters is chosen, using a probability method (often simple random


sampling). Only individuals within sampled clusters are surveyed.
Note the difference between cluster sampling and stratified sampling. With stratified sampling,
the sample includes elements from each stratum. With cluster sampling, in contrast, the sample
includes elements only from sampled clusters.

Examples:

1. In a study of the opinions of homeless across a country, rather than study a few
homeless people in all towns, a number of towns are selected and a significant
number of homeless people are interviewed in each one.
2. In studying the common experiences of the failing students in the Campus, a study
of the common experiences of failing students in each college is rather conducted.
In each college, a number of students with failing marks are selected.

Systematic random sampling. With systematic random


sampling, we create a list of every member of the population. From the
list, we randomly select the first sample element from the
first k elements on the population list. Thereafter, we select
every kth element on the list.

This method is different from simple random sampling since every possible sample of n
elements is not equally likely.

11
Example:

1. Suppose you want to sample 8 houses from a street of 120 houses.

120/8=15, so every 15th house is chosen after a random starting point between 1
and 15. If the random starting point is 11, then the houses selected are 11, 26, 41,
56, 71, 86, 101, and 116.

If there were 125 houses, 125/8=15.625, so should you take every 15th house or
every 16th house? If you take every 16th house, 8*16=128 so there is a risk that
the last house chosen does not exist. To overcome this, the random starting point
should be between 1 and 10. On the other hand if you take every 15th house,
8*15=120 so the last five houses will never be selected. The random starting
point should now be between 1 and 20 to ensure that every house has some
chance of being selected.

Non-probability Sampling Methods

Non-probability sampling is a sampling technique where the samples


are gathered in a process that does not give all the individuals in the population
equal chances of being selected.

The difference between non-probability and probability sampling is that non-probability


sampling does not involve random selection and probability sampling does.

Main types of Non-probability Sampling Methods

Convenience Sampling or Accidental. Convenience sampling is


a non-probability sampling technique where subjects are selected
because of their convenient accessibility and proximity to the researcher.

Purposive Sampling. In purposive sampling, we sample with


a purpose in mind. We usually would have one or more specific
predefined groups we are seeking.
Examples:

1. A study of rehabilitation after stroke collected a small sample for a focus group
of patients, care givers, and health care providers with unique expertise.

12
Quota sampling. A non-probability sampling technique wherein
the researcher ensures equal or proportionate representation of subjects
\ depending on which trait is considered as basis of the quota.

Examples:

1. If the basis of the quota is college year level and the researcher needs equal
representation, with a sample size of 100, he must select 25 1st year students,
another 25 2nd year students, 25 3rd year and 25 4th year students. The bases of
the quota are usually age, gender, education, race, religion and socioeconomic
status.

Presentation of Data
The main portion of Statistics is the display of summarized data. Data is initially
collected from a given source, whether they are experiments, surveys, or observation, and is
presented in three methods:

Textual Method. The reader acquires information through reading the


gathered data. The textual presentation combines text and figures in a statistical
report.

In the presentation of the text, the writer can emphasize the importance of some figures.
This method of data is not particularly effective because of some instances, like it takes dull
reading and may not give a clear and concise meaning of the quantitative relationship indicated
in any particular report.

Tabular Method. Provides a more precise, systematic and orderly


presentation of data in rows or columns. Tabulation is the process of condensing
classified data and arranging them in a table.

Tables are constructed to facilitate analysis of relationship and are made possible by the
orderly arrangement of numerical facts in columns and rows.

Major Functional Parts of Statistical Table

a. Table Number. Each table must be given a number. Table number helps in
distinguishing one table from other tables. Usually tables are numbered according to
the order of their appearance in a chapter. For example, the first table in the first
chapter of a book should be given number 1.1 and second table of the same chapter
be given 1.2 Table number should be given at its top or towards the left of the table.

13
b. Title of the Table. Every table should have a suitable title. It should be short & clear.
Title should be such that one can know the nature of the data contained in the table as
well as where and when such data were collected. It is either placed just below the
table number or at its right. A title is the main heading written in capital shown at the
top of the table. It must explain the contents of the table and throw light on the table
as whole different parts of the heading can be separated by commas there are no full
stop be used in the little.

c. The Box Head (column captions). Caption refers to the headings of the columns. It
consists of one or more column heads. A caption should be brief, concise and self-
explanatory, Column heading is written in the middle of a column in small letters.

The heading of each column is called a column caption, while the section of a table
that contains the column captions, is referred to as box head.

The vertical heading and subheading of the column are called columns captions. The
spaces were these column headings are written is called box head. Only the first letter
of the box head is in capital letters and the remaining words must be written in small
letters.

d. Stub (Row captions). Stub refers to the headings of rows. The horizontal headings
and sub heading of the row are called row captions and the space where these rows
headings are written is called stub.

e. Body. This is the most important part of a table. It contains a number of cells. Cells
are formed due to the intersection of rows and column. Data are entered in these cells.
It is the main part of the table which contains the numerical information classified
with respect to row and column captions.

f. Head Note. The head-note (or prefatory note) contains the unit of measurement of
data. It is usually placed just below the title or at the right hand top corner of the
table. A statement given below the title and enclosed in brackets usually describe the
units of measurement is called prefatory notes.

g. Foot Note. A foot note is given at the bottom of a table. It helps in clarifying the
point which is not clear in the table. A foot note may be keyed to the title or to any
column or to any row heading. It is identified by symbols such as *, +, @, £ etc. It
appears immediately below the body of the table providing the further additional
explanation.

h. Source Note. The source note shows the source of the data presented in the table.
Reliability and accuracy of data can be tested to some extent from the source note. It
shows the name of the author, title, volume, page, publisher’s name, year and place of
publication of the book or journal from which data are complied. The source notes is
given at the end of the table indicating the source from when information has been
taken. It includes the information about compiling agency, publication etc.

14
----THE TITLE----
----Prefatory Notes----

----Box Head----
----Row Captions---- ----Column Captions----

----Stub Entries---- ----The Body----

Foot Notes…
Source Notes…

Graphical Method. The utilization of graphs is most effective method of


visually presenting statistical results or findings.

Types of Graphs

Example: The following are the size of radius of a water tank that are available on the market
with their corresponding quantities. This table serves ast he general data used for the following
graphs.

Radius(decimeter) f x
0.98-1.01 7 0.995
0.94-0.97 5 0.955
0.90-0.93 2 0.915
0.86-0.89 6 0.875
0.82-0.85 4 0.835
0.78-0.81 5 0.795
0.74-0.77 3 0.755
0.70-0.73 8 0.715
0.66-0.69 4 0.675
0.62-0.65 2 0.635
0.58-0.61 1 0.595
0.54-0.57 3 0.555
N=50

15
• Pictograph
A pictograph uses an icon to represent a quantity of data values in order to decrease the
size of the graph. A key must be used to explain the icon.

• Pie chart
A pie chart displays data as a percentage of the whole. Each pie section should have a
label and percentage. A total data number should be included.

• Histogram
A histogram displays continuous data in ordered columns. Categories are of continuous
measure such as time, inches, temperature, etc.

16
• Bar graph
A bar graph displays discrete data in separate columns. A double bar graph can be used to
compare two data sets. Categories are considered unordered and can be rearranged
alphabetically, by size, etc.

• Frequency Polygon
A frequency polygon can be made from a line graph by shading in the area beneath the
graph. It can be made from a histogram by joining midpoints of each column.

• Scatter plot
A scatter plot displays the relationship between two factors of the experiment. A trend
line is used to determine positive, negative, or no correlation.

17
18

You might also like