0% found this document useful (0 votes)
12 views15 pages

BUP 03 Computing Descriptive Statistics

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views15 pages

BUP 03 Computing Descriptive Statistics

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Dr. Md. Abdus Salam Akanda Website: http://du.ac.

bd
Professor of Statistics, DU E-mail: [email protected]

Computing Descriptive Statistics-I

How can you summarize the values of a variable?


We can summarize the values of variables through some statistical techniques collectively
known as descriptive statistics. The SPSS will have the following options under descriptive
statistics:
 Frequencies
 Descriptives
 Explore
 Crosstabs

Frequency distribution: A set of classes together with the frequencies of occurrence of


values in each class in a given set of data, presented in a tabular form, is referred to as a
frequency distribution. The classes should be non-overlapping or mutually exclusive.
Frequency tables are essential for getting acquainted with the data. A frequency distribution
is one of the most useful ways of describing, summarizing and condensing a set of data.

A frequency distribution assumes usually three different forms depending on the type of data
available. When a frequency distribution is constructed from categorical data, we may call it
categorical frequency distribution. When it is based on discrete data, we call it discrete
frequency distribution. It is called a continuous frequency distribution when it is based on
continuous data.

Procedures for obtaining frequency distribution: Finding a frequency table in SPSS is


simple. Open the data file and follow the instructions given below.
1. From the menu at the top of the screen click on Analyze, and then click on
Descriptive Statistics, then Frequencies…..
2. Choose and highlight the variables you are interested. Move these into the
Variable(s) box.
3. Click Ok.

Data file: gssnet.sav (Obtained from General Social Survey in USA, 2000)
Example:
Analyze
Descriptive Statistics
Frequencies … (usenet, agecat)
Ok

Table: Frequency table of Internet use


Use Frequency Percent Valid Percent Cumulative Percent
Valid No 734 51.7 52.9 52.9
Yes 654 46.1 47.1 100.0
Total 1388 97.8 100.0
Missing No answer 31 2.2
Total 1419 100.0

Explanation:- From the last row of the frequency table, we see that a total of 1419 people
participated in the survey. Of these 31 had a response identified as Missing. The other 1388
people selected one of the two possible valid responses. Note that the people who gave the
response ‘Yes’ are 46.1% of the 1419 people in the survey. The 734 people who gave the

1
Dr. Md. Abdus Salam Akanda Website: http://du.ac.bd
Professor of Statistics, DU E-mail: [email protected]

response ‘No’ are 51.7% of the sample. The 31 people whose responses are not available are
2.2% of the total sample.

Table: Frequency table of age category

Age Frequency Percent Valid Percent Cumulative Percent


Valid 18-29 251 17.7 17.7 17.7
30-39 311 21.9 21.9 39.6
40-49 306 21.6 21.6 61.2
50-59 212 14.9 14.9 76.1
60-89 339 23.9 23.9 100.0
Total 1419 100.0 100.0

Can you notice the matter of percent and valid percent columns in the table?

Sorting a Frequency Table: A frequency table can be sorted either in ascending or


descending order of frequencies.

Procedures for sorting a frequency table: Open the data file and follow the instructions
given below.
1. From the menu at the top of the screen click on Analyze, and then click on
Descriptive Statistics, then Frequencies…..
2. Choose and highlight the variables you are interested. Move these into the
Variable(s) box.
3. Click on Format and then in the Frequencies: Format box select Descending
counts.
4. Click Continue and then click on Ok.

How can you sort frequency table of the Search engine variable?

2
Dr. Md. Abdus Salam Akanda Website: http://du.ac.bd
Professor of Statistics, DU E-mail: [email protected]

Measure of central tendency:

3
Dr. Md. Abdus Salam Akanda Website: http://du.ac.bd
Professor of Statistics, DU E-mail: [email protected]

Mean: The most commonly used measure of central tendency is the arithmetic mean, also
known as the average.

It is important to note that the mean can be calculated only for numerical data. Categorical
(nominal or ordinal) data such as sex, opinion, health status, do not permit calculation of
arithmetic mean.

Procedures for obtaining mean: Open the data file and follow the instructions given below:
1. From the menu at the top of the screen click on Analyze, and then click on
Descriptive Statistics, then Frequencies…..
2. Choose and highlight the variables you are interested. Move these into the
Variable(s) box. Then click Statistics…….
3. In the Frequencies: Statistics dialog box select Mean.
4. Click Continue and then click Ok.

What is the mean of the variable age?

Median: Median is the middle most observation when the observations or set of values of a
particular study are arranged in (ascending or descending) order of magnitude. That is the
number of observations above median is equal to the number of observations below it.

Median cannot be calculated for nominal measurements, since ranking of the observations is
not possible. When numeric values are attached to categories, they are merely identifiers, and
hence none of the properties of numbers can be applied to these numerical coded categories.
For ordinal data, however, median is usually a good measure, since it uses the ranking
information.
n 1
 If n is odd then median is the th observation.
2
n n 
 If n is even then median is the average of the th and  1th observations.
2 2 
5 1
The median of 5, 10, 15, 18, 23 is the  3rd observation, i.e., 15 and the median of 5,
2

4
Dr. Md. Abdus Salam Akanda Website: http://du.ac.bd
Professor of Statistics, DU E-mail: [email protected]

4 4  10  15
15, 20, 10 is the average of  2 nd and   1  3rd observation, i.e.,  12.5 .
2 2  2

The median ignores much of the available information. The median is 30 for both the data
sets: 28, 29, 30, 35, 37 and 28, 29, 30, 98, 170
Procedures for obtaining median: Open the data file and follow the instructions given
below:
1. From the menu at the top of the screen click on Analyze, and then click on
Descriptive Statistics, then Frequencies…..
2. Choose and highlight the variables you are interested. Move these into the
Variable(s) box. Then click Statistics…….
3. In the Frequencies: Statistics dialog box select Median.
4. Click Continue and then click Ok.

What is the median of the variable “nethrs”?

Mode: Mode is the value of a distribution for which the frequency is maximum. In other
words, mode is the value of a variable, which occurs with the highest frequency. For
example, if a population consists of 87% Muslims, 11% Hindus, and the remaining 2% are
followers of other religion, the modal category is the Muslim, which has the most people.

Procedures for obtaining mode:- Open the data file and follow the instructions given below:
1. From the menu at the top of the screen click on Analyze, and then click on
Descriptive Statistics, then Frequencies…..
2. Choose and highlight the variables you are interested. Move these into the
Variable(s) box. Then click Statistics…….
3. In the Frequencies: Statistics dialog box select Mode.
4. Click Continue and then click Ok.

What is the mode for the Search engine variable?

Finding Mean, Median and Mode for the variables age and education: Open the data file
and follow the instructions given below:
1. From the menu at the top of the screen click on Analyze, and then click on
Descriptive Statistics, then Frequencies…..
2. Select the variables age and educ from the variables and move these into the
Variable(s) box. Then click Statistics…….
3. In the Frequencies: Statistics dialog box select Mean, Median and Mode.
4. Click Continue and then click Ok.

Quartiles, Deciles and Percentiles: We know that the median divides the items arranged in
order of magnitude into two equal parts. Other measures that are allied to the median include
the quartiles, deciles, and percentiles, because they are also based on their position in a series
of observations. They together are called quantiles, fractiles or proportion values.

5
Dr. Md. Abdus Salam Akanda Website: http://du.ac.bd
Professor of Statistics, DU E-mail: [email protected]

Quartiles: There are three quartiles in a data series, usually denoted by Q1 , Q2 and Q3 ,
which divide the whole distribution into four equal parts. The second quartile, Q2 , is identical
with the median. The first quartile, Q1 , is the value at or below which one-fourth (25%) of all
items in the series fall; the third quartile, Q3 , is the value at or below which three-fourth
(75%) of the items lie.
If n is divisible by 4, the first quartile Q1  has the value half-way between the th and
n
4
n  n
  1 th number. If n is not exactly divisible by 4, i.e., is not an integer, the first quartile
4  4
n 3n
has the value of the next higher integer. To find the third quartile, Q3 , we replace by .
4 4

Consider the following series of 12 values arranged in ascending order:


14, 17, 19, 23, 27, 32, 40, 49, 54, 59, 71, 80.
Here n  12 , which is divisible by 4. The quotient is 3. Thus the first quartile will be the
average value of the 3rd and 4th items, which in this case is
19  23  21. We add a new
2
n n 13
value 94 so that is not an integer. Here   3.25 . The next higher integer is 4. Thus
4 4 4
the 4th value will be the first quartile. It is 23.

3n  3n 
With n  12 , the third quartile is the value mid-way between th and   1 th
4  4 

6
Dr. Md. Abdus Salam Akanda Website: http://du.ac.bd
Professor of Statistics, DU E-mail: [email protected]

3n
observation, since  9 is an integer. Thus Q3 is the average of the 9th and 10th
4
observations. This is
54  59  56.5 . If a value 94 is added to the series as before, n  13
2
4n 39
and the third quartile is the 10th observation, since   9.75 , which is not an integer.
4 4
The next higher integer is 10. Thus Q3 is the tenth value, which is equal to 59.

Procedures for obtaining quartiles: Open the data file and follow the instructions given
below:
1. From the menu at the top of the screen click on Analyze, and then click on
Descriptive Statistics, then Frequencies…..
2. Choose and highlight the variables you are interested. Move these into the
Variable(s) box. Then click Statistics…….
3. In the Frequencies: Statistics dialog box select Quartiles.
4. Then click Continue and then click Ok.

Percentiles: Percentiles are the values, which divide the distribution into 100 equal parts.
Thus there are 99 percentiles in a distribution, which are conventionally denoted by P1 , P2 ,
…….., P99 .
The median is the 50th percentile and the first quartile, Q1 , is the 25th percentile and third
quartile, Q3 , is the 75th percentile.

Procedures for obtaining percentiles: Open the data file and follow the instructions given
below:
1. From the menu at the top of the screen click on Analyze, and then click on
Descriptive Statistics, then Frequencies…..
2. Choose and highlight the variables you are interested. Move these into the
Variable(s) box. Then click Statistics…….
3. In the Frequencies: Statistics dialog box select Percentile(s). Then fixed the
percentile which you want to know and then click on Add.
4. Click Continue and then click Ok.

Deciles: When a distribution is divided into ten equal parts, each division is called a decile.
Thus, there are 9 deciles in a distribution, which are denoted by D1 , D2 , …… , D9 .
Obviously D5  Me  P50 .

Procedures for obtaining deciles: Open the data file and follow the instructions given
below:
1. From the menu at the top of the screen click on Analyze, and then click on
Descriptive Statistics, then Frequencies…..
2. Choose and highlight the variables you are interested. Move these into the
Variable(s) box. Then click Statistics…….
3. In the Frequencies: Statistics dialog box select Cut points for 10 equal groups.
4. Click Continue and then click Ok.

Finding quartiles for the variables nethrs, emailhrs, webhrs: Open the data file and

7
Dr. Md. Abdus Salam Akanda Website: http://du.ac.bd
Professor of Statistics, DU E-mail: [email protected]

follow the instructions given below:


1. From the menu at the top of the screen click on Analyze, and then click on
Descriptive Statistics, then Frequencies…..
2. Select the variables nethrs, emailhrs and webhrs from the variables and move these
into the Variable(s) box. Then click Statistics…….
3. In the Frequencies: Statistics dialog box select Quartiles.
4. Click Continue and then click Ok.

Problem:
The variable tvhours tells you how many hours per day GSS respondents say they watch
television.
1. Make a frequency table of the hours of television watched. Do any of the values strike
you as strange? Explain.
2. Based on the frequency table, answer the following questions: Of the people who
answered the question, what percentage don’t watch any television? What percentage
watch two hours or less? Five hours or more? Of the people who watch television,
what percentage watch one hour? What percentage watch four hours or less?
3. From the frequency table, estimate the 25th, 50th, and 75th, and 95th percentiles.
4. What is the value for the median? The mode?

Harmonic and geometric means:

Harmonic and geometric means are both available in the MEANS command in SPSS. In the
SPSS menus, select Analyze>Compare Means>Means, then click on the Options button
and select them from the list of available statistics on the left.

Computing Descriptive Statistics-II

You have already learned how to summarize the information by computing summary
statistics that describe the ‘typical’ value, or the central tendency. Very often you might want
to know how the data spread out around a typical value. This can be done through measures
of variability that attempt to quantify the spread of observations.

8
Dr. Md. Abdus Salam Akanda Website: http://du.ac.bd
Professor of Statistics, DU E-mail: [email protected]

Measures of Dispersion: Any precise measure of dispersion provides the magnitude of the
variations in a set of observations. Further the measures of dispersion can be distinguished by
two major categories:
a. The absolute measures of dispersion
b. The relative measures of dispersion

Absolute measures of dispersion:- When dispersion is measured in original units then it is


known as absolute dispersion. The four important absolute measures of dispersion are as
follows:

9
Dr. Md. Abdus Salam Akanda Website: http://du.ac.bd
Professor of Statistics, DU E-mail: [email protected]

i. Range
ii. Quartile deviation
iii. Mean or average deviation
iv. Standard deviation

Relative measures of dispersion: It is tough to compare different data by the absolute


dispersion, specially when the data is in different units. In such a situation we may use the
relative dispersion. A relative dispersion is independent of original units. Generally, relative
measures of dispersion are expressed in terms of ratio, percentage etc. The relative measures
of dispersion are as follows:
i. Coefficient of range
ii. Coefficient of quartile deviation
iii. Coefficient of mean deviation
iv. Coefficient of variation

Range: The range R  of a set of observations is the difference between two extreme values,
i.e., the difference between the maximum and minimum values. Therefore, it indicates the
limits within which all the observations fall. In the form of an equation:
Range  Highest value  Lowest value

Procedures for obtaining range: Open the data file and follow the instructions given below:
5. From the menu at the top of the screen click on Analyze, then click on Descriptive
Statistics, then Frequencies…..
6. Choose and highlight the variables you are interested. Move these into the
Variable(s) box. Then click Statistics…….
7. In the Frequencies: Statistics dialog box select Range, Maximum and Minimum.
8. Click Continue and then click Ok.

What are the range of the variables age and education?

Variance and Standard Deviation: The most important and commonly used measure of
dispersion is the standard deviation. The standard deviation is the positive square root of the
mean-squared deviations of the observations from their arithmetic mean (variance).
The population standard deviation of a population data set of N entries is
 x   
2

Population s tan dard deviation  


N
The sample standard deviation of a sample data set of n entries is
 x  x 
2

Sample s tan dard deviation s 


n 1

10
Dr. Md. Abdus Salam Akanda Website: http://du.ac.bd
Professor of Statistics, DU E-mail: [email protected]

Procedures for obtaining variance and standard deviation: Open the data file and follow
the instructions given below:
1. From the menu at the top of the screen click on Analyze, then click on Descriptive
Statistics, then Frequencies…..
2. Choose and highlight the variables you are interested. Move these into the
Variable(s) box. Then click Statistics…….
3. In the Frequencies: Statistics dialog box select Variance and Std. deviation.
4. Click Continue and then click Ok.

What are the variance and standard deviation of the variables age and education?

Standard score / z – score : The standard score or z-score, represents the number of standard
deviations a given value x falls from the mean. To find the z-score for a given value, use the
following formula:
value  mean
z
S tan dard deviation

The mean of the standard score for a variable is always 0 and their standard deviation is 1.
A z-score can be negative, positive or zero. If z is negative, the corresponding x-value is
below the mean. If z is positive, the corresponding x-value is above the mean. And if z  0 ,
the corresponding x-value is equal to the mean.

If a z-score is 1.5, then we can conclude that the variable value is 1.5 standard deviations
above the mean, a z-score of -2.25 imply that it is 2.25 standard deviations below the mean.

11
Dr. Md. Abdus Salam Akanda Website: http://du.ac.bd
Professor of Statistics, DU E-mail: [email protected]

Procedures for obtaining z-scores: Open the data file and follow the instructions given
below:
1. From the menu at the top of the screen click on Analyze, then click on Descriptive
Statistics, then Descriptives…..
2. Choose and highlight the variables you are interested. Move these into the
Variable(s) box. Then click Save standardized values as variables.
3. Click Ok.

What are the z-scores of the variables age and education?

Standard Error: The standard deviation of the sampling distribution of a statistic is known
as its Standard Error, abbreviated as S.E..

The standard deviation of the sampling distribution of the sample means is called the standard

error of the mean and is defined as  X  .
n

It can be used to roughly compare the observed mean to a hypothesized value.

Procedures for obtaining standard error: Open the data file and follow the instructions
given below:
1. From the menu at the top of the screen click on Analyze, then click on Descriptive
Statistics, then Frequencies…..
2. Choose and highlight the variables you are interested. Move these into the
Variable(s) box. Then click Statistics…….
3. In the Frequencies: Statistics dialog box select S.E. mean.

12
Dr. Md. Abdus Salam Akanda Website: http://du.ac.bd
Professor of Statistics, DU E-mail: [email protected]

4. Click Continue and then click Ok.

What is the standard error of the variable age?

Shape characteristics: The study of central tendency and dispersion provides us with
valuable information relating to the central value as well as the variability of the distribution.
Unfortunately, these measures fail to demonstrate how the observations are arranged and
accumulated about the central value of the distribution. The arrangement and accumulation of
the observations determine the characteristics of the distribution with respect to its shape and
pattern. The study of these shape characteristics of a distribution is of crucial importance in
comparing a distribution with other distributions. By shape characteristic of a distribution, we
refer to the extent of its asymmetry and peakedness relative to an agreed upon standard and
the study of these two characteristics is accomplished through what is known as the measures
of skewness and kurtosis respectively.

The term skewness refers to the lack of symmetry. The lack of symmetry in a distribution is
always determined with reference to a normal distribution.
 When the skewness is positive, the associated distribution is positively skewed.
 When the skewness is negative, we call the distribution a negatively skewed.
 Absence of skewness makes the distribution symmetrical.

On the other hand, the kurtosis refers to the degree of peakedness of a distribution, usually
taken in relation to a normal distribution. A curve having relatively higher peak than the
normal curve, is known as leptokurtic. On the other hand, if the curve is more flat-topped
than the normal curve, it is called platykurtic. A normal curve itself is called mesokurtic,
which is neither too peaked nor too flat-topped.
 If kurtosis is positive, the distribution is leptokurtic.
 If kurtosis is negative, the distribution is platykurtic.
 If kurtosis is zero, the distribution is mesokurtic.

13
Dr. Md. Abdus Salam Akanda Website: http://du.ac.bd
Professor of Statistics, DU E-mail: [email protected]

Procedures for obtaining skewness and kurtosis: Open the data file and follow the
instructions given below:
1. From the menu at the top of the screen click on Analyze, then click on Descriptive
Statistics, then Frequencies…..
2. Choose and highlight the variables you are interested. Move these into the
Variable(s) box. Then click Statistics…….
3. In the Frequencies: Statistics dialog box select Skewness and Kurtosis.
4. Click Continue and then click Ok.

What are the skewness and kurtosis of the variables age, education, and nethrs?

Comparing Groups: You can determine if the values of the summary statistics for a variable
differ for subgroups of cases.

Procedures for comparing groups: Open the data file and follow the instructions given
below:
1. From the menu at the top of the screen click on Analyze, then click on Compare
Means, then Means…..
2. Choose a dependent variable and an independent variable. Move the dependent
variable into the Dependent list and the independent variable into the Independent
list. Then click Options…….
3. In the Means: Options dialog box, send Mean, Median and Number of cases to the
Cell Statistics: box.
4. Click Continue and then click Ok.

How can find the mean and median years of age for people in each of five internet usage
categories (netcat variable)?

Explore: You can get most of the descriptive statistics just following a single command.

Procedures for exploring: Open the data file and follow the instructions given below:

14
Dr. Md. Abdus Salam Akanda Website: http://du.ac.bd
Professor of Statistics, DU E-mail: [email protected]

1. From the menu at the top of the screen click on Analyze, then click on Descriptive
Statistics, then Explore…..
2. Choose and highlight the variables you are interested. Move these into the Dependent
list. Then click Display Statistics.
3. Click Ok.

How can we explore the variable age?

Problem: Use the electric.sav data file to answer the following questions:
1. Calculate the mean, median, and mode for cholesterol values and diastolic blood
pressures in 1958 (variables chol158 and dbp58). For each variable, compare the
values of these statistics. Which measures of central tendency do you think best
summarizes each variable?
2. Consider the number of cigarettes smoked in 1958(variable cgt58). Describe the
smoking habits of the men in 1958. (Be sure to include what percentage of men in
your sample were nonsmokers in 1958 as well as the mean and median number of
cigarettes smoked by the smokers.)
3. Compute standardized scores for cholesterol values and diastolic blood pressure.
4. What is the smallest standardized score for diastolic blood pressure? The largest?
5. What is the smallest standardized score for cholesterol values? The largest?
6. Compute the means and standard deviations of the standardized score. What are they?
Compute the quartiles for the standardized scores.

15

You might also like