0% found this document useful (0 votes)
14 views8 pages

SDA Lab 2

The command finds the summary statistics for the Mid variable by filtering for observations where the school variable is equal to 1, 3, or 4 using the inlist function. It shows that there are 20 such observations and provides the mean, standard deviation, minimum, and maximum values for the Mid variable for those schools.

Uploaded by

Muneeb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views8 pages

SDA Lab 2

The command finds the summary statistics for the Mid variable by filtering for observations where the school variable is equal to 1, 3, or 4 using the inlist function. It shows that there are 20 such observations and provides the mean, standard deviation, minimum, and maximum values for the Mid variable for those schools.

Uploaded by

Muneeb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Eeman S Qureshi SDA Lab 2

SDA Lab 2: Reading Datasets, Tabulation, Sum, IF function

In this lab we will learn the following:

1) How to identify string, numeric variables, and labelled variables


2) How to read the data
3) Discrete vs Continuous Variables
4) Visualize data through tabulate, histogram, sum
5) How to generate String Variables

Q1: Open the dta file pslm_inc_edu.dta and provide the command in your do-file

PSLM Survey is conducted in Pakistan annually and it provides accurate


representation/measurement of living and social standards. The purpose of the survey was to
investigate the social issues, behavioral practices, and health status of households. It stores
information of individuals within households such as their employment status, earnings etc.
Q2: What is the first row of this dataset telling us?
Using the browse command, you can tell that the first row is storing information about an
individual who resides in the urban region of KPK, they are 24 years old and are male. It also
tells us that the person being interviewed is the brother of the household head. This dataset is
storing information about a person’s demographic information as well as his health, social
status and income information.
To better understand and read the data, we also use the ‘describe’ command or short form
‘desc’ to see what each variable name stands for.
There two types of variables in STATA:
1) String Variable: String variables are variables that contain not only numbers plus other
characters)
2) Numeric Variables
Types of Numeric Variables:

• Byte
• Integar
• Long
• Float (this includes non-integar values)
• Double (Non-integar values)
Eeman S Qureshi SDA Lab 2

The following shows how the data appears in the browse window:

• Black represents numerical data


• Red Represents String Variable
• Blue is also a numerical variable with a label on top so this is known as a labelled
variable.

Q3) Import the gradebook.xls dataset again and provide the command in your do-files.

Q4) What is the first row of this dataset telling us? Type the answer as a comment in your do-
file.
We will use the browse command for this. If we need more information regarding our variables so we
use the ‘describe’ command to check for the properties of the dataset as well as what each variable
name represents.
Eeman S Qureshi SDA Lab 2

Distribution of Variables

There are two types of variables:

Discrete Variables: It is in count form and can only take numerical values in jumps e.g.
number of siblings. (1,2,3)
Continuous Variables: They can have infinite no. of values within a given interval e.g.,
height, speed, exam marks etc.

Q6: Identify the variable types as a comment in your do-files:


Variable Name Variable Type:
Categorical/Continuous
School
Female
Mid
Final
Attendance
Eeman S Qureshi SDA Lab 2

Visualizing Categorical Variables


Tabular Data

Q7: To visualise a categorical variable we will use the tabulate command,


open the ‘gradebook’ file and tabulate the the variable 'female'. What does
the command show us?

CROSS-TABULATION/TWO-WAY TABULATION:
How can we see the school wise distribution of females? We can do this through cross
tabulation or use relational operators with the tab command. Which tabular data is
easy to interpret?

. tab school female

female
school 0 1 Total

1 5 4 9
2 7 3 10
3 5 5 10
4 5 5 10

Total 22 17 39
Eeman S Qureshi SDA Lab 2

. tab female school

school
female 1 2 3 4 Total

0 5 7 5 5 22
1 4 3 5 5 17

Total 9 10 10 10 39

Q8: Interpret the following table and provide the interpretation as a comment in your
do-file

Visualizing Continuous Variables


To visualize a continuous variable, we use the histogram command. Histograms provide a view
of the data density. A histogram breaks the range of values of a variable into classes and
displays only the count or percent of the observations that fall into each class. Higher bars
represent where the data is relatively more common. Plot a histogram on STATA for students'
final scores:

• Histogram Mid (The vertical scale of a 'density histogram' shows units that make the
total area of all the bars add to 1.)
• Histogram Mid, normal

We also visualize continuous variables through the ‘sum’ command.


Eeman S Qureshi SDA Lab 2

Q9: How do we get the average/mean of Mids? Type the command in your
do-files.
If we want the summary statistics for all variables, we simply type sum and it shows us the following
table:
. sum

Variable Obs Mean Std. Dev. Min Max

Id 39 184.7436 143.3211 19 419


studentno 39 10.33333 5.474022 1 20
Name 0
school 39 2.538462 1.120295 1 4
female 39 .4358974 .5023561 0 1

Mid 36 29.33472 5.383855 18 38


Final 39 75.92308 9.426683 48 100
attendance 39 .825 .1588031 .51 1

Summary statistics of name are not being calculated because it is a string


variable

Q10: Why does mid show only 36 observations instead of 39?


Missing values can be present within a variable in our dataset. We use the following command to
check for missing values in our variables:

Codebook mid

Now suppose we want to find the mean of mid score for only females. To do
this we use the ‘if’function. We can provide STATA the if statement using
relational and logical operators. Both can be used together.

Relational Operators in STATA:


• == (equal to)
• <=
• >=
• != (not equal)
• >
• <
Eeman S Qureshi SDA Lab 2

Logical Operators
• & (both conditions need to hold true/met simultaneously)
• | (this stands for ‘OR’ only one condition must hold true/be met)

To check average mid score for females we will type the following
command:

sum mid if female==1

We can also use the if function in conjunction with other commands


• inlist (variable_name, value x, value y) If used with the if statement
STATA will subset observations where variables have values either x or
y. There can be more than two values in the command.
• inrange (variable_name,x,y)=> Used with if statement, STATA will
only subset observations where variables have values ranging from x to y
with x being the lower value and y being the higher value. This command
includes the values of both x and y.

What if you want to get summary statistics of math scores for school 1 and
school 3?

. *1) Logical & Relational Operators

. sum Mid if school ==1 | school ==3

Variable Obs Mean Std. Dev. Min Max

Mid 16 31.38312 4.168468 22.89 36.33

. *2) Using inlist

. sum Mid if inlist(school, 1,3)

Variable Obs Mean Std. Dev. Min Max

Mid 16 31.38312 4.168468 22.89 36.33


Eeman S Qureshi SDA Lab 2

Summary stats of math for school 2, 3, and 4?


There are three ways to do this:

. *1)Logical and Relational Operators:


.
. sum Mid if school >= 2 & school <4

Variable Obs Mean Std. Dev. Min Max

Mid 17 28.62294 4.332786 22.38 36.33

.
. *2) Inlist command
.
. sum Mid if inlist(school, 2,3,4)

Variable Obs Mean Std. Dev. Min Max

Mid 27 28.55667 5.594136 18 38

.
. *3) Inrange command
.
. sum Mid if inrange(school, 2,4)

Variable Obs Mean Std. Dev. Min Max

Mid 27 28.55667 5.594136 18 38

Q11: Find the summary statistics for Mid from school 1,3,4?

You might also like