0% found this document useful (0 votes)
41 views23 pages

SOCS0075 Lecture5

The document outlines essential practices for handling data in a BSc SSQM dissertation, emphasizing reproducibility, proper project setup, and the importance of addressing missing values. It discusses techniques for reality checks, ensuring consistency in conceptualization and measurement, and the classification of variables. Recommendations for handling missing data include listwise deletion and careful consideration of the implications of missing observations.

Uploaded by

Harinda Silva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views23 pages

SOCS0075 Lecture5

The document outlines essential practices for handling data in a BSc SSQM dissertation, emphasizing reproducibility, proper project setup, and the importance of addressing missing values. It discusses techniques for reality checks, ensuring consistency in conceptualization and measurement, and the classification of variables. Recommendations for handling missing data include listwise deletion and careful consideration of the implications of missing observations.

Uploaded by

Harinda Silva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

BSc SSQM Dissertation (SOCS0075)

Data handling, missing values, and reality checks

Tobias Rüttenauer
Social Research Institute
UCL Institute of Education

January 8, 2024

1 / 23
Happy new year!
Any questions? Anything to discuss?

2 / 23
Topics today

Today:
Set-up
Reality checks
Missing values
Results

3 / 23
Reproducibility

Main aim: Your dissertation should be perfectly reproducible!


Set-up a reproducible project
Keep track of things (comments in your code!)
data cleaning,
recoding,
statistical method / analysis
Report your research transparently
DO NOT (never!) copy-paste single numbers / columns from a
table in R or Stata into Excel or word.

4 / 23
Project set-up

Keep things in order, using a folder system such as


01 Script, 02 Data, 03 Output, 04 Document.
Comment your code - future you will be grateful.
Use automated outputs for tables and figures!
Use a bibliography management software:
Zotero (+ Better Bibtex), Endnote.
Here is an example.

5 / 23
Reality checks

6 / 23
Dubious Values and Incomplete Data

Start with looking at descriptive values


means, variances, and ranges
for each of the variables in our analysis
Examine scatterplots for outliers and nonlinearities
How are those missing values coded?
Make sure you declare them as missing!!!
It would be a shame if your marital status would look like
0 – Single
1 – Married
-999 – ???

7 / 23
Dubious Values and Incomplete Data

Using the R package stargazer (Hlavac, 2022):

stargazer(data, style = "asr", digits = 3, type = "html",


summary.stat = c("n", "mean", "sd", "min", "max"))

html format is readable with Word.

8 / 23
Consistency in Conceptualization and Measurement

Make sure that your variables align with your theory


e.g. wealth ̸= labour income
Keep the direction straight
poverty vs income
equality vs inequality (e.g. Gini)
reversing the sign complicates reading
Communicate clearly and simple
Be concise in your coding and with your labels
sex (0, 1) vs. female (0, 1)
income vs. net monthly income (in EUR)

9 / 23
Correct classification of your variables

The classification of your variables determines what you can do with


them.
Continuous / interval-ratio variables / numeric
E.g., income, age, weight, years of work experience
Ordinal variables: there is a natural ordering of the categories
Nominal variables: there is no natural ordering of the categories
e.g. education level, gender, vote choice
make sure to declare that they are categorical (as.factor())
consider the choice of the omitted reference category, e.g. largest
category (relevel())

10 / 23
Correct classification of your variables

11 / 23
Correct classification of your variables

nominal vs. numeric


The data had been collected across numerous countries, e.g. United
States, Canada, Turkey, etc. and the country information had been
coded as “1, 2, 3. . . ” Although Decety’s paper had reported that they
had controlled for country, they had accidentally not controlled for each
country, but just treated it as a single continuous variable so that, for
example “Canada” (coded as 2) was twice the “United States” (coded
as 1). Regardless of what one might think about the relative merits and
rankings of countries, this is obviously not the right way to analyze data.

Retraction Watch 2019

12 / 23
Consistency in Conceptualization and Measurement

Check if correlations are plausible


use a correlation matrix of your key variable
if something is odd, it may be a result of miss-coding?
For composite index variables
Do single items measure the same thing?
Check direction of coding
Check and report Cronbachs Alpha

13 / 23
Consistency with common sense

Interpret the size / magnitude of your effects


Is the result logically possible and plausible?
a 10,000 EUR increase in income with every year of age sounds
unlikely to be true.
Do external reality checks
Compare to results of previous studies
Why is there a difference? Provide (potential) explanations.

14 / 23
External reality checks - a voting example

15 / 23
External reality checks

First estimate: 10,000 lost votes for Bush within the last ten minutes!
Is this plausible?
There are overall 300,000 panhandle voters.
1/12 of votes usually happen in last hour. Then , approx.
1/6 ∗ 1/12 = 1/72 usually vote within the last 10 minutes.
1/72 ∗ 300, 000 = 4, 200 overall votes within the relevant period.

Brady, H. E. (2010). Data-set versus causal-process observations: The 2000 U.S. Presidential
election. In H. E. Brady & D. Collier (Eds.), Rethinking social inquiry: Diverse tools, shared
standards (2. ed, pp. 267–271). Rowman & Littlefield.

16 / 23
Missing values

17 / 23
How to handle missing values

Most common approach


Listwise Deletion
Every observation with a missing on some variable will be omitted
Sometimes can lead to omission of many cases
Other approaches
Mean imputation
Pairwise deletion
Multiple imputation

18 / 23
Suggestion on missing values

My suggestion:
Use Listwise Deletion
Most common method
Unbiased under standard assumptions
Keep a cautious eye on your number of observations (N)
If you compare across multiple models, your N should be constant
Discuss potential limitations if you have strong believes that missing
values are systematic
If you’re losing a large proportion of your observations check
where this is coming from!
Could this be a coding mistake?
Do you really need the variables that have a large amount of
missings? Trade-off between losing observations or losing a control
variable

19 / 23
Always check your N!

20 / 23
Always check your N!

Using the R package texreg (Leifeld, 2013):

wordreg(l = list(reg1, reg2, reg3),


file = "Regression_output1.doc",
custom.model.names = c("Mod 1", "Mod 2", "Mod 3"),
dcolumn = TRUE, digits = 3, include.nobs = TRUE)

texreg(), and htmlreg() export to other formats than Word. The


functions are very flexible and be easily customized. Via option
custom.coef.map = list("flood af" = "Flood affected") you
can chose and rename the variables to display in the table.

21 / 23
The proposition that low-dose alcohol use protects against all-cause
mortality in general populations continues to be controversial.
Observational studies tend to show that people classified as ”moderate
drinkers” have longer life expectancy and are less likely to die from heart
disease than those classified as abstainers.
Why is this simply wrong?
See Zhao et al. 2023

22 / 23
Example code

Here is an example (03 Data-handling.R)


Declare missings
Recode variables
Index Construction
Exporting summary statistics
Exporting Correlation matrix
Export Coefficient tables
Compared to the original data (44,100 obs), we have lost 5,357
observations (12%) in our final model. What is the main reason? Is this
a problem?

23 / 23

You might also like