0% found this document useful (0 votes)

41 views23 pages

SOCS0075 Lecture5

The document outlines essential practices for handling data in a BSc SSQM dissertation, emphasizing reproducibility, proper project setup, and the importance of addressing missing values. It discusses techniques for reality checks, ensuring consistency in conceptualization and measurement, and the classification of variables. Recommendations for handling missing data include listwise deletion and careful consideration of the implications of missing observations.

Uploaded by

Harinda Silva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views23 pages

SOCS0075 Lecture5

Uploaded by

Harinda Silva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

BSc SSQM Dissertation (SOCS0075)

Data handling, missing values, and reality checks

Tobias Rüttenauer
Social Research Institute
UCL Institute of Education

January 8, 2024

1 / 23
Happy new year!
Any questions? Anything to discuss?

2 / 23
Topics today

Today:
Set-up
Reality checks
Missing values
Results

3 / 23
Reproducibility

Main aim: Your dissertation should be perfectly reproducible!

Set-up a reproducible project
Keep track of things (comments in your code!)
data cleaning,
recoding,
statistical method / analysis
Report your research transparently
DO NOT (never!) copy-paste single numbers / columns from a
table in R or Stata into Excel or word.

4 / 23
Project set-up

Keep things in order, using a folder system such as

01 Script, 02 Data, 03 Output, 04 Document.
Comment your code - future you will be grateful.
Use automated outputs for tables and figures!
Use a bibliography management software:
Zotero (+ Better Bibtex), Endnote.
Here is an example.

5 / 23
Reality checks

6 / 23
Dubious Values and Incomplete Data

Start with looking at descriptive values

means, variances, and ranges
for each of the variables in our analysis
Examine scatterplots for outliers and nonlinearities
How are those missing values coded?
Make sure you declare them as missing!!!
It would be a shame if your marital status would look like
0 – Single
1 – Married
-999 – ???

7 / 23
Dubious Values and Incomplete Data

Using the R package stargazer (Hlavac, 2022):

stargazer(data, style = "asr", digits = 3, type = "html",

summary.stat = c("n", "mean", "sd", "min", "max"))

html format is readable with Word.

8 / 23
Consistency in Conceptualization and Measurement

Make sure that your variables align with your theory

e.g. wealth ̸= labour income
Keep the direction straight
poverty vs income
equality vs inequality (e.g. Gini)
reversing the sign complicates reading
Communicate clearly and simple
Be concise in your coding and with your labels
sex (0, 1) vs. female (0, 1)
income vs. net monthly income (in EUR)

9 / 23
Correct classification of your variables

The classification of your variables determines what you can do with

them.
Continuous / interval-ratio variables / numeric
E.g., income, age, weight, years of work experience
Ordinal variables: there is a natural ordering of the categories
Nominal variables: there is no natural ordering of the categories
e.g. education level, gender, vote choice
make sure to declare that they are categorical (as.factor())
consider the choice of the omitted reference category, e.g. largest
category (relevel())

10 / 23
Correct classification of your variables

11 / 23
Correct classification of your variables

nominal vs. numeric

The data had been collected across numerous countries, e.g. United
States, Canada, Turkey, etc. and the country information had been
coded as “1, 2, 3. . . ” Although Decety’s paper had reported that they
had controlled for country, they had accidentally not controlled for each
country, but just treated it as a single continuous variable so that, for
example “Canada” (coded as 2) was twice the “United States” (coded
as 1). Regardless of what one might think about the relative merits and
rankings of countries, this is obviously not the right way to analyze data.

Retraction Watch 2019

12 / 23
Consistency in Conceptualization and Measurement

Check if correlations are plausible

use a correlation matrix of your key variable
if something is odd, it may be a result of miss-coding?
For composite index variables
Do single items measure the same thing?
Check direction of coding
Check and report Cronbachs Alpha

13 / 23
Consistency with common sense

Interpret the size / magnitude of your effects

Is the result logically possible and plausible?
a 10,000 EUR increase in income with every year of age sounds
unlikely to be true.
Do external reality checks
Compare to results of previous studies
Why is there a difference? Provide (potential) explanations.

14 / 23
External reality checks - a voting example

15 / 23
External reality checks

First estimate: 10,000 lost votes for Bush within the last ten minutes!
Is this plausible?
There are overall 300,000 panhandle voters.
1/12 of votes usually happen in last hour. Then , approx.
1/6 ∗ 1/12 = 1/72 usually vote within the last 10 minutes.
1/72 ∗ 300, 000 = 4, 200 overall votes within the relevant period.

Brady, H. E. (2010). Data-set versus causal-process observations: The 2000 U.S. Presidential
election. In H. E. Brady & D. Collier (Eds.), Rethinking social inquiry: Diverse tools, shared
standards (2. ed, pp. 267–271). Rowman & Littlefield.

16 / 23
Missing values

17 / 23
How to handle missing values

Most common approach

Listwise Deletion
Every observation with a missing on some variable will be omitted
Sometimes can lead to omission of many cases
Other approaches
Mean imputation
Pairwise deletion
Multiple imputation

18 / 23
Suggestion on missing values

My suggestion:
Use Listwise Deletion
Most common method
Unbiased under standard assumptions
Keep a cautious eye on your number of observations (N)
If you compare across multiple models, your N should be constant
Discuss potential limitations if you have strong believes that missing
values are systematic
If you’re losing a large proportion of your observations check
where this is coming from!
Could this be a coding mistake?
Do you really need the variables that have a large amount of
missings? Trade-off between losing observations or losing a control
variable

19 / 23
Always check your N!

20 / 23
Always check your N!

Using the R package texreg (Leifeld, 2013):

wordreg(l = list(reg1, reg2, reg3),

file = "Regression_output1.doc",
custom.model.names = c("Mod 1", "Mod 2", "Mod 3"),
dcolumn = TRUE, digits = 3, include.nobs = TRUE)

texreg(), and htmlreg() export to other formats than Word. The

functions are very flexible and be easily customized. Via option
custom.coef.map = list("flood af" = "Flood affected") you
can chose and rename the variables to display in the table.

21 / 23
The proposition that low-dose alcohol use protects against all-cause
mortality in general populations continues to be controversial.
Observational studies tend to show that people classified as ”moderate
drinkers” have longer life expectancy and are less likely to die from heart
disease than those classified as abstainers.
Why is this simply wrong?
See Zhao et al. 2023

22 / 23
Example code

Here is an example (03 Data-handling.R)

Declare missings
Recode variables
Index Construction
Exporting summary statistics
Exporting Correlation matrix
Export Coefficient tables
Compared to the original data (44,100 obs), we have lost 5,357
observations (12%) in our final model. What is the main reason? Is this
a problem?

23 / 23

Causal Inference and Research Design Scott Cunningham (Baylor)
100% (1)
Causal Inference and Research Design Scott Cunningham (Baylor)
1,056 pages
ECMT1010 Notes
No ratings yet
ECMT1010 Notes
84 pages
Session 1 Canvas
No ratings yet
Session 1 Canvas
62 pages
Introduction To Data Cleaning and Bias in Analysis
No ratings yet
Introduction To Data Cleaning and Bias in Analysis
35 pages
POLI4023 - Week 19 - Before
No ratings yet
POLI4023 - Week 19 - Before
30 pages
Da (22C01156)
No ratings yet
Da (22C01156)
26 pages
L0-Introduction To Module
No ratings yet
L0-Introduction To Module
50 pages
Missing Data and Data Cleaning - Tagged
No ratings yet
Missing Data and Data Cleaning - Tagged
31 pages
Module 3 Data Preparation
No ratings yet
Module 3 Data Preparation
33 pages
Master Class Data Uses 100712
No ratings yet
Master Class Data Uses 100712
69 pages
ch9 - Model Specification and Data Problems
No ratings yet
ch9 - Model Specification and Data Problems
79 pages
Chapter 3
No ratings yet
Chapter 3
58 pages
09 - Common Limitations Errors PDF
No ratings yet
09 - Common Limitations Errors PDF
81 pages
Unit 2
No ratings yet
Unit 2
76 pages
Bio Stat Problems 2
No ratings yet
Bio Stat Problems 2
15 pages
Data Analyses R Manual NYTS
No ratings yet
Data Analyses R Manual NYTS
24 pages
Unit2 - Data Cleaning and Multivariate Techniques - 26 - 01 - 2025
No ratings yet
Unit2 - Data Cleaning and Multivariate Techniques - 26 - 01 - 2025
42 pages
Unit-2 Open Elective
No ratings yet
Unit-2 Open Elective
19 pages
Multivariant Data.
No ratings yet
Multivariant Data.
36 pages
Statistical Model Specification
No ratings yet
Statistical Model Specification
3 pages
Data Screening (Sometimes Referred To As "Data Screaming") Is The Process of Ensuring Your Data Is
No ratings yet
Data Screening (Sometimes Referred To As "Data Screaming") Is The Process of Ensuring Your Data Is
4 pages
Jornadas de Estad Istica Aplicada, Universidad de Chimborazo, Riobamba, Ecuador, 10 - 13th June 2013
No ratings yet
Jornadas de Estad Istica Aplicada, Universidad de Chimborazo, Riobamba, Ecuador, 10 - 13th June 2013
28 pages
CH 02 Data Handling Technique
No ratings yet
CH 02 Data Handling Technique
105 pages
2 Multivariate Statistics Assumptions
No ratings yet
2 Multivariate Statistics Assumptions
20 pages
Methods Notes
No ratings yet
Methods Notes
9 pages
Data - Analysis - With - R - 24
No ratings yet
Data - Analysis - With - R - 24
47 pages
Path Analysis: Observed Variables
No ratings yet
Path Analysis: Observed Variables
25 pages
Lecture 2.3.10
No ratings yet
Lecture 2.3.10
30 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
30 pages
Pima Tutorial
No ratings yet
Pima Tutorial
8 pages
Unit 1
No ratings yet
Unit 1
21 pages
Path Analysis
No ratings yet
Path Analysis
25 pages
BRM Statwiki
No ratings yet
BRM Statwiki
55 pages
CSV R Import
No ratings yet
CSV R Import
20 pages
In-Class Exercise #1 Notes
No ratings yet
In-Class Exercise #1 Notes
7 pages
Da Laqs Saqs
No ratings yet
Da Laqs Saqs
23 pages
Presentation 3
No ratings yet
Presentation 3
14 pages
Missing Data: I. Types of Missing Data. There Are Several Useful Distinctions We Can Make
No ratings yet
Missing Data: I. Types of Missing Data. There Are Several Useful Distinctions We Can Make
19 pages
8 Fixed Effects Regression (Powerpoint)
No ratings yet
8 Fixed Effects Regression (Powerpoint)
49 pages
Understanding Missing Values
No ratings yet
Understanding Missing Values
3 pages
Data Preparation: Handling Missing Values and Outliers
No ratings yet
Data Preparation: Handling Missing Values and Outliers
28 pages
1.data Cleaning Screening
No ratings yet
1.data Cleaning Screening
21 pages
Missing Data Stata
No ratings yet
Missing Data Stata
18 pages
2024 Ceed Mathematics - Paper I
No ratings yet
2024 Ceed Mathematics - Paper I
14 pages
Missng Data
No ratings yet
Missng Data
8 pages
DS Assignment COMPLETED
No ratings yet
DS Assignment COMPLETED
11 pages
Hydrograph Analysis
100% (1)
Hydrograph Analysis
48 pages
BA UNIT-3 - Part 1
No ratings yet
BA UNIT-3 - Part 1
4 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Data Cleaning Workshop:: Club Data Science and Cloud Computing
No ratings yet
Data Cleaning Workshop:: Club Data Science and Cloud Computing
6 pages
Assignment No 2 by Sunit Mishra
No ratings yet
Assignment No 2 by Sunit Mishra
2 pages
Amrut Brochure
100% (1)
Amrut Brochure
19 pages
Missing Data Part 1: Overview, Traditional Methods
No ratings yet
Missing Data Part 1: Overview, Traditional Methods
11 pages
Subtitle
No ratings yet
Subtitle
2 pages
Missing Data in Stata
No ratings yet
Missing Data in Stata
12 pages
Data Handling Best Practices
No ratings yet
Data Handling Best Practices
4 pages
Dictionary - Programs Questions and Answers - Class 11
No ratings yet
Dictionary - Programs Questions and Answers - Class 11
17 pages
Lab #1 - Data Screening: Statistics - Spring 2008
No ratings yet
Lab #1 - Data Screening: Statistics - Spring 2008
11 pages
Tentamen #1 - Data Analytics and Visualization - 2020-2021
No ratings yet
Tentamen #1 - Data Analytics and Visualization - 2020-2021
6 pages
Missing Data Imputation Using Singular Value Decomposition
No ratings yet
Missing Data Imputation Using Singular Value Decomposition
6 pages
Data Screening Assumptions
No ratings yet
Data Screening Assumptions
29 pages
Leading Dan Lagging Indicators Highlights
No ratings yet
Leading Dan Lagging Indicators Highlights
78 pages
EUROCOD 5 - Design of Timber Structures - General Rules
100% (1)
EUROCOD 5 - Design of Timber Structures - General Rules
72 pages
Chemistry Acid and Basic Radicals
87% (15)
Chemistry Acid and Basic Radicals
1 page
My Strategy - MACD.HA
No ratings yet
My Strategy - MACD.HA
6 pages
CBSE Computer Science Class 12 Question Paper 2024 Solutions FREE PDF
No ratings yet
CBSE Computer Science Class 12 Question Paper 2024 Solutions FREE PDF
44 pages
Chapter 12 Biology 11
No ratings yet
Chapter 12 Biology 11
52 pages
The Development of The Atomic Structure.
No ratings yet
The Development of The Atomic Structure.
10 pages
SAFECode Dev Practices0211
No ratings yet
SAFECode Dev Practices0211
56 pages
Fisher Thermo Scientific Catalogue V Dear
100% (1)
Fisher Thermo Scientific Catalogue V Dear
72 pages
2 Failure Theory
No ratings yet
2 Failure Theory
53 pages
Blas Lapack
No ratings yet
Blas Lapack
21 pages
Thesis Topics On Image Processing
100% (3)
Thesis Topics On Image Processing
6 pages
Safety Function Guide
No ratings yet
Safety Function Guide
38 pages
Intrinsic Viscosities and Unperturbed Dimensions of Long Chain Molecules
No ratings yet
Intrinsic Viscosities and Unperturbed Dimensions of Long Chain Molecules
117 pages
Nodal Analysis and (IPR, TPC) Curve
No ratings yet
Nodal Analysis and (IPR, TPC) Curve
9 pages
Computer Ebook English RBE
No ratings yet
Computer Ebook English RBE
69 pages
How To Know (Check) My Own Mobile Number - Airtel, Idea, Jio Vodafone, Tata Docomo, Reliance, BSNL, Aircel, MTNL, Videocon, Virgin, Uninor
No ratings yet
How To Know (Check) My Own Mobile Number - Airtel, Idea, Jio Vodafone, Tata Docomo, Reliance, BSNL, Aircel, MTNL, Videocon, Virgin, Uninor
3 pages
User Manual GALILEO: 06/2013 MN04802104Z-EN
No ratings yet
User Manual GALILEO: 06/2013 MN04802104Z-EN
17 pages
Draftspecificationformantransformer 7775 Kvawithincr
No ratings yet
Draftspecificationformantransformer 7775 Kvawithincr
13 pages
كلية الهندسة
No ratings yet
كلية الهندسة
73 pages
Ocean and Sea Waves
No ratings yet
Ocean and Sea Waves
30 pages
Image Registration Methods A Survey
No ratings yet
Image Registration Methods A Survey
25 pages
Ethanolamine and Phosphoethanolamine Inhibit Mitochondrial Function in Vitro - Implications For Mitochondrial Dysfunction Hypothesis in Depression and Bipolar Disorder - ScienceDirect
No ratings yet
Ethanolamine and Phosphoethanolamine Inhibit Mitochondrial Function in Vitro - Implications For Mitochondrial Dysfunction Hypothesis in Depression and Bipolar Disorder - ScienceDirect
6 pages
CSP2101 Scripting Languages Assignment 3 - Software Based Solution
No ratings yet
CSP2101 Scripting Languages Assignment 3 - Software Based Solution
8 pages
EC3355 SS IAT II Question Paper
No ratings yet
EC3355 SS IAT II Question Paper
2 pages
Risc VS Cisc
No ratings yet
Risc VS Cisc
2 pages

SOCS0075 Lecture5

Uploaded by

SOCS0075 Lecture5

Uploaded by

BSc SSQM Dissertation (SOCS0075)

Data handling, missing values, and reality checks

Main aim: Your dissertation should be perfectly reproducible!

Keep things in order, using a folder system such as

Start with looking at descriptive values

Using the R package stargazer (Hlavac, 2022):

stargazer(data, style = "asr", digits = 3, type = "html",

html format is readable with Word.

Make sure that your variables align with your theory

The classification of your variables determines what you can do with

nominal vs. numeric

Retraction Watch 2019

Check if correlations are plausible

Interpret the size / magnitude of your effects

Most common approach

Using the R package texreg (Leifeld, 2013):

wordreg(l = list(reg1, reg2, reg3),

texreg(), and htmlreg() export to other formats than Word. The

Here is an example (03 Data-handling.R)

You might also like