0% found this document useful (0 votes)
63 views

1 - Statistical Programming 101

The document introduces 7 general principles for reproducible research, including treating code as an output, knowing your data structure, tracking changes with version control, writing code that is easy for others to understand, thinking critically about assumptions and potential errors, asking for help from others, and continuously improving programming skills. It provides examples of each principle and recommends best practices like using style guides, code linters, and help files to implement reproducible workflows.

Uploaded by

Rafael Monteiro
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

1 - Statistical Programming 101

The document introduces 7 general principles for reproducible research, including treating code as an output, knowing your data structure, tracking changes with version control, writing code that is easy for others to understand, thinking critically about assumptions and potential errors, asking for help from others, and continuously improving programming skills. It provides examples of each principle and recommends best practices like using style guides, code linters, and help files to implement reproducible workflows.

Uploaded by

Rafael Monteiro
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Basic principles of statistical

programming

• During the training, find all materials in our shared


Reproducible Research Fundamentals DropBox: https://bit.ly/rrf-materials
September 2023 • Permanent link to final materials: https://osf.io/pszy3/
Development Impact (DIME)
The World Bank
Overview
Session overview

Before we dig deep into specific aspects of research reproducibility, this session
will introduce you to 7 general principles that you should always keep in mind:

1. Your code is an output


2. Know your data
3. Track your changes
4. Write code for others to read
5. Think critically
6. Ask for help
7. Keep improving your skills

2
Principle 1: Your code is an output
Your code is an output

Do not treat code as a means to an end.

In fact, your code is equally as much an end in


itself as the paper or the report you are writing!

This is absolutely fundamental to making


research transparent, reproducible and credible.

3
Your code is an output

4
Your code is an output

During this course, we will focus on using Stata and


R. Why are we not using Excel?

5
The main reason why we code

• In Excel you make changes directly to the data and save new versions of
the dataset.

• In Stata and R you make changes to the instructions on how to get from the
original data to the final analysis and save new versions of the instructions.

6
Create recipes and not meals

7
Create recipes and not meals

We are rarely trained in writing recipes

• What we are trained to do in school:


• Cook a delicious meal!
• Econometrics/statistics assignments tend to only grade the final results, not
necessarily how we got to them

• What is expected from us in the work place:


• Delicious meals (correct results) are just as important in the workplace
• But as a research assistant, your task is to write the recipe (code scripts, folder
structures, data documentation, etc.) that creates those delicious meals

8
Principle 2: Know your data
Know your data

• To write a good recipe you need to know your ingredients very well
• The ingredients for a data work recipe are contained in the datasets
• Let’s discuss a framework to understand and communicate how your data is
structured

9
Exploring a new dataset

What is the first thing you


want to look for whenever
you open a new dataset for
the first time?

10
Exploring a new dataset

What is the first thing you


want to look for whenever 1. Unit of observation
you open a new dataset for 2. Uniquely and fully identifying ID
variable
the first time?

10
ID variables

• ID variables are crucial to understanding and handling data


• Make sure all your datasets have an ID variable
• If the dataset that you have received does not have one, then creating it is
your first task
• The session on data cleaning will discuss in more details the desired
properties of an ID variable

11
Understand project data

• It is easy to remember information about one or two datasets while you are
working with them
• However, in your role as a research assistant, you will need to keep track of
multiple datasets, explain to other team members how they are organized,
and hand them to other researchers
• To communicate our understanding of datasets, we use data maps. We will
learn about this tool in the next session

12
Principle 3: Track your changes
Track your changes

• Your code will constantly change, but when using a


version control tool like Git/GitHub, then you can
access all previous versions of your code.

• If your original data is backed up, as well as all


versions of your code, then all versions of your
outputs are also backed up.

• To be able to reproduce all past outputs is central to


credibility and transparency.

13
How can you track changes?

• Using file naming conventions (such as


adding dates and initials as suffixes) is better
than no version control, but it can get very
unwieldy very quickly
• Syncing software (such as OneDrive and
Dropbox) allow teams to revert to old version
of a document, but not to track specific
changes
• git is currently the best version control
system out there as one can track changes
and revert to old versions easily

14
Recommended practices for version control

• DIME projects are required to use git for version control of code
• Anything can be version-controlled through git, but it is only suitable for
code and outputs in plain text formats such as .csv, .do, .R, .tex
• The World Bank does not allow us to store data on GitHub, but you can track
changes to it by saving metadata such as codebooks on plain text format

15
Principle 4: Write code that others
can read
How to write good recipes

A recipe only has any value if someone


else can follow it

How do you write code that is useful to


others?

16
Is this slide easy to read?

White Space. Stata does not distinguish between one empty space and many empty spaces,
or one line break or many line breaks. It makes a big difference to the human eye and we
would never share a Word document, an Excel sheet or a PowerPoint presentation without
thinking about white space - although we call it formatting.

17
White Space

• Stata does not distinguish between one empty space and many empty
spaces, or one line break or many line breaks

• It makes a big difference to the human eye and we would never share a Word
document, an Excel sheet or a PowerPoint presentation without thinking
about white space – although we call it formatting

18
Vertical spacing

19
Vertical spacing

19
Horizontal spacing

20
Horizontal spacing

20
Style Guides

Style guides are common in most programming languages. Following a style guide
will make your code much more readable, and it will reduce the risk of errors.

• Stata: See appendix A in DIME Analytic’s Data Handbook -


https://worldbank.github.io/dime-data-handbook/coding.html
• R: https://style.tidyverse.org

21
Code linters

Linters are tools that flag style errors and possible bugs in software.

• Stata: Install the Stata linter (proudly developed by DIME Analytics!) from
SSC with: ssc install stata linter. More information is available here.
• R: Use the package lintr, available in CRAN. More information in this link.

22
Don’t repeat yourself

23
Principle 5: Think critically about
the data work
Critical thinking about data work

Do I believe this number?

24
Critical thinking about data work

• What does my data look like?


• What can go wrong in my code?
• How will missing values be treated in this command?
• What would happen if more observations would be added to the dataset?
• What would happen if some observations would be removed from the
dataset?
• We will cover this on the lecture Best practices for reproducible outputs

25
Principle 6: Ask for help
Help file usage and coding knowledge

26
Help file usage and coding knowledge

27
Help file usage and coding knowledge

The Dunning-Krueger effect


28
Help files

• In Stata, type: help command name


• In R, type: ?command name
• Get in the habit of using the help file as often as possible!
• Even with familiar commands, always more to learn
• Help files are not the only place to learn
• Follow blogs and Twitter accounts that discuss best practices
• Follow the tag for your programming language on
https://stackoverflow.com/
• In Stata, there are a reference manual that you access by clicking [R]
command name in the help file where the developers at Stata Corp discuss coding
practices, common mistakes, alternative approaches etc.

29
Asking for help

The quality of the help you will get


depends on how well you asked your question

This is always the case, no matter who you ask: DIME Analytics, Stack Overflow, a
friend from grad school etc.

30
How to ask for help

• You will never get a good answer if you only say “my code is not working”
• In good code question etiquette, include at least:
• Error message or description of unexpected behavior
• Software language and point to the part of your code that breaks
• Describe what you have tested so far and what you have learned

• The more you include of this the better answer:


• Your version of the software and your operating system (mac/windows)
• Show that it is indeed that part of the code that cause the error and not just that
it is there the code crash
• Provide a minimum reproducible example

Much more details and advice on this topic at https://git.io/JtQTb and http://tinyurl.com/stack-hints

31
Principle 7: Keep improving your
skills
When your code works you are only half done.

- Ancient proverb

32
Re-write your own code

• Can this code be made simpler?

• Can I generalize this code so I can use it in other projects?

• Read your own code as a recipe. Would you be able to follow the instructions
if you were a new person joining the team?

33
Read other peoples code

• Look for code on GitHub


• https://github.com/vikjam/mostly-harmless-replication - all examples in
the book coded in Stata, R, Python and Julia

• Read our book https://worldbank.github.io/dime-data-handbook/

• Google code, but before using, ask yourself critical questions about the code
you found
• Why did this person code this way?
• Does this apply to my context?

34
Have someone else read your own code

• Swap code with someone and discuss differences in coding style. Think of
each other’s code as recipes, can you follow the instructions?

• Have you ever asked someone to help you proofread your Word document?
Ask people to proof read you code

• In DIME, we hold structured peer code review sessions every quarter

35
Wrapping up
Wrapping up

1. Your code is an output


2. Know your data
3. Track your changes
4. Write code that others can read
5. Think critically about data work
6. Ask for help
7. Keep improving your skills

36
Wrapping up

1. Your code is an output


2. Know your data
3. Track your changes
4. Write code that others can read
5. Think critically about data work
6. Ask for help
7. Keep improving your skills

We will see these principles in practice during the rest of this training.

36
Thank you! Gracias!

You might also like