Reproducible Research

Uploaded by

Rutuja Tandalwade

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views46 pages

Reproducible Research

Uploaded by

Rutuja Tandalwade

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 46

Reproducible Research

Reproducible Research
• Reproducible research is the idea that data
analyses,, scientific claims, are published with
their data and software code
• allows for people to focus on the actual content of
a data analysis.
• makes an analysis more useful to others
• Replicability: the act of repeating a scientific
methodology to reach similar conclusions.
Why do reproducible research?
• A. Reproducible research benefits those who do it :

• can repeat the same analysis multiple times with the same
results

• Reproducible research Researchers are the primary

beneficiaries of this practice.

• helps researchers remember how and why they performed

specific analyses during the course of a project.
• easier explanation of work to collaborators, supervisors, and
reviewers, and it allows collaborators to conduct
supplementary analyses more quickly and more efficiently
Why do reproducible research?
• enables researchers to quickly and simply
modify analyses and figures.
• When analyses are reproducible, creating a
new figure may be as easy as changing one
value in a line of code and re-running a script
Why do reproducible research?
• quick reconfiguration of previously conducted
research tasks : new projects become much
simpler and easier.
• to re-use earlier materials (e.g., analysis code,
file organization systems) to execute these
common research tasks more efficiently in
subsequent iterations.
Why do reproducible research?
• a strong indicator to fellow researchers of
rigor, trustworthiness, and transparency in
scientific research
• increase the quality and speed of peer review
• protects researchers from accusations of
research misconduct due to analytical errors
• increases paper citation rates
• For example, researchers can re-use code
from a paper with similar methods and then
cite code
• A third team of researchers may conduct a
meta-analysis on the phenomenon described
in these two research papers and thus use and
cite both of these papers and the data
B. Reproducible research benefits the research community-

• Sharing data, code, and detailed research

methods and results leads to faster progress in
methodological development and innovation
because research is more accessible to more
scientists
• allows others to learn from your work, gives them
a head start on performing similar analyses.
• saves time and efforts in finding current “best
practice” through trial and error
• makes it easier to perform follow-up studies to
increase the strength of evidence for the
phenomenon of interest.
• increases the likelihood that similar studies are
compatible, and that a group of studies can
together provide evidence in support of or in
opposition to a concept.
• increases the utility of these studies for meta-
analyses that are important for generalizing and
contextualizing the findings of studies on a topic.
• enhance the likelihood that data can be used in
future meta-analyses
• allows others to protect themselves from your
mistakes.
• gives them a better chance to critically analyze
the work, which can lead to co-authors or
reviewers discovering mistakes during the
revision process, or other scientists discovering
mistakes after publication.
• This prevents mistakes from compounding over
time and provides protection for collaborators,
research institutions, funding organizations,
journals, and others who may be affected when
such mistakes happen.
Barriers to reproducible research

• barriers can be simplified into four primary

themes:
• (1) complexity, (2) technological change, (3)
human error, and (4) concerns over
intellectual property rights.
Barriers to reproducible research
• Complexity-
• Science is difficult, and scientific research requires
specialized (and often proprietary) knowledge and
tools that may not be available to everyone who
would like to reproduce research.
• Some analyses may require high-performance
computing clusters that use several different
programming languages and software packages, or
that are designed for specific hardware
configurations.
Barriers to reproducible research
• analyses may be performed using proprietary
software programs such as SAS statistical
software that are expensive
• Lack of knowledge, lack of institutional
infrastructure, and lack of funding all make
research less reproducible.
Barriers to reproducible research
• Solutions:
• Researchers can cite primers on complex
subjects or analyses to reduce knowledge
barriers. They can also thoroughly annotate
analytical code with comments explaining
each step in an analysis or provide extensive
documentation on research software.
• Using open software
Technological change

• When old tools become obsolete, research

becomes less reproducible.
• can be mitigated by using established tools in
reproducible research
• Careful documentation of versions of software
• more advanced tools : software containers.
Human error

• People forget small details of how they

performed analyses.
• They fail to describe data collection protocols or
analyses completely
• They fail to collect or thoroughly document data
• Science is performed by fallible humans, and a
wide variety of common events can render
research less reproducible.
• For example, carefully recording details such as when and
where data were collected, what decisions were made
during data collection, and what labelling conventions
were used can make a huge difference in making sure that
those data can later be used appropriately or re-purposed.
• Unintentional errors often occur during the data wrangling
(row data to usable form) stage of a project, and these
can be mitigated by keeping multiple copies of data to
prevent data loss, carefully documenting the process for
converting raw data into clean data, and double checking
a small test set of data before manipulating the data set as
a whole.
Intellectual property rights-
• Researchers often hesitate to share data and code because doing so may allow
other researchers to use data and code incorrectly or unethically.
• Use without notifying authors, leading to incorrect assumptions about the data
that result in invalid analyses.
• Use without citing the original data owners or code writers,
• Researchers may want to conceal data from others so that they can perform new
analyses on those data in the future without worrying about others scooping
them using the shared data.
• Rational self-interest can lead to hesitation to share data and code via many
pathways, and we acknowledge that making data openly available is likely the
most controversial aspect of reproducible research .
• However, new tools for sharing data and code (outlined below and in Table 1) are
making it easier for researchers to receive credit for doing so and to prevent
others from using their data during an embargo period (a temporary stop or ban
on disclosing information contained in a research paper.)
†Free to use, but paid premium options with more features are available.
A three-step framework for conducting
reproducible research
• research is often not reproducible,
• Some basic steps toward making research
more reproducible in three stages of a
research project:
• (1) before data analysis,
• (2) during analysis, and
(3) after analysis.
A three-step framework for conducting
reproducible research
• Before data analysis: data storage and organization-
• Reproducibility starts in the planning stage, with sound data
management practices.
• It is difficult to reproduce research when data are disorganized
or missing, or when it is impossible to determine where or
how data originated.
• First, data should be backed up at every stage of the research
process and stored in multiple locations.
• This includes raw data (e.g., physical data sheets or initial
spreadsheets), clean analysis-ready data (i.e., final data sets),
and steps in between. Because it is entirely possible that
researchers unintentionally alter or corrupt data while
cleaning it up, raw data should always be kept as a backup.
• scan and save data sheets or laboratory
notebook pages associated with a data set to
ensure that these are kept paired with the
digital data set.
• different copies should be stored in different
locations and using different storage media
• Digital data files should be stored in useful, flexible,
portable, non-proprietary formats. Storing data digitally
in a “flat” file format is almost always a good idea.
• Flat file formats are those that store data as plain text
with one record per line and are the most portable
formats across platforms, as they can be opened by
anyone without proprietary software programs.
• For more complex data types, multi-dimensional
relational formats or other discipline-specific formats
may be appropriate.
• the complexity of these formats makes them difficult
for many researchers to access and use appropriately,
so it is best to stick with simpler file formats when
possible.
• It is often useful to transform data into a “tidy”
format (Wickham 2014) when cleaning up and
standardizing raw data.
• Tidy data are in long format (i.e., variables in columns,
observations in rows), have consistent data structure
(e.g., character data are not mixed with numeric data
for a single variable),
• and have informative and appropriately formatted
headers (e.g., reasonably short variable names that
do not include problematic characters like spaces,
commas, and parentheses).
• Data in this format are easy to manipulate, model,
and visualize during analysis.
• Metadata explaining what was done to clean up the data and what each
of the variables means should be stored along with the data.
• Data are useless unless they can be interpreted (Roche et al. 2015), and
metadata is how we maximize data interpretability across potential users.
• At a minimum, all data sets should include informative metadata that
explains how and why data were collected, what variable names mean,
whether a variable consists of raw or transformed data, and how
observations are coded.
• Metadata should be placed in a sensible location that pairs it with the
data set it describes.
• A few rows of metadata above a table of observations within the same
file may work in some cases, or a paired text file can be included in the
same directory as the data if the metadata must be more detailed.
• In the latter case, it is best to stick with a simple file for metadata to
maximize portability
• Finally, researchers should organize files in a sensible, user-
friendly structure and make sure that all files have informative
names.
• It should be easy to tell what is in a file or directory from its
name, and a consistent naming protocol (e.g., ending the
filename with the date created or version number) provides even
more information when searching through files in a directory.
• A consistent naming protocol for both directories and files also
makes coding simpler by placing data, analyses, and products in
logical locations with logical names.
• It is often more useful to organize files in small blocks of similar
files, rather than having one large directory full of hundreds of
files.
• For example, Noble (2009) suggests organizing computational
projects within a main directory for each project, with sub-
directories for the manuscript (), data files (), analyses ( or ), and
analysis products () within that directory.
• Throughout the research process, from data acquisition
to publication, version control can be used to record a
project’s history and provide a log of changes that have
occurred over the life of a project or research group.
• Version control systems record changes to a file or set
of files over time so that you can recall specific versions
later, compare differences between versions of files, and
even revert files back to previous states in the event of
mistakes.
• Version control also enables a specific snapshot of data
or code to be easily shared, so that code used for
analyses at a specific point in time) can be documented,
even if that code is later updated.
During analysis: best coding practices-

• When possible, all data wrangling and analysis should be

performed using coding scripts—as opposed
• to using interactive or point-and-click tools—so that every step is
documented and repeatable by yourself and others.
• Code serves as a log of analytical activities.
• code (unlike point-and-click programs) is inherently reproducible.
Most errors are unintentional mistakes made during data
wrangling or analysis, so having a record of these steps ensures
that analyses can be checked for errors and are repeatable on
future data sets.
• If operations are not possible to script, then they should be well-
documented in a log file that is kept in the appropriate directory
• Analytical code should be thoroughly annotated with
comments. Comments embedded within code serve as
metadata for that code, substantially increasing its
usefulness.
• Comments should contain enough information for an
informed stranger to easily understand what the code does,
but not so much that sorting through comments is a chore.
Code comments can be tested for this balance by a friend .
• In most scripting languages, the first few lines of a script
should include a description of what the script does and
who wrote it, followed by small blocks that import data,
packages, and external functions.
• Data cleaning and analytical code then follows those
sections, and sections are demarcated using a consistent
protocol and sufficient comments to explain what function
each section of code performs.
• Following a clean, consistent coding style
makes code easier to read
• This includes using a consistent naming
convention of name objects and embedding
meaningful information in object names
• Code should also be written in relatively short
lines and grouped into blocks
• There are several ways to prevent coding
mistakes and make code easier to use.
• First, researchers should automate repetitive
tasks. For example, if a set of analysis steps
are being used repeatedly, those steps can be
saved as a function and loaded at the top of
the script.
• researchers can use loops (though it is also
important to note that nesting too many
loops inside one another can quickly make
code incomprehensible)
• A third way to reduce mistakes is to reduce the
number of hard-coded values that must be
changed to replicate analyses on an updated or
new data set.
• It is often best to read in the data file(s) and
assign parameter values at the beginning of a
script, so that those variables can then be used
throughout the rest of the script.
• When operating on new data, these variables
can then be changed once at the beginning of a
script rather than multiple times in locations
littered throughout the script.
• Because incompatibility between operating systems or
program versions can inhibit the reproducibility of research,
the current gold standard for ensuring that analyses can be
used in the future is to create a software container.
• Containers are standalone, portable environments that
contain the entire computing environment used in an analysis:
• software, all of its dependencies, libraries, binaries, and
configuration files, all bundled into one package.
• Containers can then be archived or shared, allowing them to
be used in the future, even as packages, functions, or libraries.
• If creating a software container is infeasible or a larger step
than researchers are willing to take, it is important to
thoroughly report all software packages used, including version
numbers.
After data analysis: finalizing results and sharing-

• All input data, scripts, program versions,

parameters, and important intermediate
results should be made publicly and easily
accessible
• it is better to produce tables and figures directly
from code : re-run, creating a “dynamic” document.
For example, documents written in LaTeX incorporate
figures directly from a directory
• it is possible to make data wrangling, analysis, and
creation of figures, tables, and manuscripts a “one-
button” process using GNU Make
(https://www.gnu.org/software/make/).
• Is a simple, yet powerful tool that can be used to
coordinate and automate command-line processes,
such as a series of independent scripts.
• Currently, code and data that can be used to
replicate research are often found in the
supplementary material of journal articles.
Some journals: in articles themselves.
• To increase access to publications, authors can
post preprints of final (but preacceptance)
versions of manuscripts on a preprint server,
or postprints of manuscripts on postprint
ervers.
• To make research accessible to everyone, it is
therefore better to use tools like data and
code repositories than personal websites
• Data repositories are large databases that
collect, manage, and store data sets for
analysis, sharing, and reporting.
• When data, code, software, and products of a
research project are archived together, these
are termed a “research compendium”
(Gentleman and Lang 2007).
• They provide a standardized and easily
recognizable way to organize the digital
materials of a research project, which enables
other researchers to inspect, reproduce, and
extend research

Activity VPD
No ratings yet
Activity VPD
1 page
Virtualizations
No ratings yet
Virtualizations
7 pages
UNIT 05 Data Science PDF
No ratings yet
UNIT 05 Data Science PDF
6 pages
Frequentist Statistics
No ratings yet
Frequentist Statistics
34 pages
Cloud Unit 3 (1mark)
No ratings yet
Cloud Unit 3 (1mark)
12 pages
PortWise Manual
No ratings yet
PortWise Manual
374 pages
CCS335 Set3
No ratings yet
CCS335 Set3
2 pages
Ikzl 6 SG 80 Weoxr 2 o 4 Bise 7 NMBVCTWJR 5
No ratings yet
Ikzl 6 SG 80 Weoxr 2 o 4 Bise 7 NMBVCTWJR 5
22 pages
Recommendations To Enhance Rigor and Reproducibility in Biomedical Research
No ratings yet
Recommendations To Enhance Rigor and Reproducibility in Biomedical Research
19 pages
Hypothesis Power Analysis
No ratings yet
Hypothesis Power Analysis
37 pages
13 Ethics
No ratings yet
13 Ethics
75 pages
Linear Regression
No ratings yet
Linear Regression
35 pages
Power BI
100% (1)
Power BI
20 pages
ZOHO Presentation
No ratings yet
ZOHO Presentation
8 pages
Acn Module 5
No ratings yet
Acn Module 5
13 pages
TM Endpoint Security Presentation
No ratings yet
TM Endpoint Security Presentation
67 pages
Best Practices Guide in Open and Reproducible Science Serrapilheira
No ratings yet
Best Practices Guide in Open and Reproducible Science Serrapilheira
34 pages
Data Accessibility in The Chemical Sciences An Ana
No ratings yet
Data Accessibility in The Chemical Sciences An Ana
13 pages
Lesson 01 Introduction To Data Collection and Surveys 1
No ratings yet
Lesson 01 Introduction To Data Collection and Surveys 1
46 pages
Microsoft IP's List (Must Block)
No ratings yet
Microsoft IP's List (Must Block)
4 pages
822pm - 26.EPRA JOURNALS-5445
No ratings yet
822pm - 26.EPRA JOURNALS-5445
5 pages
Ai Procycle
No ratings yet
Ai Procycle
13 pages
Dcee 26 Midterm
No ratings yet
Dcee 26 Midterm
12 pages
JetStream Migrate Requirements
No ratings yet
JetStream Migrate Requirements
7 pages
Test Your C Skills Yaswant Kanetkar
No ratings yet
Test Your C Skills Yaswant Kanetkar
134 pages
Research Modure
No ratings yet
Research Modure
59 pages
BarnOwl Audit Management Datasheet Final
No ratings yet
BarnOwl Audit Management Datasheet Final
3 pages
Stage 05
No ratings yet
Stage 05
20 pages
What Is Research?
No ratings yet
What Is Research?
6 pages
Tanishq Jain DST Project Report
No ratings yet
Tanishq Jain DST Project Report
6 pages
Are Data Papers Cited As Research Data Preliminary
No ratings yet
Are Data Papers Cited As Research Data Preliminary
9 pages
Data Saving and Swapping OXO
No ratings yet
Data Saving and Swapping OXO
6 pages
Animesh Research
No ratings yet
Animesh Research
90 pages
Alfred Advanced DTB
No ratings yet
Alfred Advanced DTB
5 pages
Nikto
No ratings yet
Nikto
15 pages
Extend The Fair
No ratings yet
Extend The Fair
4 pages
Exam Questions and Answers
No ratings yet
Exam Questions and Answers
26 pages
Transparency JEL 2016 12 20
No ratings yet
Transparency JEL 2016 12 20
94 pages
Scjentific Practice Today RRJuly152011-STODDEN
No ratings yet
Scjentific Practice Today RRJuly152011-STODDEN
26 pages
Manscirep 20231205
No ratings yet
Manscirep 20231205
61 pages
Data and Measurements BRM
No ratings yet
Data and Measurements BRM
16 pages
BBA Unit 5
No ratings yet
BBA Unit 5
131 pages
Slides Powerpoint SCRUM Standard
No ratings yet
Slides Powerpoint SCRUM Standard
20 pages
Cnrs Research-Data-Plan Mars21
No ratings yet
Cnrs Research-Data-Plan Mars21
9 pages
BRM Imp Ans
No ratings yet
BRM Imp Ans
18 pages
Research Topic and Data
No ratings yet
Research Topic and Data
3 pages
Lesson 2 Identification of Research Area
No ratings yet
Lesson 2 Identification of Research Area
5 pages
RM Practical No. 5
No ratings yet
RM Practical No. 5
5 pages
Ics 2306: Computer Networks
No ratings yet
Ics 2306: Computer Networks
36 pages
E2 - FAIR Data: What Is FAIR and Why?
No ratings yet
E2 - FAIR Data: What Is FAIR and Why?
3 pages
3.azure VPN Gateway
No ratings yet
3.azure VPN Gateway
5 pages
Criteria For Selecting A Research Title: Researcher Topic Identification of A Research Topic and Problems
No ratings yet
Criteria For Selecting A Research Title: Researcher Topic Identification of A Research Topic and Problems
5 pages
2023 Krähmer Schächtele Care To Share Experimental Evidence On Code Sharing Behavior in Social Sciences
No ratings yet
2023 Krähmer Schächtele Care To Share Experimental Evidence On Code Sharing Behavior in Social Sciences
17 pages
Question Bank RMI Solved
No ratings yet
Question Bank RMI Solved
13 pages
Annurev Psych 020821 094927
No ratings yet
Annurev Psych 020821 094927
28 pages
Gezelter 2015 Open Source and Open Data Should Be Standard Practices
No ratings yet
Gezelter 2015 Open Source and Open Data Should Be Standard Practices
2 pages
First Derivatives In-Memory Databases: Peter Storeng
No ratings yet
First Derivatives In-Memory Databases: Peter Storeng
34 pages
Inc305 RM Unit 3
No ratings yet
Inc305 RM Unit 3
7 pages
Big Data Analytics
0% (1)
Big Data Analytics
19 pages
Why Dont We Share Data and Code Perceived Barriers and Benefits To Public Archiving PracticesProceedings of The Royal Society B Biological Sciences
No ratings yet
Why Dont We Share Data and Code Perceived Barriers and Benefits To Public Archiving PracticesProceedings of The Royal Society B Biological Sciences
11 pages
What Is Research... Coceptual Framework
No ratings yet
What Is Research... Coceptual Framework
14 pages
6th Question RM
No ratings yet
6th Question RM
4 pages
Basic Research
No ratings yet
Basic Research
8 pages
Big Data and E-Government A Review
No ratings yet
Big Data and E-Government A Review
8 pages
Fair Principles
No ratings yet
Fair Principles
9 pages
RM - Module I
No ratings yet
RM - Module I
75 pages
Unit-3 DATA COLLECTION SOURCES AND METHODS
No ratings yet
Unit-3 DATA COLLECTION SOURCES AND METHODS
19 pages
Hana - 2092196 - How-To - Terminating Sessions in SAP HANA
No ratings yet
Hana - 2092196 - How-To - Terminating Sessions in SAP HANA
3 pages
SMTP Status Codes Library
No ratings yet
SMTP Status Codes Library
15 pages
Bala MCAT RC Sample
No ratings yet
Bala MCAT RC Sample
9 pages
Xi Ai Unit - 5 Notes
No ratings yet
Xi Ai Unit - 5 Notes
28 pages
TalendOpenStudio DI GettingStarted EN 7.1.1 PDF
No ratings yet
TalendOpenStudio DI GettingStarted EN 7.1.1 PDF
38 pages
Embedding Data Skills in Research
No ratings yet
Embedding Data Skills in Research
23 pages
ERPSCM MM SRVServiceEntrySheet 180915 0417 1152
No ratings yet
ERPSCM MM SRVServiceEntrySheet 180915 0417 1152
7 pages
Storing and Communicating Information in A Lab
No ratings yet
Storing and Communicating Information in A Lab
4 pages
Lookup and Lookup Caches
No ratings yet
Lookup and Lookup Caches
17 pages
Turkyilmaz Vandervelden2020
No ratings yet
Turkyilmaz Vandervelden2020
6 pages
BMM Layer New Features-OBIEE11g
No ratings yet
BMM Layer New Features-OBIEE11g
31 pages
Reproducible
No ratings yet
Reproducible
41 pages
Sybase Handling Long-Running Rollbacks and Recovery in Adaptive Server
No ratings yet
Sybase Handling Long-Running Rollbacks and Recovery in Adaptive Server
4 pages
Phenomenology, Data Analysis and Referencing Styles - Notes
No ratings yet
Phenomenology, Data Analysis and Referencing Styles - Notes
31 pages
El Estudio de Oxford Internet Institute Sobre Los Beneficios de Jugar Videojuegos
No ratings yet
El Estudio de Oxford Internet Institute Sobre Los Beneficios de Jugar Videojuegos
94 pages
Pengantar Basis Data: Kemas Rahmat Saleh Wiharja Fakultas Teknik Informatika Ittelkom
No ratings yet
Pengantar Basis Data: Kemas Rahmat Saleh Wiharja Fakultas Teknik Informatika Ittelkom
41 pages
Leveraging Container Technologies in A Giscience Project: A Perspective From Open Reproducible Research
No ratings yet
Leveraging Container Technologies in A Giscience Project: A Perspective From Open Reproducible Research
22 pages
Nevermind The Data, Where Are The Protocols - The Scholarly Kitchen
No ratings yet
Nevermind The Data, Where Are The Protocols - The Scholarly Kitchen
11 pages
HarvardX PH527X Planning Checklist 2017
No ratings yet
HarvardX PH527X Planning Checklist 2017
5 pages
Nine Simple Ways To Make It Easier To (Re) Use Your Data Vol 6 No 2 (2013) : Special Issue - Data Sharing in Ecology and Evolution
No ratings yet
Nine Simple Ways To Make It Easier To (Re) Use Your Data Vol 6 No 2 (2013) : Special Issue - Data Sharing in Ecology and Evolution
10 pages
Sources of Data Collection
No ratings yet
Sources of Data Collection
4 pages
Bridging The GAP: Literary Survey: Discovery-Deals With Realizing The Concepts Advocacy - Focuses On Understanding The
No ratings yet
Bridging The GAP: Literary Survey: Discovery-Deals With Realizing The Concepts Advocacy - Focuses On Understanding The
4 pages
Chapter No.08 Using Secondary Data: Often Used in Research Projects That Also Use Primary Data)
No ratings yet
Chapter No.08 Using Secondary Data: Often Used in Research Projects That Also Use Primary Data)
3 pages
Distributed Reproducible Research Using Cached Computations
No ratings yet
Distributed Reproducible Research Using Cached Computations
21 pages
Data Collection Methods
No ratings yet
Data Collection Methods
23 pages
Data Publication
No ratings yet
Data Publication
4 pages
Research Methods: Data Collection
No ratings yet
Research Methods: Data Collection
17 pages

Reproducible Research

Uploaded by

Reproducible Research

Uploaded by

Reproducible Research

• Reproducible research Researchers are the primary

• helps researchers remember how and why they performed

• Sharing data, code, and detailed research

• barriers can be simplified into four primary

• When old tools become obsolete, research

• People forget small details of how they

• When possible, all data wrangling and analysis should be

• All input data, scripts, program versions,

You might also like