0% found this document useful (0 votes)
13 views46 pages

Reproducible Research

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views46 pages

Reproducible Research

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Reproducible Research

Reproducible Research
• Reproducible research is the idea that data
analyses,, scientific claims, are published with
their data and software code
• allows for people to focus on the actual content of
a data analysis.
• makes an analysis more useful to others
• Replicability: the act of repeating a scientific
methodology to reach similar conclusions.
Why do reproducible research?
• A. Reproducible research benefits those who do it :

• can repeat the same analysis multiple times with the same
results

• Reproducible research Researchers are the primary


beneficiaries of this practice.

• helps researchers remember how and why they performed


specific analyses during the course of a project.
• easier explanation of work to collaborators, supervisors, and
reviewers, and it allows collaborators to conduct
supplementary analyses more quickly and more efficiently
Why do reproducible research?
• enables researchers to quickly and simply
modify analyses and figures.
• When analyses are reproducible, creating a
new figure may be as easy as changing one
value in a line of code and re-running a script
Why do reproducible research?
• quick reconfiguration of previously conducted
research tasks : new projects become much
simpler and easier.
• to re-use earlier materials (e.g., analysis code,
file organization systems) to execute these
common research tasks more efficiently in
subsequent iterations.
Why do reproducible research?
• a strong indicator to fellow researchers of
rigor, trustworthiness, and transparency in
scientific research
• increase the quality and speed of peer review
• protects researchers from accusations of
research misconduct due to analytical errors
• increases paper citation rates
• For example, researchers can re-use code
from a paper with similar methods and then
cite code
• A third team of researchers may conduct a
meta-analysis on the phenomenon described
in these two research papers and thus use and
cite both of these papers and the data
B. Reproducible research benefits the research community-

• Sharing data, code, and detailed research


methods and results leads to faster progress in
methodological development and innovation
because research is more accessible to more
scientists
• allows others to learn from your work, gives them
a head start on performing similar analyses.
• saves time and efforts in finding current “best
practice” through trial and error
• makes it easier to perform follow-up studies to
increase the strength of evidence for the
phenomenon of interest.
• increases the likelihood that similar studies are
compatible, and that a group of studies can
together provide evidence in support of or in
opposition to a concept.
• increases the utility of these studies for meta-
analyses that are important for generalizing and
contextualizing the findings of studies on a topic.
• enhance the likelihood that data can be used in
future meta-analyses
• allows others to protect themselves from your
mistakes.
• gives them a better chance to critically analyze
the work, which can lead to co-authors or
reviewers discovering mistakes during the
revision process, or other scientists discovering
mistakes after publication.
• This prevents mistakes from compounding over
time and provides protection for collaborators,
research institutions, funding organizations,
journals, and others who may be affected when
such mistakes happen.
Barriers to reproducible research

• barriers can be simplified into four primary


themes:
• (1) complexity, (2) technological change, (3)
human error, and (4) concerns over
intellectual property rights.
Barriers to reproducible research
• Complexity-
• Science is difficult, and scientific research requires
specialized (and often proprietary) knowledge and
tools that may not be available to everyone who
would like to reproduce research.
• Some analyses may require high-performance
computing clusters that use several different
programming languages and software packages, or
that are designed for specific hardware
configurations.
Barriers to reproducible research
• analyses may be performed using proprietary
software programs such as SAS statistical
software that are expensive
• Lack of knowledge, lack of institutional
infrastructure, and lack of funding all make
research less reproducible.
Barriers to reproducible research
• Solutions:
• Researchers can cite primers on complex
subjects or analyses to reduce knowledge
barriers. They can also thoroughly annotate
analytical code with comments explaining
each step in an analysis or provide extensive
documentation on research software.
• Using open software
Technological change

• When old tools become obsolete, research


becomes less reproducible.
• can be mitigated by using established tools in
reproducible research
• Careful documentation of versions of software
• more advanced tools : software containers.
Human error

• People forget small details of how they


performed analyses.
• They fail to describe data collection protocols or
analyses completely
• They fail to collect or thoroughly document data
• Science is performed by fallible humans, and a
wide variety of common events can render
research less reproducible.
• For example, carefully recording details such as when and
where data were collected, what decisions were made
during data collection, and what labelling conventions
were used can make a huge difference in making sure that
those data can later be used appropriately or re-purposed.
• Unintentional errors often occur during the data wrangling
(row data to usable form) stage of a project, and these
can be mitigated by keeping multiple copies of data to
prevent data loss, carefully documenting the process for
converting raw data into clean data, and double checking
a small test set of data before manipulating the data set as
a whole.
Intellectual property rights-
• Researchers often hesitate to share data and code because doing so may allow
other researchers to use data and code incorrectly or unethically.
• Use without notifying authors, leading to incorrect assumptions about the data
that result in invalid analyses.
• Use without citing the original data owners or code writers,
• Researchers may want to conceal data from others so that they can perform new
analyses on those data in the future without worrying about others scooping
them using the shared data.
• Rational self-interest can lead to hesitation to share data and code via many
pathways, and we acknowledge that making data openly available is likely the
most controversial aspect of reproducible research .
• However, new tools for sharing data and code (outlined below and in Table 1) are
making it easier for researchers to receive credit for doing so and to prevent
others from using their data during an embargo period (a temporary stop or ban
on disclosing information contained in a research paper.)
†Free to use, but paid premium options with more features are available.
A three-step framework for conducting
reproducible research
• research is often not reproducible,
• Some basic steps toward making research
more reproducible in three stages of a
research project:
• (1) before data analysis,
• (2) during analysis, and
(3) after analysis.
A three-step framework for conducting
reproducible research
• Before data analysis: data storage and organization-
• Reproducibility starts in the planning stage, with sound data
management practices.
• It is difficult to reproduce research when data are disorganized
or missing, or when it is impossible to determine where or
how data originated.
• First, data should be backed up at every stage of the research
process and stored in multiple locations.
• This includes raw data (e.g., physical data sheets or initial
spreadsheets), clean analysis-ready data (i.e., final data sets),
and steps in between. Because it is entirely possible that
researchers unintentionally alter or corrupt data while
cleaning it up, raw data should always be kept as a backup.
• scan and save data sheets or laboratory
notebook pages associated with a data set to
ensure that these are kept paired with the
digital data set.
• different copies should be stored in different
locations and using different storage media
• Digital data files should be stored in useful, flexible,
portable, non-proprietary formats. Storing data digitally
in a “flat” file format is almost always a good idea.
• Flat file formats are those that store data as plain text
with one record per line and are the most portable
formats across platforms, as they can be opened by
anyone without proprietary software programs.
• For more complex data types, multi-dimensional
relational formats or other discipline-specific formats
may be appropriate.
• the complexity of these formats makes them difficult
for many researchers to access and use appropriately,
so it is best to stick with simpler file formats when
possible.
• It is often useful to transform data into a “tidy”
format (Wickham 2014) when cleaning up and
standardizing raw data.
• Tidy data are in long format (i.e., variables in columns,
observations in rows), have consistent data structure
(e.g., character data are not mixed with numeric data
for a single variable),
• and have informative and appropriately formatted
headers (e.g., reasonably short variable names that
do not include problematic characters like spaces,
commas, and parentheses).
• Data in this format are easy to manipulate, model,
and visualize during analysis.
• Metadata explaining what was done to clean up the data and what each
of the variables means should be stored along with the data.
• Data are useless unless they can be interpreted (Roche et al. 2015), and
metadata is how we maximize data interpretability across potential users.
• At a minimum, all data sets should include informative metadata that
explains how and why data were collected, what variable names mean,
whether a variable consists of raw or transformed data, and how
observations are coded.
• Metadata should be placed in a sensible location that pairs it with the
data set it describes.
• A few rows of metadata above a table of observations within the same
file may work in some cases, or a paired text file can be included in the
same directory as the data if the metadata must be more detailed.
• In the latter case, it is best to stick with a simple file for metadata to
maximize portability
• Finally, researchers should organize files in a sensible, user-
friendly structure and make sure that all files have informative
names.
• It should be easy to tell what is in a file or directory from its
name, and a consistent naming protocol (e.g., ending the
filename with the date created or version number) provides even
more information when searching through files in a directory.
• A consistent naming protocol for both directories and files also
makes coding simpler by placing data, analyses, and products in
logical locations with logical names.
• It is often more useful to organize files in small blocks of similar
files, rather than having one large directory full of hundreds of
files.
• For example, Noble (2009) suggests organizing computational
projects within a main directory for each project, with sub-
directories for the manuscript (), data files (), analyses ( or ), and
analysis products () within that directory.
• Throughout the research process, from data acquisition
to publication, version control can be used to record a
project’s history and provide a log of changes that have
occurred over the life of a project or research group.
• Version control systems record changes to a file or set
of files over time so that you can recall specific versions
later, compare differences between versions of files, and
even revert files back to previous states in the event of
mistakes.
• Version control also enables a specific snapshot of data
or code to be easily shared, so that code used for
analyses at a specific point in time) can be documented,
even if that code is later updated.
During analysis: best coding practices-

• When possible, all data wrangling and analysis should be


performed using coding scripts—as opposed
• to using interactive or point-and-click tools—so that every step is
documented and repeatable by yourself and others.
• Code serves as a log of analytical activities.
• code (unlike point-and-click programs) is inherently reproducible.
Most errors are unintentional mistakes made during data
wrangling or analysis, so having a record of these steps ensures
that analyses can be checked for errors and are repeatable on
future data sets.
• If operations are not possible to script, then they should be well-
documented in a log file that is kept in the appropriate directory
• Analytical code should be thoroughly annotated with
comments. Comments embedded within code serve as
metadata for that code, substantially increasing its
usefulness.
• Comments should contain enough information for an
informed stranger to easily understand what the code does,
but not so much that sorting through comments is a chore.
Code comments can be tested for this balance by a friend .
• In most scripting languages, the first few lines of a script
should include a description of what the script does and
who wrote it, followed by small blocks that import data,
packages, and external functions.
• Data cleaning and analytical code then follows those
sections, and sections are demarcated using a consistent
protocol and sufficient comments to explain what function
each section of code performs.
• Following a clean, consistent coding style
makes code easier to read
• This includes using a consistent naming
convention of name objects and embedding
meaningful information in object names
• Code should also be written in relatively short
lines and grouped into blocks
• There are several ways to prevent coding
mistakes and make code easier to use.
• First, researchers should automate repetitive
tasks. For example, if a set of analysis steps
are being used repeatedly, those steps can be
saved as a function and loaded at the top of
the script.
• researchers can use loops (though it is also
important to note that nesting too many
loops inside one another can quickly make
code incomprehensible)
• A third way to reduce mistakes is to reduce the
number of hard-coded values that must be
changed to replicate analyses on an updated or
new data set.
• It is often best to read in the data file(s) and
assign parameter values at the beginning of a
script, so that those variables can then be used
throughout the rest of the script.
• When operating on new data, these variables
can then be changed once at the beginning of a
script rather than multiple times in locations
littered throughout the script.
• Because incompatibility between operating systems or
program versions can inhibit the reproducibility of research,
the current gold standard for ensuring that analyses can be
used in the future is to create a software container.
• Containers are standalone, portable environments that
contain the entire computing environment used in an analysis:
• software, all of its dependencies, libraries, binaries, and
configuration files, all bundled into one package.
• Containers can then be archived or shared, allowing them to
be used in the future, even as packages, functions, or libraries.
• If creating a software container is infeasible or a larger step
than researchers are willing to take, it is important to
thoroughly report all software packages used, including version
numbers.
After data analysis: finalizing results and sharing-

• All input data, scripts, program versions,


parameters, and important intermediate
results should be made publicly and easily
accessible
• it is better to produce tables and figures directly
from code : re-run, creating a “dynamic” document.
For example, documents written in LaTeX incorporate
figures directly from a directory
• it is possible to make data wrangling, analysis, and
creation of figures, tables, and manuscripts a “one-
button” process using GNU Make
(https://www.gnu.org/software/make/).
• Is a simple, yet powerful tool that can be used to
coordinate and automate command-line processes,
such as a series of independent scripts.
• Currently, code and data that can be used to
replicate research are often found in the
supplementary material of journal articles.
Some journals: in articles themselves.
• To increase access to publications, authors can
post preprints of final (but preacceptance)
versions of manuscripts on a preprint server,
or postprints of manuscripts on postprint
ervers.
• To make research accessible to everyone, it is
therefore better to use tools like data and
code repositories than personal websites
• Data repositories are large databases that
collect, manage, and store data sets for
analysis, sharing, and reporting.
• When data, code, software, and products of a
research project are archived together, these
are termed a “research compendium”
(Gentleman and Lang 2007).
• They provide a standardized and easily
recognizable way to organize the digital
materials of a research project, which enables
other researchers to inspect, reproduce, and
extend research

You might also like