Automating Scientific Data Analysis Part 1 - by Peter Grant - Towards Data Science
Automating Scientific Data Analysis Part 1 - by Peter Grant - Towards Data Science
Search Medium
You have 1 free member-only story left this month. Sign up for Medium and get an extra one
Save
338 3
https://towardsdatascience.com/automating-scientific-data-analysis-part-1-c9979cd0817e 1/7
4/16/23, 11:20 PM Automating Scientific Data Analysis Part 1: | by Peter Grant | Towards Data Science
Many people are familiar with the typical application of data science techniques. A
company with an incredibly large data set asks somebody to mine the data set for
understanding, develop algorithms trained to the data set, and let the company use
their models to drive business decisions. Data science writing typically focuses on this
valuable application, but there are other applications where people can benefit from
these techniques and mindsets. For instance, scientific researchers.
Scientific research has a lot in common with data science. There are often large data
sets to study. Those data sets typically contain the answers to important questions.
Those answers are often important in decision making. The main difference is that
scientific researchers typically do their data analysis manually in spreadsheets,
whereas data scientists typically leverage the many powerful packages available in
Python.
The purpose of this post is to introduce scientists to some of the ways data science
techniques and mindsets can improve scientific research, and why scientists should
consider using these techniques over their current methods. The fundamental
principle is simple: The data analysis portion of most scientific data analysis is routine,
and can be automated with Python scripts. That automation enables the scientist to
process larger data sets than their competition, with fewer mistakes, in a faction of the
time.
Faster processing of data: Analyzing scientific data sets can consume weeks, or
months of every year. Each project whether it includes lab experiments, field
studies, or simulation studies can yield hundreds if not thousands of data files.
Each of these files must be opened, studied to ensure that the
test/monitoring/simulation proceeded correctly, and analyzed to find the result
contained in that file. Then the result must be added to another file and saved for
later analysis. Manually doing this takes a lot of time. It’s expensive. It’s repetitive
https://towardsdatascience.com/automating-scientific-data-analysis-part-1-c9979cd0817e 2/7
4/16/23, 11:20 PM Automating Scientific Data Analysis Part 1: | by Peter Grant | Towards Data Science
and boring. Automation solves all of those problems. If the project is planned out
in advance, scientists can write a Python script that performs all of these tasks on
every data file automatically. Then this process can be performed in minutes
instead of months.
Reduced error potential: Humans make mistakes. That’s simply part of being
human. Analyzing hundreds of test files requires thousands of calculations. It
involves creating hundreds of plots. It requires saving hundreds of data points in
the right location. Each of these actions has the potential for typos, for incorrectly
remembered constants, for files to be saved in the wrong location, for inconsistent
plot axis labels, and so on. This has always been part of the process, and requires
both significant amounts of care and time to avoid. Again, automation has the
potential to avoid this issue completely. Instead of ensuring that all calculations
and plots in hundreds of data files are correct individually, a scientist only needs to
ensure that a single Python script is correct. Then that script is applied to each file.
And if there’s a mistake in the script there’s no need to dig through hundreds of
files checking to see where else the mistake was made; simply update the script
and re-run it on all files. While getting a cup of coffee.
Access to Python packages: There are many Python packages designed specifically
to make life easier for scientists. Scikit-learn is an excellent package for scientists
needing to make regressions, or implement machine learning. Numpy is a
numerical package capable of performing most calculations that scientists would
need. Matplotlib and Bokeh both offer plotting options with different features
allowing flexibility in plot creation. Pandas replaces the Excel table with
DataFrames enabling the data to be structured and manipulated in a familiar
manner.
Time available for other purposes: Since automated data analysis allows you to
complete that part of your job in less time, suddenly you have time available for
other activities. Maybe you’d rather spend the time on business development and
proposal writing. Or maybe you have a staff member that you’d like to be
mentoring. Or customer relationships that you’d like to spend more time on.
Regardless of what activity you find more meaningful, analyzing your data analysis
will help you spend more time there.
https://towardsdatascience.com/automating-scientific-data-analysis-part-1-c9979cd0817e 3/7
4/16/23, 11:20 PM Automating Scientific Data Analysis Part 1: | by Peter Grant | Towards Data Science
I believe that these reasons provide a solid justification for learning to automate data
analysis, and that it would be wise for any scientist to do so. But I’m sure that these
aren’t all of the reasons. What additional benefits do you think that you could gain?
Since laboratory experimentation and the associated data analysis is a common part of
scientific research, this series of posts will focus on how to automate this process.
https://towardsdatascience.com/automating-scientific-data-analysis-part-1-c9979cd0817e 4/7
4/16/23, 11:20 PM Automating Scientific Data Analysis Part 1: | by Peter Grant | Towards Data Science
https://towardsdatascience.com/automating-scientific-data-analysis-part-1-c9979cd0817e 5/7
4/16/23, 11:20 PM Automating Scientific Data Analysis Part 1: | by Peter Grant | Towards Data Science
more comprehensive data set. If the model is accurate enough, publish detailed
descriptions of its strengths and weaknesses so that future users understand the
situations when the model should/should not be used.
Next Steps
This post presented the concept of, motivation for, and procedure for automating
scientific data analysis using Python scripts. The remaining posts in the series will
guide you through the 9 steps presented above. The next post will discuss steps 1
through 6 leaving you with a firm understanding of how to automate analysis of
individual laboratory tests. The third and final post will discuss ways to store your data
from each test, and combine it to form regressions. When the topics covered in the two
posts are combined, you’ll be able to write scripts that automatically perform the
entire data analysis process for a particular project.
I hope to see you there, and I hope you find the posts useful.
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge
research to original features you don't want to miss. Take a look.
By signing up, you will create a Medium account if you don’t already have one. Review
our Privacy Policy for more information about our privacy practices.
https://towardsdatascience.com/automating-scientific-data-analysis-part-1-c9979cd0817e 6/7
4/16/23, 11:20 PM Automating Scientific Data Analysis Part 1: | by Peter Grant | Towards Data Science
https://towardsdatascience.com/automating-scientific-data-analysis-part-1-c9979cd0817e 7/7