0% found this document useful (0 votes)
104 views7 pages

Automating Scientific Data Analysis Part 1 - by Peter Grant - Towards Data Science

This document summarizes an article about automating scientific data analysis with Python. It discusses how scientific data analysis is often done manually in spreadsheets, which can be time-consuming and error-prone when dealing with large datasets. Automating the process with Python scripts can speed up analysis, reduce errors, and free up scientists' time for other tasks. The document outlines the general steps to take: 1) create a test plan, 2) design the dataset for automation, 3) establish a clear file naming system, 4) store data files in a specific folder, and 5) write a script to analyze individual test results from the files. Automating routine data analysis enables processing larger datasets faster and with fewer mistakes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views7 pages

Automating Scientific Data Analysis Part 1 - by Peter Grant - Towards Data Science

This document summarizes an article about automating scientific data analysis with Python. It discusses how scientific data analysis is often done manually in spreadsheets, which can be time-consuming and error-prone when dealing with large datasets. Automating the process with Python scripts can speed up analysis, reduce errors, and free up scientists' time for other tasks. The document outlines the general steps to take: 1) create a test plan, 2) design the dataset for automation, 3) establish a clear file naming system, 4) store data files in a specific folder, and 5) write a script to analyze individual test results from the files. Automating routine data analysis enables processing larger datasets faster and with fewer mistakes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

4/16/23, 11:20 PM Automating Scientific Data Analysis Part 1: | by Peter Grant | Towards Data Science

Open in app Sign up Sign In

Search Medium

Published in Towards Data Science

You have 1 free member-only story left this month. Sign up for Medium and get an extra one

Peter Grant Follow

Apr 1, 2019 · 7 min read · · Listen

Save

Automating Scientific Data Analysis Part 1:


Why and How You Can Write Python Programs that Automatically
Analyze Scientific Data Sets

338 3

https://towardsdatascience.com/automating-scientific-data-analysis-part-1-c9979cd0817e 1/7
4/16/23, 11:20 PM Automating Scientific Data Analysis Part 1: | by Peter Grant | Towards Data Science

Many people are familiar with the typical application of data science techniques. A
company with an incredibly large data set asks somebody to mine the data set for
understanding, develop algorithms trained to the data set, and let the company use
their models to drive business decisions. Data science writing typically focuses on this
valuable application, but there are other applications where people can benefit from
these techniques and mindsets. For instance, scientific researchers.

Scientific research has a lot in common with data science. There are often large data
sets to study. Those data sets typically contain the answers to important questions.
Those answers are often important in decision making. The main difference is that
scientific researchers typically do their data analysis manually in spreadsheets,
whereas data scientists typically leverage the many powerful packages available in
Python.

The purpose of this post is to introduce scientists to some of the ways data science
techniques and mindsets can improve scientific research, and why scientists should
consider using these techniques over their current methods. The fundamental
principle is simple: The data analysis portion of most scientific data analysis is routine,
and can be automated with Python scripts. That automation enables the scientist to
process larger data sets than their competition, with fewer mistakes, in a faction of the
time.

Why would I want to automate my data analysis?


This is perhaps the most important question. Nobody is going to learn a new skill, in
this case the two new skills of Python programming and data analysis automation, if
they don’t think it will benefit them. Fortunately there are many reasons scientists
should automate data analysis, including the following:

Faster processing of data: Analyzing scientific data sets can consume weeks, or
months of every year. Each project whether it includes lab experiments, field
studies, or simulation studies can yield hundreds if not thousands of data files.
Each of these files must be opened, studied to ensure that the
test/monitoring/simulation proceeded correctly, and analyzed to find the result
contained in that file. Then the result must be added to another file and saved for
later analysis. Manually doing this takes a lot of time. It’s expensive. It’s repetitive

https://towardsdatascience.com/automating-scientific-data-analysis-part-1-c9979cd0817e 2/7
4/16/23, 11:20 PM Automating Scientific Data Analysis Part 1: | by Peter Grant | Towards Data Science

and boring. Automation solves all of those problems. If the project is planned out
in advance, scientists can write a Python script that performs all of these tasks on
every data file automatically. Then this process can be performed in minutes
instead of months.

Reduced error potential: Humans make mistakes. That’s simply part of being
human. Analyzing hundreds of test files requires thousands of calculations. It
involves creating hundreds of plots. It requires saving hundreds of data points in
the right location. Each of these actions has the potential for typos, for incorrectly
remembered constants, for files to be saved in the wrong location, for inconsistent
plot axis labels, and so on. This has always been part of the process, and requires
both significant amounts of care and time to avoid. Again, automation has the
potential to avoid this issue completely. Instead of ensuring that all calculations
and plots in hundreds of data files are correct individually, a scientist only needs to
ensure that a single Python script is correct. Then that script is applied to each file.
And if there’s a mistake in the script there’s no need to dig through hundreds of
files checking to see where else the mistake was made; simply update the script
and re-run it on all files. While getting a cup of coffee.

Access to Python packages: There are many Python packages designed specifically
to make life easier for scientists. Scikit-learn is an excellent package for scientists
needing to make regressions, or implement machine learning. Numpy is a
numerical package capable of performing most calculations that scientists would
need. Matplotlib and Bokeh both offer plotting options with different features
allowing flexibility in plot creation. Pandas replaces the Excel table with
DataFrames enabling the data to be structured and manipulated in a familiar
manner.

Time available for other purposes: Since automated data analysis allows you to
complete that part of your job in less time, suddenly you have time available for
other activities. Maybe you’d rather spend the time on business development and
proposal writing. Or maybe you have a staff member that you’d like to be
mentoring. Or customer relationships that you’d like to spend more time on.
Regardless of what activity you find more meaningful, analyzing your data analysis
will help you spend more time there.

https://towardsdatascience.com/automating-scientific-data-analysis-part-1-c9979cd0817e 3/7
4/16/23, 11:20 PM Automating Scientific Data Analysis Part 1: | by Peter Grant | Towards Data Science

I believe that these reasons provide a solid justification for learning to automate data
analysis, and that it would be wise for any scientist to do so. But I’m sure that these
aren’t all of the reasons. What additional benefits do you think that you could gain?

Since laboratory experimentation and the associated data analysis is a common part of
scientific research, this series of posts will focus on how to automate this process.

What steps do I need to take to automate laboratory data analysis?


First, we’ll present the structure and big-picture design of a project before moving on
to discuss several of the topics in significantly more depth. This series of posts will
focus on the planning and data analysis aspects of the process.

Unfortunately each project must be approached individually and a detailed, yet


generic solution doesn’t exist. However, there is a fundamental approach that can be
applied to every project, with the specific programming (Primarily the calculations)
changing between projects. The following general procedure provides the structure of
an automated data analysis project.

1. CREATE THE TEST PLAN


Determine what tests need to be performed to generate the data set needed to answer
the research question. This ensures that a satisfactory data set is available when
generating regressions at the end of the project, and avoids needing to perform extra
tests.

2. DESIGN THE DATA SET TO ALLOW AUTOMATION


This includes specifying what signals will be used to identify the most important
sections of the tests, or the sections that will be analyzed by the script. This ensures
that there will be an easy way to structure the script to identify the results of each
individual test.

3. CREATE A CLEAR FILE NAMING SYSTEM


Either create a data printing method that makes identification of the test conditions in
each test straightforward or collaborate with the lab tester to do so. This ensures that
the program will be able to identify the conditions of each test, which is necessary for
analyzing the data and storing the results.

https://towardsdatascience.com/automating-scientific-data-analysis-part-1-c9979cd0817e 4/7
4/16/23, 11:20 PM Automating Scientific Data Analysis Part 1: | by Peter Grant | Towards Data Science

4. STORE THE RESULTING DATA FILES IN A SPECIFIC FOLDER


This allows use of the Python package “glob” to sequentially open, and analyze the data
from each individual test.

5. ANALYZE THE RESULTS OF INDIVIDUAL TESTS


Create a program to automatically cycle through all of the data files, and analyze each
data set. This program will likely use a for loop and glob to automatically analyze every
data file. It will likely use pandas to perform the calculations to identify the desired
result of the test, and create checks to ensure that the test was performed correctly. It
will also likely include plotting features with either bokeh or matplotlib.

6. INCLUDE ERROR CHECKING OPTIONS


Any numbers of errors can occur in this process. Maybe some of the tests had errors.
Maybe there was a mistake in the programmed calculations. Make your life easier by
ensuring that the program provides ample outputs to check the quality of the test
results and the following data analysis. This could mean printing plots from the test
that allow visual inspection, or adding an algorithm that compares the measured data
and calculations to expectations and reports errors.

7. STORE THE DATA LOGICALLY


The calculated values from each test need to be stored in tables and data files for later
use. How these values are stored can either make the remaining steps easy or
impossible. The data should often be stored in different tables that provide the data set
needed to later perform regressions.

8. GENERATE REGRESSIONS FROM THE RESULTING DATA SET


Create a program that will open the stored data from Step 7 and create regressions. It
should include an algorithm to create each desired regression, matching the data
storage structure determined in Step 7. Ensure that this program provides adequate
outputs, both statistical and visual, to allow thorough validation of the results.

9. VALIDATE THE RESULTS


Validate the resulting regressions using the statistical and visual outputs provided in
Step 8. Determine whether the model is accurate enough or not. If not, either return to
Step 7 and generate different regressions, or Step 1 and add additional tests to create a

https://towardsdatascience.com/automating-scientific-data-analysis-part-1-c9979cd0817e 5/7
4/16/23, 11:20 PM Automating Scientific Data Analysis Part 1: | by Peter Grant | Towards Data Science

more comprehensive data set. If the model is accurate enough, publish detailed
descriptions of its strengths and weaknesses so that future users understand the
situations when the model should/should not be used.

Next Steps
This post presented the concept of, motivation for, and procedure for automating
scientific data analysis using Python scripts. The remaining posts in the series will
guide you through the 9 steps presented above. The next post will discuss steps 1
through 6 leaving you with a firm understanding of how to automate analysis of
individual laboratory tests. The third and final post will discuss ways to store your data
from each test, and combine it to form regressions. When the topics covered in the two
posts are combined, you’ll be able to write scripts that automatically perform the
entire data analysis process for a particular project.

I hope to see you there, and I hope you find the posts useful.

Data Science Automation Python Science Programming

Sign up for The Variable


By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge
research to original features you don't want to miss. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review
our Privacy Policy for more information about our privacy practices.

Get this newsletter

https://towardsdatascience.com/automating-scientific-data-analysis-part-1-c9979cd0817e 6/7
4/16/23, 11:20 PM Automating Scientific Data Analysis Part 1: | by Peter Grant | Towards Data Science

About Help Terms Privacy

Get the Medium app

https://towardsdatascience.com/automating-scientific-data-analysis-part-1-c9979cd0817e 7/7

You might also like