Role of Statistics in Data Science
Role of Statistics in Data Science
Abstract
Statistics is one of the most important disciplines to provide tools and methods
to find structure in and to give deeper insight into data, and the most important
discipline to analyze and evaluate vulnerability. Here I give an overview over
different proposed structures of Data Science and address the impact of
statistics on such steps as data acquisition and enrichment, data exploration,
data analysis and modelling, validation and representation and reporting.
Introduction:
In the real world, statistics are used to process complicated issues so that data
scientists and analysts can search for important data patterns and improvements.
Statistics can be used in simple words to extract concrete knowledge from
information by conducting mathematical computations on it.
History:
In 1996, data science was first mentioned as a topic at the conference of the
International Federation of Classification Societies. Even though the term was
founded by statisticians, in the public image of Data Science, the importance of
computer science and business applications is often much more stressed, in
particular in the era of Big Data.
Already in the 1970s, the ideas of John Tukey changed the viewpoint of
statistics from a purely mathematical setting, e.g., statistical testing, to deriving
hypotheses from data (exploratory setting), i.e., trying to understand the data
before hypothesizing.
Nowadays, these ideas of data science are combined and formed many different
definitions. One of the most comprehensive definitions of Data Science was
recently given as the formula:
“Data Scientist is a person who is better at statistics than any programmer and
better at programming than any statistician.”
- Josh Wills
Stanford professor David Donoho writes that data science is not distinguished
from statistics by the size of datasets or use of computing. He even described
data science as an applied field emerge out of traditional statistics. Indeed, as
early as 1997, there was an even more radical view suggesting renaming of
statistics to Data Science. And in 2015, a statement was also released about the
role of statistics in Data Science, saying that “statistics and machine learning
play a central role in data science.”
Thus, experimental design is crucial for the reliability, validity, and duplication
of our results.
Data exploration:
In order to sum up the main features, mostly with visual formats, exploratory
data analysis is an approach to analyzing data sets. A mathe
mathematical
matical model may
or may not be used but EDA is mainly intended to see what the data can teach
us beyond the formal task of modeling or hypothesis testing.
The Hypothesis test was presented by Ronald Fisher, Jerzy Neyman, Karl
Pearson and Pearson's son, Egon Pearson. Hypothesis testing is a statistical
method used to make mathematical decisions using test data. It is widely used
when comparing two or more groups. When testing a hypothesis, we need to
respond differently to our sample and how large our sample is. Based on this
information, we would like to do a test to see if there is a meaningful difference,
or if it is simply a matter of chance. This is officially done through a process
called hypothesis testing.
Statistical modeling
The method of applying statistical analysis to a dataset is statistical modeling. A
mathematical representation (or mathematical model) of observed data is a
statistical model.
In cases where more than one model is proposed for, e.g., prediction, statistical
tests for comparing models are helpful to structure the models, e.g., concerning
their predictive power.
Model selection becomes more important since the number of classification and
regression models proposed increased rapidly.
Besides visualization and adequate model storing, for statistics, the main task is
reporting of uncertainties and review.
The statistical methods are fundamental for finding structure in data and for
obtaining deeper insight into data, and thus, for a successful data analysis. But
ignoring modern statistical thinking or using simplistic data analytics/statistical
methods may lead to avoidable fallacies since it concerns big and/or complex
data. Here are several statistical pitfalls we can avoid falling into by data
scientists.
• Picking of cherry:
• Data Dredging:
• Simpson’s Paradox:
• Gambler’s Fallacy:
The Gambler's Fallacy notes that because anything has happened more often
lately, it is now less likely to happen (and vice-versa).
"When drawing inferences from data, people tend to have a gut feeling based
on previous experiences rather than logical explanations."
• Overfitting:
Following the above assessment of the capabilities and impacts of statistics our
conclusion is:
References-
https://en.wikipedia.org/wiki/Data_science
https://asq.org/quality-resources/design-of-experiments
https://towardsdatascience.com/statistical-pitfalls-in-data-science