0% found this document useful (0 votes)

38 views8 pages

Role of Statistics in Data Science

Uploaded by

Kiranbala Nongthombam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views8 pages

Role of Statistics in Data Science

Uploaded by

Kiranbala Nongthombam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

ROLE OF STATISTICS IN DATA SCIENCE

Kiranbala Nongthombam (20MSM3070)

Department of Mathematics,
University Institute of Sciences,
Chandigarh University, Gharuan,
Mohali, Punjab-140413, India

Abstract

Statistics is one of the most important disciplines to provide tools and methods
to find structure in and to give deeper insight into data, and the most important
discipline to analyze and evaluate vulnerability. Here I give an overview over
different proposed structures of Data Science and address the impact of
statistics on such steps as data acquisition and enrichment, data exploration,
data analysis and modelling, validation and representation and reporting.

Introduction:

Data Science as a scientific discipline is influenced by informatics, computer

science, mathematics, operations research, and statistics as well as the applied
sciences.

Statistics is a mathematical science that gathers, analyzes, interprets and

presents knowledge.

In the real world, statistics are used to process complicated issues so that data
scientists and analysts can search for important data patterns and improvements.
Statistics can be used in simple words to extract concrete knowledge from
information by conducting mathematical computations on it.

History:

In 1996, data science was first mentioned as a topic at the conference of the
International Federation of Classification Societies. Even though the term was
founded by statisticians, in the public image of Data Science, the importance of
computer science and business applications is often much more stressed, in
particular in the era of Big Data.

Already in the 1970s, the ideas of John Tukey changed the viewpoint of
statistics from a purely mathematical setting, e.g., statistical testing, to deriving
hypotheses from data (exploratory setting), i.e., trying to understand the data
before hypothesizing.
Nowadays, these ideas of data science are combined and formed many different
definitions. One of the most comprehensive definitions of Data Science was
recently given as the formula:

Data science = (statistics + informatics + computing)

Relation of Data Science to Statistics:

“Data Scientist is a person who is better at statistics than any programmer and
better at programming than any statistician.”
- Josh Wills
Stanford professor David Donoho writes that data science is not distinguished
from statistics by the size of datasets or use of computing. He even described
data science as an applied field emerge out of traditional statistics. Indeed, as
early as 1997, there was an even more radical view suggesting renaming of
statistics to Data Science. And in 2015, a statement was also released about the
role of statistics in Data Science, saying that “statistics and machine learning
play a central role in data science.”

Statistics is a foundation for Data Science; there is an unbreakable relationship

between these two fields. And hence, Statistics plays the crucial role by
providing tools and methods to find structure and give deeper insights into data.

Major impact of statistics on the most important steps in Data Science:

Data acquisition and enrichment:

Fig. Experiment design

Experiment design (DOE) is defined as a branch of applied statistics that deals
with the planning, execution, analysis and interpretation of controlled tests to
determine the factors that control the value of a parameter or parameter group.
DOE is a versatile method for data collecting and interpretation that can be used
in a range of test situations.

DOE can be used -

• Systematic data processing (data acquisition),

• By systematically reducing data bases, too
• By setting (e.g., upgrading) parameters of algorithms e.g., optimization

Thus, experimental design is crucial for the reliability, validity, and duplication
of our results.

Data exploration:

In order to sum up the main features, mostly with visual formats, exploratory
data analysis is an approach to analyzing data sets. A mathe
mathematical
matical model may
or may not be used but EDA is mainly intended to see what the data can teach
us beyond the formal task of modeling or hypothesis testing.

The most important contribution of statistics is the notion of distribution. It

allows us to represent
sent variability in the data as well as (a
(a-priori)
priori) knowledge of
parameters, the concept underlying Bayesian statistics. Distributions also enable
us to choose
se adequate subsequent analytical models and methods.
Statistical data analysis:

Analysis of statistical data is a method for conducting different statistical

operations. It is a kind of quantitative research that seeks to measure the data,
and some form of statistical analysis is usually applied. Basically, quantitative
data includes descriptive data, such as data from surveys and qualitative data.
Finding structure in data and making predictions are the most important steps in
Data Science. Here, in particular, statistical methods are essential since they are
able to handle many different analytical tasks. The following are important
examples of statistical data analysis approaches.

 The Hypothesis test was presented by Ronald Fisher, Jerzy Neyman, Karl
Pearson and Pearson's son, Egon Pearson. Hypothesis testing is a statistical
method used to make mathematical decisions using test data. It is widely used
when comparing two or more groups. When testing a hypothesis, we need to
respond differently to our sample and how large our sample is. Based on this
information, we would like to do a test to see if there is a meaningful difference,
or if it is simply a matter of chance. This is officially done through a process
called hypothesis testing.

Five Steps in Hypothesis Testing:

1. Defining the Null Hypothesis
2. Define Alternative Hypothesis
3. Set Value Level (a)
4. Read Test Statistic and corresponding P-Value
5. Draw the Conclusion

 The relationships between dependent and explanatory variables, which are

typically charted on a scatter plot, are modelled by regression. The line of
regression also determines whether these interactions are strong or weak.
1. Simple Linear Regression model
2. Lasso Regression
3. Logistic regression
4. Support Vector Machines
5. Multivariate Regression algorithm
6. Multiple Regression Algorithm
 The arithmetical mean, simply referred to as "average," is the total of a number
list separated by the number of items in the list. The average is useful in
assessing a data set's central trend or providing a description of our data.
Another benefit of the average is that it is very simple and easy to measure.
 The standard deviation, often expressed by the Greek letter sigma, is the
measure of the mean distribution of results. A high standard deviation means
that data is more uniformly distributed from the mean, where a lower standard
deviation implies that more data aligns with the mean. The standard deviation is
useful in a portfolio of data analysis tools for quickly evaluating the dispersion
of data points.
 Determination of Sample Size: We don't always need to collect information
from every member of the group when calculating a broad data set or
population. Here comes the sample role. For a sample to be accurate, the trick is
to decide the correct size. We are able to reliably calculate the correct sample
size we need using proportion and standard deviation methods, to make our data
collection statistically relevant.

Statistical modeling
The method of applying statistical analysis to a dataset is statistical modeling. A
mathematical representation (or mathematical model) of observed data is a
statistical model.

Model validation and model selection

In cases where more than one model is proposed for, e.g., prediction, statistical
tests for comparing models are helpful to structure the models, e.g., concerning
their predictive power.

Predictive power is typically assessed by means of so-called re sampling

methods where the distribution of power characteristics is studied by artificially
varying the subpopulation used to learn the model. Characteristics of such
distributions can be used for model selection.

Model selection becomes more important since the number of classification and
regression models proposed increased rapidly.

Representation and reporting

Presenting the data involves using graphs, tables, maps and other tools to
pictorially represent the data. These methods help to add the visual element to
details, making it much simpler and easier to understand. This visual
representation of data is called as data visualization.

Visualization to interpret found structures and storing of models in an easy-to-

update form are very important tasks in statistical analyses to communicate the
results and safeguard data analysis deployment. Deployment is decisive for
obtaining interpretable results in Data Science.

Besides visualization and adequate model storing, for statistics, the main task is
reporting of uncertainties and review.

Statistical pitfalls / fallacies:

The statistical methods are fundamental for finding structure in data and for
obtaining deeper insight into data, and thus, for a successful data analysis. But
ignoring modern statistical thinking or using simplistic data analytics/statistical
methods may lead to avoidable fallacies since it concerns big and/or complex
data. Here are several statistical pitfalls we can avoid falling into by data
scientists.

• Picking of cherry:

Cherry picking decreases the legitimacy of experimental outcomes because

at the cost of other data points that reject the proposition, it reveals just one
side of the picture-intentionally choosing data points to help support a
specific theory.

• Data Dredging:

Data dredging can be described as seeking more information from a dataset

than it actually contains.

• Simpson’s Paradox:

The Simpson's Paradox was named after Edward Hugh Simpson, a

statistician who described this statistical phenomenon-a pattern or outcome
that occurs when data is placed in groups that when the data is combined
reverses or disappears.
• Survivorship Bias:

Survivorship bias can be said as drawing conclusions from incomplete

knowledge or data.

• Gambler’s Fallacy:

The Gambler's Fallacy notes that because anything has happened more often
lately, it is now less likely to happen (and vice-versa).

"When drawing inferences from data, people tend to have a gut feeling based
on previous experiences rather than logical explanations."

• Overfitting:

Overfitting refers to the process of designing an overly complex model that

is excessively customized to the dataset and does not work well on
generalized results.
Conclusion

Following the above assessment of the capabilities and impacts of statistics our
conclusion is:

The role of statistics in Data Science is not measurable as compared to other

fields. In particular, this results in the areas of data acquisition and enrichment,
as well as the advanced modelling needed for prediction.

Scientific results, based on appropriate approaches, can only be achieved by

complementing and/or combining mathematical methods and computational
algorithms with statistical reasoning, particularly for big data. But it's also
necessary to avoid the traps that will lead to effective solutions in data science.

References-

Data Science: the impact of statistics: https://link.springer.com