0% found this document useful (0 votes)
3 views

Module 1

Module 1 introduces data science as a multidisciplinary field focused on extracting insights from data using statistical inference, modeling, and analysis. It discusses the current landscape of data science jobs, the importance of datafication, and the relationship between big data and data science. Key concepts include statistical inference, populations and samples, modeling, and the challenges of overfitting in predictive models.
Copyright
© © All Rights Reserved
Available Formats
Download as ODP, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Module 1

Module 1 introduces data science as a multidisciplinary field focused on extracting insights from data using statistical inference, modeling, and analysis. It discusses the current landscape of data science jobs, the importance of datafication, and the relationship between big data and data science. Key concepts include statistical inference, populations and samples, modeling, and the challenges of overfitting in predictive models.
Copyright
© © All Rights Reserved
Available Formats
Download as ODP, PDF, TXT or read online on Scribd
You are on page 1/ 19

MODULE 1

Introduction to Data Science


Introduction: What is Data Science? Big Data and
Data Science hype – and getting past the hype,
Why now? – Datafication, Current landscape of
perspectives, Skill sets. Needed Statistical
Inference: Populations and samples, Statistical
modelling, probability distributions, fitting a model.
What is Data Science?

Data science is the study of data to extract meaningful
insights for business.

It is a multidisciplinary approach that combines
principles and practices from the fields of mathematics,
statistics, artificial intelligence, and computer
engineering to analyze large amounts of data.

This analysis helps data scientists to ask and answer
questions like what happened, why it happened, what
will happen, and what can be done with the results.
Big Data and Data Science Hype

There’s a lack of definitions around the most basic terminology.

What is “Big Data” anyway? What does “data science” mean?

What is the relationship between Big Data and data science?

Is data science the science of Big Data?

Is data science only the stuff going on in companies like
Google and Facebook and tech companies? Why do many
people refer to Big Data as crossing disciplines (as‐ tronomy,
finance, tech, etc.) and to data science as only taking place in
tech?

Just how big is big? Or is it just a relative term? These terms
are so ambiguous, they’re well-nigh meaningless.

There’s a distinct lack of respect for the
researchers in academia and industry labs who
have been working on this kind of stuff for years,
and whose work is based on decades (in some
cases, cen‐ turies) of work by statisticians,
computer scientists, mathemati‐ cians, engineers,
and scientists of all types

Statisticians already feel that they are studying
and working on the “Science of Data.” That’s their
bread and butter
Why Now?

We have massive amounts of data about many aspects of
our lives, and, simultaneously, an abundance of inexpensive
computing power.

Shopping, communicating, reading news, listening to music,
search‐ ing for information, expressing our opinions—all this
is being tracked online, as most people know.

It’s not just Internet data, though—it’s finance, the medical
industry, pharmaceuticals, bioinformatics, social welfare,
government, educa‐ tion, retail, and the list goes on.

Amazon recommendation systems, friend recommendations
on Face‐ book, film and music recommendations
Datafication

Datafication as a process of “taking all aspects of life and
turning them into data.”

As examples, they mention that Google’s augmented-
reality glasses datafy the gaze. Twitter datafies stray
thoughts. LinkedIn datafies professional networks.

We are being datafied, or rather our actions are, and
when we “like” someone or something online, we are
intending to be datafied,or at least we should expect to be.

And when we walk around in a store, or even on the
street, we are being datafied in a completely unintentional
way, via sensors, cameras, or Google glasses.

Once we datafy things, we can transform their purpose
and turn the information into new forms of value.
The Current Landscape

Data Science Jobs

To be experts in computer science, statistics, communication, data
visualization, and to have extensive domain expertise

A Data Science Profile

Computer science

Math

Statistics

Machine learning

Domain expertise

Communication and presentation skills

Data visualization
Statistical Inference

Imagine spending 24 hours looking out the window, and
for every minute, counting and recording the number of
people who pass by.

This overall process of going from the world to the data,
and then from the data back to the world, is the field of
statistical inference.

More precisely, statistical inference is the discipline that
concerns itself with the development of procedures,
methods, and theorems that al‐ low us to extract
meaning and information from data that has been
generated by stochastic (random) processes.
Populations and Samples

Population:

It could be any set of objects or units, such as
tweets or photographs or stars.

If we could measure the characteristics or
extract characteristics of all those objects, we’d
have a complete set of observations, and the
con‐ vention is to use N to represent the total
number of observations in the population

Samples:

When we take a sample, we take a subset of
the units of size n in order to examine the
observations to draw conclusions and make
inferences about the population.

There are different ways you might go about
getting this subset of data, and you want to be
aware of this sampling mechanism
What is a model?

Humans try to understand the world around them by
representing it in different ways.

Architects capture attributes of buildings through blueprints
and three-dimensional, scaled-down versions.

Molecularbiologists capture protein structure with three-
dimensional visualizations of the connections between amino
acids.

Statisticians and data scientists capture the uncertainty and
randomness of data-generating processes with mathematical
functions that express the shape and structure of the data
itself.

A model is our attempt to understand and represent the
nature of reality through a particular lens, be it architectural,
biological, or mathematical.
Statistical modeling

Before you get too involved with the data and start coding,
it’s useful to draw a picture of what you think the underlying
process might be with your model. What comes first? What
influences what? What causes what? What’s a test of that?

But different people think in different ways. Some prefer to
express these kinds of relationships in terms of math. The
mathematical expressions will be general enough that they
have to include parameters, but the values of these
parameters are not yet known.

In mathematical expressions, the convention is to use Greek
letters for parameters and Latin letters for data. So, for
example, if you have two columns of data, x and y, and you
think there’s a linear relationship, you’d write down y = β 0 +
β 1 x. You don’t know what β 0 and β 1 are in terms of
actual numbers yet, so they’re the parameters.

Other people prefer pictures and will first draw a
diagram of data flow, possibly with arrows, showing
how things affect other things or what happens over
time. This gives them an abstract picture of the
relationships before choosing equations to express
them.

One place to start is exploratory data analysis (EDA),
which we will cover in a later section. This entails
making plots and building intu‐ ition for your particular
dataset. EDA helps out a lot, as well as trial and error
and iteration.
Probability distributions

The classical example is the height of hu‐
mans, following a normal distribution—a bell-
shaped curve, also called a Gaussian
distribution, named after Gauss.
Fitting a model

Fitting a model means that you estimate the
parameters of the model using the observed
data. You are using your data as evidence to
help approximate the real-world mathematical
process that generated the data.

Fitting the model often involves optimization
methods and algorithms, such as maximum
likelihood estimation, to help get the
parameters.

Fitting the model is when you start actually
coding: your code will read in the data, and
you’ll specify the functional form that you wrote
down on the piece of paper.

Then R or Python will use built-in optimization
methods to give you the most likely values of
the parameters given the data.
Overfitting

Overfitting is the term used to mean that you
used a dataset to estimate the parameters of
your model, but your model isn’t that good at
capturing reality beyond your sampled data.

You might know this because you have tried to
use it to predict labels for another set of data
that you didn’t use to fit the model, and it
doesn’t do a good job, as measured by an
evaluation metric such as accuracy.

You might also like