Introduction to Data
Science
MODULE 1
CONTENT
Introduction: What is Data Science?
Big Data and Data Science hype
Getting past the hype
Why now? – Datafication
Current landscape of perspectives
Skill sets.
Needed Statistical Inference: Populations and samples
Statistical modelling
probability distributions
fitting a model.
Data Science?
Big Data and Data Science Hype
● There’s a lack of definitions around the most basic terminology.
● There’s a distinct lack of respect for the researchers in academia and industry labs
who have been working on this kind of stuff for years
● The hype is crazy
● Statisticians already feel that they are studying and working on the “Science of
Data.”
● People have said to us, “Anything that has to call itself a science isn’t.”
Getting Past the Hype
Difference between academic statistics and industry statistics
Why Now?
Know by people Might not know by people
● Massive amount of data
● An abundance of inexpensive ● Datafication
computing power ● Process of our offline behavior and
● Online tracking mirroring the online data collection
● Data of more sectors and industries revolution.
cont’d
It’s that the data itself, often in real time, becomes the building blocks of
data products.
“We’re witnessing the beginning of a massive, culturally saturated feed-
back loop where our behavior changes the product and the product
changes our behavior.”
cont’d
Infrastructure for large scale of data processing
Memory
Bandwidth
Cultural acceptance of technology
Datafication
cont’d
In the May/June 2013 issue of Foreign Affairs, Kenneth Neil Cukier and Viktor
Mayer-Schoenberger wrote an article called “The Rise of Big Data”. In it they
discuss the concept of datafication, and their example is how we quantify
friendships with “likes”: it’s the way everything we do, online or otherwise,
ends up recorded for later examination in someone’s data storage units. Or
maybe multiple storage units, and maybe also for sale.
cont’d
● Definition: a process of “taking all aspects of life and turning them into
data.”
● Examples, they mention that “Google’s augmented-reality glasses datafy
the gaze. Twitter datafies stray thoughts. LinkedIn datafies professional
networks.”
● But when we merely browse the Web, we are unintentionally, or at least
passively, being datafied through cookies that we might or might not be
aware of.
● we walk around in a store, or even on the street, we are being datafied in
a completely unintentional way, via sensors, cameras, or Google glasses.
cont’d
● Once we datafy things, we can transform their purpose and turn the
information into new forms of value.
Current landscape of perspectives
cont’d
● For example, on Quora there’s a discussion from 2010 about “What is Data
Science?” and here’s Metamarket CEO Mike Driscoll’s answer:
● Data science, as it’s practiced, is a blend of Red-Bull-fueled hacking and
espresso-inspired statistics.
● But data science is not merely hacking—because when hackers finish debugging
their Bash one-liners and Pig scripts, few of them care about non-Euclidean
distance metrics.
● And data science is not merely statistics, because when statisticians finish
theorizing the perfect model, few could read a tab-delimited file into R if their
job depended on it.
● Data science is the civil engineering of data. Its acolytes possess a practical
knowledge of tools and materials, coupled with a theoretical understanding of
what’s possible.
cont’d
cont’d
● Nathan Yau’s 2009 post, “Rise of the Data Scientist”, which include:
• Statistics (traditional analysis you’re used to thinking about)
• Data munging (parsing, scraping, and formatting data)
• Visualization (graphs, tools, etc.)
cont’d
● Cosma basically argues that any statistics department worth its salt does
all the stuff in the descriptions of data science that he sees, and therefore
data science is just a rebranding and unwelcome takeover of statistics.
Statistical Inference
cont’d
● The world we live in is complex, random, and uncertain. At the same time,
it’s one big data-generating machine.
● the processes in our lives are actually data-generating processes.
● Data represents the traces of the real-world processes, and exactly which
traces we gather are decided by our data collection or sampling method
cont’d
● After separating the process from the data collection, there are two
sources of randomness and uncertainty.
○ Namely, the randomness and uncertainty underlying the process
itself
○ the uncertainty associated with your underlying data collection
methods
cont’d
● Once you have all this data, you have somehow captured the world, or
certain traces of the world. But you can’t go walking around with a huge
Excel spreadsheet or database of millions of transactions
● So you need a new idea, and that’s to simplify those captured traces into
something more comprehensible, more concise way, and mathematical
models or functions of the data, known as statistical estimators.
This overall process of going from the world to the data, and then
from the data back to the world, is the field of statistical inference.
cont’d
● statistical inference is the discipline that concerns itself with the
development of procedures, methods, and theorems that allow us to
extract meaning and information from data that has been generated by
stochastic (random) processes.
Populations and Samples
Population
● In classical statistical literature, a distinction is made between the
population and the sample.
● statistical inference population isn’t used to simply describe only people.
It could be any set of objects or units, such as tweets or photographs or
stars.
● If measure the characteristics or extract characteristics of all those
objects, we’d have a complete set of observations, and the convention is
to use N to represent the total number of observations in the
population.
● Example email sent last year is population, in that single observation give
you, list of recipients, data sent, text of email, sender etc.
Sample
● A sample, is a subset of the units of size n in order to examine the
observations to draw conclusions and make inferences about the
population.
● There are different ways you might go about getting this subset of data
● so that the subset is not a “mini-me” shrunk-down version of the
population. Once that happens, any conclusions you draw will simply be
wrong and distorted.
● Example, email : select 1/10th of those people at random and take all the
email they ever sent.
Modeling
Introduction
● data models—the representation one is choosing to store one’s data,
which is the realm of database managers
Statistical modeling
● Before you get too involved with the data and start coding, it’s useful to draw a
picture of what you think the underlying process might be with your model.
What comes first? What influences what? What causes what? What’s a test of
that?
● Some prefer to express these kinds of relationships in terms of math. The
mathematical expressions will be general enough that they have to include
parameters, but the values of these parameters are not yet known.
● In mathematical expressions, the convention is to use Greek letters for
parameters and Latin letters for data. So, for example, if you have two columns
of data, x and y, and you think there’s a linear relationship, you’d write down y =
β0 +β1x. You don’t know what β0 and β1 are in terms of actual numbers yet, so
they’re the parameters.
Statistical modeling
● Other people prefer pictures and will first draw a diagram of data flow,
possibly with arrows, showing how things affect other things or what
happens over time. This gives them an abstract picture of the
relationships before choosing equations to express them.
● One place to start is Exploratory Data Analysis (EDA), This entails making
plots and building intuition for your particular dataset.
● EDA helps out a lot, as well as trial and error and iteration.
Probability distributions
● Probability distributions are the
foundation of statistical models.
● A random variable denoted by x or y
can be assumed to have a
corresponding probability
distribution, p (x) , which maps x to
a positive real number.
● In order to be a probability density
function, we’re restricted to the set
of functions such that if we
integrate p (x) to get the area under
the curve, it is 1, so it can be
interpreted as probability.
Probability distributions - example 1
● For example, let x be the amount of time until the next bus arrives
(measured in seconds). x is a random variable because there is variation
and uncertainty in the amount of time until the next bus.
● Suppose we know that the time until the next bus has a probability
density function of p (x) =2e−2x.
● If we want to know the likelihood of the next bus arriving in between 12
and 13 minutes, then we find the area under the curve between 12 and
13
Probability distributions - example 2
● If we consider X to be the random variable that represents the amount
of money spent, then we can look at the distribution of money spent
across all users, and represent it as p (X) .
● We can then take the subset of users who looked at more than five items
before buying anything, and look at the distribution of money spent
among these users.
● Let Y be the random variable that represents number of items looked
at, then p( X|Y > 5) would be the corresponding conditional
distribution.
● Note a conditional distribution has the same properties as a regular
distribution in that when we integrate it, it sums to 1 and has to take
nonnegative values.
Probability distributions - example 3
● When we observe data points, i.e.,( x1, y1) , (x2, y2) , . . ., (xn, yn), we are
observing realizations of a pair of random variables.
● When we have an entire dataset with n rows and k columns, we are
observing n realizations of the joint distribution of those k random
variables.
Fitting a model
● Fitting a model means that you estimate the parameters of the model using
the observed data.
● You are using your data as evidence to help approximate the real-world
mathematical process that generated the data.
● Fitting the model often involves optimization methods and algorithms, such
as maximum likelihood estimation, to help get the parameters.
● when you estimate the parameters, they are actually estimators, meaning
they themselves are functions of the data.
● Fitting the model is when you start actually coding: your code will read in the
data, and you’ll specify the functional form that you wrote down on the piece
of paper. Then R or Python will use built-in optimization methods to give you
the most likely values of the parameters given the data.