Introduction To Data Science

The document provides an introduction to data science, discussing its definitions, the hype surrounding big data, and the concept of datafication. It emphasizes the importance of statistical inference, modeling, and the distinction between populations and samples in data analysis. The content also highlights the evolving landscape of data science and the skill sets required for practitioners in the field.

Uploaded by

dhruthin1907

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views34 pages

Introduction To Data Science

Uploaded by

dhruthin1907

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Introduction to Data

Science
MODULE 1
CONTENT
Introduction: What is Data Science?
Big Data and Data Science hype
Getting past the hype
Why now? – Datafication
Current landscape of perspectives
Skill sets.
Needed Statistical Inference: Populations and samples
Statistical modelling
probability distributions
fitting a model.
Data Science?
Big Data and Data Science Hype

● There’s a lack of definitions around the most basic terminology.

● There’s a distinct lack of respect for the researchers in academia and industry labs
who have been working on this kind of stuff for years
● The hype is crazy
● Statisticians already feel that they are studying and working on the “Science of
Data.”
● People have said to us, “Anything that has to call itself a science isn’t.”
Getting Past the Hype

Difference between academic statistics and industry statistics

Why Now?

Know by people Might not know by people

● Massive amount of data
● An abundance of inexpensive ● Datafication
computing power ● Process of our offline behavior and
● Online tracking mirroring the online data collection
● Data of more sectors and industries revolution.
cont’d
It’s that the data itself, often in real time, becomes the building blocks of
data products.

“We’re witnessing the beginning of a massive, culturally saturated feed-

back loop where our behavior changes the product and the product
changes our behavior.”
cont’d
Infrastructure for large scale of data processing
Memory
Bandwidth
Cultural acceptance of technology
Dataﬁcation
cont’d
In the May/June 2013 issue of Foreign Affairs, Kenneth Neil Cukier and Viktor
Mayer-Schoenberger wrote an article called “The Rise of Big Data”. In it they
discuss the concept of datafication, and their example is how we quantify
friendships with “likes”: it’s the way everything we do, online or otherwise,
ends up recorded for later examination in someone’s data storage units. Or
maybe multiple storage units, and maybe also for sale.
cont’d
● Definition: a process of “taking all aspects of life and turning them into
data.”
● Examples, they mention that “Google’s augmented-reality glasses datafy
the gaze. Twitter datafies stray thoughts. LinkedIn datafies professional
networks.”
● But when we merely browse the Web, we are unintentionally, or at least
passively, being datafied through cookies that we might or might not be
aware of.
● we walk around in a store, or even on the street, we are being datafied in
a completely unintentional way, via sensors, cameras, or Google glasses.
cont’d

● Once we datafy things, we can transform their purpose and turn the
information into new forms of value.
Current landscape of perspectives
cont’d
● For example, on Quora there’s a discussion from 2010 about “What is Data
Science?” and here’s Metamarket CEO Mike Driscoll’s answer:
● Data science, as it’s practiced, is a blend of Red-Bull-fueled hacking and
espresso-inspired statistics.
● But data science is not merely hacking—because when hackers finish debugging
their Bash one-liners and Pig scripts, few of them care about non-Euclidean
distance metrics.
● And data science is not merely statistics, because when statisticians finish
theorizing the perfect model, few could read a tab-delimited file into R if their
job depended on it.
● Data science is the civil engineering of data. Its acolytes possess a practical
knowledge of tools and materials, coupled with a theoretical understanding of
what’s possible.
cont’d
cont’d

● Nathan Yau’s 2009 post, “Rise of the Data Scientist”, which include:
• Statistics (traditional analysis you’re used to thinking about)
• Data munging (parsing, scraping, and formatting data)
• Visualization (graphs, tools, etc.)
cont’d

● Cosma basically argues that any statistics department worth its salt does
all the stuff in the descriptions of data science that he sees, and therefore
data science is just a rebranding and unwelcome takeover of statistics.
Statistical Inference
cont’d

● The world we live in is complex, random, and uncertain. At the same time,
it’s one big data-generating machine.
● the processes in our lives are actually data-generating processes.
● Data represents the traces of the real-world processes, and exactly which
traces we gather are decided by our data collection or sampling method
cont’d

● After separating the process from the data collection, there are two
sources of randomness and uncertainty.
○ Namely, the randomness and uncertainty underlying the process
itself
○ the uncertainty associated with your underlying data collection
methods
cont’d
● Once you have all this data, you have somehow captured the world, or
certain traces of the world. But you can’t go walking around with a huge
Excel spreadsheet or database of millions of transactions
● So you need a new idea, and that’s to simplify those captured traces into
something more comprehensible, more concise way, and mathematical
models or functions of the data, known as statistical estimators.

This overall process of going from the world to the data, and then
from the data back to the world, is the field of statistical inference.
cont’d
● statistical inference is the discipline that concerns itself with the
development of procedures, methods, and theorems that allow us to
extract meaning and information from data that has been generated by
stochastic (random) processes.
Populations and Samples
Population
● In classical statistical literature, a distinction is made between the
population and the sample.
● statistical inference population isn’t used to simply describe only people.
It could be any set of objects or units, such as tweets or photographs or
stars.
● If measure the characteristics or extract characteristics of all those
objects, we’d have a complete set of observations, and the convention is
to use N to represent the total number of observations in the
population.
● Example email sent last year is population, in that single observation give
you, list of recipients, data sent, text of email, sender etc.
Sample
● A sample, is a subset of the units of size n in order to examine the
observations to draw conclusions and make inferences about the
population.
● There are different ways you might go about getting this subset of data
● so that the subset is not a “mini-me” shrunk-down version of the
population. Once that happens, any conclusions you draw will simply be
wrong and distorted.
● Example, email : select 1/10th of those people at random and take all the
email they ever sent.
Modeling
Introduction
● data models—the representation one is choosing to store one’s data,
which is the realm of database managers
Statistical modeling
● Before you get too involved with the data and start coding, it’s useful to draw a
picture of what you think the underlying process might be with your model.
What comes first? What influences what? What causes what? What’s a test of
that?
● Some prefer to express these kinds of relationships in terms of math. The
mathematical expressions will be general enough that they have to include
parameters, but the values of these parameters are not yet known.
● In mathematical expressions, the convention is to use Greek letters for
parameters and Latin letters for data. So, for example, if you have two columns
of data, x and y, and you think there’s a linear relationship, you’d write down y =
β0 +β1x. You don’t know what β0 and β1 are in terms of actual numbers yet, so
they’re the parameters.
Statistical modeling
● Other people prefer pictures and will first draw a diagram of data flow,
possibly with arrows, showing how things affect other things or what
happens over time. This gives them an abstract picture of the
relationships before choosing equations to express them.

● One place to start is Exploratory Data Analysis (EDA), This entails making
plots and building intuition for your particular dataset.
● EDA helps out a lot, as well as trial and error and iteration.
Probability distributions
● Probability distributions are the
foundation of statistical models.
● A random variable denoted by x or y
can be assumed to have a
corresponding probability
distribution, p (x) , which maps x to
a positive real number.
● In order to be a probability density
function, we’re restricted to the set
of functions such that if we
integrate p (x) to get the area under
the curve, it is 1, so it can be
interpreted as probability.
Probability distributions - example 1
● For example, let x be the amount of time until the next bus arrives
(measured in seconds). x is a random variable because there is variation
and uncertainty in the amount of time until the next bus.
● Suppose we know that the time until the next bus has a probability
density function of p (x) =2e−2x.
● If we want to know the likelihood of the next bus arriving in between 12
and 13 minutes, then we find the area under the curve between 12 and
13
Probability distributions - example 2
● If we consider X to be the random variable that represents the amount
of money spent, then we can look at the distribution of money spent
across all users, and represent it as p (X) .
● We can then take the subset of users who looked at more than five items
before buying anything, and look at the distribution of money spent
among these users.
● Let Y be the random variable that represents number of items looked
at, then p( X|Y > 5) would be the corresponding conditional
distribution.
● Note a conditional distribution has the same properties as a regular
distribution in that when we integrate it, it sums to 1 and has to take
nonnegative values.
Probability distributions - example 3
● When we observe data points, i.e.,( x1, y1) , (x2, y2) , . . ., (xn, yn), we are
observing realizations of a pair of random variables.
● When we have an entire dataset with n rows and k columns, we are
observing n realizations of the joint distribution of those k random
variables.
Fitting a model
● Fitting a model means that you estimate the parameters of the model using
the observed data.
● You are using your data as evidence to help approximate the real-world
mathematical process that generated the data.
● Fitting the model often involves optimization methods and algorithms, such
as maximum likelihood estimation, to help get the parameters.
● when you estimate the parameters, they are actually estimators, meaning
they themselves are functions of the data.
● Fitting the model is when you start actually coding: your code will read in the
data, and you’ll specify the functional form that you wrote down on the piece
of paper. Then R or Python will use built-in optimization methods to give you
the most likely values of the parameters given the data.

Class VIII Data Science Book Cbse
No ratings yet
Class VIII Data Science Book Cbse
34 pages
Data Science 5
100% (4)
Data Science 5
216 pages
Data Science SPPU
No ratings yet
Data Science SPPU
115 pages
Defining Data Science
100% (1)
Defining Data Science
167 pages
Module - 1 IDS
100% (1)
Module - 1 IDS
19 pages
Module1 DS
No ratings yet
Module1 DS
61 pages
Andrews M. Doing Data Science in R. An Introduction... 2021
No ratings yet
Andrews M. Doing Data Science in R. An Introduction... 2021
486 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Linear and Generalized Linear Mixed Models and Their Applications - 2nd Edition Optimized DOCX Download
100% (15)
Linear and Generalized Linear Mixed Models and Their Applications - 2nd Edition Optimized DOCX Download
17 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Unit 1
No ratings yet
Unit 1
85 pages
08-MBA-DATA ANALYTICS - Data Science and Business Analysis - Unit 1
No ratings yet
08-MBA-DATA ANALYTICS - Data Science and Business Analysis - Unit 1
40 pages
Getting Started With Data Science: Grade VIII
No ratings yet
Getting Started With Data Science: Grade VIII
32 pages
ClassXI DS Teacher Presentation
No ratings yet
ClassXI DS Teacher Presentation
77 pages
Deep Learning Notes
100% (1)
Deep Learning Notes
44 pages
Data Science UNIT 1 Final
No ratings yet
Data Science UNIT 1 Final
107 pages
C20 Combined
No ratings yet
C20 Combined
291 pages
Modul 1
No ratings yet
Modul 1
56 pages
Unit - I & II
No ratings yet
Unit - I & II
59 pages
3 Data Science Intro
No ratings yet
3 Data Science Intro
76 pages
22mca341 - Data Science
No ratings yet
22mca341 - Data Science
109 pages
Fdsa PPT - Unit 1
No ratings yet
Fdsa PPT - Unit 1
19 pages
AI ML June 4 2022
No ratings yet
AI ML June 4 2022
40 pages
Introduction To Predictive Analytics
No ratings yet
Introduction To Predictive Analytics
92 pages
classVIII DS Student Handbook
No ratings yet
classVIII DS Student Handbook
30 pages
DAT100 - Int - Data - Ana - Lec2 - Intro II
No ratings yet
DAT100 - Int - Data - Ana - Lec2 - Intro II
39 pages
DS - Module 1
No ratings yet
DS - Module 1
57 pages
DS 1
No ratings yet
DS 1
56 pages
Unit-1 - Introduction To Data Science
No ratings yet
Unit-1 - Introduction To Data Science
17 pages
DS231 Module 2
No ratings yet
DS231 Module 2
33 pages
DS231 Week 2
No ratings yet
DS231 Week 2
33 pages
Chapter1 2023
No ratings yet
Chapter1 2023
76 pages
Data Science - AD1102-1
No ratings yet
Data Science - AD1102-1
53 pages
Chapter 5
No ratings yet
Chapter 5
58 pages
Data Science and Visualization
No ratings yet
Data Science and Visualization
37 pages
FDS Unit 1 Notes
No ratings yet
FDS Unit 1 Notes
53 pages
03-07-2024-Data Science - Orentation Programme
No ratings yet
03-07-2024-Data Science - Orentation Programme
53 pages
Module 1
No ratings yet
Module 1
19 pages
Evaluating Functions
No ratings yet
Evaluating Functions
1 page
Data Science
No ratings yet
Data Science
87 pages
CS109a Lecture1
No ratings yet
CS109a Lecture1
67 pages
L1 - Introduction To Data Science
No ratings yet
L1 - Introduction To Data Science
33 pages
Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm
No ratings yet
Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm
44 pages
DV - Unit 1
No ratings yet
DV - Unit 1
40 pages
DSF 1-2
No ratings yet
DSF 1-2
28 pages
Wa0004.
No ratings yet
Wa0004.
44 pages
Data Strategy
No ratings yet
Data Strategy
41 pages
Data v2
No ratings yet
Data v2
25 pages
Data Science-New (Unit-I)
No ratings yet
Data Science-New (Unit-I)
18 pages
Semana 1: The Data Scientist's Toolbox
No ratings yet
Semana 1: The Data Scientist's Toolbox
20 pages
CH 1
No ratings yet
CH 1
34 pages
Module 2 BDA
No ratings yet
Module 2 BDA
40 pages
Fha Unit 1 Introduction
No ratings yet
Fha Unit 1 Introduction
8 pages
TYCS DS Unit1
No ratings yet
TYCS DS Unit1
28 pages
FDS Module 1 Notes
No ratings yet
FDS Module 1 Notes
27 pages
Technion - IIT Haifa, Israel
No ratings yet
Technion - IIT Haifa, Israel
5 pages
6220010
No ratings yet
6220010
37 pages
Maths Complete
No ratings yet
Maths Complete
109 pages
Module1 21CS644 DSV
No ratings yet
Module1 21CS644 DSV
16 pages
Baumer - 2015 - A Data Science Course For Undergraduates Thinking
No ratings yet
Baumer - 2015 - A Data Science Course For Undergraduates Thinking
10 pages
Lecture1 Intro WhatDataScience
No ratings yet
Lecture1 Intro WhatDataScience
15 pages
Queuing Theory
No ratings yet
Queuing Theory
13 pages
CST433 Security in Computing, December 2023
No ratings yet
CST433 Security in Computing, December 2023
2 pages
Aqa 2H 2025
No ratings yet
Aqa 2H 2025
23 pages
Advanced Control Theory
No ratings yet
Advanced Control Theory
96 pages
Unit 1 - AP For Data Science
No ratings yet
Unit 1 - AP For Data Science
19 pages
07 - Principles of Visualizing Data
No ratings yet
07 - Principles of Visualizing Data
41 pages
Probability Ca Module
No ratings yet
Probability Ca Module
12 pages
PPO (v3)
No ratings yet
PPO (v3)
29 pages
Economatrics Final Paper
No ratings yet
Economatrics Final Paper
13 pages
Lecture7 Design Strategies Divide and Conquer and Greedy
No ratings yet
Lecture7 Design Strategies Divide and Conquer and Greedy
26 pages
Group 22 - Final Year Black Book Plagiarism Report
No ratings yet
Group 22 - Final Year Black Book Plagiarism Report
13 pages
Cs 4004 Analysis and Design of Algorithm Jun 2020
No ratings yet
Cs 4004 Analysis and Design of Algorithm Jun 2020
2 pages
Ritika Kapoor - DETD
No ratings yet
Ritika Kapoor - DETD
22 pages
Report Project 3
No ratings yet
Report Project 3
11 pages
Ijirt155690 Paper
No ratings yet
Ijirt155690 Paper
6 pages
Data OUtput
No ratings yet
Data OUtput
10 pages
Comp Science 2
No ratings yet
Comp Science 2
5 pages
BCA 3.3 Non Linear Data Structure Using C++
No ratings yet
BCA 3.3 Non Linear Data Structure Using C++
18 pages
Automatic Programming of Robots Using Genetic Programming: John R. Koza James P. Rice
No ratings yet
Automatic Programming of Robots Using Genetic Programming: John R. Koza James P. Rice
9 pages
Digital Image Processing: Dr. Ir. Aleksandra Pizurica
No ratings yet
Digital Image Processing: Dr. Ir. Aleksandra Pizurica
12 pages
Simple Regression: Quality MKT Share X Y Error LINEST Output (1
100% (1)
Simple Regression: Quality MKT Share X Y Error LINEST Output (1
2 pages
Apurba Resume
No ratings yet
Apurba Resume
1 page
Curriculum CSE
No ratings yet
Curriculum CSE
2 pages
Practical Lab Schedule Nov Dec 2 K 23
No ratings yet
Practical Lab Schedule Nov Dec 2 K 23
2 pages
DSTL KCS303
No ratings yet
DSTL KCS303
2 pages
Inventory Problems Solutions
No ratings yet
Inventory Problems Solutions
3 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet

Introduction To Data Science

Uploaded by

Introduction To Data Science

Uploaded by

Introduction to Data

● There’s a lack of definitions around the most basic terminology.

Difference between academic statistics and industry statistics

Know by people Might not know by people

“We’re witnessing the beginning of a massive, culturally saturated feed-

You might also like