Module 1

Module 1 introduces data science as a multidisciplinary field focused on extracting insights from data using statistical inference, modeling, and analysis. It discusses the current landscape of data science jobs, the importance of datafication, and the relationship between big data and data science. Key concepts include statistical inference, populations and samples, modeling, and the challenges of overfitting in predictive models.

Uploaded by

16gireeshak122003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODP, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Module 1

Uploaded by

16gireeshak122003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODP, PDF, TXT or read online on Scribd

You are on page 1/ 19

MODULE 1

Introduction to Data Science

Introduction: What is Data Science? Big Data and
Data Science hype – and getting past the hype,
Why now? – Datafication, Current landscape of
perspectives, Skill sets. Needed Statistical
Inference: Populations and samples, Statistical
modelling, probability distributions, fitting a model.
What is Data Science?
●
Data science is the study of data to extract meaningful
insights for business.
●
It is a multidisciplinary approach that combines
principles and practices from the fields of mathematics,
statistics, artificial intelligence, and computer
engineering to analyze large amounts of data.
●
This analysis helps data scientists to ask and answer
questions like what happened, why it happened, what
will happen, and what can be done with the results.
Big Data and Data Science Hype
●
There’s a lack of definitions around the most basic terminology.
●
What is “Big Data” anyway? What does “data science” mean?
●
What is the relationship between Big Data and data science?
●
Is data science the science of Big Data?
●
Is data science only the stuff going on in companies like
Google and Facebook and tech companies? Why do many
people refer to Big Data as crossing disciplines (as‐ tronomy,
finance, tech, etc.) and to data science as only taking place in
tech?
●
Just how big is big? Or is it just a relative term? These terms
are so ambiguous, they’re well-nigh meaningless.
●
There’s a distinct lack of respect for the
researchers in academia and industry labs who
have been working on this kind of stuff for years,
and whose work is based on decades (in some
cases, cen‐ turies) of work by statisticians,
computer scientists, mathemati‐ cians, engineers,
and scientists of all types
●
Statisticians already feel that they are studying
and working on the “Science of Data.” That’s their
bread and butter
Why Now?
●
We have massive amounts of data about many aspects of
our lives, and, simultaneously, an abundance of inexpensive
computing power.
●
Shopping, communicating, reading news, listening to music,
search‐ ing for information, expressing our opinions—all this
is being tracked online, as most people know.
●
It’s not just Internet data, though—it’s finance, the medical
industry, pharmaceuticals, bioinformatics, social welfare,
government, educa‐ tion, retail, and the list goes on.
●
Amazon recommendation systems, friend recommendations
on Face‐ book, film and music recommendations
Datafication
●
Datafication as a process of “taking all aspects of life and
turning them into data.”
●
As examples, they mention that Google’s augmented-
reality glasses datafy the gaze. Twitter datafies stray
thoughts. LinkedIn datafies professional networks.
●
We are being datafied, or rather our actions are, and
when we “like” someone or something online, we are
intending to be datafied,or at least we should expect to be.
●
And when we walk around in a store, or even on the
street, we are being datafied in a completely unintentional
way, via sensors, cameras, or Google glasses.
●
Once we datafy things, we can transform their purpose
and turn the information into new forms of value.
The Current Landscape
➔
Data Science Jobs
•
To be experts in computer science, statistics, communication, data
visualization, and to have extensive domain expertise
➔
A Data Science Profile
•
Computer science
•
Math
•
Statistics
•
Machine learning
•
Domain expertise
•
Communication and presentation skills
•
Data visualization
Statistical Inference
●
Imagine spending 24 hours looking out the window, and
for every minute, counting and recording the number of
people who pass by.
●
This overall process of going from the world to the data,
and then from the data back to the world, is the field of
statistical inference.
●
More precisely, statistical inference is the discipline that
concerns itself with the development of procedures,
methods, and theorems that al‐ low us to extract
meaning and information from data that has been
generated by stochastic (random) processes.
Populations and Samples
➔
Population:
•
It could be any set of objects or units, such as
tweets or photographs or stars.
•
If we could measure the characteristics or
extract characteristics of all those objects, we’d
have a complete set of observations, and the
con‐ vention is to use N to represent the total
number of observations in the population
➔
Samples:
•
When we take a sample, we take a subset of
the units of size n in order to examine the
observations to draw conclusions and make
inferences about the population.
•
There are different ways you might go about
getting this subset of data, and you want to be
aware of this sampling mechanism
What is a model?
●
Humans try to understand the world around them by
representing it in different ways.
●
Architects capture attributes of buildings through blueprints
and three-dimensional, scaled-down versions.
●
Molecularbiologists capture protein structure with three-
dimensional visualizations of the connections between amino
acids.
●
Statisticians and data scientists capture the uncertainty and
randomness of data-generating processes with mathematical
functions that express the shape and structure of the data
itself.
●
A model is our attempt to understand and represent the
nature of reality through a particular lens, be it architectural,
biological, or mathematical.
Statistical modeling
●
Before you get too involved with the data and start coding,
it’s useful to draw a picture of what you think the underlying
process might be with your model. What comes first? What
influences what? What causes what? What’s a test of that?
●
But different people think in different ways. Some prefer to
express these kinds of relationships in terms of math. The
mathematical expressions will be general enough that they
have to include parameters, but the values of these
parameters are not yet known.
●
In mathematical expressions, the convention is to use Greek
letters for parameters and Latin letters for data. So, for
example, if you have two columns of data, x and y, and you
think there’s a linear relationship, you’d write down y = β 0 +
β 1 x. You don’t know what β 0 and β 1 are in terms of
actual numbers yet, so they’re the parameters.
●
Other people prefer pictures and will first draw a
diagram of data flow, possibly with arrows, showing
how things affect other things or what happens over
time. This gives them an abstract picture of the
relationships before choosing equations to express
them.
●
One place to start is exploratory data analysis (EDA),
which we will cover in a later section. This entails
making plots and building intu‐ ition for your particular
dataset. EDA helps out a lot, as well as trial and error
and iteration.
Probability distributions
●
The classical example is the height of hu‐
mans, following a normal distribution—a bell-
shaped curve, also called a Gaussian
distribution, named after Gauss.
Fitting a model
●
Fitting a model means that you estimate the
parameters of the model using the observed
data. You are using your data as evidence to
help approximate the real-world mathematical
process that generated the data.
●
Fitting the model often involves optimization
methods and algorithms, such as maximum
likelihood estimation, to help get the
parameters.
●
Fitting the model is when you start actually
coding: your code will read in the data, and
you’ll specify the functional form that you wrote
down on the piece of paper.
●
Then R or Python will use built-in optimization
methods to give you the most likely values of
the parameters given the data.
Overfitting
●
Overfitting is the term used to mean that you
used a dataset to estimate the parameters of
your model, but your model isn’t that good at
capturing reality beyond your sampled data.
●
You might know this because you have tried to
use it to predict labels for another set of data
that you didn’t use to fit the model, and it
doesn’t do a good job, as measured by an
evaluation metric such as accuracy.

2022 Bookmatter StatisticsForDataScientists
No ratings yet
2022 Bookmatter StatisticsForDataScientists
24 pages
Tuo Zhao Notes
No ratings yet
Tuo Zhao Notes
47 pages
6220010
No ratings yet
6220010
37 pages
Unit - I & II
No ratings yet
Unit - I & II
59 pages
Data-Science Needed
No ratings yet
Data-Science Needed
24 pages
DSF 1-2
No ratings yet
DSF 1-2
28 pages
Data Science and Visualization
No ratings yet
Data Science and Visualization
37 pages
DS 1
No ratings yet
DS 1
56 pages
DAT100_Int_Data_Ana_Lec2_Intro II
No ratings yet
DAT100_Int_Data_Ana_Lec2_Intro II
39 pages
Module1 21CS644 DSV
No ratings yet
Module1 21CS644 DSV
16 pages
Wa0004.
No ratings yet
Wa0004.
44 pages
Chapter 5
No ratings yet
Chapter 5
58 pages
Unit I
No ratings yet
Unit I
52 pages
Andrews M. Doing Data Science in R. an Introduction...2021
No ratings yet
Andrews M. Doing Data Science in R. an Introduction...2021
486 pages
L1 - Introduction To Data Science
No ratings yet
L1 - Introduction To Data Science
33 pages
Module - 1 IDS
100% (1)
Module - 1 IDS
19 pages
dataScience(mod1)
No ratings yet
dataScience(mod1)
4 pages
IDS Sec-1 CS1-CS8 Merged Slides
No ratings yet
IDS Sec-1 CS1-CS8 Merged Slides
419 pages
Modul1 PPt.pptx
No ratings yet
Modul1 PPt.pptx
56 pages
IDS Mid 1 Notes
No ratings yet
IDS Mid 1 Notes
80 pages
data science dse
No ratings yet
data science dse
24 pages
Data Science With Python (MSC 3rd Sem) Unit 1
No ratings yet
Data Science With Python (MSC 3rd Sem) Unit 1
17 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Module1 DS Ppt
No ratings yet
Module1 DS Ppt
61 pages
ds sem
No ratings yet
ds sem
71 pages
DS_Module 1
No ratings yet
DS_Module 1
57 pages
Data Strategy
No ratings yet
Data Strategy
41 pages
Unit 1 - AP For Data Science
No ratings yet
Unit 1 - AP For Data Science
19 pages
0bd671618021447098c9e2da9729d5bb_20250117_105640_347_862932_Introduction
No ratings yet
0bd671618021447098c9e2da9729d5bb_20250117_105640_347_862932_Introduction
35 pages
Semana 1: The Data Scientist's Toolbox
No ratings yet
Semana 1: The Data Scientist's Toolbox
20 pages
r22 Unit1 Theory1 Ch1
No ratings yet
r22 Unit1 Theory1 Ch1
16 pages
Data Science Intro
No ratings yet
Data Science Intro
52 pages
Data Science
100% (2)
Data Science
52 pages
IDS Notes
No ratings yet
IDS Notes
32 pages
what is data science Explain big data and hype in data science.
No ratings yet
what is data science Explain big data and hype in data science.
8 pages
IDS UNIT 1,2,3,4 & 5
No ratings yet
IDS UNIT 1,2,3,4 & 5
117 pages
UNIT I Material
No ratings yet
UNIT I Material
25 pages
OceanofPDF.com Modern Data Science With R - Baumer Benjamin SKaplan Daniel THort
No ratings yet
OceanofPDF.com Modern Data Science With R - Baumer Benjamin SKaplan Daniel THort
985 pages
DS Unit-1 PDF
No ratings yet
DS Unit-1 PDF
50 pages
Aspiring Data Scientist? Master These Fundamentals
No ratings yet
Aspiring Data Scientist? Master These Fundamentals
10 pages
FDS Unit 1 Notes
No ratings yet
FDS Unit 1 Notes
53 pages
Data Science Module1
No ratings yet
Data Science Module1
20 pages
Introduction to Data Science
No ratings yet
Introduction to Data Science
25 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Basics of Data Science KPK
No ratings yet
Basics of Data Science KPK
38 pages
Data Science-New (Unit-I)
No ratings yet
Data Science-New (Unit-I)
18 pages
Module 1
No ratings yet
Module 1
47 pages
Introduction Data Science
100% (1)
Introduction Data Science
23 pages
Data Science Lectureflow
No ratings yet
Data Science Lectureflow
10 pages
21CS64 Data Science and Visualization (PE)
No ratings yet
21CS64 Data Science and Visualization (PE)
37 pages
Model_Qp_Scheme-2
No ratings yet
Model_Qp_Scheme-2
19 pages
1b Datascience
No ratings yet
1b Datascience
26 pages
Data Scientist - KD PDF
No ratings yet
Data Scientist - KD PDF
1 page
DSS-first Lecture
No ratings yet
DSS-first Lecture
14 pages
Lec1 - For Upload Complete
No ratings yet
Lec1 - For Upload Complete
111 pages
Data Science
No ratings yet
Data Science
59 pages
Data Science Book
No ratings yet
Data Science Book
383 pages
INTRODUCTION and M1-CH-1
No ratings yet
INTRODUCTION and M1-CH-1
63 pages
Datasciencevictoryy
No ratings yet
Datasciencevictoryy
16 pages
Data Science - Ebook
No ratings yet
Data Science - Ebook
32 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Machine Learning and Econometrics EF
No ratings yet
Machine Learning and Econometrics EF
270 pages
(Ebook) Machine Learning with Spark and Python by Michael Bowles ISBN 9781119561958, 1119561957 instant download
100% (2)
(Ebook) Machine Learning with Spark and Python by Michael Bowles ISBN 9781119561958, 1119561957 instant download
54 pages
An Explainable Transformer-Based Model For Phishing Email Detection: A Large Language Model Approach
No ratings yet
An Explainable Transformer-Based Model For Phishing Email Detection: A Large Language Model Approach
15 pages
Machine Learning Yarning - Andrew NG - 23 To 27
50% (2)
Machine Learning Yarning - Andrew NG - 23 To 27
8 pages
Python Notes
No ratings yet
Python Notes
16 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
21 pages
Intro To Machine Learning
100% (1)
Intro To Machine Learning
250 pages
Cat Boost
No ratings yet
Cat Boost
7 pages
Itae 002 Test 1 2
0% (1)
Itae 002 Test 1 2
5 pages
Linear Regression Basic Interview Questions
No ratings yet
Linear Regression Basic Interview Questions
36 pages
Machine Learning Mock
No ratings yet
Machine Learning Mock
3 pages
Anshul Dyundi Predictive Modelling Alternate Project July 2022
No ratings yet
Anshul Dyundi Predictive Modelling Alternate Project July 2022
11 pages
2022 - Towards Understanding Grokking - An Effective Theory of Representation Learning
No ratings yet
2022 - Towards Understanding Grokking - An Effective Theory of Representation Learning
29 pages
QM COURSE WORK
No ratings yet
QM COURSE WORK
22 pages
1a Organisation
No ratings yet
1a Organisation
24 pages
Random Forest Presentation
No ratings yet
Random Forest Presentation
37 pages
Understanding Machine Learning Algorithms - in Depth
No ratings yet
Understanding Machine Learning Algorithms - in Depth
167 pages
Advantages and Disadvantages of Different Crash Modeling Techniques
No ratings yet
Advantages and Disadvantages of Different Crash Modeling Techniques
3 pages
Finhack 2018 - ATM Cash Optimization (Dilan) v2
No ratings yet
Finhack 2018 - ATM Cash Optimization (Dilan) v2
23 pages
01 - ML - Introduction (1)
No ratings yet
01 - ML - Introduction (1)
65 pages
MCQ Concept
No ratings yet
MCQ Concept
3 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
Lecture Week 2 KNN and Model Evaluation PDF
100% (1)
Lecture Week 2 KNN and Model Evaluation PDF
53 pages
EHB 420E - Artificial Neural Networks Term Project: Machine Learning Models For Heart Attack Prediction
No ratings yet
EHB 420E - Artificial Neural Networks Term Project: Machine Learning Models For Heart Attack Prediction
10 pages
Self-Study Plan For Becoming A Quantitative Trader - Part II
No ratings yet
Self-Study Plan For Becoming A Quantitative Trader - Part II
4 pages
Estimating_the_mean_and_variance_of_the_target_probability_distribution
No ratings yet
Estimating_the_mean_and_variance_of_the_target_probability_distribution
6 pages
Data Science Vijay1
No ratings yet
Data Science Vijay1
88 pages
Supervised Learning
No ratings yet
Supervised Learning
19 pages
Restaurant Success Prediction
No ratings yet
Restaurant Success Prediction
14 pages

Module 1

Uploaded by

Module 1

Uploaded by

MODULE 1

Introduction to Data Science

You might also like