0% found this document useful (0 votes)
16 views10 pages

DS-2, Week 1, Lecture

This document discusses what data science is, its importance, and applications in various sectors such as finance, healthcare, education, and politics. It defines key terms and outlines the skills a data scientist needs such as statistics, machine learning, computer science, and visualization. Data science combines expertise in computing and programming with statistical and mathematical analysis to gain meaningful insights from data.

Uploaded by

Prerana Varshney
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views10 pages

DS-2, Week 1, Lecture

This document discusses what data science is, its importance, and applications in various sectors such as finance, healthcare, education, and politics. It defines key terms and outlines the skills a data scientist needs such as statistics, machine learning, computer science, and visualization. Data science combines expertise in computing and programming with statistical and mathematical analysis to gain meaningful insights from data.

Uploaded by

Prerana Varshney
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

MDS 202: Data Science II with R

Lecture 1 - What is Data Science?∗

Dr. Shatrughan Singh†

Week 1 (6-10 February) 2023

1 LEARNING OBJECTIVES
1.1 Pre-requisite for this course
• A general understanding of computer and data systems.
• A basic understanding of how smartphones and other day-to-day life devices work.

1.2 What is expected from you to learn from this lecture?


• Definitions and notions of data science.
• How data science is related to other disciplines?
• Computation thinking – a way to solve problems systematically.
• What skills data scientists need?

2 DATA SCIENCE
A field of study combining expertise in subject domain, involving strong computing skills including pro-
gramming in computing languages such as R, python, C, Fortran etc., and knowledge & understanding of
statistical and mathematical principles to draw meaningful insights from a given dataset. Using data science
skills, a data scientist is expected to discover hidden patterns from raw data while utilizing a blend of various
statistical tools and techniques, developing unexplored algorithms, and machine learn- ing principles. Hence,
a data scientist is person with expertise in multi-disciplinary field comprising knowledge in mathematics,
statistics, knowledge in IT, and subject domain.
A data scientist must also possess some non-mathematical skills to understand the patterns in the data and
decipher the meaning of results in the light of proposed objectives. In general a data scientist over- looks four
‘A’s of the data – Data Architecture, Data Acquisition, Data Analysis, and Data Archiving. In a nutshell,
following are the skills a data scientist should posses to become successful.
• Must possess knowledge on the application domain
• Must possess strong communication skills to translate back and forth between the data analysis team
and the end-user
• Must have an eye to visualize larger picture of a complex system
• Must have good understanding about metadata (data/file that stores information about the storage
and details of the data)
• Must possess the skills to transform, summarize, and interpret the results to draw meaning inferences
• Must possess strong skills in data display and presentation
∗ Chirag Shah, A Hands-On Introduction to Data Science, Cambridge University Press, 2020
† Amity University Rajasthan (Jaipur), [email protected]

Dr. S. Singh MDS 202: Lec —> 01-03 1


• Must pay attention to finer details of the data, should not compromise on data quality and must be
able to suggest ways to improve data quality in the event of poor data
• Must practice ethical reasoning while analyzing data or communication results of the analysis

Figure 1: Data Science is a field at the intersection of various expertise and domains as shown above.

Why is data science so important now?


The answer is not surprising: we have a lot of data, we continue to generate a staggering amount of data at
an unprecedented and ever-increasing speed, analyzing data wisely necessitates the involvement of competent
and well-trained practitioners, and analyzing such data can provide actionable insights.
The “5V model” attempts to lay this out in a simple (and catchy) way. These are the five Vs:
1. Velocity: The speed at which data is accumulated.
2. Volume: The size and scope of the data.
3. Variety: The massive array of data and types (structured and unstructured).
4. Veracity: The quality of data being accumulated.
5. Value: Worth of data-generated information for decision making.
Each of these five Vs regarding data has dramatically increased in recent years. Specifically, the increasing
volume of heterogeneous and unstructured (text, images, and video) data, as well as the possibilities emerging
from their analysis, renders data science evermore essential. Figure below shows the expected volumes of
data to reach 180 zettabytes (ZB) by the end of 2025, which is a 90-fold increase in volume than what was
available at the beginning of 2010.

Dr. S. Singh MDS 202: Lec —> 01-03 2


Figure 2: Increase of data volume in last 20 years.

(Source: https://www.statista.com/statistics/871513/worldwide-data-created/, June 2021)


To understand this in reality:
If your computer has 1 terabytes (TB) hard drive (roughly 1000 GB), 180 ZB is 180 billion times of that.
To provide a different perspective, the world population is around 8 billion by the end of 2022, which means,
if we think about data per person, each individual getting birth in the world with 125 TB of data.

3 DATA SCIENTIST APPLICATIONS IN VARIOUS SECTORS


What do financial data scientists do?
Through capturing and analyzing new sources of data, building predictive models and running real-time
simulations of market events, they help the finance industry obtain the information necessary to make
accurate predictions.
Example: Fraud detection and risk reduction (customer profiling, past expenditures, and other essential
variables that can be used to analyze the probabilities of risk and default.)
What do policy makers do?
Data science helps governments and agencies gain insights into citizen behaviors that affect the quality of pub-
lic life, including traffic, public transportation, social welfare, community wellbeing, etc. This information,
or data, can be used to develop plans that address the betterment of these areas.
Example: How does an NGO use data to estimate the size of a temporary refugee camp in war zones to
organize the provision of help?
How do DS assist in politics?
Data scientists have been quite successful in constructing the most accurate voter targeting models and
increasing voter participation. In 2016, the campaign to elect Donald Trump was a brilliant example of the
use of data science in social media to tailor individual messages to individual people.

Dr. S. Singh MDS 202: Lec —> 01-03 3


Example: Dark side of this with the infamous Cambridge Analytica data scandal that surfaced in March
2018. This data analytics firm obtained data on approximately 87 million Facebook users from an academic
researcher in order to target political ads during the 2016 US presidential campaign.
How do DS assist in Healthcare?
Healthcare is another area in which data scientists keep changing their research approach and practices.14
Though the medical industry has always stored data (e.g., clinical studies, insurance information, hospital
records), the healthcare industry is now awash in an unprecedented amount of information.
Example: Apple has partnered with Stanford Medicine to collect and analyze data from Apple Watch
to identify irregular heart rhythms, including those from potentially serious heart conditions such as atrial
fibrillation, which is a leading cause of stroke.
How do DS assist in Education?
Technology will definitely have a large part to play in the future of education, but how exactly that happens
is still an open question. There is a growing realization among educators and technology evangelists that we
are heading toward more data-driven and personalized use of technology in education. And some of that is
already happening.
Example: Online tools enable evaluation of a much wider range of student actions, such as how long they
devote to readings, where they get electronic resources, and how quickly they master key concepts.

3.1 Data scientists should have at least three basic skills:


1. A strong knowledge of basic statistics and machine learning – or at least enough to avoid misinterpret-
ing correlation for causation or extrapolating too much from a small sample size.
2. The computer science skills to take an unruly dataset and use a programming language (like R or
Python) to make it easy to analyze.
3. The ability to visualize and express their data and analysis in a way that is meaningful to somebody
less conversant in data.
Note to Remember:
* Machine learning is certainly a very crucial part of data science today, and it is hard
to do meaningful data science in most domains without at least basic knowledge of machine
learning.

* Data is something raw, meaningless, an object that, when analyzed or converted to a


useful form, becomes information.

3.2 Some Basic Definitions:


1. Data: Information that is factual, such as measurements or statistics, which can be used as a basis
for reasoning, discussion, or prediction.
2. Information: Data that are endowed with meaning and purpose.
3. Science: The systematic study of the structure and behavior of the physical and natural world through
observations and experiments.
4. Data Science: The field of study and practice that involves collection, storage, and processing of data
in order to derive important insights into a problem or a phenomenon.
5. Information Science: A thorough understanding of information considering different contexts and
circumstances related to the data that is created, generated, and shared, mostly by human beings.
6. Business Analytics: The skills, technologies, and practices for continuous iterative exploration and
investigation of past and current business performance to gain insight and be strategic.

Dr. S. Singh MDS 202: Lec —> 01-03 4


7. Computational Thinking: This is a process of using abstraction and decomposition when attacking
a large complex task or designing a large complex system.

4 HANDS-ON EXAMPLE: ANALYZING DATA


Table 1: Average height and weight of Indian Women.

Observation Height (cms) Weight (kgs)


1 147 51
2 149 52
3 151 53
4 153 54
5 155 55
6 157 56
7 159 57
8 163 60
9 167 63
10 171 66
11 175 69
12 179 72
13 183 75
14 187 78
15 191 81

Question: On average, how much increase can we expect in weight with an increase of one cm in height?
A simple method is to compute the differences in height (191 − 147 = 44 cms) and weight (81 − 51 = 30
kgs), then divide the weight difference by the height difference, that is, 30/44, leading to 0.68. In other
words, we see that, on average, one cm of height difference leads to a difference of 0.68 kgs in weight.
height <- c(147, 149, 151, 153, 155, 157, 159, 163, 167, 171, 175, 179, 183, 187, 191)
weight <- c(51, 52, 53, 54, 55, 56, 57, 60, 63, 66, 69, 72, 75, 78, 81)

Plot the data and visualize the pattern, and analyze something meaningful from it.
plot(height,weight,las=1,
type="b",col="blue",lwd=2,
xaxs = "i", yaxs = "i",
xlim=c(145,195),ylim=c(50,85),
xlab="Height (cms)", ylab="Weight (kgs)")
axis(2, at = seq(50,85,5), las = 2)
axis(1, at = seq(145,195,5), las = 1)

Plot the data again, however, include the regression analysis with the given data for extracting menaingful
information.
# Plot the data points
plot(height,weight,las=1,
pch = 21, col = "darkred", bg = "darkred", lwd=2,
xaxs = "i", yaxs = "i",
xlim=c(145,195),ylim=c(50,85),
xlab="Height (cms)", ylab="Weight (kgs)", cex=1.4)
axis(2, at = seq(50,85,5), las = 2)
axis(1, at = seq(145,195,5), las = 1)

Dr. S. Singh MDS 202: Lec —> 01-03 5


85

80

75
Weight (kgs)

70

65

60

55

50
145 150 155 160 165 170 175 180 185 190 195

Height (cms)

Figure 3: Visualization of Height vs. Weight data

abline(lm(weight ~ height), lty = 2,col = "blue", lwd = 2)


fit <- lm(weight ~ height)
rsq <- summary(fit)$r.squared
pv <- anova(fit)$'Pr(>F)'[1]
mtext(bquote(y == ~.(sprintf("%1.2f",fit$coefficients[2]))*x
+ .(sprintf("%1.2f",fit$coefficients[1]))),
side=3, adj=0.05, padj=0.05, line=-2, cex=1) # display equation
mtext(bquote(paste(R^2 == ~.(sprintf("%1.3f",rsq)),", ",
italic(p) <= ~.(sprintf("%1.3f",pv)))), side=3, adj=0.05, padj=0.05,
line=-3.5, cex=1) # display R-sq and p-value

Question: What would you expect the weight to be of an Indian woman who is 145 cms tall?
wt = NULL
ht = 145

wt = 0.69*ht + (-52.37)
paste("The weight of the Indian woman having height of 145 cms is",wt,"(kgs)" )

## [1] "The weight of the Indian woman having height of 145 cms is 47.68 (kgs)"
Question: What would you expect the weight of someone who is 193 cms tall to be?
wt = NULL
ht = 193

wt = 0.69*ht + (-52.37)
paste("The weight of the Indian woman having height of 193 cms is",wt,"(kgs)" )

## [1] "The weight of the Indian woman having height of 193 cms is 80.8 (kgs)"

Dr. S. Singh MDS 202: Lec —> 01-03 6


85
y = 0.69x + −52.37
80
R2 = 0.995, p ≤ 0.000
75
Weight (kgs)

70

65

60

55

50
145 150 155 160 165 170 175 180 185 190 195

Height (cms)

Figure 4: Regression of Height vs. Weight data

Two different regression equations can offer slightly convincing results.


# Plot the data points for first regression
plot(height[1:7],weight[1:7],las=1,
pch = 21, col = "red", bg = "red", lwd=2,
xaxs = "i", yaxs = "i",
xlim=c(145,195),ylim=c(50,85),
xlab="Height (cms)", ylab="Weight (kgs)", cex=1.4)
axis(2, at = seq(50,85,5), las = 2)
axis(1, at = seq(145,195,5), las = 1)
# Plot the data points for second regression
points(height[8:15],weight[8:15],las=1,
pch = 21, col = "blue", bg = "blue", lwd=2,
cex=1.4)

# First regression line


abline(lm(weight[1:7] ~ height[1:7]), lty = 2,col = "red", lwd = 2)
fit1 <- lm(weight[1:7] ~ height[1:7])
rsq1 <- summary(fit1)$r.squared
pv1 <- anova(fit1)$'Pr(>F)'[1]
mtext(bquote(y == ~.(sprintf("%1.2f",fit1$coefficients[2]))*x
+ .(sprintf("%1.2f",fit1$coefficients[1]))),
side=3, adj=0.05, padj=0.05, line=-2, cex=1,col="red") # display equation
mtext(bquote(paste(R^2 == ~.(sprintf("%1.3f",rsq1)),", ",
italic(p) <= ~.(sprintf("%1.3f",pv1)))), side=3, adj=0.05, padj=0.05,
line=-3.5, cex=1, col="red") # display R-sq and p-value

# Second regression line


abline(lm(weight[8:15] ~ height[8:15]), lty = 2,col = "blue", lwd = 2)

Dr. S. Singh MDS 202: Lec —> 01-03 7


fit2 <- lm(weight[8:15] ~ height[8:15])
rsq2 <- summary(fit2)$r.squared
pv2 <- anova(fit2)$'Pr(>F)'[1]
mtext(bquote(y == ~.(sprintf("%1.2f",fit2$coefficients[2]))*x
+ .(sprintf("%1.2f",fit2$coefficients[1]))),
side=1, adj=0.95, padj=0.05, line=-3.5, cex=1,col="blue") # display equation
mtext(bquote(paste(R^2 == ~.(sprintf("%1.3f",rsq2)),", ",
italic(p) <= ~.(sprintf("%1.3f",pv2)))), side=1, adj=0.95, padj=0.05,
line=-2, cex=1,col="blue") # display R-sq and p-value

85
y = 0.50x + −22.50
80
R2 = 1.000, p ≤ 0.000
75
Weight (kgs)

70

65

60
y = 0.75x + −62.25
55 R = 1.000, p ≤ 0.000
2

50
145 150 155 160 165 170 175 180 185 190 195

Height (cms)

Figure 5: Dual Regressions of Height vs. Weight data

Both the equations (previously calculated with single regression and recently calculated dual regressions)
show different results.
wt1 = NULL
ht1 = 145
# Equation 1
wt1 = 0.50*ht1 + (-22.50)
paste("The weight of the Indian woman having height of 145 cms is",wt1,"(kgs)" )

## [1] "The weight of the Indian woman having height of 145 cms is 50 (kgs)"
wt2 = NULL
ht2 = 193
# Equation 2
wt2 = 0.75*ht2 + (-62.25)
paste("The weight of the Indian woman having height of 193 cms is",wt2,"(kgs)" )

## [1] "The weight of the Indian woman having height of 193 cms is 82.5 (kgs)"

Dr. S. Singh MDS 202: Lec —> 01-03 8


4.1 Some More Basic Statistical Definitions:
Statistics:
Collection of methods for planning experiments, obtaining data, and then organizing, summarizing, presenting,
analyzing, interpreting, and drawing conclusions.
Variable:
Characteristic or attribute that can assume different values.
Random Variable:
A variable whose values are determined by chance.
Population:
All subjects possessing a common characteristic that is being studied.
Census:
The collection of data from every element in a population.
Sample:
A subgroup or subset of the population.
Parameter:
Characteristic or measure obtained from a population.
Statistic: (not to be confused with Statistics)
Characteristic or measure obtained from a sample.
Descriptive Statistics:
Collection, organization, summarization, and presentation of data.
Inferential Statistics:
Generalizing from samples to populations using probabilities. Performing hypothesis testing, determining
relationships between variables, and making predictions.
Qualitative Variables (Data):
Variables (data) which assume non-numerical values.
Quantitative Variables (Data):
Variables (data) which assume numerical values.
Discrete Variables (Data):
Variables (data) which assume a finite or countable number of possible values. Usually obtained by counting.
Continuous Variables (Data):
Variables (data) which assume an infinite number of possible values. Usually obtained by measurement.
Nominal Level:
Level of measurement which classifies data into mutually exclusive, all inclusive categories in which no order
or ranking can be imposed on the data.
Ordinal Level:
Level of measurement which classifies data into categories that can be ranked. Differences between the ranks
do not exist.

Dr. S. Singh MDS 202: Lec —> 01-03 9


Interval Level:
Level of measurement which classifies data that can be ranked and differences are meaningful. However, there
is no meaningful zero, so ratios are meaningless.
Ratio Level:
Level of measurement which classifies data that can be ranked, differences are meaningful, and there is a true
zero. True ratios exist between the different units of measure.
Random Sampling:
Sampling in which the data is collected using chance methods or random numbers.
Systematic Sampling:
Sampling in which data is obtained by selecting every 𝑘𝑡ℎ object.
Convenience Sampling:
Sampling in which data that is readily available is used.
Stratified Sampling:
Sampling in which the population is divided into groups (called **strata**) according to some characteristic.
Each of these strata is then sampled using one of the other sampling techniques.
Cluster Sampling:
Sampling in which the population is divided into groups (usually geographically). Some of these groups are
randomly selected, and then all of the elements in those groups are selected.
Self-Selected Survey:
Sampling in which the respondents themselves decide whether or not to be included.
Observational Study:
A study in which the subjects are observed and studied, but no attempt is made to manipulate or modify the
subjects.
Experiment:
A study in which a treatment is applied, and then its effects on the subjects are studied.
Sampling Error:
The difference between the sample result and the true population result that occurs because of chance variation.
Non-sampling Error:
An error that occurs because sample data is incorrectly collected, recorded, or analyzed.

End of the Lecture !!

Dr. S. Singh MDS 202: Lec —> 01-03 10

You might also like