0% found this document useful (0 votes)
19 views48 pages

Week 1

The document is an introduction to a Data Science course taught by Dr. Irfan Yousuf at UET, Lahore. It outlines the importance of data science as a career, the necessary skill set including statistics and programming, and provides an overview of key statistical concepts such as descriptive and inferential statistics, probability distributions, and normal distribution. The course aims to equip students with the foundational knowledge required in the field of data science.

Uploaded by

Ambreen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views48 pages

Week 1

The document is an introduction to a Data Science course taught by Dr. Irfan Yousuf at UET, Lahore. It outlines the importance of data science as a career, the necessary skill set including statistics and programming, and provides an overview of key statistical concepts such as descriptive and inferential statistics, probability distributions, and normal distribution. The course aims to equip students with the foundational knowledge required in the field of data science.

Uploaded by

Ambreen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Introduction to Data Science

Dr. Irfan Yousuf


Department of Computer Science (New Campus)
UET, Lahore
(Week 1; January 15 - 19, 2024)
Instructor
• Dr. Irfan Yousuf
[email protected]
Weekly Contents
Weekly Contents
Weekly Contents
Weekly Contents
Why Data Science?
• One of the topmost professions
• New driving force behind industries is Data.
• Data Science is the Career of Tomorrow.
Skill Set Needed
• Statistics
• Programming skills
• Multivariable Calculus & Linear Algebra
Statistics
• In plural form, it refers to set of numerical data.

• In singular form, it is an academic discipline.


Data
• Facts and statistics collected for reference or analysis.

• Data are units of information, often numeric, that are


collected through observation.

• Data is a collection of facts, such as numbers, words,


measurements, observations or just descriptions of things.
What is Statistics
• Statistics is a branch of mathematics that deals with the
scientific collection, organization, presentation, analysis,
and interpretation of numerical data in order to obtain
useful and meaningful information.
Descriptive Statistics
• A statistical method concerned with the collection,
organization, presentation and description of sample data.
Inferential Statistics
• Inferential Statistics concerned with the analysis of a sample
data leading to prediction, inferences, interpretation,
decision or conclusion about the entire population
Population vs. Sample
• Population: The totality of all the elements or persons for
which one has an interest at a particular time.
• Students of 2018 session of CS-KSK

• Sample: It is a subset of a population


• Students with CGPA > 3.0
Parameter vs. Statistic
• A parameter is a number describing a whole population.

• A statistic is a number describing a sample.

• With inferential statistics, we use sample statistics to make educated


guesses about population parameters.
Quantitative vs. Qualitative Data
• Quantitative: These are numerical information obtained from
counting or measuring that which can be manipulated by any
fundamental operation.
• Age, Weight, Height

• Qualitative: These are descriptive attributes and characterized


by categorical responses.
• Gender, Weather, Attitude
Variable
• A variable is any characteristics, number, or quantity that can
be measured or counted.

• Independent variables: Variables you manipulate in order to


affect the outcome of an experiment, e.g., Age

• Dependent variables: Variables that represent the outcome of


the experiment, e.g., Salary
Descriptive vs. Inferential Statistics
• Descriptive: concerned with the collection, organization,
presentation and description of sample data.

• Inferential: concerned with the analysis of a sample data


leading to prediction, inferences, interpretation, decision or
conclusion about the entire population
Inferential Statistics
• Inferential statistics takes data from a sample and makes
inferences about the larger population from which the sample
was drawn.
• Because the goal of inferential statistics is to draw
conclusions from a sample and generalize them to a
population, we need to have confidence that our sample
accurately reflects the population.

• Define the population we are studying.


• Draw a representative sample from that population.
• Use analyses that incorporate the sampling error.
Probability Distributions
• A probability distribution is the mathematical function that
gives the probabilities of occurrence of different possible
outcomes of an experiment.

• Tossing a coin
• throwing a fair die

• Probability distributions are typically defined in terms of the


probability distribution functions.
Probability Distribution Functions

Probability Mass
Function (PMF) for
Discrete Data
Cumulative
Distribution
Function (CDF)
Probability Density
Function (PDF) for
Continuous Data
Discrete vs. Continuous Variable
• A discrete variable is a variable that takes on distinct,
countable values. In theory, you should always be able to
count the values of a discrete variable.

• A continuous variable is a variable that can take on any


value within a range. Because the possible values for a
continuous variable are infinite, we measure continuous
variables (rather than count),
Probability Density Functions (PDFs)
• For a discrete random variable X that takes on a finite or countably
infinite number of possible values, we determine P(X=x) for all the
possible values of X, and call it the probability mass function
(pmf)

• For continuous random variables, the probability that X takes on


any particular value x is 0. That is, finding P(X=x) for a continuous
random variable is not going to work. Instead, we'll need to find the
probability that falls in some interval (a,b) , that is, we'll need to
find P(a < X < b). We'll do that using a probability density function
(pdf).
Probability Mass Function
Day Travel Time (min) pms X p(X=x)
1 25 0.1 25 0.1
2 26 0.2 26 0.2
3 26 0.2 28 0.2
4 28 0.2 32 0.1
5 28 0.2 33 0.1
6 32 0.1 34 0.2
7 33 0.1 35 0.1
8 34 0.2
9 34 0.2
10 35 0.1
Cumulative Distribution Function of PMF
Day Travel Time (min) pms X PMF CDF
1 25 0.1 25 0.1 0.1
2 26 0.2 26 0.2 0.3
3 26 0.2 28 0.2 0.5
4 28 0.2 32 0.1 0.6
5 28 0.2 33 0.1 0.7
6 32 0.1 34 0.2 0.9
7 33 0.1 35 0.1 1
8 34 0.2
9 34 0.2
10 35 0.1
Probability Density Function
Probability Density Function
Let the random variable X denote the time a person waits for
an elevator to arrive. Suppose the longest one would need to
wait for the elevator is 2 minutes, so that the possible values of
X (in minutes) are given by the interval [0,2] .
A possible pdf for X is given by:
Probability Density Function

probability that a person waits less than


30 seconds (or 0.5 minutes).

Integral Formula
Probability Density Function
Continuous random variables have zero point probabilities, i.e.,
the probability that a continuous random variable equals a single
value is always given by 0.

Probability for a continuous random variable is given by areas


under pdf’s.
Cumulative Distribution Function of PDF

Let X have pdf f , then the cdf F is given by


Cumulative Distribution Function of PDF

PDF to CDF
Normal Distribution
Normal Distribution
• The mean, median and mode are exactly the same.
• The distribution is symmetric about the mean—half the
values fall below the mean and half above the mean.
• The distribution can be described by two values: the mean and
the standard deviation.
Normal Distribution

Day Time 11 28.24


1 32.14 12 29.10
2 31.30 13 28.34
3 29.17
14 28.50
4 28.15
15 29.26
5 30.30
6 30.41
16 28.29
7 32.37 17 25.36
8 33.19 18 27.18
9 31.19 19 30.29
10 30.37 20 27.15
Normal Distribution
Normal Distribution
Day Time f(x)
1 32.14 0.08
2 31.30 0.13
3 29.17 0.20
4 28.15 0.16
5 30.30 0.19 Mean 29.52
6 30.41 0.18
7 32.37 0.07 St. Dev 1.96
8 33.19 0.04
9 31.19 0.14
10 30.37 0.19
11 28.24 0.16
12 29.10 0.20
13 28.34 0.17
14 28.50 0.18
15 29.26 0.20
16 28.29 0.17
17 25.36 0.02
18 27.18 0.10
19 30.29 0.19
20 27.15 0.10
Normal Distribution
Time f(x)
25.36 0.02
27.15 0.10
27.18 0.10
28.15 0.16
28.24 0.16
28.29 0.17
28.34 0.17
28.50 0.18
29.10 0.20
29.17 0.20
29.26 0.20
30.29 0.19
30.30 0.19
30.37 0.19
30.41 0.18
31.19 0.14
31.30 0.13
32.14 0.08
32.37 0.07
33.19 0.04
Normal Distribution

Mean 29.52 M+SD 31.48


St. Dev 1.96
M-SD 27.55
Normal Distribution
• The mean, median and mode are exactly the same.
• The distribution is symmetric about the mean—half the
values fall below the mean and half above the mean.
• The distribution can be described by two values: the mean and
the standard deviation.
Normal Distribution
68-95-99.7 Rule
CDF of Normal Distribution
• The cumulative distribution function (cdf) is the probability that the
variable X takes a value less than or equal to x.
• (Here in the figure below, Mean=0, SD=1)
Z-Distribution
• The standard normal distribution, also called the z-distribution, is a
special normal distribution where the mean is 0 and the standard
deviation is 1.
• Z-scores tell you how many standard deviations away from the mean
each value lies.
Z-Distribution
Z-Score

As the formula shows, the z-score is simply the raw score


minus the population mean, divided by the population
standard deviation.
Z-Distribution
Day Time
1 26
2 33
3 65
4 28 Mean is 38.8 minutes
5 34 Standard Deviation is 11.4 minutes
6 55
7 25
8 44
9 50
10 36
11 26
12 37
13 43
14 62
15 35
16 38
17 45
18 32
19 28
20 34
Summary
• Introduction to Data Science

You might also like