0% found this document useful (0 votes)
3 views58 pages

Unit 1 Computational Statistics

The document provides an introduction to statistics, covering key concepts such as data types (categorical and numerical), univariate and bivariate analysis, and measures of central tendency and variability. It also distinguishes between data science and related fields, outlining the roles of various professionals in the domain. Additionally, it includes examples of calculating mean, variance, and standard deviation, emphasizing the importance of these measures in data analysis.

Uploaded by

anseltemp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views58 pages

Unit 1 Computational Statistics

The document provides an introduction to statistics, covering key concepts such as data types (categorical and numerical), univariate and bivariate analysis, and measures of central tendency and variability. It also distinguishes between data science and related fields, outlining the roles of various professionals in the domain. Additionally, it includes examples of calculating mean, variance, and standard deviation, emphasizing the importance of these measures in data analysis.

Uploaded by

anseltemp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 58

UNIT-1

Introduction to Statistics
What is Statistics
• The practice or science of collecting,
interpreting and analyzing numerical data in
large quantities, especially for the purpose of
inferring proportions in a whole from those
in a representative sample.
• Collection of methods for planning
experiments, obtaining data and then
organizing, summarizing, presenting,
analyzing, interpreting & drawing conclusions.
IT IS AFFECTED BY OUTLIERS….
Statistical Data- Categorical, Numerical (Continuous)

• Categorical data includes categories or groups:


- Car brands- VW, TATA, Suzuki
- Have enrolled for a course- Yes, No
• Numerical data includes discrete and
continuous values:
- No. of vehicles: 0, 1, 2, 3,… (discrete data)
- Weight: 72.23, 68.7, … (continuous data)
They cannot be ordered
Consists of group & categories that follow a
strict order
The difference between interval and ratio scales comes from
their ability to dip below zero.
Interval scales hold no true zero and can represent values
below zero.
Eg, you can measure temperature below 0 degrees Celsius, such as -10
degrees.
Ratio variables never fall below zero.
Eg, Height and weight are measured from 0 and above, but never fall below
it.
Univariate and Bivariate Analysis
• Univariate data –
- This type of data consists of only one variable.
- The analysis of univariate data is thus the
simplest form of analysis since the information
deals with only one quantity that changes.
- It does not deal with causes or effect
relationships and the main purpose of the
analysis is to describe the data and find
patterns that exist within it.
- The example of a univariate data can be height.
- The description of patterns found in this type
of data can be made by drawing conclusions
using central tendency measures (mean, median
and mode), dispersion or spread of data (range,
minimum, maximum, quartiles, variance and
standard deviation) and by using frequency
distribution tables, histograms, pie charts and bar
charts.
• Bivariate data-
- This type of data involves two different
variables.
- The analysis of this type of data deals with
causes or effect relationships and the analysis
is done to find out the relationship among the
two variables.
• Multivariate data –
- When the data involves three or more
variables, it is categorized under multivariate.
- It is similar to Bivariate but contains more
than one dependent variable.
- The ways to perform analysis on this data
depends on the goals to be achieved.
- Example, Dataset on cholesterol, blood
pressure and weight to predict heart attack.
Distinction between Data Science & Other
related domains
Mean median mode solution
Task 1: Annual income

Mean $ 189,848.18

Median $ 55,000.00

Mode $ 64,000.00
Task-2:
-Income is an example where averages are meaningless. You should be
aware that the correct measure to use depends on the research that you
are conducting.
-Usually, whenever we have research on income, we use the median
income, instead of the mean income.
-There are certain individuals that are earning much more than others.
They are the outliers which deviate the mean value drastically.
Measure of Asymmetry- SKEWNESS

• Skewness indicates whether the data is


concentrated on one side.
• Skewness describes where does most of the
data lies.
• Positive/ Right skew:

• Zero/ No skew:
• Negative/ Left skew:
Measures of Variability
• Variance
• Standard Deviation
• Coefficient of variance
Standard Deviation
• Variance values are large.
• SD is much more small and meaningful.
• SD is the preferred measure of variability (for a
single dataset), as it is directly interpretable.

• But, if we have two or more datasets and we


want to compare the their variability
• Comparing SDs of two datasets is
meaningless
The higher the coefficient the greater is the dispersion around the mean.
Harmonic Mean
• Harmonic mean is the reciprocal of the
arithmetic mean of the reciprocals of the
values.
• It is calculated by dividing the number of
observations by the sum of reciprocal of each
number in the series.
• So, if x1, x2, x3,…xn are observations of any
variable X, then
Use of Harmonic mean
• It is used to find average of classes or
groups in a frequency distribution.
• It gives equal weight to each data point.
• Eg. Four category of typists take 4, 5, 8 and
10 minutes respectively to type a letter. Find
average time taken to type the letter.

• H.M. of any frequency distribution =


mean, variance , standard
deviation:
Mean • μ if working with population
(or Average) • x̄ if working with samples
denoted by

Variance • σ2 (for population)


denoted by • s2 (for sample)

Standard • σ or σ (for
deviation population)
X
denoted by • sX or s (for sample)
Mean – is a simple average of given data values:

●Example: 4,5,9,2,14,6

●Mean x̄ =
(4+5+9+3+15+6) /6
= 42/6
=7
Variance: a measure of how data-
points differ from the mean
● Marks of Student A : 30, 50, 70, 100, 100
● Marks of Student B: 70, 70, 70, 70, 70
●The mean (average) of 2 students’ marks are:
○Marks of Student A : mean = 70
○Marks of Student B : mean = 70
●But we know that the two data sets are not
identical !
●So, variance will show how they are different.
●We want to find a way to represent these two
datasets numerically.
How to Calculate variance?

●If we conceptualize the spread of a


distribution as the extent to which the
values in the distribution differ from the
mean and from each other, then a
reasonable measure of spread might be
the average deviation, or difference of
the values from the mean.
How to Calculate variance?

● The average of the squared deviations about the mean is called


the variance.

For population variance

For sample variance


Example 1- Variance
Score ( )2

A
1 30
2 50
3 70
4 100
5 100
Total 350

The mean is 350/5 = 70


Example 1- Variance
Score ( )2

X
1 30 30-70=-40
2 50 50-70=-20
3 70 70-70=0
4 100 100-70=30
5 100 100-70=30
Total 350
Example 1- Variance
Score ( )2

X
1 30 30-70=-40 1600
2 50 50-70=-20 400
3 70 70-70=0 00
4 100 100-70=30 900
5 100 100-70=30 900
Total 350 3800
Example 1- Variance
Score ( )2
X

1
30 30-70=-40 1600
2
50 50-70=-20 400
3
70 70-70=0 00
4
100 100-70=30 900
5
100 100-70=30 900
Totals 350 3800

= 3800/5 =
760
Example 1- Variance
Score ( )2
B

1
70 70-70=0 0
2
70 70-70=0 0
3
70 70-70=0 0
4
70 70-70=0 0
5
70 70-70=0 0
Totals 350 0

0/5 =0
Example 2- Variance
Drive Mark Mathe
w
1 28 27
2 22 27
3 21 28
4 26 6
5 18 27
Which driver was more
consistent?
Example 2- Variance
Drive Mark's ( )2

Score X

1 28 5 25
2 22 -1 1
3 21 -2 4
4 26 3 9
5 18 -5 25
Totals 115 64

_
X = (28+22+21+26+18)/5 = 23
Example 2- Variance
Drive Mathew's ( )2

Score X

1 27 4 16
2 27 4 16
3 28 5 25
4 06 -17 289
5 27 4 16
Totals 115 362
Mark’s Variance = 64 / 5 = 12.8
Mathew’s Variance = 362 / 5 = 72.4

Conclusion: Mark has a lower variance therefore, he is more


consistent.
Standard Deviation -a measure of
variation of scores about the mean
●We Can think of standard deviation as the
average distance to the mean
●Higher standard deviation indicates higher
spread, less consistency, and less clustering.
●Sample standard deviation:

●Population standard
deviation:
Example – Standard
Deviation
Drive Mark's Score ( )2
X

1 28 5 25

2 22 -1 1 Mark’s Variance = 64 / 5
= 12.8
3 21 -2 4

4 26 3 9

5 18 -5 25

Totals 115 64
Example- Variance & Standard
Deviation
●You have just measured the heights of your dogs (in mm)
● The heights (at the shoulders) are: 600mm, 470mm,
170mm, 430mm and 300mm.
●Find out the Mean, the Variance, and the Standard Deviation.
Example- Variance & Standard
Deviation
●Your first step is to find the Mean:
● Mean (600 + 470 + 170 + 430 + 300)/ 5
=
=
● Mean 1970/5
● Mean = 394
Example- Variance & Standard
Deviation
●Now we calculate each dog's difference from the
Mean
Example- Variance & Standard
Deviation
●To calculate the Variance, take each difference, square it, and
then average the result:
● Variance

= 2062 + 762 + (−224)2 + 362 + (−94)2 / 5


σ2
= 42436 + 5776 + 50176 + 1296 + 8836 / 5
= 108520 / 5
= 21704

● So the Variance σ2 is 21,704


Example- Variance & Standard Deviation

●And the Standard Deviation is just the square root of Variance,


so:
● Standard Deviation

σ = √21704
= 147.32...
= 147 (to the nearest mm)
Example- Variance & Standard
Deviation
●And the good thing about the Standard Deviation is that it is useful. Now we
can show which heights are within one Standard Deviation (147mm) of
the Mean:

● So, using the Standard Deviation we have a "standard" way


of knowing what is normal, and what is extra large
or extra small.
Home Assignment: Database of Real Estate Company
• We need all statistical properties of the data.
• Compute measures of central tendency and
variability and comment on each value.
• Plot Line, Scatter, Box plots, Histogram for
insightful use-cases such as a scatter plot for
age vs price.
Defining Data Science and Big Data
• Data science is an umbrella term that
encompasses data analytics, data mining,
machine learning, and several other related
disciplines.
• It includes collection, ingestion, retrieval and
transformation of large amounts of data
(collectively known as big data).
• The best way to describe data science is via a
Venn diagram created by Hugh Conway in
2010:
Data Science Roles
• Data Scientist: Understands data from a specific
business point of view, establishes experimental setup
and provides accurate predictions and insights that
can be used to power critical business decisions.
• Data Analyst: Data Analysts takes a technical role in
developing, implementing, and maintaining analytic
systems.
• Business Analyst: The Business Analyst is responsible
to use data to drive business decisions.
• Statistician: He is responsible for creating data-driven
surveys, opinion polls, and questionnaires and
interpreting them.
• Data and Analytics Manager: Plays the role of
assigning duties and operations to the data
science team.
• Database Administrator
• Data Engineer: Data Engineer is responsible
for transforming data into an easily analyzable
format.
• Data Architect: The role of a Data Architect is
to integrate, protect, maintain and expand the
data sources of an organization.

You might also like