0% found this document useful (0 votes)

27 views16 pages

Statistics For Data Science PDF

Uploaded by

mohitofficial2019

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views16 pages

Statistics For Data Science PDF

Uploaded by

mohitofficial2019

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 16

©

DatabaseTown.co
STATISTICS FOR DATA SCIENCE
A. DESCRIPTIVE STATISTICS:

Before going to discuss about descriptive statistics, first we recall the basic concept of data
and its types again here before starting descriptive statistics …..

Data:

Data is a collection of factual information based on numbers, words, observations,

measurements which can be utilized for calculation, discussion and reasoning.

TYPES OF DATA:

DATA (plural)
Singular form is datum

Categorical or Qualitative Data Numerical or Quantitative Data

based on descriptive information based on numerical informtaion
e.g He is a clever boy e.g. He has 2 legs

Discrete Data
Continuous Data
Bionomial Data This data is countable
This data is measureable
Variable data with only two options e.g. no. of children, whole numbers
e.g. height, width, length
e.g. good or bad, true or false
Nominal or Unordered Data
Variable data which is in
Interval
unordered form No true zero
e.g. red, green, man e.g. absence of temperature
Ordinal Data
Variable data with proper order
e.g. short, medium, long Ratio
Absolute zero
e.g. height can be zero

The crude dataset is the basic foundation of data science and it may be of different kinds like
Structured Data (Tabular structure), Unstructured Data (pictures, recordings, messages, PDF
documents and so forth.) and Semi Structured.

https:// Page
©
DatabaseTown.co
DATA

UNSTRUCTURED DATA
STRUCTURED DATA
unformated, unorganized, cannot be processed and and analyzed by utilizing conventional m
Formated , highly organized, easily searchable and
e.g.understandable by Maching
text, audio, video, Language
social media
e.g. name, address, dates, etc. activity, etc.
RDBMS, CRM, ERP are suitable for structured data
Non-relational and NoSQL databases are best for unstructured data

Furthermore, there are two kinds of data i.e. population data and sample data.

 Population Data:
Population data is the collection of all items of interest which is denoted by ‘N’ and the
numbers we obtained when using population are called parameters.

 Sample Data:
Sample data is a subset of the population which is denoted by ‘n’ and the numbers we
obtained when using sample are called statistics.

Graphical Representation of variables in form of Graph & Tables:

i. Bar Chart:
Bar charts are frequently being used to display data. In bar chart, each bar represents a
category and y-axis shows the frequency as shown in figure

6
5
4
3 Series 1
2 Series 2
1 Series 3
0

https:// Page
©
DatabaseTown.co
ii. Pie chart:
Pie Charts are frequently being used to display market share. If we want to see the share of
any item as a part of the total then we utilized pie chart, as shown in figure below:

Sales
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

iii. Frequency Distribution Table:

Frequency distribution table shows the category and its corresponding absolute
frequency as shown in figure

Category Frequency
Black 12
Brown 5
Blond 3
Red 7

Relative frequency = frequency / total frequency

Measures of central tendency:

It is a single value that explains a set of data by identifying the central positing within that
set of data. Measure of central tendency is also called measure of central location. The measures
of central tendency are:

i. Mean
ii. Median
iii. Mode

https:// Page
©
DatabaseTown.co
i. Mean:
It is most popular to measures the central tendency. It is used with both discrete and
continuous data. The mean is equal to the sum of all the values in the data set divided by the
number of values in the data set. Therefore, if we have n values in a data set and they have values
x1, x2, ..., xn, the sample mean, usually denoted by is given by,

If we intend to calculate the population means instead of sample mean then we use the
greet letter µ as

ii. Median:
It is the mid score of a dataset that has been arranged in order of magnitude. In order to
calculate the median, suppose we have the following dataset:

10 20 30 15 20 30 15 20 30 15 20

First of all, we re-arrange this data into order of magnitude from smaller to larger

10 15 15 15 20 20 20 20 30 30 30

Therefore, in this case bold figure 20 is our median. It is the middle mark, as there are 5
scores before it and 5 scores after it. However, if we have an odd number of scores like this one,

10 20 30 15 20 15 20 30 15 20

Then, we re-arrange this data into order of magnitude and we obtain

15151520202020
10 30 30

https:// Page
©
DatabaseTown.co
In this case, we have to take two values i.e. 20, 20 and average them to get a median i.e. 20.

iii. Mode:
It is a value that most often score in our dataset. A dataset can have no mode, one mode or
multiple modes. It can be calculated by finding the value with the maximum frequency. For
instance,

Mode

Two modes

Measure of Asymmetry:

https:// Page
©
DatabaseTown.co
Skewness:
It is the measure of asymmetry that shows whether the observations in a dataset are
focused on one side. Skewness can be calculated by the following formula
( 𝑥̅)3
∑𝑛 𝑥
1
𝑖=1 𝑖−
𝑛
3
( 𝑖 𝑥̅)
𝑛
√𝑛− ∑
1 2
𝑖=
1 1 𝑥−
There are two types of skewness,
i. Right or Positive
Skewness
ii. Left or Negative Skewness

i. Right or Positive Skewness:

Right or positive skewness means that the outliers are to the right (long tail to the right) as
shown in figure

Frequency
40

0 Sugar Soap Milk Shampoo Rice

Mean > median

ii. Left or Negative Skewness:

Left or negative skewness means that the outliers are to the left (long tail to the left) as
shown in figure

Frequency
40

0 Sugar Soap Milk Shampoo Rice

https:// Page
©
DatabaseTown.co
Mean < median

https:// Page
©
DatabaseTown.co
However, if mean = median = mode then no skew and therefore, distribution will be
symmetrical.

Variance and Standard Deviation:

Variance and Standard Deviation measure the dispersion of a set of data points around its
means value.

∑𝑛 (𝑥𝑖− 𝑥̅)2
Sample Variance formula:

𝑠2 =
𝑖=1
𝑛−1

Population Variance formula:

∑𝑁 (𝑥𝑖− µ)2
𝜎2 =
𝑖=1
𝑁
Sample Standard Deviation
formula:
∑𝑛 (𝑥𝑖− 𝑥̅)2
𝑆 =√ 𝑛−1
𝑖=1

Population Standard Deviation formula:

∑𝑁 (𝑥𝑖− µ)2
𝜎 =√
𝑖=1
𝑁
Coefficient of Variation:
There is no unit of measurement for Coefficient of variation. Coefficient of variation is
perfect for comparison and universal across datasets. Formula of coefficient of variation is given
below: -
𝑆
𝐶𝑉
= 𝑥̅
Covariance and correlation:
Covariance Correlation
Covariance is a statistical measure which is Correlation is a statistical measure which is
defined as a systematic relationship between a defined as a systematic relationship between a
pair of random variables wherein change in one pair of random variables wherein movement in
variable responded by an equivalent change in one variable responded by an equivalent
another variable movement in another variable
The value of covariance lies between -∞ and The value of correlation lies between -1 and +1
+∞

https:// Page
©
DatabaseTown.co
A covariance of 0 means that the two variables A correlation of 0 means that the two variables

https:// Page
©
DatabaseTown.co
are independent. are independent
A positive covariance means that two variables A correlation of 1 means perfect positive
move together correlation
A negative covariance means that the two A correlation of -1 means perfect negative
variables move in opposite directions correlation.

Sample Covariance formula: Sample Correlation formula:

∑ 𝑠𝑥𝑦
1(𝑥𝑖− 𝑥̅ ) ∗ (𝑦𝑖− 𝑦̅)
𝑛

𝑆𝑥𝑦 =
𝑖= 𝑟=
𝑛−1 𝑠𝑥 𝑠
𝑦

𝜎𝑥𝑦
Population Correlation formula:
∑ 𝑁 1(𝑥𝑖− 𝜇𝑥) ∗ (𝑦𝑖− 𝜇𝑦) 𝜌=
Population Covariance formula:

𝜎𝑥𝑦 =
𝑖= 𝜎𝑥 𝜎𝑦
𝑁

B. INFERENTIAL STATISTICS:

Probability distribution:
It is a statistical function that explains all the possible values and likelihoods that a random
variable can take within a given range. This range will be bounded between the least and the
highest possible values, but precisely where the possible value is likely to be plotted on the
probability distribution depends on a number of factors like distribution's mean, standard
deviation, skewness, and kurtosis. Few examples of distributions are,
 Normal distribution
 Binominal distribution
 Student’s T distribution
 Uniform distribution
 Poission distribution

Mostly, there is a confusion that distribution is a graph but in fact, it is the rule that help us
in determining how the values are positioned in relation to each other.

i. Normal Distribution:
It is also known as Gaussian distribution or Bell Curve. It is mostly used in regression
analysis. A lot of things closely follow this distribution:

 heights of people

https:// Page
©
DatabaseTown.co
 size of things produced by machines
 errors in measurements
 blood pressure
 marks on a test
 stock market information

When data is normal distributed then distribution is symmetric and

Mean = median = mode

𝑁 ~(𝜇, 𝜎2)

Where, N for normal, ~ for distribution, µ is mean, and 𝜎2 is the variance

ii. Standard Normal Distribution:

It is a normal distribution with a mean of 0 and a standard deviation of 1. Every normal

𝑥−µ
𝑎=
distribution can be standardized using the following formula

𝑁 ~(0, 1)

Standardization permit to compare different normally distributed datasets, test hypothesis,

detect outliers and normality, create confidence intervals and perform regression analysis.

iii. Central Limit Theorem:

This theorem states that the distribution of sample means approximates a normal
distribution as the sample size gets larger (assuming that all samples are the same in size),
regardless of population distribution shape. If the sample sizes= or >30 are considered enough for
the Central Limit Theorem to hold. The main aspect of this theorem is that the average of the
sample means and standard deviations will equal the population mean and standard deviation.
Furthermore, an adequately large sample size can forecast the characteristics of a population
accurately. In Central Limit Theorem,

 No matter the distribution

 The more samples, closer to Normal (k ->∞)
 The bigger the samples, the closer to Normal (n -> ∞)

https:// Page
©
DatabaseTown.co
Estimators and Estimates:

Estimators:

It is a mathematical function of the sample that tell us that how to calculate an estimate of
a parameter from a sample. Smaller the variance, most efficient the estimator. Hence, we required
to find what are the “good” estimators. Few vital criteria for goodness of an estimator are based
on these properties: -

- Bias
- Variance
- Mean Square Error

Examples of estimators and equivalent parameters are given in below table.

Term Estimator Parameter

Mean 𝑥̅ 𝜇
Variance s2 𝜎2
Correlation R 𝜌

Estimates:

An estimate is the output value that you can get from an estimator. There are following
types of estimates:

i. Point Estimates – a single value, e.g. 1, 6, 12.34, 0.123

ii. Confidence Interval Estimates – an interval, e.g. (1,4), (43, 45), (3.22, 5.33), (-0.24,
0.26). We mostly used confidence interval estimates when making inferences because
it is more precise as compare to point estimates.

Confidence Interval:

It is an interval within which we are assured with certain %age of confidence, the
population parameter will fall.

Margin of Error:

https:// Page
©
DatabaseTown.co
A margin of error explains how many percentage points your results will differ from the
real population value. It can be calculated by the following two ways:

i. Margin of error = Critical value x Standard deviation

ii. Margin of error = Critical value x Standard error of the statistic

Student’s T Distribution:

It is mostly used to estimate population parameters when the sample size is small and/or
population variance are not known. It is pertinent to mention here that it is very useful in such
cases where we have not enough information or too much cost is involve to acquire the requisite
information. It has fatter tails as compare to normal distribution and lower peak. Following
formula can be used to get the student’s T distribution for a variable with a normally distributed
population:

𝑡𝑣,𝛼 = 𝑥̅−𝜇

𝑠/√𝑛
where v are the degree of freedom

C. HYPOTHESIS TESTING:
Scientific Method:
The scientific method is a process for gathering data and processing information. It was
first sketched by Sir Francis Bacon (1561-1626) to provide logical, rational problem solving
across many scientific fields. The main principle of scientific method is systematic observation,
predictability, verifiability and amendment of hypothesis. The basic steps of the scientific method
are:

 Make an observation that explains the issue

 Make an hypothesis or potential solution to the problem
 Test the hypothesis
 If the hypothesis is true then find further evidence or against-evidence
 If the hypothesis is false then create a new one or try again
 Draw conclusions and purify the hypothesis

What is hypothesis?
A hypothesis is an assumption based on inadequate evidence that requires further testing
and experimentation. After further testing, a hypothesis can generally be confirmed true or false.

https:// Page
©
DatabaseTown.co
Null Hypothesis (H0):
A null hypothesis is a hypothesis which is required to be tested. It is the hypothesis that
the investigator is trying to show to be false. It is a status-quo. The concept of null is similar to
someone remain innocent until enough evidence to prove guilty. For instance, someone say, data
engineer normal salary is Rs.1,25,000/- but in our opinion he may be wrong, so, we make
statistical testing to reject this hypothesis, it is called null hypothesis.

Alternative Hypothesis (H1 or HA):

An alternative hypothesis is inverses of the null hypothesis which is usually based on our
own opinion. For instance, someone say, data engineer normal salary is Rs.1,25,000/- but in our
opinion, data engineer cannot earn this value (less salary), it is called alternative hypothesis.

DECISIONS:
After testing, there will be two possibility of decisions i.e. accept the null hypothesis or
reject the null hypothesis. Accept the null hypothesis means there is insufficient data to support
the alteration or novelty brought by the unconventional. Reject the null hypothesis means there is
sufficient statistical evidence that show this null hypothesis is false.

Level of Significance:
It is the probability of rejecting a null hypothesis by the test when it is really true. It is
denoted by α (Alpha).

Confidence Level:
It is a possibility of a parameter that lies within a specified range of values. It is denoted
as C. Level of significance is connected with the confidence level and the relationship between
them is denoted by c = 1 – α. The common level of significance and the corresponding
confidence level are given below:-

 The level of significance 0.10 is related to the 90% confidence level.

 The level of significance 0.05 is related to the 95% confidence level.
 The level of significance 0.01 is related to the 99% confidence level.

The rejection rule is given below:-

 If p-value ≤ level of significance, then reject the null hypothesis.
 If p-value > level of significance, then do not reject the null hypothesis.
https:// Page
©
DatabaseTown.co

Rejection region:
The rejection region is the values of test statistic for which the null hypothesis is rejected.

Non rejection region:

The set of all possible values for which the null hypothesis is not rejected is called the rejection
region.

One sided (one-tailed) test is used when the null does not contain equality or inequality
sign (<, >, ≤, ≥). The rejection region for one-sided (one-tailed) test is shown in figure:

 In the left-tailed test, the rejection region is shaded in left side (as shown in above figure).

 In the right-tailed test, the rejection region is shaded in right side.

Two sided (two-tailed) test is used when the null contains equality (=) or inequality (≠)
sign. The rejection region for two-sided (two-tailed) test is shown in figure:-

Statistical Errors:
There are two types of statistical errors:

i. Type I Error (False Positive)

Type-I Error (False Positive):

Type-I error occurs when we reject a null hypothesis that is actually true. The probability
of committing type-I error is denoted by α (alpha).

Type-I Error (False Negative):

Type-II error occurs when we accept a null hypothesis that is actually false. The
probability of committing type-II error is denoted by β (Beta).

P-value:
The p-value is the smallest level of marginal significance at which the null
hypothesis would be rejected. A smaller p-value means that there is stronger evidence in support
of the alternative hypothesis. Usually, p-value is found with 3 digits after the dot (x.xxx).

The p-value is a number between 0 and 1 and can be interpreted as:

 A small p-value (typically ≤ 0.05) represents strong evidence against the null hypothesis,
so, we reject the null hypothesis.
 A large p-value (> 0.05) represents weak evidence against the null hypothesis, so, we fail
to reject the null hypothesis. 0.05 is often the “cut-off-line”.

https:// Page

Calculate With Confidence 8th Edition Morris Test Bank Available Instantly
No ratings yet
Calculate With Confidence 8th Edition Morris Test Bank Available Instantly
311 pages
Statistics
No ratings yet
Statistics
81 pages
Math
No ratings yet
Math
50 pages
Proposed Evacuation Center With Research Objectives
100% (6)
Proposed Evacuation Center With Research Objectives
7 pages
DS Module 2
No ratings yet
DS Module 2
113 pages
Descriptive Analytics Notes
No ratings yet
Descriptive Analytics Notes
6 pages
Ch01 ICS422 04
No ratings yet
Ch01 ICS422 04
84 pages
Unit II TYCS DS
No ratings yet
Unit II TYCS DS
176 pages
Analytics Compendium (Incl Stats)
No ratings yet
Analytics Compendium (Incl Stats)
31 pages
Ai - Ssmda
No ratings yet
Ai - Ssmda
142 pages
Chapter 2 Descriptive Statistics
No ratings yet
Chapter 2 Descriptive Statistics
12 pages
Statistics
No ratings yet
Statistics
63 pages
Technical Delay Report
100% (1)
Technical Delay Report
1 page
Basic Statistics (3685) PPT - Lecture On 20-01-2019
100% (1)
Basic Statistics (3685) PPT - Lecture On 20-01-2019
64 pages
Business Analytics
No ratings yet
Business Analytics
40 pages
DDR - Chapter 05
No ratings yet
DDR - Chapter 05
41 pages
2 - Introduction To Statistics
No ratings yet
2 - Introduction To Statistics
97 pages
How Much Data Does Google Handle?
No ratings yet
How Much Data Does Google Handle?
132 pages
Lecture 5
No ratings yet
Lecture 5
33 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
63 pages
Lesson 02 Probability and Statistics
No ratings yet
Lesson 02 Probability and Statistics
127 pages
Statistics Ppt.1
No ratings yet
Statistics Ppt.1
39 pages
Statistics
No ratings yet
Statistics
88 pages
Interpreting Test Score: Online Workshop 8602 Aiou
100% (1)
Interpreting Test Score: Online Workshop 8602 Aiou
39 pages
Chapter2-Statistical Analysis
No ratings yet
Chapter2-Statistical Analysis
86 pages
Statistics For Data Science 1
No ratings yet
Statistics For Data Science 1
65 pages
Lesson 5 (Descriptive Statistics Part 1) - Oct 2024
No ratings yet
Lesson 5 (Descriptive Statistics Part 1) - Oct 2024
72 pages
Basic Statistics
100% (9)
Basic Statistics
73 pages
Checklist and Procedure Ver 3.0
No ratings yet
Checklist and Procedure Ver 3.0
4 pages
Statistics 24 04 2021 20210618114031
No ratings yet
Statistics 24 04 2021 20210618114031
41 pages
Unit 3
No ratings yet
Unit 3
6 pages
Business Analytics
No ratings yet
Business Analytics
44 pages
1.1 Statistics For Data Science PDF
No ratings yet
1.1 Statistics For Data Science PDF
91 pages
REFLEX ACT III™ Quick User Guide v12
100% (1)
REFLEX ACT III™ Quick User Guide v12
20 pages
Basic Stat 1
No ratings yet
Basic Stat 1
50 pages
Basics of Statistics: Definition: Science of Collection, Presentation, Analysis, and Reasonable
100% (1)
Basics of Statistics: Definition: Science of Collection, Presentation, Analysis, and Reasonable
33 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
15 pages
Basics of Statistics
No ratings yet
Basics of Statistics
32 pages
Data Managementmmw
No ratings yet
Data Managementmmw
26 pages
Statistics, Statistical Modelling & Data Analytics
No ratings yet
Statistics, Statistical Modelling & Data Analytics
68 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
26 pages
Statistics and Its Types (v1.0)
No ratings yet
Statistics and Its Types (v1.0)
6 pages
Statistics For Data Science
100% (1)
Statistics For Data Science
27 pages
Unit-2 Data Analytics Approaches
No ratings yet
Unit-2 Data Analytics Approaches
24 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
24 pages
Mathematics in The Modern World Midterm Reviewer
No ratings yet
Mathematics in The Modern World Midterm Reviewer
8 pages
Data Management
No ratings yet
Data Management
48 pages
Bo de Thi Tieng Anh Lop 4 Hoc Ki 1 Co Dap An
No ratings yet
Bo de Thi Tieng Anh Lop 4 Hoc Ki 1 Co Dap An
60 pages
Unit .......
No ratings yet
Unit .......
45 pages
Day 01-Basic Statistics
No ratings yet
Day 01-Basic Statistics
36 pages
Statistics - Compendium - DMS IIT DELHI - 2025
No ratings yet
Statistics - Compendium - DMS IIT DELHI - 2025
18 pages
Statistics For Bussiness: By: Dr. (C) Nanik Istianingsih, S.E., M.E., C.LMA., C.PR., C.DM
No ratings yet
Statistics For Bussiness: By: Dr. (C) Nanik Istianingsih, S.E., M.E., C.LMA., C.PR., C.DM
31 pages
Sampling Design and Analysis MTH 494: Ossam Chohan Assistant Professor CIIT Abbottabad
No ratings yet
Sampling Design and Analysis MTH 494: Ossam Chohan Assistant Professor CIIT Abbottabad
34 pages
Statistics Lecture 1
No ratings yet
Statistics Lecture 1
20 pages
Assignment No 3
No ratings yet
Assignment No 3
16 pages
Article Review 1 Eng
No ratings yet
Article Review 1 Eng
30 pages
Notes Stats Quiz 2
No ratings yet
Notes Stats Quiz 2
10 pages
04 Pointers
No ratings yet
04 Pointers
103 pages
Presentation 4
No ratings yet
Presentation 4
29 pages
Law 2
No ratings yet
Law 2
12 pages
f592b059 1643454320549
No ratings yet
f592b059 1643454320549
39 pages
Match The Verbs With Its Definition
No ratings yet
Match The Verbs With Its Definition
2 pages
Mock Test 5
No ratings yet
Mock Test 5
5 pages
Yellow Musk Creeper
No ratings yet
Yellow Musk Creeper
7 pages
Magnetic Particle Inspection
0% (1)
Magnetic Particle Inspection
32 pages
Chapter1 Statistics
No ratings yet
Chapter1 Statistics
17 pages
HP250 G7 Laptop PDF
No ratings yet
HP250 G7 Laptop PDF
4 pages
Trevor Ivan - Final Assessment
No ratings yet
Trevor Ivan - Final Assessment
3 pages
Statistics For Data Science PDF - Statistics-for-Data-Science PDF
No ratings yet
Statistics For Data Science PDF - Statistics-for-Data-Science PDF
14 pages
Statistics - Imp Points
No ratings yet
Statistics - Imp Points
6 pages
Forensic Nursing
No ratings yet
Forensic Nursing
2 pages
UNIT1-3 Notes
No ratings yet
UNIT1-3 Notes
57 pages
T2222-Advanced Operation Research
No ratings yet
T2222-Advanced Operation Research
3 pages
Analyst Prep Quants 2024
100% (1)
Analyst Prep Quants 2024
465 pages
Operating - Station Master
No ratings yet
Operating - Station Master
9 pages
Statistical Analysis - Descriptive Stat
No ratings yet
Statistical Analysis - Descriptive Stat
6 pages
Ukcp Safeguarding Guidelines 2018
No ratings yet
Ukcp Safeguarding Guidelines 2018
5 pages
Ge8 Statistics
No ratings yet
Ge8 Statistics
2 pages
395 SrivastavaS
No ratings yet
395 SrivastavaS
10 pages
Jasmina Milicevic
100% (1)
Jasmina Milicevic
17 pages
Fantasy Film
No ratings yet
Fantasy Film
26 pages
Cosmeceuticals Myths and Misconceptions
No ratings yet
Cosmeceuticals Myths and Misconceptions
7 pages
JD - Lead Salesforce Developer-2
No ratings yet
JD - Lead Salesforce Developer-2
2 pages
Cab and Chassis Connections Cab Wiring (Right Side) Fuse Block Wiring
No ratings yet
Cab and Chassis Connections Cab Wiring (Right Side) Fuse Block Wiring
4 pages
64709b0902cd9 RN Ati Capstone Proctored Comprehensive Assessment 2019 B Ati Comprehensive Practice Test B Best Study Guide Version With Complete Solution 2 Revised (1) - 2
No ratings yet
64709b0902cd9 RN Ati Capstone Proctored Comprehensive Assessment 2019 B Ati Comprehensive Practice Test B Best Study Guide Version With Complete Solution 2 Revised (1) - 2
1 page
Stats 101 Assignment 1
No ratings yet
Stats 101 Assignment 1
9 pages
Only One Mind PDF
No ratings yet
Only One Mind PDF
34 pages
Mapreduce Join Document
No ratings yet
Mapreduce Join Document
4 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
From Everand
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
1/5 (1)

Statistics For Data Science PDF

Uploaded by

Statistics For Data Science PDF

Uploaded by

©

Data is a collection of factual information based on numbers, words, observations,

Categorical or Qualitative Data Numerical or Quantitative Data

Graphical Representation of variables in form of Graph & Tables:

iii. Frequency Distribution Table:

Relative frequency = frequency / total frequency

Measures of central tendency:

Then, we re-arrange this data into order of magnitude and we obtain

i. Right or Positive Skewness:

0 Sugar Soap Milk Shampoo Rice

Mean > median

ii. Left or Negative Skewness:

0 Sugar Soap Milk Shampoo Rice

Variance and Standard Deviation:

Population Variance formula:

Population Standard Deviation formula:

Sample Covariance formula: Sample Correlation formula:

When data is normal distributed then distribution is symmetric and

Where, N for normal, ~ for distribution, µ is mean, and 𝜎2 is the variance

ii. Standard Normal Distribution:

It is a normal distribution with a mean of 0 and a standard deviation of 1. Every normal

Standardization permit to compare different normally distributed datasets, test hypothesis,

iii. Central Limit Theorem:

 No matter the distribution

Examples of estimators and equivalent parameters are given in below table.

Term Estimator Parameter

i. Point Estimates – a single value, e.g. 1, 6, 12.34, 0.123

i. Margin of error = Critical value x Standard deviation

 Make an observation that explains the issue

Alternative Hypothesis (H1 or HA):

 The level of significance 0.10 is related to the 90% confidence level.

The rejection rule is given below:-

Non rejection region:

 In the right-tailed test, the rejection region is shaded in right side.

i. Type I Error (False Positive)

Type-I Error (False Positive):

Type-I Error (False Negative):

The p-value is a number between 0 and 1 and can be interpreted as:

You might also like