0% found this document useful (0 votes)

12 views49 pages

Classx - DS - UNIT 1

The document provides an overview of the use of statistics in data science, focusing on key concepts such as subsets, mean, median, mean absolute deviation, and standard deviation. It explains how to create and interpret two-way frequency tables and their relative frequency counterparts, along with practical examples and exercises. Additionally, it highlights the importance of these statistical measures in real-life applications and data analysis.

Uploaded by

Sushanth Dasari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views49 pages

Classx - DS - UNIT 1

Uploaded by

Sushanth Dasari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 49

DATA SCIENCE

Grade X
Chapter 1: Use of Statistics in Data
Science
LEARNING OBJECTIVES:
 What are subsets and relative frequency?
 Meaning of mean
 What is median and its usage in data science?
 What is mean absolute deviation?
 What is Standard Deviation?
What is a Subset?

• Many a times we come across situations where

we have a lot of data with us.
• However, for analysis, we do not need to consider
the entire data.
• Thus, instead of working with the whole data set,
we can take certain part of data for our analysis.
• This smaller set of data that forms a part of a
larger set is known as a Subset.
What is a Subset?
• Subsetting the data is a useful indexing feature
for accessing object elements. It can be used
for selecting and filtering variables and
observations. We subset the data from a data
frame to retrieve a part of the data that we
need for a specific purpose. This helps us to
observe just the required set of data by filtering
out unnecessary content.
• For example, if you have a Table of 100 rows
and 100 columns and you want to perform
certain actions on first 5 rows and first 5
columns, you can separate it out from the main
table.
• This small table of 5 rows and 5 columns is
known as a “Subset” in Data Analytics.
How do we subset data?
Subsetting is a very significant component of
data management and there are several ways
that one can subset data. Let us now understand
different ways of subsetting the data.
Row based Subsetting
In this method of subsetting, we take some
rows from the top or bottom of the table.
Column based Subsetting
In this method, we select specific columns
from dataset for processing.
Data based Subsetting
To subset the data based on specific data we
use data-based subsetting
Two-way frequency table

Consider you are conducting a poll

asking people if they like chocolates.
If you now break down the data into
age categories of (5 – 10 years), (10 –
15 years), and (15 – 20 years), and plot
the number of people who liked and
disliked chocolates then the table
would look different.
This type of table is called a two-way
frequency table.
What is a two-way frequency table?
A two-way table is a statistical table that
demonstrates the observed number or frequency
for two variables, the rows indicate one category,
and the columns indicate the other category.
Two-way frequency tables show how many data
points fit in each category.
The row category in this example is “5-10 years”,
“10-15 years” and “15-20 years”.
The column category is their choice “Like
chocolates” or “Do not like chocolates”.
Each cell tells us the number (or frequency) of the
people.
Interpreting Two Way Frequency Table

In a two-way frequency table, the entries in the table are counts.

The table has several features:
Categories are in the left column and top row

The counts are placed in the center of the table.

The totals are at the end of each row and column.

A sum of all counts (a total) is placed at the bottom right

What is a two-way frequency table?

There is a lot of information that we can get from this

small table.
For example,
How many people were questioned? Answer: 10
How many people like chocolates? Answer: 6
In which age group do people like chocolate the most?
Answer: 10 – 15
Example:
A survey of eighty people (40 men and 40 women) was taken on what genre
of movie they would choose to watch, and the following responses were
recorded:
• 8 men preferred comedy movies.
• 18 men preferred action movies.
• 14 men preferred horror movies.
• 23 women preferred comedy movies.
• 10 women preferred action movies.
• 7 women preferred horror movies.
Two-way table
Activity 1.1

Record how many of your friends like cricket and

how many like football. Create a two-way relative
frequency table with the data
Two-way relative frequency table

Two-way relative frequency table very similar to the two-way frequency type of
table.
The only difference here is we consider percentage instead of numbers.
Two-way relative frequency tables represent what is the percentage of data
points that fit in each category.
We can take the help of row relative frequencies or column relative frequencies;
it depends on the context of the problem.
Two-way relative frequency table (% given)

Two-way relative frequency tables are helpful when there are different sample sizes in a
dataset. Percentages makes it easier to compare the preferences.
Two-way relative frequency table
Two-way relative frequency table
Two-way relative frequency table
What is Mean?

Mean is a measure of central tendency.

In data science, Mean is nothing but an average value of a data frame.
It is a value in the data frame around which entire data is spread out
The mean of a data set is calculated by adding up all the values in the data set
and later dividing them by the number of values present in the data frame.
Example of Mean
Consider that we have a set of 11 numbers 10 to 20 in a data set.
Array = {10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20}

So mean is calculated by adding up 10 numbers in the data set.

Sum of all the numbers = 165
Mean = 165/10 = 16.5
Activity 1.2

• Height of Ravi: 156cm

• Height of Juhi: 148cm
• Height of Shweta: 151cm
• Height of Kishan: 158cm
What is the mean
What is Median?

Median is a second measure of central tendency

It is a middle value in an ordered data frame
To calculate median, we must order our data set in ascending or descending order.
The exact middle value of the ordered set is nothing but a Median.
If the data set is sorted from smallest value to biggest value, the exact middle value
of the set is the Median.
Example of Median
Consider the below data set of 5 values.
Array = [12, 34, 56, 89, 32]

Now let us sort the data set.

Sorted array = [12, 32, 34, 56, 89]

The value at 3rd position is the middle point of the sorted list. So, 34 is our
median for the array.
Example of Median

What if the data set has an even number of records?

For these situations, there will be two middle
points. Thus, we need to calculate the
average of the two to get the median.
The below example illustrates how to
calculate median from an even number
of records.
Mean vs Median
So mean and median both represent the
central tendency of a data set.
So when do we use median over mean Median is a more accurate form
of central tendancy specially in scenarios where there are some irregular
values also known as outliers.
For example consider the below scenario.
Your father gets his blood pressure checked every week.
But due some error in the device,
the recording for one week was too high.
Mean absolute deviation

Mean Absolute Deviation (MAD) is the average of how far away all values in a data
set are from the mean.
The value of Mean absolute deviation gives a very good understanding of the
variability of the data set or in other words how scattered the data set is?
One of the applications of Mean Absolute Deviation in real life is when teachers give
tests to students and then average the results to see if the average score was high,
in between, or too low.
Each average tells a story.
Absolute Deviation can further help to see the distance between each of the scores
and the beginning average scores.
Example of Mean absolute deviation
Consider the below data set:
12, 16, 10, 18, 11, 19
Step 1: Calculate the mean
Mean = (12 + 16 + 10 + 18 + 11 + 19) / 6 = 14 (rounded off)
Step 2: Calculate the distance of each data point from the mean. We need to find
the absolute value. For example, if the distance is -2, then we ignore the negative
sign.
|-2| = 2
Step 3: Calculate the mean of the distances.
Mean of distances = (2 + 2 + 4 + 4 + 3 + 5) / 6 = 3.33
So, 3.33 is our mean absolute deviation, and the mean is 14.
What is Standard Deviation?

The Standard Deviation is a measure of how spread-out

numbers are.
It is a summary measure of the differences of each
observation from the mean.
If the differences themselves were added up, the positive
would exactly balance the negative and so their sum
would be zero.
How to find Standard Deviation?
In order to find standard deviation:

1. Calculate the mean by adding up all the data pieces and dividing it by the
number of pieces of the data.
2. Subtract mean from every value
3. Square each of the differences
4. Find the average of squared numbers calculated in point number 3 to find the
variance.
5. Lastly, find the square root of variance. That is the standard deviation.
Example
Take the values 1,2,3,5 and 8

Step 1: Calculate the mean

1+2+3+5+8 = 19
19/5 = 3.8 (mean
Step 2: Subtract mean from every value
1- 3.8= -2.8
2- 3.8= -1.8
3- 3.8= -0.8
5- 3.8= 1.2
8- 3.8= 4.2
Step 3: Square each difference
-2.8*-2.8 = 7.84
-1.8*-1.8 = 3.24
-0.8*-0.8 = 0.64
1.2*1.2 = 1.44
4.2*4.2 = 17.64
Step 4: Calculate the average of the squared numbers
to get the variance

7.84+3.24+0.64+1.44+17.64 = 30.8
30.8/5 = 6.16 (Variance)
Step 5: Find the square root of the variance
The square root of 6.16 = 2.48

Thus, the Standard deviation of values

1,2,3,5 and 8 is 2.48
Graphically,
the standard
deviation of
2.48 can be
represented
like below:
Few real-life implementations of standard deviation
include:

1. Grading Tests – If a teacher wants to know whether

students are performing at the same level or whether
there is a higher standard
deviation.
2. To calculate the results of any Survey – If someone
wants to have some measure of the reliability of
the responses received in the survey, to predict how a
bigger group of people may answer the same questions.

3. Weather Forecasting – If a weather forecaster is

analyzing the low temperature forecasted for three
different cities. A low standard deviation will always
show a reliable weather forecast.
Practice Question
A financial analyst analyzes the returns of Google stock
and wants to measure the risks on returns if investments
are in a particular stock. Therefore, he collects data on
the historical returns of google for the last five years,
which are as follows:
Year 2018 2017 2016 2015 2014
Returns
(%) (xi) 27.70% 36.10% 10.50% 6.80% -4.60%
Exercises
Objective Type Questions
Please choose the correct option in the questions below.
1. We want to get the cars of red color from the below data set. Which type of
subsetting should be used?
a) Column based subsetting
b) Data based subsetting
c) Row based subsetting
d) None of the above
Answer: b
2. Which is a more accurate measure of central tendency when there
are outliers in
the data set?
a) Mean
b) Median
Answer: b
3. Mean absolute deviation is an identifier of the variability of the data
set. Is this a
correct statement?
a) Yes
b) No
Answer: a
4. The mean absolute deviation is divided by coefficient of mean absolute deviation to
calculate
a) Variance
b) Median
c) Arithmetic Mean
d) Coefficient of Variation
Answer: c
5. In a manufacturing company, the number of employees in unit A is 40, the mean is Rs. 6400
and the number of employees in unit B is 30 with the mean of Rs. 5500 then the combined
arithmetic mean is
a) 9500
b) 8000
c) 7014.29
d) 6014.29
Answer: d
6. The mean deviation about the mean for the following data:
5, 6, 7, 8, 6, 9, 13, 12, 15 is:
a) 1.5
b) 3.2
c) 2.89
d) 5
Answer: c
7. The arithmetic mean of the numerical values of the deviations of items from some
average value is called the
a) Standard Deviation
b) Range
c) Quartile Deviation
d) Mean Deviation
Answer: d
Standard Questions
1. Explain the different ways of subsetting data.
2. When should we use median over mean?
3. What is Mean Absolute Deviation?
4. What is a two way relative frequency table? How is it different from two way
frequency table?
5. What are two way frequency table beneficial for?
6. What is Standard Deviation?
7. How to calculate Standard Deviation?
8. Name five real-life applications of Standard Deviation
9. Explain five real-life situations where subsetting data can be advantageous

CRSP Examination Preparation 1695464322
No ratings yet
CRSP Examination Preparation 1695464322
14 pages
Q 4 RESEARCH Module 2 3
No ratings yet
Q 4 RESEARCH Module 2 3
27 pages
LS3 Modules With Worksheets (Mean, Median, Mode and Range)
100% (2)
LS3 Modules With Worksheets (Mean, Median, Mode and Range)
18 pages
Statistics For Data Science
100% (1)
Statistics For Data Science
27 pages
Eco and Youth Club 2023-24
No ratings yet
Eco and Youth Club 2023-24
9 pages
SEL-487B-1: Bus Differential and Breaker Failure Relay
100% (1)
SEL-487B-1: Bus Differential and Breaker Failure Relay
726 pages
Descriptive Statistics PDF
100% (1)
Descriptive Statistics PDF
40 pages
Statistics Assignment 05
50% (2)
Statistics Assignment 05
14 pages
2 - Module 1 - Descriptive Statistics - Frequency Tables, Measure of Central Tendency & Measures of Dispersion
No ratings yet
2 - Module 1 - Descriptive Statistics - Frequency Tables, Measure of Central Tendency & Measures of Dispersion
21 pages
LESSON 4 MMW Data Management
No ratings yet
LESSON 4 MMW Data Management
104 pages
Classx Ds Unit 1
No ratings yet
Classx Ds Unit 1
58 pages
Bda M3 B.com2
No ratings yet
Bda M3 B.com2
59 pages
Unit 6 Interpreting Evaluation Results
No ratings yet
Unit 6 Interpreting Evaluation Results
54 pages
Previously On Statistics 1
No ratings yet
Previously On Statistics 1
48 pages
3 Summarizing Data
No ratings yet
3 Summarizing Data
64 pages
Classx DS Student Handbook
No ratings yet
Classx DS Student Handbook
60 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
34 pages
Lesson 6c, 7, 8
No ratings yet
Lesson 6c, 7, 8
46 pages
1.1 Statistics For Data Science PDF
No ratings yet
1.1 Statistics For Data Science PDF
91 pages
Statistics
No ratings yet
Statistics
25 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
123 pages
Measures of Central Tendency Position and Dispersion 1.Pptx 20241015 145631 0000
No ratings yet
Measures of Central Tendency Position and Dispersion 1.Pptx 20241015 145631 0000
44 pages
Unit 2
No ratings yet
Unit 2
39 pages
Week 4 Measures of Central Tendency
No ratings yet
Week 4 Measures of Central Tendency
29 pages
Health Statistics: Principles of Secondary Data Analysis
No ratings yet
Health Statistics: Principles of Secondary Data Analysis
61 pages
Week 2
No ratings yet
Week 2
27 pages
4x @6ote ) 'Btda2@m
No ratings yet
4x @6ote ) 'Btda2@m
55 pages
Central Tendency
No ratings yet
Central Tendency
105 pages
Statistics
No ratings yet
Statistics
29 pages
Summarising and Analysing Data
No ratings yet
Summarising and Analysing Data
36 pages
STA201 Lec 04
No ratings yet
STA201 Lec 04
22 pages
Unit 5 BRM
No ratings yet
Unit 5 BRM
17 pages
Standard Error
No ratings yet
Standard Error
14 pages
Statistics
No ratings yet
Statistics
16 pages
How Much Data Does Google Handle?
No ratings yet
How Much Data Does Google Handle?
132 pages
f592b059 1643454320549
No ratings yet
f592b059 1643454320549
39 pages
Quantitative Data Analysis
No ratings yet
Quantitative Data Analysis
31 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
32 pages
Stat 1101 4 7
No ratings yet
Stat 1101 4 7
18 pages
MMW Module 4 - Statistics
No ratings yet
MMW Module 4 - Statistics
18 pages
Measures of Central Tendency and Variability
No ratings yet
Measures of Central Tendency and Variability
59 pages
Use of Statistics in Data Science
No ratings yet
Use of Statistics in Data Science
11 pages
Almendralejo Statistics
No ratings yet
Almendralejo Statistics
19 pages
Ud Module 4
No ratings yet
Ud Module 4
105 pages
4689-2 Final
No ratings yet
4689-2 Final
11 pages
Statistics 1
No ratings yet
Statistics 1
10 pages
Lesson Plan in Direct Proof (Paragraph Form)
No ratings yet
Lesson Plan in Direct Proof (Paragraph Form)
6 pages
ACC 324 - Data Analysis
No ratings yet
ACC 324 - Data Analysis
11 pages
Weighted Average Formula Simplify. Multiply Each Side by 27. Subtract 1245 From Each Side. Simplify. Divide Each Side by 12
No ratings yet
Weighted Average Formula Simplify. Multiply Each Side by 27. Subtract 1245 From Each Side. Simplify. Divide Each Side by 12
12 pages
2.stat & Proba 2
No ratings yet
2.stat & Proba 2
15 pages
Reviewer in IE-SAN1
No ratings yet
Reviewer in IE-SAN1
5 pages
Statistics Lab 10-4
No ratings yet
Statistics Lab 10-4
11 pages
Mean Median Mode Population Sample
No ratings yet
Mean Median Mode Population Sample
11 pages
Business Statistics: Qualitative or Categorical Data
No ratings yet
Business Statistics: Qualitative or Categorical Data
14 pages
Q2 Business Mathematics 12 - Week 7.2-1
No ratings yet
Q2 Business Mathematics 12 - Week 7.2-1
8 pages
Central Tendency
No ratings yet
Central Tendency
11 pages
Data Science Notes
No ratings yet
Data Science Notes
3 pages
Stats
No ratings yet
Stats
3 pages
Evaluating Analytical Chemistry
No ratings yet
Evaluating Analytical Chemistry
4 pages
Math Written Reportgroup 4 PDF
No ratings yet
Math Written Reportgroup 4 PDF
18 pages
Standard Deviation Formulas
No ratings yet
Standard Deviation Formulas
10 pages
Frequency Distribution Table: Measure of Dispersion: Range, Variance, Standard Deviation
No ratings yet
Frequency Distribution Table: Measure of Dispersion: Range, Variance, Standard Deviation
4 pages
Handbook of Econometrics Volume 3
No ratings yet
Handbook of Econometrics Volume 3
620 pages
Top 10 Solar O&M KPIs To Track - Arbox Renewable Energy
No ratings yet
Top 10 Solar O&M KPIs To Track - Arbox Renewable Energy
4 pages
Module 4.1 - Minimum Design Lateral Force
No ratings yet
Module 4.1 - Minimum Design Lateral Force
6 pages
Untitled
No ratings yet
Untitled
4 pages
Background of The Study vs. Literature Review
100% (3)
Background of The Study vs. Literature Review
6 pages
AI Book 10 - Worksheets - Unit 1 - Answer Key
No ratings yet
AI Book 10 - Worksheets - Unit 1 - Answer Key
8 pages
On The Optimal Weighting Matrix For The GMM System Estimator in Dynamic Panel Data Models
No ratings yet
On The Optimal Weighting Matrix For The GMM System Estimator in Dynamic Panel Data Models
28 pages
ECO Exam IMP Questions (JAN-24) by HM Hasnan
No ratings yet
ECO Exam IMP Questions (JAN-24) by HM Hasnan
83 pages
Regression
No ratings yet
Regression
4 pages
直播一课前资料
No ratings yet
直播一课前资料
7 pages
Authors Book
No ratings yet
Authors Book
274 pages
Interpreting Studies L On Fidelity in Interpretation
No ratings yet
Interpreting Studies L On Fidelity in Interpretation
11 pages
Civil Engineering Important Questions
No ratings yet
Civil Engineering Important Questions
8 pages
Lesson One - Inclusive Education - Supplimentary Notes
No ratings yet
Lesson One - Inclusive Education - Supplimentary Notes
10 pages
Listening 3
No ratings yet
Listening 3
4 pages
ANOVA Poplar-Trees
No ratings yet
ANOVA Poplar-Trees
3 pages
Motion in 2D DPP 7 Min
No ratings yet
Motion in 2D DPP 7 Min
3 pages
Fatigue Failure of The de Havilland Comet 1
No ratings yet
Fatigue Failure of The de Havilland Comet 1
8 pages
1 1exercises
No ratings yet
1 1exercises
17 pages
Cromeans J Breanan NU-607-818 Theoretical Underpinnings
No ratings yet
Cromeans J Breanan NU-607-818 Theoretical Underpinnings
8 pages
SRS-02 (Gen. Aptitude Test) SET-A PDF
No ratings yet
SRS-02 (Gen. Aptitude Test) SET-A PDF
22 pages
Block 1 Psyc1009 Course Pack 2021 Final
No ratings yet
Block 1 Psyc1009 Course Pack 2021 Final
14 pages
Prueba Modelo Diagnostica Optativa Ingles
No ratings yet
Prueba Modelo Diagnostica Optativa Ingles
5 pages
Product Conformity Certificate - O2000 Oxygen Analyser
No ratings yet
Product Conformity Certificate - O2000 Oxygen Analyser
9 pages
BNAD 277 Tableau Assignment
No ratings yet
BNAD 277 Tableau Assignment
1 page
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
This is The Statistics Handbook your Professor Doesn't Want you to See. So Easy, it's Practically Cheating...
From Everand
This is The Statistics Handbook your Professor Doesn't Want you to See. So Easy, it's Practically Cheating...
S. Deviant
4.5/5 (6)
Painless Statistics
From Everand
Painless Statistics
Barron's Educational Series
No ratings yet

Classx - DS - UNIT 1

Uploaded by

Classx - DS - UNIT 1

Uploaded by

DATA SCIENCE

• Many a times we come across situations where

Consider you are conducting a poll

In a two-way frequency table, the entries in the table are counts.

The counts are placed in the center of the table.

The totals are at the end of each row and column.

A sum of all counts (a total) is placed at the bottom right

There is a lot of information that we can get from this

Record how many of your friends like cricket and

Mean is a measure of central tendency.

So mean is calculated by adding up 10 numbers in the data set.

• Height of Ravi: 156cm

Median is a second measure of central tendency

Now let us sort the data set.

What if the data set has an even number of records?

The Standard Deviation is a measure of how spread-out

Step 1: Calculate the mean

Thus, the Standard deviation of values

1. Grading Tests – If a teacher wants to know whether

3. Weather Forecasting – If a weather forecaster is

You might also like