0% found this document useful (0 votes)
12 views49 pages

Classx - DS - UNIT 1

The document provides an overview of the use of statistics in data science, focusing on key concepts such as subsets, mean, median, mean absolute deviation, and standard deviation. It explains how to create and interpret two-way frequency tables and their relative frequency counterparts, along with practical examples and exercises. Additionally, it highlights the importance of these statistical measures in real-life applications and data analysis.

Uploaded by

Sushanth Dasari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views49 pages

Classx - DS - UNIT 1

The document provides an overview of the use of statistics in data science, focusing on key concepts such as subsets, mean, median, mean absolute deviation, and standard deviation. It explains how to create and interpret two-way frequency tables and their relative frequency counterparts, along with practical examples and exercises. Additionally, it highlights the importance of these statistical measures in real-life applications and data analysis.

Uploaded by

Sushanth Dasari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

DATA SCIENCE

Grade X
Chapter 1: Use of Statistics in Data
Science
LEARNING OBJECTIVES:
 What are subsets and relative frequency?
 Meaning of mean
 What is median and its usage in data science?
 What is mean absolute deviation?
 What is Standard Deviation?
What is a Subset?

• Many a times we come across situations where


we have a lot of data with us.
• However, for analysis, we do not need to consider
the entire data.
• Thus, instead of working with the whole data set,
we can take certain part of data for our analysis.
• This smaller set of data that forms a part of a
larger set is known as a Subset.
What is a Subset?
• Subsetting the data is a useful indexing feature
for accessing object elements. It can be used
for selecting and filtering variables and
observations. We subset the data from a data
frame to retrieve a part of the data that we
need for a specific purpose. This helps us to
observe just the required set of data by filtering
out unnecessary content.
• For example, if you have a Table of 100 rows
and 100 columns and you want to perform
certain actions on first 5 rows and first 5
columns, you can separate it out from the main
table.
• This small table of 5 rows and 5 columns is
known as a “Subset” in Data Analytics.
How do we subset data?
Subsetting is a very significant component of
data management and there are several ways
that one can subset data. Let us now understand
different ways of subsetting the data.
Row based Subsetting
In this method of subsetting, we take some
rows from the top or bottom of the table.
Column based Subsetting
In this method, we select specific columns
from dataset for processing.
Data based Subsetting
To subset the data based on specific data we
use data-based subsetting
Two-way frequency table

Consider you are conducting a poll


asking people if they like chocolates.
If you now break down the data into
age categories of (5 – 10 years), (10 –
15 years), and (15 – 20 years), and plot
the number of people who liked and
disliked chocolates then the table
would look different.
This type of table is called a two-way
frequency table.
What is a two-way frequency table?
A two-way table is a statistical table that
demonstrates the observed number or frequency
for two variables, the rows indicate one category,
and the columns indicate the other category.
Two-way frequency tables show how many data
points fit in each category.
The row category in this example is “5-10 years”,
“10-15 years” and “15-20 years”.
The column category is their choice “Like
chocolates” or “Do not like chocolates”.
Each cell tells us the number (or frequency) of the
people.
Interpreting Two Way Frequency Table

In a two-way frequency table, the entries in the table are counts.


The table has several features:
Categories are in the left column and top row

The counts are placed in the center of the table.

The totals are at the end of each row and column.

A sum of all counts (a total) is placed at the bottom right


What is a two-way frequency table?

There is a lot of information that we can get from this


small table.
For example,
How many people were questioned? Answer: 10
How many people like chocolates? Answer: 6
In which age group do people like chocolate the most?
Answer: 10 – 15
Example:
A survey of eighty people (40 men and 40 women) was taken on what genre
of movie they would choose to watch, and the following responses were
recorded:
• 8 men preferred comedy movies.
• 18 men preferred action movies.
• 14 men preferred horror movies.
• 23 women preferred comedy movies.
• 10 women preferred action movies.
• 7 women preferred horror movies.
Two-way table
Activity 1.1

Record how many of your friends like cricket and


how many like football. Create a two-way relative
frequency table with the data
Two-way relative frequency table

Two-way relative frequency table very similar to the two-way frequency type of
table.
The only difference here is we consider percentage instead of numbers.
Two-way relative frequency tables represent what is the percentage of data
points that fit in each category.
We can take the help of row relative frequencies or column relative frequencies;
it depends on the context of the problem.
Two-way relative frequency table (% given)

Two-way relative frequency tables are helpful when there are different sample sizes in a
dataset. Percentages makes it easier to compare the preferences.
Two-way relative frequency table
Two-way relative frequency table
Two-way relative frequency table
What is Mean?

Mean is a measure of central tendency.


In data science, Mean is nothing but an average value of a data frame.
It is a value in the data frame around which entire data is spread out
The mean of a data set is calculated by adding up all the values in the data set
and later dividing them by the number of values present in the data frame.
Example of Mean
Consider that we have a set of 11 numbers 10 to 20 in a data set.
Array = {10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20}

So mean is calculated by adding up 10 numbers in the data set.


Sum of all the numbers = 165
Mean = 165/10 = 16.5
Activity 1.2

• Height of Ravi: 156cm


• Height of Juhi: 148cm
• Height of Shweta: 151cm
• Height of Kishan: 158cm
What is the mean
What is Median?

Median is a second measure of central tendency


It is a middle value in an ordered data frame
To calculate median, we must order our data set in ascending or descending order.
The exact middle value of the ordered set is nothing but a Median.
If the data set is sorted from smallest value to biggest value, the exact middle value
of the set is the Median.
Example of Median
Consider the below data set of 5 values.
Array = [12, 34, 56, 89, 32]

Now let us sort the data set.


Sorted array = [12, 32, 34, 56, 89]

The value at 3rd position is the middle point of the sorted list. So, 34 is our
median for the array.
Example of Median

What if the data set has an even number of records?


For these situations, there will be two middle
points. Thus, we need to calculate the
average of the two to get the median.
The below example illustrates how to
calculate median from an even number
of records.
Mean vs Median
So mean and median both represent the
central tendency of a data set.
So when do we use median over mean Median is a more accurate form
of central tendancy specially in scenarios where there are some irregular
values also known as outliers.
For example consider the below scenario.
Your father gets his blood pressure checked every week.
But due some error in the device,
the recording for one week was too high.
Mean absolute deviation

Mean Absolute Deviation (MAD) is the average of how far away all values in a data
set are from the mean.
The value of Mean absolute deviation gives a very good understanding of the
variability of the data set or in other words how scattered the data set is?
One of the applications of Mean Absolute Deviation in real life is when teachers give
tests to students and then average the results to see if the average score was high,
in between, or too low.
Each average tells a story.
Absolute Deviation can further help to see the distance between each of the scores
and the beginning average scores.
Example of Mean absolute deviation
Consider the below data set:
12, 16, 10, 18, 11, 19
Step 1: Calculate the mean
Mean = (12 + 16 + 10 + 18 + 11 + 19) / 6 = 14 (rounded off)
Step 2: Calculate the distance of each data point from the mean. We need to find
the absolute value. For example, if the distance is -2, then we ignore the negative
sign.
|-2| = 2
Step 3: Calculate the mean of the distances.
Mean of distances = (2 + 2 + 4 + 4 + 3 + 5) / 6 = 3.33
So, 3.33 is our mean absolute deviation, and the mean is 14.
What is Standard Deviation?

The Standard Deviation is a measure of how spread-out


numbers are.
It is a summary measure of the differences of each
observation from the mean.
If the differences themselves were added up, the positive
would exactly balance the negative and so their sum
would be zero.
How to find Standard Deviation?
In order to find standard deviation:

1. Calculate the mean by adding up all the data pieces and dividing it by the
number of pieces of the data.
2. Subtract mean from every value
3. Square each of the differences
4. Find the average of squared numbers calculated in point number 3 to find the
variance.
5. Lastly, find the square root of variance. That is the standard deviation.
Example
Take the values 1,2,3,5 and 8

Step 1: Calculate the mean


1+2+3+5+8 = 19
19/5 = 3.8 (mean
Step 2: Subtract mean from every value
1- 3.8= -2.8
2- 3.8= -1.8
3- 3.8= -0.8
5- 3.8= 1.2
8- 3.8= 4.2
Step 3: Square each difference
-2.8*-2.8 = 7.84
-1.8*-1.8 = 3.24
-0.8*-0.8 = 0.64
1.2*1.2 = 1.44
4.2*4.2 = 17.64
Step 4: Calculate the average of the squared numbers
to get the variance

7.84+3.24+0.64+1.44+17.64 = 30.8
30.8/5 = 6.16 (Variance)
Step 5: Find the square root of the variance
The square root of 6.16 = 2.48

Thus, the Standard deviation of values


1,2,3,5 and 8 is 2.48
Graphically,
the standard
deviation of
2.48 can be
represented
like below:
Few real-life implementations of standard deviation
include:

1. Grading Tests – If a teacher wants to know whether


students are performing at the same level or whether
there is a higher standard
deviation.
2. To calculate the results of any Survey – If someone
wants to have some measure of the reliability of
the responses received in the survey, to predict how a
bigger group of people may answer the same questions.

3. Weather Forecasting – If a weather forecaster is


analyzing the low temperature forecasted for three
different cities. A low standard deviation will always
show a reliable weather forecast.
Practice Question
A financial analyst analyzes the returns of Google stock
and wants to measure the risks on returns if investments
are in a particular stock. Therefore, he collects data on
the historical returns of google for the last five years,
which are as follows:
Year 2018 2017 2016 2015 2014
Returns
(%) (xi) 27.70% 36.10% 10.50% 6.80% -4.60%
Exercises
Objective Type Questions
Please choose the correct option in the questions below.
1. We want to get the cars of red color from the below data set. Which type of
subsetting should be used?
a) Column based subsetting
b) Data based subsetting
c) Row based subsetting
d) None of the above
Answer: b
2. Which is a more accurate measure of central tendency when there
are outliers in
the data set?
a) Mean
b) Median
Answer: b
3. Mean absolute deviation is an identifier of the variability of the data
set. Is this a
correct statement?
a) Yes
b) No
Answer: a
4. The mean absolute deviation is divided by coefficient of mean absolute deviation to
calculate
a) Variance
b) Median
c) Arithmetic Mean
d) Coefficient of Variation
Answer: c
5. In a manufacturing company, the number of employees in unit A is 40, the mean is Rs. 6400
and the number of employees in unit B is 30 with the mean of Rs. 5500 then the combined
arithmetic mean is
a) 9500
b) 8000
c) 7014.29
d) 6014.29
Answer: d
6. The mean deviation about the mean for the following data:
5, 6, 7, 8, 6, 9, 13, 12, 15 is:
a) 1.5
b) 3.2
c) 2.89
d) 5
Answer: c
7. The arithmetic mean of the numerical values of the deviations of items from some
average value is called the
a) Standard Deviation
b) Range
c) Quartile Deviation
d) Mean Deviation
Answer: d
Standard Questions
1. Explain the different ways of subsetting data.
2. When should we use median over mean?
3. What is Mean Absolute Deviation?
4. What is a two way relative frequency table? How is it different from two way
frequency table?
5. What are two way frequency table beneficial for?
6. What is Standard Deviation?
7. How to calculate Standard Deviation?
8. Name five real-life applications of Standard Deviation
9. Explain five real-life situations where subsetting data can be advantageous

You might also like