0% found this document useful (0 votes)
4 views3 pages

Data Science Notes

Subsets refer to smaller portions of a larger dataset, which can be created through row-based, column-based, or data-based subsetting techniques. Two-way frequency tables display the frequency of two variables, while two-way relative frequency tables present this data as percentages. Additionally, measures of central tendency such as mean and median are discussed, along with standard deviation, which quantifies the spread of data around the mean.

Uploaded by

rohansinghnirwan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views3 pages

Data Science Notes

Subsets refer to smaller portions of a larger dataset, which can be created through row-based, column-based, or data-based subsetting techniques. Two-way frequency tables display the frequency of two variables, while two-way relative frequency tables present this data as percentages. Additionally, measures of central tendency such as mean and median are discussed, along with standard deviation, which quantifies the spread of data around the mean.

Uploaded by

rohansinghnirwan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

What are subsets?

When we have a lot of data then instead of working with the whole data set, we
can take a certain part of the data for our analysis. This division of a small set of data from
a large set of data is known as a Subset.
Row-based subsetting:
Row-based subsetting, also known as filtering or selecting rows, is a technique used
to extract specific rows from a dataset based on certain criteria,
Column based subsetting:
When data is selected from specific columns from the dataset. This process of
subsetting is known as column-based subsetting.
Data-based subsetting
Data-based subsetting is a technique that extracts a smaller, representative portion
of a larger dataset.

Two-Way Frequency Tables


A two-way table is a statistical table that demonstrates the observed number or
frequency for two variables, the rows indicate one category and the columns indicate the
other category.
Interpreting two-way tables
The entries in the table are counts. The table has several features:

 Categories are in the left column and top row


 The counts are placed in the center of the table.
 The totals are at the end of each row and column.
 A sum of all counts (a total) is placed at the bottom right
Two-way relative frequency table
Two-way relative frequency table very similar to the two-way frequency type of
table. Only difference here is we consider percentage instead of numbers.

Mean:
Mean is a measure of central tendency. In data science, Mean, also termed as the
simple average, is an average value of a data set. Basically, mean is a value in the data set
around which entire data is spread out.
Example

 Consider that we have a set of 11 numbers 10 to 20 in a data set.


 Array = {10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20}
 So mean is calculated by adding up 10 numbers in the data set.
 Sum of all the numbers = 165
 Mean = 165/10 = 16.5
Median
To calculate median, we must order our data set in ascending or descending order.
If the data set is sorted from smallest value to biggest value, the exact middle value of the
set is the Median.

Mean VS Median
Mean

1. Its is the average value of the whole list or Array


2. Even no of elements: add all the elements and divide the sum with the no of
elements.
3. Odd no of elements: add all the elements and divide the sum with the no of
elements.
Median

1. Median is the middle element of the list irrespective.


2. Even no of elements: add the center elements after sorting the list and divide by 2
3. Odd no of elements: Middle element of the sorted list.
4. Median is a more accurate form of central tendency specially in scenarios where
there are some irregular values also known as Outliers

Standard Deviation (SD):


standard deviation represents how much the data is spread out around the mean or
an average.

To find standard deviation:

1. Calculate the mean by adding up all the data pieces and dividing it by the number
of pieces of the data.
2. Subtract mean fromevery value
3. Square each of the differences
4. Find the average of squared numbers calculated in point number 3 to find the
variance.
5. Lastly, find the square root of variance. That is the standard deviation.
Example

Values; [1, 2, 3, 5, 8]

1. Calculate the mean


1+2+3+5+8 = 19
19/5 = 3.8 (mean)
2. Subtract mean from every value
1- 3.8= -2.8
2- 3.8= -1.8
3- 3.8= -0.8
5- 3.8= 1.2
8- 3.8= 4.2

3. Square each difference


(-2.8)*(-2.8) = 7.84
(-1.8)*(-1.8) = 3.24
(-0.8)*(-0.8) = 0.64
(1.2)*1.2) = 1.44
(4.2)*(4.2) = 17.64
4. Calculate the average of the squared numbers to get the variance
7.84+3.24+0.64+1.44+17.64 = 30.8
30.8/5 = 6.16 (Variance)
5. Find the square root of the variance
The square root of 6.16 = 2.48

Thus, the Standard deviation of values 1,2,3,5 and 8 is 2.48.

You might also like