Use of Statistics in Data Science
Use of Statistics in Data Science
(1) We want to get the cars of red color from the below data set. Which type
of subsetting should be used?
Name Height Color
Innova 70 White
Swift 50 Red
Amaze 50 Red
Bolero 80 Grey
(a) Column based subsetting
(2) Which is a more accurate measure of central tendency when there are outliers in the
data set?
(a) Mean
(b) Median
(3) Mean absolute deviation is an identifier of the variability of the data set. Is this a correct
statement?
(a) Yes
(b) No
(a) Variance
(b) Median
(c) Arithmetic Mean
(d) Coefficient of Variation
Ans: (c) Arithmetic Mean
(5) In a manufacturing company, the number of employers in unit A is 40, the mean is Rs
6400 and the number of employee in unit B is 30 with the mean of Rs. 5,500 then the
combined arithmetic mean is –
(a) 9500
(b) 8000
(c) 7014.29
(d) 6014.29
(6) The mean deviation about the mean for the following data: 5, 6, 7, 8, 9, 13, 12, 15 is
(a) 1.5
(b) 3.2
(c) 2.89
(d) 5
(7) The arithmetic mean of the numerical values of the deviations of items from some
average value is called the
Standard Questions
Ans: There are basically three ways of subsetting data which are:
(a) Row based subsetting: In the Row based subsetting we consider some rows of the table
from top to bottom. Suppose you have inserted 8 rows and 6 columns in your table, so you
can only take 4 rows that too from the top side of the table.
(b) Column based subsetting: We have always observed that in original data set there is
inclusion of columns in a large number, but all of these columns are not necessary for the
analysis. In that case, we have to select some columns from the original dataset. Such
method of subsetting is to be termed as column- based subsetting.
(c) Data based subsetting: In this type of subsetting, the data is subsetted on the basis of
the specific data. We can also notice that the rows which we select will be colored.
Ans :- As we know that Median is the exact form of the tendency specially where there are
irregular values. Such are to be termed as outliers.
Rahul’s father gets his blood pressure checked for every week. But for one week due to the
defect in the machine, the blood pressure was recorded high.
From the above illustration we can observe that Rahul’s father mean value is different from
regular blood pressure values due to the problem/defect in the blood pressure machine.
Though the median value still correctly shows the centre point of the data set. Now, in the
data set where there is presence of outliers , as compared to mean median is the most
effective measuring of central tendency.
Ans:- Mean Absolute Deviation (MAD) is the average calculation of the distance between
the values of the data set from the mean.
Let us consider the following data set and solve the following:-
12 16 10 18 11 19
Step 2: In order to find the exact/absolute value, we are supposed to calculate the distance
of each point from the mean. Suppose if the distance from mean is -2,then we can avoid
the negative sign (-).
Following is the table which is related to the distance which we get after
calculating the each data point from the mean.
Value Distance form the mean value (14)
12 2
16 2
10 4
18 4
11 3
19 5
Total 20
Step 3: Now it’s the time for us to calculate the mean of the distances
The Mean absolute deviation will give us an idea about the variation of data set.
(4) What is a two way relative frequency table? How is it different from two way frequency
table?
Ans: The two-way relative frequency table is similar to two way frequency type of table. We
can consider the difference here on the basis of percentage instead of number. In two-way
table frequency tables shows data points which fits in each category. We can also take the
help of column relative frequencies and row relative frequencies, which mostly depends on
the problem.
Let us take into consider the table of two-way table where the indoor and outdoor games
preference are been recorded :-
Ans: Two way relative frequency is much useful when there is difference in the sizes of the
sample data set. Preference comparison can be made by using percentages.
Ans:- Standard Deviation is related to the measuring how the numbers are been spreaded
out. In other terms, it shows how much data is been spreaded around the mean or an
average.
For ex :- We can determine whether all the points are nearer to the average or whether they
are above or below the average.
Ans: We can make use of the following steps if we want to find the final standard deviation:
(a) You have to calculate the mean by adding up all the pieces of the data and then make a
division by the number of the pieces of data.
1+2+3+5+8 = 19
(b) You have to subtract the mean from every single values.
1 -3.8 = -2.8
2 -3.8 = -1.8
3 -3.8 = -0.8
4 – 3.8 = 1.2
8 – 3.8 = 4.2
(d) To find difference/variance, we need to find out the average of the squared numbers
which is calculated to point number 3.
7.84+3.24+0.64+1.44+17.64 = 30.8
(e) Now we can get our standard deviation by finding out the square root of the variance.
Hence the standard deviation of the values 1,2,3,5 and 8 are 2.48
(a) Grading Tests: If in case the teacher wants to analyse that the students performance is
at the same level or it is a higher standard deviation.
(b) To calculate the results of any survey: If any of the person has received any
responsibility from the survey and wants to measure its reliability, then he may make the
prediction about how the bigger group people may answer.
(c) Weather Forecasting: If the person has analysed the low temperature forecasted for
three different cities, then a low standard deviation will always show the reliable weather
forecast.
(d) Marketing: Every marketers they calculate the standard deviation of the revenues which
is been earned after every advertisement. So they can expect the variation in the revenue
how much they expect from the given advertisement.
(e) Real – Estate :- Every real estate agents makes use of standard deviation. It is helpful in
calculating the prices of houses as per the square footage in the particular area, so they
can inform their clients about the different in the prices of houses as per their
expectations.
(9) Explain five real-life situations where subsetting data can be advantageous
Ans:
Ans:
(2) Calculate the mean of the data set – [56, 89, 76, 58, 58, 65]
(3) Calculate the median of this data set – [56, 89, 76, 58, 58, 65]
Median = 57