Assignment - Data Analytics
Assignment - Data Analytics
5 Marks Questions:
Categorize and briefly describe all types of data.
Distinguish skewness and kurtosis using example.
"Usually, 25% of the candidates who appear for interview for the post of assistants in the
nationalized bank are selected for recruitment. If 15 candidates who are going to appear for
interview are randomly selected, find the probability that:
o Exactly 12 will be rejected.
o Fewer than 3 will be rejected."
"The quality control manager finds the average number of defective parts coming out in a
production cycle is 5. Find the probability that:
o Exactly 5 parts are defective in a production cycle.
o More than 4 parts are defective"
What is data cleaning? Write down the steps of data cleaning.
What is Associative Rule Mining in Data Mining? Elaborate.
"A business researcher wants to test the fuel efficiency of two cars A and B and collects the
following information from 12 users of car A and 9 users of car B:
¯x_A = 19 km/litre S_A = 3.8 km/litre
¯x_B = 24 km/litre S_B = 4.3 km/litre
Assignments for Data Analytics
Test whether the average mileage offered by car B is better than the average mileage offered by
car A at a significance level of 0.01."
A hotel in a city was having an occupancy rate of 70% per day with a standard deviation of
18.2% for 50 days. In order to increase the occupancy rate the hotel put up flex boards in
prominent locations of the city and found that the occupancy rate rose to 82.7% with a standard
deviation of 19.7% in the next 75 days. Do you have enough statistical evidence to conclude that
the occupancy rate has increased by 5% due to display of flex board. Test a significance level of
0.05.
"The number of accidents over 30 months in a city is given by the freqency distribution
presented in the following table
Number of accidents 1 2 3 4 5 6 7
Frequency 4 5 5 10 2 1 3
Calculate the mean, median and mode for the distribution of accidents."
"The probability that a person has a certain disease is 0.02. If the disease is present, the medical
test confirms it with a probability of 0.9. If the disease is not present the medical test confirms
the absence of disease with a probability of 0.95. If the medical test indicates that a disease is
present, find the probability that:
o The disease is actually present.
o The disease is actually absent"
"A training centre prepares students for CAT examination. There are 50 boys and 50 girls. The
following table gives the frequencies of boys and girls good in these skills.
Students Good in Communication Good in analytic thinking Good in any one
Boys 20 45 50
Girls 30 30 50
o Find the probability that a randomly chosen boy is good in both.
o Find the probability that a randomly selected girl is good in both.
o A student is chosen at random. What is the probability that he/she is good in both."
A contractor will not meet the dead line in completing a project due to rain or absence of
workers. He feels that there is a 20% chance of rain, 30% chance for absence of workers and a
15% chance for both. Find the probability that:
o The project is not delayed.
o The project is delayed due to only one reason.
"A branch of a nationalised bank is giving educational loans to students. The persons in-charge
of disbursement of loans claim that 40% of the students do not repay the loan. The manager is
not convinced and takes a random sample of 10 students. If the person in-charge of sanction of
loans is correct, find the probability that:
o three of the 10 students do not repay.
o everybody repays."
A bank manager found that 3 customers arrive on an the average in every 5 minutes in a savings
bank counter. Assume, that the customers arrive at random:
o find the probability that 5 customers arrive in a 5 minute interval.
o the manager wants to add one more counter for savings account holder customers if
the probability that more than 5 customers arrive in a 5-minute interval exceeds 0.2.
Will the manager add one more counter?
Assignments for Data Analytics
The manager of a TV showroom orders 100 TVs per month. He is unable to fulfill the entire
demand in one month out of 4 months on the average. He estimates the average monthly
demand to be 90 TVs. If the demand for TVs follows a normal distribution find the standard
deviation of the distribution of demand. How many TVs should be order per month if he wants
the probability of running out of stock to be at most 0.1.
"A telemarketing executive has determined that 15% of people contacted will purchase the
product. 12 people are contacted about this product.
a)Find the probability that among 12 people contacted, 2 will buy the product.
b)Find the probability that among 12 people contacted, at most 2 will buy the product?
According to a survey on use of smart phones in India, the smart phone users spend 68 minutes
in a day on average in sending messages and the corresponding standard deviation is 12
minutes. Assume that the time spent in sending messages follows a normal distribution.
(a)What proportion of the smart phone users are spending more than 90 minutes in
sending messages daily?
(b)What proportion of customers are spending less than 20 minutes?
The life of a particular TV model produced by a reputed TV manufacturer has a standard
deviation of 2.5 years. In order to estimate the mean life of the TV model, a sample of 40 TVs
was taken and found to have a mean life of 9 years.
o Compute the standard error of the sample mean
o Construct a 90% confidence interval for the mean life of the particular model
The customers coming to a branch of LIC complain that they have to wait for a long time before
paying their premium. The manager knows from his past experience that the standard deviation
of the waiting time is 2.5 minutes. He took a sample of 49 persons, found their average waiting
time to be 12.3 minutes. Construct a confidence interval for the average waiting time with
confidence level of 90% and 99%.
Explain logistic regression with suitable example.
Use the following information to construct confidence intervals for estimating μ.
o 80% confidence interval for x(bar) = 55 σ = 12 and n =32
o 90% confidence interval for x(bar) = 3.5 σ = 0.97 and n =30
How is hard clustering different from soft clustering? Explain with examples.
Define reinforcement learning with an example.
Explain in brief the working of SVM.
List out the requirements for clustering.
Discuss the applications of clustering in detail.
Discuss the steps of K-means clustering algorithm.
Discuss the steps of K-medoids clustering algorithm.
What is dendrogram? Explain in brief Hierarchical Agglomerative algorithm with the help of a
dendrogram.
Draw the flowchart of k-means algorithm.
Name the types of regression and briefly explain.
Explain structured, unstructured, and semi-structured data with example.
List the advantages and disadvantages of Apriori Algorithm.
"A medicine is given to two groups (A and B) of
patients with following data -
Assignments for Data Analytics
o Group A (mean survival score 78, standard deviation 18, total patients 200),
o Group B (mean survival score 76, standard deviation 17, total patients 170).
o Is the difference in mean survival score of two groups statistically significant with 1%
significance level?"
"Find out Accuracy, Precision, Recall, Mislcassification and F1-score from the following
Confusion Matrix -
Predicted/Actual Disease No Disease
Alert 5 3
No Alert 4 8"
"Demonstrate the steps for finding Accuracy, Precision, Recall, Mislcassification and F1-score
from following confusion matrix -
Actual / Predicted Spam Not Spam
Alert 3 3
No Alert 4 10"
Explain the terms support and confidence from the aspect of association rule mining.
Discuss the types of Data Mining with examples.
Analyse differences between Apriori Algorithm and FP-Growth Algorithm
Distinguish in details between clustering and classification.
10 Marks Questions:
An automatic filling machine has to fill eyedrops with a mean of 10 ml per bottle. Both
overfilling and underfilling are not desirable. A quality control inspector takes sample of 30
bottles in every half an hour in order to decide whether he has to stop operation in the case of
overfilling and underfilling beyond a certain level for adjusting the machine. If the sample has a
mean of 9.8 ml with a standard deviation of 0.5 ml. give the decision rule for the quality control
inspector, taking 0.05 as the significance level.
Find out the test type and give the rejection rule for the following data:
A dealer in automobiles wants to know whether 3 different brands of tyres give different
mileage. He gathered the data presented in the following table regarding the life of three brands
of tyres.
27,500
Test whether the average mileage provided by the brands differ significantly at α = 0.05."
The data given in the following table represents the number of units produced per shift by 5
workers using 3 different machines.
Machines
Worker A B C
1 48 44 36
2 50 48 40
3 40 37 38
4 45 45 34
5 50 40 44
(a) Test whether the average productivity is the same for different machines.
(b) Test whether the workers differ in their efficiency. Take α = 0.05 for both the cases.
An engineering college has the following data regarding placement of their studies in the past:
30% of the students got placed in their first sitting. 25% of the students got placed in their
second sitting. 20% of the students got placed in their third sitting. No student is allowed to sit
for more than 3 times. A sample of 10 students is drawn from the present set of students
eligible for placement. Find the probability that:
(c) not more than 3 students will be placed in their third sitting.
Also, find the mean and standard deviation of the number of students who will be
placed.
An oil company has purchased an ‘oil-reserve’ on auction which can yield high-quality oil with
probability 0.6, medium-quality oil with probability 0.2 and no oil with probability 0.2. The site is
drilled and a particular type of soil is found. The probability of getting this type of soil in
presence of high-quality oil, medium-quality oil and no oil are 0.3, 0.5, 0.2, respectively. Find the
probability of finding high-quality oil, medium-quality oil and no oil.
Explain in details every step of data pre-processing.
Assignments for Data Analytics
The rent of a property is related to its area. Given the area in square feet and rent in rupees,
find the relationship between area and rent using the concept of linear regression. Also predict
the rent for a property of 790 ft2.
1 340 500
2 1080 1700
3 640 1100
4 880 800
5 990 1400
6 510 500
The marks obtained by a student are dependent on his/her study time. Given the study time in
minutes and marks out of 2000, find the relationship between study time and marks using the
concept of linear regression. Also predict the marks for a student if he/she studied for 790
minutes.
1 350 520
2 1070 1600
3 630 1000
4 890 850
5 940 1350
6 500 490
Categorize the clustering techniques in machine learning and explain each category.
The travellers in a train lodge a complaint witj the manager of the pantry car that they are to be
provided with 150 ml per cup. The manager knows that the standard deviation of quantity of
coffee provided in a cup is 20 ml. He took a sample of 50 cups and found that they contained an
average of 145 ml per cup. Should the manager act on the complaint? Assume a 5% level of
significance.
An IT company expects its professionals to spend a maximum of 15% of their time in a day for
taking coffee/tea/lunch and for chatting with their colleagues. Of late, it feels that the IT
professionals spend more than 15% of their time for taking coffee/tea etc. In order to test that,
it took a sample of 30 professionals on a particular day and observed that they took 88 minutes
on the average (in an 8-hour working day) with a standard deviation of 20 minute. Test whether
the company's expectation is violated at a significance level of 5%.
The average daily wages of 15 labours engaged in construction sector in Tamil Nadu is 300
rupees with the standard deviation of 25 rupees. The average daily wages of 10 labours engaged
Assignments for Data Analytics
in constructing sector in Karnataka is 325 rupees with a standard deviation of 35 rupees. Test
whether the daily wages in construction sector in Karnataka at a significance level of 0.5.
A large engineering company in Chennai purchases a particular component from two suppliers,
one in Maharashtra and the other in Haryana. The data regarding 30 orders placed with each of
the suppliers are as follows: The supplier from Maharashtra supplies the components in 12 days
on an average (after receiving the order) with a standard deviation of 3 days. The supplier from
Haryana takes 14 days on an average for delivering the components with a standard deviation of
2 days. Test whether the supplier from Haryana is less prompt in delivering the components at a
significance level of 0.05.
A chain of hotels has opened new branches in Tricy and Madurai. The management wants to
know whether the customers of the new branches are satisfied with the service provided by the
branches. A random sample of 50 customers was taken from the hotel at Trichy and was found
to have a mean service rating of 8.3 out of 10 and a standard deviation of 0.8. A sample of 60
customers of the hotel at Madurai was taken and was found to have a mean of 8.6 out of 10 and
a standard deviation of 0.9. Is the customer service provided by the hotel in Trichy and Madurai
significantly different at a significance level of 0.05.
A departmental store started training its salesman in order to improve their relationship. In
order to test whether the training was effective it collected the following data: A sample of 16
untrained employees were able to sell for 6880 rupees per day on an average with a standard
deviation of 326.30. A sample of 11 trained employees were able to sell for 7060 rupees per day
on an average with a standard deviation of 248.40 rupees. Can you conclude that there is
improvement in sales by trained salesman at a significance level of 0.05.
A chain of resaurants in a city wants to compare 3 of its retaurants regarding the service time
per customer. One of the owners visited the 3 restaurants during the peak hours and noted the
service time for 5 customers in each of the 3 restaurants. The following table gives the details:
3 3 2
4 4 3.5
5.5 5.5 5
4 3 6
The problem is to test whether the average service time in 3 restaurants are significantly
different at a significance level of 0.05.
A company is trying 3 different training methods to its new employees for enabling them to get
familiarized with the company environment and learn the ways the various departments of the
company are working. It collected the data given in the following table regarding the time taken
by employees to complete the training methods.
16 23 19
Assignments for Data Analytics
19 28 25
20 19 20
23 22 17
12 18 16
Test whether the 3 methods are equally effective at a significance level of 0.05.
Let us consider the data regarding the number of 5 brands of cars in 3 months given in the
following table:
Alto 24 23 23
Swift 17 19 16
Dzire 17 17 15
Wagon R 15 14 13
Bolero 9 11 8
Test whether (i) the mean sales of cars are different in different months (ii) the mean
sales of cars are different for different brands at significance level of 0.05.
Financial Service 4 7 9
BPO 8 6 7
(a) Test whether the incentive scheme is equally satisfying to professionals of different
branches at a significance level of 0.05.
(b) Test whether the scheme is equally satisfying across job types at significance level of
0.05.
Trace the results of using the Apriori algorithm on the grocery store example with support
threshold s = 33.34% and confidence threshold c = 60%. Show the candidate and frequent-item
sets for each database scan. Enumerate all the final frequent item-sets. Also, indicate the
association rules that are generated and highlighted the strong ones, sorting them by
confidence.
T2 Milk, Bun
T8 Bread, Ketchup
Trace the results of using the FP-Growth algorithm on the grocery store example with support
threshold s = 33.34% and confidence threshold c = 60%. Create an FP tree. Show in detail the
tree creation during each transaction. Use FP-Growth to discover the frequent item-sets from
this FP tree.
T2 Cakes, Buns
T4 Chips, Coke
T5 Chips, Ketchup