2024 May Data Science and Big Data Analytics Ds Bda Pattern 2019
2024 May Data Science and Big Data Analytics Ds Bda Pattern 2019
8
23
PB4430 [6262]-43 [Total No. of Pages : 3
ic-
tat
T.E. (Computer Engineering)
6s
DATA SCIENCE AND BIG DATA ANALYTICS
2:0
02 91
(2019 Pattern) (Semester- II) (310251)
9:4
0
40
Time : 2½ Hours ] 5/0 13 [Max. Marks : 70
0
Instructions to the candidates:
5/2
.23 GP
8
3) Figures to the right side indicate full marks.
C
23
ic-
4) Assume suitable data if necessary.
16
tat
5) Use of Scientific Calculator is permitted.
8.2
6s
.24
2:0
Q1) a) What is the data Preparation phase in Data Analytics Lifecycle. What is
91
49
9:4
the Analytics Sandbox and ETLT process in this phase? [8]
30
40
OR
5/0
CE
81
8
Q2) a) List out the activities to be carried out in model planning and model
23
.23
building phase. What are different tools used for these phases? [8]
ic-
16
tat
b) What is linear regression, and what are its primary objectives? What is
8.2
6s
2:0
9:4
30
40
Q3) a) What is logistic regression, and how does it differ from linear regression?
01
02
What is the sigmoid function, and what role does it play in logistic
5/2
regression? [9]
GP
5/0
emails are spam or not spam, along with two features: the presence of
the word "offer" (1 for present, 0 for absent) and the presence of the
.23
word "free" (1 for present, 0 for absent). You are tasked with classifying
16
a new email with the following feature values: "offer"=1 and "free"=1.[9]
8.2
.24
49
P.T.O.
Given the training dataset:
8
23
Email Offer Free Spam
ic-
tat
1 1 0 No
6s
2 0 1 Yes
2:0
02 91
3 1 1 Yes
9:4
0
40
4 0 1 No
5/0 13
5 1 1 Yes
0
5/2
.23 GP
Calculate the probability that the new email is spam using Naive Bayes.
E
81
8
C
23
OR
ic-
16
tat
Q4) a) How does the Apriori algorithm discover frequent itemsets in a dataset?
8.2
6s
What is the role of support and confidence in the context of association
.24
2:0
rule mining using the Apriori algoritm? [9]
91
49
9:4
b) Explain the process of building a decision tree? What are the criteria
30
40
Q5) a) Suppose you have the following dataset containing the coordinates of
CE
8
23
.23
tat
A 2 3
8.2
6s
B 4 7
.24
2:0
91
C 3 5
49
9:4
30
40
D 6 9
01
02
E 8 6
5/2
GP
F 7 8
5/0
CE
initial centroids to be (2,3) and (8,6). Compute the new centroids after
.23
centroids.
8.2
.24
49
[6262]-43 2
b) How do you handle noise and irrelevant information in text data during
8
23
preprocessing? Explain the terms bag of words and TF IDF in text
ic-
analytics. [9]
tat
6s
OR
2:0
02 91
Q6) a) Explain how hierarchical clustering can be used for visualizing hierarchical
9:4
relationships in data with suitable example? What are some real-world
0
40
applications of hierarchical clustering?
5/0 13
0 [9]
b) What is the holdout method, and how does it work? Explain the difference
5/2
.23 GP
between training set, validation set, and test set in the holdout method.[9]
E
81
8
C
23
ic-
Q7) a) What is a histogram? How is it used to visualize the distribution of data?
16
tat
How is it different from a density plot? [9]
8.2
6s
.24
2:0
b) What is the Hadoop ecosystem, and what are its primary components?
91
49
9:4
What is MapReduce, and how does it fit into the Hadoop ecosystem?[9]
30
40
OR
01
02
5/2
Q8) a) What is a box plot? Explain the different components of a box plot?
GP
5/0
How do you interpret the median, quartiles, and whiskers in a box plot?
CE
What does the interquartile range (IQR) represent in a box plot? [9]
81
8
23
.23
What is Apache Spark, and how does it complement Hadoop for big
tat
8.2
2:0
91
49
9:4
30
40
01
02
5/2
GP
5/0
CE
81
.23
16
8.2
.24
49
[6262]-43 3