DSBDA Merged
DSBDA Merged
8
23
P812 [5870] - 1133
[Total No. of Pages : 2
ic-
T.E. (Computer Engineering)
tat
7s
DATA SCIENCE AND BIG DATA ANALYTICS
6:5
(2019 Pattern) (Semester - II) (310251)
02 91
8:3
Time : 2½ Hours] [Max. Marks : 70
0
20
Instructions to the candidates:
9/0 13
1) Answer Q.1 or Q.2, Q.3 or Q.4, Q.5 or Q.6, Q.7 or Q.8.
0
1) Neat diagrams must be drawn whenver necessary.
6/2
.23 GP
8
C
23
4) Assume suistable data, if necessary.
ic-
16
Q1) a) What is driving data deluge? Explain with one example. [9]
tat
8.2
7s
b) What is data science? Differentiate between Business Intelligence and
.24
6:5
Data Science. [9]
91
49
8:3
30
OR
20
01
02
Q2) a) What are the sources of Big Data. Explain model building phase with
6/2
example. [9]
GP
9/0
8
discovery phase. Explain with example. [9]
23
.23
ic-
16
tat
8.2
7s
6:5
8:3
OR
6/2
GP
i) Linear Regression
.23
8
23
i) Time series Analysis
ic-
tat
ii) TF - IDF. [9]
7s
6:5
02 91
b) What is clustering? With suitable example explain the steps involved in
8:3
k - means algorithm. [9]
0
20
9/0 13
OR
0
6/2
.23 GP
8
i) Confusion matrix
C
23
ic-
ii) AVC - ROC curve [9]
16
tat
8.2
7s
b) Discuss Holdout method and Random Sub Sampling methods. [9]
.24
6:5
91
49
8:3
30
Q7) a) With a suitable example explain Histogram and explain its usages. [8]
20
01
02
in brief. [9]
GP
9/0
OR
CE
82
8
23
Q8) a) With a suitable example explain and draw a Box plot and explain its
.23
tat
8.2
7s
b) Describe the challenges of data visualization. Draw box plot and explain
.24
6:5
8:3
30
20
01
02
6/2
GP
9/0
CE
82
.23
16
8.2
.24
[5870] - 1133 2
49
Total No. of Questions : 8] SEAT No. :
8
23
P-3153 [Total No. of Pages : 2
ic-
[6003]-354
tat
0s
T.E. (Computer Engineering)
2:3
Data Science and Big Data Analytics
02 91
0:3
(2019 Pattern) (Semester - II) (310251)
0
31
1/0 13
0
Time : 2½ Hours] 6/2 [Max. Marks : 70
.23 GP
8
2) Neat diagram must be drawn whenever necessary.
C
23
3) Figures to the right indicate full marks.
ic-
16
tat
8.2
0s
.24
2:3
91
49
0:3
Q1) a) What is Model Building elaborate this phase of data analytics with the
30
science. [8]
GP
1/0
OR
CE
82
8
23
Q2) a) What are the three characteristic of Big Data and what are the main
.23
tat
b) Explain Descriptive, Diagnostic, Predictive analytics. [9]
8.2
0s
.24
2:3
91
49
0:3
Q3) a) Explain why decision tree are used. Draw a sample decision tree and
30
31
OR
1/0
CE
P.T.O.
Q5) a) What is text processing? Explain TF-IDF with example. [8]
8
23
b) With suitable example ,explain the steps involved in k-means algorithm.
ic-
[9]
tat
OR
0s
2:3
Q6) a) Define following terms with respect to confusion matrix : [8]
02 91
0:3
i) Accuracy
0
31
ii) 1/0 13
Precision
0
6/2
iii) Recall
.23 GP
iv) AUC-ROC
E
82
8
C
23
b) Explain k-fold Cross Validation & Random Subsampling. [9]
ic-
16
tat
8.2
0s
Q7) a) With a suitable example, draw a Histogram, boxplot and explain its
.24
usages. [9]
2:3
91
49
0:3
b) Describe the data visualization tool Tableau. List of data visualization
30
31
tools. [9]
01
02
OR
6/2
GP
[9]
CE
82
8
23
b) Explain architecture of Apache-Pig. [9]
.23
ic-
16
tat
8.2
0s
.24
2:3
91
49
0:3
30
31
01
02
6/2
GP
1/0
CE
82
.23
16
8.2
.24
49
[6003]-354 2
Total No. of Questions : 8] SEAT No. :
8
23
PB4430 [6262]-43 [Total No. of Pages : 3
ic-
tat
T.E. (Computer Engineering)
6s
DATA SCIENCE AND BIG DATA ANALYTICS
2:0
02 91
(2019 Pattern) (Semester- II) (310251)
9:4
0
40
Time : 2½ Hours ] 5/0 13 [Max. Marks : 70
0
Instructions to the candidates:
5/2
.23 GP
8
3) Figures to the right side indicate full marks.
C
23
ic-
4) Assume suitable data if necessary.
16
tat
5) Use of Scientific Calculator is permitted.
8.2
6s
.24
2:0
Q1) a) What is the data Preparation phase in Data Analytics Lifecycle. What is
91
49
9:4
the Analytics Sandbox and ETLT process in this phase? [8]
30
40
OR
5/0
CE
81
8
Q2) a) List out the activities to be carried out in model planning and model
23
.23
building phase. What are different tools used for these phases? [8]
ic-
16
tat
b) What is linear regression, and what are its primary objectives? What is
8.2
6s
2:0
9:4
30
40
Q3) a) What is logistic regression, and how does it differ from linear regression?
01
02
What is the sigmoid function, and what role does it play in logistic
5/2
regression? [9]
GP
5/0
emails are spam or not spam, along with two features: the presence of
the word "offer" (1 for present, 0 for absent) and the presence of the
.23
word "free" (1 for present, 0 for absent). You are tasked with classifying
16
a new email with the following feature values: "offer"=1 and "free"=1.[9]
8.2
.24
49
P.T.O.
Given the training dataset:
8
23
Email Offer Free Spam
ic-
tat
1 1 0 No
6s
2 0 1 Yes
2:0
02 91
3 1 1 Yes
9:4
0
40
4 0 1 No
5/0 13
5 1 1 Yes
0
5/2
.23 GP
Calculate the probability that the new email is spam using Naive Bayes.
E
81
8
C
23
OR
ic-
16
tat
Q4) a) How does the Apriori algorithm discover frequent itemsets in a dataset?
8.2
6s
What is the role of support and confidence in the context of association
.24
2:0
rule mining using the Apriori algoritm? [9]
91
49
9:4
b) Explain the process of building a decision tree? What are the criteria
30
40
Q5) a) Suppose you have the following dataset containing the coordinates of
CE
8
23
.23
tat
A 2 3
8.2
6s
B 4 7
.24
2:0
91
C 3 5
49
9:4
30
40
D 6 9
01
02
E 8 6
5/2
GP
F 7 8
5/0
CE
initial centroids to be (2,3) and (8,6). Compute the new centroids after
.23
centroids.
8.2
.24
49
[6262]-43 2
b) How do you handle noise and irrelevant information in text data during
8
23
preprocessing? Explain the terms bag of words and TF IDF in text
ic-
analytics. [9]
tat
6s
OR
2:0
02 91
Q6) a) Explain how hierarchical clustering can be used for visualizing hierarchical
9:4
relationships in data with suitable example? What are some real-world
0
40
applications of hierarchical clustering?
5/0 13
0 [9]
b) What is the holdout method, and how does it work? Explain the difference
5/2
.23 GP
between training set, validation set, and test set in the holdout method.[9]
E
81
8
C
23
ic-
Q7) a) What is a histogram? How is it used to visualize the distribution of data?
16
tat
How is it different from a density plot? [9]
8.2
6s
.24
2:0
b) What is the Hadoop ecosystem, and what are its primary components?
91
49
9:4
What is MapReduce, and how does it fit into the Hadoop ecosystem?[9]
30
40
OR
01
02
5/2
Q8) a) What is a box plot? Explain the different components of a box plot?
GP
5/0
How do you interpret the median, quartiles, and whiskers in a box plot?
CE
What does the interquartile range (IQR) represent in a box plot? [9]
81
8
23
.23
What is Apache Spark, and how does it complement Hadoop for big
tat
8.2
2:0
91
49
9:4
30
40
01
02
5/2
GP
5/0
CE
81
.23
16
8.2
.24
49
[6262]-43 3
Total No. of Questions : 8] SEAT No. :
8
23
PA-1449 [Total No. of Pages : 3
ic-
[5926]-65
tat
T.E. (Computer Engg.)
6s
DATA SCIENCE AND BIG DATA ANALYTICS
0:4
(2019 Pattern) (Semester-II) (310251)
02 91
9:4
0
Time : 2½ Hours] [Max. Marks : 70
30
4/0 13
Instructions to the candidates:
1) Answer Q1 or Q2, Q3. or Q4, Q5 or Q6, and Q7 or Q8.
0
1/2
2) Neat diagram must be drawn wherever necessary.
.23 GP
8
and steam tables is allowed.
C
23
5) Assume suitable data if necessary.
ic-
16
tat
Q1) a) Draw the diagram of data analytics life cycle in big data and briefly explain
8.2
6s
its phases. [8]
.24
0:4
91
b) Explain in detail how the model building phase is built by team in data
49
9:4
analytics life cycle? [9]
30
30
OR
01
02
Q2) a) List and explain the steps in data preparation phase of data analytics life
1/2
GP
cycle. [8]
4/0
8
23
i) ETL
.23
tat
iii) Model selection for data analytics.
8.2
6s
.24
0:4
91
Q3) a) What are the types of analytics in big data? Explain in brief. [9]
49
9:4
30
b) Calculate the support and confidence value for all the possible item sets.[9]
30
OR
P.T.O.
49
[5926]-65 1
Q4) a) Explain the use of logistic function in logistic regression in detail. [9]
8
23
b) Write short note on the following:
ic-
i) Removing duplicates from data set.
tat
6s
ii) Handling missing data
0:4
iii) Data transformation. [9]
02 91
9:4
0
30
Q5) a) 4/0 13
Suppose that the given data the taste is to cluster points (With (x.y)
representing location) into three cluster, where the points are.
0
1/2
.23 GP
8
C
23
The distance function is Euclidean distance suppose initially we assign
ic-
A1, B1 and C1 as the center of each cluster, respectively. use the k-
16
tat
means algorithm to show only the three cluster centers after the first
8.2
6s
round of execution with steps. [9]
.24
0:4
91
b) Explain the following text analysis steps with suitable example. [8]
49
9:4
i) Part of speech (POS) tagging
30
30
ii) Lemmatization
01
02
iii) Stemming
1/2
GP
4/0
CE
OR
81
8
23
Q6) a) Given the confusion matrix, calculate accuracy. precision, Recall, Error
.23
tat
8.2
Predicted classes
6s
.24
0:4
9:4
Risk-yes Risk-No
30
30
Heart Attack
4/0
CE
[5926]-65 2
Q7) a) List the data visualization tools and discuss any four applications of data
8
23
visualization along with the use of the suitable plot. [9]
ic-
tat
b) List the challenges of data visualization explain the types of visualization
6s
with example. [9]
0:4
02 91
OR
9:4
0
Q8) a) Explain in detail the Hadoop Ecosystem with suitable diagram [9]
30
4/0 13
b) Write a short note on the following [9]
0
1/2
.23 GP
i) Map reduce.
E
81
8
ii) Pig
C
23
ic-
iii) Hive
16
tat
8.2
6s
.24
0:4
91
49
9:4
30
30
01
02
1/2
GP
4/0
CE
81
8
23
.23
ic-
16
tat
8.2
6s
.24
0:4
91
49
9:4
30
30
01
02
1/2
GP
4/0
CE
81
.23
16
8.2
.24
49
[5926]-65 3
Total No. of Questions : 8] SEAT No. :
8
23
P-7545 [Total No. of Pages : 3
ic-
tat
[6180]-53
5s
T.E. (Computer Engineering)
3:3
02 91
9:5
DATA SCIENCE AND BIG DATA ANALYTICS
0
30
(2019 Pattern) (Semester - II) (310251)
2/1 13
Time : 2½ Hours] [Max. Marks : 70
0
2/2
.23 GP
8
C
23
3) Figures to the right side indicate full marks.
ic-
4) Assume suitable data if necessary.
16
tat
5) Use of Scientific calculator is permitted.
8.2
5s
.24
Q1) a) Explain Data Analytics Cycle with suitable diagram and its phases. [8]
3:3
91
49
[9]
01
02
OR
2/2
GP
Q2) a) List and explain the key roles for successful analytics project. [8]
2/1
CE
8
23
i) Common Tools for the Model Building
.23
tat
8.2
5s
.24
3:3
Q3) a) List and explain the various types of analytics in Big data. [9]
91
49
9:5
b) Calculates the support and confidence value for all the possible item sets.[9]
30
30
OR
8.2
P.T.O.
.24
49
Q4) a) Explain the need of logistic regression along with its various types. [9]
8
23
b) Explain the following terms with suitable example. [9]
ic-
i) Removing Duplicates from dataset.
tat
5s
ii) Handling Missing Data
3:3
02 91
9:5
Q5) a) Suppose that the given data the task is to cluster points (with (x, y)
0
30
representing location) into three clusters, where the points are A1 (2, 10),
2/1 13
A2(2, 5), A3(8, 4), B1(5, 8), B2(7, 5), B3(6, 4), C1(1, 2), C2(4, 9). The
0
2/2
distance function is Euclidean distance. Suppose initially we assign A1,
.23 GP
Use the k-means algorithm to show only show only the first round of
81
8
C
23
execution with cluster center.
ic-
b) Explain the following Text Analysis steps with suitable example [9]
16
tat
8.2
i) Part-of-speech(POS)tagging
5s
.24
3:3
ii) Lemmatization
91
49
9:5
OR
30
30
Q6) a) Given the confusion matrix, Calculate Accuracy, Precision, Recall, Error
01
02
Predicted classes
2/1
8
-Yes -No
23
.23
tat
classes Yes
8.2
5s
3:3
91
No
49
9:5
30
Q7) a) List the few data visualization tools and discuss any four applications of
GP
2/1
data visualization along with the use of the various plots with Python/R
CE
OR
.24
[6180]-53 2
49
Q8) a) Explain in detail the Hadoop Ecosystem with suitable diagram along with
8
23
the various components. [9]
ic-
b) Write a short note on the following. [9]
tat
5s
a) Map Reduce
3:3
b) Pig
02 91
9:5
0
30
2/1 13
0
2/2
.23 GP
E
81
8
C
23
ic-
16
tat
8.2
5s
.24
3:3
91
49
9:5
30
30
01
02
2/2
GP
2/1
CE
81
8
23
.23
ic-
16
tat
8.2
5s
.24
3:3
91
49
9:5
30
30
01
02
2/2
GP
2/1
CE
81
.23
16
8.2
.24
[6180]-53 3
49