0% found this document useful (0 votes)
8 views3 pages

2024 May Data Science and Big Data Analytics Ds Bda Pattern 2019

The document is an examination paper for a Data Science and Big Data Analytics course, consisting of 8 questions. Candidates must answer specific pairs of questions and are allowed to use scientific calculators. The paper covers various topics including data preparation, regression analysis, clustering, and the Hadoop ecosystem.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views3 pages

2024 May Data Science and Big Data Analytics Ds Bda Pattern 2019

The document is an examination paper for a Data Science and Big Data Analytics course, consisting of 8 questions. Candidates must answer specific pairs of questions and are allowed to use scientific calculators. The paper covers various topics including data preparation, regression analysis, clustering, and the Hadoop ecosystem.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Total No. of Questions : 8] SEAT No.

8
23
PB4430 [6262]-43 [Total No. of Pages : 3

ic-
tat
T.E. (Computer Engineering)

6s
DATA SCIENCE AND BIG DATA ANALYTICS

2:0
02 91
(2019 Pattern) (Semester- II) (310251)

9:4
0
40
Time : 2½ Hours ] 5/0 13 [Max. Marks : 70
0
Instructions to the candidates:
5/2
.23 GP

1) Answer Q.1 or Q.2, Q.3 or Q.4, Q.5 or Q.6, Q.7 or Q.8.


2) Neat diagrams must be drawn wherever necessary.
E
81

8
3) Figures to the right side indicate full marks.
C

23
ic-
4) Assume suitable data if necessary.
16

tat
5) Use of Scientific Calculator is permitted.
8.2

6s
.24

2:0
Q1) a) What is the data Preparation phase in Data Analytics Lifecycle. What is
91
49

9:4
the Analytics Sandbox and ETLT process in this phase? [8]
30
40

b) List out different stakeholders of an analytics project. What they usually


01
02

expect at the conclusion (key outputs) of a project? [8]


5/2
GP

OR
5/0
CE
81

8
Q2) a) List out the activities to be carried out in model planning and model

23
.23

building phase. What are different tools used for these phases? [8]
ic-
16

tat
b) What is linear regression, and what are its primary objectives? What is
8.2

6s

the difference between simple linear regression and multiple linear


.24

2:0

regression? How do you evaluate the performance of linear regression?[8]


91
49

9:4
30
40

Q3) a) What is logistic regression, and how does it differ from linear regression?
01
02

What is the sigmoid function, and what role does it play in logistic
5/2

regression? [9]
GP
5/0

b) Suppose you are given a dataset containing information about whether


CE
81

emails are spam or not spam, along with two features: the presence of
the word "offer" (1 for present, 0 for absent) and the presence of the
.23

word "free" (1 for present, 0 for absent). You are tasked with classifying
16

a new email with the following feature values: "offer"=1 and "free"=1.[9]
8.2
.24
49

P.T.O.
Given the training dataset:

8
23
Email Offer Free Spam

ic-
tat
1 1 0 No

6s
2 0 1 Yes

2:0
02 91
3 1 1 Yes

9:4
0
40
4 0 1 No
5/0 13
5 1 1 Yes
0
5/2
.23 GP

Calculate the probability that the new email is spam using Naive Bayes.
E
81

8
C

23
OR

ic-
16

tat
Q4) a) How does the Apriori algorithm discover frequent itemsets in a dataset?
8.2

6s
What is the role of support and confidence in the context of association
.24

2:0
rule mining using the Apriori algoritm? [9]
91
49

9:4
b) Explain the process of building a decision tree? What are the criteria
30
40

used for splitting nodes in a decision tree? [9]


01
02
5/2
GP
5/0

Q5) a) Suppose you have the following dataset containing the coordinates of
CE

points in a 2-dimensional space: [9]


81

8
23
.23

Point X Coordinate Y Coordinate


ic-
16

tat
A 2 3
8.2

6s

B 4 7
.24

2:0
91

C 3 5
49

9:4
30
40

D 6 9
01
02

E 8 6
5/2
GP

F 7 8
5/0
CE

Perform K-means clustering on this dataset with K = 2. Assume the


81

initial centroids to be (2,3) and (8,6). Compute the new centroids after
.23

each iteration until convergence, and assign points to their nearest


16

centroids.
8.2
.24
49

[6262]-43 2
b) How do you handle noise and irrelevant information in text data during

8
23
preprocessing? Explain the terms bag of words and TF IDF in text

ic-
analytics. [9]

tat
6s
OR

2:0
02 91
Q6) a) Explain how hierarchical clustering can be used for visualizing hierarchical

9:4
relationships in data with suitable example? What are some real-world

0
40
applications of hierarchical clustering?
5/0 13
0 [9]

b) What is the holdout method, and how does it work? Explain the difference
5/2
.23 GP

between training set, validation set, and test set in the holdout method.[9]
E
81

8
C

23
ic-
Q7) a) What is a histogram? How is it used to visualize the distribution of data?
16

tat
How is it different from a density plot? [9]
8.2

6s
.24

2:0
b) What is the Hadoop ecosystem, and what are its primary components?
91
49

9:4
What is MapReduce, and how does it fit into the Hadoop ecosystem?[9]
30
40

OR
01
02
5/2

Q8) a) What is a box plot? Explain the different components of a box plot?
GP
5/0

How do you interpret the median, quartiles, and whiskers in a box plot?
CE

What does the interquartile range (IQR) represent in a box plot? [9]
81

8
23
.23

b) Explain the role of Apache Pig in data processing workflows on Hadoop?


ic-
16

What is Apache Spark, and how does it complement Hadoop for big
tat
8.2

data processing? [9]


6s
.24

2:0
91
49

9:4
30
40


01
02
5/2
GP
5/0
CE
81
.23
16
8.2
.24
49

[6262]-43 3

You might also like