0% found this document useful (0 votes)
12 views13 pages

DSBDA Merged

The document outlines an examination paper for a course on Data Science and Big Data Analytics, consisting of multiple questions that cover various topics such as data preprocessing, regression analysis, big data sources, and data visualization. Candidates are instructed to answer specific questions from each section, with guidelines for diagram usage and data assumptions. The exam is structured to assess knowledge on both theoretical concepts and practical applications in data analytics.

Uploaded by

kunalpatil702888
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views13 pages

DSBDA Merged

The document outlines an examination paper for a course on Data Science and Big Data Analytics, consisting of multiple questions that cover various topics such as data preprocessing, regression analysis, big data sources, and data visualization. Candidates are instructed to answer specific questions from each section, with guidelines for diagram usage and data assumptions. The exam is structured to assess knowledge on both theoretical concepts and practical applications in data analytics.

Uploaded by

kunalpatil702888
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Total No. of Questions : 8] SEAT No.

8
23
P812 [5870] - 1133
[Total No. of Pages : 2

ic-
T.E. (Computer Engineering)

tat
7s
DATA SCIENCE AND BIG DATA ANALYTICS

6:5
(2019 Pattern) (Semester - II) (310251)

02 91
8:3
Time : 2½ Hours] [Max. Marks : 70

0
20
Instructions to the candidates:
9/0 13
1) Answer Q.1 or Q.2, Q.3 or Q.4, Q.5 or Q.6, Q.7 or Q.8.
0
1) Neat diagrams must be drawn whenver necessary.
6/2
.23 GP

2) Figures to the right side indicate full marks.


3) Use of logarithmic tables slide rule, mollier charts, electronic pocket calculator
E

and steam tables is allowed.


82

8
C

23
4) Assume suistable data, if necessary.

ic-
16

Q1) a) What is driving data deluge? Explain with one example. [9]

tat
8.2

7s
b) What is data science? Differentiate between Business Intelligence and
.24

6:5
Data Science. [9]
91
49

8:3
30

OR
20
01
02

Q2) a) What are the sources of Big Data. Explain model building phase with
6/2

example. [9]
GP
9/0

b) Explain big data analytics architecture with diagram. What is data


CE
82

8
discovery phase. Explain with example. [9]

23
.23

ic-
16

tat
8.2

7s

Q3) a) Explain various data pre-processing steps. Discuss essential python


.24

6:5

libraries for preprocessing. [8]


91
49

8:3

b) What are association rules? Explain Apriori Algorithm in brief. [9]


30
20
01
02

OR
6/2
GP

Q4) a) Explain the following


9/0
CE
82

i) Linear Regression
.23

ii) Logistic Regression [8]


16
8.2

b) Explain scikit-learn library for matplotlib with example. [9]


.24

[5870] - 1133 1 P.T.O.


49
Q5) a) Write short note on

8
23
i) Time series Analysis

ic-
tat
ii) TF - IDF. [9]

7s
6:5
02 91
b) What is clustering? With suitable example explain the steps involved in

8:3
k - means algorithm. [9]

0
20
9/0 13
OR
0
6/2
.23 GP

Q6) a) Write short note on


E
82

8
i) Confusion matrix
C

23
ic-
ii) AVC - ROC curve [9]
16

tat
8.2

7s
b) Discuss Holdout method and Random Sub Sampling methods. [9]
.24

6:5
91
49

8:3
30

Q7) a) With a suitable example explain Histogram and explain its usages. [8]
20
01
02

b) Describe the Data visualization tool “Tableau”. Explain its applications


6/2

in brief. [9]
GP
9/0

OR
CE
82

8
23
Q8) a) With a suitable example explain and draw a Box plot and explain its
.23

usages. [8] ic-


16

tat
8.2

7s

b) Describe the challenges of data visualization. Draw box plot and explain
.24

6:5

its usages. [9]


91
49

8:3
30
20


01
02
6/2
GP
9/0
CE
82
.23
16
8.2
.24

[5870] - 1133 2
49
Total No. of Questions : 8] SEAT No. :

8
23
P-3153 [Total No. of Pages : 2

ic-
[6003]-354

tat
0s
T.E. (Computer Engineering)

2:3
Data Science and Big Data Analytics

02 91
0:3
(2019 Pattern) (Semester - II) (310251)

0
31
1/0 13
0
Time : 2½ Hours] 6/2 [Max. Marks : 70
.23 GP

Instructions to the candidates:


1) Answer Q.1 or Q.2, Q3 or Q.4, Q.5 or Q.6, Q.7 or Q.8.
E
82

8
2) Neat diagram must be drawn whenever necessary.
C

23
3) Figures to the right indicate full marks.

ic-
16

4) Assume suitable data if necessary.

tat
8.2

5) Use of Scientific Calculalor is permitted.

0s
.24

2:3
91
49

0:3
Q1) a) What is Model Building elaborate this phase of data analytics with the
30

help of suitable example. [9]


31
01
02

b) Explain any three sources of Big Data. Differentiate BI versus Data


6/2

science. [8]
GP
1/0

OR
CE
82

8
23
Q2) a) What are the three characteristic of Big Data and what are the main
.23

consideration in processing Big Data. [8]


ic-
16

tat
b) Explain Descriptive, Diagnostic, Predictive analytics. [9]
8.2

0s
.24

2:3
91
49

0:3

Q3) a) Explain why decision tree are used. Draw a sample decision tree and
30
31

explain its parts. [9]


01
02

b) How Apriori Algorithm works, explain with suitable example? [9]


6/2
GP

OR
1/0
CE

Q4) a) What is data preprocessing? Explain in details about handling missing


82

data and transformation of data. [9]


.23

b) Explain Naïve Bayes’ classifier and it applications. [9]


16
8.2
.24
49

P.T.O.
Q5) a) What is text processing? Explain TF-IDF with example. [8]

8
23
b) With suitable example ,explain the steps involved in k-means algorithm.

ic-
[9]

tat
OR

0s
2:3
Q6) a) Define following terms with respect to confusion matrix : [8]

02 91
0:3
i) Accuracy

0
31
ii) 1/0 13
Precision
0
6/2
iii) Recall
.23 GP

iv) AUC-ROC
E
82

8
C

23
b) Explain k-fold Cross Validation & Random Subsampling. [9]

ic-
16

tat
8.2

0s
Q7) a) With a suitable example, draw a Histogram, boxplot and explain its
.24

usages. [9]
2:3
91
49

0:3
b) Describe the data visualization tool Tableau. List of data visualization
30
31

tools. [9]
01
02

OR
6/2
GP

Q8) a) What is Data Visualization? Describe the challenges of data visualization.


1/0

[9]
CE
82

8
23
b) Explain architecture of Apache-Pig. [9]
.23

ic-
16

tat
8.2

0s


.24

2:3
91
49

0:3
30
31
01
02
6/2
GP
1/0
CE
82
.23
16
8.2
.24
49

[6003]-354 2
Total No. of Questions : 8] SEAT No. :

8
23
PB4430 [6262]-43 [Total No. of Pages : 3

ic-
tat
T.E. (Computer Engineering)

6s
DATA SCIENCE AND BIG DATA ANALYTICS

2:0
02 91
(2019 Pattern) (Semester- II) (310251)

9:4
0
40
Time : 2½ Hours ] 5/0 13 [Max. Marks : 70
0
Instructions to the candidates:
5/2
.23 GP

1) Answer Q.1 or Q.2, Q.3 or Q.4, Q.5 or Q.6, Q.7 or Q.8.


2) Neat diagrams must be drawn wherever necessary.
E
81

8
3) Figures to the right side indicate full marks.
C

23
ic-
4) Assume suitable data if necessary.
16

tat
5) Use of Scientific Calculator is permitted.
8.2

6s
.24

2:0
Q1) a) What is the data Preparation phase in Data Analytics Lifecycle. What is
91
49

9:4
the Analytics Sandbox and ETLT process in this phase? [8]
30
40

b) List out different stakeholders of an analytics project. What they usually


01
02

expect at the conclusion (key outputs) of a project? [8]


5/2
GP

OR
5/0
CE
81

8
Q2) a) List out the activities to be carried out in model planning and model

23
.23

building phase. What are different tools used for these phases? [8]
ic-
16

tat
b) What is linear regression, and what are its primary objectives? What is
8.2

6s

the difference between simple linear regression and multiple linear


.24

2:0

regression? How do you evaluate the performance of linear regression?[8]


91
49

9:4
30
40

Q3) a) What is logistic regression, and how does it differ from linear regression?
01
02

What is the sigmoid function, and what role does it play in logistic
5/2

regression? [9]
GP
5/0

b) Suppose you are given a dataset containing information about whether


CE
81

emails are spam or not spam, along with two features: the presence of
the word "offer" (1 for present, 0 for absent) and the presence of the
.23

word "free" (1 for present, 0 for absent). You are tasked with classifying
16

a new email with the following feature values: "offer"=1 and "free"=1.[9]
8.2
.24
49

P.T.O.
Given the training dataset:

8
23
Email Offer Free Spam

ic-
tat
1 1 0 No

6s
2 0 1 Yes

2:0
02 91
3 1 1 Yes

9:4
0
40
4 0 1 No
5/0 13
5 1 1 Yes
0
5/2
.23 GP

Calculate the probability that the new email is spam using Naive Bayes.
E
81

8
C

23
OR

ic-
16

tat
Q4) a) How does the Apriori algorithm discover frequent itemsets in a dataset?
8.2

6s
What is the role of support and confidence in the context of association
.24

2:0
rule mining using the Apriori algoritm? [9]
91
49

9:4
b) Explain the process of building a decision tree? What are the criteria
30
40

used for splitting nodes in a decision tree? [9]


01
02
5/2
GP
5/0

Q5) a) Suppose you have the following dataset containing the coordinates of
CE

points in a 2-dimensional space: [9]


81

8
23
.23

Point X Coordinate Y Coordinate


ic-
16

tat
A 2 3
8.2

6s

B 4 7
.24

2:0
91

C 3 5
49

9:4
30
40

D 6 9
01
02

E 8 6
5/2
GP

F 7 8
5/0
CE

Perform K-means clustering on this dataset with K = 2. Assume the


81

initial centroids to be (2,3) and (8,6). Compute the new centroids after
.23

each iteration until convergence, and assign points to their nearest


16

centroids.
8.2
.24
49

[6262]-43 2
b) How do you handle noise and irrelevant information in text data during

8
23
preprocessing? Explain the terms bag of words and TF IDF in text

ic-
analytics. [9]

tat
6s
OR

2:0
02 91
Q6) a) Explain how hierarchical clustering can be used for visualizing hierarchical

9:4
relationships in data with suitable example? What are some real-world

0
40
applications of hierarchical clustering?
5/0 13
0 [9]

b) What is the holdout method, and how does it work? Explain the difference
5/2
.23 GP

between training set, validation set, and test set in the holdout method.[9]
E
81

8
C

23
ic-
Q7) a) What is a histogram? How is it used to visualize the distribution of data?
16

tat
How is it different from a density plot? [9]
8.2

6s
.24

2:0
b) What is the Hadoop ecosystem, and what are its primary components?
91
49

9:4
What is MapReduce, and how does it fit into the Hadoop ecosystem?[9]
30
40

OR
01
02
5/2

Q8) a) What is a box plot? Explain the different components of a box plot?
GP
5/0

How do you interpret the median, quartiles, and whiskers in a box plot?
CE

What does the interquartile range (IQR) represent in a box plot? [9]
81

8
23
.23

b) Explain the role of Apache Pig in data processing workflows on Hadoop?


ic-
16

What is Apache Spark, and how does it complement Hadoop for big
tat
8.2

data processing? [9]


6s
.24

2:0
91
49

9:4
30
40


01
02
5/2
GP
5/0
CE
81
.23
16
8.2
.24
49

[6262]-43 3
Total No. of Questions : 8] SEAT No. :

8
23
PA-1449 [Total No. of Pages : 3

ic-
[5926]-65

tat
T.E. (Computer Engg.)

6s
DATA SCIENCE AND BIG DATA ANALYTICS

0:4
(2019 Pattern) (Semester-II) (310251)

02 91
9:4
0
Time : 2½ Hours] [Max. Marks : 70

30
4/0 13
Instructions to the candidates:
1) Answer Q1 or Q2, Q3. or Q4, Q5 or Q6, and Q7 or Q8.
0
1/2
2) Neat diagram must be drawn wherever necessary.
.23 GP

3) Figures to the right indicate full makrs,


4) Use of logarithmic tables slide rule, mollier charts, electronic pocket calculator
E
81

8
and steam tables is allowed.
C

23
5) Assume suitable data if necessary.

ic-
16

tat
Q1) a) Draw the diagram of data analytics life cycle in big data and briefly explain
8.2

6s
its phases. [8]
.24

0:4
91
b) Explain in detail how the model building phase is built by team in data
49

9:4
analytics life cycle? [9]
30
30

OR
01
02

Q2) a) List and explain the steps in data preparation phase of data analytics life
1/2
GP

cycle. [8]
4/0

b) Write short note on the following: [9]


CE
81

8
23
i) ETL
.23

ii) Common tools for the model building. ic-


16

tat
iii) Model selection for data analytics.
8.2

6s
.24

0:4
91

Q3) a) What are the types of analytics in big data? Explain in brief. [9]
49

9:4
30

b) Calculate the support and confidence value for all the possible item sets.[9]
30

Transaction ID Items bought


01
02

1 Onion, Potato, Cold drink


1/2
GP

2 Onion, Burger, Cold drink


4/0
CE

3 Eggs, Onion, Cold drink


81

4 Potato, Milk, Eggs.


.23

5 Potato, Burger, cold drink, Milk eggs.


16
8.2
.24

OR
P.T.O.
49

[5926]-65 1
Q4) a) Explain the use of logistic function in logistic regression in detail. [9]

8
23
b) Write short note on the following:

ic-
i) Removing duplicates from data set.

tat
6s
ii) Handling missing data

0:4
iii) Data transformation. [9]

02 91
9:4
0
30
Q5) a) 4/0 13
Suppose that the given data the taste is to cluster points (With (x.y)
representing location) into three cluster, where the points are.
0
1/2
.23 GP

A1(2,10), A2(2,5), A3(8,4), B1 (5,8)


B2(7,5) B3(6,4), C1(1,2), C2(4,9)
E
81

8
C

23
The distance function is Euclidean distance suppose initially we assign

ic-
A1, B1 and C1 as the center of each cluster, respectively. use the k-
16

tat
means algorithm to show only the three cluster centers after the first
8.2

6s
round of execution with steps. [9]
.24

0:4
91
b) Explain the following text analysis steps with suitable example. [8]
49

9:4
i) Part of speech (POS) tagging
30
30

ii) Lemmatization
01
02

iii) Stemming
1/2
GP
4/0
CE

OR
81

8
23
Q6) a) Given the confusion matrix, calculate accuracy. precision, Recall, Error
.23

rate with description on heart attact risk. [8] ic-


16

tat
8.2

Predicted classes
6s
.24

0:4

Classes Heart-Attack Heart Attack


91
49

9:4

Risk-yes Risk-No
30
30

Actual Heart Attack


01
02

Classes Risk-yes 80 220


1/2
GP

Heart Attack
4/0
CE

Risk-No 150 9,500


81
.23

b) Explain the TF/IDF (term frequency-inverse document frequency) terms


16

in text analysis with suitable example. [9]


8.2
.24
49

[5926]-65 2
Q7) a) List the data visualization tools and discuss any four applications of data

8
23
visualization along with the use of the suitable plot. [9]

ic-
tat
b) List the challenges of data visualization explain the types of visualization

6s
with example. [9]

0:4
02 91
OR

9:4
0
Q8) a) Explain in detail the Hadoop Ecosystem with suitable diagram [9]

30
4/0 13
b) Write a short note on the following [9]
0
1/2
.23 GP

i) Map reduce.
E
81

8
ii) Pig
C

23
ic-
iii) Hive
16

tat
8.2

6s
.24

 0:4
91
49

9:4
30
30
01
02
1/2
GP
4/0
CE
81

8
23
.23

ic-
16

tat
8.2

6s
.24

0:4
91
49

9:4
30
30
01
02
1/2
GP
4/0
CE
81
.23
16
8.2
.24
49

[5926]-65 3
Total No. of Questions : 8] SEAT No. :

8
23
P-7545 [Total No. of Pages : 3

ic-
tat
[6180]-53

5s
T.E. (Computer Engineering)

3:3
02 91
9:5
DATA SCIENCE AND BIG DATA ANALYTICS

0
30
(2019 Pattern) (Semester - II) (310251)
2/1 13
Time : 2½ Hours] [Max. Marks : 70
0
2/2
.23 GP

Instructions to the candidates :


1) Answer Q1 or Q2, Q3 or Q4, Q5 or Q6. Q7 or Q8.
E
81

2) Neat diagrams must be drawn wherever necessary.

8
C

23
3) Figures to the right side indicate full marks.

ic-
4) Assume suitable data if necessary.
16

tat
5) Use of Scientific calculator is permitted.
8.2

5s
.24

Q1) a) Explain Data Analytics Cycle with suitable diagram and its phases. [8]
3:3
91
49

b) List and Explain the various activities involved in identifying potential


9:5
30

data resources as a part of discovery phase in Data Analytics Life Cycle?


30

[9]
01
02

OR
2/2
GP

Q2) a) List and explain the key roles for successful analytics project. [8]
2/1
CE

b) Write short note on : [9]


81

8
23
i) Common Tools for the Model Building
.23

ii) Model selection for Data Analytics ic-


16

tat
8.2

5s
.24

3:3

Q3) a) List and explain the various types of analytics in Big data. [9]
91
49

9:5

b) Calculates the support and confidence value for all the possible item sets.[9]
30
30

Transaction ID Items bought


01
02

1 Onion, Potato, Cold Drink


2/2
GP

2 Onion, Burger, Cold Drink


2/1

3 Eggs, Onion, Cold Drink


CE
81

4 Potato, Milk, Eggs


.23

5 Potato, Burger, Cold Drink, Milk, Eggs


16

OR
8.2

P.T.O.
.24
49
Q4) a) Explain the need of logistic regression along with its various types. [9]

8
23
b) Explain the following terms with suitable example. [9]

ic-
i) Removing Duplicates from dataset.

tat
5s
ii) Handling Missing Data

3:3
02 91
9:5
Q5) a) Suppose that the given data the task is to cluster points (with (x, y)

0
30
representing location) into three clusters, where the points are A1 (2, 10),
2/1 13
A2(2, 5), A3(8, 4), B1(5, 8), B2(7, 5), B3(6, 4), C1(1, 2), C2(4, 9). The
0
2/2
distance function is Euclidean distance. Suppose initially we assign A1,
.23 GP

B1 and C1 as the center of each cluster, respectively. [8]


E

Use the k-means algorithm to show only show only the first round of
81

8
C

23
execution with cluster center.

ic-
b) Explain the following Text Analysis steps with suitable example [9]
16

tat
8.2

i) Part-of-speech(POS)tagging

5s
.24

3:3
ii) Lemmatization
91
49

9:5
OR
30
30

Q6) a) Given the confusion matrix, Calculate Accuracy, Precision, Recall, Error
01
02

rate with description on Diabetic Risk. [8]


2/2
GP

Predicted classes
2/1

Classes Diabetic Risk Diabetic Risk


CE
81

8
-Yes -No

23
.23

Actual Diabetic Risk- 90 210


ic-
16

tat
classes Yes
8.2

5s

Diabetic Risk- 140 9560


.24

3:3
91

No
49

9:5
30

b) Explain the Text Preprocessing steps with suitable example. [9]


30
01
02
2/2

Q7) a) List the few data visualization tools and discuss any four applications of
GP
2/1

data visualization along with the use of the various plots with Python/R
CE

or suitable tool. [9]


81

b) List the challenges of Data Visualization. Explain the types of visualization


.23

with example. [9]


16
8.2

OR
.24

[6180]-53 2
49
Q8) a) Explain in detail the Hadoop Ecosystem with suitable diagram along with

8
23
the various components. [9]

ic-
b) Write a short note on the following. [9]

tat
5s
a) Map Reduce

3:3
b) Pig

02 91
9:5
0
30
2/1 13 
0
2/2
.23 GP
E
81

8
C

23
ic-
16

tat
8.2

5s
.24

3:3
91
49

9:5
30
30
01
02
2/2
GP
2/1
CE
81

8
23
.23

ic-
16

tat
8.2

5s
.24

3:3
91
49

9:5
30
30
01
02
2/2
GP
2/1
CE
81
.23
16
8.2
.24

[6180]-53 3
49

You might also like