0% found this document useful (0 votes)
56 views4 pages

University of Mumbai Examination 2020 Under Cluster - 4 - (Lead College: PCE-New Panvel)

This document contains a past exam for a Big Data and Analytics course at the University of Mumbai. The exam contains 20 multiple choice questions testing concepts related to Hadoop, MapReduce, NoSQL databases, data streams, and graph algorithms like PageRank. It also lists 3 short answer questions to choose from that require explaining Hadoop ecosystem components, relational algebra operations in MapReduce, NoSQL data architectures, distance metrics, the DGIM algorithm, and PageRank.

Uploaded by

yo fire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views4 pages

University of Mumbai Examination 2020 Under Cluster - 4 - (Lead College: PCE-New Panvel)

This document contains a past exam for a Big Data and Analytics course at the University of Mumbai. The exam contains 20 multiple choice questions testing concepts related to Hadoop, MapReduce, NoSQL databases, data streams, and graph algorithms like PageRank. It also lists 3 short answer questions to choose from that require explaining Hadoop ecosystem components, relational algebra operations in MapReduce, NoSQL data architectures, distance metrics, the DGIM algorithm, and PageRank.

Uploaded by

yo fire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

University of Mumbai

Examination 2020 under cluster _4_ (Lead College: PCE-New Panvel)


Program: BE Computer Engineering
Curriculum Scheme: Rev 2016
Examination: BE Semester VII
Course Code: CSDLO7032 Course Name: Big Data & Analytics
Time: 2 hour Max. Marks: 80
=====================================================================
Choose the correct option for the following questions. All the questions are
Q1. compulsory and carry equal marks

Q1.(s) Which of the following tool is designed for efficiently transferring bulk data
between Apache Hadoop and structured datastores such as relational databases?

Option A: Apache Sqoop


Option B: Pig
Option C: Mahout
Option D: Flume

2.(s) Point out the wrong statement

Option A: A. Replication Factor can be configured at a cluster level (Default is set to 3) and
also at a file level
Option B: Block Report from each DataNode contains a list of all the blocks that are stored
on that DataNode
Option C: User data is stored on the local file system of DataNodes
Option D: DataNode is aware of the files to which the blocks stored on it belong to

3.D Point out the correct statement:

Option A: DataNode is the slave/worker node and holds the user data in the form of Data
Blocks
Option B: Each incoming file is broken into 32 MB by default
Option C: Data blocks are replicated across different nodes in the cluster to ensure a low
degree of fault tolerance
Option D: DataNode is master node and holds the meta data details

4. The output of a mapper task is


Option A: The Key-value pair of all the records of the dataset.
Option B: The Key-value pair of all the records from the input split processed by the mapper
Option C: Only the sorted Keys from the input split
Option D: The number of rows processed by the mapper task

5.M Which of the following operations can’t use Reducer as a combiner ?


Option A: Group by Minimum
Option B: Group by Maximum
Option C: Group by Count
Option D: Group by Average

1 | Page
6.s Which of the following is a wrong statement for a document store

Option A: Documents can contain many different key-value pairs, or key-array pairs, or
even nested documents
Option B: When compared to relational databases, Document stores are more scalable and
provide superior performance
Option C: It requires schema to be defined before you can add data
Option D: Secondary indices are available in Document store

7.D Which architecture is more suitable for NoSQL ?


Option A: Shared Nothing
Option B: Shared Memory
Option C: Shared Disk
Option D: Shared All

8.s Which one is not sampling in a data stream?

Option A: Reservoir Sampling


Option B: Biased Reservoir Sampling
Option C: Concise Sampling
Option D: Cosin Sampling

9.m Define exponentially decaying window by

Option A: t-1
∑ at (1-c)i
i=0

Option B: t-1
∑ at (1-c)i
i=0

Option C: t
∑ at-1 (1-c)i
i=0

Option D: t-1
∑ at-1 (1-c)i
i=0

10.s While devising the bloom filter if the filter is of 5 bits 0 0 0 0 0 and 2 hash functions
h1(x) = x mod 5 and h2(x)= (2x+3) mod 5 are used, what is the filter bit positions
when 9 followed by 11 is inserted

Option A: 01001
Option B: 10001
Option C: 11001
2 | Page
Option D: 00001

11.d Stream Queries is one that is supplied to the DSMS before any relevant data has
arrived is called as

Option A: Continuous Queries


Option B: One time Queries
Option C: Adhoc Queries
Option D: Predefined Queries

12.s The angle between two points in Cosine Distance will range from
Option A: 0 to 90 degrees
Option B: 0 to 180 degrees
Option C: 0 to 360 degrees
Option D: 90 to 180 degrees

13.D Which of the step is not performed in the second phase of the CURE algorithm

Option A: clustering the renaming points and output the final cluster
Option B: merge two clusters if they have a pair of representative points, one from each
cluster, that are sufficiently close.
Option C: Move each of the representative points a fixed fraction of the distance between its
location and the centroid of its cluster.
Option D: Each point P is brought from secondary storage and compared with the
representative points

14.M For the distance function, the triangle inequality guarantees the function is well-
behaved. Which of the following shows correct distance function for triangle
inequality?

Option A: d(x,y) = d( x, y) + d( z)
Option B: d(x,y) = d( x,y) + d(x,z)
Option C: d(x,y) = d(x,z) + d(z,y)
Option D: d(x) = d(y) + d(z)

15.s Find the correct Hamming distance between X=111111101 and Y=000111111

Option A: 4
Option B: 5
Option C: 3
Option D: 2

16.D The process of identifying similar users and recommending what similar users like
is called _________ .

Option A: Content Based Recommendation System


Option B: Collaborative Filtering
Option C: Hybrid Recommendation System
Option D: Nearest Neighbor Search

17.M The modified equation for calculating PageRank is


3 | Page
Option A:
Option B:
Option C:
Option D:

18.m “clique” in a graph is a __________ .

Option A: simple sub-graph


Option B: null graph
Option C: trivial sub-graph
Option D: fully connected sub-graph

19.s The _______ , consists of pages that could reach the SCC by following links, but
were not reachable from the SCC.

Option A: out-component
Option B: in-component
Option C: Tendrils
Option D: Tubes

20.D The problems of dead end and spider traps are solved by a method called
__________

Option A: Stochastic Matrix


Option B: Substochastic Matrix
Option C: Taxation
Option D: Transition Matrix

Q2 Solve any Two out of Three 10 marks each


(20
Marks)
A Explain briefly the components of Hadoop Ecosystem with neat diagram.
With appropriate examples explain how these relational algebra operators are
B
solved using Map Reduce functions (i)Selection (ii) Projection (iii) Joins
C Explain different NoSQL data architecture patterns.

Q3 Solve any Two out of Three 10 marks each


(20
Marks)
Find Manhattan distance (Ll-norm) and Euclidean distance (L2-norm} for the
A
following points XI = (1, 2, 2), X2 = {2, 5, 3}
Explain working of DGIM algorithm to count number of l 's (Ones) in a
B
datastream.
Explain Page Rank with Example. Can a Website's Page rank Ever,Increase?What
C
are its chances of Decreasing?

4 | Page

You might also like