0% found this document useful (0 votes)
64 views4 pages

Nptel Assignment 1

Uploaded by

gilfoyle burkham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views4 pages

Nptel Assignment 1

Uploaded by

gilfoyle burkham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Assignment 1

1. Which of the following best describes the concept of 'Big Data'?


a. Data that is physically large in size
b. Data that is collected from multiple sources and is of high variety,
volume, and velocity
c. Data that requires specialized hardware for storage
d. Data that is highly structured and easily analyzable

Ans- Big Data is characterized by the "Three Vs": variety (different types of data),
volume (large amounts of data), and velocity (speed at which data is generated
and processed). This definition captures the essence of Big Data, distinguishing
it from merely large or structured datasets.

2. Which technology is commonly used for processing and analyzing Big Data in
distributed computing environments?
a. MySQL
b. Hadoop
c. Excel
d. SQLite

Ans- Hadoop is a widely-used framework designed for processing and analyzing


large datasets in distributed computing environments. It provides a scalable and
fault-tolerant way to handle Big Data, unlike MySQL, Excel, or SQLite, which are
not typically used for large-scale distributed processing.

3. What is a primary limitation of traditional RDBMS when dealing with Big Data?
a. They cannot handle structured data
b. They are too expensive to implement
c. They struggle with scaling to manage very large datasets
d. They are not capable of performing complex queries

Ans- Traditional Relational Database Management Systems (RDBMS) often face


challenges with scalability when handling Big Data, primarily due to their limited
ability to distribute data across multiple nodes. They are not inherently designed
for the scale required by Big Data.
4. Which component of Hadoop is responsible for distributed storage?
a. YARN
b. HDFS
c. MapReduce
d. Pig

Ans- The Hadoop Distributed File System (HDFS) is the component


responsible for storing data across a distributed cluster, providing
redundancy and fault tolerance. YARN is for resource management,
MapReduce is a processing framework, and Pig is a high-level data flow
language.

5. Which Hadoop ecosystem tool is primarily used for querying and analyzing
large datasets stored in Hadoop's distributed storage?
a. HBase
b. Hive
c. Kafka
d. Sqoop

Ans- Hive is a data warehousing and SQL-like query language tool used to query
and analyze large datasets in Hadoop. HBase is a NoSQL database, Kafka is a
messaging system, and Sqoop is used for data transfer between Hadoop and
relational databases.

6. Which YARN component is responsible for coordinating the execution of tasks


within containers on individual nodes in a Hadoop cluster?
a. NodeManager
b. ResourceManager
c. ApplicationMaster
d. DataNode

Ans- NodeManager is the YARN component responsible for managing resources


and monitoring the execution of tasks on individual nodes. ResourceManager
manages overall cluster resources, ApplicationMaster handles
application-specific resource requests, and DataNode is part of HDFS.
7. What is the primary advantage of using Apache Spark over traditional
MapReduce for data processing?
a. Better fault tolerance
b. Lower hardware requirements
c. Real-time data processing
d. Faster data processing

Ans- Apache Spark provides faster data processing compared to traditional


MapReduce due to its in-memory processing capabilities, which reduce the need
for disk I/O operations. This leads to significant performance improvements for
iterative algorithms and complex data processing tasks.

8. What is Apache Spark Streaming primarily used for?


a. Real-time data visualization
b. Batch processing of large datasets
c. Real-time stream processing
d. Data storage and retrieval

Ans- Apache Spark Streaming is designed for real-time stream processing,


enabling the analysis of live data streams. It is not used for batch processing,
real-time visualization, or data storage and retrieval.

9. Which operation in Apache Spark GraphX is used to perform triangle counting


on a graph?
a. connectedComponents
b. triangleCount
c. shortestPaths
d. pageRank

And-The triangleCount operation in Apache Spark GraphX is


used to count the number of triangles in a graph, which
helps in analyzing the structure and connectivity of the
graph.
10. Which component in Hadoop is responsible for executing
tasks on individual nodes and reporting back to the
JobTracker?
a. HDFS Namenode
b. TaskTracker
c. YARN ResourceManager
d. DataNode

Ans- The TaskTracker is responsible for executing MapReduce


tasks on individual nodes and reporting the progress and
status back to the JobTracker. The HDFS Namenode manages the
file system namespace, the YARN ResourceManager allocates
resources, and DataNode stores the actual data.

You might also like