0% found this document useful (0 votes)
9 views10 pages

Uc PDF

The document is an exam paper for the Big Data Analytics course at Marwadi University, detailing instructions and questions for students. It covers topics such as RDBMS, HDFS, Big Data characteristics, and Hadoop architecture, with specific questions on comparisons, explanations, and applications. The exam is structured to assess students' understanding of Big Data concepts and technologies within a 75-minute timeframe.

Uploaded by

Dev Savaliya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views10 pages

Uc PDF

The document is an exam paper for the Big Data Analytics course at Marwadi University, detailing instructions and questions for students. It covers topics such as RDBMS, HDFS, Big Data characteristics, and Hadoop architecture, with specific questions on comparisons, explanations, and applications. The exam is structured to assess students' understanding of Big Data concepts and technologies within a 75-minute timeframe.

Uploaded by

Dev Savaliya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Enroll. No.

_________

MARWADI UNIVERSITY
Faculty of Engineering & Technology
Computer Engineering
B.Tech SEM: 7 MID-SEM. EXAM: I SEPTEMBER: 2024
___________________________________________________________________________
Subject: - BIG DATA ANALYTICS (01CE0719) Date:- 17/09/2024
Total Marks:-30 Time: - 75 Minutes
Instructions:
1. Attempt all questions.
2. Make suitable assumptions wherever necessary.
3. Figures to the right indicate full marks.
4. Do not write/sign/indication/tick mark anything other than Enroll No. at a specific place on the question paper.

Question: 1. [6]

(a) What is the full form of RDBMS? Relational Database Management System
(b) What is the full form of HDFS? Hadoop Distributed File System
(c) What is the full form of SSH? Secure Shell
(d) What is the full form of ACID? atomicity, consistency, isolation, and durability
(e) What is the full form of YARN in Hadoop? Yet Another Resource Negotiator
(f) What is use of Name Node?
NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and
manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly
available server that manages the File System Namespace and controls access to files by clients.

Question: 2. [12]

(a) Compare RDBMS and Hadoop. [6]

1|Page MU
(b) Explain Big Data Characteristics in detail. [6]

Big Data is characterized by several distinct attributes that differentiate it from traditional data processing
methods. These characteristics are often summarized by the "Three Vs," though additional dimensions can
also be considered. Here's a detailed breakdown:

1. Volume

• Definition: Refers to the sheer amount of data generated and collected. Big Data involves datasets
that are so large that traditional data management tools and methods are insufficient to handle them
effectively.
• Examples: Social media platforms generating terabytes of data daily, financial institutions processing
billions of transactions, or sensors in IoT devices producing continuous streams of data.

2. Velocity

• Definition: The speed at which data is generated, processed, and analyzed. Big Data often involves
real-time or near-real-time processing to gain timely insights and make quick decisions.
• Examples: Streaming data from online activities, real-time analytics for financial markets, or instant
feedback systems in customer service.

3. Variety

2|Page MU
• Definition: The different types of data collected from various sources. Big Data includes structured,
semi-structured, and unstructured data.
• Examples: Structured data (e.g., databases), semi-structured data (e.g., XML, JSON), and
unstructured data (e.g., text documents, social media posts, videos).

4. Veracity

• Definition: The trustworthiness and accuracy of the data. Given the massive volume and variety,
ensuring the quality and reliability of Big Data can be challenging.
• Examples: Inconsistent data from different sources, errors in data collection, or biases in data can
affect the validity of insights drawn from the data.

5. Value

• Definition: The usefulness and insights that can be derived from the data. Big Data's value is realized
through effective analysis, which can drive business decisions, innovations, and strategies.
• Examples: Predictive analytics for customer behavior, optimization of supply chains, or new product
development based on user preferences.

Additional Characteristics

Complexity

• Definition: The intricate nature of data relationships and the challenges involved in integrating,
managing, and analyzing diverse data sources.
• Examples: Complex networks of data from different sources that need to be integrated to derive
meaningful insights.

Data Quality

• Definition: Ensures that the data is accurate, consistent, and complete. Poor data quality can
undermine the reliability of analytics and decision-making.
• Examples: Missing values, incorrect data entries, and outdated information that can impact the
outcomes of data analysis.

Scalability

• Definition: The ability to handle growing amounts of data efficiently. Big Data systems must be
scalable to accommodate increasing data volumes without performance degradation.
• Examples: Distributed computing frameworks like Hadoop and Spark that can scale out to handle
large datasets.

Storage

• Definition: The methods and technologies used to store massive amounts of data. Storage solutions
for Big Data must be robust and capable of handling high-throughput and high-capacity requirements.
• Examples: Distributed storage systems such as cloud storage solutions, NoSQL databases, and data
lakes.

Processing

• Definition: Techniques and technologies used to analyze and process Big Data. This includes data
mining, machine learning, and advanced analytics.
• Examples: Batch processing, stream processing, and the use of distributed computing frameworks to
handle and analyze large datasets.

3|Page MU
OR

(b) Explain Big Data Architecture with diagram. [6]

Big Data architectures have a number of layers or components. These are the most common:

1. Data sources

Data is sourced from multiple inputs in a variety of formats, including both structured and unstructured.
Sources include relational databases allied with applications such as ERP or CRM, data warehouses, mobile
devices, social media, email, and real-time streaming data inputs such as IoT devices. Data can be ingested in
batch mode or in real-time.

2. Data storage

This is the data receiving layer, which ingests data, stores it, and converts unstructured data into a format
analytic tools can work with. Structured data is often stored in a relational database, while unstructured data
can be housed in a NoSQL database such as MongoDB Atlas. A specialized distributed system like Hadoop
Distributed File System (HDFS) is a good option for high-volume batch processed data in various formats.

3. Batch processing

4|Page MU
With very large data sets, long-running batch jobs are required to filter, combine, and generally render the
data usable for analysis. Source files are typically read and processed, with the output written to new files.
Hadoop is a common solution for this.

4. Real-time message ingestion

This component focuses on categorizing the data for a smooth transition into the deeper layers of the
environment. An architecture designed for real-time sources needs a mechanism to ingest and store real-time
messages for stream processing. Messages can sometimes just be dropped into a folder, but in other cases, a
message capture store is necessary for buffering and to enable scale-out processing, reliable delivery, and
other queuing requirements.

5. Stream processing

Once captured, the real-time messages have to be filtered, aggregated, and otherwise prepared for analysis,
after which they are written to an output sink. Options for this phase include Azure Stream Analytics, Apache
Storm, and Apache Spark Streaming.

6. Analytical data store

The processed data can now be presented in a structured format – such as a relational data warehouse – for
querying by analytical tools, as is the case with traditional business intelligence (BI) platforms. Other
alternatives for serving the data are low-latency NoSQL technologies or an interactive Hive database.

7. Analysis and reporting

Most Big Data platforms are geared to extracting business insights from the stored data via analysis and
reporting. This requires multiple tools. Structured data is relatively easy to handle, while more advanced and
specialized techniques are required for unstructured data. Data scientists may undertake interactive data
exploration using various notebooks and tool-sets. A data modeling layer might also be included in the
architecture, which may also enable self-service BI using popular visualization and modeling techniques.

Analytics results are sent to the reporting component, which replicates them to various output systems for
human viewers, business processes, and applications. After visualization into reports or dashboards, the
analytic results are used for data-driven business decision making.

8. Orchestration

The cadence of Big Data analysis involves multiple data processing operations followed by data
transformation, movement among sources and sinks, and loading of the prepared data into an analytical data
store. These workflows can be automated with orchestration systems from Apache such as Oozie and Sqoop,
or Azure Data Factory.
5|Page MU
Question: 3. [12]

(a) Explain Hadoop Distributed File System in detail with diagram. [8]

HDFS (Hadoop Distributed File System) is the primary storage system used by Hadoop
applications. This open source framework works by rapidly transferring data between nodes. It's
often used by companies who need to handle and store big data. HDFS is a key component of
many Hadoop systems, as it provides a means for managing big data, as well as supporting big
data analytics.

There are many companies across the globe that use HDFS, so what exactly is it and why is it
needed? Let's take a deep dive into what HDFS is and why it may be useful for businesses.

What is HDFS?
HDFS stands for Hadoop Distributed File System. HDFS operates as a distributed file system
designed to run on commodity hardware.

HDFS is fault-tolerant and designed to be deployed on low-cost, commodity hardware. HDFS


provides high throughput data access to application data and is suitable for applications that have
large data sets and enables streaming access to file system data in Apache Hadoop.

So, what is Hadoop? And how does it vary from HDFS? A core difference between Hadoop and
HDFS is that Hadoop is the open source framework that can store, process and analyze data,
while HDFS is the file system of Hadoop that provides access to data. This essentially means that
HDFS is a module of Hadoop.

As we can see, it focuses on NameNodes and DataNodes. The NameNode is the hardware that
contains the GNU/Linux operating system and software. The Hadoop distributed file system acts
as the master server and can manage the files, control a client's access to files, and overseas file
operating processes such as renaming, opening, and closing files.

A DataNode is hardware having the GNU/Linux operating system and DataNode software. For
every node in a HDFS cluster, you will locate a DataNode. These nodes help to control the data
storage of their system as they can perform operations on the file systems if the client requests,
and also create, replicate, and block files when the NameNode instructs.

The HDFS meaning and purpose is to achieve the following goals:

6|Page MU
Manage large datasets - Organizing and storing datasets can be a hard talk to handle. HDFS is
used to manage the applications that have to deal with huge datasets. To do this, HDFS should
have hundreds of nodes per cluster.
Detecting faults - HDFS should have technology in place to scan and detect faults quickly and
effectively as it includes a large number of commodity hardware. Failure of components is a
common issue.
Hardware efficiency - When large datasets are involved it can reduce the network traffic and
increase the processing speed.

HDFS components
It's important to know that there are three main components of Hadoop. Hadoop HDFS, Hadoop
MapReduce, and Hadoop YARN. Let's take a look at what these components bring to Hadoop:

Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit of Hadoop.
Hadoop MapReduce - Hadoop MapReduce is the processing unit of Hadoop. This software
framework is used to write applications to process vast amounts of data.
Hadoop YARN - Hadoop YARN is a resource management component of Hadoop. It processes
and runs data for batch, stream, interactive, and graph processing - all of which are stored in
HDFS.

(b) Which are the challenges for Distributed Computing? [4]

Distributed computing systems face a number of challenges, including:

• Communication
Distributed software teams need to communicate properly and in a timely manner to coordinate
their work.

• Scalability
Scalability is a key challenge in distributed real-time systems, especially when tasks are
dynamically added to the schedule.

• Fault tolerance
Fault tolerance is a crucial concern in distributed systems, but it can be difficult to implement due
to the dynamic nature of the system and its complex services.

• Latency issues
Latency is the travel time for data to travel between different nodes in a distributed system, which
can cause delays.

• Network issues
Distributed systems rely on network communication, so network stability and bandwidth
problems can occur.

• Consistency
Consistency can be a problem because a system with multiple parts and replicas can lead to
outdated cached resources or conflicting resources.

• Geographical scalability limitations


Existing distributed systems designed for local-area networks can be difficult to scale because
they are based on synchronous communication.

OR

7|Page MU
(a) Explain the applications on Big Hadoop Ecosystem in detail. [8]

Introduction: Hadoop Ecosystem is a platform or a suite which provides various services to solve the big data
problems. It includes Apache projects and various commercial tools and solutions. There are four major
elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common Utilities. Most of the tools or
solutions are used to supplement or support these major elements. All these tools work collectively to provide
services such as absorption, analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:

HDFS: Hadoop Distributed File System


YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
Spark: In-Memory data processing
PIG, HIVE: Query based processing of data services
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling

HDFS:

HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing large data sets
of structured or unstructured data across various nodes and thereby maintaining the metadata in the form of
log files.
HDFS consists of two core components i.e.
Name node
Data Node
Name Node is the prime node which contains metadata (data about data) requiring comparatively fewer
resources than the data nodes that stores the actual data. These data nodes are commodity hardware in the
distributed environment. Undoubtedly, making Hadoop cost effective.
HDFS maintains all the coordination between the clusters and hardware, thus working at the heart of the
system.
YARN:

8|Page MU
Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage the resources
across the clusters. In short, it performs scheduling and resource allocation for the Hadoop System.
Consists of three major components i.e.
Resource Manager
Nodes Manager
Application Manager
Resource manager has the privilege of allocating resources for the applications in a system whereas Node
managers work on the allocation of resources such as CPU, memory, bandwidth per machine and later on
acknowledges the resource manager. Application manager works as an interface between the resource
manager and node manager and performs negotiations as per the requirement of the two.
MapReduce:

By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry over the
processing’s logic and helps to write applications which transform big data sets into a manageable one.
MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
Map() performs sorting and filtering of data and thereby organizing them in the form of group. Map generates
a key-value pair based result which is later on processed by the Reduce() method.
Reduce(), as the name suggests does the summarization by aggregating the mapped data. In simple, Reduce()
takes the output generated by Map() as input and combines those tuples into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query based language
similar to SQL.

It is a platform for structuring the data flow, processing and analyzing huge data sets.
Pig does the work of executing commands and in the background, all the activities of MapReduce are taken
care of. After the processing, pig stores the result in HDFS.
Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just the way Java
runs on the JVM.
Pig helps to achieve ease of programming and optimization and hence is a major segment of the Hadoop
Ecosystem.
HIVE:

With the help of SQL methodology and interface, HIVE performs reading and writing of large data sets.
However, its query language is called as HQL (Hive Query Language).
It is highly scalable as it allows real-time processing and batch processing both. Also, all the SQL datatypes
are supported by Hive thus, making the query processing easier.
Similar to the Query Processing frameworks, HIVE too comes with two components: JDBC Drivers and
HIVE Command Line.
JDBC, along with ODBC drivers work on establishing the data storage permissions and connection whereas
HIVE Command line helps in the processing of queries.
Mahout:

Mahout, allows Machine Learnability to a system or application. Machine Learning, as the name suggests
helps the system to develop itself based on some patterns, user/environmental interaction or on the basis of
algorithms.
It provides various libraries or functionalities such as collaborative filtering, clustering, and classification
which are nothing but concepts of Machine learning. It allows invoking algorithms as per our need with the
help of its own libraries.
Apache Spark:

It’s a platform that handles all the process consumptive tasks like batch processing, interactive or iterative
real-time processing, graph conversions, and visualization, etc.
It consumes in memory resources hence, thus being faster than the prior in terms of optimization.
Spark is best suited for real-time data whereas Hadoop is best suited for structured data or batch processing,
hence both are used in most of the companies interchangeably.
Apache HBase:

9|Page MU
It’s a NoSQL database which supports all kinds of data and thus capable of handling anything of Hadoop
Database. It provides capabilities of Google’s BigTable, thus able to work on Big Data sets effectively.
At times where we need to search or retrieve the occurrences of something small in a huge database, the
request must be processed within a short quick span of time. At such times, HBase comes handy as it gives us
a tolerant way of storing limited data
Other Components: Apart from all of these, there are some other components too that carry out a huge task in
order to make Hadoop capable of processing large datasets. They are as follows:

Solr, Lucene: These are the two services that perform the task of searching and indexing with the help of
some java libraries, especially Lucene is based on Java which allows spell check mechanism, as well.
However, Lucene is driven by Solr.
Zookeeper: There was a huge issue of management of coordination and synchronization among the resources
or the components of Hadoop which resulted in inconsistency, often. Zookeeper overcame all the problems by
performing synchronization, inter-component based communication, grouping, and maintenance.
Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them together as a
single unit. There is two kinds of jobs .i.e Oozie workflow and Oozie coordinator jobs. Oozie workflow is the
jobs that need to be executed in a sequentially ordered manner whereas Oozie Coordinator jobs are those that
are triggered when some data or external stimulus is given to it.

(b) What is MapReduce? [4]

What is MapReduce?
MapReduce is a Java-based, distributed execution framework within the Apache Hadoop Ecosystem. It takes
away the complexity of distributed programming by exposing two processing steps that developers
implement: 1) Map and 2) Reduce. In the Mapping step, data is split between parallel processing tasks.
Transformation logic can be applied to each chunk of data. Once completed, the Reduce phase takes over to
handle aggregating data from the Map set.. In general, MapReduce uses Hadoop Distributed File System
(HDFS) for both input and output. However, some technologies built on top of it, such as Sqoop, allow access
to relational systems.

A MapReduce system is usually composed of three steps (even though it's generalized as the combination of
Map and Reduce operations/functions). The MapReduce operations are:

Map: The input data is first split into smaller blocks. The Hadoop framework then decides how many mappers
to use, based on the size of the data to be processed and the memory block available on each mapper server.
Each block is then assigned to a mapper for processing. Each ‘worker’ node applies the map function to the
local data, and writes the output to temporary storage. The primary (master) node ensures that only a single
copy of the redundant input data is processed.
Shuffle, combine and partition: worker nodes redistribute data based on the output keys (produced by the map
function), such that all data belonging to one key is located on the same worker node. As an optional process
the combiner (a reducer) can run individually on each mapper server to reduce the data on each mapper even
further making reducing the data footprint and shuffling and sorting easier. Partition (not optional) is the
process that decides how the data has to be presented to the reducer and also assigns it to a particular reducer.
Reduce: A reducer cannot start while a mapper is still in progress. Worker nodes process each group of
<key,value> pairs output data, in parallel to produce <key,value> pairs as output. All the map output values
that have the same key are assigned to a single reducer, which then aggregates the values for that key. Unlike
the map function which is mandatory to filter and sort the initial data, the reduce function is optional.

Legacy applications and Hadoop native tools like Sqoop and Pig leverage MapReduce today. There is very
limited MapReduce application development nor any significant contributions being made to it as an open
source technology.

---Best of Luck---
10 | P a g e
MU

You might also like