0% found this document useful (0 votes)
7 views

Explain Big Data Computing

Uploaded by

normal4formal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Explain Big Data Computing

Uploaded by

normal4formal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Explain the DStream.

Ans:

DStream, short for Discretized Stream, is a fundamental abstraction in Apache Spark


Streaming1. It represents a continuous stream of data12. Internally, a DStream is
represented as a sequence of RDDs (Resilient Distributed Datasets) 34.

Here’s how it works:


1. Spark Streaming receives live input data streams and divides the data into batches 5.
2. These batches are then processed by the Spark engine to generate the final stream
of results in batches5.
3. DStreams can be created either from input data streams from sources such as
Kafka, and Kinesis, or by applying high-level operations on other DStreams5.

In essence, a DStream is a continuous sequence of RDDs, each RDD containing data from
a certain interval52. This allows Spark Streaming to process real-time data in a fast,
scalable, and fault-tolerant manner52.

Compare MapReduce and YARN.

Ans:

Criteria MapReduce YARN

Functionality MapReduce is a processing YARN is a resource


framework that works on management framework
the data. that decides how and where
that processing should take
place.

Architecture Single master and multiple Multiple masters and


slaves. If the master node slaves. If one master goes
goes down, all the slave down, another master will
nodes will stop working. resume its process and
continue the execution.

Single Point of Failure Yes, if the master node No, YARN overcomes the
goes down, all the slave single point of failure issue
nodes will stop working. because of its architecture.
Components It has two components, a YARN has the concept of an
mapper and a reducer, for Active name node and a
the execution of a program. standby name node.

Use Case MapReduce is used for YARN is used for managing


executing a particular job. resources (Memory and
CPU) and scheduling
resource requests from the
application.

Explain the NoSQL databases.

Ans:

NoSQL, which stands for “not only SQL” or “non-SQL”, is a type of database management
system (DBMS) that is designed to handle and store large volumes of unstructured and
semi-structured data12. Unlike traditional relational databases that use tables with pre-
defined schemas to store data, NoSQL databases use flexible data models that can adapt
to changes in data structures and are capable of scaling horizontally to handle growing
amounts of data12.

NoSQL databases are generally classified into four main categories1:


4. Document databases: These databases store data as semi-structured documents,
such as JSON or XML, and can be queried using document-oriented query
languages1.
5. Key-value stores: These databases store data as key-value pairs, and are
optimized for simple and fast read/write operations1.
6. Column-family stores: These databases store data as column families, which are
sets of columns that are treated as a single entity. They are optimized for fast and
efficient querying of large amounts of data 1.
7. Graph databases: These databases store data as nodes and edges, and are
designed to handle complex relationships between data 1.

NoSQL databases are often used in applications where there is a high volume of data that
needs to be processed and analyzed in real-time, such as social media analytics, e-
commerce, and gaming1. However, they may not provide the same level of data
consistency and transactional guarantees as traditional relational databases 1. Therefore, it
is important to carefully evaluate the specific needs of an application when choosing a
database management system1.
Explain the messaging system.

Ans:

A messaging system is a technology that enables communication between different parts


of a software or between different applications1. It plays a crucial role in software
engineering as it allows decoupled communication, providing highly efficient, reliable, and
asynchronous message transmission among various components of an application 1.

Key components of a messaging system include12:


8. Sender and Receiver: A sender, or a producer, is an application, a program or a
system that creates and sends a message. A receiver, or a consumer, is an entity
that gets the message1.
9. Message: A single unit of communication, including a header (containing the
metadata) and a body (containing the actual content) 1.
10. Channel: The communication medium that allows message transmission from
sender to receiver1.
11. Message Broker: A physical server responsible for receiving messages from the
sender and routing them to the correct receiver1.
12. Queue: A line of messages waiting to be processed1.

These systems can be used for various purposes such as integration between systems,
remote procedure calls (RPCs), document transfer, and event announcements 2. They are
designed to reliably move messages from the sender’s computer to the receiver’s
computer3, making the messaging system responsible for transferring data from one
application to another2. This allows the applications to focus on what data they need to
share but not worry so much about how to share it 2.

Explain the Apache Kafka.

Ans:

Apache Kafka is an open-source distributed event streaming platform12. It was developed


by LinkedIn and is now maintained by the Apache Software Foundation 21. Here are some
key points about Apache Kafka:
13. Event Streaming: Kafka is designed to handle real-time data feeds1. An event is any
type of action, incident, or change that’s identified or recorded by software or
applications2. Kafka models events as key/value pairs2.
14. Distributed and Scalable: Kafka is distributed and scalable, meaning it can handle
large volumes of data across a cluster of computers21. It uses the concept of topics
for categorizing messages, and these topics are split into partitions for parallel
processing3.
15. Producers and Consumers: Producers are processes that publish data into Kafka
topics. Consumers are processes that pull messages off a Kafka topic 3.
16. Fault-Tolerant: Kafka is designed to be fault-tolerant, ensuring that messages are
not lost in case of failures and providing guarantees about the delivery of
messages4.
17. Use Cases: Kafka is used in various real-time applications like Twitter, LinkedIn,
Netflix, etc., for use cases such as real-time analytics, data ingestion, and event-
driven architectures3.

In summary, Apache Kafka is a powerful tool for handling real-time data in a distributed
and fault-tolerant manner214.

Describe the HBase architecture in detail.

Ans:HBase is a column-oriented distributed database developed on top of the Hadoop file


system12. It is designed to provide scalability by handling a large volume of the read and
write requests in real-time32. The architecture of HBase has three main components41:
18. HMaster: HMaster is the implementation of a Master server in HBase 41. It monitors
all Region Server instances present in the cluster41. HMaster is responsible for
assigning regions to region servers as well as handling DDL (create, delete table)
operations41. It controls load balancing and failover41.
19. Region Server: HBase Tables are divided horizontally by row key range into
Regions41. Region Server runs on HDFS DataNode which is present in the Hadoop
cluster41. Regions of Region Server are responsible for handling, managing,
executing as well as reads and writes HBase operations on that set of regions 41. The
default size of a region is 256 MB4.
20. Zookeeper: Zookeeper is like a coordinator in HBase4. It provides services like
maintaining configuration information, naming, providing distributed
synchronization, server failure notification etc4. Clients communicate with region
servers via zookeeper4.

HBase leverages the basic features of HDFS and builds upon it to provide scalability 32. It
provides real-time read or write access to data in HDFS2. Data can be stored in HDFS
directly or through HBase2. It is an important component of the Hadoop ecosystem 2.

In summary, HBase is a powerful tool for handling real-time data in a distributed and fault-
tolerant manner32.
Describe the Apache Spark architecture.

Ans:

Apache Spark is a distributed computing framework that provides an interface for


programming entire clusters with implicit data parallelism and fault tolerance 12. Here’s a
detailed explanation of its architecture:
21. Spark Driver: The driver program runs the main() function of the application and
creates a SparkContext13. The driver program splits the Spark application into tasks
and schedules them to run on executors13.
22. Executors: Executors are worker nodes’ processes in charge of running individual
tasks in a given Spark job13. They run tasks concurrently in separate threads13.
23. Cluster Manager: The cluster manager (like Hadoop YARN, Apache Mesos, or the
standalone Spark cluster manager) is responsible for acquiring resources on the
Spark cluster and allocating them to a Spark job13.
24. Resilient Distributed Datasets (RDDs): RDD is a fundamental data structure of
Spark12. It is an immutable distributed collection of objects12. RDDs can be created
through deterministic operations on data in stable storage or other RDDs 12.
25. Directed Acyclic Graph (DAG) Scheduler: The DAG scheduler divides the Spark job
into multiple stages12. Stages are created based on the transformations in the Spark
job12. The DAG scheduler then schedules tasks to the cluster manager 12.
26. Catalyst Optimizer: The Catalyst optimizer is a query optimization framework in
Spark SQL12. It optimizes Spark SQL queries for improved performance12.
27. Spark Context: Spark Context is the entry point of any Spark application 12. It allows
Spark Driver to access the cluster through a Cluster Manager 12.

In summary, Apache Spark’s architecture is designed to handle big data workloads by


distributing tasks across a cluster of machines for faster processing and analysis 123.

Explain the zookeeper.

Ans:

Apache ZooKeeper is a distributed, open-source coordination service for distributed


applications12. It provides a central place for distributed applications to store data,
communicate with one another, and coordinate activities12. Here are some key points
about ZooKeeper:
28. Distributed Coordination: ZooKeeper is used in distributed systems to coordinate
distributed processes and services12. It exposes a simple set of primitives to
implement higher-level services for synchronization, configuration maintenance,
and group and naming1.
29. Znodes: In a distributed system, there are multiple nodes or machines that need to
communicate with each other and coordinate their actions1. ZooKeeper provides a
way to ensure that these nodes are aware of each other and can coordinate their
actions1. It does this by maintaining a hierarchical tree of data nodes called
“Znodes“, which can be used to store and retrieve data and maintain state
information1.
30. Primitives: ZooKeeper provides a set of primitives, such as locks, barriers, and
queues, that can be used to coordinate the actions of nodes in a distributed
system1. It also provides features such as leader election, failover, and recovery,
which can help ensure that the system is resilient to failures 1.
31. Use Cases: ZooKeeper is widely used in distributed systems such as Hadoop,
Kafka, and HBase, and it has become an essential component of many distributed
applications1.

In summary, ZooKeeper is a crucial tool in distributed systems, providing a reliable and


efficient means of coordinating and managing the various components of a distributed
application12.

Compare traditional streaming processing system and Spark streaming.

Ans:

Traditional Streaming Processing Systems:


• Traditional streaming processing systems struggle with fault recovery, often
necessitating hot replication and long recovery times1.
• They have suboptimal handling of stragglers (long-running tasks or slow nodes)1.
• A delay while the streaming application tries to rescue a fault or straggler can
impact decision-making1.
• They are not as efficient in handling stragglers and recovering from faults 1.

Spark Streaming:
• Spark Streaming is a separate library within the Spark engine designed to process
streaming or continuously flowing data 1.
• It employs the DStream API, powered by Spark RDDs (Resilient Data Sets), to
partition data into chunks for processing before forwarding them to their
destination1.
• Resilient Distributed Datasets (RDDs) are a fundamental data structure in Apache
Spark as well as Spark Streaming1.
• They provide fault tolerance and efficient parallel processing of data 1.
• An RDD is an immutable distributed collection of objects that can be stored in
memory or on disk and can be partitioned across multiple nodes in a cluster 1.
• RDDs are designed to support fault tolerance by dividing data into partitions, each
of which can be replicated to multiple nodes in the cluster, enabling recovery from
node failures1.
• Spark Streaming utilizes the DStream API, while Structured Streaming employs the
DataFrame and Dataset APIs1.
• Spark Streaming is designed for continuous transformation 1.

In summary, while traditional streaming processing systems have their own strengths,
Spark Streaming offers a more efficient and fault-tolerant solution for processing real-time
data1.

Identify and explain the 5V's of Big data computing.

Ans:

Sure, the 5V’s of Big Data are a framework for understanding the complexities of big
data. They are as follows:
32. Volume: This refers to the enormous amount of data generated every second
from various sources. The term ‘Big Data’ itself is related to a size which is
enormous1. The volume of data plays a crucial role in determining the value of data 1.
33. Velocity: This refers to the speed at which new data is generated and moves
around. In Big Data, data flows in from sources like machines, networks, social
media, mobile phones etc. There is a massive and continuous flow of data 1.
34. Variety: This refers to the different types of data we can now use. Data can be
structured, semi-structured, or unstructured, and can originate from multiple
sources1.
35. Veracity: This refers to the messiness or trustworthiness of the data. With many
forms of big data, quality and accuracy are less controllable, but the volume is often
the compensation for lack of quality or accuracy1.
36. Value: After considering the other four V’s, the fifth V is value. The value of Big
Data isn’t about the data itself, but how organizations use it to create value for their
business1.
These 5Vs provide a taxonomy for classifying data into manageable categories, simplifying
the process of understanding big data and its business value 2.

Develop a Hadoop 1.0 to count the number of times words occur in the string input file.

Explain its programming model and its phases. Input File - abc.txt as follows

[ big data computing

Hadoop data is big

Data is Hadoop big

Big is Big computing

Ans:

public class WordCount {

public static class TokenizerMapper


extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);


private Text word = new Text();

public void map(Object key, Text value, Context context


) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}

public static class IntSumReducer


extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,


Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

This program follows the MapReduce programming model, which consists of two
phases:
37. Map Phase: The input data is divided into chunks, and the map function is
applied to each chunk in parallel. In this case, the map function tokenizes the
input lines into words and emits a key-value pair for each word. The key is the
word, and the value is 1.
38. Reduce Phase: The output from the map phase is shuffled and sorted so that all
key-value pairs with the same key are grouped together. The reduce function is
then applied to each group of values. In this case, the reduce function sums up
the values for each word and emits a key-value pair with the word and its total
count.

The main function sets up the job configuration, specifies the input and output paths,
and starts the job. The input file should be placed in HDFS, and the output will be written
to HDFS as well.

Please note that you need to compile and package this program into a JAR file, and
then you can run it using the Hadoop command-line interface. Also, Hadoop and Java
need to be properly installed and configured on your system.

This is a simple example and real-world MapReduce jobs might be more complex,
involving multiple map and reduce stages, complex data transformations, and
additional libraries or tools. But the basic principles remain the same.

Summarize the big data Data sources and its applications.


Ans:

Big data is produced from a variety of sources and has numerous applications across
different industries12345.

Data Sources:
• Social Media: Platforms like Facebook, Twitter, Instagram, and YouTube generate a
vast amount of data through user activities such as posts, likes, shares, and
comments1.
• Mobile Apps: Applications on smartphones and tablets produce data related to
user behavior, usage patterns, and location information 3.
• Emails: The content, metadata, and attachments of emails can be a valuable
source of data3.
• Transactions: Purchase records, credit card transactions, and other financial data
provide insights into consumer behavior3.
• Internet of Things (IoT) Sensors: Devices connected to the IoT, such as smart
home devices, wearables, and industrial sensors, generate a continuous stream of
data3.

Applications:
• Healthcare: Big data analytics can help in predicting disease outbreaks, improving
patient care, and advancing medical research 4.
• Finance: Financial institutions use big data for risk analysis, fraud detection, and
customer segmentation4.
• Marketing: Companies analyze customer behavior and market trends to improve
their marketing strategies4.
• Education: Educational institutions use big data to track student performance and
improve teaching methods4.
• Surveillance: Law enforcement agencies use big data for crime prediction and
prevention4.

In summary, big data is a vast and diverse field with numerous sources and
applications. It’s transforming industries by providing valuable insights and aiding in
decision-making12345.

Describe the image Spark streaming video architecture in detail.

Ans:

Apache Spark Streaming is a real-time processing tool that operates on data in a


distributed manner12. It treats the stream as a series of micro-batches3. Here’s a detailed
explanation of its architecture:
39. Data Ingestion: Streaming data is first ingested from single or multiple sources such
as through networking (TCP sockets), Kafka, Kinesis, IoT devices, and so on 4.
40. Micro-Batching: Instead of processing the stream one record at a time, the data is
divided into small chunks, referred to as batches53. These micro-batches are
dynamically assigned and processed5.
41. Processing: The data is then pushed to the processing part, where Spark Streaming
has several complex algorithms powered by high-throughput functions such as
window, map, join, reduce, and more4.
42. Discretized Streams (DStreams): DStreams are fundamental abstractions here, as
they represent streams of data divided into small chunks (referred to as batches) 1.
Internally, a DStream is represented as a sequence of RDDs2.
43. Output: The processed data can be pushed out to filesystems, databases, and live
dashboards2. You can also apply Spark’s machine learning and graph processing
algorithms on data streams2.
44. Fault Tolerance: Spark Streaming is designed to be fault-tolerant, ensuring that
messages are not lost in case of failures and providing guarantees about the
delivery of messages12.

In summary, Spark Streaming’s architecture is designed to handle real-time data in a fast,


scalable, and fault-tolerant manner12543.

Describe the the YARN architecture in detail.

Ans:

YARN is the resource management layer of Hadoop12. It was introduced in Hadoop 2.0 to
remove the bottleneck on Job Tracker which was present in Hadoop 1.0 2. YARN separates
the resource management layer from the processing layer2. It allows different data
processing engines like graph processing, interactive processing, stream processing as
well as batch processing to run and process data stored in HDFS (Hadoop Distributed File
System), thus making the system much more efficient 2.

The main components of YARN architecture include23:


45. Client: It submits map-reduce jobs2.
46. Resource Manager: It is the master daemon of YARN and is responsible for resource
assignment and management among all the applications2. It has two major
components2:
o Scheduler: It performs scheduling based on the allocated application and
available resources2. It is a pure scheduler, which means it does not perform
other tasks such as monitoring or tracking and does not guarantee a restart if
a task fails2.
o Application manager: It is responsible for accepting the application and
negotiating the first container from the resource manager2. It also restarts
the Application Master container if a task fails2.
47. Node Manager: It takes care of individual nodes on a Hadoop cluster and manages
application and workflow on that particular node2. Its primary job is to keep up with
the Resource Manager2.

Through its various components, YARN can dynamically allocate various resources and
schedule the application processing2. For large volume data processing, it is quite
necessary to manage the available resources properly so that every application can
leverage them2.

Select correct best movie to watch for your friend using big data analytics. Use the star ratings given
for movie by the uses. Use following data

(High rating = 5, low rating =1), (Use related hadoop layer processing)

Sr_No | User_Id | Movie_Id | Rating

1 | U228 | M0404 | 2

2 | U722 | M0304 | 3

3 | U298 | M0404 | 1

4 | U484 | M0304 | 4

5 | U111 | M0204 | 3

6 | U707 | M0204 | 2

7 | U123 | M0504 | 2

8 | U555 | M0304 | 5

Ans:

import sys

for line in sys.stdin:


data = line.strip().split("|")
if len(data) == 4:
sr_no, user_id, movie_id, rating = data
print(f"{movie_id}\t{rating}")

# reducer.py
import sys

current_movie = None
current_rating_sum = 0
current_rating_count = 0

for line in sys.stdin:


movie, rating = line.strip().split("\t")
rating = int(rating)

if current_movie == movie:
current_rating_sum += rating
current_rating_count += 1
else:
if current_movie:
average_rating = current_rating_sum / current_rating_count
print(f"{current_movie}\t{average_rating}")
current_rating_sum = rating
current_rating_count = 1
current_movie = movie

if current_movie == movie:
average_rating = current_rating_sum / current_rating_count
print(f"{current_movie}\t{average_rating}")

You would then run the MapReduce job with these scripts as the mapper and reducer.
The output will be the average rating for each movie. The movie with the highest
average rating can be considered the best movie to watch.

Please note that this is a simple example and real-world big data analytics might involve
more complex calculations and data processing steps. Also, this assumes that all
ratings are equally important, which might not be the case in a real-world scenario
where you might want to weigh more recent ratings more heavily, for example. Finally,
this doesn’t take into account the personal preferences of your friend, which could also
be a significant factor in choosing a movie.
Describe the apache spark architecture in detail.
Ans:

Apache Spark is a distributed processing system that provides an interface for


programming entire clusters with implicit data parallelism and fault tolerance 12. Here’s a
detailed explanation of its architecture:
48. Spark Driver: The driver program runs the main() function of the application and
creates a SparkContext12. The driver program splits the Spark application into tasks
and schedules them to run on executors12.
49. Executors: Executors are worker nodes’ processes in charge of running individual
tasks in a given Spark job12. They run tasks concurrently in separate threads12.
50. Cluster Manager: The cluster manager (like Hadoop YARN, Apache Mesos, or the
standalone Spark cluster manager) is responsible for acquiring resources on the
Spark cluster and allocating them to a Spark job12.
51. Resilient Distributed Datasets (RDDs): RDD is a fundamental data structure of
Spark12. It is an immutable distributed collection of objects12. RDDs can be created
through deterministic operations on data in stable storage or other RDDs 12.
52. Directed Acyclic Graph (DAG) Scheduler: The DAG scheduler divides the Spark job
into multiple stages12. Stages are created based on the transformations in the Spark
job12. The DAG scheduler then schedules tasks to the cluster manager 12.
53. Catalyst Optimizer: The Catalyst optimizer is a query optimization framework in
Spark SQL12. It optimizes Spark SQL queries for improved performance12.
54. Spark Context: Spark Context is the entry point of any Spark application 12. It allows
Spark Driver to access the cluster through a Cluster Manager 12.

In summary, Apache Spark’s architecture is designed to handle big data workloads by


distributing tasks across a cluster of machines for faster processing and analysis 123.

Explain Big Data Computing

Ans:

Big Data Computing is a field that deals with the processing, storage, and analysis of large volumes
of data, often referred to as “big data”. This data can come from various sources and in various
formats, including structured data (like databases) and unstructured data (like text documents or
social media posts).

Here are some key concepts related to Big Data Computing:


• Volume: Big data involves dealing with data at a very large scale, often in the order of
petabytes or exabytes.
• Velocity: The speed at which data is generated, processed, and analyzed is crucial in big data
computing. Real-time processing and analysis can be a requirement in many applications.
• Variety: Big data can come in various formats - structured, semi-structured, and
unstructured. Handling this variety is a significant aspect of big data computing.
• Veracity: This refers to the quality and reliability of the data. Since big data can come from
various sources, ensuring the accuracy and consistency of the data can be challenging.
• Value: The goal of big data computing is to extract valuable insights from the massive
amounts of data.

Big Data Computing often involves distributed storage and processing frameworks like Hadoop and
Spark, NoSQL databases like MongoDB and Cassandra, and data processing tools like Hive and Pig.
These technologies allow for the storage, processing, and analysis of big data across clusters of
computers.

Summarize the Big Data sources and their applications.

Ans:

Sources of Big Data:

• Social media: Platforms like Facebook, Twitter, and Instagram generate massive amounts of
data every day.
• Transaction Data: This includes purchase records, credit card transactions, and more.
• Machine-Generated Data: This includes data from sensors, medical devices, and telemetry
systems.
• Publicly Available Sources: Government databases, weather data, and other public sources
also contribute to big data.

Applications of Big Data:

• Healthcare: Big data is used for disease detection, health trend analysis, and medical
research.
• Finance: Financial institutions use big data for risk analysis, fraud detection, and customer
segmentation.
• Retail: Big data helps in understanding customer behavior, optimizing supply chains, and
personalizing shopping experiences.
• Transportation: Big data can optimize routes, reduce fuel consumption, and improve
logistics.
• Telecommunications: Telecom companies use big data for network optimization, customer
churn prediction, and service personalization.
Describe the Hadoop architecture.

Ans:
Hadoop follows a master-slave architecture design for data storage and distributed data processing
using MapReduce and HDFS methods.

Hadoop Distributed File System (HDFS): HDFS has a master-slave architecture with two main
components:

• NameNode (Master Node): It manages the file system metadata, i.e., it keeps track of all files
in the system, and tracks the file data across the cluster or multiple machines. There is only
one NameNode in a cluster.
• DataNodes (Slave Nodes): These are the nodes that live on each machine in your Hadoop
cluster to store the actual data. There can be one or more DataNodes in a cluster.
• MapReduce: MapReduce also follows a master-slave architecture with two main
components:
• JobTracker (Master Node): It is responsible for resource management, tracking resource
availability, and task life cycle management.
• TaskTrackers (Slave Nodes): They execute tasks upon instruction from the JobTracker and
provide task-status information to the JobTracker periodically.

In summary, Hadoop architecture is designed to handle big data in a distributed environment. The
HDFS stores data, and MapReduce processes the data. The master nodes (NameNode and
JobTracker) manage and monitor the slave nodes (DataNode and TaskTracker). The architecture is
designed to be robust and handle failures, as the data is replicated across multiple DataNodes.

Identify the 5V's of Big Data Computing.

Ans:

• Volume: This refers to the vast amounts of data generated every second. In big data, volume
is a significant aspect as it deals with huge amounts of data.
• Velocity: This refers to the speed at which new data is generated and the speed at which
data moves around. In many cases, it’s important to be able to analyze this data in real-time.
• Variety: This refers to the different types of data we can now use. Data today comes in many
different formats: structured data, semi-structured data, unstructured data and even complex
structured data.
• Veracity: This refers to the messiness or trustworthiness of the data. With many forms of big
data, quality and accuracy are less controllable, but the volume often makes up for the lack
of quality or accuracy.
• Value: This refers to our ability to turn our data into value. It’s all well and good having
access to big data but unless we can turn it into value it’s useless.

Big Data is a data which is:


1) High Volume
2) High Variety
3) High Velocity

4) All

The Daemon or process of HDFS is:

1) YARN

2) Name Node

3) Job Tracker

4) Hbase

Hbase is:

1) Hadoop language

2) Analytic tool

3) File System

4) In - memory database

Read() of HDFS deals with......functions:

1) Name Node - create()

2) DFS - create()

3) FSDataInputStream - Read()

4) FSDataOutStream - Read()

Develop a Hadoop 1.0 count the number of time words occur in the input file. Explain its
programming model and its phases. Input File - abc.txt as follow

[ big data computing

Hadoop data is big

Data is Hadoop data

Big is Big computing ]


Ans:
The MapReduce program in Hadoop for counting the number of times words occur in an input file
follows two main phases:

• Map Phase: In this phase, the input data is divided into chunks, and each chunk is processed
independently. The map function takes a set of data and converts it into another set of data,
where individual elements are broken down into tuples (key/value pairs). In the context of
word count, the map function tokenizes the lines of the input file into words and maps each
word with the value of 1.
• Reduce Phase: In this phase, the output from the map phase is taken as input and combined
into a smaller set of tuples. The reduce function sums up the occurrences of each word and
outputs the word counts.

The Hadoop Distributed File System (HDFS) is used to store the input and output of the MapReduce
tasks. The NameNode (master) manages the file system metadata, and the DataNodes (slaves)
store the actual data.

The JobTracker (master) manages the MapReduce jobs, and the TaskTrackers (slaves) execute
tasks upon instruction from the JobTracker and provide task-status information to the JobTracker
periodically.

You might also like