0% found this document useful (0 votes)
18 views25 pages

Bigdata Final

Reasearch for bigdata

Uploaded by

Pranav N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views25 pages

Bigdata Final

Reasearch for bigdata

Uploaded by

Pranav N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

1.

Describe the 6V's of Big data Analytics with diagram


 Volume: Volume refers to the large amount of data that is generated every
day. With the advancement of technology and the widespread use of the
internet and social media, huge amounts of data are being generated
continuously. This data comes in various forms, including structured, semi-
structured, and unstructured data. Structured data is organized data that is
stored in a database, while semi-structured data refers to data that has some
organizational structure, such as XML and JSON files. Unstructured data, on
the other hand, refers to data that has no defined structure, such as social
media posts, videos, and images.
 Velocity: Velocity refers to the speed at which data is generated, collected,
processed, and analyzed. With the increase in the volume of data being
generated, the speed at which it is generated also increases. This means that
big data analytics requires fast and real-time processing of data to extract
meaningful insights.
 Variety: Variety refers to the different types of data that are being
generated. As mentioned earlier, data can come in various forms, including
structured, semi-structured, and unstructured data. Big data analytics
requires the ability to handle various types of data from multiple sources,
including text, audio, video, social media, and IoT sensors.
 Veracity: Veracity refers to the accuracy and reliability of the data being
analyzed. With the increase in the volume and variety of data, it becomes
increasingly difficult to ensure the accuracy and reliability of the data.
Therefore, big data analytics requires the use of data cleansing and data
quality techniques to ensure the veracity of the data.
 Value: Value refers to the ability to extract meaningful insights from the
data being analyzed. The ultimate goal of big data analytics is to derive
actionable insights from the data that can help organizations make informed
decisions. Therefore, big data analytics requires the use of advanced
analytics techniques such as machine learning, predictive analytics, and data
mining to extract insights that can add value to the business.
 Visualization: Visualization refers to the ability to represent the insights
derived from the data in a meaningful and understandable way. Big data
analytics requires the use of data visualization techniques to represent
complex data in a simple and easy-to-understand format. This helps
decision-makers to understand the insights and make informed decisions.

2.Represent diagrammatically the computing model of Hadoop.


Hadoop is an Apache open source framework written in java that allows
distributed processing of large datasets across clusters of computers using
simple programming models. The Hadoop framework application works in an
environment that provides distributed storage and computation across clusters
of computers. Hadoop is designed to scale up from single server to thousands of
machines, each offering local computation and storage.
Hadoop follows a distributed computing model that allows for the processing of
large amounts of data across multiple computers. The computing model consists
of three main components: the Hadoop Distributed File System (HDFS), the
MapReduce programming model, and the YARN resource manager.

 Hadoop Distributed File System (HDFS): HDFS is the primary storage


system used by Hadoop. It is a distributed file system that allows for the
storage of large data sets across multiple machines. The data is stored in
blocks and replicated across multiple machines to ensure data availability
and reliability.
 MapReduce Programming Model: MapReduce is a programming model
used by Hadoop for processing large data sets. It consists of two main
functions: Map and Reduce. The Map function takes in input data and
processes it to generate key-value pairs. The Reduce function takes in the
output of the Map function and aggregates the data to generate a final output.
 YARN Resource Manager: YARN (Yet Another Resource Negotiator) is
the resource manager used by Hadoop. It manages the resources required by
the Hadoop cluster, such as CPU, memory, and disk, and allocates them to
different applications running on the cluster.

Hadoop runs code across a cluster of computers. This process includes the
following core tasks that Hadoop performs −
 Data is initially divided into directories and files. Files are divided into
uniform sized blocks of 128M and 64M (preferably 128M).
 These files are then distributed across various cluster nodes for further
processing.
 HDFS, being on top of the local file system, supervises the processing.
 Blocks are replicated for handling hardware failure.
 Checking that the code was executed successfully.
 Performing the sort that takes place between the map and reduce stages.
 Sending the sorted data to a certain computer.
 Writing the debugging logs for each job.

3.Elaborate on the Hadoop Architecture with block diagram


Hadoop is an open-source framework that enables distributed storage and
processing of large datasets across multiple computers using simple
programming models. The Hadoop architecture comprises four main
components: Hadoop Distributed File System (HDFS), MapReduce, YARN,
and Hadoop Common.
 Hadoop Distributed File System (HDFS)
 MapReduce
 YARN
 Hadoop Common

Hadoop Distributed File System (HDFS): HDFS is the primary storage


system used by Hadoop. It is a distributed file system that allows for the storage
of large data sets across multiple machines. The data is stored in blocks and
replicated across multiple machines to ensure data availability and reliability.
HDFS architecture consists of two main components: NameNode and
DataNode.
 NameNode: It is the master node of the HDFS cluster and is responsible
for maintaining the metadata of the data stored in HDFS, such as the
location of the data blocks, permissions, and access control. NameNode
also manages the replication of data blocks across the DataNodes.
 DataNode: It is the slave node of the HDFS cluster and is responsible for
storing the actual data blocks. DataNode periodically reports its status to
the NameNode and sends heartbeats to let the NameNode know that it is
still alive.
 MapReduce: MapReduce is a programming model used by Hadoop for
processing large data sets. It consists of two main functions: Map and
Reduce. The Map function takes in input data and processes it to generate
key-value pairs. The Reduce function takes in the output of the Map
function and aggregates the data to generate a final output.

The MapReduce: This architecture consists of three main components


JobTracker, TaskTracker, and JobHistoryServer.
 JobTracker: It is the master node of the MapReduce cluster and is
responsible for scheduling the MapReduce jobs on the TaskTrackers. It
also monitors the progress of the jobs and handles task failures.
 TaskTracker: It is the slave node of the MapReduce cluster and is
responsible for executing the Map and Reduce tasks assigned by the
JobTracker. It also sends heartbeats to the JobTracker to let it know that it
is still alive.
 JobHistoryServer: It is responsible for storing the job history
information, such as job status, start time, end time, and input/output data
locations.

YARN: YARN (Yet Another Resource Negotiator) is the resource manager


used by Hadoop. It manages the resources required by the Hadoop cluster, such
as CPU, memory, and disk, and allocates them to different applications running
on the cluster. The YARN architecture consists of two main components:
ResourceManager and NodeManager.
 ResourceManager: It is the master node of the YARN cluster and is
responsible for managing the resources and scheduling the applications
on the NodeManagers. It also maintains the application lifecycle and
monitors the resources used by the applications.
 NodeManager: It is the slave node of the YARN cluster and is
responsible for managing the resources on the individual nodes, such as
CPU, memory, and disk. It also launches and monitors the containers that
execute the application tasks.

Hadoop Common: Hadoop Common is a set of common libraries and utilities


used by all the other Hadoop components. It includes various Java libraries and
utilities, such as the Hadoop shell, logging, and configuration files.

4.Explain different techniques used in Data Visualization with


diagram
The type of data visualization technique you leverage will vary based on the
type of data you’re working with, in addition to the story you’re telling with
your data.
Here are some important data visualization techniques to know:
 Pie Chart
 Bar Chart
 Histogram
 Gantt Chart
 Heat Map
 Box and Whisker Plot
 Waterfall Chart
 Area Chart
 Scatter Plot
 Pictogram Chart

1. Pie Chart

Pie charts are one of the most common and basic data visualization techniques,
used across a wide range of applications. Pie charts are ideal for illustrating
proportions, or part-to-whole comparisons.
Because pie charts are relatively simple and easy to read, they’re best suited for
audiences who might be unfamiliar with the information or are only interested
in the key takeaways. For viewers who require a more thorough explanation of
the data, pie charts fall short in their ability to display complex information.
2. Bar Chart
The classic bar chart, or bar graph, is another common and easy-to-use method
of data visualization. In this type of visualization, one axis of the chart shows
the categories being compared, and the other, a measured value. The length of
the bar indicates how each group measures according to the value.
One drawback is that labeling and clarity can become problematic when there
are too many categories included. Like pie charts, they can also be too simple
for more complex data sets.
3. Histogram

Unlike bar charts, histograms illustrate the distribution of data over a continuous
interval or defined period. These visualizations are helpful in identifying where
values are concentrated, as well as where there are gaps or unusual values.
Histograms are especially useful for showing the frequency of a particular
occurrence. For instance, if you’d like to show how many clicks your website
received each day over the last week, you can use a histogram. From this
visualization, you can quickly determine which days your website saw the
greatest and fewest number of clicks.
4. Gantt Chart
Gantt charts are particularly common in project management, as they’re useful
in illustrating a project timeline or progression of tasks. In this type of chart,
tasks to be performed are listed on the vertical axis and time intervals on the
horizontal axis. Horizontal bars in the body of the chart represent the duration of
each activity.
Utilizing Gantt charts to display timelines can be incredibly helpful, and enable
team members to keep track of every aspect of a project. Even if you’re not a
project management professional, familiarizing yourself with Gantt charts can
help you stay organized.

5. Heat Map

A heat map is a type of visualization used to show differences in data through


variations in color. These charts use color to communicate values in a way that
makes it easy for the viewer to quickly identify trends. Having a clear legend is
necessary in order for a user to successfully read and interpret a heatmap.
There are many possible applications of heat maps. For example, if you want to
analyze which time of day a retail store makes the most sales, you can use a
heat map that shows the day of the week on the vertical axis and time of day on
the horizontal axis. Then, by shading in the matrix with colors that correspond
to the number of sales at each time of day, you can identify trends in the data
that allow you to determine the exact times your store experiences the most
sales.
6. A Box and Whisker Plot

A box and whisker plot, or box plot, provides a visual summary of data through
its quartiles. First, a box is drawn from the first quartile to the third of the data
set. A line within the box represents the median. “Whiskers,” or lines, are then
drawn extending from the box to the minimum (lower extreme) and maximum
(upper extreme). Outliers are represented by individual points that are in-line
with the whiskers.
This type of chart is helpful in quickly identifying whether or not the data is
symmetrical or skewed, as well as providing a visual summary of the data set
that can be easily interpreted.

7. Waterfall Chart
A waterfall chart is a visual representation that illustrates how a value changes
as it’s influenced by different factors, such as time. The main goal of this chart
is to show the viewer how a value has grown or declined over a defined period.
For example, waterfall charts are popular for showing spending or earnings over
time.

8. Area Chart

An area chart, or area graph, is a variation on a basic line graph in which the
area underneath the line is shaded to represent the total value of each data point.
When several data series must be compared on the same graph, stacked area
charts are used.
This method of data visualization is useful for showing changes in one or more
quantities over time, as well as showing how each quantity combines to make
up the whole. Stacked area charts are effective in showing part-to-whole
comparisons.

9. Scatter Plot
Another technique commonly used to display data is a scatter plot. A scatter
plot displays data for two variables as represented by points plotted against the
horizontal and vertical axis. This type of data visualization is useful in
illustrating the relationships that exist between variables and can be used to
identify trends or correlations in data.
Scatter plots are most effective for fairly large data sets, since it’s often easier to
identify trends when there are more data points present. Additionally, the closer
the data points are grouped together, the stronger the correlation or trend tends
to be.

10. Pictogram Chart

Pictogram charts, or pictograph charts, are particularly useful for presenting


simple data in a more visual and engaging way. These charts use icons to
visualize data, with each icon representing a different value or category. For
example, data about time might be represented by icons of clocks or watches.
Each icon can correspond to either a single unit or a set number of units (for
example, each icon represents 100 units).
In addition to making the data more engaging, pictogram charts are helpful in
situations where language or cultural differences might be a barrier to the
audience’s understanding of the data.

5.Explain the architecture of Materialized Views.


materialized views are often used in data warehousing to improve query
performance. The architecture of materialized views in big data is similar to that
of traditional databases, but with some differences.
The architecture of materialized views in big data typically involves a
distributed file system, such as Hadoop HDFS, and a query engine, such as
Apache Hive or Apache Impala. Here's a diagram that illustrates the architecture
of materialized views in big data:

The source data is stored in the distributed file system, which is typically
partitioned across multiple nodes in a cluster. The query engine provides an
SQL-like interface for querying the data, but instead of executing queries
directly against the source data, it can create materialized views based on the
queries. The materialized views are stored as physical tables in the distributed
file system, which can be queried like any other table.
When a query is executed against the materialized view, the query engine can
read the data directly from the physical table, rather than computing the result
from scratch each time. This can significantly improve query performance,
especially for complex queries that involve aggregations, joins, or other
expensive operations.
However, materialized views in big data also have some limitations. Because
the data is stored in a distributed file system, updates to the source data can be
slow or require complex synchronization mechanisms. Additionally, the
materialized views themselves can take up significant storage space, especially
if they are created based on large datasets or complex queries. As a result,
materialized views in big data are typically used in combination with other
optimization techniques, such as partitioning, indexing, and caching, to achieve
the best possible query performance.

6.Explain CAP Theorem with diagram


The CAP theorem, originally introduced as the CAP principle, can be used to
explain some of the competing requirements in a distributed system with
replication. It is a tool used to make system designers aware of the trade-offs
while designing networked shared-data systems.
The three letters in CAP refer to three desirable properties of distributed
systems with replicated data: consistency (among replicated
copies), availability (of the system for read and write operations)
and partition tolerance (in the face of the nodes in the system being
partitioned by a network fault).
The CAP theorem states that it is not possible to guarantee all three of the
desirable properties – consistency, availability, and partition tolerance at the
same time in a distributed system with data replication.
The theorem states that networked shared-data systems can only strongly
support two of the following three properties:

Consistency –
Consistency means that the nodes will have the same copies of a replicated
data item visible for various transactions. A guarantee that every node in a
distributed cluster returns the same, most recent and a successful write.
Consistency refers to every client having the same view of the data. There are
various types of consistency models. Consistency in CAP refers to sequential
consistency, a very strong form of consistency.

Availability –
Availability means that each read or write request for a data item will either be
processed successfully or will receive a message that the operation cannot be
completed. Every non-failing node returns a response for all the read and write
requests in a reasonable amount of time. The key word here is “every”. In
simple terms, every node (on either side of a network partition) must be able to
respond in a reasonable amount of time.

Partition Tolerance –
Partition tolerance means that the system can continue operating even if the
network connecting the nodes has a fault that results in two or more partitions,
where the nodes in each partition can only communicate among each other.
That means, the system continues to function and upholds its consistency
guarantees in spite of network partitions. Network partitions are a fact of life.
Distributed systems guaranteeing partition tolerance can gracefully recover
from partitions once the partition heals.

The use of the word consistency in CAP and its use in ACID do not refer to
the same identical concept.
In CAP, the term consistency refers to the consistency of the values in
different copies of the same data item in a replicated distributed system.
In ACID, it refers to the fact that a transaction will not violate the integrity
constraints specified on the database schema.

7.Identify the feautures and explain the Advantage and


Disadvantage of NOSQL
NOSQL, which stands for "not only SQL", is a type of database that does not
use the traditional relational model of tables and rows. Instead, it uses a non-
tabular data model, such as document-based, graph-based, key-value pair, or
column-family, to store and manage data.
Features:
 Schemaless: NoSQL databases are schemaless, which means that they do
not require a predefined schema to store data. This allows for greater
flexibility in data modeling and allows for more dynamic and evolving data
structures.
 High availability and fault-tolerance: NoSQL databases are designed to
operate in distributed environments, which allows for replication of data
across multiple nodes. This provides high availability and fault-tolerance,
which ensures that data is always available even if some nodes fail.
 High scalability: NoSQL databases are designed to handle large volumes of
data and a high volume of read/write requests. They are designed to scale
horizontally by adding more nodes to a cluster, which enables them to
handle increasing amounts of data and traffic.
 Performance: NoSQL databases are designed to be fast and efficient in
handling large volumes of data. They are optimized for specific types of data
models, which allows for faster queries and higher throughput.
 Consistency models: NoSQL databases offer various consistency models,
such as eventual consistency and strong consistency. This allows developers
to choose the appropriate consistency model based on their application's
needs.
 Open source: Many NoSQL databases are open-source, which means that
they are free to use and have a large community of developers who
contribute to their development and maintenance.
Advantages:
 High scalability and performance: NoSQL databases can easily scale
horizontally, which means that they can handle large amounts of data and a
high volume of read/write requests with ease.
 Flexibility: NoSQL databases can handle different types of data, including
structured, semi-structured, and unstructured data, which makes them a
better choice for big data applications.
 Distributed architecture: NoSQL databases can work in a distributed
environment, making them highly available and fault-tolerant. They can also
handle data replication, which means that they can maintain data consistency
even in the event of node failures.
 No fixed schema: NoSQL databases do not require a fixed schema, which
makes them easy to adapt to changing data structures. They can also handle
semi-structured and unstructured data with ease.
 High availability and fault-tolerance: NoSQL databases can provide high
availability and fault tolerance by replicating data across multiple nodes.
This means that even if one node fails, the data is still available and
accessible.
 Multiple query languages: NoSQL databases support multiple query
languages, which makes it easier for developers to work with them.
Disadvantages:
 Limited querying capabilities: NoSQL databases do not have the same
querying capabilities as traditional relational databases, which may limit
their use in certain applications.
 Lack of standardization: NoSQL databases are not as standardized as
traditional relational databases, which can make it difficult to move data
between different systems.
 Limited transaction support: NoSQL databases do not provide the same
level of transaction support as traditional relational databases, which may not
be suitable for applications that require ACID transactions.
 Limited community support: NoSQL databases are not as widely used as
traditional relational databases, which means that they may have limited
community support and resources available.
 Lack of maturity compared to traditional databases: NoSQL databases
are relatively new compared to traditional relational databases and may not
have the same level of maturity and stability.
 NoSQL databases may not be suitable for applications that require strict data
consistency and integrity, as they do not enforce strict data consistency and
integrity checks like traditional relational databases.

8.Build a model of cloud computing architecture and explain its


service

Cloud computing architecture for big data typically involves a distributed


computing model that leverages the resources of multiple machines or nodes to
store, process, and analyze large amounts of data. The following is a high-level
model of cloud computing architecture for big data:
 Data storage layer: This layer consists of a distributed file system that is
designed to store large amounts of data across multiple nodes in a cluster.
Examples of distributed file systems include Hadoop Distributed File System
(HDFS) and Amazon S3.
 Data processing layer: This layer consists of a cluster of machines that are
used to process large amounts of data in parallel. The processing is typically
done using a distributed processing framework like Apache Spark, Apache
Flink, or Apache Hadoop MapReduce.
 Data analysis layer: This layer consists of tools and technologies that are
used to analyze the data stored in the data storage layer. Examples include
Apache Hive, Apache Pig, and Apache Impala.
 Data visualization layer: This layer consists of tools and technologies that
are used to visualize the results of data analysis. Examples include Tableau,
Power BI, and QlikView.
 Data streaming layer: This layer consists of tools and technologies that are
used to process and analyze real-time data streams. Examples include
Apache Kafka, Apache Flink, and Apache Storm.
 Data security and management layer: This layer consists of tools and
technologies that are used to manage data security and governance.
Examples include Apache Ranger, Apache Atlas, and AWS Identity and
Access Management.

The services provided by this cloud computing architecture for big data include:
 Scalability: The architecture is designed to scale horizontally by adding
more nodes to the cluster. This allows for the processing of large amounts of
data and high-volume requests.
 Fault-tolerance: The architecture is designed to be fault-tolerant, meaning
that it can continue to operate even if some nodes fail.
 Cost-effectiveness: The architecture is cost-effective because it leverages
commodity hardware and open-source software.
 Flexibility: The architecture is flexible because it can handle various types
of data and can be adapted to different data processing and analysis
requirements.
 Real-time data processing: The architecture can handle real-time data
streams and process them in real-time, allowing for real-time analytics and
insights.
 Security and governance: The architecture provides tools and technologies
for managing data security and governance, ensuring that data is protected
and compliant with regulations.

9. Explain the Hadoop distributed file system design and concept


Hadoop Distributed File System (HDFS) is a distributed file system designed to
store large data sets reliably and efficiently on commodity hardware. HDFS is
one of the key components of the Hadoop ecosystem and is based on the Google
File System (GFS).
The design of HDFS is based on the master-slave architecture. There are two
main components in HDFS: the NameNode and the DataNode. The NameNode
is the master node that manages the file system namespace and regulates access
to files by clients. The DataNodes are the slave nodes that store the actual data.
The following are the key concepts of HDFS:
 Blocks: HDFS stores data in blocks, which are typically 64 MB or 128 MB
in size. Blocks are replicated across multiple DataNodes for fault tolerance
and reliability.
 NameNode: The NameNode is the master node that manages the file system
namespace and regulates access to files by clients. It maintains the metadata
information of the file system, including the directory structure and the
location of each block.
 DataNode: The DataNode is the slave node that stores the actual data
blocks. Each DataNode reports to the NameNode about the blocks it stores
and periodically sends heartbeats to confirm that it is still alive.
 Rack Awareness: HDFS is designed to be rack-aware, which means that it
is aware of the physical network topology of the cluster. It ensures that data
blocks are stored across multiple racks for improved fault tolerance and
network efficiency.
 Replication: HDFS replicates data blocks across multiple DataNodes for
fault tolerance and reliability. By default, each block is replicated three
times, but this can be configured as needed.

10. Analyse the concepts of data in Hadoop and explain Hadoop


streaming and pipes
In Hadoop, data is stored in a distributed file system and processed using a
distributed computing framework. Hadoop supports two types of data:
structured data and unstructured data. Structured data is typically stored in a
relational database, while unstructured data is stored in a variety of formats such
as text, images, audio, and video.
Hadoop Streaming and Pipes are two ways to enable the processing of data in
Hadoop using custom scripts or applications written in languages other than
Java.
Hadoop streaming:

Hadoop Streaming is a utility in Hadoop that allows users to write MapReduce


jobs in any programming language that can read data from standard input and
write data to standard output. It provides a way to use non-Java applications or
scripts with Hadoop MapReduce framework.
The process of using Hadoop Streaming involves creating two scripts - one for
the mapper and the other for the reducer. These scripts can be written in any
programming language that can read data from standard input and write data to
standard output. For example, you can write mapper and reducer scripts in
Python, Perl, Ruby, or even shell script.
Once the scripts are created, Hadoop Streaming enables the communication
between the MapReduce jobs and the custom scripts. The MapReduce jobs read
input data from HDFS, pass it to the mapper script via standard input, and then
the mapper script processes the data and writes the output to standard output.
The reducer script reads the output of the mapper script from standard input and
performs further processing before writing the final output to HDFS.
Hadoop Streaming allows users to leverage their existing knowledge and
expertise in other programming languages and integrate them with Hadoop.
This flexibility makes Hadoop Streaming an important tool for data processing
in Hadoop, as it enables developers to use the language of their choice for data
processing without having to write Java code.
Hadoop Pipes:
Hadoop Pipes is another way to enable the processing of data in Hadoop using
custom C++ or C programs. It provides a C++ API that allows users to write
MapReduce jobs in C++ and execute them in Hadoop.
The Hadoop Pipes API allows developers to write MapReduce programs in C++
without having to write Java code. The Pipes API provides a way to
communicate with the Hadoop MapReduce framework from C++ programs by
sending and receiving data using Unix pipes.
The working of Hadoop Pipes can be described in the following steps:
 Create the mapper and reducer programs: The first step is to create the
mapper and reducer programs in C++ using the Pipes API. The mapper and
reducer programs should be able to read input data from standard input and
write output data to standard output.
 Compile the programs: Once the programs are created, they need to be
compiled into native executables.
 Upload the programs to HDFS: The next step is to upload the mapper and
reducer programs to the Hadoop Distributed File System (HDFS).
 Run the Hadoop Pipes command: Once the programs are uploaded to
HDFS, you can run the Hadoop Pipes command to execute the MapReduce
job. The command specifies the location of the input data, the location of the
output data, and the location of the mapper and reducer programs.
 Map phase: In the map phase, the input data is divided into splits, and each
split is processed by a separate mapper task. The mapper task reads the input
data from standard input and processes it using the mapper program. The
output of the mapper task is written to standard output.
 Shuffle and sort phase: In the shuffle and sort phase, the output of the
mapper tasks is sorted and partitioned by the key, and then sent to the
reducer tasks.
 Reduce phase: In the reduce phase, the output of the shuffle and sort phase
is processed by the reducer tasks. The reducer task reads the input data from
standard input and processes it using the reducer program. The output of the
reducer task is written to standard output.
 Write the output to HDFS: Finally, the output of the MapReduce job is
written to HDFS.
11. Different failure in hadoop file system and how to resolve it
Hadoop Distributed File System (HDFS) is designed to handle failures
gracefully in a distributed environment. There are various types of failures that
can occur in HDFS, and each of them requires a different approach to resolve
them. Here are some common failures in HDFS and their possible solutions:
Namenode failure: Namenode is the master node that manages the file system
namespace and controls access to files. If the Namenode fails, the entire HDFS
cluster becomes unavailable. To resolve this issue, you can either configure a
secondary Namenode to take over in case of a Namenode failure or use Hadoop
High Availability (HA) to create a standby Namenode that can take over in case
of a failure.
Datanode failure: Datanodes are the slave nodes that store data in HDFS. If a
Datanode fails, the blocks it contains become unavailable. To resolve this issue,
HDFS replicates the data across multiple Datanodes, and if a Datanode fails, the
replicas can be used to ensure data availability. HDFS also uses a heartbeat
mechanism to detect failed Datanodes and remove them from the cluster.
Network failure: HDFS relies on a network connection between nodes to
transfer data. If there is a network failure, data transfer can be interrupted, and
the cluster performance can be affected. To resolve this issue, you can configure
HDFS to use redundant network paths or use a network topology that minimizes
the impact of network failures.
Disk failure: HDFS uses commodity hardware, and disk failures are common
in such hardware. If a disk fails, the blocks it contains become unavailable. To
resolve this issue, HDFS replicates the data across multiple Datanodes, and if a
Datanode fails, the replicas can be used to ensure data availability.
Job failure: MapReduce jobs in Hadoop can fail due to various reasons, such as
programming errors, resource constraints, or hardware failures. To resolve this
issue, you can use the Hadoop job tracker to monitor and diagnose job failures.
You can also configure job retry policies or use a tool like Apache Oozie to
manage job workflows and retries.

12. Explain the architecture of the different file based data


structures
File-based data structures are used to store and manage data in computer
systems. There are several file-based data structures, each with its own
architecture and characteristics. Here are some of the most common file-based
data structures:
 Flat file structure: The flat file structure is the simplest type of file-based
data structure. In this structure, data is stored in a single file, and there is no
relationship between the data elements. Each record in the file contains a set
of fields, and the fields are separated by delimiters such as commas or tabs.
This structure is used when the data is simple and does not require any
complex relationships or queries.
 Hierarchical file structure: In the hierarchical file structure, data is
organized in a tree-like structure, where each record is linked to one or more
records above it. This structure is used when the data has a parent-child
relationship. For example, in a company's organizational hierarchy, each
employee record may be linked to a department record, which is linked to a
division record, and so on.
 Network file structure: The network file structure is similar to the
hierarchical structure, but it allows each record to have multiple parent
records. This structure is used when the data has complex relationships that
cannot be represented in a simple hierarchy. The network structure is more
flexible than the hierarchical structure, but it is also more complex to
implement.
 Relational file structure: The relational file structure is the most widely
used file-based data structure. In this structure, data is organized into tables,
where each table represents an entity, and each row represents a record of
that entity. The tables are related to each other through common fields, and
complex queries can be performed on the data using Structured Query
Language (SQL). The relational structure is flexible, efficient, and scalable,
and it can handle large amounts of data.
 Object-oriented file structure: The object-oriented file structure is used to
store and manage data in object-oriented programming languages such as
Java or C++. In this structure, data is stored as objects, which are instances
of classes that define the data structure and behavior. The object-oriented
structure is more flexible than the relational structure, but it is also more
complex to implement.
Each file-based data structure has its own architecture and characteristics, and
the choice of structure depends on the nature and complexity of the data, as well
as the requirements of the application.
13. Illustrate the PIG latin relational operators and schemas in PIG
programming
PIG is a high-level data flow language used for data processing and analysis on
Hadoop. It is designed to be easy to use and can handle both structured and
unstructured data. PIG Latin is the scripting language used in PIG. It is a
procedural language that provides various relational operators to perform data
processing tasks. Here are some of the relational operators in PIG Latin:
1.LOAD: The LOAD operator is used to load data from an external data source
such as HDFS, local file system, or a remote server. It takes a path to the data
source and an optional schema as input.
Example:
mydata = LOAD '/user/mydata/input' USING PigStorage(',') AS
(name:chararray, age:int, salary:double);

2.FILTER: The FILTER operator is used to select a subset of data based on a


condition. It takes a logical expression as input and returns a subset of data that
satisfies the condition.
Example:
mydata_filtered = FILTER mydata BY age > 25;

3.GROUP: The GROUP operator is used to group the data based on one or
more columns. It takes one or more columns as input and returns a bag of tuples
for each group.
Example:
mydata_grouped = GROUP mydata BY name;

4.JOIN: The JOIN operator is used to combine two or more data sets based on
a common column. It takes two or more relations as input and returns a relation
that contains the columns of all the input relations.
Example:
mydata_join = JOIN mydata BY name, mydata2 BY name;
5.FOREACH: The FOREACH operator is used to perform a transformation on
each tuple of a relation. It takes an expression as input and applies it to each
tuple in the relation.
Example:
mydata_transformed = FOREACH mydata GENERATE name, salary *
12 AS annual_salary;

PIG Latin Schemas:


In PIG Latin, schemas are used to describe the structure of the data being
processed. A schema specifies the name and data type of each column in the
relation. Schemas are used in the LOAD operator to define the structure of the
input data and in the FOREACH operator to define the structure of the output
data.
Here is the syntax for defining a schema in PIG Latin:
(column_name1: data_type1, column_name2: data_type2, ...,
column_nameN: data_typeN)
The data types that can be used in PIG Latin schemas include:
 int: Integer value
 long: Long integer value
 float: Floating-point value
 double: Double-precision floating-point value
 chararray: Character string
 bytearray: Byte array
 datetime: Date and time value
 boolean: Boolean value
For example, the following schema defines a relation with three columns: name
(a character string), age (an integer value), and salary (a double-precision
floating-point value):
(name: chararray, age: int, salary: double)
Schemas can also be used to define nested structures. For example, the
following schema defines a relation with two columns: name (a character string)
and address (a nested structure with two fields: street and city):
(name: chararray, address: (street: chararray, city: chararray))
In this example, the nested structure is enclosed in parentheses and the fields are
separated by commas.

14. Formulate the steps for the creation of employee tables with empid,
empname, empaddress, empphoneno using HIVE programming. Address
the following queries:
1. Write a query for insertion of values to the table.
2. Write a query for listing the all the employees of the ABC company
with the name 'Radhika'
3. Write a query to list the address of employess located in
sadashivanagar.

Here are the steps to create an employee table with empid, empname,
empaddress, empphoneno using HIVE programming:
1. Open the HIVE shell or any other interface like Hue or Beeline and connect
to the Hadoop cluster.
2. Create a database in HIVE where the employee table will be stored using
the following command:
CREATE DATABASE employee_db;

3. Switch to the newly created database using the following command:


USE employee_db;

4. Create the employee table with empid, empname, empaddress, empphoneno


columns using the following command:
CREATE TABLE employee (
empid INT,
empname STRING,
empaddress STRING,
empphoneno STRING
);
1. To insert values into the table, use the following command:
INSERT INTO employee VALUES
(1, 'John', '123 Main St', '555-1234'),
(2, 'Jane', '456 Elm St', '555-5678'),
(3, 'Radhika', '789 Oak St', '555-9012');
This will insert three rows into the table with the specified values.

2. To list all employees of the ABC company with the name 'Radhika', use the
following command:
SELECT * FROM employee WHERE empname = 'Radhika';
This will return a list of all employees with the name 'Radhika'.

3. To list the address of employees located in Sadashivanagar, use the


following command:
SELECT empaddress FROM employee WHERE empaddress
LIKE '%Sadashivanagar%';
This will return the address of all employees whose address contains the string
'Sadashivanagar'.

You might also like