0% found this document useful (0 votes)
158 views

Data Visualization and Hadoop

Data visualization is the practice of translating data into visual representations like graphs, charts, and maps to help humans understand patterns, trends, and insights more easily. The main goal is to make large datasets comprehensible. Effective data visualization relies on clean data sources and choosing visuals like line charts, bar graphs, or maps that clearly communicate relationships and patterns in the data. Data visualization is important for decision making across many fields and plays a key role in analyzing big data.

Uploaded by

toon town
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
158 views

Data Visualization and Hadoop

Data visualization is the practice of translating data into visual representations like graphs, charts, and maps to help humans understand patterns, trends, and insights more easily. The main goal is to make large datasets comprehensible. Effective data visualization relies on clean data sources and choosing visuals like line charts, bar graphs, or maps that clearly communicate relationships and patterns in the data. Data visualization is important for decision making across many fields and plays a key role in analyzing big data.

Uploaded by

toon town
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

What is Data Visualization?

Data visualization is a graphical representation of quantitative information and data by


using visual elements like graphs, charts, and maps.

Data visualization convert large and small data sets into visuals, which is easy to
understand and process for humans.

Data visualization tools provide accessible ways to understand outliers, patterns, and
trends in the data.

In the world of Big Data, the data visualization tools and technologies are required to
analyze vast amounts of information.

Data visualizations are common in your everyday life, but they always appear in the form
of graphs and charts. The combination of multiple visualizations and bits of information
are still referred to as Infographics.

Data visualizations are used to discover unknown facts and trends. You can see
visualizations in the form of line charts to display change over time. Bar and column charts
are useful for observing relationships and making comparisons. A pie chart is a great way
to show parts-of-a-whole. And maps are the best way to share geographical data visually.

Today's data visualization tools go beyond the charts and graphs used in the Microsoft
Excel spreadsheet, which displays the data in more sophisticated ways such as dials and
gauges, geographic maps, heat maps, pie chart, and fever chart.

What makes Data Visualization Effective?


Effective data visualization are created by communication, data science, and design
collide. Data visualizations did right key insights into complicated data sets into
meaningful and natural.

American statistician and Yale professor Edward Tufte believe useful data visualizations
consist of ?complex ideas communicated with clarity, precision, and efficiency.
To craft an effective data visualization, you need to start with clean data that is well-
sourced and complete. After the data is ready to visualize, you need to pick the right chart.

After you have decided the chart type, you need to design and customize your
visualization to your liking. Simplicity is essential - you don't want to add any elements
that distract from the data.

History of Data Visualization


The concept of using picture was launched in the 17th century to understand the data
from the maps and graphs, and then in the early 1800s, it was reinvented to the pie chart.

Several decades later, one of the most advanced examples of statistical graphics occurred
when Charles Minard mapped Napoleon's invasion of Russia. The map represents the
size of the army and the path of Napoleon's retreat from Moscow - and that information
tied to temperature and time scales for a more in-depth understanding of the event.
Computers made it possible to process a large amount of data at lightning-fast speeds.
Nowadays, data visualization becomes a fast-evolving blend of art and science that certain
to change the corporate landscape over the next few years.

Importance of Data Visualization


Data visualization is important because of the processing of information in human brains.
Using graphs and charts to visualize a large amount of the complex data sets is more
comfortable in comparison to studying the spreadsheet and reports.

Data visualization is an easy and quick way to convey concepts universally. You can
experiment with a different outline by making a slight adjustment.

Data visualization have some more specialties such as:

o Data visualization can identify areas that need improvement or modifications.


o Data visualization can clarify which factor influence customer behavior.
o Data visualization helps you to understand which products to place where.
o Data visualization can predict sales volumes.
Data visualization tools have been necessary for democratizing data, analytics, and
making data-driven perception available to workers throughout an organization. They are
easy to operate in comparison to earlier versions of BI software or traditional statistical
analysis software. This guide to a rise in lines of business implementing data visualization
tools on their own, without support from IT.

Why Use Data Visualization?


1. To make easier in understand and remember.
2. To discover unknown facts, outliers, and trends.
3. To visualize relationships and patterns quickly.
4. To ask a better question and make better decisions.
5. To competitive analyze.
6. To improve insights.
What is data visualization?
Data visualization is the practice of translating information into a visual context, such
as a map or graph, to make data easier for the human brain to understand and pull
insights from. The main goal of data visualization is to make it easier to identify
patterns, trends and outliers in large data sets. The term is often used interchangeably
with others, including information graphics, information visualization and statistical
graphics.

Data visualization is one of the steps of the data science process, which states that
after data has been collected, processed and modeled, it must be visualized for
conclusions to be made. Data visualization is also an element of the broader data
presentation architecture (DPA) discipline, which aims to identify, locate, manipulate,
format and deliver data in the most efficient way possible.

Data visualization is important for almost every career. It can be used by teachers to
display student test results, by computer scientists exploring advancements in artificial
intelligence (AI) or by executives looking to share information with stakeholders. It
also plays an important role in big data projects. As businesses accumulated massive
collections of data during the early years of the big data trend, they needed a way to
get an overview of their data quickly and easily. Visualization tools were a natural fit.

Visualization is central to advanced analytics for similar reasons. When a data


scientist is writing advanced predictive analytics or machine learning (ML)
algorithms, it becomes important to visualize the outputs to monitor results and ensure
that models are performing as intended. This is because visualizations of complex
algorithms are generally easier to interpret than numerical outputs.
A timeline
depicting the history of data visualization

Why is data visualization important?


Data visualization provides a quick and effective way to communicate information in
a universal manner using visual information. The practice can also help businesses
identify which factors affect customer behavior; pinpoint areas that need to be
improved or need more attention; make data more memorable for stakeholders;
understand when and where to place specific products; and predict sales volumes.

Other benefits of data visualization include the following:


 the ability to absorb information quickly, improve insights and make faster
decisions;

 an increased understanding of the next steps that must be taken to


improve the organization;

 an improved ability to maintain the audience's interest with information


they can understand;

 an easy distribution of information that increases the opportunity to share


insights with everyone involved;

 eliminate the need for data scientists since data is more accessible and
understandable; and

 an increased ability to act on findings quickly and, therefore, achieve


success with greater speed and less mistakes.
Data visualization and big data
The increased popularity of big data and data analysis projects have made
visualization more important than ever. Companies are increasingly using machine
learning to gather massive amounts of data that can be difficult and slow to sort
through, comprehend and explain. Visualization offers a means to speed this up and
present information to business owners and stakeholders in ways they can understand.

Big data visualization often goes beyond the typical techniques used in normal
visualization, such as pie charts, histograms and corporate graphs. It instead uses more
complex representations, such as heat maps and fever charts. Big data visualization
requires powerful computer systems to collect raw data, process it and turn it into
graphical representations that humans can use to quickly draw insights.

While big data visualization can be beneficial, it can pose several disadvantages to
organizations. They are as follows:

 To get the most out of big data visualization tools, a visualization specialist
must be hired. This specialist must be able to identify the best data sets
and visualization styles to guarantee organizations are optimizing the use of
their data.

 Big data visualization projects often require involvement from IT, as well as
management, since the visualization of big data requires powerful
computer hardware, efficient storage systems and even a move to the
cloud.

 The insights provided by big data visualization will only be as accurate as


the information being visualized. Therefore, it is essential to have people
and processes in place to govern and control the quality of corporate data,
metadata and data sources.
Examples of data visualization
In the early days of visualization, the most common visualization technique was using
a Microsoft Excel spreadsheet to transform the information into a table, bar graph or
pie chart. While these visualization methods are still commonly used, more intricate
techniques are now available, including the following:

 infographics

 bubble clouds

 bullet graphs

 heat maps

 fever charts

 time series charts

Some other popular techniques are as follows:

Line charts. This is one of the most basic and common techniques used. Line charts
display how variables can change over time.
Area charts. This visualization method is a variation of a line chart; it displays
multiple values in a time series -- or a sequence of data collected at consecutive,
equally spaced points in time.

Scatter plots. This technique displays the relationship between two variables.
A scatter plot takes the form of an x- and y-axis with dots to represent data points.

Treemaps. This method shows hierarchical data in a nested format. The size of the
rectangles used for each category is proportional to its percentage of the
whole. Treemaps are best used when multiple categories are present, and the goal is to
compare different parts of a whole.

Population pyramids. This technique uses a stacked bar graph to display the
complex social narrative of a population. It is best used when trying to display the
distribution of a population.

Common data visualization use cases


Common use cases for data visualization include the following:

Sales and marketing. Research from market and consumer data provider Statista
estimated $566 billion was spent on digital advertising in 2022 and that number will
cross the $700 billion mark by 2025. Marketing teams must pay close attention to
their sources of web traffic and how their web properties generate revenue. Data
visualization makes it easy to see how marketing efforts effect traffic trends over time.

Politics. A common use of data visualization in politics is a geographic map that


displays the party each state or district voted for.

Healthcare. Healthcare professionals frequently use choropleth maps to visualize


important health data. A choropleth map displays divided geographical areas or
regions that are assigned a certain color in relation to a numeric variable. Choropleth
maps allow professionals to see how a variable, such as the mortality rate of heart
disease, changes across specific territories.

Scientists. Scientific visualization, sometimes referred to in shorthand as SciVis,


allows scientists and researchers to gain greater insight from their experimental data
than ever before.

Finance. Finance professionals must track the performance of their investment


decisions when choosing to buy or sell an asset. Candlestick charts are used as trading
tools and help finance professionals analyze price movements over time, displaying
important information, such as securities, derivatives, currencies, stocks, bonds and
commodities. By analyzing how the price has changed over time, data analysts and
finance professionals can detect trends.

Logistics. Shipping companies can use visualization tools to determine the best global
shipping routes.

Data scientists and researchers. Visualizations built by data scientists are typically
for the scientist's own use, or for presenting the information to a select audience. The
visual representations are built using visualization libraries of the chosen
programming languages and tools. Data scientists and researchers frequently use open
source programming languages -- such as Python -- or proprietary tools designed for
complex data analysis. The data visualization performed by these data scientists and
researchers helps them understand data sets and identify patterns and trends that
would have otherwise gone unnoticed.
https://www.yellowfinbi.com/blog/10-essential-types-of-data-visualization

https://www.geeksforgeeks.org/data-visualization-tools/

https://www.toptal.com/designers/data-visualization/data-visualization-tools

Challenges for Big Data Visualization or Visual Analytics;

The main challenge with visual analytics is to apply visual analytics to big data
problems. Generally, technological challenges such as computation, algorithm,
database, and storage, rendering along with human perception; such as visual
representation, data summarization, and abstraction are some of the common
challenges. “The top 5 challenges in extreme-scale visual analytics” as
addressed in the publication by SAS analytics are as follows:

 Speed requirement; In-memory analysis and expanding memory


should utilize to address this challenge.
 Data understanding; There must be proper tools and professionals;
who are proficient in understanding the data underneath the sea to
make proper insight.
 Information quality; One of the biggest challenges is managing large
amounts of data and maintaining the quality of such data. The data
needs to understand and presented in the proper format that increases
its overall quality of it.
 Meaningful output; Using the proper visualization technique
according to the data presented is necessary to bring meaningful output
to the data.
 Managing outliers; While you cluster the data for favorable
outcomes; it is obvious that an outlier will exist. Outliers cannot
neglecte because they might reveal some valuable information and
must treate separately in separate charts.

Big data visualization poses several challenges due to the unique characteristics of large-scale datasets.
Some of the key challenges include:

1. Scalability: Big data often involves massive volumes of data that exceed the capabilities of traditional
visualization tools. Handling and visualizing such large datasets requires specialized techniques and
infrastructure that can scale to accommodate the data size.

2. Data Variety and Complexity: Big data is characterized by diverse data types, including structured,
semi-structured, and unstructured data. Visualizing complex data types, such as text, images, or
geospatial data, requires advanced techniques and specialized tools.

3. Data Preprocessing: Big data often requires preprocessing and transformation before visualization.
This involves data cleaning, filtering, aggregation, and integration from multiple sources. Preprocessing
can be time-consuming and resource-intensive, especially when dealing with large and heterogeneous
datasets.

4. Real-Time Visualization: Big data is often generated and updated in real-time or at high velocities.
Visualizing streaming data or rapidly changing data in real-time poses challenges in terms of data
ingestion, processing, and rendering to provide up-to-date visual representations.

5. Computation and Performance: Processing and analyzing large datasets for visualization can be
computationally intensive. Handling complex queries, aggregations, and calculations on big data requires
powerful computing resources and efficient algorithms to ensure timely and responsive visualizations.

6. Interactivity and Responsiveness: Big data visualizations should maintain interactivity and
responsiveness even when dealing with large datasets. Users need to be able to explore, filter, and
interact with the visualizations without experiencing significant delays or performance issues.
7. Visualization Design: Designing effective visualizations for big data requires careful consideration of
the information density, representation choices, color schemes, and visual encoding techniques.
Balancing complexity, clarity, and interpretability is essential when dealing with large and intricate
datasets.

8. Data Security and Privacy: Big data often contains sensitive and private information. Ensuring data
confidentiality and privacy while visualizing and sharing big data poses challenges, requiring robust
security measures and anonymization techniques.

9. Interpretation and Insight Extraction: Extracting meaningful insights from big data visualizations can be
challenging due to the vastness and complexity of the data. Identifying patterns, trends, and anomalies
in large datasets requires advanced analytics techniques and interactive exploration tools.

Addressing these challenges requires a combination of advanced visualization techniques, scalable


infrastructure, efficient algorithms, and domain expertise. It also necessitates a multidisciplinary
approach involving data scientists, visualization experts, and domain specialists to effectively visualize
and derive insights from big data.

Analytical techniques play a crucial role in extracting insights and patterns from big data during the
visualization process. Here are some commonly used analytical techniques in big data visualization:
1. Aggregation: Aggregation involves summarizing and condensing large volumes of data into meaningful
subsets or higher-level representations. Aggregating data helps in reducing complexity and providing an
overview of patterns or trends in the data.

2. Filtering: Filtering allows users to focus on specific subsets of data based on specified criteria. It helps
in reducing noise, removing outliers, and highlighting relevant patterns or anomalies within the big data.

3. Sampling: Sampling involves selecting a representative subset of the data to analyze or visualize,
especially when dealing with extremely large datasets. Sampling helps in reducing computational
requirements and enables quicker analysis and visualization.

4. Statistical Analysis: Statistical analysis techniques, such as descriptive statistics, hypothesis testing,
regression analysis, and clustering, can be applied to big data to identify relationships, correlations,
distributions, and other statistical properties. These techniques help in uncovering insights and
understanding the underlying patterns within the data.

5. Machine Learning: Machine learning algorithms and techniques are widely used for analyzing big data
and extracting meaningful patterns. Techniques like classification, regression, clustering, and anomaly
detection can be applied to big data to gain insights, make predictions, or identify hidden patterns.

6. Text Mining and Natural Language Processing (NLP): Text mining and NLP techniques are employed to
analyze and visualize large volumes of text data in big data. These techniques involve tasks such as
sentiment analysis, topic modeling, text classification, and entity recognition, enabling the extraction of
insights from textual information.

7. Time-Series Analysis: Time-series analysis techniques are used to analyze data that changes over time.
These techniques help in identifying trends, seasonality, and patterns in time-dependent data, facilitating
the visualization of temporal relationships and behavior within big data.

8. Graph Analysis: Graph analysis techniques are used to analyze complex networks and relationships
present in big data. Graph algorithms, such as centrality measures, community detection, and path
finding, enable the identification of key nodes, clusters, or structures in interconnected data, which can
be visualized for deeper insights.
9. Geo-Spatial Analysis: Geo-spatial analysis techniques involve analyzing data with location information.
Mapping, spatial clustering, hotspot analysis, and spatial interpolation techniques can be applied to big
data with geo-spatial components to visualize and understand spatial patterns and relationships.

10. Deep Learning: Deep learning techniques, particularly neural networks, are used to analyze and
extract insights from big data that involve complex patterns or high-dimensional data. Deep learning
algorithms are capable of learning hierarchical representations and detecting intricate patterns within
big data.

These analytical techniques, along with effective visualization methods, enable data scientists and
analysts to gain valuable insights from big data and communicate them visually. It's important to select
the appropriate analytical techniques based on the specific characteristics of the data and the objectives
of the analysis.

Overview: Apache Hadoop is an open source framework


intended to make interaction with big data easier, However,
for those who are not acquainted with this technology, one
question arises that what is big data ? Big data is a term
given to the data sets which can’t be processed in an
efficient manner with the help of traditional methodology
such as RDBMS. Hadoop has made its place in the industries
and companies that need to work on large data sets which
are sensitive and needs efficient handling. Hadoop is a
framework that enables processing of large data sets which
reside in the form of clusters. Being a framework, Hadoop
is made up of several modules that are supported by a large
ecosystem of technologies.
Introduction: Hadoop Ecosystem is a platform or a suite
which provides various services to solve the big data
problems. It includes Apache projects and various
commercial tools and solutions. There are four major
elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop
Common. Most of the tools or solutions are used to
supplement or support these major elements. All these tools
work collectively to provide services such as absorption,
analysis, storage and maintenance of data etc.
Following are the components that collectively form a
Hadoop ecosystem:

 HDFS: Hadoop Distributed File System


 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm
libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling
Note: Apart from the above-mentioned components, there are
many other components too that are part of the Hadoop
ecosystem.
All these toolkits or components revolve around one term
i.e. Data. That’s the beauty of Hadoop that it revolves
around data and hence making its synthesis easier.
HDFS:

 HDFS is the primary or major component of Hadoop


ecosystem and is responsible for storing large data
sets of structured or unstructured data across
various nodes and thereby maintaining the metadata
in the form of log files.
 HDFS consists of two core components i.e.
1. Name node
2. Data Node
 Name Node is the prime node which contains metadata
(data about data) requiring comparatively fewer
resources than the data nodes that stores the actual
data. These data nodes are commodity hardware in the
distributed environment. Undoubtedly, making Hadoop
cost effective.
 HDFS maintains all the coordination between the
clusters and hardware, thus working at the heart of
the system.
YARN:

Yet Another Resource Negotiator, as the name



implies, YARN is the one who helps to manage the
resources across the clusters. In short, it performs
scheduling and resource allocation for the Hadoop
System.
 Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
 Resource manager has the privilege of allocating
resources for the applications in a system whereas
Node managers work on the allocation of resources
such as CPU, memory, bandwidth per machine and later
on acknowledges the resource manager. Application
manager works as an interface between the resource
manager and node manager and performs negotiations
as per the requirement of the two.
MapReduce:

 By making the use of distributed and parallel


algorithms, MapReduce makes it possible to carry
over the processing’s logic and helps to write
applications which transform big data sets into a
manageable one.
 MapReduce makes the use of two functions i.e. Map()
and Reduce() whose task is:
1. Map() performs sorting and filtering of data
and thereby organizing them in the form of
group. Map generates a key-value pair based
result which is later on processed by the
Reduce() method.
2. Reduce(), as the name suggests does the
summarization by aggregating the mapped data.
In simple, Reduce() takes the output
generated by Map() as input and combines
those tuples into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig
Latin language, which is Query based language similar to
SQL.
 It is a platform for structuring the data flow,
processing and analyzing huge data sets.
 Pig does the work of executing commands and in the
background, all the activities of MapReduce are
taken care of. After the processing, pig stores the
result in HDFS.
 Pig Latin language is specially designed for this
framework which runs on Pig Runtime. Just the way
Java runs on the JVM.
 Pig helps to achieve ease of programming and
optimization and hence is a major segment of the
Hadoop Ecosystem.
HIVE:

 With the help of SQL methodology and interface, HIVE


performs reading and writing of large data sets.
However, its query language is called as HQL (Hive
Query Language).
 It is highly scalable as it allows real-time
processing and batch processing both. Also, all the
SQL datatypes are supported by Hive thus, making the
query processing easier.
 Similar to the Query Processing frameworks, HIVE too
comes with two components: JDBC Drivers and HIVE
Command Line.
 JDBC, along with ODBC drivers work on establishing
the data storage permissions and connection whereas
HIVE Command line helps in the processing of
queries.
Mahout:

 Mahout, allows Machine Learnability to a system or


application. Machine Learning, as the name suggests
helps the system to develop itself based on some
patterns, user/environmental interaction or on the
basis of algorithms.
 It provides various libraries or functionalities
such as collaborative filtering, clustering, and
classification which are nothing but concepts of
Machine learning. It allows invoking algorithms as
per our need with the help of its own libraries.
Apache Spark:

 It’s a platform that handles all the process


consumptive tasks like batch processing, interactive
or iterative real-time processing, graph
conversions, and visualization, etc.
 It consumes in memory resources hence, thus being
faster than the prior in terms of optimization.
 Spark is best suited for real-time data whereas
Hadoop is best suited for structured data or batch
processing, hence both are used in most of the
companies interchangeably.
Apache HBase:

 It’s a NoSQL database which supports all kinds of


data and thus capable of handling anything of Hadoop
Database. It provides capabilities of Google’s
BigTable, thus able to work on Big Data sets
effectively.
 At times where we need to search or retrieve the
occurrences of something small in a huge database,
the request must be processed within a short quick
span of time. At such times, HBase comes handy as it
gives us a tolerant way of storing limited data
Other Components: Apart from all of these, there are some
other components too that carry out a huge task in order to
make Hadoop capable of processing large datasets. They are
as follows:

 Solr, Lucene: These are the two services that


perform the task of searching and indexing with the
help of some java libraries, especially Lucene is
based on Java which allows spell check mechanism, as
well. However, Lucene is driven by Solr.
 Zookeeper: There was a huge issue of management of
coordination and synchronization among the resources
or the components of Hadoop which resulted in
inconsistency, often. Zookeeper overcame all the
problems by performing synchronization, inter-
component based communication, grouping, and
maintenance.
 Oozie: Oozie simply performs the task of a
scheduler, thus scheduling jobs and binding them
together as a single unit. There is two kinds of
jobs .i.e Oozie workflow and Oozie coordinator jobs.
Oozie workflow is the jobs that need to be executed
in a sequentially ordered manner whereas Oozie
Coordinator jobs are those that are triggered when
some data or external stimulus is given to it.

MapReduce is a framework using which we can write applications to


process huge amounts of data, in parallel, on large clusters of
commodity hardware in a reliable manner.

What is MapReduce?
MapReduce is a processing technique and a program model for
distributed computing based on java. The MapReduce algorithm
contains two important tasks, namely Map and Reduce. Map takes a
set of data and converts it into another set of data, where
individual elements are broken down into tuples (key/value pairs).
Secondly, reduce task, which takes the output from a map as an
input and combines those data tuples into a smaller set of tuples.
As the sequence of the name MapReduce implies, the reduce task is
always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data
processing over multiple computing nodes. Under the MapReduce
model, the data processing primitives are called mappers and
reducers. Decomposing a data processing application
into mappers and reducers is sometimes nontrivial. But, once we
write an application in the MapReduce form, scaling the application
to run over hundreds, thousands, or even tens of thousands of
machines in a cluster is merely a configuration change. This simple
scalability is what has attracted many programmers to use the
MapReduce model.

The Algorithm
 Generally MapReduce paradigm is based on sending the
computer to where the data resides!
 MapReduce program executes in three stages, namely map
stage, shuffle stage, and reduce stage.
o Map stage − The map or mapper’s job is to
process the input data. Generally the input
data is in the form of file or directory and
is stored in the Hadoop file system (HDFS). The
input file is passed to the mapper function
line by line. The mapper processes the data and
creates several small chunks of data.
o Reduce stage − This stage is the combination
of the Shuffle stage and the Reduce stage. The
Reducer’s job is to process the data that comes
from the mapper. After processing, it produces
a new set of output, which will be stored in
the HDFS.
 During a MapReduce job, Hadoop sends the Map and Reduce
tasks to the appropriate servers in the cluster.
 The framework manages all the details of data-passing
such as issuing tasks, verifying task completion, and
copying data around the cluster between the nodes.
 Most of the computing takes place on nodes with data on
local disks that reduces the network traffic.
 After completion of the given tasks, the cluster collects
and reduces the data to form an appropriate result, and
sends it back to the Hadoop server.
Why Apache Pig?
By now, we know that Apache Pig is used with Hadoop, and Hadoop is based on the
Java programming language. Now, the question that arises in our minds is ‘Why Pig?’ The
need for Apache Pig came up when many programmers weren’t comfortable with Java
and were facing a lot of struggle working with Hadoop, especially, when MapReduce tasks
had to be performed. Apache Pig came into the Hadoop world as a boon for all such
programmers.

 After the introduction of Pig Latin, now, programmers are able to work
on MapReduce tasks without the use of complicated codes as in Java.
 To reduce the length of codes, the multi-query approach is used by Apache
Pig, which results in reduced development time by 16 folds.
 Since Pig Latin is very similar to SQL, it is comparatively easy to learn
Apache Pig if we have little knowledge of SQL.

For supporting data operations such as filters, joins, ordering, etc., Apache Pig provides
several in-built operations.

What is Hadoop? Check out the Big Data Hadoop Training in Sydney and learn
more!

Features of Pig Hadoop


There are several features of Apache Pig:
1. In-built operators: Apache Pig provides a very good set of operators for
performing several data operations like sort, join, filter, etc.
2. Ease of programming: Since Pig Latin has similarities with SQL, it is very
easy to write a Pig script.
3. Automatic optimization: The tasks in Apache Pig are automatically
optimized. This makes the programmers concentrate only on the semantics
of the language.
4. Handles all kinds of data: Apache Pig can analyze both structured and
unstructured data and store the results in HDFS.
Do you still have queries on ‘What is Hadoop?,’ do post them on
our Big Data Hadoop and Spark Community!

Get 100% Hike!


Master Most in Demand Skills Now !

Submit

Apache Pig Architecture


The main reason why programmers have started using Hadoop Pig is that it converts the
scripts into a series of MapReduce tasks making their job easy. Below is the architecture
of Pig Hadoop:
Pig Hadoop framework has four main components:

1. Parser: When a Pig Latin script is sent to Hadoop Pig, it is first handled by
the parser. The parser is responsible for checking the syntax of the script,
along with other miscellaneous checks. Parser gives an output in the form of
a Directed Acyclic Graph (DAG) that contains Pig Latin statements, together
with other logical operators represented as nodes.
2. Optimizer: After the output from the parser is retrieved, a logical plan for
DAG is passed to a logical optimizer. The optimizer is responsible for carrying
out the logical optimizations.
3. Compiler: The role of the compiler comes in when the output from the
optimizer is received. The compiler compiles the logical plan sent by the
optimizing The logical plan is then converted into a series of MapReduce
tasks or jobs.
4. Execution Engine: After the logical plan is converted to MapReduce jobs,
these jobs are sent to Hadoop in a properly sorted order, and these jobs are
executed on Hadoop for yielding the desired result.
Hadoop Hive
Apache Hive is an open-source data warehouse system that has been built on top
of Hadoop. You can use Hive for analyzing and querying large datasets that are stored in
Hadoop files. Processing structured and semi-structured data can be done by using Hive.

Let’s look at the agenda for this section first:

 What is Hive in Hadoop?


 Why do we need Hadoop Hive?
 Hive Architecture
 Differences Between Hive and Pig
 Features of Apache Hive
 Limitations of Apache Hive

Now, let’s start with this Apache Hive tutorial.

What is Hive in Hadoop?


Don’t you think writing MapReduce jobs is tedious work? Well, with Hadoop Hive, you
can just go ahead and submit SQL queries and perform MapReduce jobs. So, if you are
comfortable with SQL, then Hive is the right tool for you as you will be able to work on
MapReduce tasks efficiently. Similar to Pig, Hive has its own language, called HiveQL
(HQL). It is similar to SQL. HQL translates SQL-like queries into MapReduce jobs, like
what Pig Latin does. The best part is that you don’t need to learn Java to work with
Hadoop Hive.
Watch this video on HIVE by Intellipaat:

Hadoop Hive runs on our system and converts SQL queries into a set of jobs for execution
on a Hadoop cluster. Basically, Hadoop Hive classifies data into tables providing a method
for attaching the structure to data stores in HDFS.

Facebook uses Hive to address its various requirements, like running thousands of tasks
on the cluster, along with thousands of users for a huge variety of applications. Since
Facebook has a huge amount of raw data, i.e., 2 PB, Hadoop Hive is used for storing this
voluminous data. It regularly loads around 15 TB of data on a daily basis. Now, many
companies, such as IBM, Amazon, Yahoo!, and many others, are also using and
developing Hive.

Why do we need Hadoop Hive?


Let’s now talk about the need for Hive. To understand that, let’s see what Facebook did
with its big data.

Basically, there were a lot of challenges faced by Facebook before they had finally
implemented Apache Hive. One of those challenges was the size of data that has been
generated on a daily basis. Traditional databases, such as RDBMS and SQL, weren’t able
to handle the pressure of such a huge amount of data. Because of this, Facebook was
looking for better options. It started using MapReduce in the beginning to overcome this
problem. But, it was very difficult to work with MapReduce as it needed mandatory
programming expertise in SQL. Later on, Facebook realized that Hadoop Hive had the
potential to actually overcome the challenges it faced.
Apache Hive helps developers get away with writing complex MapReduce tasks. Hadoop
Hive is extremely fast, scalable, and extensible. Since Apache Hive is comparable to SQL,
it is easy for the SQL developers as well to implement Hive queries.

Additionally, the Hive is capable of decreasing the complexity of MapReduce by providing


an interface wherein a user can submit various SQL queries. So, technically, you don’t
need to learn Java for working with Apache Hive.

Enroll in our Big Data Hadoop Training now and learn in detail!

Hive Architecture
Let’s now talk about the Hadoop Hive architecture and the major working force behind
Apache Hive.

The components of Apache Hive are as follows:

o Driver: The driver acts as a controller receiving HiveQL statements. It


begins the execution of statements by creating sessions. It is
responsible for monitoring the life cycle and the progress of the
execution. Along with that, it also saves the important metadata that
has been generated during the execution of the HiveQL statement.
o Metastore: A metastore stores metadata of all tables. Since Hive
includes partition metadata, it helps the driver in tracking the progress
of various datasets that have been distributed across a cluster, hence
keeping track of data. In a metastore, the data is saved in an RDBMS
format.
o Compiler: The compiler performs the compilation of a HiveQL query. It
transforms the query into an execution plan that contains tasks.
o Optimizer: An optimizer performs many transformations on the
execution plan for providing an optimized DAG. An optimizer aggregates
several transformations together like converting a pipeline of joins to a
single join. It can also split the tasks for providing better performance.
o Executor: After the processes of compilation and optimization are
completed, the execution of the task is done by the executor. It is
responsible for pipelining the tasks.

Do you still have queries on Hive do post them on our Big Data Hadoop and
Spark Community!
Get 100% Hike!
Master Most in Demand Skills Now !

Submit

Differences Between Hive and Pig


Hive Pig

Used for data analysis Pig is used for data and programs

Used for processing the structured data It is used for the semi-structured data
Has HiveQL Has Pig Latin

Used for creating reports Used for programming

Works on the server side Works on the client side

Does not support Avro Supports Avro

You can even check out which is better. Hive and HBase in your case, by going
through Hive vs Hbase blog.

Features of Apache Hive


Let’s now look at the features of Apache Hive:

 Hive provides easy data summarization and analysis and query support.
 Hive supports external tables, making it feasible to process data without having to
store it into HDFS.
 Since Hadoop has a low-level interface, Hive fits in here properly.
 Hive supports the partitioning of data at the data level for better performance.
 There is a rule-based optimizer present in Hive responsible for optimizing logical
plans.
 Hadoop can process external data using Hive.
This code snippet demonstrates how to create a line plot using Matplotlib:

1. Importing libraries:

```python

import matplotlib.pyplot as plt

import numpy as np

```
In this step, we import the necessary libraries for data visualization. Matplotlib is imported as `plt`, and
NumPy is imported as `np`.

2. Generating data:

```python

x = np.arange(0, 10, 0.1)

y = np.sin(x)

```

Here, we generate the x-values using `np.arange()` function, which creates an array of numbers from 0 to
10 (exclusive) with a step size of 0.1. Then, we compute the corresponding y-values using the `np.sin()`
function, which calculates the sine of each element in the x-array.

3. Creating the line plot:

```python

plt.plot(x, y)

```

This line of code creates the line plot using the `plot()` function of Matplotlib. We pass in the x-array as
the first argument and the y-array as the second argument. Matplotlib automatically connects the points
with lines to create the plot.

4. Adding labels and title:


```python

plt.title('Line Plot')

plt.xlabel('X-axis')

plt.ylabel('Y-axis')

```

These lines of code add a title to the plot using `plt.title()`, and label the x-axis and y-axis using
`plt.xlabel()` and `plt.ylabel()` respectively.

5. Displaying the plot:

```python

plt.show()

```

Finally, this line of code displays the line plot on the screen.

When you run this code, it will generate a line plot of the sine function over the range of x-values from 0
to 10. The x-axis represents the values of x, the y-axis represents the corresponding values of sin(x), and
the plot will have the title "Line Plot".

import matplotlib.pyplot as plt


import numpy as np

# Line Plot
x = np.arange(0, 10, 0.1)
y = np.sin(x)
plt.plot(x, y)
plt.title('Line Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()
import matplotlib.pyplot as plt

import seaborn as sns

import numpy as np

# Line Plot with Matplotlib

x = np.arange(0, 10, 0.1)

y = np.sin(x)

plt.plot(x, y)

plt.title('Line Plot (Matplotlib)')

plt.xlabel('X-axis')

plt.ylabel('Y-axis')

plt.show()

# Scatter Plot with Seaborn

x = np.random.rand(100)

y = np.random.rand(100)

sns.scatterplot(x, y)

plt.title('Scatter Plot (Seaborn)')

plt.xlabel('X-axis')

plt.ylabel('Y-axis')

plt.show()

# Histogram with Matplotlib

data = np.random.randn(1000)

plt.hist(data, bins=30)

plt.title('Histogram (Matplotlib)')
plt.xlabel('Values')

plt.ylabel('Frequency')

plt.show()

# Density Plot with Seaborn

data = np.random.randn(1000)

sns.histplot(data, kde=True)

plt.title('Density Plot (Seaborn)')

plt.xlabel('Values')

plt.ylabel('Density')

plt.show()

# Box Plot with Matplotlib and Seaborn

data = np.random.randn(1000)

plt.subplot(1, 2, 1)

plt.boxplot(data)

plt.title('Box Plot (Matplotlib)')

plt.ylabel('Values')

plt.subplot(1, 2, 2)

sns.boxplot(data)

plt.title('Box Plot (Seaborn)')

plt.ylabel('Values')

plt.tight_layout()

plt.show()

You might also like