0% found this document useful (0 votes)

1K views21 pages

Data Analytics Unit-3 Notes

The document discusses several big data technologies: - Apache Hadoop is an open-source framework for storing and processing large datasets across clusters of computers. It is scalable, fault-tolerant, and processes data in parallel for fast results. - Apache Spark is another tool that can run analytics up to 100 times faster than Hadoop in memory. It supports real-time and batch processing. - MongoDB, Apache Cassandra, Apache Kafka, Splunk, QlikView, Qlik Sense, and Tableau are also discussed as popular tools for big data analytics, visualization, and real-time streaming. Each has features such as scalability, fault tolerance, and processing of structured and unstructured data.

Uploaded by

18R11A0530 MUSALE AASHISH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1K views21 pages

Data Analytics Unit-3 Notes

Uploaded by

18R11A0530 MUSALE AASHISH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 21

UNIT-III

Big Data Technologies

Big Data technologies are the software utility designed for analyzing, processing, and extracting
information from the unstructured large data which can’t be handled with the traditional data
processing software.

Companies required big data processing technologies to analyze the massive amount of real-time
data. They use Big Data technologies to come up with Predictions to reduce the risk of failure.

The topmost big data technologies are:

1. Apache Hadoop
It is the topmost big data tool. Apache Hadoop is an open-source software framework developed
by Apache Software foundation for storing and processing Big Data. Hadoop stores and
processes data in a distributed computing environment across the cluster of commodity
hardware.

Hadoop is the in-expensive, fault-tolerant and highly available framework that can process data
of any size and formats. It was written in JAVA and the current stable version is Hadoop 3.1.3.
The Hadoop HDFS is the most reliable storage on the planet.

Features Apache Hadoop:

 It is scalable and fault-tolerant.

 The framework is designed in such a way that it can work even in unfavorable conditions like
a machine crash.
 The framework stores data across commodity hardware that makes Hadoop cost-effective.
 Hadoop stores and processes data in a distributed manner. This data is processed parallelly
resulting in fast data processing.

Companies using Hadoop are Facebook, LinkedIn, IBM, MapR, Intel, Microsoft, and many
more.

2. Apache Spark
Apache Spark is another popular open-source big data tool designed with the goal to speed up
the Hadoop big data processing. The main objective of the Apache Spark project was to keep the
advantages of MapReduce’s distributed, scalable, fault-tolerant processing framework and make
it more efficient and easier to use.

It provides in-memory computing capabilities to deliver Speed. Spark supports both real-time as
well as batch processing and provides high-level APIs in Java, Scala, Python, and R.
Features of Apache Spark:

 Spark has the ability to run applications in Hadoop clusters 100 times faster in memory and
ten times faster on the disk.
 Apache Spark can work with different data stores (such as OpenStack, HDFS, Cassandra) due
to which it provides more flexibility than Hadoop.
 Spark contains an MLib library that offers a dynamic group of machine algorithms such as
Clustering, Collaborative, Filtering, Regression, Classification, etc.
 Apache Spark can run on Hadoop, Kubernetes, Apache Mesos, standalone, or in the cloud.

3. MongoDB
MongoDB is an open-source data analysis tool developed by MongoDB in 2009. It is a NoSQL,
document-oriented database written in C, C++, and JavaScript and has an easy setup
environment.

MongoDB is one of the most popular databases for Big Data. It facilitates the management of
unstructured or semi-structured data or the data that changes frequently.

MongoDB executes on MEAN software stack, NET applications, and Java

platforms. It is also flexible in cloud infrastructure.

Features of MongoDB:

 It is highly reliable, as well as cost-effective.

 It has a powerful query language that provides support for aggregation, geo-based search, text
search, graph search.
 Supports ad hoc queries, indexing, sharding, replication, etc.
 It has all the powers of the relational database.

Companies like Facebook, eBay, MetLife, Google, etc. use MongoDB.

4. Apache Cassandra
Apache Cassandra is an open-source, decentralized, distributed NoSQL(Not Only SQL) database
which provides high availability and scalability without compromising the performance
efficiency.

It is one of the biggest Big Data tools that can accommodate structured as well as unstructured
data. It employs Cassandra Structure Language (CQL) to interact with the database.

Cassandra is the perfect platform for the mission-critical data because of its linear-scalability and
fault-tolerance on the in-expenisive hardware or the cloud infrastructure.

Features Apache Cassandra:

 Due to Cassandra’s decentralized architecture, there is no single point of failure in a cluster.

 It is highly fault-tolerant and durable.
 Cassandra’s performance is able to scale linearly with the addition of nodes.
 It outperforms popular NoSQL alternatives in real applications.

Companies like Instagram, Netflix, GitHub, GoDaddy, eBay, Hulu, etc. use Cassandra.

5. Apache Kafka
Apache Kafka is an open-source distributed streaming platform developed by Apache Software
Foundation. It is a publish-subscriber based fault-tolerant messaging system and a robust queue
capable of handling large volumes of data.

It allows us to pass the message from one point to another. Kafka is used for building real-time
streaming data pipelines and real-time streaming applications. Kafka is written in Java and Scala.

Apache Kafka integrates very well with Spark and Storm for real-time streaming data analysis.

Features of Apache Kafka:

 Kafka can work with huge volumes of data easily.

 Kafka is highly scalable, distributed and fault-tolerant.
 It has high throughput for both publishing and subscribing messages.
 It guarantees zero downtime and no data loss.

Companies like LinkedIn, Twitter, Yahoo, Netflix, etc use Kafka.

Splunk captures, correlates, and indexes data from the searchable repository and generates
insightful graphs, reports, alerts, and dashboards.

Features:

 Support for real-time data processing.

 It takes input data in any format like JSON, .csv, config files, log files, and others.
 Using Splunk, one can monitor business metrics and makes informed decisions.
 With Splunk, we can analyze the performance of any IT system.
 We can incorporate AI into our data strategy through Splunk.

Companies like JPMorgan Chase, Wells Fargo, Verizon, Domino’s, Porsche, etc use Splunk.

6. QlikView
QuickView is the fastest evolving BI and data visualization tool. It is the best BI tool for
transforming raw data into knowledge. QuickView allows users to generate business insights by
exploring how data is associated with each other and which data is not related.

QuickView brings a whole new level of analysis, values, and insights to existing data stores with
simple, clean, and straightforward user interfaces. It enables users to conduct direct or indirect
searches on all data anywhere in the application.
When the user clicks on a data-point, no queries are fired. All the other fields filter themselves
based on user selection. It promotes unrestricted analysis of data, thus helping users to make
accurate decisions.

Features of QuickView:

 It provides an in-memory storage feature that makes data collection, integration, and analysis
process very fast.
 It works on Associative data modeling.
 The QuickView software automatically derives the relationship between data.
 It provides powerful and global data discovery.
 Support for social data discovery and mobile data discovery.

7. Qlik Sense
It is a data analysis and data visualization tool. Qlik Sense operates with an associative QIX
engine that enables users to associate and link data from different sources and perform dynamic
searching and selections.

It is used as a data analytic platform by technical as well as non-technical users. One who is
looking for the tool for showing and analyzing data in the best possible way, then the Qlik Sense
is the best choice.

With a drag and drop interface, the user can easily create an analytical report that is easy to
understand and is in the form of a story. The client team can share applications and reports on a
centralized hub, export the data stories to enhance the business, and share secure data models.

Features of Qlik Sense:

 Qlik Sense uses the associative model.

 It has a centralized hub or dashboard where all the files and reports generated using Qlik
software can be shared.
 Qlik sense can be embedded into the application and captures data from it.
 It conducts in-memory data comparisons.
 It has a ‘smart search’ feature which helps in analyzing data by interacting with the charts and
visualizations.

8. Tableau
Tableau is a powerful data visualization and software solution tools in the Business Intelligence
and analytics industry.

It is the perfect tool for transforming the raw data into an easily understandable format without
any technical skill and coding knowledge.

Tableau allows users to work on the live datasets and turns the raw data into valuable insights
and enhances the decision-making process.
It offers a rapid data analysis process, which results in visualizations that are in the form of
interactive dashboards and worksheets. It works in synchronization with the other Big Data tools.

Features of Tableau:

 In Tableau, with simple drag and drop, one can make visualizations in the form of a Bar chart,
Pie chart, Histogram, Treemap, Boxplot, Gantt chart, Bullet chart, and many more.
 Tableau offers a large option of data sources ranging from on-premise files, Text files, CSV,
Excel, relational databases, spreadsheets, non-relational databases, big data, data warehouses,
to on-cloud data.
 It is highly robust and secure.
 It allows the sharing of data in the form of visualizations, dashboards, sheets, etc. in real-time.

9. Apache Storm
It is a distributed real-time computational framework. Apache Storm is written in Clojure and
Java. With Apache Storm, we can reliably process our unbounded streams of data. It is a simple
tool and can be used with any programming language.

We can use Apache Storm in real-time analytics, continuous computation, online machine
learning, ETL, and more.

Features of Storm:

 It is free and open-source.

 It is highly scalable.
 Storm is fault-tolerant and easy to set up.
 Apache Storm guarantees data processing.
 It has the capability to process millions of tuples per second per node.

Companies like Yahoo, Alibaba, Groupon, Twitter, Spotify use Apache Storm.

10. Apache Hive

Hive is an open-source data warehousing tool for analyzing Big Data. Hive uses Hive Query
Language which is similar to SQL for querying unstructured data.

It is built on the top of Hadoop and enables developers to perform processing on data stored in
Hadoop HDFS without writing the complex MapReduce jobs. Users can interact with Hive
through CLI (Beeline Shell).

Features of Apache Hive:

 Hive provides support for all the client applications written in different languages.
 It reduces the overhead of writing complex MapReduce jobs.
 HQL syntax is similar to SQL. Thus, one who is familiar with SQL can easily write Hive
queries.
11. Apache Pig
It is an alternative approach for making MapReduce jobs easier. A pig was developed by Yahoo
for providing ease in writing the Hadoop MapReduce programs. Pig enables developers to use
Pig Latin, which is a scripting language designed for pig framework that runs on Pig runtime.

Pig Latin is SQL like commands that are converted to MapReduce program in the background by
the compiler. It translates the Pig Latin into MapReduce program for performing large scale data
processing in YARN.

Features of Pig:

 Pig allows users to create their own function for performing specific purpose processing.
 It is best suited for solving complex use cases.
 It handles data of all kinds.

12. Presto
Presto is an open-source query engine (SQL-on-Hadoop) developed by Facebook for running
interactive analytic queries against petabytes of data. It allows querying the data where it lives,
including Cassandra, Hive, proprietary data stores, or relational databases.

A single Presto query can merge data from different sources and perform analytics across the
entire organization. It does not depend on Hadoop MapReduce techniques and can retrieve data
very quickly within sub-seconds to minutes.

Features of Presto:

 It is easy to install and debug.

 Presto has a simple and extensible architecture.
 Provide support for user-defined functions.

13. Apache Flink

Apache Flink is an open-source distributed processing engine designed for stateful computations
over bounded and unbounded data streams.

It is written in Java and Scala. Apache Flink can run in all common cluster environments and
performs computations in-memory.

It doesn’t have any single point of failure.

Features of Flink:

 Flink delivers high throughput and low latency.

 It can be scaled to thousands of cores and terabytes of application state.
 Flink is used with stream processing applications like Event-Driven applications, Data
Analytics applications, Data pipeline applications.
Companies, including Alibaba, Bouygues Telecom, BetterCloud, etc. uses Apache Flink.

14. Apache Sqoop

Apache Sqoop is an open-source top-level project at Apache. It is a tool designed for transferring
huge amounts of data between Apache Hadoop and structured datastores. The structured
datastores are relational databases such as MySQL, Oracle, etc.

When we want to import data to HDFS from the relational database or export data from HDFS to
relational database, then we can use Sqoop.

Features of Sqoop:

 Sqoop offers connectors for all the major relational databases.

 With Sqoop we can load all the tables of the database with a single command.
 Apache Sqoop provides the facility of incremental load where we can load the parts of the
table when it is updated.

15. Rapidminer
RapidMiner is one of the most used tools for implementing Data Science. In 2017, it was ranked
1 at Gartner Magic Quadrant for Data Science Platform. It is a powerful data mining tool for
building predictive models.

Rapidminer is all in one tool which features data preparation, machine learning, and deep
learning.

Features of RapidMiner:

 It offers a single platform for data processing, and building Machine Learning models and
deployment.
 It supports the integration of the Hadoop framework with its in-built RapidMiner Radoop.
 RapidMiner can generate predictive models through automated modeling.

16. KNIME
The KNIME (Konstanz Information Miner) is an open-source data analytics platform for data
analysis and business intelligence. It is written in Java.

It allows users to visually create Data Flows, selectively execute analysis steps, inspect results,
interactive views, and models. KNIME is a good alternative for SAS.

Features of KNIME:

 It offers simple ETL operations.

 We can integrate KNIME with other languages and technologies.
 It offers a broad spectrum of integrated tools and advanced algorithms.
 KNIME is easy to set up.
 It does not have any stability issues.
17. Elasticsearch
Elasticsearch is a Lucene-based search engine. It is an open-source database server developed in
Java. Elasticsearch is used for performing full-text-search and analysis with an HTTP Web
Interface and JSON document.

It takes unstructured data from different sources and stores them in a sophisticated format which
is highly optimized for language-based searches.

Features of Elasticsearch:

 It is reliable and very easy to scale.

 It offers high speed even when searching in very large data sets.
 Elasticsearch offers simple RESTful APIs and uses schema-free JSON documents which
makes searching, indexing, and querying the data easy.
 It is schema-free because it accepts JSON documents.
Google File System

Basics:

Google developers routinely deal with large files that can be difficult to manipulate using a
traditional computer file system. The size of the files drove many of the decisions
programmers had to make for the GFS's design. Another big concern was scalability, which
refers to the ease of adding capacity to the system. A system is scalable if it's easy to increase
the system's capacity. The system's performance shouldn't suffer as it grows. Google requires
a very large network of computers to handle all of its files, so scalability is a top concern.

Because the network is so huge, monitoring and maintaining it is a challenging task. While
developing the GFS, programmers decided to automate as much of the administrative duties
required to keep the system running as possible. This is a key principle of autonomic
computing, a concept in which computers are able to diagnose problems and solve them in
real time without the need for human intervention. The challenge for the GFS team was to not
only create an automatic monitoring system, but also to design it so that it could work across a
huge network of computers. The key to the team's designs was the concept of simplification.

Files on the GFS tend to be very large, usually in the multi-gigabyte (GB) range. Accessing
and manipulating files that large would take up a lot of the network's bandwidth. Bandwidth
is the capacity of a system to move data from one location to another. The GFS addresses this
problem by breaking files up into chunks of 64 megabytes (MB) each. Every chunk receives a
unique 64-bit identification number called a chunk handle.

By requiring all the file chunks to be the same size, the GFS simplifies resource
application. It's easy to see which computers in the system are near capacity and which are
underused. It's also easy to port chunks from one resource to another to balance the workload
across the system.

Architecture:

Google organized the GFS into clusters of computers. A cluster is simply a network of

computers. Each cluster might contain hundreds or even thousands of machines. Within GFS
clusters there are three kinds of entities: clients, master servers and chunkservers.
Role of Client:

 "Client" refers to any entity that makes a file request. Requests can range from retrieving and
manipulating existing files to creating new files on the system. Clients can be other computers
or computer applications. You can think of clients as the customers of the GFS.

Role of Master:

 The master server acts as the coordinator for the cluster. The master's duties include
maintaining an operation log, which keeps track of the activities of the master's cluster. The
operation log helps keep service interruptions to a minimum -- if the master server crashes, a
replacement server that has monitored the operation log can take its place.

 The master server also keeps track of metadata, which is the information that describes
chunks. The metadata tells the master server to which files the chunks belong and where they
fit within the overall file.

 Upon startup, the master polls all the chunkservers in its cluster. The chunkservers respond by
telling the master server the contents of their inventories. From that moment on, the master
server keeps track of the location of chunks within the cluster.

There's only one active master server per cluster at any one time (though each cluster has
multiple copies of the master server in case of a hardware failure). That might sound like a
good recipe for a bottleneck -- after all, if there's only one machine coordinating a cluster of
thousands of computers, wouldn't that cause data traffic jams? The GFS gets around this
sticky situation by keeping the messages the master server sends and receives very small.
The master server doesn't actually handle file data at all. It leaves that up to the
chunkservers.

Role of ChunkServers:

 Chunkservers are the workhorses of the GFS. They're responsible for storing the 64-MB file
chunks. The chunkservers don't send chunks to the master server. Instead, they send requested
chunks directly to the client.

 The GFS copies every chunk multiple times and stores it on different chunkservers. Each copy
is called a replica. By default, the GFS makes three replicas per chunk, but users can change
the setting and make more or fewer replicas if desired.

READ OPERATION IN GFS:

A read request is simple -- the client sends a request to the master server to find out where the
client can find a particular file on the system. The server responds with the location for the
primary replica of the respective chunk. The primary replica holds a lease from the master
server for the chunk in question.If no replica currently holds a lease, the master server
designates a chunk as the primary. It does this by comparing the IP address of the client to the
addresses of the chunkservers containing the replicas. The master server chooses the
chunkserver closest to the client. That chunkserver's chunk becomes the primary. The client
then contacts the appropriate chunkserver directly, which sends the replica to the client.

WRITE OPERATION IN GFS:

Write requests are a little more complicated. The client still sends a request to the master
server, which replies with the location of the primary and secondary replicas. The client stores
this information in a memory cache. That way, if the client needs to refer to the same replica
later on, it can bypass the master server. If the primary replica becomes unavailable or the
replica changes, the client will have to consult the master server again before contacting a
chunkserver.

The client then sends the write data to all the replicas, starting with the closest replica and
ending with the furthest one. It doesn't matter if the closest replica is a primary or
secondary .Google compares this data delivery method to a pipeline.

Once the replicas receive the data, the primary replica begins to assign consecutive serial
numbers to each change to the file. Changes are called mutations. The serial numbers instruct
the replicas on how to order each mutation. The primary then applies the mutations in
sequential order to its own data. Then it sends a write request to the secondary replicas, which
follow the same application process. If everything works as it should, all the replicas across
the cluster incorporate the new data. The secondary replicas report back to the primary once
the application process is over.

At that time, the primary replica reports back to the client. If the process was successful, it
ends here. If not, the primary replica tells the client what happened. For example, if one
secondary replica failed to update with a particular mutation, the primary replica notifies the
client and retries the mutation application several more times. If the secondary replica doesn't
update correctly, the primary replica tells the secondary replica to start over from the
beginning of the write process. If that doesn't work, the master server will identify the
affected replica as garbage.
What is Hadoop?

Hadoop is an open source framework from Apache and is used to store process and analyze data
which are very huge in volume. Hadoop is written in Java and is not OLAP (online analytical
processing). It is used for batch/offline processing. It is being used by Facebook, Yahoo, Google,
Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the
cluster.

Modules of Hadoop(Hadoop Framework)

1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the
basis of that HDFS was developed. It states that the files will be broken into blocks and
stored in nodes over the distributed architecture.

2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the
cluster.

3. Map Reduce: This is a framework which helps Java programs to do the parallel

computation on data using key value pair. The Map task takes input data and converts it
into a data set which can be computed in Key value pair. The output of Map task is
consumed by reduce task and then the out of reducer gives the desired result.

4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other
Hadoop modules.
HADOOP ARCHITECTURE

The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS
(Hadoop Distributed File System). The MapReduce engine can be MapReduce/MR1 or
YARN/MR2.

A Hadoop cluster consists of a single master and multiple slave nodes. The master node includes
Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node includes
DataNode and TaskTracker.

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains
a master/slave architecture. This architecture consist of a single NameNode performs the role of
master, and multiple Data Nodes performs the role of a slave.

Both NameNode and Data Node are capable enough to run on commodity machines. The Java
language is used to develop HDFS. So any machine that supports Java language can easily run
the NameNode and Data Node software.

NameNode

o It is a single master server exist in the HDFS cluster.

o As it is a single node, it may become the reason of single point failure.

o It manages the file system namespace by executing an operation like the opening,
renaming and closing the files.
o It simplifies the architecture of the system.

Data Node

o The HDFS cluster contains multiple Data Nodes.

o Each Data Node contains multiple data blocks.

o These data blocks are used to store data.

o It is the responsibility of Data Node to read and write requests from the file system's
clients.

o It performs block creation, deletion, and replication upon instruction from the
NameNode.

Job Tracker

o The role of Job Tracker is to accept the MapReduce jobs from client and process the data
by using NameNode.

o In response, NameNode provides metadata to Job Tracker.

Task Tracker

o It works as a slave node for Job Tracker.

o It receives task and code from Job Tracker and applies that code on the file. This process
can also be called as a Mapper.

Advantages of Hadoop

o Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster
retrieval. Even the tools to process the data are often on the same servers, thus reducing
the processing time. It is able to process terabytes of data in minutes and Peta bytes in
hours.

o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.

o Cost Effective: Hadoop is open source and uses commodity hardware to store data so it
really cost effective as compared to traditional relational database management system.
o Resilient to failure: HDFS has the property with which it can replicate data over the
network, so if one node is down or some other network failure happens, then Hadoop
takes the other copy of data and use it. Normally, data are replicated thrice but the
replication factor is configurable.
How Hadoop is useful for Big data Analytics?

Example

Big Data & Hadoop – Restaurant Analogy

Let us take an analogy of a restaurant to understand the problems associated with Big Data and
how Hadoop solved that problem.

Bob is a businessman who has opened a small restaurant. Initially, in his restaurant, he used to
receive two orders per hour and he had one chef with one food shelf in his restaurant which
was sufficient enough to handle all the orders.

Fig: Traditional Restaurant Scenario

Now let us compare the restaurant example with the traditional scenario where data was getting
generated at a steady rate and our traditional systems like RDBMS is capable enough to handle
it, just like Bob’s chef. Here, you can relate the data storage with the restaurant’s food shelf and
the traditional processing unit with the chef as shown in the figure above.

Fig:Traditional Scenario
After a few months, Bob thought of expanding his business and therefore, he started taking
online orders and added few more cuisines to the restaurant’s menu in order to engage a larger
audience. Because of this transition, the rate at which they were receiving orders rose to an
alarming figure of 10 orders per hour and it became quite difficult for a single cook to cope up
with the current situation. Aware of the situation in processing the orders, Bob started thinking
about the solution.

Fig: Distributed Processing Scenario

Similarly, in Big Data scenario, the data started getting generated at an alarming rate because of
the introduction of various data growth drivers such as social media, smartphones etc.

Now, the traditional system, just like the cook in Bob’s restaurant, was not efficient enough to
handle this sudden change. Thus, there was a need for a different kind of solutions strategy to
cope up with this problem.

After a lot of research, Bob came up with a solution where he hired 4 more chefs to tackle the
huge rate of orders being received. Everything was going quite well, but this solution led to one
more problem. Since four chefs were sharing the same food shelf, the very food shelf was
becoming the bottleneck of the whole process. Hence, the solution was not that efficient as Bob
thought.

Fig: Distributed Processing Scenario Failure

Similarly, to tackle the problem of processing huge data sets, multiple processing units were
installed so as to process the data in parallel (just like Bob hired 4 chefs). But even in this case,
bringing multiple processing units was not an effective solution because the centralized storage
unit became the bottleneck.

In other words, the performance of the whole system is driven by the performance of the central
storage unit. Therefore, the moment our central storage goes down, the whole system gets
compromised. Hence, again there was a need to resolve this single point of failure.

Fig: Solution to Restaurant Problem

Bob came up with another efficient solution, he divided all the chefs into two hierarchies, that is
a Junior and a Head chef and assigned each junior chef with a food shelf. Let us assume that the
dish is Meat Sauce. Now, according to Bob’s plan, one junior chef will prepare meat and the
other junior chef will prepare the sauce. Moving ahead they will transfer both meat and sauce to
the head chef, where the head chef will prepare the meat sauce after combining both the
ingredients, which then will be delivered as the final order.

Fig: Hadoop Tutorial – Hadoop in Restaurant Analogy

Hadoop functions in a similar fashion as Bob’s restaurant. As the food shelf is distributed in

Bob’s restaurant, similarly, in Hadoop, the data is stored in a distributed fashion with
replications, to provide fault tolerance. For parallel processing, first the data is processed by the
slaves where it is stored for some intermediate results and then those intermediate results are
merged by master node to send the final result.

Now, you must have got an idea why Big Data is a problem statement and how Hadoop solves
it. As we just discussed above, there were three major challenges with Big Data:

 The first problem is storing the colossal amount of data

Storing huge data in a traditional system is not possible. The reason is obvious, the storage will
be limited to one system and the data is increasing at a tremendous rate.

 The second problem is storing heterogeneous data

Now we know that storing is a problem, but let me tell you it is just one part of the problem. The
data is not only huge, but it is also present in various formats i.e. unstructured, semi-structured
and structured. So, you need to make sure that you have a system to store different types of
data that is generated from various sources.

 Finally let’s focus on the third problem, which is the processing speed

Now the time taken to process this huge amount of data is quite high as the data to be processed
is too large.
Features of Hadoop

Reliability

When machines are working as a single unit, if one of the machines fails, another machine will
take over the responsibility and work in a reliable and fault-tolerant fashion. Hadoop
infrastructure has inbuilt fault tolerance features and hence, Hadoop is highly reliable.

Economical

Hadoop uses commodity hardware (like your PC, laptop). For example, in a small Hadoop
cluster, all your DataNodes can have normal configurations like 8-16 GB RAM with 5-10 TB
hard disk and Xeon processors.

But if I would have used hardware-based RAID with Oracle for the same purpose, I would end
up spending 5x times more at least. So, the cost of ownership of a Hadoop-based project
is minimized. It is easier to maintain a Hadoop environment and is economical as well. Also,
Hadoop is open-source software and hence there is no licensing cost.

Scalability

Hadoop has the inbuilt capability of integrating seamlessly with cloud-based services. So, if you
are installing Hadoop on a cloud, you don’t need to worry about the scalability factor because
you can go ahead and procure more hardware and expand your set up within minutes whenever
required.

Flexibility
Hadoop is very flexible in terms of the ability to deal with all kinds of data. Hadoop can store
and process them all, whether it is structured, semi-structured or unstructured data.

These 4 characteristics make Hadoop a front-runner as a solution to Big Data challenges. Now

that we know what is Hadoop, we can explore the core components of Hadoop. Let us
understand, what are the core components of Hadoop.

DEVOPS Spectrum
No ratings yet
DEVOPS Spectrum
44 pages
Ad3351 Daa Unit I
No ratings yet
Ad3351 Daa Unit I
135 pages
FIOT Unit-5
No ratings yet
FIOT Unit-5
24 pages
Ai - Unit - 3-1
No ratings yet
Ai - Unit - 3-1
31 pages
BDA-Unit 4
No ratings yet
BDA-Unit 4
61 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
Ad3381 DDM Lab Manual
No ratings yet
Ad3381 DDM Lab Manual
55 pages
AD3501-DL-Unit 1 Notes
No ratings yet
AD3501-DL-Unit 1 Notes
43 pages
ds4015 Big Data Analytics Vignesh K Notes
No ratings yet
ds4015 Big Data Analytics Vignesh K Notes
146 pages
Data Engineering UNIT-1
100% (1)
Data Engineering UNIT-1
14 pages
Unit 1 PPT
No ratings yet
Unit 1 PPT
72 pages
IDS-Unit 3
No ratings yet
IDS-Unit 3
142 pages
KRR Unit I Notes
100% (1)
KRR Unit I Notes
32 pages
Factors and Tables
No ratings yet
Factors and Tables
10 pages
Ccs334 Big Data Analytics
0% (1)
Ccs334 Big Data Analytics
2 pages
ML Lab Mannual R22 Cse (DS)
No ratings yet
ML Lab Mannual R22 Cse (DS)
46 pages
ML Unit-1
No ratings yet
ML Unit-1
32 pages
FIOT Unit-1 Notes
No ratings yet
FIOT Unit-1 Notes
27 pages
Aiml Lab Manual Upto DT
No ratings yet
Aiml Lab Manual Upto DT
40 pages
Jntuk r20 Unit-I Deep Learning Techniques (WWW - Jntumaterials.co - In)
No ratings yet
Jntuk r20 Unit-I Deep Learning Techniques (WWW - Jntumaterials.co - In)
23 pages
Machine Learning UNIT-3
100% (1)
Machine Learning UNIT-3
16 pages
Unit V
100% (1)
Unit V
66 pages
Bda - Unit 3
No ratings yet
Bda - Unit 3
29 pages
Cyber Security - Organizational Implications
No ratings yet
Cyber Security - Organizational Implications
40 pages
MCS101-Artificial Intelligence
100% (1)
MCS101-Artificial Intelligence
3 pages
Cp4252-Machine Learning Lab Manual 23-24
No ratings yet
Cp4252-Machine Learning Lab Manual 23-24
28 pages
Unit 2 - Notes
No ratings yet
Unit 2 - Notes
9 pages
Ai Unit 4 Notes
No ratings yet
Ai Unit 4 Notes
36 pages
CCW331 BA IAT 1 Set 1 & Set 2 Questions
No ratings yet
CCW331 BA IAT 1 Set 1 & Set 2 Questions
19 pages
Da Unit-2
No ratings yet
Da Unit-2
23 pages
CCS369 - TSS-Unit 3
No ratings yet
CCS369 - TSS-Unit 3
55 pages
IF4071 - Deep Learning Laboratory
No ratings yet
IF4071 - Deep Learning Laboratory
1 page
Unit V - AI
No ratings yet
Unit V - AI
41 pages
LM7 Approximate Inference in BN
No ratings yet
LM7 Approximate Inference in BN
18 pages
BigData Mining and Analytics
No ratings yet
BigData Mining and Analytics
2 pages
CS3401 Algorithms Lecture Notes 1
No ratings yet
CS3401 Algorithms Lecture Notes 1
132 pages
CCS334 Big Data Analytics Important Question
No ratings yet
CCS334 Big Data Analytics Important Question
1 page
CS8091-BIG DATA ANALYTICS UNIT V Notes
100% (4)
CS8091-BIG DATA ANALYTICS UNIT V Notes
31 pages
CP4252 Machine Learning Lab Manual
No ratings yet
CP4252 Machine Learning Lab Manual
33 pages
STM Lab Manual
No ratings yet
STM Lab Manual
50 pages
Big Data Technologies UNIT 1
No ratings yet
Big Data Technologies UNIT 1
5 pages
M.E.cse - R21 Syllabus
No ratings yet
M.E.cse - R21 Syllabus
20 pages
BDA Unit 1
No ratings yet
BDA Unit 1
10 pages
CS-605 Data - Analytics - Lab Complete Manual (2) - 1672730238
No ratings yet
CS-605 Data - Analytics - Lab Complete Manual (2) - 1672730238
56 pages
Real-Time Encapsulation Interview Questions and Answers
No ratings yet
Real-Time Encapsulation Interview Questions and Answers
20 pages
AL3391 AI UNIT 3 NOTES EduEngg
No ratings yet
AL3391 AI UNIT 3 NOTES EduEngg
38 pages
Dbms Lab Manual II Cse II Sem
No ratings yet
Dbms Lab Manual II Cse II Sem
58 pages
M.E. Bda 2021
No ratings yet
M.E. Bda 2021
64 pages
Ai Unit 4
No ratings yet
Ai Unit 4
23 pages
Cse Flat Digital Notes Full 2020 21
No ratings yet
Cse Flat Digital Notes Full 2020 21
195 pages
Ad3251 Data Structures Design
No ratings yet
Ad3251 Data Structures Design
2 pages
Data Analytics Lab File Rohit
No ratings yet
Data Analytics Lab File Rohit
23 pages
Instance Based Learning
100% (1)
Instance Based Learning
49 pages
UNIT-1 Introduction To Data Mining
No ratings yet
UNIT-1 Introduction To Data Mining
29 pages
CS8661-IP LAB MAUAL UPDATION NEW (1) Lak
100% (1)
CS8661-IP LAB MAUAL UPDATION NEW (1) Lak
87 pages
Data Mining Notes Jntuh Compress
No ratings yet
Data Mining Notes Jntuh Compress
62 pages
Data Analytics Important Questions
No ratings yet
Data Analytics Important Questions
11 pages
1.1. Cloud Architecture System Models For Distributed and Cloud Computing
No ratings yet
1.1. Cloud Architecture System Models For Distributed and Cloud Computing
31 pages
Data Mining-Graph Mining
No ratings yet
Data Mining-Graph Mining
9 pages
Microsoft Excel 365
No ratings yet
Microsoft Excel 365
1 page
Pyspark
100% (1)
Pyspark
48 pages
Notes of Data Science Unit 3
No ratings yet
Notes of Data Science Unit 3
22 pages
(Step by Step (Microsoft) ) John Sharp, Jon Jagger - Microsoft Visual C# .Net Step by Step-Microsoft Press (2002) PDF
No ratings yet
(Step by Step (Microsoft) ) John Sharp, Jon Jagger - Microsoft Visual C# .Net Step by Step-Microsoft Press (2002) PDF
663 pages
Final
No ratings yet
Final
27 pages
CS411 Quiz 4 Merged Finl Term
No ratings yet
CS411 Quiz 4 Merged Finl Term
292 pages
IGI Rules Guide v03
No ratings yet
IGI Rules Guide v03
161 pages
EAI5
No ratings yet
EAI5
82 pages
Data Analytics Course File 2021-22 Odd Semester
No ratings yet
Data Analytics Course File 2021-22 Odd Semester
164 pages
TC10 Structure Manager
No ratings yet
TC10 Structure Manager
580 pages
ShopZ E-Commerce Application
No ratings yet
ShopZ E-Commerce Application
12 pages
Data Warehousing & Mining: Unit - V
100% (2)
Data Warehousing & Mining: Unit - V
13 pages
1.definition and Characteristics of Iot
No ratings yet
1.definition and Characteristics of Iot
19 pages
Ch12 Short
No ratings yet
Ch12 Short
49 pages
Nested Classes 235-245: Durgasoft MR - Ratan
No ratings yet
Nested Classes 235-245: Durgasoft MR - Ratan
4 pages
PDF Python For Everyone 1st Edition Cay S. Horstmann Download
100% (5)
PDF Python For Everyone 1st Edition Cay S. Horstmann Download
64 pages
Notes For Shirish
No ratings yet
Notes For Shirish
6 pages
Gagandeep Lamba - Full Stack 7
No ratings yet
Gagandeep Lamba - Full Stack 7
4 pages
PHP XML Expat Parser
No ratings yet
PHP XML Expat Parser
4 pages
Hemant Sahu
No ratings yet
Hemant Sahu
3 pages
Ap22 Apc Computer Science A q1
No ratings yet
Ap22 Apc Computer Science A q1
13 pages
DA Unit 3,4,5 Notes
No ratings yet
DA Unit 3,4,5 Notes
54 pages
Java Program Key Mouse Adapter
No ratings yet
Java Program Key Mouse Adapter
3 pages
Tc6 XML v201 Technical
No ratings yet
Tc6 XML v201 Technical
80 pages
CECS 277 Midterm Review Fa17
No ratings yet
CECS 277 Midterm Review Fa17
12 pages
An Introduction To Scrum: Agus Mulyanto, MT., M.Sc. Fakultas Teknik & Ilmu Komputer
No ratings yet
An Introduction To Scrum: Agus Mulyanto, MT., M.Sc. Fakultas Teknik & Ilmu Komputer
45 pages
Why I Don't Like The Test Pyramid
No ratings yet
Why I Don't Like The Test Pyramid
5 pages
Assignment 2016 17 Dca01 06 Practical
No ratings yet
Assignment 2016 17 Dca01 06 Practical
6 pages
2024-05-02 10 - 07 - 59.026
No ratings yet
2024-05-02 10 - 07 - 59.026
2 pages
Py Curses
No ratings yet
Py Curses
8 pages
Unit 2 Assignment Questions
No ratings yet
Unit 2 Assignment Questions
1 page
Challenge Yourself 23
No ratings yet
Challenge Yourself 23
3 pages
Project Costing Using Cocomo
No ratings yet
Project Costing Using Cocomo
12 pages
Ahmad Kanaan - Full Stack Developer
No ratings yet
Ahmad Kanaan - Full Stack Developer
2 pages
Software Engineer Resume
No ratings yet
Software Engineer Resume
1 page
Introduction to Linux: Installation and Programming
From Everand
Introduction to Linux: Installation and Programming
N. B. Venkateswarlu
No ratings yet

Data Analytics Unit-3 Notes

Uploaded by

Data Analytics Unit-3 Notes

Uploaded by

UNIT-III

Big Data Technologies

The topmost big data technologies are:

Features Apache Hadoop:

 It is scalable and fault-tolerant.

MongoDB executes on MEAN software stack, NET applications, and Java

 It is highly reliable, as well as cost-effective.

Companies like Facebook, eBay, MetLife, Google, etc. use MongoDB.

Features Apache Cassandra:

 Due to Cassandra’s decentralized architecture, there is no single point of failure in a cluster.

Features of Apache Kafka:

 Kafka can work with huge volumes of data easily.

Companies like LinkedIn, Twitter, Yahoo, Netflix, etc use Kafka.

 Support for real-time data processing.

Features of Qlik Sense:

 Qlik Sense uses the associative model.

 It is free and open-source.

Companies like Yahoo, Alibaba, Groupon, Twitter, Spotify use Apache Storm.

10. Apache Hive

Features of Apache Hive:

 It is easy to install and debug.

13. Apache Flink

It doesn’t have any single point of failure.

 Flink delivers high throughput and low latency.

14. Apache Sqoop

 Sqoop offers connectors for all the major relational databases.

 It offers simple ETL operations.

 It is reliable and very easy to scale.

Google organized the GFS into clusters of computers. A cluster is simply a network of

READ OPERATION IN GFS:

WRITE OPERATION IN GFS:

Modules of Hadoop(Hadoop Framework)

3. Map Reduce: This is a framework which helps Java programs to do the parallel

Hadoop Distributed File System

o It is a single master server exist in the HDFS cluster.

o As it is a single node, it may become the reason of single point failure.

o The HDFS cluster contains multiple Data Nodes.

o Each Data Node contains multiple data blocks.

o These data blocks are used to store data.

o In response, NameNode provides metadata to Job Tracker.

o It works as a slave node for Job Tracker.

o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.

Big Data & Hadoop – Restaurant Analogy

Fig: Traditional Restaurant Scenario

Fig: Distributed Processing Scenario

Fig: Distributed Processing Scenario Failure

Fig: Solution to Restaurant Problem

Fig: Hadoop Tutorial – Hadoop in Restaurant Analogy

Hadoop functions in a similar fashion as Bob’s restaurant. As the food shelf is distributed in

 The first problem is storing the colossal amount of data

 The second problem is storing heterogeneous data

These 4 characteristics make Hadoop a front-runner as a solution to Big Data challenges. Now

You might also like