Data Analytics Unit-3 Notes
Data Analytics Unit-3 Notes
Big Data technologies are the software utility designed for analyzing, processing, and extracting
information from the unstructured large data which can’t be handled with the traditional data
processing software.
Companies required big data processing technologies to analyze the massive amount of real-time
data. They use Big Data technologies to come up with Predictions to reduce the risk of failure.
1. Apache Hadoop
It is the topmost big data tool. Apache Hadoop is an open-source software framework developed
by Apache Software foundation for storing and processing Big Data. Hadoop stores and
processes data in a distributed computing environment across the cluster of commodity
hardware.
Hadoop is the in-expensive, fault-tolerant and highly available framework that can process data
of any size and formats. It was written in JAVA and the current stable version is Hadoop 3.1.3.
The Hadoop HDFS is the most reliable storage on the planet.
Companies using Hadoop are Facebook, LinkedIn, IBM, MapR, Intel, Microsoft, and many
more.
2. Apache Spark
Apache Spark is another popular open-source big data tool designed with the goal to speed up
the Hadoop big data processing. The main objective of the Apache Spark project was to keep the
advantages of MapReduce’s distributed, scalable, fault-tolerant processing framework and make
it more efficient and easier to use.
It provides in-memory computing capabilities to deliver Speed. Spark supports both real-time as
well as batch processing and provides high-level APIs in Java, Scala, Python, and R.
Features of Apache Spark:
Spark has the ability to run applications in Hadoop clusters 100 times faster in memory and
ten times faster on the disk.
Apache Spark can work with different data stores (such as OpenStack, HDFS, Cassandra) due
to which it provides more flexibility than Hadoop.
Spark contains an MLib library that offers a dynamic group of machine algorithms such as
Clustering, Collaborative, Filtering, Regression, Classification, etc.
Apache Spark can run on Hadoop, Kubernetes, Apache Mesos, standalone, or in the cloud.
3. MongoDB
MongoDB is an open-source data analysis tool developed by MongoDB in 2009. It is a NoSQL,
document-oriented database written in C, C++, and JavaScript and has an easy setup
environment.
MongoDB is one of the most popular databases for Big Data. It facilitates the management of
unstructured or semi-structured data or the data that changes frequently.
Features of MongoDB:
4. Apache Cassandra
Apache Cassandra is an open-source, decentralized, distributed NoSQL(Not Only SQL) database
which provides high availability and scalability without compromising the performance
efficiency.
It is one of the biggest Big Data tools that can accommodate structured as well as unstructured
data. It employs Cassandra Structure Language (CQL) to interact with the database.
Cassandra is the perfect platform for the mission-critical data because of its linear-scalability and
fault-tolerance on the in-expenisive hardware or the cloud infrastructure.
Companies like Instagram, Netflix, GitHub, GoDaddy, eBay, Hulu, etc. use Cassandra.
5. Apache Kafka
Apache Kafka is an open-source distributed streaming platform developed by Apache Software
Foundation. It is a publish-subscriber based fault-tolerant messaging system and a robust queue
capable of handling large volumes of data.
It allows us to pass the message from one point to another. Kafka is used for building real-time
streaming data pipelines and real-time streaming applications. Kafka is written in Java and Scala.
Apache Kafka integrates very well with Spark and Storm for real-time streaming data analysis.
Splunk captures, correlates, and indexes data from the searchable repository and generates
insightful graphs, reports, alerts, and dashboards.
Features:
Companies like JPMorgan Chase, Wells Fargo, Verizon, Domino’s, Porsche, etc use Splunk.
6. QlikView
QuickView is the fastest evolving BI and data visualization tool. It is the best BI tool for
transforming raw data into knowledge. QuickView allows users to generate business insights by
exploring how data is associated with each other and which data is not related.
QuickView brings a whole new level of analysis, values, and insights to existing data stores with
simple, clean, and straightforward user interfaces. It enables users to conduct direct or indirect
searches on all data anywhere in the application.
When the user clicks on a data-point, no queries are fired. All the other fields filter themselves
based on user selection. It promotes unrestricted analysis of data, thus helping users to make
accurate decisions.
Features of QuickView:
It provides an in-memory storage feature that makes data collection, integration, and analysis
process very fast.
It works on Associative data modeling.
The QuickView software automatically derives the relationship between data.
It provides powerful and global data discovery.
Support for social data discovery and mobile data discovery.
7. Qlik Sense
It is a data analysis and data visualization tool. Qlik Sense operates with an associative QIX
engine that enables users to associate and link data from different sources and perform dynamic
searching and selections.
It is used as a data analytic platform by technical as well as non-technical users. One who is
looking for the tool for showing and analyzing data in the best possible way, then the Qlik Sense
is the best choice.
With a drag and drop interface, the user can easily create an analytical report that is easy to
understand and is in the form of a story. The client team can share applications and reports on a
centralized hub, export the data stories to enhance the business, and share secure data models.
8. Tableau
Tableau is a powerful data visualization and software solution tools in the Business Intelligence
and analytics industry.
It is the perfect tool for transforming the raw data into an easily understandable format without
any technical skill and coding knowledge.
Tableau allows users to work on the live datasets and turns the raw data into valuable insights
and enhances the decision-making process.
It offers a rapid data analysis process, which results in visualizations that are in the form of
interactive dashboards and worksheets. It works in synchronization with the other Big Data tools.
Features of Tableau:
In Tableau, with simple drag and drop, one can make visualizations in the form of a Bar chart,
Pie chart, Histogram, Treemap, Boxplot, Gantt chart, Bullet chart, and many more.
Tableau offers a large option of data sources ranging from on-premise files, Text files, CSV,
Excel, relational databases, spreadsheets, non-relational databases, big data, data warehouses,
to on-cloud data.
It is highly robust and secure.
It allows the sharing of data in the form of visualizations, dashboards, sheets, etc. in real-time.
9. Apache Storm
It is a distributed real-time computational framework. Apache Storm is written in Clojure and
Java. With Apache Storm, we can reliably process our unbounded streams of data. It is a simple
tool and can be used with any programming language.
We can use Apache Storm in real-time analytics, continuous computation, online machine
learning, ETL, and more.
Features of Storm:
It is built on the top of Hadoop and enables developers to perform processing on data stored in
Hadoop HDFS without writing the complex MapReduce jobs. Users can interact with Hive
through CLI (Beeline Shell).
Hive provides support for all the client applications written in different languages.
It reduces the overhead of writing complex MapReduce jobs.
HQL syntax is similar to SQL. Thus, one who is familiar with SQL can easily write Hive
queries.
11. Apache Pig
It is an alternative approach for making MapReduce jobs easier. A pig was developed by Yahoo
for providing ease in writing the Hadoop MapReduce programs. Pig enables developers to use
Pig Latin, which is a scripting language designed for pig framework that runs on Pig runtime.
Pig Latin is SQL like commands that are converted to MapReduce program in the background by
the compiler. It translates the Pig Latin into MapReduce program for performing large scale data
processing in YARN.
Features of Pig:
Pig allows users to create their own function for performing specific purpose processing.
It is best suited for solving complex use cases.
It handles data of all kinds.
12. Presto
Presto is an open-source query engine (SQL-on-Hadoop) developed by Facebook for running
interactive analytic queries against petabytes of data. It allows querying the data where it lives,
including Cassandra, Hive, proprietary data stores, or relational databases.
A single Presto query can merge data from different sources and perform analytics across the
entire organization. It does not depend on Hadoop MapReduce techniques and can retrieve data
very quickly within sub-seconds to minutes.
Features of Presto:
It is written in Java and Scala. Apache Flink can run in all common cluster environments and
performs computations in-memory.
Features of Flink:
When we want to import data to HDFS from the relational database or export data from HDFS to
relational database, then we can use Sqoop.
Features of Sqoop:
15. Rapidminer
RapidMiner is one of the most used tools for implementing Data Science. In 2017, it was ranked
1 at Gartner Magic Quadrant for Data Science Platform. It is a powerful data mining tool for
building predictive models.
Rapidminer is all in one tool which features data preparation, machine learning, and deep
learning.
Features of RapidMiner:
It offers a single platform for data processing, and building Machine Learning models and
deployment.
It supports the integration of the Hadoop framework with its in-built RapidMiner Radoop.
RapidMiner can generate predictive models through automated modeling.
16. KNIME
The KNIME (Konstanz Information Miner) is an open-source data analytics platform for data
analysis and business intelligence. It is written in Java.
It allows users to visually create Data Flows, selectively execute analysis steps, inspect results,
interactive views, and models. KNIME is a good alternative for SAS.
Features of KNIME:
It takes unstructured data from different sources and stores them in a sophisticated format which
is highly optimized for language-based searches.
Features of Elasticsearch:
Basics:
Google developers routinely deal with large files that can be difficult to manipulate using a
traditional computer file system. The size of the files drove many of the decisions
programmers had to make for the GFS's design. Another big concern was scalability, which
refers to the ease of adding capacity to the system. A system is scalable if it's easy to increase
the system's capacity. The system's performance shouldn't suffer as it grows. Google requires
a very large network of computers to handle all of its files, so scalability is a top concern.
Because the network is so huge, monitoring and maintaining it is a challenging task. While
developing the GFS, programmers decided to automate as much of the administrative duties
required to keep the system running as possible. This is a key principle of autonomic
computing, a concept in which computers are able to diagnose problems and solve them in
real time without the need for human intervention. The challenge for the GFS team was to not
only create an automatic monitoring system, but also to design it so that it could work across a
huge network of computers. The key to the team's designs was the concept of simplification.
Files on the GFS tend to be very large, usually in the multi-gigabyte (GB) range. Accessing
and manipulating files that large would take up a lot of the network's bandwidth. Bandwidth
is the capacity of a system to move data from one location to another. The GFS addresses this
problem by breaking files up into chunks of 64 megabytes (MB) each. Every chunk receives a
unique 64-bit identification number called a chunk handle.
By requiring all the file chunks to be the same size, the GFS simplifies resource
application. It's easy to see which computers in the system are near capacity and which are
underused. It's also easy to port chunks from one resource to another to balance the workload
across the system.
Architecture:
"Client" refers to any entity that makes a file request. Requests can range from retrieving and
manipulating existing files to creating new files on the system. Clients can be other computers
or computer applications. You can think of clients as the customers of the GFS.
Role of Master:
The master server acts as the coordinator for the cluster. The master's duties include
maintaining an operation log, which keeps track of the activities of the master's cluster. The
operation log helps keep service interruptions to a minimum -- if the master server crashes, a
replacement server that has monitored the operation log can take its place.
The master server also keeps track of metadata, which is the information that describes
chunks. The metadata tells the master server to which files the chunks belong and where they
fit within the overall file.
Upon startup, the master polls all the chunkservers in its cluster. The chunkservers respond by
telling the master server the contents of their inventories. From that moment on, the master
server keeps track of the location of chunks within the cluster.
There's only one active master server per cluster at any one time (though each cluster has
multiple copies of the master server in case of a hardware failure). That might sound like a
good recipe for a bottleneck -- after all, if there's only one machine coordinating a cluster of
thousands of computers, wouldn't that cause data traffic jams? The GFS gets around this
sticky situation by keeping the messages the master server sends and receives very small.
The master server doesn't actually handle file data at all. It leaves that up to the
chunkservers.
Role of ChunkServers:
Chunkservers are the workhorses of the GFS. They're responsible for storing the 64-MB file
chunks. The chunkservers don't send chunks to the master server. Instead, they send requested
chunks directly to the client.
The GFS copies every chunk multiple times and stores it on different chunkservers. Each copy
is called a replica. By default, the GFS makes three replicas per chunk, but users can change
the setting and make more or fewer replicas if desired.
A read request is simple -- the client sends a request to the master server to find out where the
client can find a particular file on the system. The server responds with the location for the
primary replica of the respective chunk. The primary replica holds a lease from the master
server for the chunk in question.If no replica currently holds a lease, the master server
designates a chunk as the primary. It does this by comparing the IP address of the client to the
addresses of the chunkservers containing the replicas. The master server chooses the
chunkserver closest to the client. That chunkserver's chunk becomes the primary. The client
then contacts the appropriate chunkserver directly, which sends the replica to the client.
Write requests are a little more complicated. The client still sends a request to the master
server, which replies with the location of the primary and secondary replicas. The client stores
this information in a memory cache. That way, if the client needs to refer to the same replica
later on, it can bypass the master server. If the primary replica becomes unavailable or the
replica changes, the client will have to consult the master server again before contacting a
chunkserver.
The client then sends the write data to all the replicas, starting with the closest replica and
ending with the furthest one. It doesn't matter if the closest replica is a primary or
secondary .Google compares this data delivery method to a pipeline.
Once the replicas receive the data, the primary replica begins to assign consecutive serial
numbers to each change to the file. Changes are called mutations. The serial numbers instruct
the replicas on how to order each mutation. The primary then applies the mutations in
sequential order to its own data. Then it sends a write request to the secondary replicas, which
follow the same application process. If everything works as it should, all the replicas across
the cluster incorporate the new data. The secondary replicas report back to the primary once
the application process is over.
At that time, the primary replica reports back to the client. If the process was successful, it
ends here. If not, the primary replica tells the client what happened. For example, if one
secondary replica failed to update with a particular mutation, the primary replica notifies the
client and retries the mutation application several more times. If the secondary replica doesn't
update correctly, the primary replica tells the secondary replica to start over from the
beginning of the write process. If that doesn't work, the master server will identify the
affected replica as garbage.
What is Hadoop?
Hadoop is an open source framework from Apache and is used to store process and analyze data
which are very huge in volume. Hadoop is written in Java and is not OLAP (online analytical
processing). It is used for batch/offline processing. It is being used by Facebook, Yahoo, Google,
Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the
cluster.
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the
basis of that HDFS was developed. It states that the files will be broken into blocks and
stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the
cluster.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other
Hadoop modules.
HADOOP ARCHITECTURE
The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS
(Hadoop Distributed File System). The MapReduce engine can be MapReduce/MR1 or
YARN/MR2.
A Hadoop cluster consists of a single master and multiple slave nodes. The master node includes
Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node includes
DataNode and TaskTracker.
The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains
a master/slave architecture. This architecture consist of a single NameNode performs the role of
master, and multiple Data Nodes performs the role of a slave.
Both NameNode and Data Node are capable enough to run on commodity machines. The Java
language is used to develop HDFS. So any machine that supports Java language can easily run
the NameNode and Data Node software.
NameNode
o It manages the file system namespace by executing an operation like the opening,
renaming and closing the files.
o It simplifies the architecture of the system.
Data Node
o It is the responsibility of Data Node to read and write requests from the file system's
clients.
o It performs block creation, deletion, and replication upon instruction from the
NameNode.
Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and process the data
by using NameNode.
Task Tracker
o It receives task and code from Job Tracker and applies that code on the file. This process
can also be called as a Mapper.
Advantages of Hadoop
o Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster
retrieval. Even the tools to process the data are often on the same servers, thus reducing
the processing time. It is able to process terabytes of data in minutes and Peta bytes in
hours.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so it
really cost effective as compared to traditional relational database management system.
o Resilient to failure: HDFS has the property with which it can replicate data over the
network, so if one node is down or some other network failure happens, then Hadoop
takes the other copy of data and use it. Normally, data are replicated thrice but the
replication factor is configurable.
How Hadoop is useful for Big data Analytics?
Example
Let us take an analogy of a restaurant to understand the problems associated with Big Data and
how Hadoop solved that problem.
Bob is a businessman who has opened a small restaurant. Initially, in his restaurant, he used to
receive two orders per hour and he had one chef with one food shelf in his restaurant which
was sufficient enough to handle all the orders.
Now let us compare the restaurant example with the traditional scenario where data was getting
generated at a steady rate and our traditional systems like RDBMS is capable enough to handle
it, just like Bob’s chef. Here, you can relate the data storage with the restaurant’s food shelf and
the traditional processing unit with the chef as shown in the figure above.
Fig:Traditional Scenario
After a few months, Bob thought of expanding his business and therefore, he started taking
online orders and added few more cuisines to the restaurant’s menu in order to engage a larger
audience. Because of this transition, the rate at which they were receiving orders rose to an
alarming figure of 10 orders per hour and it became quite difficult for a single cook to cope up
with the current situation. Aware of the situation in processing the orders, Bob started thinking
about the solution.
Similarly, in Big Data scenario, the data started getting generated at an alarming rate because of
the introduction of various data growth drivers such as social media, smartphones etc.
Now, the traditional system, just like the cook in Bob’s restaurant, was not efficient enough to
handle this sudden change. Thus, there was a need for a different kind of solutions strategy to
cope up with this problem.
After a lot of research, Bob came up with a solution where he hired 4 more chefs to tackle the
huge rate of orders being received. Everything was going quite well, but this solution led to one
more problem. Since four chefs were sharing the same food shelf, the very food shelf was
becoming the bottleneck of the whole process. Hence, the solution was not that efficient as Bob
thought.
In other words, the performance of the whole system is driven by the performance of the central
storage unit. Therefore, the moment our central storage goes down, the whole system gets
compromised. Hence, again there was a need to resolve this single point of failure.
Bob came up with another efficient solution, he divided all the chefs into two hierarchies, that is
a Junior and a Head chef and assigned each junior chef with a food shelf. Let us assume that the
dish is Meat Sauce. Now, according to Bob’s plan, one junior chef will prepare meat and the
other junior chef will prepare the sauce. Moving ahead they will transfer both meat and sauce to
the head chef, where the head chef will prepare the meat sauce after combining both the
ingredients, which then will be delivered as the final order.
Now, you must have got an idea why Big Data is a problem statement and how Hadoop solves
it. As we just discussed above, there were three major challenges with Big Data:
Storing huge data in a traditional system is not possible. The reason is obvious, the storage will
be limited to one system and the data is increasing at a tremendous rate.
Now we know that storing is a problem, but let me tell you it is just one part of the problem. The
data is not only huge, but it is also present in various formats i.e. unstructured, semi-structured
and structured. So, you need to make sure that you have a system to store different types of
data that is generated from various sources.
Finally let’s focus on the third problem, which is the processing speed
Now the time taken to process this huge amount of data is quite high as the data to be processed
is too large.
Features of Hadoop
Reliability
When machines are working as a single unit, if one of the machines fails, another machine will
take over the responsibility and work in a reliable and fault-tolerant fashion. Hadoop
infrastructure has inbuilt fault tolerance features and hence, Hadoop is highly reliable.
Economical
Hadoop uses commodity hardware (like your PC, laptop). For example, in a small Hadoop
cluster, all your DataNodes can have normal configurations like 8-16 GB RAM with 5-10 TB
hard disk and Xeon processors.
But if I would have used hardware-based RAID with Oracle for the same purpose, I would end
up spending 5x times more at least. So, the cost of ownership of a Hadoop-based project
is minimized. It is easier to maintain a Hadoop environment and is economical as well. Also,
Hadoop is open-source software and hence there is no licensing cost.
Scalability
Hadoop has the inbuilt capability of integrating seamlessly with cloud-based services. So, if you
are installing Hadoop on a cloud, you don’t need to worry about the scalability factor because
you can go ahead and procure more hardware and expand your set up within minutes whenever
required.
Flexibility
Hadoop is very flexible in terms of the ability to deal with all kinds of data. Hadoop can store
and process them all, whether it is structured, semi-structured or unstructured data.