0% found this document useful (0 votes)
113 views69 pages

Unit 1-BigDataTools

Uploaded by

shritis2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views69 pages

Unit 1-BigDataTools

Uploaded by

shritis2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 69

BIG DATA TOOLS

AND TECHNIQUES
21CSE222T
Why Big Data
Big Data Analytics
some key pieces of data from the report:

• Facebook users share nearly 4.16 million pieces of content


• Twitter users send nearly 300,000 tweets
• Instagram users like nearly 1.73 million photos
• YouTube users upload 300 hours of new video content
• Apple users download nearly 51,000 apps
• Skype users make nearly 110,000 new calls
• Amazon receives 4300 new visitors
• Uber passengers take 694 rides
• Netflix subscribers stream nearly 77,000 hours of video
Characteristics of Big data
5 V’s of Big Data

• Volume refers to huge amount of data


• Velocity of data refers to how fast the data is generated
• Variety refers to the forms of the data.
• Veracity refers to how accurate is the data
• Value of data refers to the usefulness of data for the intended
purpose.
Big Data – Definition

• Big data is defined as collections of datasets


whose volume, velocity or variety is so large
that it is difficult to store, manage, process and
analyze the data using traditional databases
and data processing tools.
Types of Big Data

• Structured : Highly Organized data


Transactions and Financial records
• Semi-Structured : combination of both
Streaming data from sensors.
• Unstructured : not organized data
Text, Documents and Multimedia files
Domain using Big Data analytics
3Vs

12
Structured Data
• Structured data is data that depends on a data model and resides in a fixed field within a record.
• Structured data is highly organized and formatted.
• Examples of structured data include names, dates, addresses, credit card numbers, stock
information, geolocation, and more.
• PL/SQL, SQLite, SQL, Oracle.
Unstructured data
• Unstructured data is data that isn’t easy to fit into a data model because the content is context-
specific or varying.
• Is highly complex, qualitative, and unorganized.
• This data can be numerical, alphabetical, boolean or a mix of all of them.
What is big data analytics

• Big data analytics examines large and different types of data to

uncover hidden patterns, correlations and other insights


Stages of Big data analytics

How does big data analytics work?

• Data Collection: Data is the heart of Big Data Analytics. It is the process of the
collection of data from various sources, which can include customer reviews,
surveys, sensors, social media etc. The main goal of data collection is to gather as
much relevant data as possible. The more data, the richer the insights.
• Data Cleaning (Data Preprocessing): Once we have the data, it often needs
some cleaning. This process involves identifying and dealing with missing values,
correcting errors, and removing duplicates.
• Data Storage
Once the data is collected, it is stored in a repository, such as Hadoop Distributed
File System (HDFS), Amazon S3, or Google Cloud Storage.
How does big data analytics work?
• Data Processing: Next, we need to process the data. This involves different steps
like organizing, structuring, and formatting it in a way that makes it appropriate
for analysis.
• Data Analysis: Data analysis is performed using various statistical, mathematical,
and machine learning techniques to extract valuable insights from the processed
data.
• Data Visualization: Data analysis results are often presented in the form of
visualizations – charts, graphs, and interactive dashboards.
How does big data analytics work?
• Data Governance
Data Governance (data cataloging, data quality management, and data lineage
tracking) ensures the accuracy, completeness, and security of the data.
• Data Management
Big data platforms provide management capabilities that enable organizations to
make backups, recover, and archive.
Types of Big Data Analytics

1.Descriptive Analytics: This type helps us understand past events. In social media,
it shows performance metrics, like the number of likes on a post.
2.Diagnostic Analytics: In Diagnostic analytics delves deeper to uncover the
reasons behind past events. In healthcare, it identifies the causes of high patient
re-admissions.
3.Predictive Analytics: Predictive analytics forecasts future events based on past
data. Weather forecasting, for example, predicts tomorrow’s weather by analyzing
historical patterns.
4.Prescriptive Analytics: This type not only predicts outcomes but also suggests
actions to optimize them. In e-commerce, it might recommend the best price for a
product to maximize profits.
Types of Big Data Analytics
5.Real-time Analytics: Real-time analytics processes data instantly. In stock
trading, it helps traders make quick decisions based on current market conditions.
6.Spatial Analytics: Spatial analytics focuses on location data. For city planning, it
optimizes traffic flow using data from sensors and cameras to reduce congestion.
7.Text Analytics: Text analytics extracts insights from unstructured text data. In the
hotel industry, it can analyze guest reviews to improve services and guest
satisfaction.
Tools used in Big Data Analytics
Big Data Analytics Technologies
and Tools – Bigdata platform
Apache Hadoop
• Hadoop is an open-source programming architecture and server software. It is
employed to store and analyze large data sets very fast with the assistance of
thousands of commodity servers in a clustered computing environment[6]. In case of
one server or hardware failure, it can replicate the data leading to no loss of data.
• it is commonly employed on Ubuntu and other variants of Linux.
Apache Spark
• Apache Spark is an open-source data-processing engine designed to deliver the
computational speed and scalability required for streaming data, graph data, machine
learning, and artificial intelligence applications. Spark processes and keeps the data
in memory without writing to or reading from the disk, which is why it is way faster
than the alternatives such as Apache Hadoop.
• The solution can be deployed on-premise, in addition to being available on cloud
platforms such as Amazon Web Services, Google Cloud Platform, and Microsoft
Azure.
Big Data Analytics Technologies
and Tools – Bigdata platform
Apache Storm
• Apache Storm is a free and open-source distributed processing system designed to
process high volumes of data streams in real-time, making it suitable for use cases
such as real-time analytics, online machine learning, and IoT applications.
• Storm processes data streams by breaking them down into small units of work,
called “tasks,” and distributing those tasks across a cluster of machines. This
allows Storm to process large amounts of data in parallel, providing high
performance and scalability.
• Apache Storm is available on cloud platforms such as Amazon Web Services
(AWS), Google Cloud Platform (GCP), and Microsoft Azure, but it is possible to
deploy it also on-premise.
Big Data Analytics Technologies
and Tools – Bigdata platform
Datameer
• Datameer is a data analytics platform that provides big data processing and
analysis capabilities designed to support end-to-end analytics projects, from data
ingestion and preparation to analysis, visualization, and collaboration.
• Datameer provides a visual interface for designing and executing big data
workflows and includes built-in support for various data sources and analytics
tools. The platform is optimized for use with Hadoop, and provides integration
with Apache Spark and other big data technologies.
• The service is available as a cloud-based platform and on-premise. The on-
premise version of Datameer provides the same features as the cloud-based
platform but is deployed and managed within an organization’s own data center.
Big Data Analytics Technologies
and Tools – Bigdata platform
Snowflake
• Snowflake is a cloud-based data warehousing platform that provides data storage,
processing, and analysis capabilities. It supports structured and semi-structured
data and provides a SQL interface for querying and analyzing data.
• It provides a fully managed service, which means that the platform handles all
infrastructure and management tasks, including automatic scaling, backup and
recovery, and security. It supports integrating various data sources, including other
cloud-based data platforms and on-premise databases.
Databricks
• Databricks is a cloud-based platform for big data processing and analysis based on
Apache Spark. It provides a collaborative work environment for data scientists,
engineers, and business analysts offering features such as an interactive
workspace, distributed computing, machine learning, and integration with popular
big data tools.
Big Data Analytics Technologies
and Tools – Bigdata platform
Cloudera
• Cloudera
can is huge
handle a bigvolumes
data platform
of based
data. on Apache’s
Enterprises Hadoop
regularly system.
store over It
50
petabytes
as text, in this platform’s
machine logs, and Data
more.Warehouse,
Cloudera’swhich handles
DataFlow data
also such
enables
real-time data processing.
• Cloudera platform is based on the Apache Hadoop ecosystem and
includes components such as HDFS, Spark, Hive, and Impala, among
others. Cloudera provides a comprehensive solution for managing
and processing big data and offers features such as data warehousing,
machine learning, and real-time data processing. The platform can be
deployed on-premise, in the cloud, or as a hybrid solution.
DISTRIBUTED AND PARALLEL COMPUTING FOR BIG DATA

• Distributed Computing Multiple computing resources are connected in a network


and computing tasks are distributed across these resources.
• Advantages
• Increases the Speed
• Increases the Efficiency
• more suitable to process huge mount of data in a limited time.
Parallel Computing

• Also improves the processing capability of a computer system by adding


additional computational resources.
• Divide complex computations into subtasks, handled individually by processing
units, running in parallel.
• Concept – processing capability will increase with the increase in the level of
parallelism.
BIG DATA PROCESSING TECHNIQUES

• With the increase in data, forcing organizations to adopt a data analysis strategy that can be used for analyzing the
entire data in a very short time.
• Done by Powerful h/w components and new s/w programs.
• The procedure followed by the s/w applications are:
• Break up the given task
• Surveying the available resources.
• Assigning the subtask to the nodes
• Issues in the System
• Latency: Can be defined as the aggregate delay in the s/m because of Delays in completing individual tasks.

• System delay
• Also affects data management and communication
• Affecting the productivity and profitability of an organization.
DISTRIBUTED COMPUTING TECHNIQUE FOR PROCESSING
LARGE DATA
MERITS OF THE SYSTEM

• Scalability: The system with added scalability, can accommodate the growing
amounts of data more efficiently and flexibly.
• Virtualization and Load Balancing Features:
• Load Balancing – The sharing of workload across various systems.
• Virtualization – creates a virtual environment h/w platform, storage device and
OS.
PARALLEL COMPUTING TECHNIQUES

1) Cluster or Grid Computing


• Primarily used in Hadoop.
• Based on a connection of multiple servers in a network (clusters)
• Servers share the workload among them.
• Overall cost may be very high.
2) Massively Parallel Processing (MPP)
• Used in data warehouses.
• A single machine working as a grid is used in the MPP platform.
• Capable of handling the storage, memory and computing activities.
• Software written specifically for MPP platform is used for optimization.
• MPP platforms, EMC Greenplum, ParAccel , suited for high-value use cases.
3) High Performance Computing (HPC)
• Offer high performance and scalability by using IMC.
• Suitable for processing floating-point data at high speeds.
• Used in research and business organization where the result is more valuable than
the cost or where the strategic importance of the project is of high priority.
DIFFERENCE B/W DISTRIBUTED AND PARALLEL SYSTEMS
CLOUD COMPUTING AND BIG DATA

• Cloud Computing is the delivery of computing services—servers, storage,


databases, networking, software, analytics and more—over the Internet (“the
cloud”).
• Companies offering these computing services are called cloud providers and
typically charge for cloud computing services based on usage, similar to how you
are billed for water or electricity at home.
FEATURES OF CLOUD COMPUTING

• Scalability – addition of new resources to an existing infrastructure.


• increase in the amount of data , requires organization to improve h/w components.
• The new h/w may not provide complete support to the s/w, that used to run
properly on the earlier set of h/w.
• Solution to this problem is using cloud services - that employ the distributed
computing technique to provide scalability.
• Elasticity – Hiring certain resources, as and when required, and paying for those
resources.
• no extra payment is required for acquiring specific cloud services.
• A cloud does not require customers to declare their resource requirements in
advance.
• Resource Pooling - multiple organizations, which use similar kinds of
resources to carry out computing practices, have no need to
individually hire all the resources.
• Self Service – cloud computing involves a simple user interface that
helps customers to directly access the cloud services they want.
• Low Cost – cloud offers customized solutions, especially to
organizations that cannot afford too much initial investment. - cloud
provides pay-us-you-use option, in which organizations need to sign
for those resources only that are essential.
• Fault Tolerance – offering uninterrupted services to customers
CLOUD DEPLOYMENT MODELS

• Depending upon the architecture used in forming the n/w, services and
applications used, and the target consumers, cloud services form various
deployment models. They are,
• Public Cloud
• Private Cloud
• Community Cloud
• Hybrid Cloud
Public Cloud (End-User Level Cloud)

• Owned and managed by a company than the one using it.


• Third party administrator.
• Eg : Verizon, Amazon Web Services, and Rackspace.
• The workload is categorized based on service category, h/w
customization is possible to provide optimized performance.
• The process of computing becomes very flexible and scalable through
customized h/w resources.
• The primary concern with a public cloud include security and latency.
Private Cloud (Enterprise Level Cloud)

• Remains entirely in the ownership of the organization using it.


• Infrastructure is solely designed for a single organization.
• Can automate several processes and operations that require manual handling in a
public cloud.
• Can also provide firewall protection to the cloud, solving latency and security
concerns.
• A private cloud can be either on-premises or hosted externally. On-premises: The
service is exclusively used and hosted by a single organization.
• Hosted externally: used by a single organization and are not shared with other
organizations.
Community Cloud

• Type of cloud that is shared among various organizations with a common tie.
• Managed by third party cloud services.
• Available on or off premises.
• Eg. In any state, the community cloud can provided so that almost all govt.
organizations of that state can share the resources available on the cloud. Because
of the sharing of resources on community cloud, the data of all citizens of that
state can be easily managed by the govt. organizations.
Hybrid Cloud

• Various internal or external service providers offer services to many


organizations.
• In hybrid clouds, an organization can use both types of cloud, i.e.
public and private together – situations such as cloud bursting.
• Organization uses its own computing infrastructure, high load
requirement, access clouds. The organization using the hybrid cloud
can manage an internal private cloud for general use and migrate the
entire or part of an application to the public cloud during the peak
periods.
CLOUD SERVICES FOR BIG DATA

• In big data Iaas, Paas and Saas clouds are used in following manner.
• Iaas:- Huge storage and computational power requirement for big data are
fulfilled by limitless storage space and computing ability obtained by iaas cloud.
• Paas:- offerings of various vendors have stared adding various popular big data
platforms that include mapreduce, Hadoop. These offerings save organisations
from a lot of hassels which occur in managing individual hardware components
and software applications.
• Saas:- Various organisation require identifying and analysing the voice of
customers particularly on social media. Social media data and platform are
provided by SAAS vendors. In addition, private cloud facilitates access to
enterprise data which enable these analyses.
IN MEMORY COMPUTING TECHNOLOGY

• Another way to improve speed and processing power of data.


• IMC is used to facilitate high speed data processing e.g. IMC can help in tracking
and monitoring the consumer’s activities and behaviours which allow
organizations to take timely actions for improving customer services and hence
customer satisfaction.
• Data stored on external devices known as secondary storage space. This data had
to be accessed from external source.
• In the IMC technology the RAM or Primary storage space is used for analysing
data. Ram helps helps to increase computing speed.
• Also reduction in cost of primary memory has helped to store data in primary
memory.
INTRODUCTION TO HADOOP
Hadoop
• Hadoop is a framework that allows us to store and process large data
sets in parallel and distributed fashion
Hadoop

• Hadoop is an open source framework. It is provided by Apache to process and


analyze very huge volume of data. It is written in Java and currently used by
Google, Facebook, LinkedIn, Yahoo, Twitter etc.
• Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File
System) to form clusters and store data in a distributed fashion. It works on Write
once, read many times principle.
• Processing: Map Reduce paradigm is applied to data distributed over network to
find the required output.
• Analyze: Pig, and Hive can be used to analyze the data.
• Cost: Hadoop is open source so the cost is no more an issue.
Core Modules of Hadoop

1.HDFS: Hadoop Distributed File System. It states that the files will be
broken into blocks and stored in nodes over the distributed architecture.
2.Yarn: Yet another Resource Negotiator is used for job scheduling and
manage the cluster.
3.Map Reduce: This is a framework which helps Java programs to do the
parallel computation on data using key value pair. The Map task takes input
data and converts it into a data set which can be computed in Key value
pair. The output of Map task is consumed by reduce task and then the out of
reducer gives the desired result.
4.Hadoop Common: These Java libraries are used to start Hadoop and are
used by other Hadoop modules.
Hadoop
Hadoop Distributed File System
(HDFS)
• HDFS is a distributed file system (DFS) that runs on large clusters and
provides high-throughput access to data.

• HDFS is a highly fault-tolerant system and is designed to work with


commodity hardware.

• HDFS stores each file as a sequence of blocks. The blocks of each file
are replicated on multiple machines in a cluster to provide fault
tolerance.
HDFS
• HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing large
data sets of structured or unstructured data across various nodes and thereby maintaining the
metadata in the form of log files.
• HDFS consists of two core components i.e.
• Name node
• Data Node
• Name Node is the prime node which contains metadata (data about data) requiring comparatively
fewer resources than the data nodes that stores the actual data. It manages the file system namespace
by executing an operation like the opening, renaming and closing the files.
• DataNode
• Each DataNode contains multiple data blocks.
• These data blocks are used to store data.
• It is the responsibility of DataNode to read and write requests from the file system's clients.
• It performs block creation, deletion, and replication upon instruction from the NameNode.
Hadoop Distributed File System
(HDFS)
• Scalable Storage for Large Files: HDFS has been designed to store
large files (typically from gigabytes to terabytes in size). Large files are
broken into chunks or blocks and each block is replicated across
multiple machines in the cluster.
• Replication: HDFS replicates data blocks to multiple machines in a
cluster which makes the system reliable and fault-tolerant.
• Streaming Data Access: HDFS has been designed for streaming data
access patterns and provides high throughput streaming reads and
writes
• File Appends: HDFS was originally designed to have immutable files.
Files once written to HDFS could not be modified by writing at
arbitrary locations in the file or appending to the file.
HDFS Architecture
• HDFS has two types of nodes: Namenode and Datanode.
• Namenode manages the filesystem namespace. All the file system
meta-data is stored on the Namenode.
• Namenode is responsible for executing operations such as opening
and closing of files.
• The Namenode checks if the file exists and whether the client has
sufficient permissions to read the file
• The Secondary Namenode helps in the checkpointing process
HDFS Architecture - Datanode
• While the Namenode stores the filesystem meta-data, the Datanodes
store the data blocks and serve the read and write requests.

• Datanodes periodically send heartbeat messages and block reports to


the Namenode.
• While the heartbeat messages tell the Namenode that a Datanode is
alive, the block reports contain information on the blocks on a
Datanode.
YARN
• Apache YARN (Yet Another Resource Negotiator) is Hadoop’s cluster
resource management system.
• YARN was introduced in Hadoop 2 to improve the MapReduce im‐
plementation.
• It supports other distributed computing paradigms as well.
• YARN provides its core services via two types of long-running
daemon: a resource manager (one per cluster) to manage the use of
resources across the cluster,
• and node managers running on all the nodes in the cluster to launch
and monitor containers.
• Container is a bundle of resources allocated by RM (memory, CPU and
network).
key components of YARN
• The key components of YARN are described as follows:
• Resource Manager (RM): RM act as Scheduler

• Application Master (AM): A per-application AM manages the


application’s life cycle. AM is responsible for negotiating resources
from the RM and working with the NMs to execute and monitor the
tasks.
• Node Manager (NM): A per-machine NM manages the user processes
on that machine.
JVM

JVM stands for Java Virtual Machine Process Status Tool

Hadoop Daemons are implemented currently.


• DataNode
• Node Manager
• Name Node
• Resource Manager
HADOOP COMPUTING MODEL
HADOOP CLUSTER ARCHITECTURE

You might also like