Unit 1-BigDataTools
Unit 1-BigDataTools
AND TECHNIQUES
21CSE222T
Why Big Data
Big Data Analytics
some key pieces of data from the report:
12
Structured Data
• Structured data is data that depends on a data model and resides in a fixed field within a record.
• Structured data is highly organized and formatted.
• Examples of structured data include names, dates, addresses, credit card numbers, stock
information, geolocation, and more.
• PL/SQL, SQLite, SQL, Oracle.
Unstructured data
• Unstructured data is data that isn’t easy to fit into a data model because the content is context-
specific or varying.
• Is highly complex, qualitative, and unorganized.
• This data can be numerical, alphabetical, boolean or a mix of all of them.
What is big data analytics
• Data Collection: Data is the heart of Big Data Analytics. It is the process of the
collection of data from various sources, which can include customer reviews,
surveys, sensors, social media etc. The main goal of data collection is to gather as
much relevant data as possible. The more data, the richer the insights.
• Data Cleaning (Data Preprocessing): Once we have the data, it often needs
some cleaning. This process involves identifying and dealing with missing values,
correcting errors, and removing duplicates.
• Data Storage
Once the data is collected, it is stored in a repository, such as Hadoop Distributed
File System (HDFS), Amazon S3, or Google Cloud Storage.
How does big data analytics work?
• Data Processing: Next, we need to process the data. This involves different steps
like organizing, structuring, and formatting it in a way that makes it appropriate
for analysis.
• Data Analysis: Data analysis is performed using various statistical, mathematical,
and machine learning techniques to extract valuable insights from the processed
data.
• Data Visualization: Data analysis results are often presented in the form of
visualizations – charts, graphs, and interactive dashboards.
How does big data analytics work?
• Data Governance
Data Governance (data cataloging, data quality management, and data lineage
tracking) ensures the accuracy, completeness, and security of the data.
• Data Management
Big data platforms provide management capabilities that enable organizations to
make backups, recover, and archive.
Types of Big Data Analytics
1.Descriptive Analytics: This type helps us understand past events. In social media,
it shows performance metrics, like the number of likes on a post.
2.Diagnostic Analytics: In Diagnostic analytics delves deeper to uncover the
reasons behind past events. In healthcare, it identifies the causes of high patient
re-admissions.
3.Predictive Analytics: Predictive analytics forecasts future events based on past
data. Weather forecasting, for example, predicts tomorrow’s weather by analyzing
historical patterns.
4.Prescriptive Analytics: This type not only predicts outcomes but also suggests
actions to optimize them. In e-commerce, it might recommend the best price for a
product to maximize profits.
Types of Big Data Analytics
5.Real-time Analytics: Real-time analytics processes data instantly. In stock
trading, it helps traders make quick decisions based on current market conditions.
6.Spatial Analytics: Spatial analytics focuses on location data. For city planning, it
optimizes traffic flow using data from sensors and cameras to reduce congestion.
7.Text Analytics: Text analytics extracts insights from unstructured text data. In the
hotel industry, it can analyze guest reviews to improve services and guest
satisfaction.
Tools used in Big Data Analytics
Big Data Analytics Technologies
and Tools – Bigdata platform
Apache Hadoop
• Hadoop is an open-source programming architecture and server software. It is
employed to store and analyze large data sets very fast with the assistance of
thousands of commodity servers in a clustered computing environment[6]. In case of
one server or hardware failure, it can replicate the data leading to no loss of data.
• it is commonly employed on Ubuntu and other variants of Linux.
Apache Spark
• Apache Spark is an open-source data-processing engine designed to deliver the
computational speed and scalability required for streaming data, graph data, machine
learning, and artificial intelligence applications. Spark processes and keeps the data
in memory without writing to or reading from the disk, which is why it is way faster
than the alternatives such as Apache Hadoop.
• The solution can be deployed on-premise, in addition to being available on cloud
platforms such as Amazon Web Services, Google Cloud Platform, and Microsoft
Azure.
Big Data Analytics Technologies
and Tools – Bigdata platform
Apache Storm
• Apache Storm is a free and open-source distributed processing system designed to
process high volumes of data streams in real-time, making it suitable for use cases
such as real-time analytics, online machine learning, and IoT applications.
• Storm processes data streams by breaking them down into small units of work,
called “tasks,” and distributing those tasks across a cluster of machines. This
allows Storm to process large amounts of data in parallel, providing high
performance and scalability.
• Apache Storm is available on cloud platforms such as Amazon Web Services
(AWS), Google Cloud Platform (GCP), and Microsoft Azure, but it is possible to
deploy it also on-premise.
Big Data Analytics Technologies
and Tools – Bigdata platform
Datameer
• Datameer is a data analytics platform that provides big data processing and
analysis capabilities designed to support end-to-end analytics projects, from data
ingestion and preparation to analysis, visualization, and collaboration.
• Datameer provides a visual interface for designing and executing big data
workflows and includes built-in support for various data sources and analytics
tools. The platform is optimized for use with Hadoop, and provides integration
with Apache Spark and other big data technologies.
• The service is available as a cloud-based platform and on-premise. The on-
premise version of Datameer provides the same features as the cloud-based
platform but is deployed and managed within an organization’s own data center.
Big Data Analytics Technologies
and Tools – Bigdata platform
Snowflake
• Snowflake is a cloud-based data warehousing platform that provides data storage,
processing, and analysis capabilities. It supports structured and semi-structured
data and provides a SQL interface for querying and analyzing data.
• It provides a fully managed service, which means that the platform handles all
infrastructure and management tasks, including automatic scaling, backup and
recovery, and security. It supports integrating various data sources, including other
cloud-based data platforms and on-premise databases.
Databricks
• Databricks is a cloud-based platform for big data processing and analysis based on
Apache Spark. It provides a collaborative work environment for data scientists,
engineers, and business analysts offering features such as an interactive
workspace, distributed computing, machine learning, and integration with popular
big data tools.
Big Data Analytics Technologies
and Tools – Bigdata platform
Cloudera
• Cloudera
can is huge
handle a bigvolumes
data platform
of based
data. on Apache’s
Enterprises Hadoop
regularly system.
store over It
50
petabytes
as text, in this platform’s
machine logs, and Data
more.Warehouse,
Cloudera’swhich handles
DataFlow data
also such
enables
real-time data processing.
• Cloudera platform is based on the Apache Hadoop ecosystem and
includes components such as HDFS, Spark, Hive, and Impala, among
others. Cloudera provides a comprehensive solution for managing
and processing big data and offers features such as data warehousing,
machine learning, and real-time data processing. The platform can be
deployed on-premise, in the cloud, or as a hybrid solution.
DISTRIBUTED AND PARALLEL COMPUTING FOR BIG DATA
• With the increase in data, forcing organizations to adopt a data analysis strategy that can be used for analyzing the
entire data in a very short time.
• Done by Powerful h/w components and new s/w programs.
• The procedure followed by the s/w applications are:
• Break up the given task
• Surveying the available resources.
• Assigning the subtask to the nodes
• Issues in the System
• Latency: Can be defined as the aggregate delay in the s/m because of Delays in completing individual tasks.
• System delay
• Also affects data management and communication
• Affecting the productivity and profitability of an organization.
DISTRIBUTED COMPUTING TECHNIQUE FOR PROCESSING
LARGE DATA
MERITS OF THE SYSTEM
• Scalability: The system with added scalability, can accommodate the growing
amounts of data more efficiently and flexibly.
• Virtualization and Load Balancing Features:
• Load Balancing – The sharing of workload across various systems.
• Virtualization – creates a virtual environment h/w platform, storage device and
OS.
PARALLEL COMPUTING TECHNIQUES
• Depending upon the architecture used in forming the n/w, services and
applications used, and the target consumers, cloud services form various
deployment models. They are,
• Public Cloud
• Private Cloud
• Community Cloud
• Hybrid Cloud
Public Cloud (End-User Level Cloud)
• Type of cloud that is shared among various organizations with a common tie.
• Managed by third party cloud services.
• Available on or off premises.
• Eg. In any state, the community cloud can provided so that almost all govt.
organizations of that state can share the resources available on the cloud. Because
of the sharing of resources on community cloud, the data of all citizens of that
state can be easily managed by the govt. organizations.
Hybrid Cloud
• In big data Iaas, Paas and Saas clouds are used in following manner.
• Iaas:- Huge storage and computational power requirement for big data are
fulfilled by limitless storage space and computing ability obtained by iaas cloud.
• Paas:- offerings of various vendors have stared adding various popular big data
platforms that include mapreduce, Hadoop. These offerings save organisations
from a lot of hassels which occur in managing individual hardware components
and software applications.
• Saas:- Various organisation require identifying and analysing the voice of
customers particularly on social media. Social media data and platform are
provided by SAAS vendors. In addition, private cloud facilitates access to
enterprise data which enable these analyses.
IN MEMORY COMPUTING TECHNOLOGY
1.HDFS: Hadoop Distributed File System. It states that the files will be
broken into blocks and stored in nodes over the distributed architecture.
2.Yarn: Yet another Resource Negotiator is used for job scheduling and
manage the cluster.
3.Map Reduce: This is a framework which helps Java programs to do the
parallel computation on data using key value pair. The Map task takes input
data and converts it into a data set which can be computed in Key value
pair. The output of Map task is consumed by reduce task and then the out of
reducer gives the desired result.
4.Hadoop Common: These Java libraries are used to start Hadoop and are
used by other Hadoop modules.
Hadoop
Hadoop Distributed File System
(HDFS)
• HDFS is a distributed file system (DFS) that runs on large clusters and
provides high-throughput access to data.
• HDFS stores each file as a sequence of blocks. The blocks of each file
are replicated on multiple machines in a cluster to provide fault
tolerance.
HDFS
• HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing large
data sets of structured or unstructured data across various nodes and thereby maintaining the
metadata in the form of log files.
• HDFS consists of two core components i.e.
• Name node
• Data Node
• Name Node is the prime node which contains metadata (data about data) requiring comparatively
fewer resources than the data nodes that stores the actual data. It manages the file system namespace
by executing an operation like the opening, renaming and closing the files.
• DataNode
• Each DataNode contains multiple data blocks.
• These data blocks are used to store data.
• It is the responsibility of DataNode to read and write requests from the file system's clients.
• It performs block creation, deletion, and replication upon instruction from the NameNode.
Hadoop Distributed File System
(HDFS)
• Scalable Storage for Large Files: HDFS has been designed to store
large files (typically from gigabytes to terabytes in size). Large files are
broken into chunks or blocks and each block is replicated across
multiple machines in the cluster.
• Replication: HDFS replicates data blocks to multiple machines in a
cluster which makes the system reliable and fault-tolerant.
• Streaming Data Access: HDFS has been designed for streaming data
access patterns and provides high throughput streaming reads and
writes
• File Appends: HDFS was originally designed to have immutable files.
Files once written to HDFS could not be modified by writing at
arbitrary locations in the file or appending to the file.
HDFS Architecture
• HDFS has two types of nodes: Namenode and Datanode.
• Namenode manages the filesystem namespace. All the file system
meta-data is stored on the Namenode.
• Namenode is responsible for executing operations such as opening
and closing of files.
• The Namenode checks if the file exists and whether the client has
sufficient permissions to read the file
• The Secondary Namenode helps in the checkpointing process
HDFS Architecture - Datanode
• While the Namenode stores the filesystem meta-data, the Datanodes
store the data blocks and serve the read and write requests.