0% found this document useful (0 votes)

9 views18 pages

Unit-11 Big Data

This document provides an overview of Big Data, including its characteristics, challenges, and the technologies used for storage and analysis, such as Hadoop and Spark. It discusses various types of analytics, handling techniques, and the importance of understanding data classifications for effective decision-making. Additionally, it highlights case studies demonstrating the application of Big Data in marketing, healthcare, and advertising.

Uploaded by

manishlap4423

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views18 pages

Unit-11 Big Data

Uploaded by

manishlap4423

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Emerging

Technologies for UNIT 11 BIG DATA

Business

Objectives
After studying this unit, you will be able to:
• Understanding the concept of Big Data, its characteristics, and the
challenges associated with it.
• Familiarizing with the Hadoop ecosystem and its components.
• Understanding the basics of MapReduce.
• Learning utility of Pig, a high-level platform for creating MapReduce
programs, to process and analyse data.
• Understanding the basics of machine learning algorithms for Big Data
analytics.

Structure
11.0 Introduction to Big Data
11.0.1 Data Storage and Analysis
11.0.2 Characteristics of Big Data
11.0.3 Big Data Classification
11.0.4 Big Data Handling Techniques
11.0.5 Types of Big Data Analytics
11.0.6 Typical Analytical Architecture
11.0.6 Challenges in Big Data Analytics
11.0.7 Case studies: Big Data in Marketing and Sales, Healthcare, Medicine, and
Advertising
11.1 Hadoop Framework & Ecosystem
11.1.1 Requirement of Hadoop Framework
11.1.2 Map Reduce Framework
11.1.3 Hadoop Yarn and Hadoop Execution Model
11.1.4 Introduction to Hadoop Ecosystem Technologies
11.1.5 Databases: HBase, Hive
11.1.6 Scripting language: Pig, Streaming: Flink, Storm
11.2 Spark Framework
11.3 Machine Learning Algorithms for Big Data Analytics
11.4 Recent Trends in Big Data Analytics
11.5 Summary
11.6 Self–Assessment Exercises
11.7 Keywords
11.8 Further Readings

222
Big Data
11.0 INTRODUCTION TO BIG DATA
Big data refers to the vast amount of structured and unstructured data that is
generated and collected by individuals, organizations, and machines every
day. Data is too large and complex to be processed by traditional data
processing applications, which often have limitations in terms of their
capacity to store, process, and analyse large datasets.

To process and analyse big data, specialized technologies and tools such as
Hadoop, Spark, and NoSQL databases have been developed. These tools
allow organizations to store, process, and analyse large volumes of data
quickly and efficiently. The insights derived from big data can be used to
make informed decisions, identify trends and patterns, improve customer
experiences, and enhance operational efficiency.

11.0.1 Data Storage and Analysis

Data storage and analysis are two critical components of big data processing.
As mentioned earlier, big data is too large and complex to be processed by
traditional data processing applications, and therefore specialized
technologies and tools have been developed to store and analyse big data.
One popular technology for storing big data is Hadoop, which is an open-
source software framework that allows for distributed storage and processing
of large datasets across clusters of computers. Hadoop is designed to handle
both structured and unstructured data, and it uses a distributed file system
called Hadoop Distributed File System (HDFS) to store data across multiple
machines.
Once data is stored, it can be analysed using specialized tools such as Spark,
which is an open-source data processing engine that allows for fast and
efficient processing of large datasets. Spark uses an in-memory processing
model to speed up computations, and it can handle both batch and streaming
data processing. Another popular tool for big data analysis is NoSQL
databases, which are non-relational databases that can handle large volumes
of structured and unstructured data with high scalability and availability.

In addition to these specialized technologies and tools, data scientists and

analysts can use programming languages such as Python and R to perform
data analysis and create visualizations to help interpret and communicate the
results of the analysis. Machine learning algorithms and artificial intelligence
techniques can also be used to derive insights from big data and make
predictions and recommendations.

11.0.2 Characteristics of Big Data

Big data can be characterized by several distinct features as presented in
figure 1, often referred to as "4 Vs", which are volume, velocity, variety, and
veracity.

• Volume: Refers to the sheer amount of data generated by businesses,

individuals, and machines every day. With the increase in the use of IoT
223
Emerging devices and social media, data volumes continue to grow at an
Technologies for
Business unprecedented rate.
• Velocity: Refers to the speed at which data is generated, processed, and
analysed. Big data requires rapid processing of data, including both
streaming and batch processing.
• Variety: Refers to the diverse types of data generated by businesses,
individuals, and machines, including structured, semi-structured, and
unstructured data. Big data requires the ability to manage and analyse
various data types.
• Veracity: Refers to the accuracy and reliability of the data, which may
vary due to data quality issues, noise, and other factors. Big data requires
careful consideration of data quality and data cleansing techniques to
ensure accurate analysis.

Figure 1: Characteristics of Big Data

In addition to these 3 Vs/ 4Vs, big data can also be characterized by several
other features, including:

• Variability: Refers to the inconsistency in the data, which may result

from changes in the data sources, data formats, and data quality.
• Complexity: Refers to the difficulty in understanding and processing
large and diverse datasets, including handling unstructured data, dealing
with data integration challenges, and identifying patterns and trends.
• Accessibility: Refers to the ability to access and share data across
different platforms and systems while ensuring data privacy, security,
and compliance with regulatory requirements.
Understanding the characteristics of big data is essential for businesses and
organizations to effectively collect, manage, and analyse data to gain insights
and make informed decisions.

224
Big Data
11.0.3 Big Data Classification
Big data can be classified based on several different criteria, such as the
source, the structure, the application, and the analytics approach. Here are
some common classifications of big data:

• Structured, semi-structured, and unstructured: This classification is

based on the structure of the data. Structured data is well-organized and
easy to process, like data in a database. Semi-structured data, such as
XML or JSON, has a defined structure but may also contain unstructured
data. Unstructured data, such as social media posts or images, does not
have a specific structure and is difficult to process.
• Internal and external: This classification is based on the source of the
data. Internal data is generated within an organization, such as sales data
or customer data. External data comes from sources outside the
organization, such as social media data, weather data, or financial data.
• Batch and real-time: This classification is based on the velocity of the
data. Batch data processing involves analysing data in large batches,
often overnight or at set intervals. Real-time data processing involves
analysing data as it is generated, like processing stock market data in
real-time.
• Descriptive, diagnostic, predictive, and prescriptive: This
classification is based on the analytics approach. Descriptive analytics
involves summarizing historical data to understand what has happened.
Diagnostic analytics involves identifying the causes of a particular event
or pattern. Predictive analytics involves forecasting future events or
patterns based on historical data. Prescriptive analytics involves
recommending actions based on insights from predictive analytics.

By understanding the different classifications of big data, businesses and

organizations can better plan and implement their big data strategies to
extract insights and drive value from their data.

11.0.4 Big Data Handling Techniques

There are several techniques and tools available to handle big data
effectively. Here are some of the most common big data handling techniques:

• Distributed File Systems: Distributed file systems such as Hadoop

Distributed File System (HDFS) and Apache Cassandra enable
distributed storage and processing of big data across a cluster of
computers.
• In-memory Data Processing: In-memory data processing systems such
as Apache Spark and Apache Flink allow for faster processing of big
data by storing the data in memory rather than on disk.
• NoSQL Databases: NoSQL databases such as MongoDB and Cassandra
are designed to handle unstructured and semi-structured data and provide
high scalability and availability.

225
Emerging • MapReduce: MapReduce is a programming model that is used to
Technologies for
Business process large datasets in parallel across a cluster of computers.
• Data Compression: Data compression techniques such as gzip and
bzip2 can be used to reduce the size of data, making it easier to transfer
and store.
• Data Partitioning: Data partitioning involves dividing a large dataset
into smaller subsets to enable distributed processing.
• Cloud Computing: Cloud computing platforms such as Amazon Web
Services (AWS) and Microsoft Azure provide scalable and cost-effective
solutions for storing and processing big data.
• Machine learning: Machine learning techniques can be used to analyse
big data and identify patterns and insights that can help organizations
make informed decisions.

By using these techniques, businesses and organizations can handle big data
more effectively, extract insights, and derive value from their data.

11.0.5 Types of Big Data Analytics

Big data analytics is the process of examining large and complex datasets to
uncover hidden patterns, unknown correlations, and other useful information
that can help organizations make informed decisions. Big data analytics
involves the use of advanced technologies and tools to process, store, and
analyse large volumes of structured, semi-structured, and unstructured data.
There are several types of big data analytics techniques as presented in table
1 below:

Table 1: Types of Big Data Analytics

Types of Answers Description Level of

Analytics the Advancement
Question
Descriptive What is Uses data aggregation & Low
happening? mining techniques to provide
insight into the past
Diagnostic Why is it Discovers root-cause of the Medium
happening? problem. It has the ability to
isolate all confounding
information
Predictive What’s Historical patterns are used to High
likely to predict specific outcomes
happen?
Prescriptive What do I Applies advanced analytical Very High
need to do? techniques (optimization &
simulation algorithms) to
advice on possible outcomes
& make specific
recommendations

226
Big Data
• Descriptive Analytics: This technique involves summarizing historical
data to understand what has happened in the past.
• Diagnostic Analytics: This technique involves analysing data to
determine the causes of a particular event or pattern.
• Predictive Analytics: This technique involves using statistical models
and machine learning algorithms to forecast future events or patterns
based on historical data.
• Prescriptive Analytics: This technique involves recommending actions
based on insights from predictive analytics.

To effectively implement big data analytics, organizations need to have a

clear understanding of their data sources, objectives, and analytical tools.
They also need to have the necessary infrastructure and skilled personnel to
manage and analyse large and complex datasets.

11.0.6 Typical Analytical Architecture

A typical analytical architecture for big data includes several layers, each of
which serves a specific purpose in the data analytics process. Here are the
typical layers of a big data analytical architecture:

• Data Sources: This layer includes all the sources of data, both internal
and external, that an organization collects and stores. These may include
data from customer transactions, social media, web logs, sensors, and
other sources.
• Data Ingestion and Storage: This layer is responsible for ingesting data
from various sources, processing it, and storing it in a format that can be
easily accessed and analysed. This layer may include technologies such
as Hadoop Distributed File System (HDFS) and NoSQL databases.
• Data Processing and Preparation: This layer is responsible for
cleaning, transforming, and preparing data for analysis. This may include
tasks such as data integration, data cleaning, data normalization, and data
aggregation.

• Analytics Engines: This layer includes the technologies and tools used
for analysing and processing data. This may include machine learning
algorithms, statistical analysis tools, and visualization tools.

• Data Presentation and Visualization: This layer includes the tools used
to present data in a meaningful way, such as dashboards, reports, and
visualizations. This layer is critical for making data accessible and
understandable to non-technical stakeholders.

• Data Governance and Security: This layer is responsible for ensuring

that data is managed in a secure and compliant manner. This may include
access controls, data quality monitoring, and data privacy regulations.

By implementing a big data analytical architecture, organizations can

streamline the data analytics process, extract valuable insights, and make
informed decisions that drive business growth and success.
227
Emerging 11.0.7 Challenges in Big Data Analytics
Technologies for
Business
Big data analytics is a complex and challenging process, and there are several
challenges that organizations face when trying to extract insights and value
from their data. Here are some of the key challenges in big data analytics:

• Data Complexity and Variety: Big data comes in many different forms,
including structured, semi-structured, and unstructured data, which can
be challenging to process and analyse.
• Data Quality: Big data is often incomplete, inconsistent, or inaccurate,
which can lead to erroneous insights and conclusions.
• Data Security and Privacy: Big data often contains sensitive and
confidential information, which must be protected from unauthorized
access and breaches.
• Scalability: As data volumes grow, the analytical architecture must be
able to scale to handle the increased load, which can be challenging and
costly.
• Talent Shortage: There is a shortage of skilled data scientists and
analysts who are able to process and analyse big data effectively.
• Integration: Big data analytics requires integration with multiple
systems and technologies, which can be challenging to implement and
maintain.
• Data Governance: Big data requires careful management and
governance to ensure compliance with regulations and policies.
• Interpreting Results: Big data analytics often produces large and
complex datasets, which can be challenging to interpret and translate into
actionable insights.

Addressing these challenges requires a combination of technology, processes,

and people. Organizations need to invest in robust analytical architectures,
data quality processes, security and privacy protocols, and talent development
to unlock the full potential of big data analytics.

11.0.8 Case Studies: Big Data in Marketing and Sales,

Healthcare, Medicine, and Advertising
Presented here are some examples of how big data is being used in marketing
and sales, healthcare, medicine, and advertising:

• Marketing and Sales: Big data is being used in marketing and sales to
understand customer behaviour and preferences, personalize marketing
messages, and optimize pricing and promotions. For example, Amazon
uses big data to personalize recommendations for individual customers
based on their browsing and purchase history. Walmart uses big data to
optimize pricing and inventory management in its stores. Coca-Cola uses
big data to optimize its vending machine placement, prices, and
promotions based on local weather conditions, events, and consumer
behaviour.
228
Big Data
• Healthcare: Big data is being used in healthcare to improve patient
outcomes, reduce costs, and enable personalized medicine. For example,
IBM's Watson Health is using big data to develop personalized cancer
treatments based on a patient's genetic profile and medical history.
Hospitals and healthcare providers are using big data to predict patient
readmission rates, identify patients at risk of developing chronic
conditions, and optimize resource allocation.

• Medicine: Big data is being used in medicine to accelerate drug

discovery and development, identify new treatments and therapies, and
improve clinical trial design. For example, Pfizer is using big data
analytics to mine existing drug data and identify new therapeutic targets.
Novartis is using big data to improve the efficiency of its clinical trials
and accelerate the development of new drugs.

• Advertising: Big data is being used in advertising to target and

personalize ads based on individual consumer preferences and behaviour.
For example, Google and Facebook use big data to target ads to specific
audiences based on their browsing and search history. Netflix uses big
data to recommend movies and TV shows to individual users based on
their viewing history and preferences.
These are just a few examples of how big data is being used in various
industries and sectors. Big data has the potential to transform the way
organizations operate and make decisions, leading to improved efficiency,
productivity, and innovation.

11.1 HADOOP FRAMEWORK & ECOSYSTEM

Hadoop is an open-source framework that is used for storing and processing
large volumes of data across distributed systems. It was originally developed
by Doug Cutting and Mike Cafarella in 2006 and is now maintained by the
Apache Software Foundation. The Hadoop ecosystem consists of several
components, including:
• Hadoop Distributed File System (HDFS): HDFS is a distributed file
system that stores data across multiple nodes in a cluster. It is designed
to handle large files and is fault-tolerant, ensuring that data is always
available even in the event of a hardware or software failure.

• MapReduce: MapReduce is a programming model and software

framework for processing large data sets across distributed systems. It is
used to parallelize data processing tasks and distribute them across
multiple nodes in a Hadoop cluster.

• YARN: YARN (Yet Another Resource Negotiator) is a resource

management system that is used to manage resources in a Hadoop
cluster. It enables the sharing of resources across multiple applications,
making it easier to run multiple applications simultaneously on a Hadoop
cluster.

• HBase: HBase is a NoSQL database that is used to store large volumes

229
Emerging of structured data. It is built on top of Hadoop and provides real-time
Technologies for
Business access to data stored in HDFS.

• Pig: Pig is a high-level programming language that is used to process

large datasets in Hadoop. It is designed to simplify the programming of
MapReduce jobs and provides a user-friendly interface for data
processing tasks.

• Hive: Hive is a data warehouse system that is built on top of Hadoop. It

provides a SQL-like interface for querying large datasets stored in
HDFS.

• Spark: Spark is a fast and powerful data processing engine that is

designed to handle both batch and real-time processing. It is built on top
of Hadoop and provides a unified platform for data processing, machine
learning, and graph processing.

As depicted in figure 2, Hadoop ecosystem provides a comprehensive and

powerful platform for storing and processing large volumes of data across
distributed systems. It is widely used in industries such as finance, healthcare,
retail, and telecommunications for data processing, analysis, and machine
learning.

Figure 2: Overview of architecture of data warehouse Hadoop ecosystem components

Source: https://www.researchgate.net/figure/Overview-of-architecture-of-data-warehouse-
Hadoop-ecosystem-components_fig3_346482337

11.1.1 Requirement of Hadoop Framework

The Hadoop framework was developed to address the challenges of
processing and analysing large volumes of data that traditional data
processing systems were not designed to handle. Here are some key
requirements that led to the development of Hadoop:

• Scalability: Traditional data processing systems were not designed to

handle the massive volumes of data being generated today. Hadoop was
designed to be scalable, allowing organizations to store and process large
volumes of data across distributed systems.
230
Big Data
• Fault-tolerance: As data volumes grow, the probability of hardware or
software failures also increases. Hadoop was designed to be fault-
tolerant, ensuring that data is always available even in the event of a
hardware or software failure.

• Cost-effectiveness: Traditional data processing systems can be

expensive to scale, both in terms of hardware and software. Hadoop was
designed to be cost-effective, using commodity hardware and open-
source software to reduce costs.
• Flexibility: Traditional data processing systems were designed for
specific types of data and processing tasks. Hadoop was designed to be
flexible, allowing organizations to store and process a wide variety of
data types and perform a range of processing tasks.

• Processing Speed: Traditional data processing systems were not

designed to handle real-time data processing and analysis. Hadoop was
designed to be fast, using parallel processing and distributed computing
to speed up data processing and analysis.
Overall, the Hadoop framework was developed to meet the growing demand
for processing and analysing large volumes of data in a cost-effective,
flexible, and scalable manner. Its architecture and ecosystem have enabled
organizations to develop new applications and use cases for big data
processing and analysis.

11.1.2 Map Reduce Framework

MapReduce is a programming model and software framework that is used to
process and analyse large datasets in parallel across a distributed computing
cluster. The MapReduce framework consists of two main components: Map
and Reduce.
1. Map: The Map component is responsible for processing the input data
and producing a set of key-value pairs as output. Each Map task
processes a portion of the input data and generates intermediate key-
value pairs, which are then passed on to the Reduce tasks.

2. Reduce: The Reduce component takes the intermediate key-value pairs

generated by the Map tasks and combines them to produce a final set of
key-value pairs as output. Each Reduce task processes a subset of the
intermediate data generated by the Map tasks, which are grouped by key.
Key features of the MapReduce framework include scalability, fault
tolerance, data locality, and ease of use. Some popular implementations of
the MapReduce framework include Apache Hadoop, Apache Spark, and
Apache Flink. These frameworks provide a range of tools and libraries that
can be used to build complex data processing workflows using the
MapReduce programming model.

231
Emerging 11.1.3 Hadoop Yarn and Hadoop Execution Model
Technologies for
Business
Hadoop YARN (Yet Another Resource Negotiator) is a resource
management layer that sits between the Hadoop Distributed File System
(HDFS) and the processing engines, such as MapReduce, Spark, and Tez. It
provides a central platform for managing cluster resources, allocating
resources to different applications, and scheduling jobs across a cluster.

The Hadoop execution model involves the following components:

• Client: The client submits a job to the YARN Resource Manager (RM),
which schedules it across the cluster.

• Resource Manager: The Resource Manager is responsible for managing

the resources in the cluster and scheduling jobs. It allocates resources to
each job based on the application requirements and the available
resources in the cluster.

• Node Manager: The Node Manager runs on each node in the cluster and
is responsible for managing the resources on that node, such as CPU,
memory, and disk space. It reports the available resources back to the
Resource Manager, which uses this information to allocate resources to
different applications.

• Application Master: The Application Master is responsible for

managing the lifecycle of a specific application, such as MapReduce or
Spark. It negotiates with the Resource Manager for resources and
monitors the progress of the application.

• Container: A container is a virtualized environment in which an

application runs. It provides an isolated environment for the application
to run in, with its own allocated resources, such as CPU and memory.

The Hadoop execution model is designed to be highly scalable and fault-

tolerant. It allows multiple applications to run concurrently on the same
cluster, with each application running in its own container. If a node fails, the
Resource Manager can redistribute the workload to other nodes in the cluster,
ensuring that the application continues to run without interruption. Largely,
the Hadoop execution model and YARN are essential components of the
Hadoop ecosystem, providing a powerful and flexible platform for processing
and analysing large datasets.

11.1.4 Introduction to Hadoop Ecosystem Technologies

The Hadoop ecosystem is a collection of open-source software tools and
frameworks that are built on top of the Hadoop Distributed File System
(HDFS) and the Hadoop MapReduce programming model. The Hadoop
ecosystem includes a wide range of tools and technologies for data
processing, storage, management, and analysis. Here are some of the most
popular Hadoop ecosystem technologies:

• Apache Spark: Apache Spark is an open-source big data processing

framework that is built on top of Hadoop. It provides a faster and more
232
Big Data
flexible alternative to MapReduce, with support for real-time data
processing, machine learning, and graph processing.
• Apache Hive: Apache Hive is a data warehouse system for querying and
analysing large datasets stored in Hadoop. It provides a SQL-like
interface for querying data and supports a range of data formats,
including structured and semi-structured data.
• Apache Pig: Apache Pig is a high-level data processing language that is
used to simplify the development of MapReduce jobs. It provides a
simple, easy-to-use syntax for writing data processing pipelines.
• Apache HBase: Apache HBase is a NoSQL database that is built on top
of Hadoop. It provides real-time random read and write access to large
datasets and supports low-latency queries.
• Apache Zoo Keeper: Apache ZooKeeper is a centralized service for
maintaining configuration information, naming, providing distributed
synchronization, and group services.
• Apache Oozie: Apache Oozie is a workflow scheduling system that is
used to manage Hadoop jobs. It provides a way to specify dependencies
between jobs and ensures that jobs are executed in the correct order.
• Apache Flume: Apache Flume is a distributed system for collecting,
aggregating, and moving large amounts of log data from various sources
to Hadoop.
• Apache Sqoop: Apache Sqoop is a tool for transferring data between
Hadoop and structured data stores, such as relational databases.

These are just a few examples of the many tools and technologies that are
available in the Hadoop ecosystem. Each of these technologies is designed to
address specific challenges and use cases in big data processing and
analytics. By leveraging the Hadoop ecosystem, organizations can build
powerful, scalable, and cost-effective data processing and analytics solutions.

11.1.5 Databases: HBase, Hive

HBase and Hive are two popular databases in the Hadoop ecosystem that are
used for storing and processing large-scale datasets.

HBase is a NoSQL database that is designed for storing and managing large
volumes of unstructured and semi-structured data in Hadoop. It provides real-
time random read and write access to large datasets, making it ideal for use
cases that require low-latency queries and high-throughput data processing.
HBase is modelled after Google's Bigtable database and is built on top of
Hadoop Distributed File System (HDFS). HBase uses a column-oriented data
model, which allows for efficient storage and retrieval of data, and provides a
powerful API for data manipulation.
Hive, on the other hand, is a data warehouse system for querying and
analysing large datasets stored in Hadoop. It provides a SQL-like interface
for querying data and supports a range of data formats, including structured
and semi-structured data. Hive is modelled after the SQL language, making it
easy for users with SQL experience to work with large-scale datasets in 233
Emerging Hadoop. Hive uses a metadata-driven approach to data management, which
Technologies for
Business allows for easy integration with other tools in the Hadoop ecosystem. Hive
provides a powerful SQL-like language called HiveQL for querying data and
supports advanced features such as user-defined functions, subqueries, and
joins.

Both HBase and Hive are powerful tools in the Hadoop ecosystem, and they
are often used together to provide a complete data management and analysis
solution. HBase is typically used for real-time data processing and low-
latency queries, while Hive is used for complex analytical queries and ad-hoc
data analysis.

11.1.6 Scripting language: Pig, Streaming: Flink, Storm

Pig, Flink, and Storm are popular scripting languages and streaming
frameworks used in the Hadoop ecosystem for big data processing and
analytics.

Pig is a high-level data processing language that is used to simplify the

development of MapReduce jobs. It provides a simple, easy-to-use syntax for
writing data processing pipelines. Pig supports a wide range of data formats
and provides a powerful set of operators for data manipulation. Pig programs
are compiled into MapReduce jobs, which are then executed on a Hadoop
cluster. Pig is often used for ETL (Extract, Transform, Load) tasks and data
preparation.
Flink is a real-time streaming framework that is designed for high-throughput
and low-latency data processing. Flink provides a distributed processing
engine that can process data in real-time as it arrives. It supports a wide range
of data sources and provides a powerful set of operators for data
manipulation. Flink supports a variety of programming languages, including
Java, Scala, and Python. Flink is often used for real-time analytics, machine
learning, and complex event processing.

Storm is another real-time streaming framework that is used for processing

large-scale data streams. Storm provides a distributed processing engine that
can process data in real-time and can scale to handle large volumes of data.
Storm supports a wide range of data sources and provides a powerful set of
operators for data manipulation. Storm is often used for real-time analytics,
machine learning, and event processing.

Both Flink and Storm support stream processing, whereas Pig supports batch
processing. Stream processing is useful in scenarios where data is generated
continuously and needs to be processed in real-time, such as sensor data or
social media feeds. Batch processing is useful in scenarios where large
volumes of data need to be processed in a non-real-time manner, such as ETL
jobs or data warehousing.

11.2 SPARK FRAMEWORK

Spark is a distributed computing framework that is used for big data
processing and analytics. It was developed at the University of California,
234
Big Data
Berkeley and is now maintained by the Apache Software Foundation. Spark
provides an in-memory data processing engine that can handle large-scale
data processing and analytics tasks. It supports a wide range of data sources
and provides a powerful set of operators for data manipulation. Spark is
designed to work with Hadoop Distributed File System (HDFS) and other
data sources, such as Apache Cassandra and Amazon S3.
Spark supports a variety of programming languages, including Java, Scala,
Python, and R. It provides a simple, easy-to-use API for data processing and
analytics tasks, and supports a wide range of applications, including real-time
stream processing, machine learning, and graph processing. Spark includes
several components, including:
• Spark Core: This is the fundamental computing engine of Spark and
provides the distributed task scheduling, memory management, and fault
tolerance features.
• Spark SQL: This is a module for structured data processing that allows
users to query structured data using SQL-like syntax.
• Spark Streaming: This is a module for processing real-time data
streams using Spark.
• Spark MLlib: This is a machine learning library for Spark that provides
a wide range of machine learning algorithms for tasks such as
classification, regression, clustering, and collaborative filtering.
• GraphX: This is a module for processing graph data using Spark.
Spark is known for its high-speed processing and scalability, and it has
become a popular choice for big data processing and analytics tasks. It is
often used in conjunction with Hadoop and other big data technologies to
provide a complete big data processing and analytics solution.

11.3 MACHINE LEARNING ALGORITHMS FOR

BIG DATA ANALYTICS
There are several machine learning algorithms that are commonly used for
big data analytics tasks. Here are some of the most popular ones:

1. Linear Regression: A popular algorithm used to model the relationship

between dependent and independent variables.
2. Logistic Regression: Used to model the probability of a binary outcome.
3. Decision Trees: Used to model decision-making processes and classify
data based on a series of rules.
4. Random Forest: An ensemble learning technique that combines
multiple decision trees to improve accuracy and reduce overfitting.
5. Naive Bayes: A probabilistic algorithm used for classification tasks
based on the Bayes theorem.
6. K-Nearest Neighbour’s (KNN): A non-parametric algorithm used for
classification and regression tasks that is based on the idea of finding the
k closest data points to a given input.
235
Emerging 7. Support Vector Machines (SVM): A popular algorithm used for
Technologies for
Business classification and regression tasks that involves finding the optimal
hyperplane that separates data points in a high-dimensional space.
8. Neural Networks: A family of algorithms used for various tasks, such
as classification, regression, and clustering, that mimic the structure and
function of the human brain.
9. Gradient Boosting: An ensemble learning technique that combines
multiple weak models to create a strong model.
10. Principal Component Analysis (PCA): A dimensionality reduction
technique that reduces the number of features in a dataset by finding the
most important features.

When dealing with big data, it is important to choose algorithms that are
scalable and can handle large amounts of data. Some of these algorithms,
such as KNN and SVM, can be memory-intensive and may not be suitable
for large datasets. In such cases, distributed computing frameworks like
Apache Spark can be used to handle the processing of big data.

11.4 RECENT TRENDS IN BIG DATA ANALYTICS

Big data analytics is an evolving field, and there are several recent trends that
are shaping the future of this domain. Here are some of the key trends in big
data analytics:

1. Real-time Analytics: Real-time data processing and analysis are

becoming increasingly important as businesses seek to make more
informed decisions based on up-to-date information.

2. Edge Computing: Edge computing involves processing data closer to

the source, rather than sending it to a centralized server or cloud. This
trend is gaining traction in industries such as healthcare and
manufacturing, where real-time insights are critical.

3. Cloud-based Analytics: Cloud-based analytics platforms are becoming

increasingly popular, as they offer flexibility, scalability, and cost-
effectiveness. Cloud platforms such as AWS, Azure, and Google Cloud
Platform offer a range of big data tools and services.
4. Artificial Intelligence and Machine Learning: Machine learning and
AI are becoming increasingly important in big data analytics, as they can
help automate data processing and analysis and provide more accurate
insights.

5. Data Privacy and Security: With the increasing amount of data being
collected and analysed, data privacy and security are becoming major
concerns. Businesses must ensure that they are compliant with data
protection regulations and that they are taking steps to protect sensitive
data.
6. Data Democratization: Data democratization involves making data
accessible to all stakeholders in an organization, enabling them to make
236
Big Data
data-driven decisions. This trend is gaining traction as businesses seek to
break down data silos and improve collaboration and communication
across teams.

7. Natural Language Processing (NLP): NLP is a field of AI that

involves analysing and interpreting human language. NLP is becoming
increasingly important in big data analytics, as it can help businesses
extract insights from unstructured data sources such as social media and
customer feedback.
These trends are shaping the future of big data analytics and will continue to
influence the development of new tools and technologies in this field.

11.5 SUMMARY
Big data refers to the large volume of structured and unstructured data that
inundates businesses on a daily basis. Big data analytics is the process of
collecting, processing, and analysing this data to gain insights and make
informed business decisions. The key characteristics of big data are
commonly summarized by the "3Vs": volume, velocity, and variety. To
handle big data, businesses require specialized tools and technologies, such
as the Hadoop ecosystem, which includes HDFS, MapReduce, and YARN, as
well as other technologies like Spark, HBase, and Hive. In addition to
handling the technical challenges of big data, businesses must also address
data privacy and security concerns, and ensure compliance with regulations
such as GDPR and CCPA.

Some of the key trends in big data analytics include real-time analytics, edge
computing, cloud-based analytics, artificial intelligence and machine
learning, data privacy and security, data democratization, and natural
language processing. Commonly, big data analytics has the potential to
provide businesses with valuable insights that can improve their operations,
customer experiences, and bottom lines.

11.6 SELF – ASSESSMENT EXCERCISES

1. What is big data?
2. What are the characteristics of big data?
3. How is big data analysed?
4. What are some applications of big data?
5. What are some challenges associated with big data?
6. What are some recent trends in big data analytics?
7. What are some popular big data platforms and technologies?
8. How is big data used in marketing?
9. What is the future of big data?

11.7 KEYWORDS
A glossary of commonly used terms in big data include: 237
Emerging 1. Big data: Refers to large volumes of structured and unstructured data
Technologies for
Business that inundate businesses on a daily basis.
2. Business intelligence: The use of data analysis tools and technologies to
gain insights into business performance and make informed decisions.
3. Cloud computing: The delivery of computing services, including
storage, processing, and analytics, over the internet.
4. Data cleaning: The process of identifying and correcting errors and
inconsistencies in data.
5. Data governance: The management of data assets, including policies,
procedures, and standards for data quality and security.
6. Data integration: The process of combining data from multiple sources
into a single, unified view.
7. Data lake: A centralized repository for storing large volumes of
structured and unstructured data in its native format.
8. Data mining: The process of extracting useful information from large
volumes of data.
9. Data pipeline: The process of moving data from its source to a
destination for storage, processing, or analysis.
10. Data privacy: The protection of sensitive and personal data from
unauthorized access or disclosure.
11. Data quality: The measure of the accuracy, completeness, and
consistency of data.
12. Data visualization: The process of creating visual representations of
data to aid in understanding and analysis.
13. Data warehousing: The process of collecting and storing data from
multiple sources to create a centralized repository for analysis.
14. Hadoop: A popular open-source big data framework used for storing
and processing large volumes of data.
15. Machine learning: A subset of AI that involves building algorithms and
models that can learn and make predictions based on data.
16. MapReduce: A programming model used to process large volumes of
data in parallel on a distributed system.
17. NoSQL: A non-relational database management system designed for
handling large volumes of unstructured data.
18. Predictive Analytics: The use of statistical models and machine learning
algorithms to make predictions about future events based on historical
data.
19. Spark: An open-source big data processing framework that allows for
fast, in-memory processing of large datasets.
20. Streaming: The process of analysing and processing real-time data as it
is generated.

238
Big Data
11.8 FURTHER READINGS
1. Provost, F., & Fawcett, T. (2013). Data science for business: What you
need to know about data mining and data-analytic thinking. O'Reilly
Media.
2. Zaharia, M., & Chambers, B. (2018). Spark: The definitive guide.
O'Reilly Media.
3. Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of massive
datasets. Cambridge University Press.
4. Marz, N., & Warren, J. (2015). Big data: Principles and best practices of
scalable real-time data systems. Manning Publications.
5. Apache Hadoop: https://hadoop.apache.org/
6. Apache Spark: https://spark.apache.org/
7. Big Data University: https://bigdatauniversity.com/
8. Hortonworks: https://hortonworks.com/
9. Big Data Analytics News: https://www.bigdataanalyticsnews.com/
10. Data Science Central: https://www.datasciencecentral.com/

239

Big Data Analytics 1-5
100% (1)
Big Data Analytics 1-5
63 pages
Big Data Analytics - Lecture Slides
No ratings yet
Big Data Analytics - Lecture Slides
72 pages
Block-2-Unit 5
No ratings yet
Block-2-Unit 5
101 pages
Big Data Tools
No ratings yet
Big Data Tools
29 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
33 pages
Big Data S All Units
No ratings yet
Big Data S All Units
122 pages
Cours BI 23 24 Session 4 2
No ratings yet
Cours BI 23 24 Session 4 2
46 pages
Digital Library Proposal
100% (5)
Digital Library Proposal
7 pages
Stream Processing Chapter 2
No ratings yet
Stream Processing Chapter 2
21 pages
BigData UNIT-1
No ratings yet
BigData UNIT-1
19 pages
Unit 1 Big Data Notes
No ratings yet
Unit 1 Big Data Notes
48 pages
It (r20) 4-1 Big Data Analytics Digital Notes
No ratings yet
It (r20) 4-1 Big Data Analytics Digital Notes
84 pages
Big Data Answers
No ratings yet
Big Data Answers
11 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
124 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
33 pages
J Ijdsa 20241005 11
No ratings yet
J Ijdsa 20241005 11
14 pages
Data Analytics Notes Unit 1
No ratings yet
Data Analytics Notes Unit 1
23 pages
BDA Unit 1
No ratings yet
BDA Unit 1
39 pages
Bda 1
No ratings yet
Bda 1
26 pages
Understanding Big Data and NoSQL
No ratings yet
Understanding Big Data and NoSQL
31 pages
Big Data Processing A Review
No ratings yet
Big Data Processing A Review
8 pages
BIG Data1
No ratings yet
BIG Data1
49 pages
BIG DATA 1 Unit
100% (1)
BIG DATA 1 Unit
17 pages
Exam Ref AI-900 Microsoft Azure AI Fundame - Julian Sharp
100% (1)
Exam Ref AI-900 Microsoft Azure AI Fundame - Julian Sharp
371 pages
Big Data and Business Opportunities
100% (1)
Big Data and Business Opportunities
6 pages
Big Data - Unit-I
No ratings yet
Big Data - Unit-I
17 pages
MODULE 1 - ST
No ratings yet
MODULE 1 - ST
13 pages
Big Data A Comprehensive Overview
No ratings yet
Big Data A Comprehensive Overview
25 pages
Unit1 - BDH
No ratings yet
Unit1 - BDH
77 pages
Enterprise Integration Report
No ratings yet
Enterprise Integration Report
7 pages
Big Data Analytics
No ratings yet
Big Data Analytics
32 pages
Musfequr Rahman ID - 191051015
No ratings yet
Musfequr Rahman ID - 191051015
4 pages
BIG Data Analytics 21CSH-471: Computer Science & Engineering
No ratings yet
BIG Data Analytics 21CSH-471: Computer Science & Engineering
17 pages
Big Data With Cloud Computing Discussions and Challenges
No ratings yet
Big Data With Cloud Computing Discussions and Challenges
9 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
24 pages
Bda Unit1
No ratings yet
Bda Unit1
19 pages
Types of Digital Data: Unit 1 Big Data KCS-061
No ratings yet
Types of Digital Data: Unit 1 Big Data KCS-061
12 pages
What Is Big Data
No ratings yet
What Is Big Data
18 pages
Introduction To Big Data - Report 1
No ratings yet
Introduction To Big Data - Report 1
5 pages
Unit 1 Big Data Analytics Full
No ratings yet
Unit 1 Big Data Analytics Full
29 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
BDAchap 1
No ratings yet
BDAchap 1
15 pages
Data Cleaning and Preprocessing Techniques
No ratings yet
Data Cleaning and Preprocessing Techniques
13 pages
Big Data Overview
No ratings yet
Big Data Overview
18 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Unit 1 and Unit 2 Notes Bda
No ratings yet
Unit 1 and Unit 2 Notes Bda
11 pages
Bigdata
No ratings yet
Bigdata
12 pages
BIG DATA Notes
No ratings yet
BIG DATA Notes
11 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Big Data: Spot Business Trends, Prevent Diseases, C Ombat Crime and So On"
No ratings yet
Big Data: Spot Business Trends, Prevent Diseases, C Ombat Crime and So On"
8 pages
Big Data and Hadoop Self Notes
No ratings yet
Big Data and Hadoop Self Notes
16 pages
CC Becse Unit 4 PDF
No ratings yet
CC Becse Unit 4 PDF
32 pages
Report On Big Data
No ratings yet
Report On Big Data
23 pages
Cb3401-Unit 1
100% (1)
Cb3401-Unit 1
28 pages
Big Data Unit 1 Notes - 240311 - 100703
No ratings yet
Big Data Unit 1 Notes - 240311 - 100703
15 pages
Big Data Summery
No ratings yet
Big Data Summery
9 pages
(IJCST-V9I6P1) :yew Kee Wong
No ratings yet
(IJCST-V9I6P1) :yew Kee Wong
7 pages
Big Data Mining: A Challenge and How To Manage It: Csa Deptt. Pdmce Jitender Csa Deptt. Pdmce
No ratings yet
Big Data Mining: A Challenge and How To Manage It: Csa Deptt. Pdmce Jitender Csa Deptt. Pdmce
3 pages
WEKA Lab Questions Answers
No ratings yet
WEKA Lab Questions Answers
5 pages
BCC (IEEE Format) Big Data
No ratings yet
BCC (IEEE Format) Big Data
2 pages
CC6012NIDataandWebDevelopment 93486
No ratings yet
CC6012NIDataandWebDevelopment 93486
18 pages
Azure Synapse Guidebook
100% (1)
Azure Synapse Guidebook
15 pages
I Jcs It 2015060405
No ratings yet
I Jcs It 2015060405
6 pages
DBMS Three Schema Architecture12
100% (1)
DBMS Three Schema Architecture12
3 pages
Cloud Computing Seminar Abstract
100% (6)
Cloud Computing Seminar Abstract
2 pages
Org Arw Imt SPR 00001 - 4.0 - 1
No ratings yet
Org Arw Imt SPR 00001 - 4.0 - 1
22 pages
International Journal of Engineering Research and Development (IJERD)
No ratings yet
International Journal of Engineering Research and Development (IJERD)
6 pages
Online Restaurant Reservation System Presentation
No ratings yet
Online Restaurant Reservation System Presentation
10 pages
Slowly Changing Dimensions
No ratings yet
Slowly Changing Dimensions
26 pages
Dumpshq Itil 4 Foundation Itil 4 Questions by Dunn 06 06 2022 7qa
No ratings yet
Dumpshq Itil 4 Foundation Itil 4 Questions by Dunn 06 06 2022 7qa
9 pages
01 Introduction
No ratings yet
01 Introduction
7 pages
Kriteria Penerimaan Dan Proses Pengiriman Artikel - Feri A.
No ratings yet
Kriteria Penerimaan Dan Proses Pengiriman Artikel - Feri A.
49 pages
BERT - PLI-Modeling Paragraph-Level Interactions For Legal Case Retrieval
No ratings yet
BERT - PLI-Modeling Paragraph-Level Interactions For Legal Case Retrieval
7 pages
ADO Lecture
No ratings yet
ADO Lecture
57 pages
Field: A Character or A Group of Characters (Alphabetic or Numeric) That Has A Specific Meaning. A Field Is Used To Define and Store Data
No ratings yet
Field: A Character or A Group of Characters (Alphabetic or Numeric) That Has A Specific Meaning. A Field Is Used To Define and Store Data
6 pages
DASHBOARD
No ratings yet
DASHBOARD
5 pages
Information Technology 402 Class X Term 2 Sample Paper 09
No ratings yet
Information Technology 402 Class X Term 2 Sample Paper 09
2 pages
Rancang Bangun Sistem Informasi E-Arsip Berbasis Microsoft Access
No ratings yet
Rancang Bangun Sistem Informasi E-Arsip Berbasis Microsoft Access
18 pages
mc4 PDF
No ratings yet
mc4 PDF
3 pages
Summary Chapter 1 - Database Concepts
No ratings yet
Summary Chapter 1 - Database Concepts
15 pages
Materials Management System - Pharmacy, Main Stores, and Purchasing
No ratings yet
Materials Management System - Pharmacy, Main Stores, and Purchasing
10 pages
Database Management System V1
No ratings yet
Database Management System V1
8 pages
PubChem Database BioInformatics Notes
No ratings yet
PubChem Database BioInformatics Notes
4 pages
JURNAL Surahman
No ratings yet
JURNAL Surahman
6 pages
Summer Training Synopsis PDF
No ratings yet
Summer Training Synopsis PDF
3 pages
Academic Entrepreneurship: Creating An Entrepreneurial Ecosystem
No ratings yet
Academic Entrepreneurship: Creating An Entrepreneurial Ecosystem
3 pages
Chapter 12
No ratings yet
Chapter 12
1 page
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet

Unit-11 Big Data

Uploaded by

Unit-11 Big Data

Uploaded by

Emerging

Technologies for UNIT 11 BIG DATA

11.0.1 Data Storage and Analysis

In addition to these specialized technologies and tools, data scientists and

11.0.2 Characteristics of Big Data

• Volume: Refers to the sheer amount of data generated by businesses,

Figure 1: Characteristics of Big Data

• Variability: Refers to the inconsistency in the data, which may result

• Structured, semi-structured, and unstructured: This classification is

By understanding the different classifications of big data, businesses and

11.0.4 Big Data Handling Techniques

• Distributed File Systems: Distributed file systems such as Hadoop

11.0.5 Types of Big Data Analytics

Table 1: Types of Big Data Analytics

Types of Answers Description Level of

To effectively implement big data analytics, organizations need to have a

11.0.6 Typical Analytical Architecture

• Data Governance and Security: This layer is responsible for ensuring

By implementing a big data analytical architecture, organizations can

Addressing these challenges requires a combination of technology, processes,

11.0.8 Case Studies: Big Data in Marketing and Sales,

• Medicine: Big data is being used in medicine to accelerate drug

• Advertising: Big data is being used in advertising to target and

11.1 HADOOP FRAMEWORK & ECOSYSTEM

• MapReduce: MapReduce is a programming model and software

• YARN: YARN (Yet Another Resource Negotiator) is a resource

• HBase: HBase is a NoSQL database that is used to store large volumes

• Pig: Pig is a high-level programming language that is used to process

• Hive: Hive is a data warehouse system that is built on top of Hadoop. It

• Spark: Spark is a fast and powerful data processing engine that is

As depicted in figure 2, Hadoop ecosystem provides a comprehensive and

Figure 2: Overview of architecture of data warehouse Hadoop ecosystem components

11.1.1 Requirement of Hadoop Framework

• Scalability: Traditional data processing systems were not designed to

• Cost-effectiveness: Traditional data processing systems can be

• Processing Speed: Traditional data processing systems were not

11.1.2 Map Reduce Framework

2. Reduce: The Reduce component takes the intermediate key-value pairs

The Hadoop execution model involves the following components:

• Resource Manager: The Resource Manager is responsible for managing

• Application Master: The Application Master is responsible for

• Container: A container is a virtualized environment in which an

The Hadoop execution model is designed to be highly scalable and fault-

11.1.4 Introduction to Hadoop Ecosystem Technologies

• Apache Spark: Apache Spark is an open-source big data processing

11.1.5 Databases: HBase, Hive

11.1.6 Scripting language: Pig, Streaming: Flink, Storm

Pig is a high-level data processing language that is used to simplify the

Storm is another real-time streaming framework that is used for processing

11.2 SPARK FRAMEWORK

11.3 MACHINE LEARNING ALGORITHMS FOR

1. Linear Regression: A popular algorithm used to model the relationship

11.4 RECENT TRENDS IN BIG DATA ANALYTICS

1. Real-time Analytics: Real-time data processing and analysis are

2. Edge Computing: Edge computing involves processing data closer to

3. Cloud-based Analytics: Cloud-based analytics platforms are becoming

7. Natural Language Processing (NLP): NLP is a field of AI that

11.6 SELF – ASSESSMENT EXCERCISES

You might also like