0% found this document useful (0 votes)
9 views18 pages

Unit-11 Big Data

This document provides an overview of Big Data, including its characteristics, challenges, and the technologies used for storage and analysis, such as Hadoop and Spark. It discusses various types of analytics, handling techniques, and the importance of understanding data classifications for effective decision-making. Additionally, it highlights case studies demonstrating the application of Big Data in marketing, healthcare, and advertising.

Uploaded by

manishlap4423
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views18 pages

Unit-11 Big Data

This document provides an overview of Big Data, including its characteristics, challenges, and the technologies used for storage and analysis, such as Hadoop and Spark. It discusses various types of analytics, handling techniques, and the importance of understanding data classifications for effective decision-making. Additionally, it highlights case studies demonstrating the application of Big Data in marketing, healthcare, and advertising.

Uploaded by

manishlap4423
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Emerging

Technologies for UNIT 11 BIG DATA


Business

Objectives
After studying this unit, you will be able to:
• Understanding the concept of Big Data, its characteristics, and the
challenges associated with it.
• Familiarizing with the Hadoop ecosystem and its components.
• Understanding the basics of MapReduce.
• Learning utility of Pig, a high-level platform for creating MapReduce
programs, to process and analyse data.
• Understanding the basics of machine learning algorithms for Big Data
analytics.

Structure
11.0 Introduction to Big Data
11.0.1 Data Storage and Analysis
11.0.2 Characteristics of Big Data
11.0.3 Big Data Classification
11.0.4 Big Data Handling Techniques
11.0.5 Types of Big Data Analytics
11.0.6 Typical Analytical Architecture
11.0.6 Challenges in Big Data Analytics
11.0.7 Case studies: Big Data in Marketing and Sales, Healthcare, Medicine, and
Advertising
11.1 Hadoop Framework & Ecosystem
11.1.1 Requirement of Hadoop Framework
11.1.2 Map Reduce Framework
11.1.3 Hadoop Yarn and Hadoop Execution Model
11.1.4 Introduction to Hadoop Ecosystem Technologies
11.1.5 Databases: HBase, Hive
11.1.6 Scripting language: Pig, Streaming: Flink, Storm
11.2 Spark Framework
11.3 Machine Learning Algorithms for Big Data Analytics
11.4 Recent Trends in Big Data Analytics
11.5 Summary
11.6 Self–Assessment Exercises
11.7 Keywords
11.8 Further Readings

222
Big Data
11.0 INTRODUCTION TO BIG DATA
Big data refers to the vast amount of structured and unstructured data that is
generated and collected by individuals, organizations, and machines every
day. Data is too large and complex to be processed by traditional data
processing applications, which often have limitations in terms of their
capacity to store, process, and analyse large datasets.

To process and analyse big data, specialized technologies and tools such as
Hadoop, Spark, and NoSQL databases have been developed. These tools
allow organizations to store, process, and analyse large volumes of data
quickly and efficiently. The insights derived from big data can be used to
make informed decisions, identify trends and patterns, improve customer
experiences, and enhance operational efficiency.

11.0.1 Data Storage and Analysis


Data storage and analysis are two critical components of big data processing.
As mentioned earlier, big data is too large and complex to be processed by
traditional data processing applications, and therefore specialized
technologies and tools have been developed to store and analyse big data.
One popular technology for storing big data is Hadoop, which is an open-
source software framework that allows for distributed storage and processing
of large datasets across clusters of computers. Hadoop is designed to handle
both structured and unstructured data, and it uses a distributed file system
called Hadoop Distributed File System (HDFS) to store data across multiple
machines.
Once data is stored, it can be analysed using specialized tools such as Spark,
which is an open-source data processing engine that allows for fast and
efficient processing of large datasets. Spark uses an in-memory processing
model to speed up computations, and it can handle both batch and streaming
data processing. Another popular tool for big data analysis is NoSQL
databases, which are non-relational databases that can handle large volumes
of structured and unstructured data with high scalability and availability.

In addition to these specialized technologies and tools, data scientists and


analysts can use programming languages such as Python and R to perform
data analysis and create visualizations to help interpret and communicate the
results of the analysis. Machine learning algorithms and artificial intelligence
techniques can also be used to derive insights from big data and make
predictions and recommendations.

11.0.2 Characteristics of Big Data


Big data can be characterized by several distinct features as presented in
figure 1, often referred to as "4 Vs", which are volume, velocity, variety, and
veracity.

• Volume: Refers to the sheer amount of data generated by businesses,


individuals, and machines every day. With the increase in the use of IoT
223
Emerging devices and social media, data volumes continue to grow at an
Technologies for
Business unprecedented rate.
• Velocity: Refers to the speed at which data is generated, processed, and
analysed. Big data requires rapid processing of data, including both
streaming and batch processing.
• Variety: Refers to the diverse types of data generated by businesses,
individuals, and machines, including structured, semi-structured, and
unstructured data. Big data requires the ability to manage and analyse
various data types.
• Veracity: Refers to the accuracy and reliability of the data, which may
vary due to data quality issues, noise, and other factors. Big data requires
careful consideration of data quality and data cleansing techniques to
ensure accurate analysis.

Figure 1: Characteristics of Big Data

In addition to these 3 Vs/ 4Vs, big data can also be characterized by several
other features, including:

• Variability: Refers to the inconsistency in the data, which may result


from changes in the data sources, data formats, and data quality.
• Complexity: Refers to the difficulty in understanding and processing
large and diverse datasets, including handling unstructured data, dealing
with data integration challenges, and identifying patterns and trends.
• Accessibility: Refers to the ability to access and share data across
different platforms and systems while ensuring data privacy, security,
and compliance with regulatory requirements.
Understanding the characteristics of big data is essential for businesses and
organizations to effectively collect, manage, and analyse data to gain insights
and make informed decisions.

224
Big Data
11.0.3 Big Data Classification
Big data can be classified based on several different criteria, such as the
source, the structure, the application, and the analytics approach. Here are
some common classifications of big data:

• Structured, semi-structured, and unstructured: This classification is


based on the structure of the data. Structured data is well-organized and
easy to process, like data in a database. Semi-structured data, such as
XML or JSON, has a defined structure but may also contain unstructured
data. Unstructured data, such as social media posts or images, does not
have a specific structure and is difficult to process.
• Internal and external: This classification is based on the source of the
data. Internal data is generated within an organization, such as sales data
or customer data. External data comes from sources outside the
organization, such as social media data, weather data, or financial data.
• Batch and real-time: This classification is based on the velocity of the
data. Batch data processing involves analysing data in large batches,
often overnight or at set intervals. Real-time data processing involves
analysing data as it is generated, like processing stock market data in
real-time.
• Descriptive, diagnostic, predictive, and prescriptive: This
classification is based on the analytics approach. Descriptive analytics
involves summarizing historical data to understand what has happened.
Diagnostic analytics involves identifying the causes of a particular event
or pattern. Predictive analytics involves forecasting future events or
patterns based on historical data. Prescriptive analytics involves
recommending actions based on insights from predictive analytics.

By understanding the different classifications of big data, businesses and


organizations can better plan and implement their big data strategies to
extract insights and drive value from their data.

11.0.4 Big Data Handling Techniques


There are several techniques and tools available to handle big data
effectively. Here are some of the most common big data handling techniques:

• Distributed File Systems: Distributed file systems such as Hadoop


Distributed File System (HDFS) and Apache Cassandra enable
distributed storage and processing of big data across a cluster of
computers.
• In-memory Data Processing: In-memory data processing systems such
as Apache Spark and Apache Flink allow for faster processing of big
data by storing the data in memory rather than on disk.
• NoSQL Databases: NoSQL databases such as MongoDB and Cassandra
are designed to handle unstructured and semi-structured data and provide
high scalability and availability.

225
Emerging • MapReduce: MapReduce is a programming model that is used to
Technologies for
Business process large datasets in parallel across a cluster of computers.
• Data Compression: Data compression techniques such as gzip and
bzip2 can be used to reduce the size of data, making it easier to transfer
and store.
• Data Partitioning: Data partitioning involves dividing a large dataset
into smaller subsets to enable distributed processing.
• Cloud Computing: Cloud computing platforms such as Amazon Web
Services (AWS) and Microsoft Azure provide scalable and cost-effective
solutions for storing and processing big data.
• Machine learning: Machine learning techniques can be used to analyse
big data and identify patterns and insights that can help organizations
make informed decisions.

By using these techniques, businesses and organizations can handle big data
more effectively, extract insights, and derive value from their data.

11.0.5 Types of Big Data Analytics


Big data analytics is the process of examining large and complex datasets to
uncover hidden patterns, unknown correlations, and other useful information
that can help organizations make informed decisions. Big data analytics
involves the use of advanced technologies and tools to process, store, and
analyse large volumes of structured, semi-structured, and unstructured data.
There are several types of big data analytics techniques as presented in table
1 below:

Table 1: Types of Big Data Analytics

Types of Answers Description Level of


Analytics the Advancement
Question
Descriptive What is Uses data aggregation & Low
happening? mining techniques to provide
insight into the past
Diagnostic Why is it Discovers root-cause of the Medium
happening? problem. It has the ability to
isolate all confounding
information
Predictive What’s Historical patterns are used to High
likely to predict specific outcomes
happen?
Prescriptive What do I Applies advanced analytical Very High
need to do? techniques (optimization &
simulation algorithms) to
advice on possible outcomes
& make specific
recommendations

226
Big Data
• Descriptive Analytics: This technique involves summarizing historical
data to understand what has happened in the past.
• Diagnostic Analytics: This technique involves analysing data to
determine the causes of a particular event or pattern.
• Predictive Analytics: This technique involves using statistical models
and machine learning algorithms to forecast future events or patterns
based on historical data.
• Prescriptive Analytics: This technique involves recommending actions
based on insights from predictive analytics.

To effectively implement big data analytics, organizations need to have a


clear understanding of their data sources, objectives, and analytical tools.
They also need to have the necessary infrastructure and skilled personnel to
manage and analyse large and complex datasets.

11.0.6 Typical Analytical Architecture


A typical analytical architecture for big data includes several layers, each of
which serves a specific purpose in the data analytics process. Here are the
typical layers of a big data analytical architecture:

• Data Sources: This layer includes all the sources of data, both internal
and external, that an organization collects and stores. These may include
data from customer transactions, social media, web logs, sensors, and
other sources.
• Data Ingestion and Storage: This layer is responsible for ingesting data
from various sources, processing it, and storing it in a format that can be
easily accessed and analysed. This layer may include technologies such
as Hadoop Distributed File System (HDFS) and NoSQL databases.
• Data Processing and Preparation: This layer is responsible for
cleaning, transforming, and preparing data for analysis. This may include
tasks such as data integration, data cleaning, data normalization, and data
aggregation.

• Analytics Engines: This layer includes the technologies and tools used
for analysing and processing data. This may include machine learning
algorithms, statistical analysis tools, and visualization tools.

• Data Presentation and Visualization: This layer includes the tools used
to present data in a meaningful way, such as dashboards, reports, and
visualizations. This layer is critical for making data accessible and
understandable to non-technical stakeholders.

• Data Governance and Security: This layer is responsible for ensuring


that data is managed in a secure and compliant manner. This may include
access controls, data quality monitoring, and data privacy regulations.

By implementing a big data analytical architecture, organizations can


streamline the data analytics process, extract valuable insights, and make
informed decisions that drive business growth and success.
227
Emerging 11.0.7 Challenges in Big Data Analytics
Technologies for
Business
Big data analytics is a complex and challenging process, and there are several
challenges that organizations face when trying to extract insights and value
from their data. Here are some of the key challenges in big data analytics:

• Data Complexity and Variety: Big data comes in many different forms,
including structured, semi-structured, and unstructured data, which can
be challenging to process and analyse.
• Data Quality: Big data is often incomplete, inconsistent, or inaccurate,
which can lead to erroneous insights and conclusions.
• Data Security and Privacy: Big data often contains sensitive and
confidential information, which must be protected from unauthorized
access and breaches.
• Scalability: As data volumes grow, the analytical architecture must be
able to scale to handle the increased load, which can be challenging and
costly.
• Talent Shortage: There is a shortage of skilled data scientists and
analysts who are able to process and analyse big data effectively.
• Integration: Big data analytics requires integration with multiple
systems and technologies, which can be challenging to implement and
maintain.
• Data Governance: Big data requires careful management and
governance to ensure compliance with regulations and policies.
• Interpreting Results: Big data analytics often produces large and
complex datasets, which can be challenging to interpret and translate into
actionable insights.

Addressing these challenges requires a combination of technology, processes,


and people. Organizations need to invest in robust analytical architectures,
data quality processes, security and privacy protocols, and talent development
to unlock the full potential of big data analytics.

11.0.8 Case Studies: Big Data in Marketing and Sales,


Healthcare, Medicine, and Advertising
Presented here are some examples of how big data is being used in marketing
and sales, healthcare, medicine, and advertising:

• Marketing and Sales: Big data is being used in marketing and sales to
understand customer behaviour and preferences, personalize marketing
messages, and optimize pricing and promotions. For example, Amazon
uses big data to personalize recommendations for individual customers
based on their browsing and purchase history. Walmart uses big data to
optimize pricing and inventory management in its stores. Coca-Cola uses
big data to optimize its vending machine placement, prices, and
promotions based on local weather conditions, events, and consumer
behaviour.
228
Big Data
• Healthcare: Big data is being used in healthcare to improve patient
outcomes, reduce costs, and enable personalized medicine. For example,
IBM's Watson Health is using big data to develop personalized cancer
treatments based on a patient's genetic profile and medical history.
Hospitals and healthcare providers are using big data to predict patient
readmission rates, identify patients at risk of developing chronic
conditions, and optimize resource allocation.

• Medicine: Big data is being used in medicine to accelerate drug


discovery and development, identify new treatments and therapies, and
improve clinical trial design. For example, Pfizer is using big data
analytics to mine existing drug data and identify new therapeutic targets.
Novartis is using big data to improve the efficiency of its clinical trials
and accelerate the development of new drugs.

• Advertising: Big data is being used in advertising to target and


personalize ads based on individual consumer preferences and behaviour.
For example, Google and Facebook use big data to target ads to specific
audiences based on their browsing and search history. Netflix uses big
data to recommend movies and TV shows to individual users based on
their viewing history and preferences.
These are just a few examples of how big data is being used in various
industries and sectors. Big data has the potential to transform the way
organizations operate and make decisions, leading to improved efficiency,
productivity, and innovation.

11.1 HADOOP FRAMEWORK & ECOSYSTEM


Hadoop is an open-source framework that is used for storing and processing
large volumes of data across distributed systems. It was originally developed
by Doug Cutting and Mike Cafarella in 2006 and is now maintained by the
Apache Software Foundation. The Hadoop ecosystem consists of several
components, including:
• Hadoop Distributed File System (HDFS): HDFS is a distributed file
system that stores data across multiple nodes in a cluster. It is designed
to handle large files and is fault-tolerant, ensuring that data is always
available even in the event of a hardware or software failure.

• MapReduce: MapReduce is a programming model and software


framework for processing large data sets across distributed systems. It is
used to parallelize data processing tasks and distribute them across
multiple nodes in a Hadoop cluster.

• YARN: YARN (Yet Another Resource Negotiator) is a resource


management system that is used to manage resources in a Hadoop
cluster. It enables the sharing of resources across multiple applications,
making it easier to run multiple applications simultaneously on a Hadoop
cluster.

• HBase: HBase is a NoSQL database that is used to store large volumes


229
Emerging of structured data. It is built on top of Hadoop and provides real-time
Technologies for
Business access to data stored in HDFS.

• Pig: Pig is a high-level programming language that is used to process


large datasets in Hadoop. It is designed to simplify the programming of
MapReduce jobs and provides a user-friendly interface for data
processing tasks.

• Hive: Hive is a data warehouse system that is built on top of Hadoop. It


provides a SQL-like interface for querying large datasets stored in
HDFS.

• Spark: Spark is a fast and powerful data processing engine that is


designed to handle both batch and real-time processing. It is built on top
of Hadoop and provides a unified platform for data processing, machine
learning, and graph processing.

As depicted in figure 2, Hadoop ecosystem provides a comprehensive and


powerful platform for storing and processing large volumes of data across
distributed systems. It is widely used in industries such as finance, healthcare,
retail, and telecommunications for data processing, analysis, and machine
learning.

Figure 2: Overview of architecture of data warehouse Hadoop ecosystem components


Source: https://www.researchgate.net/figure/Overview-of-architecture-of-data-warehouse-
Hadoop-ecosystem-components_fig3_346482337

11.1.1 Requirement of Hadoop Framework


The Hadoop framework was developed to address the challenges of
processing and analysing large volumes of data that traditional data
processing systems were not designed to handle. Here are some key
requirements that led to the development of Hadoop:

• Scalability: Traditional data processing systems were not designed to


handle the massive volumes of data being generated today. Hadoop was
designed to be scalable, allowing organizations to store and process large
volumes of data across distributed systems.
230
Big Data
• Fault-tolerance: As data volumes grow, the probability of hardware or
software failures also increases. Hadoop was designed to be fault-
tolerant, ensuring that data is always available even in the event of a
hardware or software failure.

• Cost-effectiveness: Traditional data processing systems can be


expensive to scale, both in terms of hardware and software. Hadoop was
designed to be cost-effective, using commodity hardware and open-
source software to reduce costs.
• Flexibility: Traditional data processing systems were designed for
specific types of data and processing tasks. Hadoop was designed to be
flexible, allowing organizations to store and process a wide variety of
data types and perform a range of processing tasks.

• Processing Speed: Traditional data processing systems were not


designed to handle real-time data processing and analysis. Hadoop was
designed to be fast, using parallel processing and distributed computing
to speed up data processing and analysis.
Overall, the Hadoop framework was developed to meet the growing demand
for processing and analysing large volumes of data in a cost-effective,
flexible, and scalable manner. Its architecture and ecosystem have enabled
organizations to develop new applications and use cases for big data
processing and analysis.

11.1.2 Map Reduce Framework


MapReduce is a programming model and software framework that is used to
process and analyse large datasets in parallel across a distributed computing
cluster. The MapReduce framework consists of two main components: Map
and Reduce.
1. Map: The Map component is responsible for processing the input data
and producing a set of key-value pairs as output. Each Map task
processes a portion of the input data and generates intermediate key-
value pairs, which are then passed on to the Reduce tasks.

2. Reduce: The Reduce component takes the intermediate key-value pairs


generated by the Map tasks and combines them to produce a final set of
key-value pairs as output. Each Reduce task processes a subset of the
intermediate data generated by the Map tasks, which are grouped by key.
Key features of the MapReduce framework include scalability, fault
tolerance, data locality, and ease of use. Some popular implementations of
the MapReduce framework include Apache Hadoop, Apache Spark, and
Apache Flink. These frameworks provide a range of tools and libraries that
can be used to build complex data processing workflows using the
MapReduce programming model.

231
Emerging 11.1.3 Hadoop Yarn and Hadoop Execution Model
Technologies for
Business
Hadoop YARN (Yet Another Resource Negotiator) is a resource
management layer that sits between the Hadoop Distributed File System
(HDFS) and the processing engines, such as MapReduce, Spark, and Tez. It
provides a central platform for managing cluster resources, allocating
resources to different applications, and scheduling jobs across a cluster.

The Hadoop execution model involves the following components:

• Client: The client submits a job to the YARN Resource Manager (RM),
which schedules it across the cluster.

• Resource Manager: The Resource Manager is responsible for managing


the resources in the cluster and scheduling jobs. It allocates resources to
each job based on the application requirements and the available
resources in the cluster.

• Node Manager: The Node Manager runs on each node in the cluster and
is responsible for managing the resources on that node, such as CPU,
memory, and disk space. It reports the available resources back to the
Resource Manager, which uses this information to allocate resources to
different applications.

• Application Master: The Application Master is responsible for


managing the lifecycle of a specific application, such as MapReduce or
Spark. It negotiates with the Resource Manager for resources and
monitors the progress of the application.

• Container: A container is a virtualized environment in which an


application runs. It provides an isolated environment for the application
to run in, with its own allocated resources, such as CPU and memory.

The Hadoop execution model is designed to be highly scalable and fault-


tolerant. It allows multiple applications to run concurrently on the same
cluster, with each application running in its own container. If a node fails, the
Resource Manager can redistribute the workload to other nodes in the cluster,
ensuring that the application continues to run without interruption. Largely,
the Hadoop execution model and YARN are essential components of the
Hadoop ecosystem, providing a powerful and flexible platform for processing
and analysing large datasets.

11.1.4 Introduction to Hadoop Ecosystem Technologies


The Hadoop ecosystem is a collection of open-source software tools and
frameworks that are built on top of the Hadoop Distributed File System
(HDFS) and the Hadoop MapReduce programming model. The Hadoop
ecosystem includes a wide range of tools and technologies for data
processing, storage, management, and analysis. Here are some of the most
popular Hadoop ecosystem technologies:

• Apache Spark: Apache Spark is an open-source big data processing


framework that is built on top of Hadoop. It provides a faster and more
232
Big Data
flexible alternative to MapReduce, with support for real-time data
processing, machine learning, and graph processing.
• Apache Hive: Apache Hive is a data warehouse system for querying and
analysing large datasets stored in Hadoop. It provides a SQL-like
interface for querying data and supports a range of data formats,
including structured and semi-structured data.
• Apache Pig: Apache Pig is a high-level data processing language that is
used to simplify the development of MapReduce jobs. It provides a
simple, easy-to-use syntax for writing data processing pipelines.
• Apache HBase: Apache HBase is a NoSQL database that is built on top
of Hadoop. It provides real-time random read and write access to large
datasets and supports low-latency queries.
• Apache Zoo Keeper: Apache ZooKeeper is a centralized service for
maintaining configuration information, naming, providing distributed
synchronization, and group services.
• Apache Oozie: Apache Oozie is a workflow scheduling system that is
used to manage Hadoop jobs. It provides a way to specify dependencies
between jobs and ensures that jobs are executed in the correct order.
• Apache Flume: Apache Flume is a distributed system for collecting,
aggregating, and moving large amounts of log data from various sources
to Hadoop.
• Apache Sqoop: Apache Sqoop is a tool for transferring data between
Hadoop and structured data stores, such as relational databases.

These are just a few examples of the many tools and technologies that are
available in the Hadoop ecosystem. Each of these technologies is designed to
address specific challenges and use cases in big data processing and
analytics. By leveraging the Hadoop ecosystem, organizations can build
powerful, scalable, and cost-effective data processing and analytics solutions.

11.1.5 Databases: HBase, Hive


HBase and Hive are two popular databases in the Hadoop ecosystem that are
used for storing and processing large-scale datasets.

HBase is a NoSQL database that is designed for storing and managing large
volumes of unstructured and semi-structured data in Hadoop. It provides real-
time random read and write access to large datasets, making it ideal for use
cases that require low-latency queries and high-throughput data processing.
HBase is modelled after Google's Bigtable database and is built on top of
Hadoop Distributed File System (HDFS). HBase uses a column-oriented data
model, which allows for efficient storage and retrieval of data, and provides a
powerful API for data manipulation.
Hive, on the other hand, is a data warehouse system for querying and
analysing large datasets stored in Hadoop. It provides a SQL-like interface
for querying data and supports a range of data formats, including structured
and semi-structured data. Hive is modelled after the SQL language, making it
easy for users with SQL experience to work with large-scale datasets in 233
Emerging Hadoop. Hive uses a metadata-driven approach to data management, which
Technologies for
Business allows for easy integration with other tools in the Hadoop ecosystem. Hive
provides a powerful SQL-like language called HiveQL for querying data and
supports advanced features such as user-defined functions, subqueries, and
joins.

Both HBase and Hive are powerful tools in the Hadoop ecosystem, and they
are often used together to provide a complete data management and analysis
solution. HBase is typically used for real-time data processing and low-
latency queries, while Hive is used for complex analytical queries and ad-hoc
data analysis.

11.1.6 Scripting language: Pig, Streaming: Flink, Storm


Pig, Flink, and Storm are popular scripting languages and streaming
frameworks used in the Hadoop ecosystem for big data processing and
analytics.

Pig is a high-level data processing language that is used to simplify the


development of MapReduce jobs. It provides a simple, easy-to-use syntax for
writing data processing pipelines. Pig supports a wide range of data formats
and provides a powerful set of operators for data manipulation. Pig programs
are compiled into MapReduce jobs, which are then executed on a Hadoop
cluster. Pig is often used for ETL (Extract, Transform, Load) tasks and data
preparation.
Flink is a real-time streaming framework that is designed for high-throughput
and low-latency data processing. Flink provides a distributed processing
engine that can process data in real-time as it arrives. It supports a wide range
of data sources and provides a powerful set of operators for data
manipulation. Flink supports a variety of programming languages, including
Java, Scala, and Python. Flink is often used for real-time analytics, machine
learning, and complex event processing.

Storm is another real-time streaming framework that is used for processing


large-scale data streams. Storm provides a distributed processing engine that
can process data in real-time and can scale to handle large volumes of data.
Storm supports a wide range of data sources and provides a powerful set of
operators for data manipulation. Storm is often used for real-time analytics,
machine learning, and event processing.

Both Flink and Storm support stream processing, whereas Pig supports batch
processing. Stream processing is useful in scenarios where data is generated
continuously and needs to be processed in real-time, such as sensor data or
social media feeds. Batch processing is useful in scenarios where large
volumes of data need to be processed in a non-real-time manner, such as ETL
jobs or data warehousing.

11.2 SPARK FRAMEWORK


Spark is a distributed computing framework that is used for big data
processing and analytics. It was developed at the University of California,
234
Big Data
Berkeley and is now maintained by the Apache Software Foundation. Spark
provides an in-memory data processing engine that can handle large-scale
data processing and analytics tasks. It supports a wide range of data sources
and provides a powerful set of operators for data manipulation. Spark is
designed to work with Hadoop Distributed File System (HDFS) and other
data sources, such as Apache Cassandra and Amazon S3.
Spark supports a variety of programming languages, including Java, Scala,
Python, and R. It provides a simple, easy-to-use API for data processing and
analytics tasks, and supports a wide range of applications, including real-time
stream processing, machine learning, and graph processing. Spark includes
several components, including:
• Spark Core: This is the fundamental computing engine of Spark and
provides the distributed task scheduling, memory management, and fault
tolerance features.
• Spark SQL: This is a module for structured data processing that allows
users to query structured data using SQL-like syntax.
• Spark Streaming: This is a module for processing real-time data
streams using Spark.
• Spark MLlib: This is a machine learning library for Spark that provides
a wide range of machine learning algorithms for tasks such as
classification, regression, clustering, and collaborative filtering.
• GraphX: This is a module for processing graph data using Spark.
Spark is known for its high-speed processing and scalability, and it has
become a popular choice for big data processing and analytics tasks. It is
often used in conjunction with Hadoop and other big data technologies to
provide a complete big data processing and analytics solution.

11.3 MACHINE LEARNING ALGORITHMS FOR


BIG DATA ANALYTICS
There are several machine learning algorithms that are commonly used for
big data analytics tasks. Here are some of the most popular ones:

1. Linear Regression: A popular algorithm used to model the relationship


between dependent and independent variables.
2. Logistic Regression: Used to model the probability of a binary outcome.
3. Decision Trees: Used to model decision-making processes and classify
data based on a series of rules.
4. Random Forest: An ensemble learning technique that combines
multiple decision trees to improve accuracy and reduce overfitting.
5. Naive Bayes: A probabilistic algorithm used for classification tasks
based on the Bayes theorem.
6. K-Nearest Neighbour’s (KNN): A non-parametric algorithm used for
classification and regression tasks that is based on the idea of finding the
k closest data points to a given input.
235
Emerging 7. Support Vector Machines (SVM): A popular algorithm used for
Technologies for
Business classification and regression tasks that involves finding the optimal
hyperplane that separates data points in a high-dimensional space.
8. Neural Networks: A family of algorithms used for various tasks, such
as classification, regression, and clustering, that mimic the structure and
function of the human brain.
9. Gradient Boosting: An ensemble learning technique that combines
multiple weak models to create a strong model.
10. Principal Component Analysis (PCA): A dimensionality reduction
technique that reduces the number of features in a dataset by finding the
most important features.

When dealing with big data, it is important to choose algorithms that are
scalable and can handle large amounts of data. Some of these algorithms,
such as KNN and SVM, can be memory-intensive and may not be suitable
for large datasets. In such cases, distributed computing frameworks like
Apache Spark can be used to handle the processing of big data.

11.4 RECENT TRENDS IN BIG DATA ANALYTICS


Big data analytics is an evolving field, and there are several recent trends that
are shaping the future of this domain. Here are some of the key trends in big
data analytics:

1. Real-time Analytics: Real-time data processing and analysis are


becoming increasingly important as businesses seek to make more
informed decisions based on up-to-date information.

2. Edge Computing: Edge computing involves processing data closer to


the source, rather than sending it to a centralized server or cloud. This
trend is gaining traction in industries such as healthcare and
manufacturing, where real-time insights are critical.

3. Cloud-based Analytics: Cloud-based analytics platforms are becoming


increasingly popular, as they offer flexibility, scalability, and cost-
effectiveness. Cloud platforms such as AWS, Azure, and Google Cloud
Platform offer a range of big data tools and services.
4. Artificial Intelligence and Machine Learning: Machine learning and
AI are becoming increasingly important in big data analytics, as they can
help automate data processing and analysis and provide more accurate
insights.

5. Data Privacy and Security: With the increasing amount of data being
collected and analysed, data privacy and security are becoming major
concerns. Businesses must ensure that they are compliant with data
protection regulations and that they are taking steps to protect sensitive
data.
6. Data Democratization: Data democratization involves making data
accessible to all stakeholders in an organization, enabling them to make
236
Big Data
data-driven decisions. This trend is gaining traction as businesses seek to
break down data silos and improve collaboration and communication
across teams.

7. Natural Language Processing (NLP): NLP is a field of AI that


involves analysing and interpreting human language. NLP is becoming
increasingly important in big data analytics, as it can help businesses
extract insights from unstructured data sources such as social media and
customer feedback.
These trends are shaping the future of big data analytics and will continue to
influence the development of new tools and technologies in this field.

11.5 SUMMARY
Big data refers to the large volume of structured and unstructured data that
inundates businesses on a daily basis. Big data analytics is the process of
collecting, processing, and analysing this data to gain insights and make
informed business decisions. The key characteristics of big data are
commonly summarized by the "3Vs": volume, velocity, and variety. To
handle big data, businesses require specialized tools and technologies, such
as the Hadoop ecosystem, which includes HDFS, MapReduce, and YARN, as
well as other technologies like Spark, HBase, and Hive. In addition to
handling the technical challenges of big data, businesses must also address
data privacy and security concerns, and ensure compliance with regulations
such as GDPR and CCPA.

Some of the key trends in big data analytics include real-time analytics, edge
computing, cloud-based analytics, artificial intelligence and machine
learning, data privacy and security, data democratization, and natural
language processing. Commonly, big data analytics has the potential to
provide businesses with valuable insights that can improve their operations,
customer experiences, and bottom lines.

11.6 SELF – ASSESSMENT EXCERCISES


1. What is big data?
2. What are the characteristics of big data?
3. How is big data analysed?
4. What are some applications of big data?
5. What are some challenges associated with big data?
6. What are some recent trends in big data analytics?
7. What are some popular big data platforms and technologies?
8. How is big data used in marketing?
9. What is the future of big data?

11.7 KEYWORDS
A glossary of commonly used terms in big data include: 237
Emerging 1. Big data: Refers to large volumes of structured and unstructured data
Technologies for
Business that inundate businesses on a daily basis.
2. Business intelligence: The use of data analysis tools and technologies to
gain insights into business performance and make informed decisions.
3. Cloud computing: The delivery of computing services, including
storage, processing, and analytics, over the internet.
4. Data cleaning: The process of identifying and correcting errors and
inconsistencies in data.
5. Data governance: The management of data assets, including policies,
procedures, and standards for data quality and security.
6. Data integration: The process of combining data from multiple sources
into a single, unified view.
7. Data lake: A centralized repository for storing large volumes of
structured and unstructured data in its native format.
8. Data mining: The process of extracting useful information from large
volumes of data.
9. Data pipeline: The process of moving data from its source to a
destination for storage, processing, or analysis.
10. Data privacy: The protection of sensitive and personal data from
unauthorized access or disclosure.
11. Data quality: The measure of the accuracy, completeness, and
consistency of data.
12. Data visualization: The process of creating visual representations of
data to aid in understanding and analysis.
13. Data warehousing: The process of collecting and storing data from
multiple sources to create a centralized repository for analysis.
14. Hadoop: A popular open-source big data framework used for storing
and processing large volumes of data.
15. Machine learning: A subset of AI that involves building algorithms and
models that can learn and make predictions based on data.
16. MapReduce: A programming model used to process large volumes of
data in parallel on a distributed system.
17. NoSQL: A non-relational database management system designed for
handling large volumes of unstructured data.
18. Predictive Analytics: The use of statistical models and machine learning
algorithms to make predictions about future events based on historical
data.
19. Spark: An open-source big data processing framework that allows for
fast, in-memory processing of large datasets.
20. Streaming: The process of analysing and processing real-time data as it
is generated.

238
Big Data
11.8 FURTHER READINGS
1. Provost, F., & Fawcett, T. (2013). Data science for business: What you
need to know about data mining and data-analytic thinking. O'Reilly
Media.
2. Zaharia, M., & Chambers, B. (2018). Spark: The definitive guide.
O'Reilly Media.
3. Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of massive
datasets. Cambridge University Press.
4. Marz, N., & Warren, J. (2015). Big data: Principles and best practices of
scalable real-time data systems. Manning Publications.
5. Apache Hadoop: https://hadoop.apache.org/
6. Apache Spark: https://spark.apache.org/
7. Big Data University: https://bigdatauniversity.com/
8. Hortonworks: https://hortonworks.com/
9. Big Data Analytics News: https://www.bigdataanalyticsnews.com/
10. Data Science Central: https://www.datasciencecentral.com/

239

You might also like