Unit-11 Big Data
Unit-11 Big Data
Objectives
After studying this unit, you will be able to:
• Understanding the concept of Big Data, its characteristics, and the
challenges associated with it.
• Familiarizing with the Hadoop ecosystem and its components.
• Understanding the basics of MapReduce.
• Learning utility of Pig, a high-level platform for creating MapReduce
programs, to process and analyse data.
• Understanding the basics of machine learning algorithms for Big Data
analytics.
Structure
11.0 Introduction to Big Data
11.0.1 Data Storage and Analysis
11.0.2 Characteristics of Big Data
11.0.3 Big Data Classification
11.0.4 Big Data Handling Techniques
11.0.5 Types of Big Data Analytics
11.0.6 Typical Analytical Architecture
11.0.6 Challenges in Big Data Analytics
11.0.7 Case studies: Big Data in Marketing and Sales, Healthcare, Medicine, and
Advertising
11.1 Hadoop Framework & Ecosystem
11.1.1 Requirement of Hadoop Framework
11.1.2 Map Reduce Framework
11.1.3 Hadoop Yarn and Hadoop Execution Model
11.1.4 Introduction to Hadoop Ecosystem Technologies
11.1.5 Databases: HBase, Hive
11.1.6 Scripting language: Pig, Streaming: Flink, Storm
11.2 Spark Framework
11.3 Machine Learning Algorithms for Big Data Analytics
11.4 Recent Trends in Big Data Analytics
11.5 Summary
11.6 Self–Assessment Exercises
11.7 Keywords
11.8 Further Readings
222
Big Data
11.0 INTRODUCTION TO BIG DATA
Big data refers to the vast amount of structured and unstructured data that is
generated and collected by individuals, organizations, and machines every
day. Data is too large and complex to be processed by traditional data
processing applications, which often have limitations in terms of their
capacity to store, process, and analyse large datasets.
To process and analyse big data, specialized technologies and tools such as
Hadoop, Spark, and NoSQL databases have been developed. These tools
allow organizations to store, process, and analyse large volumes of data
quickly and efficiently. The insights derived from big data can be used to
make informed decisions, identify trends and patterns, improve customer
experiences, and enhance operational efficiency.
In addition to these 3 Vs/ 4Vs, big data can also be characterized by several
other features, including:
224
Big Data
11.0.3 Big Data Classification
Big data can be classified based on several different criteria, such as the
source, the structure, the application, and the analytics approach. Here are
some common classifications of big data:
225
Emerging • MapReduce: MapReduce is a programming model that is used to
Technologies for
Business process large datasets in parallel across a cluster of computers.
• Data Compression: Data compression techniques such as gzip and
bzip2 can be used to reduce the size of data, making it easier to transfer
and store.
• Data Partitioning: Data partitioning involves dividing a large dataset
into smaller subsets to enable distributed processing.
• Cloud Computing: Cloud computing platforms such as Amazon Web
Services (AWS) and Microsoft Azure provide scalable and cost-effective
solutions for storing and processing big data.
• Machine learning: Machine learning techniques can be used to analyse
big data and identify patterns and insights that can help organizations
make informed decisions.
By using these techniques, businesses and organizations can handle big data
more effectively, extract insights, and derive value from their data.
226
Big Data
• Descriptive Analytics: This technique involves summarizing historical
data to understand what has happened in the past.
• Diagnostic Analytics: This technique involves analysing data to
determine the causes of a particular event or pattern.
• Predictive Analytics: This technique involves using statistical models
and machine learning algorithms to forecast future events or patterns
based on historical data.
• Prescriptive Analytics: This technique involves recommending actions
based on insights from predictive analytics.
• Data Sources: This layer includes all the sources of data, both internal
and external, that an organization collects and stores. These may include
data from customer transactions, social media, web logs, sensors, and
other sources.
• Data Ingestion and Storage: This layer is responsible for ingesting data
from various sources, processing it, and storing it in a format that can be
easily accessed and analysed. This layer may include technologies such
as Hadoop Distributed File System (HDFS) and NoSQL databases.
• Data Processing and Preparation: This layer is responsible for
cleaning, transforming, and preparing data for analysis. This may include
tasks such as data integration, data cleaning, data normalization, and data
aggregation.
• Analytics Engines: This layer includes the technologies and tools used
for analysing and processing data. This may include machine learning
algorithms, statistical analysis tools, and visualization tools.
• Data Presentation and Visualization: This layer includes the tools used
to present data in a meaningful way, such as dashboards, reports, and
visualizations. This layer is critical for making data accessible and
understandable to non-technical stakeholders.
• Data Complexity and Variety: Big data comes in many different forms,
including structured, semi-structured, and unstructured data, which can
be challenging to process and analyse.
• Data Quality: Big data is often incomplete, inconsistent, or inaccurate,
which can lead to erroneous insights and conclusions.
• Data Security and Privacy: Big data often contains sensitive and
confidential information, which must be protected from unauthorized
access and breaches.
• Scalability: As data volumes grow, the analytical architecture must be
able to scale to handle the increased load, which can be challenging and
costly.
• Talent Shortage: There is a shortage of skilled data scientists and
analysts who are able to process and analyse big data effectively.
• Integration: Big data analytics requires integration with multiple
systems and technologies, which can be challenging to implement and
maintain.
• Data Governance: Big data requires careful management and
governance to ensure compliance with regulations and policies.
• Interpreting Results: Big data analytics often produces large and
complex datasets, which can be challenging to interpret and translate into
actionable insights.
• Marketing and Sales: Big data is being used in marketing and sales to
understand customer behaviour and preferences, personalize marketing
messages, and optimize pricing and promotions. For example, Amazon
uses big data to personalize recommendations for individual customers
based on their browsing and purchase history. Walmart uses big data to
optimize pricing and inventory management in its stores. Coca-Cola uses
big data to optimize its vending machine placement, prices, and
promotions based on local weather conditions, events, and consumer
behaviour.
228
Big Data
• Healthcare: Big data is being used in healthcare to improve patient
outcomes, reduce costs, and enable personalized medicine. For example,
IBM's Watson Health is using big data to develop personalized cancer
treatments based on a patient's genetic profile and medical history.
Hospitals and healthcare providers are using big data to predict patient
readmission rates, identify patients at risk of developing chronic
conditions, and optimize resource allocation.
231
Emerging 11.1.3 Hadoop Yarn and Hadoop Execution Model
Technologies for
Business
Hadoop YARN (Yet Another Resource Negotiator) is a resource
management layer that sits between the Hadoop Distributed File System
(HDFS) and the processing engines, such as MapReduce, Spark, and Tez. It
provides a central platform for managing cluster resources, allocating
resources to different applications, and scheduling jobs across a cluster.
• Client: The client submits a job to the YARN Resource Manager (RM),
which schedules it across the cluster.
• Node Manager: The Node Manager runs on each node in the cluster and
is responsible for managing the resources on that node, such as CPU,
memory, and disk space. It reports the available resources back to the
Resource Manager, which uses this information to allocate resources to
different applications.
These are just a few examples of the many tools and technologies that are
available in the Hadoop ecosystem. Each of these technologies is designed to
address specific challenges and use cases in big data processing and
analytics. By leveraging the Hadoop ecosystem, organizations can build
powerful, scalable, and cost-effective data processing and analytics solutions.
HBase is a NoSQL database that is designed for storing and managing large
volumes of unstructured and semi-structured data in Hadoop. It provides real-
time random read and write access to large datasets, making it ideal for use
cases that require low-latency queries and high-throughput data processing.
HBase is modelled after Google's Bigtable database and is built on top of
Hadoop Distributed File System (HDFS). HBase uses a column-oriented data
model, which allows for efficient storage and retrieval of data, and provides a
powerful API for data manipulation.
Hive, on the other hand, is a data warehouse system for querying and
analysing large datasets stored in Hadoop. It provides a SQL-like interface
for querying data and supports a range of data formats, including structured
and semi-structured data. Hive is modelled after the SQL language, making it
easy for users with SQL experience to work with large-scale datasets in 233
Emerging Hadoop. Hive uses a metadata-driven approach to data management, which
Technologies for
Business allows for easy integration with other tools in the Hadoop ecosystem. Hive
provides a powerful SQL-like language called HiveQL for querying data and
supports advanced features such as user-defined functions, subqueries, and
joins.
Both HBase and Hive are powerful tools in the Hadoop ecosystem, and they
are often used together to provide a complete data management and analysis
solution. HBase is typically used for real-time data processing and low-
latency queries, while Hive is used for complex analytical queries and ad-hoc
data analysis.
Both Flink and Storm support stream processing, whereas Pig supports batch
processing. Stream processing is useful in scenarios where data is generated
continuously and needs to be processed in real-time, such as sensor data or
social media feeds. Batch processing is useful in scenarios where large
volumes of data need to be processed in a non-real-time manner, such as ETL
jobs or data warehousing.
When dealing with big data, it is important to choose algorithms that are
scalable and can handle large amounts of data. Some of these algorithms,
such as KNN and SVM, can be memory-intensive and may not be suitable
for large datasets. In such cases, distributed computing frameworks like
Apache Spark can be used to handle the processing of big data.
5. Data Privacy and Security: With the increasing amount of data being
collected and analysed, data privacy and security are becoming major
concerns. Businesses must ensure that they are compliant with data
protection regulations and that they are taking steps to protect sensitive
data.
6. Data Democratization: Data democratization involves making data
accessible to all stakeholders in an organization, enabling them to make
236
Big Data
data-driven decisions. This trend is gaining traction as businesses seek to
break down data silos and improve collaboration and communication
across teams.
11.5 SUMMARY
Big data refers to the large volume of structured and unstructured data that
inundates businesses on a daily basis. Big data analytics is the process of
collecting, processing, and analysing this data to gain insights and make
informed business decisions. The key characteristics of big data are
commonly summarized by the "3Vs": volume, velocity, and variety. To
handle big data, businesses require specialized tools and technologies, such
as the Hadoop ecosystem, which includes HDFS, MapReduce, and YARN, as
well as other technologies like Spark, HBase, and Hive. In addition to
handling the technical challenges of big data, businesses must also address
data privacy and security concerns, and ensure compliance with regulations
such as GDPR and CCPA.
Some of the key trends in big data analytics include real-time analytics, edge
computing, cloud-based analytics, artificial intelligence and machine
learning, data privacy and security, data democratization, and natural
language processing. Commonly, big data analytics has the potential to
provide businesses with valuable insights that can improve their operations,
customer experiences, and bottom lines.
11.7 KEYWORDS
A glossary of commonly used terms in big data include: 237
Emerging 1. Big data: Refers to large volumes of structured and unstructured data
Technologies for
Business that inundate businesses on a daily basis.
2. Business intelligence: The use of data analysis tools and technologies to
gain insights into business performance and make informed decisions.
3. Cloud computing: The delivery of computing services, including
storage, processing, and analytics, over the internet.
4. Data cleaning: The process of identifying and correcting errors and
inconsistencies in data.
5. Data governance: The management of data assets, including policies,
procedures, and standards for data quality and security.
6. Data integration: The process of combining data from multiple sources
into a single, unified view.
7. Data lake: A centralized repository for storing large volumes of
structured and unstructured data in its native format.
8. Data mining: The process of extracting useful information from large
volumes of data.
9. Data pipeline: The process of moving data from its source to a
destination for storage, processing, or analysis.
10. Data privacy: The protection of sensitive and personal data from
unauthorized access or disclosure.
11. Data quality: The measure of the accuracy, completeness, and
consistency of data.
12. Data visualization: The process of creating visual representations of
data to aid in understanding and analysis.
13. Data warehousing: The process of collecting and storing data from
multiple sources to create a centralized repository for analysis.
14. Hadoop: A popular open-source big data framework used for storing
and processing large volumes of data.
15. Machine learning: A subset of AI that involves building algorithms and
models that can learn and make predictions based on data.
16. MapReduce: A programming model used to process large volumes of
data in parallel on a distributed system.
17. NoSQL: A non-relational database management system designed for
handling large volumes of unstructured data.
18. Predictive Analytics: The use of statistical models and machine learning
algorithms to make predictions about future events based on historical
data.
19. Spark: An open-source big data processing framework that allows for
fast, in-memory processing of large datasets.
20. Streaming: The process of analysing and processing real-time data as it
is generated.
238
Big Data
11.8 FURTHER READINGS
1. Provost, F., & Fawcett, T. (2013). Data science for business: What you
need to know about data mining and data-analytic thinking. O'Reilly
Media.
2. Zaharia, M., & Chambers, B. (2018). Spark: The definitive guide.
O'Reilly Media.
3. Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of massive
datasets. Cambridge University Press.
4. Marz, N., & Warren, J. (2015). Big data: Principles and best practices of
scalable real-time data systems. Manning Publications.
5. Apache Hadoop: https://hadoop.apache.org/
6. Apache Spark: https://spark.apache.org/
7. Big Data University: https://bigdatauniversity.com/
8. Hortonworks: https://hortonworks.com/
9. Big Data Analytics News: https://www.bigdataanalyticsnews.com/
10. Data Science Central: https://www.datasciencecentral.com/
239