Big Data (My Notes)
Big Data (My Notes)
Unit 1
Big Data Characteristics
The characteristics of Big Data, often summarized by
the "Five V's," include −
Volume
As its name implies; volume refers to a large size of
data generated and stored every second using IoT
devices, social media, videos, financial transactions,
and customer logs. The data generated from the
devices or different sources can range from terabytes
to petabytes and beyond. To manage such large
quantities of data requires robust storage solutions and
advanced data processing techniques. The Hadoop
framework is used to store, access and process big
data.
Facebook generates 4 petabytes of data per day that's
a million gigabytes. All that data is stored in what is
known as the Hive, which contains about 300 petabytes
of data [1].
Fig: Minutes spent per day on social apps (Image
source: Recode)
Veracity
Veracity refers accuracy and trustworthiness of the
data. Ensuring data quality, addressing data
discrepancies, and dealing with data ambiguity are all
major issues in Big Data analytics.
Value
The ability to convert large volumes of data into useful
insights. Big Data's ultimate goal is to extract
meaningful and actionable insights that can lead to
better decision-making, new products, enhanced
consumer experiences, and competitive advantages.
These qualities characterise the nature of Big Data and
highlight the importance of modern tools and
technologies for effective data management,
processing, and analysis.
TYPES OF DATA:
In the context of big data, the differentiation between
structured, semi-structured, and unstructured data is
crucial because of the sheer volume, variety, and
complexity of data being generated. Here’s how these
types of data differ when dealing with big data:
1. Structured Data in Big Data:
Definition: Structured data is well-organized,
typically stored in relational databases, and easily
accessible and analyzable using traditional tools
like SQL. It fits neatly into predefined fields and
tables.
Characteristics:
o Highly organized, follows a predefined
schema.
o Easy to store and analyze using traditional
databases.
o Mostly quantitative data.
o Easily searchable and processable by big data
tools like Hadoop or Spark in conjunction with
relational databases.
Examples in Big Data:
o Financial Transactions: In big data, a bank
processes millions of transactions daily. Each
transaction has structured data like
transaction ID, amount, date, account number,
and customer ID, all organized in tabular
format.
o Retail Sales Data: A retailer like Walmart
generates massive amounts of structured data
from point-of-sale systems, tracking SKU
numbers, quantities, prices, and customer IDs
in a structured manner.
o Sensor Data: IoT devices (e.g., smart meters,
industrial sensors) generate structured data,
such as temperature readings, timestamps,
and device IDs, which are often used in big
data systems to monitor performance in real-
time.
2. Semi-Structured Data in Big Data:
Definition: Semi-structured data does not conform
to the traditional rigid structure of relational
databases but has some organizational markers
(like tags or key-value pairs) that provide flexibility.
It’s commonly stored in formats like JSON, XML, or
NoSQL databases.
Characteristics:
o Lacks a fixed schema but contains metadata
or markers.
o More flexible, allowing for rapid and adaptable
data input.
o Requires specialized tools for analysis (e.g.,
NoSQL databases like MongoDB, document
stores).
o Often used in big data systems where diverse
data types need to be processed quickly.
Examples in Big Data:
o Social Media Data: Tweets or Facebook
posts, which include structured metadata
(e.g., user ID, timestamp) alongside
unstructured content (e.g., the text of the
tweet). Platforms process billions of social
media interactions per day.
o Log Files: Web server logs or application logs,
where each log entry has structured elements
(timestamps, IP addresses) mixed with free-
form text (error messages or user actions).
o Emails: An organization may handle massive
amounts of emails. Each email has structured
data (sender, recipient, subject) and semi-
structured data (the body of the email, which
often follows a flexible structure).
3. Unstructured Data in Big Data:
Definition: Unstructured data is data that lacks a
predefined format or organizational framework.
This data comes in a variety of forms, such as text,
images, audio, and video, and requires advanced
techniques (like machine learning or natural
language processing) for analysis.
Characteristics:
o No predefined schema or structure.
o Typically qualitative and requires advanced
processing techniques.
o High in volume and variety, often the largest
component of big data.
o Requires tools like Hadoop, Spark, and AI-
based systems for extraction and analysis.
Examples in Big Data:
o Text Data: Millions of customer reviews or
feedback forms generated on e-commerce
platforms like Amazon, where the textual
content is unstructured and needs text mining
to extract insights.
o Video and Image Data: Social media
platforms like YouTube handle enormous
volumes of unstructured video data. Image
recognition and video analysis are required to
process and analyze the data.
o Healthcare Records (Medical Imaging): X-
rays, MRIs, and CT scans in healthcare
systems are unstructured data. The analysis
requires specialized image processing
algorithms to detect patterns or anomalies.
Summary of Differences in Big Data:
Semi-
Structured Unstructure
Feature Structured
Data d Data
Data
Neatly Some
Organizati organized in organizational No predefined
on rows and markers, but format
columns not rigid
Semi-
Structured Unstructure
Feature Structured
Data d Data
Data
Flexible
Fixed
schema (e.g., No schema or
Schema schema, well-
tags, key- structure
defined
value pairs)
Easier to Scalable in Requires
scale in NoSQL or advanced
Scalability traditional document- tools
relational oriented (Hadoop,
systems systems Spark)
NoSQL Distributed
Relational
Storage Databases storage
Databases
Tools (MongoDB, systems
(SQL, MySQL)
Cassandra) (HDFS, S3)
Images,
Financial
Examples Social media videos, text
transactions,
in Big posts, emails, files,
retail sales,
Data log files customer
sensor data
reviews
Tools for Big Data Processing:
Structured Data: Tools like SQL databases,
Apache Hive, and Google BigQuery are used to
store and process structured data.
Semi-Structured Data: NoSQL databases like
MongoDB, Cassandra, and document-based
stores are commonly used.
Unstructured Data: Tools like Hadoop, Spark,
and machine learning algorithms (e.g., natural
language processing, image recognition) help
process and analyze unstructured data.
In big data environments, companies deal with vast
quantities of all three types of data, necessitating
different storage, processing, and analysis strategies to
derive meaningful insight.
Unit 2
Crowdsourcing analytics
involves gathering, processing, and analyzing data from
a large group of people or contributors (the "crowd") to
solve problems, generate insights, or make decisions.
This approach leverages the collective intelligence,
skills, and efforts of a diverse group, often through an
open call, to achieve results that may not be possible
through traditional methods or small, specialized
teams.
Key Aspects of Crowdsourcing Analytics:
Data Collection: A large number of individuals
contribute data or insights, often through digital
platforms.
Diverse Contributions: Crowdsourcing leverages
the knowledge, creativity, or feedback from people
with different perspectives.
Analytics Processing: The collected data is
analyzed using machine learning, statistical
methods, or big data techniques to derive
actionable insights.
Example of Crowdsourcing Analytics:
1. Waze (Traffic and Navigation App):
How it Works: Waze, a popular GPS navigation
app, relies on crowdsourcing to gather real-time
data on traffic conditions, road hazards, accidents,
and speed traps. Millions of users share live data
as they drive, reporting incidents or confirming
road statuses.
Analytics Process: The app aggregates and
analyzes this crowd-contributed data to provide
users with the fastest routes, predict traffic
conditions, and offer estimated arrival times. Waze
also uses machine learning to improve accuracy
and make recommendations based on historical
data.
Benefit: This real-time, crowd-generated data
allows for highly accurate and dynamic traffic
management that improves the driving experience.
2. Kaggle (Crowdsourced Data Science
Competitions):
How it Works: Kaggle is a platform where
companies or researchers post data science
challenges, often offering prize money. A global
community of data scientists competes to create
the best predictive models or analytics solutions.
Analytics Process: Participants use various data
analysis, machine learning, and modeling
techniques to solve problems such as predicting
customer churn, improving healthcare outcomes,
or optimizing product recommendations.
Benefit: Companies gain access to diverse,
innovative solutions from talented data scientists
around the world, often achieving better results
than they would with internal teams.
3. Amazon Mechanical Turk (MTurk):
How it Works: MTurk is a crowdsourcing platform
where businesses post micro-tasks, such as data
labeling, image recognition, or survey participation,
which workers complete for small payments.
Analytics Process: Companies use the crowd to
gather or annotate large datasets, which are then
analyzed using machine learning algorithms or
traditional analytics methods to extract insights.
Benefit: This allows businesses to process vast
amounts of data quickly and cost-effectively by
leveraging a distributed workforce.
Benefits of Crowdsourcing Analytics:
Scalability: Access to a large pool of contributors
makes it easier to scale data collection and
processing.
Diversity: Diverse perspectives and contributions
can lead to more creative solutions and broader
insights.
Cost-Effectiveness: Crowdsourcing is often more
affordable than traditional methods, particularly for
data collection and labeling.
Real-Time Feedback: In cases like Waze,
crowdsourcing allows for real-time data collection
and immediate insights.
In summary, crowdsourcing analytics taps into the
power of the crowd to collect, process, and analyze
data, allowing organizations to solve complex problems
and gain insights that might not be achievable through
traditional means.
Open-source technologies
have played a pivotal role in the development and
expansion of big data ecosystems. They provide cost-
effective and scalable solutions for processing, storing,
analysing, and visualizing large datasets. Below are
some of the key open-source technologies in big data:
1. Data Storage and Distributed File Systems
Hadoop Distributed File System (HDFS): Part
of the Apache Hadoop ecosystem, HDFS is a
distributed file system that enables the storage of
large datasets across many machines. It’s
designed for scalability and fault tolerance, making
it ideal for managing big data.
Apache HBase: A non-relational, distributed
database that runs on top of HDFS. It is suitable for
real-time, read/write access to large datasets and
is often used for handling unstructured or semi-
structured data.
Apache Cassandra: A highly scalable, NoSQL
distributed database designed to handle large
volumes of data across commodity servers with no
single point of failure. It’s known for high
availability and is widely used for time-series data
and IoT applications.
Ceph: A distributed object store and file system
designed to provide high performance, reliability,
and scalability. It is often used in large-scale
storage systems for cloud computing.
2. Data Processing and Analytics
Apache Hadoop: One of the most well-known
open-source frameworks for distributed storage
and processing of large data sets using the
MapReduce programming model. Hadoop's
ecosystem also includes YARN (Yet Another
Resource Negotiator) for cluster management.
Apache Spark: A fast, in-memory data processing
engine that provides high-level APIs for distributed
data processing, as well as libraries for SQL,
machine learning (MLlib), and graph processing
(GraphX). Spark is often preferred over Hadoop for
faster data processing.
Apache Flink: Another distributed data processing
engine, Flink is designed for both batch and real-
time stream processing. It is known for its event-
driven, stateful computations on streams.
Dask: A parallel computing library in Python that
scales workflows from multi-core machines to large
distributed clusters. It integrates well with popular
data science libraries like NumPy and pandas.
Presto (Trino): A distributed SQL query engine
capable of querying large datasets residing in
various data sources like HDFS, S3, or relational
databases. Presto enables fast querying for
interactive analytics.
Apache Beam: A unified programming model
designed to define and execute data processing
pipelines. It runs on multiple execution engines like
Apache Spark, Flink, and Google Cloud Dataflow.
3. Data Streaming
Apache Kafka: A distributed event streaming
platform that allows real-time data streams to be
published, subscribed to, stored, and processed.
It’s widely used for real-time analytics, log
aggregation, and streaming data pipelines.
Apache Pulsar: A distributed messaging and
streaming platform, Pulsar is designed for high
throughput and low-latency data distribution. It
also supports multi-tenancy and geo-replication,
making it a strong alternative to Kafka.
Apache Storm: A real-time distributed computing
system that processes large streams of data. It’s
used for real-time analytics and machine learning,
among other applications.
4. Data Ingestion
Apache NiFi: A data integration tool that
automates the flow of data between systems. NiFi
provides a graphical user interface to design data
pipelines and is known for its ease of use and
scalability.
Apache Sqoop: A tool designed to transfer bulk
data between Hadoop and structured data stores
like relational databases (e.g., MySQL, Oracle). It is
often used for ETL (Extract, Transform, Load)
operations.
Apache Flume: A distributed service for
collecting, aggregating, and moving large amounts
of log data from various sources into a centralized
data store, such as HDFS.
5. Data Warehousing and Query Engines
Apache Hive: A data warehousing solution built
on top of Hadoop, Hive provides a SQL-like
interface to query and manage large datasets
stored in HDFS. It translates SQL-like queries into
MapReduce jobs.
Apache Impala: A high-performance, distributed
SQL engine for Apache Hadoop. Impala allows for
low-latency SQL queries on data stored in HDFS
and Apache HBase, with an emphasis on
interactive analytics.
ClickHouse: A columnar database management
system that’s optimized for high-speed OLAP
(online analytical processing) queries, making it
popular for real-time analytics and data
warehousing.
6. Data Visualization
Apache Superset: An open-source data
exploration and visualization platform that
integrates with many databases, allowing users to
create interactive dashboards and analyze large
datasets through SQL queries.
Grafana: An open-source analytics platform for
monitoring and visualizing metrics from various
data sources. Grafana is often used for time-series
data and real-time system monitoring.
Kibana: Part of the Elastic Stack (formerly ELK
Stack: Elasticsearch, Logstash, and Kibana), Kibana
is an open-source data visualization and
exploration tool that’s commonly used for
analyzing log data and creating dashboards.
7. Machine Learning and AI
Apache Mahout: A library for building scalable
machine learning algorithms, including
classification, clustering, and recommendation
engines. Mahout is designed to work with large
datasets on distributed systems like Hadoop.
H2O.ai: An open-source platform that provides
scalable machine learning and artificial intelligence
capabilities. H2O integrates well with big data
platforms like Hadoop and Spark, offering APIs for
Python, R, and Java.
TensorFlow: Although primarily used for deep
learning, TensorFlow is also capable of handling
large-scale data processing tasks in distributed
environments.
MLlib (Apache Spark): Spark's own machine
learning library, MLlib supports various machine
learning algorithms, including classification,
regression, clustering, and collaborative filtering,
all running on Spark's fast distributed system.
8. Search and Indexing
Elasticsearch: A distributed search and analytics
engine that allows real-time, full-text search and
analysis of large datasets. It’s widely used for log
and event data analysis.
Apache Solr: Another powerful, scalable search
engine built on Apache Lucene, Solr provides
distributed indexing, replication, and load-balanced
querying.
These open-source technologies form the backbone of
the big data ecosystem, providing powerful tools for
storage, processing, real-time analysis, machine
learning, and visualization, enabling organizations to
derive insights from massive amounts of data.