Lecture 2
Lecture 2
The exponential growth of digital data is primarily driven by social media, IoT (Internet
of Things) devices, sensors, e-commerce, and business transactions.
Example: Every day, social media platforms like Facebook, Twitter, and Instagram
generate petabytes of user-generated content, interactions, and multimedia files.
IoT devices such as smart home sensors, industrial equipment, and wearable devices
continuously collect and transmit real-time data, further fueling Big Data growth.
AI and ML models require vast amounts of structured and unstructured data for
training.
Big Data helps train deep learning models, such as those used in self-driving cars,
facial recognition, fraud detection, and recommendation systems.
Example: AI-powered voice assistants like Siri and Alexa continuously analyze and
learn from user interactions, improving accuracy over time.
Companies use Big Data analytics to gain insights into customer behavior, market
trends, and operational efficiency.
Predictive analytics helps businesses forecast future sales, inventory needs, and risks.
Example: Banks use Big Data to detect fraudulent transactions in real-time.
Organizations use Big Data to optimize supply chains, reduce operational costs, and
improve efficiency.
Cloud-based Big Data solutions reduce the need for expensive physical infrastructure.
Example: Retailers use Big Data to optimize inventory levels and reduce wastage,
leading to cost savings.
Financial institutions and cybersecurity firms use Big Data analytics for real-time
fraud detection and anomaly detection.
Example: Credit card companies analyze millions of transactions daily to identify
suspicious activities and prevent fraud.
Governments and organizations use Big Data to optimize urban planning, traffic
management, and energy consumption.
Example: Smart traffic lights adjust signals based on real-time vehicle flow data,
reducing congestion in major cities.
Medical research and personalized medicine rely on Big Data for disease prediction,
drug discovery, and diagnostics.
Example: Genomic sequencing generates vast amounts of data, which is used to identify
genetic disorders and develop precision medicine.
Industries must analyze and manage large volumes of compliance-related data due to
regulations such as GDPR (General Data Protection Regulation) and HIPAA (Health
Insurance Portability and Accountability Act).
Example: GDPR ensures that companies protect users' personal data and provide
transparency in how data is used.
3.5 Environmental Monitoring & Sustainability
Big Data is used in climate modeling, disaster prediction, and efficient resource
management.
Example: Meteorological departments use Big Data analytics to predict hurricanes,
earthquakes, and climate change patterns.
5. Big data architecture is specifically designed to manage data ingestion,
data processing, and analysis of data that is too large or complex. A big size data
cannot be store, process and manage by conventional relational databases. The
solution is to organize technology into a structure of big data architecture. Big data
architecture is able to manage and process data.
a) Data Sources
All big data solutions start with one or more data sources. The Big Data
Architecture accommodates various data sources and efficiently manages a wide
range of data types. Some common data sources in big data architecture include
transactional databases, logs, machine-generated data, social media and web data,
streaming data, external data sources, cloud-based data, NOSQL databases, data
warehouses, file systems, APIs, and web services.
b) Data Storage
Big Data storage consists of distributed file stores that can hold large, multi-format
files efficiently. A Data Lake is used to store diverse file formats, including
structured, semi-structured, and unstructured data. This storage is primarily used
for batch operations and supports blob storage solutions such as:
HDFS (Hadoop Distributed File System)
Microsoft Azure Blob Storage
AWS S3 (Simple Storage Service)
Google Cloud Storage (GCP Storage)
c) Batch Processing
Batch processing is a long-running operation that processes data in chunks by
filtering, aggregating, and preparing it for analysis. These jobs require input
data, process it, and generate output files. Common batch processing tools include:
Hive Jobs (SQL-like querying for batch data)
U-SQL Jobs (Microsoft’s big data processing language)
Apache Sqoop (Data transfer between RDBMS and Hadoop)
Apache Pig (High-level scripting for Hadoop)
Custom MapReduce Jobs (Written in Java, Scala, Python)
e) Stream Processing
Unlike batch processing, stream processing handles real-time data flows by
consuming, processing, and delivering insights within milliseconds to seconds.
This is achieved using publish-subscribe messaging systems and window-based
data processing techniques.
Apache Spark Streaming (Micro-batch stream processing)
Apache Flink (Low-latency, distributed stream processing)
Apache Storm (Real-time distributed computation)
Processed data is then stored in a sink for further use
f) Analytics-Based Datastore
Once processed, data is stored in a data warehouse or NoSQL database for
querying and analysis. These analytical stores allow faster lookups and advanced
analytics.
HBase (NoSQL database for real-time read/write)
Apache Hive (SQL-based querying on Hadoop)
Spark SQL (Query engine for structured big data processing)
Hive enables metadata abstraction, making it easier to manage and analyze large
datasets.
h) Orchestration
Orchestration tools automate and manage Big Data workflows, ensuring data
pipelines run efficiently. They enable data transformation, movement, and
scheduling across different sources and destinations. Some common orchestration
tools include:
Apache Oozie (Workflow scheduler for Hadoop)
Apache Airflow (Task orchestration and workflow automation)
Azure Data Factory (Cloud-based ETL and data movement service)
6. 5 V's of Big Data