BDA UNIT-1 (Lecture-1)
BDA UNIT-1 (Lecture-1)
o Social networking sites: Facebook, Google, LinkedIn all these sites generates huge amount of
data on a day to day basis as they have billions of users worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from
which users buying trends can be traced.
o Weather Sta on: All the weather sta on and satellite gives very huge data which are stored
and manipulated to forecast weather.
o Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.
o Share Market: Stock exchange across the world generates huge amount of data through its
daily transac on.
The evolu on of data management
2. Data warehouses
- Data warehouses aggregate data from mul ple sources into a single, central and consistent
data store. They also clean data and prepare it so that it is ready for use, o en by
transforming the data into a rela onal format. Data warehouses are built to support data
analy cs, business intelligence and data science efforts.
- warehouses are mainly used to make some subset of big data readily available to business
users for BI and analysis.
3. Data lakehouses
- Data lakehouses combine the flexibility of data lakes with the structure and querying
capabili es of data warehouses, enabling organiza ons to harness the best of both solu on
types in a unified pla orm. Lakehouses are a rela vely recent development, but they are
becoming increasingly popular because they eliminate the need to maintain two disparate
data systems.
Tools & Technologies used for Storage: Hadoop Distributed File System (HDFS), Amazon S3,
Google Cloud Storage
Tools & Technologies used: Tableau, Power BI, Python (Pandas, NumPy), R.
i. Organizations can use a variety of big data processing tools to transform raw
data into valuable insights.
ii. The three primary big data technologies used for data processing include:
1. Hadoop
Hadoop is an open-source framework that enables the distributed storage and
processing of large datasets across clusters of computers. This framework allows
the Hadoop Distributed File System (HDFS) to efficiently manage large amounts of data.
2. Apache Spark
Apache Spark is known for its speed and simplicity, particularly when it comes to real-
time data analytics. Because of its in-memory processing capabilities, it excels in data
mining, predictive analytics and data science tasks. Organizations generally turn to it for
applications that require rapid data processing, such as live-stream analytics.
For example, a streaming platform might use Spark to process user activity in real time
to track viewer habits and make instant recommendations.
3. NoSQL databases
NoSQL databases are designed to handle unstructured data, making them a flexible
choice for big data applications. Unlike relational databases, NoSQL solutions—such as
document, key-value and graph databases—can scale horizontally. This flexibility makes
them critical for storing data that doesn’t fit neatly into tables.
2. Finance:
5. Smart Ci es:
6. Manufacturing:
Data Security and Privacy: Protec ng sensi ve data from breaches and ensuring compliance
with regula ons (GDPR, CCPA).
Data Integra on: Combining data from mul ple sources with varying formats.