0% found this document useful (0 votes)
17 views5 pages

BDA UNIT-1 (Lecture-1)

Big data refers to large, fast, and complex data sets that traditional methods struggle to process, characterized by volume, velocity, variety, veracity, and value. The evolution of data management has transitioned from early file management to modern data governance and cloud computing, with big data management focusing on data collection, storage, and analytics. Applications of big data span various industries including healthcare, finance, and retail, while challenges include data security, scalability, and integration.

Uploaded by

Surya Shastri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views5 pages

BDA UNIT-1 (Lecture-1)

Big data refers to large, fast, and complex data sets that traditional methods struggle to process, characterized by volume, velocity, variety, veracity, and value. The evolution of data management has transitioned from early file management to modern data governance and cloud computing, with big data management focusing on data collection, storage, and analytics. Applications of big data span various industries including healthcare, finance, and retail, while challenges include data security, scalability, and integration.

Uploaded by

Surya Shastri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Lecture-1

Introduc on to Big Data


- Big data refers to data that is so large, fast and complex that is difficult or impossible to
process using tradi onal methods.
- Big data is about data volume and large data set’s, measured in terms of TB or PB.

Characteris cs of Big data:


1. Volume: The vast amount of data generated every second.
E.g. big data is its high volume. This describes the huge amount of data that is available for
collec on and produced from a variety of sources and devices on a con nuous basis.
2. Velocity: Big data velocity refers to the speed at which data is generated. Today, data is o en
produced in real me or near real me, and therefore, it must also be processed, accessed,
and analysed at the same rate to have any meaningful impact.
3. Variety: Data is heterogeneous, meaning it can come from many different sources and can
be structured, unstructured, or semi-structured. More tradi onal structured data (such as
data in spreadsheets or rela onal databases) is now supplemented by unstructured text,
images, audio, video files, or semi-structured formats like sensor data that can’t be organized
in a fixed data schema.
4. Veracity Veracity refers to the accuracy and reliability of data. Because big data comes in
such great quan es and from various sources, it can contain noise or errors, which can lead
to poor decision-making.
5. Value Value refers to the real-world benefits organiza ons can get from big data. These
benefits include everything from op mizing business opera ons to iden fying new
marke ng opportuni es.

Sources of Big Data


These data come from many sources like

o Social networking sites: Facebook, Google, LinkedIn all these sites generates huge amount of
data on a day to day basis as they have billions of users worldwide.

o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from
which users buying trends can be traced.

o Weather Sta on: All the weather sta on and satellite gives very huge data which are stored
and manipulated to forecast weather.

o Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.

o Share Market: Stock exchange across the world generates huge amount of data through its
daily transac on.
The evolu on of data management

1. Early Days: File Management (1950s-60s)


i. Punch Cards and Magne c Tapes: Data was primarily stored on punch
cards
ii. magne c tapes. Manual processes were labour-intensive, and data
retrieval was slow.
iii. Flat Files: Organiza ons began using flat file databases, where data was
stored in simple text files with limited structure.
2. . Database Management Systems (1970s)
i. Hierarchical and Network Databases: Early database models included
hierarchical (e.g., IBM's IMS) and network databases, which allowed for
more complex rela ons among data.
ii. Rela onal Databases: The introduc on of rela onal database
management systems (RDBMS) by Edgar F. Codd revolu onized data
management. The Structured Query Language (SQL) became the
standard for managing and querying data.
3. Emergence of Data Warehousing (1980s-90s)
i. Data Warehousing: The concept of data warehousing emerged, allowing
organiza ons to consolidate data from various sources for repor ng and
analysis. Tools like ETL (Extract, Transform, Load) processes became
popular.
ii. Decision Support Systems (DSS): Analy cal tools and systems began to
interact with data warehouses, enabling be er business decision-
making.
4. The Rise of Big Data and NoSQL (2000s)
i. Big Data Technologies: With the explosion of data generated by social
media, mobile devices, and the internet, tradi onal RDBMS struggled to
handle volume, velocity, and variety. Technologies like Hadoop and Spark
emerged to manage big data.
ii. NoSQL Databases: Non-rela onal databases (e.g., MongoDB, Cassandra)
provided flexible data models for unstructured or semi-structured data,
allowing organiza ons to scale out data storage efficiently.
5. Cloud Compu ng and Data Management (2010s)
i. Move to the Cloud: Cloud services revolu onized data storage and
management, allowing businesses to scale resources dynamically. Major
providers like AWS, Microso Azure, and Google Cloud offered database
services that emphasized elas city and cost-effec veness.
ii. Data Lakes: The concept of data lakes emerged, enabling organiza ons
to store vast amounts of raw data in its na ve format, allowing for
analy cs and processing.
6. Modern Data Management and Governance (2020s)
i. DataOps and Agile Data Management: DataOps methodologies began
to emerge, focusing on the agile and collabora ve aspects of data
management, similar to DevOps.
ii. Data Governance and Compliance: As data privacy regula ons like GDPR
and CCPA arose, organiza ons priori zed data governance, data quality,
and compliance management.
iii. AI and Machine Learning: The integra on of AI and machine learning
into data management processes improved data insights, automa on,
and predic ve analy cs.
7. Future Trends
i. Decentralized and Federated Data Systems: With the rise of blockchain
technologies and decentralized data architectures, future data
management may involve more distributed and secure ways to handle
data.
ii. Data Fabric and Integra on: Concepts like data fabric aim to provide
seamless integra on, accessibility, and management across various data
sources and formats.
iii. Enhanced Automa on: Con nued advancements in AI and automa on
are likely to drive more intelligent data management tools that reduce
manual interven on and enhance efficiency.

Big data management


Big data management is the systematic process of data collection, data processing and data analysis
that organizations use to transform raw data into actionable insights.
1. Big data collection
i. This stage involves capturing the large volumes of informa on from various sources
that cons tute big data.
ii. To handle the speed and diversity of incoming data, organiza ons o en rely on
specialized big data technologies and processes such as Apache Ka a for real- me
data streaming and Apache NiFi for data flow automa on.
iii. This stage also involves capturing metadata—informa on about the data’s origin,
format and other characteris cs. Metadata can provide essen al context for future
organizing and processing data down the line.
2. Big data storage
1. Data lakes
- Data lakes are low-cost storage environments designed to handle massive amounts of raw
structured and unstructured data. Data lakes generally don’t clean, validate or normalize
data. Instead, they store data in its na ve format, which means they can accommodate many
different types of data and scale easily.
- Data lakes are ideal for applica ons where the volume, variety and velocity of big data are
high and real- me performance is less important. They’re commonly used to support AI
training, machine learning and big data analy cs. Data lakes can also serve as general-
purpose storage spaces for all big data, which can be moved from the lake to different
applica ons as needed.

2. Data warehouses
- Data warehouses aggregate data from mul ple sources into a single, central and consistent
data store. They also clean data and prepare it so that it is ready for use, o en by
transforming the data into a rela onal format. Data warehouses are built to support data
analy cs, business intelligence and data science efforts.
- warehouses are mainly used to make some subset of big data readily available to business
users for BI and analysis.

3. Data lakehouses
- Data lakehouses combine the flexibility of data lakes with the structure and querying
capabili es of data warehouses, enabling organiza ons to harness the best of both solu on
types in a unified pla orm. Lakehouses are a rela vely recent development, but they are
becoming increasingly popular because they eliminate the need to maintain two disparate
data systems.

Tools & Technologies used for Storage: Hadoop Distributed File System (HDFS), Amazon S3,
Google Cloud Storage

3. Big data analy cs


Big data analy cs are the processes organiza ons use to derive value from their big data. Big
data analy cs involves using machine learning, data mining and sta s cal analysis tools to
iden fy pa erns, correla ons and trends within large datasets.

Tools & Technologies used: Tableau, Power BI, Python (Pandas, NumPy), R.

4. Big data processing tools

i. Organizations can use a variety of big data processing tools to transform raw
data into valuable insights.
ii. The three primary big data technologies used for data processing include:

1. Hadoop
Hadoop is an open-source framework that enables the distributed storage and
processing of large datasets across clusters of computers. This framework allows
the Hadoop Distributed File System (HDFS) to efficiently manage large amounts of data.

2. Apache Spark

Apache Spark is known for its speed and simplicity, particularly when it comes to real-
time data analytics. Because of its in-memory processing capabilities, it excels in data
mining, predictive analytics and data science tasks. Organizations generally turn to it for
applications that require rapid data processing, such as live-stream analytics.

For example, a streaming platform might use Spark to process user activity in real time
to track viewer habits and make instant recommendations.

3. NoSQL databases

NoSQL databases are designed to handle unstructured data, making them a flexible
choice for big data applications. Unlike relational databases, NoSQL solutions—such as
document, key-value and graph databases—can scale horizontally. This flexibility makes
them critical for storing data that doesn’t fit neatly into tables.

For example, an e-commerce company might use a NoSQL document database to


manage and store product descriptions, images and customer reviews.
Tools & Technology used for processing: Apache Hadoop, Apache Spark, Flink, Storm.

Applica ons of Big Data


1. Healthcare:

o Disease predic on, personalized medicine, pa ent monitoring.

2. Finance:

o Fraud detec on, algorithmic trading, credit scoring.

3. Retail and E-commerce:

o Personalized recommenda ons, inventory op miza on.

4. Social Media and Marke ng:

o Sen ment analysis, targeted adver sing, influencer analysis.

5. Smart Ci es:

o Traffic management, energy efficiency, waste management.

6. Manufacturing:

o Predic ve maintenance, quality control, supply chain op miza on.

Challenges of Big Data

 Data Security and Privacy: Protec ng sensi ve data from breaches and ensuring compliance
with regula ons (GDPR, CCPA).

 Scalability: Managing and processing exponen ally growing data efficiently.

 Data Integra on: Combining data from mul ple sources with varying formats.

 Real- me Processing: Handling and analyzing data streams in real- me.

 Data Governance: Ensuring data quality, lineage, and compliance.

You might also like