Bda Unit 1
Bda Unit 1
Discuss the
challenges for Big Data Analytics.
A) Big Data refers to the massive volumes of structured and unstructured data
generated from various sources at high velocity. This data is often too large and
complex for traditional data processing systems to manage effectively.
Types of Big Data
Big Data can be categorized into three primary types:
1. Structured Data:
o This is data that is organized in a predefined manner, usually in
rows and columns, making it easily searchable and analyzable.
o Examples: Databases, spreadsheets, and any data that can be easily
entered into relational databases.
Advantages:
Easy to Analyze
High Accuracy
Disadvantages:
Limited Flexibility
Inability to Capture Rich Information
2. Unstructured Data:
o This type of data lacks a predefined format or structure, making it
more complex to analyze.
o Examples: Text documents, images, videos, social media posts, and
emails.
Advantages:
Rich Insights
Flexibility
Disadvantages:
Difficult to Analyze.
Data Quality Issues
3. Semi-Structured Data:
o This type contains elements of both structured and unstructured
data. While it may have organizational properties to separate data
elements, it does not fit into a strict schema.
o Examples: JSON, XML, and NoSQL databases.
Advantages:
Balance of Structure and Flexibility
Ease of Data Integration
Disadvantages:
Complexity in Analysis
Inconsistent Formats
Challenges for Big Data Analytics
Despite its potential, Big Data analytics faces several challenges:
1. Data Quality and Cleansing:
o Ensuring that the data is accurate, consistent, and cleaned is critical.
Poor data quality can lead to incorrect insights and decision-
making.
2. Data Integration:
o Combining data from different sources (structured and unstructured)
can be difficult, especially when these sources use various formats
and protocols.
3. Storage and Management:
o Storing vast amounts of data efficiently while maintaining
performance is a significant challenge. This includes choosing the
right technology stack and managing the costs associated with
storage.
4. Scalability:
o As the volume of data grows, systems must be able to scale
effectively without sacrificing performance. This requires robust
architecture and planning.
5. Data Privacy and Security:
o Protecting sensitive data and ensuring compliance with regulations
(like GDPR) represents a major challenge, particularly with the
increase in data breaches.
6. Skill Gap:
o There is often a shortage of skilled professionals who can analyze
Big Data effectively. This includes data scientists, analysts, and
engineers familiar with Big Data technologies.
7. Real-time Processing:
o Analyzing streaming data in real-time poses technical challenges,
as traditional data processing tools may not be able to handle high-
velocity data streams effectively.
8. Interpreting Data:
o Deriving actionable insights from complex datasets can be
daunting, especially when visualizing the data or when decision-
makers lack data literacy.
2) Define Business Intelligence and How the business intelligence
systems implemented.
A) Business intelligence or BI is a set of practices of collecting, structuring,
and analyzing raw data to turn it into actionable business insights. BI considers
methods and tools that transform unstructured data sets, compiling them into
easy-to-grasp reports or information dashboards. The main purpose of BI is to
support data-driven decision-making.
Business intelligence process: How does BI work?
The whole process of business intelligence can be divided into five main stages.
1. Data gathering involves collecting information from a variety of sources,
either external (e.g., market data providers, industry analytics, etc.) or
internal (Google Analytics, CRM, ERP, etc.).
2. Data cleaning/standardization means preparing collected data for
analysis by validating data quality, ensuring its consistency, and so on
(please check the linked articles for more details.)
3. Data storage refers to loading data in the data warehouse and storing it
for further usage.
4. Data analysis is actually the automated process of turning raw data into
valuable, actionable information by applying various quantitative and
qualitative analytical techniques.
5. Reporting involves generating dashboards, graphical imagery, or other
forms of readable visual representation of analytics results that users can
interact with or extract actionable insights from.
Advantages of BI:
Data driven decision making
Improved efficiency
Enhanced visualization
Data mining
Real time analytics
Disadvantages of BI:
High Costs
Complexity
Data Overload
Dependency on IT
Security and Privacy concerns
Volume
o The name Big Data itself is related to an enormous size. Big Data is a vast
'volumes' of data generated from many sources daily, such as business processes,
machines, social media platforms, networks, human interactions, and many
more.
o Facebook can generate approximately a billion messages, 4.5 billion times that
the "Like" button is recorded, and more than 350 million new posts are
uploaded each day. Big data technologies can handle large amounts of data.
Variety
o Big Data can be structured, unstructured, and semi-structured that are being
collected from different sources. Data will only be collected
from databases and sheets in the past, But these days the data will comes in
array forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.
The data is categorized as below:
Structured data: In Structured schema, along with all the required columns. It
is in a tabular form. Structured Data is stored in the relational database
management system.
Semi-structured: In Semi-structured, the schema is not appropriately defined,
e.g., JSON, XML, CSV, TSV, and email. OLTP (Online Transaction
Processing) systems are built to work with semi-structured data. It is stored in
relations, i.e, tables.
Unstructured Data: All the unstructured files, log files, audio files,
and image files are included in the unstructured data. Some organizations have
much data available, but they did not know how to derive the value of data
since the data is raw.
Veracity:
Veracity means how much the data is reliable. It has many ways to filter or
translate the data. Veracity is the process of being able to handle and manage
data efficiently. Big Data is also essential in business development.
For example, Facebook posts with hashtags.
Value:
Value is an essential characteristic of big data. It is not the data that we process
or store. It is valuable and reliable data that we store, process, and
also analyze.
Velocity
Velocity plays an important role compared to others. Velocity creates the speed
by which the data is created in real-time. It contains the linking of
incoming data sets speeds, rate of change, and activity bursts. The primary
aspect of Big Data is to provide demanding data rapidly.
Big data velocity deals with the speed at the data flows from sources
like application logs, business processes, networks, and social media sites,
sensors, mobile devices, etc.