Ese Bda
Ese Bda
Unit 1
Types of Digital Data
1. Structured Data
Definition: Organized data that is neatly stored in tables with rows and
columns, making it easy to use.
Examples: Bank transactions, student databases, or sales records.
2. Semi-Structured Data
Definition: Data that has some structure but is not completely organized
like a table.
Examples: Emails (with subject lines and body text), JSON, and XML files.
3. Unstructured Data
Definition: Data with no specific format, making it harder to process and
`
analyze.
Examples: Images, videos, social media posts, or scanned documents.
Each type of data plays an important role and requires special tools for
handling and analysis.
1. Structured Data
Definition: Data that is neatly organized in tables, with rows and columns
(e.g., Excel sheets, databases).
Sources:
o Banking systems (e.g., transaction records).
o E-commerce websites (e.g., customer orders).
o Employee data in companies.
Ease with Structured Data:
o Easy to store, search, and analyze using software like SQL.
o Tools and techniques are well-established for handling structured
data.
Ex.Data stored in databases is an example of structured data.
`
2. Semi-Structured Data
Definition: Data that has some organization but isn’t fully formatted (e.g., JSON,
XML files).
Sources:
o Emails (structured headers but unstructured body content).
o Sensor data.
o Social media metadata.
Characteristics:
o Easier to manage than unstructured data.
o Needs specialized tools like NoSQL databases for processing.
HTML, etc. Metadata for this data is available but is not sufficient.
`
3. Unstructured Data
Definition: Data that lacks a clear structure, making it harder to organize
and analyze (e.g., text, images, videos).
Sources:
o Videos, photos, and audio files.
o Social media posts and comments.
o Documents like PDFs or Word files.
Challenges (Issues with Terminology):
o Lack of a fixed format makes it difficult to store and process.
o Requires advanced technologies (like AI and Machine Learning) for
analysis.
Example: memos, chat rooms, PowerPoint
presentations, images, videos, letters,
researches, white papers, body of an email
etc.
refer ppt
`
Completed Table:
Structured Unstructured Semi-Structured
MS Access Facebook Videos MS Excel XML
Relations/Tables Images Database Emails
Chat Conversations
`
Unit 8
Apache Pig is a high-level platform for creating programs that process large
datasets. It runs on Hadoop, simplifying the MapReduce programming model
with a scripting language called Pig Latin.
What is Pig?
Pig is a data flow tool used for analyzing large datasets. It allows users to write
scripts to handle complex data transformations easily.
Unit 4
Hadoop is an open-source framework for storing and processing large datasets
in a distributed computing environment. It uses commodity hardware and
follows the MapReduce programming model for data processing.
HDFS Architecture
Unit 7
`
`
`
`
`
UNIT : II
Big data is a collection of large, complex, and varied data sets that are difficult
to store, process, and analyze using traditional data management systems
`
`
`
Unit 3
`
`
`
`
('createdby'=‘Author','createdfor'=‘Company'
);
OK