0% found this document useful (0 votes)
15 views28 pages

Ese Bda

ESE_BDA good

Uploaded by

Ruchi Binayake
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views28 pages

Ese Bda

ESE_BDA good

Uploaded by

Ruchi Binayake
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

`

Big Data Analysis


Big Data Analysis is finding useful information by studying very large and
complex sets of data, like patterns, trends, or insights, to help make better
decisions.

Key Characteristics of Big Data (The 5 Vs)


Big data is often defined by these five attributes:
1. Volume: Refers to the large size of data being generated daily (e.g., social
media posts, IoT sensors).
2. Velocity: The speed at which data is created and processed (e.g., real-
time stock trading data).
3. Variety: The different types of data (e.g., text, images, videos, logs).
4. Veracity: The uncertainty or reliability of the data (e.g., inconsistent or
incomplete data).
5. Value: The insights and benefits gained from analyzing the data.
Steps in Big Data Analysis
1. Data Collection: Gather data from multiple sources like social media,
sensors, transactions, etc.
2. Data Storage: Use distributed systems like Hadoop or cloud-based
storage to manage the data.
3. Data Cleaning: Remove irrelevant or duplicate data to ensure accuracy.
4. Data Processing: Use tools like Spark or Hadoop MapReduce to process
the data efficiently.
5. Analysis: Apply techniques like:
o Statistical analysis: For finding trends and correlations.
o Machine learning: To predict outcomes or classify data.
o Visualization: Create charts and dashboards for better
understanding.
6. Decision-Making: Use the insights gained to make informed business
decisions.
`

Applications of Big Data Analysis


 Business: Customer behavior analysis, targeted marketing, supply chain
optimization.
 Healthcare: Predicting diseases, patient care optimization.
 Finance: Fraud detection, risk assessment.
 Retail: Personalized recommendations, inventory management.
 Social Media: Trend analysis, sentiment analysis.

Tools Used in Big Data Analysis


1. Hadoop: For distributed storage and processing.
2. Spark: For fast data processing.
3. Tableau and Power BI: For data visualization.
4. Python and R: For data analysis and modeling.
5. NoSQL Databases: Like MongoDB and Cassandra for storing
unstructured data

Unit 1
Types of Digital Data
1. Structured Data
Definition: Organized data that is neatly stored in tables with rows and
columns, making it easy to use.
Examples: Bank transactions, student databases, or sales records.
2. Semi-Structured Data
Definition: Data that has some structure but is not completely organized
like a table.
Examples: Emails (with subject lines and body text), JSON, and XML files.
3. Unstructured Data
Definition: Data with no specific format, making it harder to process and
`

analyze.
Examples: Images, videos, social media posts, or scanned documents.
Each type of data plays an important role and requires special tools for
handling and analysis.

1. Structured Data
 Definition: Data that is neatly organized in tables, with rows and columns
(e.g., Excel sheets, databases).
 Sources:
o Banking systems (e.g., transaction records).
o E-commerce websites (e.g., customer orders).
o Employee data in companies.
 Ease with Structured Data:
o Easy to store, search, and analyze using software like SQL.
o Tools and techniques are well-established for handling structured
data.
Ex.Data stored in databases is an example of structured data.
`

1. Insert/update/delete: The Data Manipulation


Language (DML) operations provide the required
ease with data input, storage, access, process,
analysis, etc.
Indexing: An index is a data structure that speeds up the data retrieval
operations (primarily the SELECT DML statement) at the cost of
additional writes and storage space, but the benefits that ensue in
search operation are worth the additional writes storage space.

Scalability: The storage and processing capabilities of the traditional


RDBMS can be easily scaled up by increasing the horsepower of the
database server
(increasing the primary and secondary or
peripheral storage capacity, processing
capacity of the processor, etc.).
`

2. Semi-Structured Data

 Definition: Data that has some organization but isn’t fully formatted (e.g., JSON,
XML files).
 Sources:
o Emails (structured headers but unstructured body content).
o Sensor data.
o Social media metadata.
 Characteristics:
o Easier to manage than unstructured data.
o Needs specialized tools like NoSQL databases for processing.

for example, en XML, markup languages like

HTML, etc. Metadata for this data is available but is not sufficient.
`

For description of above fig refer ppt


`

3. Unstructured Data
 Definition: Data that lacks a clear structure, making it harder to organize
and analyze (e.g., text, images, videos).
 Sources:
o Videos, photos, and audio files.
o Social media posts and comments.
o Documents like PDFs or Word files.
 Challenges (Issues with Terminology):
o Lack of a fixed format makes it difficult to store and process.
o Requires advanced technologies (like AI and Machine Learning) for
analysis.
Example: memos, chat rooms, PowerPoint
presentations, images, videos, letters,
researches, white papers, body of an email
etc.
refer ppt
`

Completed Table:
Structured Unstructured Semi-Structured
MS Access Facebook Videos MS Excel XML
Relations/Tables Images Database Emails
Chat Conversations
`

Unit 8
Apache Pig is a high-level platform for creating programs that process large
datasets. It runs on Hadoop, simplifying the MapReduce programming model
with a scripting language called Pig Latin.

What is Pig?
Pig is a data flow tool used for analyzing large datasets. It allows users to write
scripts to handle complex data transformations easily.

component in Hadoop ecosystem


`

Analyze big data or big datasets


`
`
`
`

Unit 4
Hadoop is an open-source framework for storing and processing large datasets
in a distributed computing environment. It uses commodity hardware and
follows the MapReduce programming model for data processing.

Distributed Computing Challenges


1. Data Distribution: Efficiently splitting and storing data across nodes.
2. Fault Tolerance: Handling node failures without losing data.
3. Scalability: Seamless addition of new nodes.
4. Data Locality: Moving computation to where data resides to improve
performance.
5. Complexity: Managing distributed systems can be difficult.

Open-source software framework to store and process massive amounts of


data in a
distributed fashion on large clusters of commodity hardware. Basically, Hadoop
accomplishes two tasks:
Massive data storage.
Faster data processing.
`
`

HDFS Architecture

Unit 7
`
`
`
`
`

UNIT : II
Big data is a collection of large, complex, and varied data sets that are difficult
to store, process, and analyze using traditional data management systems
`
`
`

Unit 3
`
`
`
`

hive> create database if not exists firstDB


comment "This is my first demo" location
'/user/hive/warehouse/newdb' with
DBPROPERTIES

('createdby'=‘Author','createdfor'=‘Company'
);

OK

Drop Database in Hive


hive> drop database if exists firstDB
CASCADE;

DESCRIBE DATABASE/SCHEMA [EXTENDED]


db_name;

ALTER DATABASE firstDB SET OWNER ROLE admin_role;

You might also like