0% found this document useful (0 votes)

15 views28 pages

Ese Bda

ESE_BDA good

Uploaded by

Ruchi Binayake

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views28 pages

Ese Bda

ESE_BDA good

Uploaded by

Ruchi Binayake

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 28

`

Big Data Analysis

Big Data Analysis is finding useful information by studying very large and
complex sets of data, like patterns, trends, or insights, to help make better
decisions.

Key Characteristics of Big Data (The 5 Vs)

Big data is often defined by these five attributes:
1. Volume: Refers to the large size of data being generated daily (e.g., social
media posts, IoT sensors).
2. Velocity: The speed at which data is created and processed (e.g., real-
time stock trading data).
3. Variety: The different types of data (e.g., text, images, videos, logs).
4. Veracity: The uncertainty or reliability of the data (e.g., inconsistent or
incomplete data).
5. Value: The insights and benefits gained from analyzing the data.
Steps in Big Data Analysis
1. Data Collection: Gather data from multiple sources like social media,
sensors, transactions, etc.
2. Data Storage: Use distributed systems like Hadoop or cloud-based
storage to manage the data.
3. Data Cleaning: Remove irrelevant or duplicate data to ensure accuracy.
4. Data Processing: Use tools like Spark or Hadoop MapReduce to process
the data efficiently.
5. Analysis: Apply techniques like:
o Statistical analysis: For finding trends and correlations.
o Machine learning: To predict outcomes or classify data.
o Visualization: Create charts and dashboards for better
understanding.
6. Decision-Making: Use the insights gained to make informed business
decisions.
`

Applications of Big Data Analysis

 Business: Customer behavior analysis, targeted marketing, supply chain
optimization.
 Healthcare: Predicting diseases, patient care optimization.
 Finance: Fraud detection, risk assessment.
 Retail: Personalized recommendations, inventory management.
 Social Media: Trend analysis, sentiment analysis.

Tools Used in Big Data Analysis

1. Hadoop: For distributed storage and processing.
2. Spark: For fast data processing.
3. Tableau and Power BI: For data visualization.
4. Python and R: For data analysis and modeling.
5. NoSQL Databases: Like MongoDB and Cassandra for storing
unstructured data

Unit 1
Types of Digital Data
1. Structured Data
Definition: Organized data that is neatly stored in tables with rows and
columns, making it easy to use.
Examples: Bank transactions, student databases, or sales records.
2. Semi-Structured Data
Definition: Data that has some structure but is not completely organized
like a table.
Examples: Emails (with subject lines and body text), JSON, and XML files.
3. Unstructured Data
Definition: Data with no specific format, making it harder to process and
`

analyze.
Examples: Images, videos, social media posts, or scanned documents.
Each type of data plays an important role and requires special tools for
handling and analysis.

1. Structured Data
 Definition: Data that is neatly organized in tables, with rows and columns
(e.g., Excel sheets, databases).
 Sources:
o Banking systems (e.g., transaction records).
o E-commerce websites (e.g., customer orders).
o Employee data in companies.
 Ease with Structured Data:
o Easy to store, search, and analyze using software like SQL.
o Tools and techniques are well-established for handling structured
data.
Ex.Data stored in databases is an example of structured data.
`

1. Insert/update/delete: The Data Manipulation

Language (DML) operations provide the required
ease with data input, storage, access, process,
analysis, etc.
Indexing: An index is a data structure that speeds up the data retrieval
operations (primarily the SELECT DML statement) at the cost of
additional writes and storage space, but the benefits that ensue in
search operation are worth the additional writes storage space.

Scalability: The storage and processing capabilities of the traditional

RDBMS can be easily scaled up by increasing the horsepower of the
database server
(increasing the primary and secondary or
peripheral storage capacity, processing
capacity of the processor, etc.).
`

2. Semi-Structured Data

 Definition: Data that has some organization but isn’t fully formatted (e.g., JSON,
XML files).
 Sources:
o Emails (structured headers but unstructured body content).
o Sensor data.
o Social media metadata.
 Characteristics:
o Easier to manage than unstructured data.
o Needs specialized tools like NoSQL databases for processing.

for example, en XML, markup languages like

HTML, etc. Metadata for this data is available but is not sufficient.
`

For description of above fig refer ppt

3. Unstructured Data
 Definition: Data that lacks a clear structure, making it harder to organize
and analyze (e.g., text, images, videos).
 Sources:
o Videos, photos, and audio files.
o Social media posts and comments.
o Documents like PDFs or Word files.
 Challenges (Issues with Terminology):
o Lack of a fixed format makes it difficult to store and process.
o Requires advanced technologies (like AI and Machine Learning) for
analysis.
Example: memos, chat rooms, PowerPoint
presentations, images, videos, letters,
researches, white papers, body of an email
etc.
refer ppt
`

Completed Table:
Structured Unstructured Semi-Structured
MS Access Facebook Videos MS Excel XML
Relations/Tables Images Database Emails
Chat Conversations
`

Unit 8
Apache Pig is a high-level platform for creating programs that process large
datasets. It runs on Hadoop, simplifying the MapReduce programming model
with a scripting language called Pig Latin.

What is Pig?
Pig is a data flow tool used for analyzing large datasets. It allows users to write
scripts to handle complex data transformations easily.

component in Hadoop ecosystem

Analyze big data or big datasets

`
`
`
`

Unit 4
Hadoop is an open-source framework for storing and processing large datasets
in a distributed computing environment. It uses commodity hardware and
follows the MapReduce programming model for data processing.

Distributed Computing Challenges

1. Data Distribution: Efficiently splitting and storing data across nodes.
2. Fault Tolerance: Handling node failures without losing data.
3. Scalability: Seamless addition of new nodes.
4. Data Locality: Moving computation to where data resides to improve
performance.
5. Complexity: Managing distributed systems can be difficult.

Open-source software framework to store and process massive amounts of

data in a
distributed fashion on large clusters of commodity hardware. Basically, Hadoop
accomplishes two tasks:
Massive data storage.
Faster data processing.
`
`

HDFS Architecture

Unit 7
`
`
`
`
`

UNIT : II
Big data is a collection of large, complex, and varied data sets that are difficult
to store, process, and analyze using traditional data management systems
`
`
`

Unit 3
`
`
`
`

hive> create database if not exists firstDB

comment "This is my first demo" location
'/user/hive/warehouse/newdb' with
DBPROPERTIES

('createdby'=‘Author','createdfor'=‘Company'
);

Drop Database in Hive

hive> drop database if exists firstDB
CASCADE;

DESCRIBE DATABASE/SCHEMA [EXTENDED]

db_name;

ALTER DATABASE firstDB SET OWNER ROLE admin_role;

Giai de C - CPE - 2409
100% (1)
Giai de C - CPE - 2409
15 pages
Big Data
No ratings yet
Big Data
190 pages
Big Data Analytics Compiled Notes
No ratings yet
Big Data Analytics Compiled Notes
130 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
1 Notes
No ratings yet
1 Notes
37 pages
Big Data Analysis by Deshbandhu
No ratings yet
Big Data Analysis by Deshbandhu
368 pages
$RM5TSDQ
No ratings yet
$RM5TSDQ
70 pages
Big Data Analytics Notess
No ratings yet
Big Data Analytics Notess
69 pages
Big Data Seminar
100% (2)
Big Data Seminar
27 pages
Bigdata Lecture Notes
No ratings yet
Bigdata Lecture Notes
118 pages
It - (R20) - 4-1 - Big Data Analytics - Digital Notes
No ratings yet
It - (R20) - 4-1 - Big Data Analytics - Digital Notes
117 pages
$R3N9XOZ
No ratings yet
$R3N9XOZ
56 pages
Big Data-Intro
No ratings yet
Big Data-Intro
31 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
Big Data Lec4
No ratings yet
Big Data Lec4
38 pages
Big Data Analytics (VN) 1
No ratings yet
Big Data Analytics (VN) 1
98 pages
BDA Answerbank
No ratings yet
BDA Answerbank
71 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
It (r20) 4-1 Big Data Analytics Digital Notes
No ratings yet
It (r20) 4-1 Big Data Analytics Digital Notes
84 pages
CS8091 LN
No ratings yet
CS8091 LN
68 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
33 pages
Module 1. 16974328175990
No ratings yet
Module 1. 16974328175990
119 pages
Big Data UNIT I
No ratings yet
Big Data UNIT I
91 pages
Unit 1 BDA
No ratings yet
Unit 1 BDA
43 pages
Hand Book: Ahmedabad Institute of Technology
No ratings yet
Hand Book: Ahmedabad Institute of Technology
103 pages
Module 1
No ratings yet
Module 1
54 pages
Big Data Chapter-I - New
No ratings yet
Big Data Chapter-I - New
49 pages
UNIT I Notes
No ratings yet
UNIT I Notes
26 pages
No SQL Database in Bda
No ratings yet
No SQL Database in Bda
84 pages
Unit 5 Concepts of Big Data and Data Lake
No ratings yet
Unit 5 Concepts of Big Data and Data Lake
15 pages
Big Data Analytics
No ratings yet
Big Data Analytics
45 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Big Data (Analytics) in Power Systems
No ratings yet
Big Data (Analytics) in Power Systems
20 pages
Big Data Analytics
No ratings yet
Big Data Analytics
61 pages
Bda CHP1
No ratings yet
Bda CHP1
83 pages
Cp5293 Big Data Analytics Question Bank
0% (1)
Cp5293 Big Data Analytics Question Bank
13 pages
BIG Data1
No ratings yet
BIG Data1
49 pages
Fundamentals of Big Data Analytics
No ratings yet
Fundamentals of Big Data Analytics
151 pages
M1 Q&a
No ratings yet
M1 Q&a
26 pages
Unit I LM
No ratings yet
Unit I LM
12 pages
Big Data Analytics (R20a0520)
No ratings yet
Big Data Analytics (R20a0520)
84 pages
Bda Unit I LM
No ratings yet
Bda Unit I LM
14 pages
2 Emerging
No ratings yet
2 Emerging
10 pages
Big Data Analytics (R18a0529)
No ratings yet
Big Data Analytics (R18a0529)
134 pages
BD Unit 1
No ratings yet
BD Unit 1
5 pages
Big Data Analytics
No ratings yet
Big Data Analytics
10 pages
BIG Data - Unit - 1
No ratings yet
BIG Data - Unit - 1
24 pages
01 - Introduction To Big Data Analytics PDF
No ratings yet
01 - Introduction To Big Data Analytics PDF
37 pages
Assignment DBMS
No ratings yet
Assignment DBMS
4 pages
Bda U1
No ratings yet
Bda U1
78 pages
BDA Unit 1
No ratings yet
BDA Unit 1
50 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
DBMS Unit1 Notes
No ratings yet
DBMS Unit1 Notes
24 pages
cp5293 Big Data Analytics Question Bank
0% (1)
cp5293 Big Data Analytics Question Bank
13 pages
Bigdata Analytics
No ratings yet
Bigdata Analytics
19 pages
CC Becse Unit 4 PDF
No ratings yet
CC Becse Unit 4 PDF
32 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Velocity: Introduction To Bigdata
No ratings yet
Velocity: Introduction To Bigdata
14 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
BD 1
No ratings yet
BD 1
15 pages
Characteristics of Database Approach
No ratings yet
Characteristics of Database Approach
63 pages
Introduction To Big Data: Soorya Prasanna Ravichandran
No ratings yet
Introduction To Big Data: Soorya Prasanna Ravichandran
33 pages
Top 50 C# Interview Q&A
No ratings yet
Top 50 C# Interview Q&A
11 pages
Relational Data Modal 11 lacture-WPS Office
No ratings yet
Relational Data Modal 11 lacture-WPS Office
24 pages
DBMS MCQs
No ratings yet
DBMS MCQs
71 pages
Lab Manual On Recommender System
No ratings yet
Lab Manual On Recommender System
59 pages
Database Interview
No ratings yet
Database Interview
15 pages
Adbase Presentation Group 4
No ratings yet
Adbase Presentation Group 4
60 pages
Design Patterns Cheatsheet
No ratings yet
Design Patterns Cheatsheet
23 pages
IS Assignment
No ratings yet
IS Assignment
3 pages
Week 2
No ratings yet
Week 2
7 pages
Information Resources
No ratings yet
Information Resources
57 pages
Python Testing Cookbook Easy solutions to test your Python projects using test driven development and Selenium 2nd Edition Greg L. Turnquist & Bhaskar N. Das - Download the ebook today and own the complete version
100% (2)
Python Testing Cookbook Easy solutions to test your Python projects using test driven development and Selenium 2nd Edition Greg L. Turnquist & Bhaskar N. Das - Download the ebook today and own the complete version
45 pages
Muktar Abdullahi
No ratings yet
Muktar Abdullahi
23 pages
Paper1 IJET 10757
No ratings yet
Paper1 IJET 10757
5 pages
AI Chatbot Learning Schedule
No ratings yet
AI Chatbot Learning Schedule
4 pages
Mumbai University Application Form
No ratings yet
Mumbai University Application Form
2 pages
Unit 1 DSML
No ratings yet
Unit 1 DSML
11 pages
Bits Bytes
No ratings yet
Bits Bytes
2 pages
JSON Web Tokens
No ratings yet
JSON Web Tokens
8 pages
DBMS Pyq 2020
No ratings yet
DBMS Pyq 2020
25 pages
STQA ISE 1 (1) Final
No ratings yet
STQA ISE 1 (1) Final
15 pages
1.1 Deep Structural Enhanced Network For Document Clustering
No ratings yet
1.1 Deep Structural Enhanced Network For Document Clustering
16 pages
Business Intelligence
No ratings yet
Business Intelligence
8 pages
Anshu Dwivedi - DS
No ratings yet
Anshu Dwivedi - DS
1 page
Assignment No 6,7,8
No ratings yet
Assignment No 6,7,8
5 pages
Crime Analytics Analysis of Crimes Through Newspaper Articles
No ratings yet
Crime Analytics Analysis of Crimes Through Newspaper Articles
7 pages
DownloadBadge - Aspx ExhibitionName PTE2023GNC&BadgeNumber 83005477
No ratings yet
DownloadBadge - Aspx ExhibitionName PTE2023GNC&BadgeNumber 83005477
3 pages
Java MP
No ratings yet
Java MP
5 pages
Air Pollution
No ratings yet
Air Pollution
1 page
CISCE Results 2023
No ratings yet
CISCE Results 2023
1 page
Hospital Management System
No ratings yet
Hospital Management System
3 pages
Question Bank For DBMS CIT II-1
No ratings yet
Question Bank For DBMS CIT II-1
2 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet

Ese Bda

Uploaded by

Ese Bda

Uploaded by

`

Big Data Analysis

Key Characteristics of Big Data (The 5 Vs)

Applications of Big Data Analysis

Tools Used in Big Data Analysis

1. Insert/update/delete: The Data Manipulation

Scalability: The storage and processing capabilities of the traditional

for example, en XML, markup languages like

For description of above fig refer ppt

component in Hadoop ecosystem

Analyze big data or big datasets

Distributed Computing Challenges

Open-source software framework to store and process massive amounts of

hive> create database if not exists firstDB

Drop Database in Hive

DESCRIBE DATABASE/SCHEMA [EXTENDED]

ALTER DATABASE firstDB SET OWNER ROLE admin_role;

You might also like