Module V

The document discusses advanced analytics technologies and tools, focusing on unstructured data, the Hadoop ecosystem, and various components such as Pig, Hive, and HBase. It highlights the challenges of handling unstructured data, which constitutes a significant portion of organizational data, and outlines methods for processing it. Additionally, it provides an overview of the Hadoop ecosystem, detailing its major elements and components that facilitate big data management and analysis.

Uploaded by

satyamshivam.in

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Module V

Uploaded by

satyamshivam.in

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 57

MCA2004 – Big Data

Analytics
Module V - Advanced
Analytics - technologies and
tools
Contents
• Analytics for unstructured data
• The Hadoop ecosystem
• Pig
• Hive
• Hbase
• Mahout
• Introduction to NoSQL
Analytics for unstructured data
• This is the data which does not
conform to a data model or is not
in a form which can be used
easily by a computer program.
• About 80%-90% data of an
organization is in this format for
example, memos, chat rooms,
powerpoint presentations,
images, videos, letters etc
Unstructured Data
• This is the data which does not
conform to a data model or is not
in a form which can be used easily
by a computer program.
•About 80–90% data of an
organization is in this format.
• Example: memos, chat rooms,
PowerPoint presentations,
images, videos, letters,
researches, white papers, body of
an email, etc
Sources of Unstructured Data
• Web Pages
• Images
• Free form text
• Audios
• Videos
• Body of email
• Text messages
• Chats
• Social media data
• Word document
Issues with terminology –
Unstructured Data
• Structure can be implied despite not being formerly defined
• Data with some structure may still be labeled unstructured if
the structure doesn’t help with processing task at hand
• Data may have some structure or may even be highly
structured in ways that are unanticipated or unannounced
Dealing with Unstructured Data
• Data Mining
• Association Rule Mining
• Regression Analysis
• Collaborative Filtering
• Text analysis and Text Mining
• Natural Language Processing(NLP)
• Noisy text Analysis
• Manual tagging with metadata
• Part-of-speech tagging
• Unstructured Information Management Architecture(UIMA)
Properties Structured data Semi-structured data Unstructured data

It is based on
It is based on Relational It is based on character and
Technology XML/RDF(Resource
database table binary data
Description Framework).
Matured transaction and
Transaction is adapted from No transaction management
Transaction management various concurrency
DBMS not matured and no concurrency
techniques

Versioning over Versioning over tuples or

Version management Versioned as a whole
tuples,row,tables graph is possible

It is more flexible than

It is schema dependent and structured data but less It is more flexible and there is
Flexibility
less flexible flexible than unstructured absence of schema
data

It is very difficult to scale DB It’s scaling is simpler than

Scalability It is more scalable.
schema structured data

New technology, not very

Robustness Very robust —
spread

Structured query allow Queries over anonymous Only textual queries are
Query performance
complex joining nodes are possible possible
The Hadoop eco system
• Hadoop Ecosystem is a platform or a suite which provides various
services to solve the big data problems.
• It includes Apache projects and various commercial tools and solutions.
• There are four major elements of Hadoop i.e. HDFS, MapReduce,
YARN, and Hadoop Common Utilities.
• Most of the tools or solutions are used to supplement or support these
major elements.
• All these tools work collectively to provide services such as absorption,
analysis, storage and maintenance of data etc.
The Hadoop eco system
Following are the components that collectively
form a Hadoop ecosystem:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query based processing of data
services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning
algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
Hadoop - PIG
• High level data flow language for exploring very large datasets
• Provides an engine for executing data flows in parallel on
Hadoop
• Compiler that produces sequences MapReduce programs
• Structure is amenable to sustainable parallelization
• Operates on files in HDFS
• Metadata is not required, but used when available
Key properties of PIG
• Ease of Programming
• Trivial to achieve parallel execution of simple and parallel data tasks
• Optimization opportunities
• Allow the user to focus on semantics rather than efficiency
• Extensibility
• Users can create their own functions to do special purpose processing
Why Hadoop PIG
Apache Hive
Introduction to HBase

HCIA-openGauss V1.0Training Materials
No ratings yet
HCIA-openGauss V1.0Training Materials
504 pages
Failure of Nokia
No ratings yet
Failure of Nokia
27 pages
Virtual Base Class
No ratings yet
Virtual Base Class
4 pages
Unit 1 Introduction
No ratings yet
Unit 1 Introduction
31 pages
INTERNET OF THINGS unit IV
No ratings yet
INTERNET OF THINGS unit IV
9 pages
All
No ratings yet
All
62 pages
Data Science - FYBCA-Sem-II
No ratings yet
Data Science - FYBCA-Sem-II
13 pages
BDA_HADOOP_UNIT-2
No ratings yet
BDA_HADOOP_UNIT-2
71 pages
Module 02 - Learners Guide
No ratings yet
Module 02 - Learners Guide
82 pages
Bsd1313 Chapter 4
No ratings yet
Bsd1313 Chapter 4
129 pages
Unit-1 (3)
No ratings yet
Unit-1 (3)
62 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
6 pages
Week 3 Assignment
No ratings yet
Week 3 Assignment
2 pages
biggdata
No ratings yet
biggdata
24 pages
DBMS PPT 1
No ratings yet
DBMS PPT 1
27 pages
Unit 2 Big Data (1) - 240328 - 162657
No ratings yet
Unit 2 Big Data (1) - 240328 - 162657
46 pages
CH 6 BDA
No ratings yet
CH 6 BDA
10 pages
CH 3 BDA
No ratings yet
CH 3 BDA
13 pages
BD by maaz
No ratings yet
BD by maaz
19 pages
Big Data QB
No ratings yet
Big Data QB
37 pages
Lecture_4
No ratings yet
Lecture_4
32 pages
Lesson 1 - Introduction To Big Data and Hadoop
No ratings yet
Lesson 1 - Introduction To Big Data and Hadoop
46 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Unit-I
No ratings yet
Unit-I
38 pages
2 emerging
No ratings yet
2 emerging
10 pages
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
No ratings yet
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
71 pages
BD Merged
No ratings yet
BD Merged
330 pages
S-Advance Database Management System 1
No ratings yet
S-Advance Database Management System 1
68 pages
DBMS PPT 1 ENG
No ratings yet
DBMS PPT 1 ENG
74 pages
Module 5_NoSQL databases
No ratings yet
Module 5_NoSQL databases
33 pages
The Big Data Technology Landscape
No ratings yet
The Big Data Technology Landscape
36 pages
Module 1
No ratings yet
Module 1
54 pages
Unit - 3
No ratings yet
Unit - 3
34 pages
Lect7 IoT BigData1
No ratings yet
Lect7 IoT BigData1
28 pages
Big-Data-A-Comprehensive-Overview
No ratings yet
Big-Data-A-Comprehensive-Overview
25 pages
1st Week Database-Systems
No ratings yet
1st Week Database-Systems
59 pages
Big Data Intro
No ratings yet
Big Data Intro
12 pages
unit5
No ratings yet
unit5
89 pages
Hadoop Main
No ratings yet
Hadoop Main
19 pages
Big Data With Hadoop
No ratings yet
Big Data With Hadoop
26 pages
Cloud computing
No ratings yet
Cloud computing
86 pages
PPT 2.1.1.
No ratings yet
PPT 2.1.1.
24 pages
BDA IA1 QB Solved complete - Copy
No ratings yet
BDA IA1 QB Solved complete - Copy
22 pages
It-222 Reviewer
No ratings yet
It-222 Reviewer
3 pages
INTRO hadoop-ecosystem
No ratings yet
INTRO hadoop-ecosystem
6 pages
BDA Unit 2
No ratings yet
BDA Unit 2
39 pages
DBMS Unit 5 Notes
No ratings yet
DBMS Unit 5 Notes
57 pages
Big Data Tools and Techniques
No ratings yet
Big Data Tools and Techniques
12 pages
Lec 1
No ratings yet
Lec 1
76 pages
S_Pig_Hive_HBase_Zookeeper
No ratings yet
S_Pig_Hive_HBase_Zookeeper
19 pages
Reviewer Infoshit
No ratings yet
Reviewer Infoshit
9 pages
CC Unit - 5
No ratings yet
CC Unit - 5
27 pages
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
No ratings yet
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
23 pages
Bda - 4 Unit
No ratings yet
Bda - 4 Unit
10 pages
Big Data Testing
No ratings yet
Big Data Testing
10 pages
Bdaut1: Give Difference Between Traditional Data Management and Analytics Approach Versus Big Data Approach
No ratings yet
Bdaut1: Give Difference Between Traditional Data Management and Analytics Approach Versus Big Data Approach
4 pages
Consolidated Presentation v2
No ratings yet
Consolidated Presentation v2
24 pages
Week 5 Database
No ratings yet
Week 5 Database
34 pages
Hadoop Ankit
No ratings yet
Hadoop Ankit
20 pages
Metadata Management On A Hadoop Eco-System: Whitepaper by
No ratings yet
Metadata Management On A Hadoop Eco-System: Whitepaper by
12 pages
New World Hadoop Architectures (& What Problems They Really Solve) For Dbas
No ratings yet
New World Hadoop Architectures (& What Problems They Really Solve) For Dbas
44 pages
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
02 GRAPH E Introduction and Basics
No ratings yet
02 GRAPH E Introduction and Basics
27 pages
PalcoControl CP en PDF
No ratings yet
PalcoControl CP en PDF
7 pages
JD PM
No ratings yet
JD PM
2 pages
20-08-23 - Doc43 Epic Games' Reply Iso Motion For TRO Against Apple
No ratings yet
20-08-23 - Doc43 Epic Games' Reply Iso Motion For TRO Against Apple
12 pages
Operation and Maintenance Protocols Manual
No ratings yet
Operation and Maintenance Protocols Manual
56 pages
Siemens PCS 7 Tools - Tag Types, Object View, and SFC Types
No ratings yet
Siemens PCS 7 Tools - Tag Types, Object View, and SFC Types
11 pages
Linux
No ratings yet
Linux
28 pages
Worksheet 9
No ratings yet
Worksheet 9
8 pages
ArcSight Supported Products
No ratings yet
ArcSight Supported Products
3 pages
HiPath 4000 V6 Sistemnye Komponenty
No ratings yet
HiPath 4000 V6 Sistemnye Komponenty
76 pages
NoSQL - Database Revolution
No ratings yet
NoSQL - Database Revolution
10 pages
Appgcet - 2021: Syllabus Test Name: 309-Computer Sceince
No ratings yet
Appgcet - 2021: Syllabus Test Name: 309-Computer Sceince
2 pages
Week 1 Introduction To Software Engineering
No ratings yet
Week 1 Introduction To Software Engineering
28 pages
Cis13 Week 1 Chat
No ratings yet
Cis13 Week 1 Chat
11 pages
STM OLD QP
No ratings yet
STM OLD QP
4 pages
Explore KNX and DALI Integration With NETx Software
No ratings yet
Explore KNX and DALI Integration With NETx Software
17 pages
Data Science - A First Introduction With Python (Z-Lib - Io)
No ratings yet
Data Science - A First Introduction With Python (Z-Lib - Io)
452 pages
Radiolink Plus: Features and Benefits
No ratings yet
Radiolink Plus: Features and Benefits
2 pages
Skema DC014 Pra UPS1
No ratings yet
Skema DC014 Pra UPS1
3 pages
Conveyor Belt Report
No ratings yet
Conveyor Belt Report
5 pages
BCM User Guide v9.4 July 2018
No ratings yet
BCM User Guide v9.4 July 2018
99 pages
Session 6 - Machine Learning Fundamentals and Orange Introduction
No ratings yet
Session 6 - Machine Learning Fundamentals and Orange Introduction
53 pages
VMware Workstation Player 16.0.1 Instalation Steps
No ratings yet
VMware Workstation Player 16.0.1 Instalation Steps
12 pages
Monthly Total Number of Pigs Slaughtered in Victoria. Jan 1980 - August 1995
No ratings yet
Monthly Total Number of Pigs Slaughtered in Victoria. Jan 1980 - August 1995
5 pages
Understanding Cryptography: Chapter 9 - Elliptic Curve Cryptography
No ratings yet
Understanding Cryptography: Chapter 9 - Elliptic Curve Cryptography
24 pages
NetApp ONTAP Cloud Volumes
No ratings yet
NetApp ONTAP Cloud Volumes
27 pages
ATTA Sample Exam Questions Only
No ratings yet
ATTA Sample Exam Questions Only
17 pages
Esports Yr2 FMP Evaluation Template 1
No ratings yet
Esports Yr2 FMP Evaluation Template 1
3 pages