0% found this document useful (0 votes)
2 views

Module V

The document discusses advanced analytics technologies and tools, focusing on unstructured data, the Hadoop ecosystem, and various components such as Pig, Hive, and HBase. It highlights the challenges of handling unstructured data, which constitutes a significant portion of organizational data, and outlines methods for processing it. Additionally, it provides an overview of the Hadoop ecosystem, detailing its major elements and components that facilitate big data management and analysis.

Uploaded by

satyamshivam.in
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module V

The document discusses advanced analytics technologies and tools, focusing on unstructured data, the Hadoop ecosystem, and various components such as Pig, Hive, and HBase. It highlights the challenges of handling unstructured data, which constitutes a significant portion of organizational data, and outlines methods for processing it. Additionally, it provides an overview of the Hadoop ecosystem, detailing its major elements and components that facilitate big data management and analysis.

Uploaded by

satyamshivam.in
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 57

MCA2004 – Big Data

Analytics
Module V - Advanced
Analytics - technologies and
tools
Contents
• Analytics for unstructured data
• The Hadoop ecosystem
• Pig
• Hive
• Hbase
• Mahout
• Introduction to NoSQL
Analytics for unstructured data
• This is the data which does not
conform to a data model or is not
in a form which can be used
easily by a computer program.
• About 80%-90% data of an
organization is in this format for
example, memos, chat rooms,
powerpoint presentations,
images, videos, letters etc
Unstructured Data
• This is the data which does not
conform to a data model or is not
in a form which can be used easily
by a computer program.
•About 80–90% data of an
organization is in this format.
• Example: memos, chat rooms,
PowerPoint presentations,
images, videos, letters,
researches, white papers, body of
an email, etc
Sources of Unstructured Data
• Web Pages
• Images
• Free form text
• Audios
• Videos
• Body of email
• Text messages
• Chats
• Social media data
• Word document
Issues with terminology –
Unstructured Data
• Structure can be implied despite not being formerly defined
• Data with some structure may still be labeled unstructured if
the structure doesn’t help with processing task at hand
• Data may have some structure or may even be highly
structured in ways that are unanticipated or unannounced
Dealing with Unstructured Data
• Data Mining
• Association Rule Mining
• Regression Analysis
• Collaborative Filtering
• Text analysis and Text Mining
• Natural Language Processing(NLP)
• Noisy text Analysis
• Manual tagging with metadata
• Part-of-speech tagging
• Unstructured Information Management Architecture(UIMA)
Properties Structured data Semi-structured data Unstructured data

It is based on
It is based on Relational It is based on character and
Technology XML/RDF(Resource
database table binary data
Description Framework).
Matured transaction and
Transaction is adapted from No transaction management
Transaction management various concurrency
DBMS not matured and no concurrency
techniques

Versioning over Versioning over tuples or


Version management Versioned as a whole
tuples,row,tables graph is possible

It is more flexible than


It is schema dependent and structured data but less It is more flexible and there is
Flexibility
less flexible flexible than unstructured absence of schema
data

It is very difficult to scale DB It’s scaling is simpler than


Scalability It is more scalable.
schema structured data

New technology, not very


Robustness Very robust —
spread

Structured query allow Queries over anonymous Only textual queries are
Query performance
complex joining nodes are possible possible
The Hadoop eco system
• Hadoop Ecosystem is a platform or a suite which provides various
services to solve the big data problems.
• It includes Apache projects and various commercial tools and solutions.
• There are four major elements of Hadoop i.e. HDFS, MapReduce,
YARN, and Hadoop Common Utilities.
• Most of the tools or solutions are used to supplement or support these
major elements.
• All these tools work collectively to provide services such as absorption,
analysis, storage and maintenance of data etc.
The Hadoop eco system
Following are the components that collectively
form a Hadoop ecosystem:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query based processing of data
services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning
algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
Hadoop - PIG
• High level data flow language for exploring very large datasets
• Provides an engine for executing data flows in parallel on
Hadoop
• Compiler that produces sequences MapReduce programs
• Structure is amenable to sustainable parallelization
• Operates on files in HDFS
• Metadata is not required, but used when available
Key properties of PIG
• Ease of Programming
• Trivial to achieve parallel execution of simple and parallel data tasks
• Optimization opportunities
• Allow the user to focus on semantics rather than efficiency
• Extensibility
• Users can create their own functions to do special purpose processing
Why Hadoop PIG
Apache Hive
Introduction to HBase

You might also like