Module V
Module V
Analytics
Module V - Advanced
Analytics - technologies and
tools
Contents
• Analytics for unstructured data
• The Hadoop ecosystem
• Pig
• Hive
• Hbase
• Mahout
• Introduction to NoSQL
Analytics for unstructured data
• This is the data which does not
conform to a data model or is not
in a form which can be used
easily by a computer program.
• About 80%-90% data of an
organization is in this format for
example, memos, chat rooms,
powerpoint presentations,
images, videos, letters etc
Unstructured Data
• This is the data which does not
conform to a data model or is not
in a form which can be used easily
by a computer program.
•About 80–90% data of an
organization is in this format.
• Example: memos, chat rooms,
PowerPoint presentations,
images, videos, letters,
researches, white papers, body of
an email, etc
Sources of Unstructured Data
• Web Pages
• Images
• Free form text
• Audios
• Videos
• Body of email
• Text messages
• Chats
• Social media data
• Word document
Issues with terminology –
Unstructured Data
• Structure can be implied despite not being formerly defined
• Data with some structure may still be labeled unstructured if
the structure doesn’t help with processing task at hand
• Data may have some structure or may even be highly
structured in ways that are unanticipated or unannounced
Dealing with Unstructured Data
• Data Mining
• Association Rule Mining
• Regression Analysis
• Collaborative Filtering
• Text analysis and Text Mining
• Natural Language Processing(NLP)
• Noisy text Analysis
• Manual tagging with metadata
• Part-of-speech tagging
• Unstructured Information Management Architecture(UIMA)
Properties Structured data Semi-structured data Unstructured data
It is based on
It is based on Relational It is based on character and
Technology XML/RDF(Resource
database table binary data
Description Framework).
Matured transaction and
Transaction is adapted from No transaction management
Transaction management various concurrency
DBMS not matured and no concurrency
techniques
Structured query allow Queries over anonymous Only textual queries are
Query performance
complex joining nodes are possible possible
The Hadoop eco system
• Hadoop Ecosystem is a platform or a suite which provides various
services to solve the big data problems.
• It includes Apache projects and various commercial tools and solutions.
• There are four major elements of Hadoop i.e. HDFS, MapReduce,
YARN, and Hadoop Common Utilities.
• Most of the tools or solutions are used to supplement or support these
major elements.
• All these tools work collectively to provide services such as absorption,
analysis, storage and maintenance of data etc.
The Hadoop eco system
Following are the components that collectively
form a Hadoop ecosystem:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query based processing of data
services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning
algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
Hadoop - PIG
• High level data flow language for exploring very large datasets
• Provides an engine for executing data flows in parallel on
Hadoop
• Compiler that produces sequences MapReduce programs
• Structure is amenable to sustainable parallelization
• Operates on files in HDFS
• Metadata is not required, but used when available
Key properties of PIG
• Ease of Programming
• Trivial to achieve parallel execution of simple and parallel data tasks
• Optimization opportunities
• Allow the user to focus on semantics rather than efficiency
• Extensibility
• Users can create their own functions to do special purpose processing
Why Hadoop PIG
Apache Hive
Introduction to HBase