0% found this document useful (0 votes)
68 views

Questions

The document discusses Apache Mahout, an open source machine learning library that implements distributed machine learning algorithms. It can be used for recommendation, classification, and clustering on large datasets. Mahout algorithms are built to run on Hadoop and can scale to distributed environments in the cloud. Specific algorithms mentioned include user-based and item-based collaborative filtering, matrix factorization, logistic regression, and Naive Bayes classification. An example is provided of using Mahout with Amazon EMR for multi-class classification of data.

Uploaded by

Prudhvi Barma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

Questions

The document discusses Apache Mahout, an open source machine learning library that implements distributed machine learning algorithms. It can be used for recommendation, classification, and clustering on large datasets. Mahout algorithms are built to run on Hadoop and can scale to distributed environments in the cloud. Specific algorithms mentioned include user-based and item-based collaborative filtering, matrix factorization, logistic regression, and Naive Bayes classification. An example is provided of using Mahout with Amazon EMR for multi-class classification of data.

Uploaded by

Prudhvi Barma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Assignment 4

1. Explain ARIMA model ?


Ans. An autoregressive integrated moving average, or ARIMA, is a statistical analysis model that uses
time series data to either better understand the data set or to predict future trends.
An autoregressive integrated moving average model is a form of regression analysis that gauges the
strength of one dependent variable relative to other changing variables. The model's goal is to
predict future securities or financial market moves by examining the differences between values in
the series instead of through actual values.
ARIMA model can be understood by outlining each of its components as follows:
Autoregression (AR) refers to a model that shows a changing variable that regresses on its own
lagged, or prior, values.
(I) represents the differencing of raw observations to allow for the time series to become stationary,
i.e., data values are replaced by the difference between the data values and the previous values.
Moving average (MA) incorporates the dependency between an observation and a residual error
from a moving average model applied to lagged observations.
Each component functions as a parameter with a standard notation. For ARIMA models, a standard
notation would be ARIMA with p, d, and q, where integer values substitute for the parameters to
indicate the type of ARIMA model used. The parameters can be defined as:
p: the number of lag observations in the model; also known as the lag order.
d: the number of times that the raw observations are differenced; also known as the degree of
differencing.
q: the size of the moving average window; also known as the order of the moving average.
In a linear regression model, for example, the number and type of terms are included. A 0 value,
which can be used as a parameter, would mean that particular component should not be used in the
model. This way, the ARIMA model can be constructed to perform the function of an ARMA model,
or even simple AR, I, or MA models.

2. Explain trem frequency-inverse document frequence (TFIDF) ?


Ans. Term Frequency-Inverse Document Frequency (TF-IDF) is a widely known technique in text
processing. This technique allows one to assign each term in a document a weight. Terms with high
frequency within a document have high weights. In addition, terms frequently appearing in all
documents of the document corpus have lower weights.
TF-IDF is used in a large variety of applications. Typical use cases include:
Document search.
Document tagging.
Text preprocessing and feature vector engineering for Machine Learning algorithms
TF-IDF is the most fundamental metric used extensively in classification of documents.
Let us try and define these terms:
Term frequency basically is significant of the frequency of occurrence of a certain word in a
document compared to other words in the document. It is defined as follows:

TFij=fij/fmaxj

where TFij represents the term frequency of the ith word in the jth document. fij represents the
frequency of the word in that document and fmaxj represents the frequency of the word which
occurred the maximum number of times in that document.
Hence the term frequency of a word for a particular document can attain a maximum value of 1.

Inverse Document frequency on the other hand is significant of the occurrence of the word in all the
documents for a given collection (of documents which we want to classify into different categories).
So if there are a total of N documents then the IDF of the ith word present in ni documents can be
expressed as the following:
IDFi=log2(N/ni)
The terms with the highest TF*IDF are considered to characterize/classify a document properly.

3. Discuss the sentiment analysis in big data analystics ?


Ans. Sentiment also referred to as opinion mining, is an approach to natural language processing
(NLP) that identifies the emotional tone behind a body of text. This is a popular way for
organizations to determine and categorize opinions about a product, service or idea. It involves the
use of data mining, machine learning (ML) and artificial intelligence (AI) to mine text for sentiment
and subjective information.
Sentiment analysis systems help organizations gather insights from and unstructured text that
comes from online sources such as emails, blog posts, support tickets, web chats, social media
channels, forums comments. Algorithms replace manual data processing by implementing rule-
based, automatic or hybrid methods. Rule-based systems perform sentiment analysis based on
predefined, lexicon-based rules while automatic systems learn from data with machine learning
techniques. A hybrid sentiment analysis combines both approaches.
In addition to identifying sentiment, opinion mining can extract the polarity (or the amount of
positivity and negativity), subject and opinion holder within the text. Furthermore, sentiment
analysis can be applied to varying scopes such as document, paragraph, sentence sub-sentence
levels.
Vendors that offer sentiment analysis platforms or SaaS products include Brandwatch, Hootsuite,
Lexalytics, NetBase, Sprout Social, Sysomos and Zoho. Businesses that use these tools can review
customer feedback more regularly and proactively respond to changes of opinion within the market.

Applications of sentiment analysis


Sentiment analysis tools can be used by organizations for a variety of applications, including:

Identifying brand awareness, reputation popularity at a specific moment or over time.


Tracking consumer reception of new products or features.
Evaluating the success of a marketing campaign.
Pinpointing the target audience or demographics.
Collecting customer feedback from social media, websites or online forms.
Conducting market research.
Categorizing customer service requests.

4. List out the features of Mahout and discuss the Mahout machine learning algorithms ?
Ans. tutorialspoint

Mahout - Introduction
Advertisements
Previous PageNext Page
We are living in a day and age where information is available in abundance. The information
overload has scaled to such heights that sometimes it becomes difficult to manage our little
mailboxes! Imagine the volume of data and records some of the popular websites (the likes of
Facebook, Twitter, and Youtube) have to collect and manage on a daily basis. It is not uncommon
even for lesser known websites to receive huge amounts of information in bulk.

Normally we fall back on data mining algorithms to analyze bulk data to identify trends and draw
conclusions. However, no data mining algorithm can be efficient enough to process very large
datasets and provide outcomes in quick time, unless the computational tasks are run on multiple
machines distributed over the cloud.

We now have new frameworks that allow us to break down a computation task into multiple
segments and run each segment on a different machine. Mahout is such a data mining framework
that normally runs coupled with the Hadoop infrastructure at its background to manage huge
volumes of data.

What is Apache Mahout?


A mahout is one who drives an elephant as its master. The name comes from its close association
with Apache Hadoop which uses an elephant as its logo.

Hadoop is an open-source framework from Apache that allows to store and process big data in a
distributed environment across clusters of computers using simple programming models.

Apache Mahout is an open source project that is primarily used for creating scalable machine
learning algorithms. It implements popular machine learning techniques such as:

Recommendation
Classification
Clustering
Apache Mahout started as a sub-project of Apache’s Lucene in 2008. In 2010, Mahout became a top
level project of Apache.

Features of Mahout
The primitive features of Apache Mahout are listed below.

The algorithms of Mahout are written on top of Hadoop, so it works well in distributed environment.
Mahout uses the Apache Hadoop library to scale effectively in the cloud.

Mahout offers the coder a ready-to-use framework for doing data mining tasks on large volumes of
data.

Mahout lets applications to analyze large sets of data effectively and in quick time.

Includes several MapReduce enabled clustering implementations such as k-means, fuzzy k-means,
Canopy, Dirichlet, and Mean-Shift.

Supports Distributed Naive Bayes and Complementary Naive Bayes classification implementations.
Comes with distributed fitness function capabilities for evolutionary programming.

Includes matrix and vector libraries

DZone
REFCARDZ RESEARCH ZONES
Download DZone’s 2019 Microservices Trend Report to see the future impact microservices will
have.Read Now
Refcard #209
Distributed Machine Learning with Apache Mahout
The Library for Distributed Machine Learning
Introduces Mahout, a library for scalable machine learning, and studies potential applications
through two Mahout projects.

8,389
Free .PDF for easy Reference
Written By

Linda Terlouw
Data Scientist, Icris
Ian Pointer
,
TABLE OF CONTENTS
► Introduction

► Machine Learning

► Algorithms Supported in Apache Mahout

► Installing Apache Mahout

► Example of Multi-Class Classification Using Amazon Elastic MapReduce

► Getting and Preparing the Data

► Classifying From Command Line Using Amazon Elastic MapReduce

► Interpreting the Test Results

► Using Apache Mahout With Apache Spark for Recommendations

► Running Mahout from Java or Scala

SECTION 3
Algorithms Supported in Apache Mahout
Apache Mahout implements sequential and parallel machine learning algorithms, which can run on
MapReduce, Spark, H2O, and Flink*. The current version of Mahout (0.10.0) focuses on
recommendation, clustering, and classification tasks.

Algorithm

User-Based Collaborative Filtering


Item-Based Collaborative Filtering
Matrix Factorization with ALS
Matrix Factorization with ALS on Implicit Feedback
Weighted Matrix Factorization, SVD++
Logistic Regression - trained via SGD
Naive Bayes / Complementary Naive Bayes
Random Forest
Hidden Markov Models
Multilayer Perceptron
k-Means Clustering
Fuzzy k-Means
Streaming k-Means
Spectral Clustering
Singular Value Decomposition
Stochastic SVD
PCA
QR Decomposition
Latent Dirichlet Allocation
RowSimilarityJob
ConcatMatrices
Collocations
Sparse TF-IDF Vectors from Text

5. List out features of HBase and discuss its architecture ?


Ans. HBase
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an
open-source project and is horizontally scalable.

HBase is a data model that is similar to Google’s big table designed to provide quick random access
to huge amounts of structured data. It leverages the fault tolerance provided by the Hadoop File
System (HDFS).
Features of HBase
HBase is linearly scalable.
It has automatic failure support.
It provides consistent read and writes.
It integrates with Hadoop, both as a source and a destination.
It has easy java API for client.
It provides data replication across clusters

HBase Architecture
HBase has three major components: the client library, a master server, and region servers. Region
servers can be added or removed as per requirement.
MasterServer
The master server -

Assigns regions to the region servers and takes the help of Apache ZooKeeper for this task.

Handles load balancing of the regions across region servers. It unloads the busy servers and shifts
the regions to less occupied servers.

Maintains the state of the cluster by negotiating the load balancing.

Is responsible for schema changes and other metadata operations such as creation of tables and
column families.

Regions
Regions are nothing but tables that are split up and spread across the region servers.

The region servers have regions that -

Communicate with the client and handle data-related operations.


Handle read and write requests for all the regions under it.
Decide the size of the region by following the region size thresholds.
When we take a deeper look into the region server, it contain regions and stores as shown below:

Regional Server
The store contains memory store and HFiles. Memstore is just like a cache memory. Anything that is
entered into the HBase is stored here initially. Later, the data is transferred and saved in Hfiles as
blocks and the memstore is flushed.

Zookeeper
Zookeeper is an open-source project that provides services like maintaining configuration
information, naming, providing distributed synchronization, etc.

Zookeeper has ephemeral nodes representing different region servers. Master servers use these
nodes to discover available servers.

In addition to availability, the nodes are also used to track server failures or network partitions.

Clients communicate with region servers via zookeeper.

In pseudo and standalone modes, HBase itself will take care of zookeeper.

6. Define and distinguish HIVE and PIG for data analysis?


Ans. Hive
Hive is developed on top of Hadoop. It is a data warehouse framework for querying and analysis of
data that is stored in HDFS. Hive is an open source-software that lets programmers analyze large
data sets on Hadoop.
Pig
Apache Pig is a platform utilized to analyze large datasets consisting of high level language for
expressing data analysis programs along with the infrastructure for assessing these programs. Pig
programs can be highly parallelized due to the virtue of which they can handle large data sets.
Pig Hive
Procedural Data Flow Language Declarative SQLish Language
For Programming For creating reports
Mainly used by Researchers and Programmers Mainly used by Data Analysts
Operates on the client side of a cluster. Operates on the server side of a cluster.
Does not have a dedicated metadata database. Makes use of exact variation of dedicated SQL
DDL language by defining tables beforehand.
Pig is SQL like but varies to a great extent. Directly leverages SQL and is easy to learn
for database experts.
Pig supports Avro file format. Hive does not support it.

You might also like