Questions
Questions
TFij=fij/fmaxj
where TFij represents the term frequency of the ith word in the jth document. fij represents the
frequency of the word in that document and fmaxj represents the frequency of the word which
occurred the maximum number of times in that document.
Hence the term frequency of a word for a particular document can attain a maximum value of 1.
Inverse Document frequency on the other hand is significant of the occurrence of the word in all the
documents for a given collection (of documents which we want to classify into different categories).
So if there are a total of N documents then the IDF of the ith word present in ni documents can be
expressed as the following:
IDFi=log2(N/ni)
The terms with the highest TF*IDF are considered to characterize/classify a document properly.
4. List out the features of Mahout and discuss the Mahout machine learning algorithms ?
Ans. tutorialspoint
Mahout - Introduction
Advertisements
Previous PageNext Page
We are living in a day and age where information is available in abundance. The information
overload has scaled to such heights that sometimes it becomes difficult to manage our little
mailboxes! Imagine the volume of data and records some of the popular websites (the likes of
Facebook, Twitter, and Youtube) have to collect and manage on a daily basis. It is not uncommon
even for lesser known websites to receive huge amounts of information in bulk.
Normally we fall back on data mining algorithms to analyze bulk data to identify trends and draw
conclusions. However, no data mining algorithm can be efficient enough to process very large
datasets and provide outcomes in quick time, unless the computational tasks are run on multiple
machines distributed over the cloud.
We now have new frameworks that allow us to break down a computation task into multiple
segments and run each segment on a different machine. Mahout is such a data mining framework
that normally runs coupled with the Hadoop infrastructure at its background to manage huge
volumes of data.
Hadoop is an open-source framework from Apache that allows to store and process big data in a
distributed environment across clusters of computers using simple programming models.
Apache Mahout is an open source project that is primarily used for creating scalable machine
learning algorithms. It implements popular machine learning techniques such as:
Recommendation
Classification
Clustering
Apache Mahout started as a sub-project of Apache’s Lucene in 2008. In 2010, Mahout became a top
level project of Apache.
Features of Mahout
The primitive features of Apache Mahout are listed below.
The algorithms of Mahout are written on top of Hadoop, so it works well in distributed environment.
Mahout uses the Apache Hadoop library to scale effectively in the cloud.
Mahout offers the coder a ready-to-use framework for doing data mining tasks on large volumes of
data.
Mahout lets applications to analyze large sets of data effectively and in quick time.
Includes several MapReduce enabled clustering implementations such as k-means, fuzzy k-means,
Canopy, Dirichlet, and Mean-Shift.
Supports Distributed Naive Bayes and Complementary Naive Bayes classification implementations.
Comes with distributed fitness function capabilities for evolutionary programming.
DZone
REFCARDZ RESEARCH ZONES
Download DZone’s 2019 Microservices Trend Report to see the future impact microservices will
have.Read Now
Refcard #209
Distributed Machine Learning with Apache Mahout
The Library for Distributed Machine Learning
Introduces Mahout, a library for scalable machine learning, and studies potential applications
through two Mahout projects.
8,389
Free .PDF for easy Reference
Written By
Linda Terlouw
Data Scientist, Icris
Ian Pointer
,
TABLE OF CONTENTS
► Introduction
► Machine Learning
SECTION 3
Algorithms Supported in Apache Mahout
Apache Mahout implements sequential and parallel machine learning algorithms, which can run on
MapReduce, Spark, H2O, and Flink*. The current version of Mahout (0.10.0) focuses on
recommendation, clustering, and classification tasks.
Algorithm
HBase is a data model that is similar to Google’s big table designed to provide quick random access
to huge amounts of structured data. It leverages the fault tolerance provided by the Hadoop File
System (HDFS).
Features of HBase
HBase is linearly scalable.
It has automatic failure support.
It provides consistent read and writes.
It integrates with Hadoop, both as a source and a destination.
It has easy java API for client.
It provides data replication across clusters
HBase Architecture
HBase has three major components: the client library, a master server, and region servers. Region
servers can be added or removed as per requirement.
MasterServer
The master server -
Assigns regions to the region servers and takes the help of Apache ZooKeeper for this task.
Handles load balancing of the regions across region servers. It unloads the busy servers and shifts
the regions to less occupied servers.
Is responsible for schema changes and other metadata operations such as creation of tables and
column families.
Regions
Regions are nothing but tables that are split up and spread across the region servers.
Regional Server
The store contains memory store and HFiles. Memstore is just like a cache memory. Anything that is
entered into the HBase is stored here initially. Later, the data is transferred and saved in Hfiles as
blocks and the memstore is flushed.
Zookeeper
Zookeeper is an open-source project that provides services like maintaining configuration
information, naming, providing distributed synchronization, etc.
Zookeeper has ephemeral nodes representing different region servers. Master servers use these
nodes to discover available servers.
In addition to availability, the nodes are also used to track server failures or network partitions.
In pseudo and standalone modes, HBase itself will take care of zookeeper.