ML Cloud
ML Cloud
SaaS Solutions ∗
Daniel Pop
Institute e-Austria Timişoara
Bd. Vasile Pârvan No. 4, 300223 Timişoara, România
arXiv:1603.08767v1 [cs.DC] 29 Mar 2016
E-mail: [email protected]
1
mode yet, who are offering machine learning services
to their customers, or big data analysis services can be
noticed in past 5 years. These initiatives can be either
PaaS/SaaS platforms or products that can be deployed
on private environments.
2
IBM, Microsoft) and academia (Berkeley, NYU, Uni- with map/reduce, visualization, security and version
versity of California etc). control packages. Results of data analysis processes,
Recent articles, such as those of S. Charrington [3], named dashboard in Opani, can easily be visualized
W. Eckerson [4] and D. Harris [5], review different and shared from desktop or mobile devices.
large-scale ML solutions providers that are trying to Approaches in this class are powerful and flexible
offer better tools and technologies, most of them based solutions, offering users the possibility to develop com-
on Hadoop infrastructure, to move forward the novel plex ML-DM applications ran on the cloud. Users are
industry of big data. They are aiming at improving freed from the burden of provisioning own distributed
user experience, at product recommendations, or web- environments for scientific computing, while being able
site optimization applicable for finance, telecommuni- to use their favorite environments. On the other side,
cations, retail, advertising or media. users of these tools need to have extensive experience in
programming and strong knowledge of statistics. Per-
3 Machine Learning environments haps, due to this limited audience, the stable providers
from the cloud in this category are fewer than in other categories, some
of them (such as CRdata.org) shutting down the oper-
Providers of this category offer computer clusters ation only shortly after taking off.
using public cloud providers, such as Amazon EC2,
Rackspace etc, pre-installed with statistics software, 4 Plugins for Machine Learning toosl
preferred packages being R system, Octave or Map-
ple. These solutions offer scalable high-performance
In this class, statistics applications (e.g. R system,
resources in the cloud to their customers, who are freed
Python) are extended with plugins that allow users to
from the burden of installating and managing own clus-
create a Hadoop cluster in the cloud and run time con-
ters.
suming jobs over large datasets on it. Most of the in-
Cloudnumbers.com 1 are using Amazon EC2 2 terest went towards R, for which several extensions are
provider to setup computer clusters preinstalled with available, comparing to Python for which less effort
software for scientific computing, such as R system, was invested until recently in supporting distributed
Octave or Mapple. Customers benefit from a web- processing. In this section we will mention several so-
interface where they can create own workspaces, con- lutions for R and Python.
figure and monitor the cluster, upload datasets or con-
nect to public databases. On top of default features RHIPE 6 is a R package that implements a
from cloud provider, Cloudnumbers offers high secu- map/reduce framework for R offering access to a
rity standards by providing secure encryption for data Hadoop installation from within R environment. Us-
transmission and storage. Overall, a HPC platform in ing specific R functions, users are able to launch
the cloud, easy to create and effortless to maintain. map/reduce jobs executed on the Hadoop cluster and
results are then retrieved from HDFS.
CloudStat 3 is a cloud integrated development en-
vironment built based on R system, and exposes its Snow 7 [16] and its variants (snowfall, snowFT) im-
functionalities via 2 types of user interfaces: console – plement a framework that is able to express an impor-
for experienced users in R language, and applications tant class of parallel computations and is easy to use
– designed as a point and click forms based interface within an interactive environment like R. It supports
for R for users with no R programming skills. There three types of clusters: socket-based, MPI, and PVM.
is also a CloudStat AppStore where users can choose Segue for R 8 project makes it easier to run
applications from a growing repository. map/reduce jobs from within R environment on elastic
Opani 4 is offering similar services to Cloudnum- clusters at Amazon Elastic Map Reduce 9 .
bers.com, but additionally helps customers to size their Anaconda 10 is a scalable data analytics and scien-
cluster according to their needs: size of data and the tific computing in Python offered by Continuum An-
time-frame for processing this data. They are us- alytics 11 . It is a collection of packages (NumbaPro –
ing Rackspace’s 5 infrastructure and support environ-
6 http://www.stat.purdue.edu/ sguha/rhipe/doc/html/index.html
ments such as R system, Node and Python, bundled 7 http://cran.r-project.org/web/packages/
1 http://cloudnumbers.com available packages by name.html
2 http://aws.amazon.com/ec2/ 8 http://code.google.com/p/segue/
3 http://cs.croakun.com 9 http://aws.amazon.com/elasticmapreduce/
4 http://opani.com 10 https://store.continuum.io/cshop/anaconda
5 http://rackspace.com 11 http://continuum.io
3
fast, multi-core and GPU-enabled computation, IOPro for good performance also for non-distributed al-
– fast data access, and wiseRF Pine – multi-core imple- gorithms.
mentation of the Random Forest) that enables large-
scale data management, analysis, and visualization and • Scalable to support various business cases. Ma-
more. It can be installed as a full Python distribution hout is distributed under a commercially friendly
or can be plugged into an existing installation. Apache Software license.
Due to its popularity among ML-DM practitioners, • Scalable community. The goal of Mahout is to
R system being the preferred tool for such tasks in past build a vibrant, responsive, diverse community to
2 years [15, 10], efforts have been made recently to par- facilitate discussions not only on the project itself
allelize lengthy processes on scalable distributed frame- but also on potential use cases.
works (Hadoop). This approach is largely preferred
over ML in the cloud due to the possibility to re-use ex- Currently Mahout supports mainly four use cases:
isting infrastructure of research, or industrial (private)
data centers. To the best of our knowledge, there are • Recommendation mining takes users’ behavior
no similar approaches for related mathematical tools, and from that tries to find items users might like
such as Mathematica, Maple or Matlab/Octave, except
• Clustering takes e.g. text documents and groups
HadoopLink 12 for Mathematica. The audience of this
them into groups of topically related documents
class of solutions is also highly qualified in program-
ming languages, mathematics, statistics and machine • Classification learns from existing categorized doc-
learning algorithms. uments what documents of a specific category look
like and is able to assign unlabelled documents to
5 Distributed Machine Learning li- the (hopefully) correct category.
braries • Frequent itemset mining takes a set of item groups
(terms in a query session, shopping cart content)
This category offers complex libraries operating on and identifies, which individual items usually ap-
various distributed setups (Hadoop, Dryad, MPI). pear together.
They allow users to use out-of-the-box algorithms, or
implement their own, that are run in parallel mode over Integration with initiatives such as graph processing
a cluster of computers. These solutions does not inte- platforms Apache Giraph 14 are actively under discus-
grate, nor use, statistics/mathematics software, rather sion. An active community is behind this project.
they offer self-contained packages of optimised, state- GraphLab 15 [11] is a framework for ML-DM in
of-the-art ML-DM methods and algorithms. the Cloud. While high-level data parallel frameworks,
Apache MahoutTM 13 [12] is an Apache project like MapReduce, simplify the design and implementa-
to produce free implementations of distributed or oth- tion of large-scale data processing systems, they do not
erwise scalable machine learning algorithms on the naturally or efficiently support many important data
Hadoop platform [20]. It started as a collection of in- mining and machine learning algorithms and can lead
dependent, ”Hadoop-free” components, e.g. ”Taste” to inefficient learning systems. To help fill this criti-
collaborative-filtering. Its goal is to build scalable ma- cal void, GraphLab is an abstraction which naturally
chine learning libraries, where scalable has a broader expresses asynchronous, dynamic, graph-parallel com-
meaning: putation while ensuring data consistency and achieving
a high degree of parallel performance, in both shared-
• Scalable to reasonably large datasets. Mahout’s memory and distributed settings. It is written in C++
core algorithms for clustering, classification and and is able to directly access data from Hadoop Dis-
batch based collaborative filtering are imple- tributed File System (HDFS) [20]. The authors report
mented on top of Apache Hadoop [20] using the out-performing similar approaches by orders of magni-
map/reduce paradigm. However, it does not re- tude.
strict contributions to Hadoop based implemen-
DryadLINQ 16 [19, 2] is LINQ (Language IN-
tations: contributions that run on a single node
tegrated Query 17 subsystem developed at Microsoft
or on a non-Hadoop cluster are welcome as well.
The core libraries are highly optimized to allow 14 http://incubator.apache.org/giraph/
15 http://graphlab.org
12 https://github.com/shadanan/HadoopLink 16 http://research.microsoft.com/en-us/projects/DryadLINQ/
13 http://mahout.apache.org 17 http://msdn.microsoft.com/netframework/future/linq/
4
Research on top of Dryad [9], a general purpose ar- transform regression, nearest neighbors, k-means,
chitecture for execution of data parallel applications. fuzzy k-means, kernel k-means, PCA, and kernel PCA.
It supports DAG-based abstractions, inherited from One of the main advantages of the PML toolbox is the
Dryad, for implementing data processing algorithms. ability to run it on a variety of operating systems and
A DryadLINQ program is a sequential program com- platforms, from multi-core laptops to supercomputers
posed of LINQ expressions performing arbitrary side- such as BlueGene. This is because the toolbox incor-
effect-free transformations on datasets, and can be porates a parallelization infrastructure that completely
written and debugged using standard .NET develop- separates parallel communications, control, and data
ment tools. The DryadLINQ system automatically and access from learning algorithm implementation. This
transparently translates the data-parallel portions of approach enables learning algorithm designers to
the program into a distributed execution plan which is focus on algorithmic issues without having to concern
passed to the Dryad execution platform that ensures themselves with low-level parallelization issues. It also
efficient and reliable execution of this plan. Authors enables learning algorithms to be deployed on multiple
demonstrate near-linear scaling of execution time on hardware architectures, running either serially or in
the number of computers used for a job. While the parallel, without having to change any algorithmic
DAG-based abstraction permits rich computational de- code. The toolbox uses the popular MPI library as
pendencies, it does not naturally express iterative, data the basis for its operation, and is written in C++.
parallel, task parallel and dynamic data driven algo- Despite of our effort to get latest news on this project,
rithms that are prevalent in ML-DM. we found no recent activity on this project since 2007,
Jubatus 18 [8], started April 2011, is an online/real- except for a chapter in [1] (2012). On the other side,
time machine learning platform, implemented on a dis- the toolkit is suited for parallel environments, not for
tributed architecture. Comparing to MahoutTM is a distributed ones.
next-step platform that offers stream processing and NIMBLE [6] is a sequel project to Parallel Ma-
online learning. In online ML, the model is continu- chine Learning Toolbox, also developed at IBM Re-
ously updated with each data sample that is coming search Labs. It exposes a multi-layered framework
by fast and not memory-intensive algorithms. It re- where developers may express their ML-DM algorithms
quires no data storage, nor sharing; only model mixing. as tasks. Tasks are then passed to the next layer, an
It supports classification problems (Passive Aggressive architecture independent layer, composed of one queue
(PA), Confidence Weighted Learning, AROW), PA- of DAGs of tasks, plus worker threads pool that unfold
based regression, nearest neighbor (LSH, MinHash, Eu- this queue. Next layer is an architecture dependent
clid LSH), recommendation, anomaly detection (LOF layer that translates the generic entities from upper
based on NN) and graph analysis (shortest path, layer into various runtimes. Currently, NIMBLE sup-
PageRank). In order to efficiently support online learn- ports execution on Hadoop platform [20] only. Other
ing, Jubatus operates updates on local models and platforms, such as Dryad [9], are also good candidates,
then each server transmits its model difference that are but not yet supported. Advantages of this framework
merged and distributed back to all servers. The mixed include:
model improves gradually thanks to all servers’ work.
IBM Parallel Machine Learning Tool- • higher level of abstraction, hiding low-level con-
box 19 [13] (PMLT), a joint effort of the Machine trol and choreography details of most of the
Learning group at the IBM Haifa Lab and the Data distributed and parallel programming paradigms
Analytics department at the IBM Watson Lab, pro- (MR, MPI etc), allowing programmers to compose
vides tools for execution of data mining and machine parallel ML-DM algorithms using reusable (serial
learning algorithms on multiple processor environ- and parallel) building blocks
ments or on multiple threaded machines. The toolbox
comprises two main components: an API for running • portability: providing specific implementation for
the users’ own machine learning algorithms, and architecture dependent layer, same code can be
several pre-programmed algorithms which serve both executed on various distributed runtimes
as examples and for comparison. The pre-programmed
algorithms include a parallel version of the Support • efficiency and scalability: due to optimisation in-
Vector Machine (SVM) classifier, linear regression, troduced by DAGs of tasks and co-scheduling, re-
18 http://jubat.us/ sults presented in [6] for Hadoop runtime show
19 https://www.research.ibm.com/haifa/projects/verification/ speedup improvement with increasing dataset size
ml toolbox/index.html and dimensionality.
5
SystemML [7], developed at IBM Research labs tion and modeling, (ii) data integration from NoSQL
as NIMBLE and PMLT, proposes an R-like language and relational databases, (iii) distributed execution on
(Declarative Machine Learning language) that includes Hadoop platform [20], (iv) instant and interactive anal-
linear algebra primitives and shows how it can be ysis (no code, no ETL (Extract, Transform, Load))
optimized and compiled down to MapReduce. They and (v) business analytics platform: data discovery, ex-
report an extensive performance evaluation on three ploration, visualization and predictive analytics. Main
(Group Nonnegative Matrix Factorization, Liner re- characteristics of Pentaho solution include:
gression, Page Rank) ML algorithms on varying data
and Hadoop cluster sizes. • MapReduce-based data processing
Table 5 presents a synthesis on investigated plat- • Can be configured for different Hadoop distribu-
forms. One can notice that Java is the preferred envi- tions (such as Cloudera, Hadapt etc.)
ronment, due to large adoption and usage of Hadoop
as distributed processing model. The good news is the • Data can be loaded and processed into Hadoop
fact that most active and lively solutions are the open- HDFS, HBase 23 , or Hive 24
source ones. Target audience of this class of products
• Supports Pig scripts
are programmers, system developers and ML experts
who need fast, scalable distributed solutions for ML- • Native support for most NoSQL databases, such
DM problems. as Apache Cassandra, DataStax, Apache HBase,
MongoDB, 10gen etc.
6 Complex Machine Learning systems
• Enables performance-optimized data analysis,
reporting and data integration for analytic
This section present several solutions for business
databases (such as Teradata, monetdb, Netezza
intelligence and data analytics that share a set of com-
etc.), through deep integration with native SQL
mon features: (i) all are deployable on on-premise or
dialects and parallel bulk data loader
in-the-cloud clusters, (ii) provide rich set of graphical
tools to analyse, explore and visualize large amounts • Integration wit HPCC (High Performance Com-
of data, (iii) expose a rather limited set of ML-DM puting Cluster) from LexisNexis Risk Solutions 25
functions, usually limited to prediction models and
(iv) utilize Apache Hadoop [20] as processing engine • Import/export from/to PMML (Predictive Mod-
and/or storage environment. There are differences on eling Markup Language)
how data is integrated and processed, supported data
• Pentaho Instaview, a visual application to reduce
sources or related to complexity of the system. Here
the time needed to deploy data analytics solutions
are the most known ones:
and to help novice users to get insights of their
Kitenga Analytics 20 , recently purchased by Dell, data, in three simple steps: select data source, au-
is a native Hadoop application that offers visual ETL, tomatically prepare data for analytics, and visual-
Apache SolrTM 21 -based search, natural language pro- ize and explore built models.
cessing, Apache Mahout-based data mining, and ad-
vanced visualization capabilities. It is a big data en- • Pentaho Mobile - application for iPad that pro-
vironment for sophisticated analysts who want a ro- vides interactive business analytics for business
bust toolbox of analytical tools, all from an easy-to- users
use interface that that does not require understanding
of complex programming or the Apache Hadoop stack Their ecosystem is composed of several powerful sys-
itself. tems, each of them a complex project of its own:
Pentaho BI Platform/Server the BI platform is a
Pentaho Business Analytics 22 offers a complete
framework providing core services, such as authen-
solution for big data analytics, supporting all phases of
tication, logging, auditing and rules engines; it also
an analytics process, from pre-processing to advanced
has a solution engine that integrates all other sys-
data exploration and visualization. It offers (i) a com-
tems (reporting, analysis, integration and data min-
plete visual design tool to accelerate data prepara-
ing); BI Server is the most well known implementation
20 http://www.quest.com/news-release/quest-software-
of the platform, which functions as a web based report
expands-its-big-data-solution-with-new-hadoop-ce-102012-
818658.aspx 23 http://hbase.apache.org
21 http://lucene.apache.org/solr/ 24 hive.apache.org
22 http://www.pentaho.org 25 http://hpccsystems.com
6
management system, application integration server and the tools to start building complex data processing
lightweight workflow engine. pipelines immediately. WibiData also provides graphi-
Pentaho Reporting based on JFreeReport, is a suite cal tools to export your data from its distributed data
of open-source tools – Pentaho Report Designer, Pen- repository into any relational database [21]. In order to
taho Reporting Engine, Pentaho Reporting SDK and simplify data processing using Hadoop, WibiData in-
the common reporting libraries shared with the entire troduces the concepts of producers – computation func-
Pentaho BI Platform – that allows users to create rela- tions that update a row in a table, and gatherers – close
tional and analytical reports from a variety of sources the gap between WibiData table and key-value pairs
outputting results in various formats (HTML, PDF, processed by Hadoop MapReduce engine.
Excel etc.) We are aware that we could not cover all the solu-
Pentaho Data Integration (Kettle) delivers powerful tion provider in the field of business intelligence and big
ETL capabilities using metadata-driven approach with data analytics. We tried to cover those who are also of-
an intuitive, graphical, drag and drop design environ- fering ML components in their applications, many oth-
ment; ers focusing only on big data analytics, such as Alteryx,
Pentaho Analysis Service (Mondrian) is an Online SiSense, SAS or SAP, being omitted from this survey.
Analytical Processing (OLAP) server that supports Solutions in this category target mostly business users,
data analysis in real-time who need to quickly and easily extract insights from
Pentaho Data Mining (Weka) a collection of ma- their data, being good candidates for users with less
chine learning algorithms for classification, regression, computer or statistics background.
clustering and association rules;
Platfora 26 delivers in-memory business intelligence 7 Software as a Service providers for
with no separate data warehouse or ETL required. Its
Machine Learning
visual interface built on HTML5 allows business users
to analyse data. Results may be easily shared between
users. It relies on Hadoop cluster, that can be installed This section focuses on platform-as-a-service, or
either on own premise, or on cloud providers (Amazon software-as-a-service providers for machine learning
EMR and S3). It is primarly focused on BI features, problems. They are offering the services mainly via
such as elaborated visualization types (charts, plots, RESTful interfaces, and in some (rare) cases the solu-
maps), or slice-and-dice operations, but also offers a tion may also be installed on-premise (Myrrix), con-
predictive analysis framework. trasting to solutions from previous section that are
mainly deployable systems on private data centers.
Skytree Server 27 is a general purpose machine
As class of ML problems, predictive modeling is the
learning and data analytics system that supports data
favorite (BigML, Google Prediction API, Eigendog)
coming from relational databases, Hadoop systems, or
among these systems. We did not include in this study
flat files and offers connectors to common statistical
providers of SQL over Hadoop solutions (e.g. Cloudera
packages and ML libraries. ML methods supported are:
Impala, Hadapt, Hive) because their main target is not
Support Vector Machine (SVM), Nearest Neighbor, K-
ML-DM, rather fast, elastic and scalable SQL process-
Means, Principal Component Analysis (PCA), Linear
ing of relational data using the distributed architecture
Regression, 2-point correlation and Kernel Density Es-
of Hadoop.
timation (KDE). Skytree Server connects with analyt-
ics front-ends, such as Web services or statistical and BigML 29 is a SaaS approach to machine learning.
ML libraries (R, Weka), for data visualization. Its de- Users can setup datasources, create, visualize and share
ployment options include cloud providers, or dedicated prediction models (only decision trees are supported),
cluster based on Linux machines. It also supports cus- and use models to generate predictions. All from a
tomers in estimating the size of the cluster they need Web interface or programmatically using REST API.
by a simple formula (Analytics Requirements Index). BitYota 30 is a young start-up (2012) SaaS provider
Wibidata 28 is a complex solution based on for BigData warehousing solution. On top of data in-
open source software stack from Apache, combining tegration from different sources (relational, NoSQL,
Hadoop, HBase and Avro with proprietary compo- HDFS) it also allows customers to run statistics and
nents. WibiData’s machine learning libraries give summarization queries in SQL92, standard R statistics
and custom functions written in JavaScript, Perl, or
26 http://platfora.com
27 http://skytree.net 29 http://bigml.com
28 http://wibidata.com 30 http://bityota.com
7
Python on a parallel analytics engine. Results are vi- Metamarkets 35 claim as being Data Science-as-
sualized by integrating with popular BI tools and dash- a-Service providers, helping users to get insights out
boards. of their large datasets. They offer end-users the pos-
sibility to perform fast, ad-hoc investigations on data,
Precog 31 has a more elaborate SaaS solution com-
to discover new and unique anomalies, to spot trends
posed of Precog database, Quirrel language, Report-
in data streams, based on statistical models, in an in-
Grid and LabCoat tools. At the core of Precog, we
tuitive, interactive and collborative way. They are fo-
have an original (no Hadoop, no other NoSQL based),
cused on business people, less knowledgeable on statis-
schemaless, columnar database designed for storing
tics and machine learning.
and analyzing semi-structured, measured data, such as
events (users clicking, engaging, and buying), sensor Myrrix 36 is a complete, real-time, scalable recom-
data, activity stream data, facts, and other kinds of mender system built using Apache MahoutTM (see Sec-
data that do not need to be mutably updated. Precog’s tion 5). It can be accessed as PaaS using a RESTful
functionality is exposed by REST APIs, but client li- interface. It is able to incrementally update the model
braries are available in JavaScript, Python, PHP, Ruby, once new data is available. It is organized in 2 lay-
Java, or C#. LabCoat is a GUI tool for creation and ers – Serving (open source and free) and Computation
management of Quirrel queries. Quirrel is a a highly (Hadoop based) – that can be deployed on-premise as
expressive data analysis language that makes it easy well, either both of them or only one.
to do in-database analytics, statistics, and machine Prior Knowledge Veritable API 37 offers
learning across any kind of measured data. Results Python and Ruby interfaces; upload data on their
are available in JSON or CSV formats. ReportGrid is servers, and build prediction model using Markov
an HTML5 visualization engine that interactively, or Chain Monte Carlo samplers. They were operating
programmatically, build reports and charts. a cloud based infrastructure based on Amazon WS.
SalesForce.com acquired Prior Knowledge at the end
Google Prediction API 32 is Google’s cloud-
of 2012.
based machine learning tools that can help analyze
your data. It is closely connected to Google Cloud Predictobot 38 by Prediction Appliance also aims
Storage33 where training data is stored and offers its at doing machine learning modeling easier. The user
services using a RESTful interface, client libraries al- will upload a spreadsheet of data, answer a few ques-
lowing programmers to connect from Java, JavaScript, tions, and then download a spreadsheet with the pre-
.NET, Ruby, Python etc. In the first step, the model dictive model. It is going to bring predictive modeling
need to be trained from data, supported models being to anyone with the skills to make a spreadsheet. The
classification and regression for now. After the model business is still in stealth mode.
is built, one can query this model to obtain predic-
tions on new instances. Adding new data to a trained 7.1 Text mining as SaaS
model is called Streaming Training and it is also nicely
supported. Recently, PMML preprocessing feature has Due to explosion of social media technologies, such
been added, i.e. Prediction API .supports preprocess- as blog platforms (WordPress.com, Blogger etc), mini-
ing your data against a PMML transform specified us- blogging (Twitter), or social networks (Facebook,
ing PMML 4.0 syntax; does not support importing of Google+), an increased interest is paid to text min-
a complete PMML model that includes data. Created ing and natural language processing (NLP) solutions
models can be shared as hosted models in the market- delivered as services to their customers. This is why
place. we devoted an entire subsection to group together
software/platform-as-a-service solutions for text min-
EigenDog 34 is a service for scalable predictive ing. Before reviewing available solutions, a short intro-
modeling, hosted on Amazon EC2 (for computation) duction to NLP and text mining is helpful.
and S3 (for data and models storage) platforms. It While NLP uses linguistically inspired techniques
builds decision tree model out of data in Weka’s ARFF (text is syntactically parsed using information from a
format. Models can be downloaded in binary format formal grammar and a lexicon, and the resulting in-
and integrated in user applications thanks to API, or formation is then interpreted semantically and used to
open-source library provided by vendor. extract information) to deeply analyse the document,
31 http://precog.com 35 http://metamarkets.com/
32 https://developers.google.com/prediction/ 36 http://myrrix.com
33 https://developers.google.com/storage/ 37 http://priorknowledge.com
34 https://eigendog.com/#home 38 http://predictobot.com
8
text mining is more recent and uses techniques devel- Ruby, PHP and Objective-C, responses are JSON en-
oped in the fields of information retrieval, statistics, coded and Python NLTK demos are offered to achieve a
and machine learning. Contrasting with NLP, text steep learning curve. For commercial purposes, clients
mining’s aim is not to understand what is ”said” in are offered monthly subscriptions via Mashape.com.
a text, rather to extract patterns across large number Yahoo! Content Analysis Web Service 42 de-
of documents. Features of text mining include extrac- tects entities/concepts, categories, and relationships
tion of concept/entity, text clustering, summarization, within unstructured content. It ranks those detected
or sentiment analysis. entities/concepts by their overall relevance, resolves
Size and number of documents that need to be pro- those if possible into Wikipedia pages, and annotates
cessed, plus real-time processing constrain contribute tags with relevant meta-data. The service is available
to the development of novel, distributed toolkits able as an YQL table and response is in XML format. It is
to answer demanding users’ needs. Websites operators freely available for non-commercial usage.
are willing to offer text mining features to their visitors
This section presented PaaS solutions addressing,
with minimum investment and reduced maintenance
to some extent, machine learning problems. A spe-
costs. Thus, more and more providers are offering text
cial sub-section was devoted to text mining problem
mining services through RESTful web services, saving
due to its spreading in the landscape of ML PaaS
clients from costly infrastructures and deployments.
landscape. We notice big players, such as Yahoo! or
Without aiming at providing an exhaustive survey of
Google, as well as many start-ups with million dollars
text mining P(S)aaS providers, we will mention several
fundings. They offer Web developers the possibility
of them hereafter:
to easily integrate in their sites ML intelligence. Easy
AlchemyAPI 39 is a cloud-based text mining SaaS usage prevailed over functionality offered by these ser-
platform providing the most comprehensive set of NLP vices, therefore there are only limited options of tweak-
capabilities of any text mining platform, including: ing algorithms behind the services. Thus, these are
named entity extraction, sentiment analysis, concept good candidates for users with basic ML needs, but
tagging, author extraction, relations extraction, web are not flexible enough for addressing more advanced
page cleaning, language detection, keyword extraction, problems.
quotations extraction, intent mining, and topic cate-
gorization. AlchemyAPI uses deep linguistic parsing,
statistical natural language processing, and machine 8 Conclusions and future work
learning to analyze your content, extracting semantic
meta-data: information about people, places, compa- Our main findings are synthesized below:
nies, topics, languages, and more. It provides RESTful (1) Existing programming paradigms for express-
API endpoints, SDKs in all major programming lan- ing large-scale parallelism such as MapReduce (MR)
guages and responses are encoded in various formats and the Message Passing Interface (MPI) are de facto
(XML, JSON, RDF). Organizations with specific data choices for implementing ML-DM algorithms. More
security needs or regulatory constraints are offered the and more interest has been devoted to MR due to its
possibility to install the solution on own environment. ability to handle large datasets and built-in resilience
NathanAppTM 40 is AI-one’s general purpose ma- against failures.
chine learning PaaS, also available for deployment on- (2) Machine Learning in distributed environments
premise as NathanNodeTM . Like Topic-Mapper, it is come in different approaches, offering viable and cost
ideally suited to learn the meaning of any human lan- effective alternatives to traditional ML and statistical
guage by learning the context of words, only faster and applications, which are not focused on distributed en-
with greater deployment flexibility. NathanApp is a vironments [14].
RESTful API using JavaScript and JSON. (3) Existing solutions target either experienced,
TextProcessing 41 is also a NLP API that sup- skilled computer scientists, mathematicians, statisti-
ports stemming and lemmatization, sentiment anal- cians or novice users who are happy with no (or few)
ysis, tagging and chunk extraction, phase extraction possibilities to tune the algorithms. Ens-user sup-
and named entity recognition. These services are of- port and guidance is largely missing from existing dis-
fered open and free (for limited usage) via RESTful tributed ML-DM solutions.
API endpoints, client libraries exist in Java, Python, After reviewing over 30 different offers on the mar-
ket, we think that there is still room for a scalable,
39 http://www.alchemyapi.com
40 http://ai-one.com 42 http://developer.yahoo.com/search/content/V2/
41 http://text-processing.com contentAnalysis.html
9
easy to use and deploy solution for ML-DM in the con- [8] S. Hido – Jubatus: Distributed On-
text of cloud computing paradigm, targeting end-users line Machine Learning Framework for
with less programming or statistical experience, but Big Data, XLDB Asia, Beijing, 2012
willing to run and tweak advanced scientific ML tasks, http://www.slideshare.net/JubatusOfficial/
such as researchers and practitioners from fields like distributed-online-machine-learning-framework-
medicine, financial, telecommunications etc. To this for-big-data
respect, our future plans include prototyping such a
distributed system relying on existing distributed ML- [9] M. Isard et al. – Dryad: distributed data-parallel
DM frameworks, but enhancing them with usability programs from sequential building blocks. In
and user friendliness features. SIGOPS Operating System Review, 2007
[10] KD Nuggets Survey 2012,
Acknowledgments http://www.kdnuggets.com/software/suites.html
This work was supported by EC-FP7 project FP7- [11] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson,
REGPOT-2011-1 284595 (HOST). C. Guestrin, J. M. Hellerstein – Distributed
GraphLab: A Framework for Machine Learning
References and Data Mining in the Cloud, Proceedings of the
VLDB Endowment, Vol. 5, No. 8, August 2012,
Istanbul, Turkey
[1] R. Bekkerman, M. Bilenko and J. Lang-
ford (editors) – Scaling up Machine Learn- [12] S. Owen, R. Anil, T. Dunning, E. Friedman – Ma-
ing, Cambridge University Press, 2012, sum- hout in Action, Manning Publications, 2011, ISBN
mary at http://people.cs.umass.edu/˜ronb/ scal- 978-1935182689
ing up machine learning.htm
[13] E. Pednault, E. Yom-Tov, A. Ghoting – IBM Par-
[2] M. Budiu, D. Fetterly, M. Isard, F. McSherry, allel Machine Learning Toolbox, in R. Bekkerman,
and Y. Yu – Large-Scale Machine Learning using M. Bilenko and J. Langford (editors) – Scaling up
DryadLINQ, in R. Bekkerman, M. Bilenko and J. Machine Learning, Cambridge University Press,
Langford (editors) – Scaling up Machine Learning, 2012
Cambridge University Press, 2012
[14] D. Pop, G. Iuhasz – Survey of Machine Learning
[3] S. Charrington – Three New Tools Tools and Libraries, Institute e-Austria Timişoara
Bring Machine Learning Insights to the Technical Report, 2011
Masses, February 2012, Read Write Web,
http://www.readwriteweb.com/hack/2012/02/ [15] Rexer Analytics Survey 2011,
three-new-tools-bring-machine.php http://www.rexeranalytics.com/Data-Miner-
Survey-Results-2011.html
[4] W. Eckerson – New technologies
for Big Data, http://www.b-eye- [16] L. Tierney, A. J. Rossini, Na Li – Snow: A parallel
network.com/blogs/eckerson/archives/2012/11/ computing framework for the R System, Int J Par-
new technologie.php (2012) allel Prog (2009) 37:78–90, DOI 10.1007/s10766-
008-0077-2
[5] D. Harris – 5 low-profile startups that could
change the face of big data, Januray 2012, [17] S. R. Upadhyaya – Parallel approaches to ma-
http://gigaom.com/cloud/5-low-profile-startups- chine learning—A comprehensive survey, Journal
that-could-change-the-face-of-big-data/ of Parallel and Distributed Computing, Volume
73, Issue 3, March 2013, Pages 284–292
[6] A. Ghoting, P. Kambadur, E. Pednault, and R.
Kannan – NIMBLE: A Toolkit for the Imple- [18] B. Werther – Pre-industrial age of big data, June
mentation of Parallel Data Mining and Machine 2012, http://www.platfora.com/pre-industrial-
Learning Algorithms on MapReduce, KDD 11 age-of-big-data/
[7] A. Ghoting et al. – SystemML: Declarative ma- [19] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlings-
chine learning on mapreduce. In Proceedings of son, P. Kumar Gunda, J. Currey – DryadLINQ:
the 2011 IEEE 27th International Conference on A System for General-Purpose Distributed Data-
Data Engineering, ICDE 11, pages 231-242, Wash- Parallel Computing Using a High-Level Language,
ington, DC, USA, 2011 In OSDI, 2008
10
[20] Apache Hadoop Webseite, Daniel Pop received his PhD degree in computer
http://hadoop.apache.org (2012) science from West University of Timişoara in 2006.
He is currently a senior researcher at Department of
Computer Science, Faculty of Mathematics and Com-
puter Science, West University of Timişoara. Research
interests covers high performance computing and dis-
tributed computing technologies, machine learning and
knowledge discovery and representation, and multi-
agent systems. He also has a broad experience in IT
industry (+15 years), where he applied agile software
development processes, such as SCRUM and Kanban.
11
Name Platform Licensing Language Activity
Mahout Hadoop Apache 2 Java High
GraphLab MPI / Hadoop Apache 2 C++ High
DryadLINQ Dryad Commercial .NET Low
Jubatus ZooKeeper LGPL 2 C++ Medium
NIMBLE Hadoop ? Java Low
SystemML Hadoop ? DML Low
12