0% found this document useful (0 votes)
106 views11 pages

Big Data Computing Spark Built-In Libraries

Spark has several built-in libraries for common machine learning, graph processing, streaming, and SQL queries on large datasets. The machine learning library (MLlib) includes algorithms for classification, regression, clustering, collaborative filtering and decomposition. GraphX is a graph processing library with algorithms for tasks like collaborative filtering, community detection, and graph analytics. Spark Streaming allows for large scale streaming computations with exactly-once semantics. Spark SQL enables loading and querying structured data from sources like Hive and JSON.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views11 pages

Big Data Computing Spark Built-In Libraries

Spark has several built-in libraries for common machine learning, graph processing, streaming, and SQL queries on large datasets. The machine learning library (MLlib) includes algorithms for classification, regression, clustering, collaborative filtering and decomposition. GraphX is a graph processing library with algorithms for tasks like collaborative filtering, community detection, and graph analytics. Spark Streaming allows for large scale streaming computations with exactly-once semantics. Spark SQL enables loading and querying structured data from sources like Hive and JSON.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Spark Built-in Libraries

Dr. Rajiv Misra


Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Computing Vu Pham Spark Built-in Libraries
Introduction
Apache Spark is a fast and general-purpose cluster
computing system for large scale data processing

High-level APIs in Java, Scala, Python and R

Big Data Computing Vu Pham Spark Built-in Libraries


Standard Library for Big Data
Big data apps lack
libraries“ of common
algorithms

Spark’s generality +
support“ for multiple
languages make it“
suitable to offer this

Much of future activity


will be in these libraries
Big Data Computing Vu Pham Spark Built-in Libraries
A General Platform

Big Data Computing Vu Pham Spark Built-in Libraries


Machine Learning Library (MLlib)
MLlib algorithms:

(i) Classification: logistic regression, linear SVM,“ naïve


Bayes, classification tree
(ii) Regression: generalized linear models (GLMs),
regression tree
(iii) Collaborative filtering: alternating least squares (ALS),
non-negative matrix factorization (NMF)
(iv) Clustering: k-means
(v) Decomposition: SVD, PCA
(vi) Optimization: stochastic gradient descent, L-BFGS
Big Data Computing Vu Pham Spark Built-in Libraries
GraphX

Big Data Computing Vu Pham Spark Built-in Libraries


GraphX

•General graph processing library

•Build graph using RDDs of nodes and edges

•Large library of graph algorithms with composable


steps

Big Data Computing Vu Pham Spark Built-in Libraries


GraphX Algorithms
(i) Collaborative Filtering (iv) Community Detection
Alternating Least Squares Triangle-Counting
Stochastic Gradient Descent K-core Decomposition
Tensor Factorization K-Truss

(ii) Structured Prediction (v) Graph Analytics


Loopy Belief Propagation PageRank
Max-Product Linear Programs Personalized PageRank
Gibbs Sampling Shortest Path
Graph Coloring
(iii) Semi-supervised ML
Graph SSL (vi) Classification
CoEM Neural Networks

Big Data Computing Vu Pham Spark Built-in Libraries


Spark Streaming
•Large scale streaming
computation

•Ensure exactly one semantics

•Integrated with Spark →unifies


batch, interactive, and streaming
computations!

Big Data Computing Vu Pham Spark Built-in Libraries


Spark SQL
Enables loading & querying structured data in Spark

From Hive:

c = HiveContext(sc)
rows = c.sql(“select text, year from hivetable”)
rows.filter(lambda r: r.year > 2013).collect()

From JSON:

c.jsonFile(“tweets.json”).registerAsTable(“tweets”)
c.sql(“select text, user.name from tweets”)

Big Data Computing Vu Pham Spark Built-in Libraries


Spark Community
Most active open source community in big data
200+ developers, 50+ companies contributing

Big Data Computing Vu Pham Spark Built-in Libraries

You might also like