0% found this document useful (0 votes)

39 views7 pages

Support of Big Data Machine Learning With Apache Spark

Apache Spark is a fast data processing engine for big data that allows distributed processing of large volumes of data. It supports in-memory and on-disk processing and can be used with Hadoop, NoSQL databases, and SQL data stores. The document discusses Spark concepts like RDDs and DataFrames and provides examples of using Spark for machine learning techniques like K-means clustering and decision trees.

Uploaded by

Lobna Merghni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views7 pages

Support of Big Data Machine Learning With Apache Spark

Uploaded by

Lobna Merghni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

© CDOSS Association contact@cdoss.

tech

I/Apache Spark presentation

Apache Spark is a fast data processing engine dedicated to big data. It allows processing of
large volumes of data in a distributed manner (cluster computing).

Advantages: Speed, Ease of use, Versatility.

Supports In-memory processing  increase the performance of big data analytical applications.

Can also be used for conventional on-disk processing  if the data sets are too large for system
memory.

Used to process data from Hadoop Distributed File System, NoSQL databases, or relational
data stores like Apache Hive.

Historical
2009: creation within the AMPLab laboratory at the University of Berkeley by Matei Zaharia,

2010: launch in open source under BSD license,

2013: entrusted to Apache Software Foundation,

2014: placed at the rank of Top-Level Project by the Apache foundation.

Spark Vs Hadoop
Hadoop: solution of choice for processing large data sets and for "one-pass" calculations
(MapReduce),
Spark: more practical for use cases requiring multi-pass calculations (machine learning),
 using them together would be best,
 Spark can be run on Hadoop 2 clusters based on the YARN resource manager

Architecture

1
© CDOSS Association [email protected]

Spark Data Models: Resilient Distributed Datasets (RDDs)

An RDD is a collection calculated from a data source,

Catalyst and Tungsten

 Catalyst is the name of Spark's workflow optimizer. Originally created for
[[SparkSQL]], Catalyst is also used in Datasets and Dataframes. Its role is to rewrite the
execution plan of a query (or an execution workflow) in order to obtain maximum
performance.
 Tungsten Project aims to improve the performance of Spark in particular the
optimization of the data structures used.

2
© CDOSS Association [email protected]

Spark Data Models: Performance

When to use RDDs?

 Unstructured data,
 Need a low level of data control.

When to use DataFrames?

 Structured or semi-structured data,

 Need a high level of data control,
 Need a high level of transformation and action.

II/Concepts of statistics to know

- Statistical moments - Mean and median
- Standart Deviation - Kurtosis
- Skewness - Correlation - Covariance
III/Machine learning concepts to know
- K-means
- Decision tree
- Random Forest
- Neural Network

IV/Practical workshop
1)RDD without Catalyst and Tungsten
1)Run pyspark on terminal.
2)Create a rdd with the following values: 1,2,3,4
Rdd=sc.parallelize([1,2,3,4])
3)Multiply rdd elements by 2 and put the result in rdd1
Rdd1=rdd.map(lambda x: x*2)
4)Filter even elements of rdd and put the result in rdd2
Rdd2=rdd.filter(lambda x: x%2 == 0)
5)Create a rdd3 with the following values: 1,4,2,2,3
Rdd3=sc.parallelize([1,4,2,2,3])
6)Select rdd3 distinct elements and put the result in rdd4
Rdd4=rdd3.distinct()
7)Create a rdd5 with the following values: 1,2,3
Rdd5=sc.parallelize([1,2,3])
8)Create a rdd6 which will contain couples according to the values of rdd5 with each value + 5
Rdd6=rdd5.map(lambda x: [x+5])
9)Create a rdd7 which will contain the values of rdd5 with each value plus 5 in flat mode
Rdd7=rdd5.flatMap(lamda x: [x,x+5])
10)Compute sum of rdd5 elements
Rdd5.reduce(lambda a,b: a*b)
11)Collect rdd5 values
Rdd5.collect()
12)Take the two first elements of rdd5
Rdd5.take(2)
13)Create a rdd8 with the following values: 5,3,1,2
Rdd8=sc.parallelize([5,3,1,2])
14)Select the three largest values of rdd8 elements
Rdd8.takeOrdered(3, lambda s: -1*s)

1.1)K-means

1) Import the KMeans function from mllib with: from pyspark.mllib.clustering import KMeans.
2) Import the array function with: from numpy import array.
3) Create a file with the following data and save it on the home:

0.0 0.0 0.0

0.1 0.1 0.0
0.1 0.0 0.1
9.0 9.2 9.0
9.3 9.0 9.2
9.0 9.2 9.1
4)Load the data with data = sc.textFile ("name_of_your_file").
5)#this step is optional# collect the data in RDD with: data.collect().
6)Prepare the data transformation by separating the elements of each row by transforming them
into a float with: parsedData = data.map(lambda line: array([float (x) for x in line.split(' ')])).
Note: lambda represents in python a function without having a name.
7)Execute by displaying the transformed data with: parsedData.collect()
8)Start the kmeans algorithm with: clusters = KMeans.train (parsedData, 2, maxIterations = 10,
initializationMode = "random").
9)Execute (with display) the result with: clusters.predict(parsedData).collect().

1.2)Decision tree

1)Place irisnum.csv file from Downloads on the home.

2)Load irisnum.csv data with: data = sc.textFile("iris num.csv")
3)Import array from numpy with: from numpy import array
4)Prepare the transformation of the rows by separating the values and converting them to float
with:
pdata = data.map(lambda line: array([float (x) for x in line.split(‘,’)]))
5)Display pdata with: pdata.collect()
6)Import the LabeledPoint function with:
from pyspark.mllib.regression import LabeledPoint.
7)Create a function called parse allowing to label the class value of a data line received as input
with:
def parse (l):
return LabeledPoint(l[4],l[0:4])
8)Pass the lines one by one in order to label all the data with:
fdata = pdata.map(lambda l: parse(l))
9)Randomly divide the data in order to have a training base and a test base with:
(trainingData,testData) = fdata.randomSplit([0.8,0.2])
10)Import the function of decision trees with:
from pyspark.mllib.tree import DecisionTree
11)Prepare the model with:

model = DecisionTree.trainClassifier(trainingData, numClasses = 3, categoricalFeaturesInfo =

{})
12)Perform the prediction for the test base with:
predictions = model.predict(testData.map(lambda r: r.features))
13)Build a table of two columns opposing the predictions to the real values with:
predictionAndLabels = testData.map(lambda lp: lp.label).zip(predictions)
14)Import MulticlassMetrics function for model evaluation with:
from pyspark.mllib.evaluation import MulticlassMetrics.
15)Start the evaluation function with:
metrics = MulticlassMetrics(predictionAndLabels)
16)Calculate the model precision with: precision = metrics.precision()
17)Calculate the recall of the model with recall = metrics.recall()
18)Calculate the f1Score with: f1Score = metrics.fMeasure()
19)Display a title to the results with: print("Summary Stats")
20)Display the model precision with: print("Precision =% s"% precision)
21)Display the model recall with: print("Recall =% s"% recall)
22)Display the model's f1score with: print("F1 Score =% s"% f1Score)

2)RDD with Catalyst and Tungsten

1) Place the Iris1.csv file in the home of VM.
2) Load this file with:
df = spark.read.load ("Iris1.csv", format = "csv", sep = ",", inferSchema = "true", header =
"true")
3) Here is another possibility to load the file with:
df1 = sqlContext.read.format ('csv'). Options (header = 'true', inferschema = 'true'). Load
('Iris1.csv' ).
Rq: inferSchema: automatically infers column types.
4) Count the number of lines of the data frame with: df.count()
5) Display the first 10 lines of the data frame with: df.show(10)
6) Filter and display the lines whose petal_lengths are strictly greater than 6 with:
df.filter(df["petal_length"]>6).show()
7) Count the "species" by group with: df.groupBy(df["species"]).Count().Show()
8) Display the first 10 lines with: df.head(10).
9) Select the different "species" using a sql query with:
df.registerTempTable ("table")
distinct_classes = sqlContext.sql("select distinct species from table")

Decision Tree
1) Transform the data frame df by indexing the class variable "species" and creating a vector of
"features" with:
from pyspark.ml.feature import StringIndexer
speciesIndexer = StringIndexer(inputCol="species", outputCol="speciesIndex")
from pyspark.ml.feature import VectorAssembler
vectorAssembler=VectorAssembler(inputCols=["petal_width","petal_length","sepal_
width","sepal_length"], outputCol="features")
data = vectorAssembler.transform(df)
index_model = speciesIndexer.fit(data)
data_indexed = index_model.transform(data)
2) Randomly divide the data into the learning base and the test base with:
trainingData, testData = data_indexed.randomSplit ([0.8, 0.2], 0.0).
3) Import the decision trees function with: from pyspark.ml.classification import
DecisionTreeClassifier
4) Configure the model with: dt = DecisionTreeClassifier (). SetLabelCol ("speciesIndex").
SetFeaturesCol ("features")
5) Start training with: model = dt.fit (trainingData).
6) Perform the classification of the test database with: classifications = model.transform
(testData)
7) Repeat steps 17 to 20 for the evaluation with:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator (labelCol = "speciesIndex", predictionCol =
"prediction", metricName = "accuracy")
accuracy = evaluator.evaluate (classifications)
print ("Test set accuracy =" + str (accuracy))

Python Programming123uo00es0440
No ratings yet
Python Programming123uo00es0440
405 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Distributed Machine Learning With PySpark Migrating Effortlessly From Pandas and Scikit-Learn (Abdelaziz Testas) (Z-Library)
No ratings yet
Distributed Machine Learning With PySpark Migrating Effortlessly From Pandas and Scikit-Learn (Abdelaziz Testas) (Z-Library)
381 pages
Py Spark
No ratings yet
Py Spark
427 pages
SPARK
No ratings yet
SPARK
66 pages
ML
No ratings yet
ML
38 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
Devops Slides
No ratings yet
Devops Slides
223 pages
Pyspark PDF
100% (1)
Pyspark PDF
397 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Py Spark
No ratings yet
Py Spark
427 pages
Slide 11 Spark ML
No ratings yet
Slide 11 Spark ML
153 pages
SPARK
No ratings yet
SPARK
47 pages
0805 Learning Apache Spark With Python
No ratings yet
0805 Learning Apache Spark With Python
147 pages
BD 07 Spark
No ratings yet
BD 07 Spark
49 pages
Bda Notes
No ratings yet
Bda Notes
241 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Py Spark
83% (6)
Py Spark
195 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
BDA Exp E1
No ratings yet
BDA Exp E1
5 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
Athul Dev - Spark With Python (2020) - Libgen - Li
No ratings yet
Athul Dev - Spark With Python (2020) - Libgen - Li
153 pages
2021 Article 9362
No ratings yet
2021 Article 9362
21 pages
BDA Unit-6
No ratings yet
BDA Unit-6
11 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
4a.introduction To Apache Spark
No ratings yet
4a.introduction To Apache Spark
28 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Public - Crash Course - Apache Spark - Berlin - 2018 PDF
No ratings yet
Public - Crash Course - Apache Spark - Berlin - 2018 PDF
76 pages
Machine Learning With PySpark and MLlib - Solving A Binary Classification Problem - by Susan Li - Towards Data Science
No ratings yet
Machine Learning With PySpark and MLlib - Solving A Binary Classification Problem - by Susan Li - Towards Data Science
10 pages
BDA Unit III
No ratings yet
BDA Unit III
19 pages
Analysis of Heart Disease Dataset
No ratings yet
Analysis of Heart Disease Dataset
16 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
LearningSpark EXCERPT
50% (2)
LearningSpark EXCERPT
47 pages
Module 2
No ratings yet
Module 2
20 pages
Module 3
No ratings yet
Module 3
51 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
4220 6 (DataFormat)
No ratings yet
4220 6 (DataFormat)
15 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Advanced Data Science On Spark: Reza Zadeh
No ratings yet
Advanced Data Science On Spark: Reza Zadeh
47 pages
Assignment 03 BigData Computing Noc23-Cs112
No ratings yet
Assignment 03 BigData Computing Noc23-Cs112
6 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
Big Data Tools 2 - Apache Spark With PySpark
No ratings yet
Big Data Tools 2 - Apache Spark With PySpark
33 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Learning Apache Spark With Python: Wenqiang Feng
No ratings yet
Learning Apache Spark With Python: Wenqiang Feng
8 pages