0% found this document useful (0 votes)
39 views7 pages

Support of Big Data Machine Learning With Apache Spark

Apache Spark is a fast data processing engine for big data that allows distributed processing of large volumes of data. It supports in-memory and on-disk processing and can be used with Hadoop, NoSQL databases, and SQL data stores. The document discusses Spark concepts like RDDs and DataFrames and provides examples of using Spark for machine learning techniques like K-means clustering and decision trees.

Uploaded by

Lobna Merghni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views7 pages

Support of Big Data Machine Learning With Apache Spark

Apache Spark is a fast data processing engine for big data that allows distributed processing of large volumes of data. It supports in-memory and on-disk processing and can be used with Hadoop, NoSQL databases, and SQL data stores. The document discusses Spark concepts like RDDs and DataFrames and provides examples of using Spark for machine learning techniques like K-means clustering and decision trees.

Uploaded by

Lobna Merghni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

© CDOSS Association contact@cdoss.

tech

I/Apache Spark presentation


Apache Spark is a fast data processing engine dedicated to big data. It allows processing of
large volumes of data in a distributed manner (cluster computing).

Advantages: Speed, Ease of use, Versatility.

Supports In-memory processing  increase the performance of big data analytical applications.

Can also be used for conventional on-disk processing  if the data sets are too large for system
memory.

Used to process data from Hadoop Distributed File System, NoSQL databases, or relational
data stores like Apache Hive.

Historical
2009: creation within the AMPLab laboratory at the University of Berkeley by Matei Zaharia,

2010: launch in open source under BSD license,

2013: entrusted to Apache Software Foundation,

2014: placed at the rank of Top-Level Project by the Apache foundation.

Spark Vs Hadoop
Hadoop: solution of choice for processing large data sets and for "one-pass" calculations
(MapReduce),
Spark: more practical for use cases requiring multi-pass calculations (machine learning),
 using them together would be best,
 Spark can be run on Hadoop 2 clusters based on the YARN resource manager

Architecture

1
© CDOSS Association [email protected]

Spark Data Models: Resilient Distributed Datasets (RDDs)


An RDD is a collection calculated from a data source,

Catalyst and Tungsten


 Catalyst is the name of Spark's workflow optimizer. Originally created for
[[SparkSQL]], Catalyst is also used in Datasets and Dataframes. Its role is to rewrite the
execution plan of a query (or an execution workflow) in order to obtain maximum
performance.
 Tungsten Project aims to improve the performance of Spark in particular the
optimization of the data structures used.

2
© CDOSS Association [email protected]

Spark Data Models: Performance

When to use RDDs?

 Unstructured data,
 Need a low level of data control.

When to use DataFrames?

 Structured or semi-structured data,


 Need a high level of data control,
 Need a high level of transformation and action.

II/Concepts of statistics to know


- Statistical moments - Mean and median
- Standart Deviation - Kurtosis
- Skewness - Correlation - Covariance
III/Machine learning concepts to know
- K-means
- Decision tree
- Random Forest
- Neural Network

3
© CDOSS Association [email protected]

IV/Practical workshop
1)RDD without Catalyst and Tungsten
1)Run pyspark on terminal.
2)Create a rdd with the following values: 1,2,3,4
Rdd=sc.parallelize([1,2,3,4])
3)Multiply rdd elements by 2 and put the result in rdd1
Rdd1=rdd.map(lambda x: x*2)
4)Filter even elements of rdd and put the result in rdd2
Rdd2=rdd.filter(lambda x: x%2 == 0)
5)Create a rdd3 with the following values: 1,4,2,2,3
Rdd3=sc.parallelize([1,4,2,2,3])
6)Select rdd3 distinct elements and put the result in rdd4
Rdd4=rdd3.distinct()
7)Create a rdd5 with the following values: 1,2,3
Rdd5=sc.parallelize([1,2,3])
8)Create a rdd6 which will contain couples according to the values of rdd5 with each value + 5
Rdd6=rdd5.map(lambda x: [x+5])
9)Create a rdd7 which will contain the values of rdd5 with each value plus 5 in flat mode
Rdd7=rdd5.flatMap(lamda x: [x,x+5])
10)Compute sum of rdd5 elements
Rdd5.reduce(lambda a,b: a*b)
11)Collect rdd5 values
Rdd5.collect()
12)Take the two first elements of rdd5
Rdd5.take(2)
13)Create a rdd8 with the following values: 5,3,1,2
Rdd8=sc.parallelize([5,3,1,2])
14)Select the three largest values of rdd8 elements
Rdd8.takeOrdered(3, lambda s: -1*s)

1.1)K-means

1) Import the KMeans function from mllib with: from pyspark.mllib.clustering import KMeans.
2) Import the array function with: from numpy import array.
3) Create a file with the following data and save it on the home:

4
© CDOSS Association [email protected]

0.0 0.0 0.0


0.1 0.1 0.0
0.1 0.0 0.1
9.0 9.2 9.0
9.3 9.0 9.2
9.0 9.2 9.1
4)Load the data with data = sc.textFile ("name_of_your_file").
5)#this step is optional# collect the data in RDD with: data.collect().
6)Prepare the data transformation by separating the elements of each row by transforming them
into a float with: parsedData = data.map(lambda line: array([float (x) for x in line.split(' ')])).
Note: lambda represents in python a function without having a name.
7)Execute by displaying the transformed data with: parsedData.collect()
8)Start the kmeans algorithm with: clusters = KMeans.train (parsedData, 2, maxIterations = 10,
initializationMode = "random").
9)Execute (with display) the result with: clusters.predict(parsedData).collect().

1.2)Decision tree

1)Place irisnum.csv file from Downloads on the home.


2)Load irisnum.csv data with: data = sc.textFile("iris num.csv")
3)Import array from numpy with: from numpy import array
4)Prepare the transformation of the rows by separating the values and converting them to float
with:
pdata = data.map(lambda line: array([float (x) for x in line.split(‘,’)]))
5)Display pdata with: pdata.collect()
6)Import the LabeledPoint function with:
from pyspark.mllib.regression import LabeledPoint.
7)Create a function called parse allowing to label the class value of a data line received as input
with:
def parse (l):
return LabeledPoint(l[4],l[0:4])
8)Pass the lines one by one in order to label all the data with:
fdata = pdata.map(lambda l: parse(l))
9)Randomly divide the data in order to have a training base and a test base with:
(trainingData,testData) = fdata.randomSplit([0.8,0.2])
10)Import the function of decision trees with:
from pyspark.mllib.tree import DecisionTree
11)Prepare the model with:

5
© CDOSS Association [email protected]

model = DecisionTree.trainClassifier(trainingData, numClasses = 3, categoricalFeaturesInfo =


{})
12)Perform the prediction for the test base with:
predictions = model.predict(testData.map(lambda r: r.features))
13)Build a table of two columns opposing the predictions to the real values with:
predictionAndLabels = testData.map(lambda lp: lp.label).zip(predictions)
14)Import MulticlassMetrics function for model evaluation with:
from pyspark.mllib.evaluation import MulticlassMetrics.
15)Start the evaluation function with:
metrics = MulticlassMetrics(predictionAndLabels)
16)Calculate the model precision with: precision = metrics.precision()
17)Calculate the recall of the model with recall = metrics.recall()
18)Calculate the f1Score with: f1Score = metrics.fMeasure()
19)Display a title to the results with: print("Summary Stats")
20)Display the model precision with: print("Precision =% s"% precision)
21)Display the model recall with: print("Recall =% s"% recall)
22)Display the model's f1score with: print("F1 Score =% s"% f1Score)

2)RDD with Catalyst and Tungsten


1) Place the Iris1.csv file in the home of VM.
2) Load this file with:
df = spark.read.load ("Iris1.csv", format = "csv", sep = ",", inferSchema = "true", header =
"true")
3) Here is another possibility to load the file with:
df1 = sqlContext.read.format ('csv'). Options (header = 'true', inferschema = 'true'). Load
('Iris1.csv' ).
Rq: inferSchema: automatically infers column types.
4) Count the number of lines of the data frame with: df.count()
5) Display the first 10 lines of the data frame with: df.show(10)
6) Filter and display the lines whose petal_lengths are strictly greater than 6 with:
df.filter(df["petal_length"]>6).show()
7) Count the "species" by group with: df.groupBy(df["species"]).Count().Show()
8) Display the first 10 lines with: df.head(10).
9) Select the different "species" using a sql query with:
df.registerTempTable ("table")
distinct_classes = sqlContext.sql("select distinct species from table")

6
© CDOSS Association [email protected]

Decision Tree
1) Transform the data frame df by indexing the class variable "species" and creating a vector of
"features" with:
from pyspark.ml.feature import StringIndexer
speciesIndexer = StringIndexer(inputCol="species", outputCol="speciesIndex")
from pyspark.ml.feature import VectorAssembler
vectorAssembler=VectorAssembler(inputCols=["petal_width","petal_length","sepal_
width","sepal_length"], outputCol="features")
data = vectorAssembler.transform(df)
index_model = speciesIndexer.fit(data)
data_indexed = index_model.transform(data)
2) Randomly divide the data into the learning base and the test base with:
trainingData, testData = data_indexed.randomSplit ([0.8, 0.2], 0.0).
3) Import the decision trees function with: from pyspark.ml.classification import
DecisionTreeClassifier
4) Configure the model with: dt = DecisionTreeClassifier (). SetLabelCol ("speciesIndex").
SetFeaturesCol ("features")
5) Start training with: model = dt.fit (trainingData).
6) Perform the classification of the test database with: classifications = model.transform
(testData)
7) Repeat steps 17 to 20 for the evaluation with:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator (labelCol = "speciesIndex", predictionCol =
"prediction", metricName = "accuracy")
accuracy = evaluator.evaluate (classifications)
print ("Test set accuracy =" + str (accuracy))

You might also like