0% found this document useful (0 votes)

25 views32 pages

BDA Lec11

Uploaded by

Ahmed Ibrahim Ghnnam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views32 pages

BDA Lec11

Uploaded by

Ahmed Ibrahim Ghnnam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

3rd grade

Big Data Analytics

Dr. Nesma Mahmoud
Lecture 11: Large-
Scale ML for Big
Data Analytics
Big Data Analytics (In short)
Goal: Generalizations
A model or summarization of the data.

Data/Workflow Frameworks Analytics and Algorithms

Spark
MapReduce Large-scale Data
Hadoop File System Mining/ML
Streaming
What will we learn in this lecture?
01. Intro to ML/DM

02. Spark MLlib

03. Pipeline Example

01. Intro to ML/DM
What is ML? ML vs DM?
Machine learning is a specific subset of AI that trains a machine how to learn. SAS definition

Machine learning (ML) is a type of artificial intelligence that allows machines to learn from data without being explicitly
programmed. ISO definition

“ML is a scientific discipline that deals with the construction and study of algorithms that can lean form data.
Such Algorithms operate:
1. by building a model based on inputs
2. and using that make predictions and decision rather that
following explicitly programmed instructions “
Supervised vs Unsupervised Learning
Supervised Learning
• Using labeled historical data and training
a model to predict the values of those
labels based on various features of the
data points.
• Classification (categorical values)
• E.g., predicting disease, classifying
images, ...
• Regression (continuous values)
• E.g., predicting sales, predicting height, ...
Unsupervised Learning
• No label to predict.
• Trying to find patterns or discover the
underlying structure in a given set of
data.
• Clustering, anomaly detection,
dimensionality reduction.
ML as a Process
● Data Ingestion:
○ • Data Storage: -
Browser, and mobile
application event logs
or accessing external
web APIs - HDFS,
Amazon S3, and other
filesystems; SQL
databases such as
MySQL or
PostgreSQL; Data transformation: - Filter out or
distributed NoSQL remove records with bad or missing
data stores such as values - Fill in bad or missing data -
HBase, Cassandra, Apply robust techniques to outliers -
and DynamoDB, … Apply transformations to potential
outliers - Extract useful features
Theoretical vs. Practical Approach to ML for Big Data Analytics

● Theoretical Approach ● Practical Approach

• Understanding the underlying algorithms • Learning how to leverage powerful tools and
and mathematical principles. frameworks designed for big data.
• Developing a strong foundation in • Understanding the distributed computing
statistical learning theory. paradigms used in these tools.
• Implementing models from scratch using • Applying machine learning techniques to real-
libraries like TensorFlow or PyTorch. world, large-scale datasets.

• Already taken in other courses • In this course

The Advanced Analytic Process
Feature engineering
● Data collection Feature Selection: Choose the most relevant features
(variables, columns) from the dataset that will contribute to
● Data cleaning the model's accuracy.
● Feature engineering Feature Creation: Derive new features from existing ones
● Training models to capture additional information. For example, creating a
"age" feature from "date of birth" and "current date.“
● Model tuning and Feature transformation: Applying transformations like log
● evaluation transformations, normalization, or one-hot encoding(type of
encoding) to improve model performance.
02. Spark MLlib
Core libraries of Apache Spark
Another popular aspect of Spark is
its ability to perform large-scale
machine learning with a built-in
library of machine learning
algorithms called MLlib.

• Spark provides support for statistics and machine learning.

• Supervised learning
• Unsupervised engines
• Deep learning
• MLlib is also comparable to or even better than other libraries (Mahout, scikit) specialized in large-
scale machine learning.
What is Spark Mllib?
● MLlib is Spark’s machine learning (ML) library.
● Its goal is to make practical machine learning scalable and easy.
● At a high level, it provides tools such as:
○ ML Algorithms: common learning algorithms such as classification,
regression, clustering, and collaborative filtering
■ classification: logistic regression, support vector machine (SVM), naive
Bayes, MLP, KNN, Decision Tree.
■ regression: Generalized Linear Regression (GLM)
■ collaborative filtering: Alternating Least Squares (ALS)
■ clustering: k-means, Fuzzy, mean shift, DBSCAN
○ Featurization: feature extraction, transformation, dimensionality reduction,
and selection
○ Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
○ Persistence: saving and load algorithms, models, and Pipelines
○ Utilities: linear algebra, statistics, data handling, etc.
What is Mllib?
● MLlib is a package built on Spark.
● It provides interfaces for:
○ • Gathering and cleaning data
○ • Feature engineering and feature selection
○ • Training and Tuning large-scale supervised and unsupervised machine
learning models
○ • Using those models in production
● MLlib consists of two packages.
○ org.apache.spark.mllib
■ • Uses RDDs
■ • It is in maintenance mode (only receives bug fixes, not new features)
○ org.apache.spark.ml
■ • Uses DataFrames
■ • Offers a high-level interface for building machine learning pipelines
High-Level MLlib Concepts
● ML pipelines (spark.ml) provide a uniform set of high-level APIs built on top of
DataFrames to create machine learning pipelines.
Pipeline
● Pipeline is a sequence of algorithms to process and learn from data.
● E.g., a text document processing workflow might include several stages:
○ • Split each document’s text into words.
○ • Convert each document’s words into a numerical feature vector.
○ • Learn a prediction model using the feature vectors and labels.
● Main pipeline components: transformers and estimators

https://spark.apache.org/docs/latest/ml-pipeline.html#pipeline
Transformers
● Transformers take a DataFrame as input and produce a new DataFrame as
output.
● DataFrame: This ML API uses DataFrame from Spark SQL as an ML dataset, which
can hold a variety of data types. E.g., a DataFrame could have different columns
storing text, feature vectors, true labels, and predictions.
● The class Transformer implements a method transform() that converts one
DataFrame into another.
● For example VectorAssembler is a transformer as it takes the input dataframe and
returns the transformed dataframe with a new column which is vector
representation of all the features.
●
Estimators
● Estimator is an abstraction of a learning algorithm that fits a model on a dataset.
● The class Estimator implements a method fit(), which accepts a DataFrame and
produces a Model (Transformer).
● For example LogisticRegression is an estimator which returns
a LogisticRegresionModel (a Transformer) after learning parameters about the
data.
How Does Pipeline Work? (1/3)
● A pipeline is a sequence of stages.
○ Pipeline class has a fit() method which kicks off the entire workflow.
● Stages of a pipeline run in order.
● The input DataFrame is transformed as it passes through each stage.
○ • Each stage is either a Transformer or an Estimator.
● E.g., the top row represents a Pipeline with three stages. The first two
(Tokenizer and HashingTF) are Transformers (blue), and the third
(LogisticRegression) is an Estimator (red).
● The bottom row represents data flowing through the pipeline, where
cylinders indicate DataFrames.
How Does Pipeline Work? (2/3)
● Pipeline.fit(): is called on the original DataFrame
○ • DataFrame with raw text documents and labels
● Tokenizer.transform(): splits the raw text documents into words
○ • Adds a new column with words to the DataFrame
● HashingTF.transform(): converts the words column into feature vectors
○ • Adds new column with those vectors to the DataFrame
● LogisticRegression.fit(): produces a model (LogisticRegressionModel).
How Does Pipeline Work? (3/3)
● A Pipeline is an Estimator (DataFrame =[fit]=> Model).
○ After a Pipeline’s fit() runs, it produces a PipelineModel.
● PipelineModel is a Transformer (DataFrame =[transform]=> DataFrame).
○ The PipelineModel is used at test time.
● During execution each stage is called sequentially and based on the type
of PipelineStage (whether its a Transformer or an Estimator) there respective fit() or
transform() methods are called. Take a look at the diagram to get a better
understanding of the flow.
Parameters
● MLlib Estimators and Transformers use a uniform API for specifying parameters.

● A Param is a named parameter with self-contained documentation. A ParamMap is

a set of (parameter, value) pairs.

● There are two main ways to pass parameters to an algorithm:

1. Set parameters for an instance.
■ E.g., if lr is an instance of LogisticRegression, one could call
lr.setMaxIter(10) to make lr.fit() use at most 10 iterations. This API
resembles the API used in spark.mllib package.
2. Pass a ParamMap to fit() or transform(). Any parameters in the ParamMap will
override parameters previously specified via setter methods.
Parameters – examples
03. Pipeline Example
Read Data as DataFrame
● This dataset consists of a categorical label with
two values (good or bad), a categorical
variable (color), and two numerical variables.
While the data is synthetic, let’s imagine that
this dataset represents a company’s customer
health.
● The “color” column represents some
categorical health rating made by a customer
service representative.
● The “lab” column represents the true customer
health.
● The other two values are some numerical
measures of activity within an application (e.g.,
minutes spent on site and purchases).
Suppose that we want to train a classification
model where we hope to predict a binary
variable—the label—from the other values.
Feature Engineering with Transformers
● Manipulating these columns is often in pursuit of building features (that
we will input into our model).
● Transformers exist to either cut down the number of features, add more
features, manipulate current ones, or simply to help us format our data
correctly.
○ Transformers add new columns to DataFrames.
● When we use MLlib, all inputs to machine learning algorithms in Spark
must consist of type Double (for labels) and Vector[Double] (for
features).
Feature Engineering with Transformers
● The current dataset does not meet that requirement and therefore we need to
transform it to the proper format.

•lab ~ .:
•This part specifies the dependent variable, which is lab.
•The dot (.) represents all the other columns in the dataset as independent variables. This means that all columns except lab
will be considered as features.
•+ color:value1 + color:value2:
•This part adds interaction terms to the model.
•color:value1 and color:value2 create interaction features between the color column and the value1 and value2 columns,
respectively.
•This means that the model will consider not only the individual effects of color, value1, and value2, but also how these
variables interact with each other.
Transformation output
In the output we can see the result of our
transformation—a column called features
that has our previously raw data. What’s
happening behind the scenes is actually
pretty simple. RFormula inspects our data
during the fit call and outputs an object
that will transform our data according to
the specified formula, which is called an
RFormulaModel.
Transformation output & Split
[green,good,1,14.386294994851129,(10,[0,2,3,4,7],[1.0,1.0,14.3862 The 5-th column is a structure
94994851129,1.0,14.386294994851129]),0.0] representing sparse vectors in Spark. It
[blue,bad,8,14.386294994851129,(10,[2,3,6,9],[8.0,14.38629499485 has three components:
1129,8.0,14.386294994851129]),1.0] •vector length - in this case all vectors
[blue,bad,12,14.386294994851129,(10,[2,3,6,9],[12.0,14.386294994 are of length 10 elements
851129,12.0,14.386294994851129]),1.0] •index array holding the indices of non-
[green,good,15,38.97187133755819,(10,[0,2,3,4,7],[1.0,15.0,38.971 zero elements
87133755819,15.0,38.97187133755819]),0.0] •value array of non-zero valuess
Estimators
● To create our classifier we instantiate an instance of LogisticRegression, using the
default configuration or hyperparameters.
● We then set the label columns and the feature columns; the column names we are
setting—label and features—are actually the default labels for all estimators in
Spark Mllib

● Upon instantiating an untrained algorithm, it becomes time to fit it to data. In this

case, this returns a LogisticRegressionModel:

This code will kick off a Spark job to train the model.
Model & Prediction
● Once complete, you can use the model to make predictions.
● We make predictions with the transform method.
● For example, we can transform our training dataset to see what labels our model
assigned to the training data and how those compare to the true outputs.
● This, again, is just another DataFrame we can manipulate. Let’s perform that
prediction with the following code snippet:
Thanks!
Do you have any questions?

CREDITS: This presentation template was created by Slidesgo, and includes

icons by Flaticon, and infographics & images by Freepik

Spark ML
No ratings yet
Spark ML
110 pages
Spark MLIB
No ratings yet
Spark MLIB
50 pages
AD3002 Healthcare Unit2 Updated
No ratings yet
AD3002 Healthcare Unit2 Updated
83 pages
Slides Scalable Machine Learning With Apache Spark
No ratings yet
Slides Scalable Machine Learning With Apache Spark
155 pages
Slide 11 Spark ML
No ratings yet
Slide 11 Spark ML
153 pages
Advanced Machine Learning Models For Academic Performance Forecasting
No ratings yet
Advanced Machine Learning Models For Academic Performance Forecasting
38 pages
Internshipml (J2)
No ratings yet
Internshipml (J2)
50 pages
ABES Presentation
No ratings yet
ABES Presentation
91 pages
Lte SSV Report Checking Guideline-V1-20161213
80% (5)
Lte SSV Report Checking Guideline-V1-20161213
244 pages
Designing Machine Learning Systems by Chip Huygen by Rick
No ratings yet
Designing Machine Learning Systems by Chip Huygen by Rick
15 pages
Module 5.pptx - 20250608 - 201231 - 0000
No ratings yet
Module 5.pptx - 20250608 - 201231 - 0000
43 pages
Machine Learning Section
No ratings yet
Machine Learning Section
31 pages
July4 SaketAnand FriendlyIntroToML
No ratings yet
July4 SaketAnand FriendlyIntroToML
84 pages
SEng5305-chap-1-Introduction To ML
No ratings yet
SEng5305-chap-1-Introduction To ML
85 pages
ML Libraries
No ratings yet
ML Libraries
19 pages
Lecture 6 - Spark ML
No ratings yet
Lecture 6 - Spark ML
31 pages
Core Concepts of AI
No ratings yet
Core Concepts of AI
46 pages
Update Plan
100% (1)
Update Plan
79 pages
Intro To Machine Learning With Apache Cassandra and Apache Spark
No ratings yet
Intro To Machine Learning With Apache Cassandra and Apache Spark
80 pages
Lecture+Notes Intro To MLOps Session3
No ratings yet
Lecture+Notes Intro To MLOps Session3
8 pages
Basic Concepts of Machine Learning For Beginners 1732109263
No ratings yet
Basic Concepts of Machine Learning For Beginners 1732109263
102 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
CT1-MLOPs S1 2
No ratings yet
CT1-MLOPs S1 2
68 pages
Unit 1
No ratings yet
Unit 1
28 pages
Scalable Machine Learning With Apache Spark en
No ratings yet
Scalable Machine Learning With Apache Spark en
145 pages
Library
No ratings yet
Library
23 pages
Scalable-ML-3 4 1
No ratings yet
Scalable-ML-3 4 1
147 pages
21CS71 Big Data Analytics
No ratings yet
21CS71 Big Data Analytics
17 pages
Mooc Presentation
No ratings yet
Mooc Presentation
13 pages
Applied Machine Learning with MLlib: Definitive Reference for Developers and Engineers
From Everand
Applied Machine Learning with MLlib: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Machine Learning Section
No ratings yet
Machine Learning Section
29 pages
Unit2 Hca Notes
No ratings yet
Unit2 Hca Notes
17 pages
2-ML Principles
No ratings yet
2-ML Principles
34 pages
AI-Lecture 8 (Machine Learning Overview)
No ratings yet
AI-Lecture 8 (Machine Learning Overview)
42 pages
Week 3 A
No ratings yet
Week 3 A
18 pages
Advanced Data Science On Spark: Reza Zadeh
No ratings yet
Advanced Data Science On Spark: Reza Zadeh
47 pages
Day5 FDP IoT Part1
No ratings yet
Day5 FDP IoT Part1
89 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Data Science
No ratings yet
Data Science
39 pages
ML Notion 1
No ratings yet
ML Notion 1
18 pages
Report Print
No ratings yet
Report Print
22 pages
Topic Cheatsheet For GCP's Professional Machine Learning Engineer Beta Exam
No ratings yet
Topic Cheatsheet For GCP's Professional Machine Learning Engineer Beta Exam
2 pages
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
Class Note Expanded 1
No ratings yet
Class Note Expanded 1
7 pages
Machine Learning Spark ML
No ratings yet
Machine Learning Spark ML
10 pages
Mastering Advanced Analytics With Apache Spark
No ratings yet
Mastering Advanced Analytics With Apache Spark
75 pages
Practical Machine Learning Pipelines With Mllib: Joseph K. Bradley
No ratings yet
Practical Machine Learning Pipelines With Mllib: Joseph K. Bradley
35 pages
B3. Machine Learning With Apache Spark - Coursera
No ratings yet
B3. Machine Learning With Apache Spark - Coursera
10 pages
ML Resources CW 2025
No ratings yet
ML Resources CW 2025
5 pages
Spark & SparkMLLib
No ratings yet
Spark & SparkMLLib
6 pages
20191216134846D3338 - COMP6579 Session 10 - Big Data Analytics (Apache Spark - SparkML)
No ratings yet
20191216134846D3338 - COMP6579 Session 10 - Big Data Analytics (Apache Spark - SparkML)
42 pages
Data Science
No ratings yet
Data Science
4 pages
D815 Technical Section
No ratings yet
D815 Technical Section
233 pages
Module - 1
No ratings yet
Module - 1
9 pages
ML Summer Training
No ratings yet
ML Summer Training
20 pages
Practical 3.4 Spark Machine Learning
No ratings yet
Practical 3.4 Spark Machine Learning
3 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
1 page
Tooth-Colored Restorations: (1) Good Esthetics
100% (1)
Tooth-Colored Restorations: (1) Good Esthetics
12 pages
Machine Learning With Python
No ratings yet
Machine Learning With Python
6 pages
Apache Spark Mllib Guide For Pipelining
No ratings yet
Apache Spark Mllib Guide For Pipelining
3 pages
An Introduction To Machine Learning and Its Applications
No ratings yet
An Introduction To Machine Learning and Its Applications
8 pages
Law of Karma Value Systems For Success.
No ratings yet
Law of Karma Value Systems For Success.
51 pages
Machine Learning Spark ML
No ratings yet
Machine Learning Spark ML
11 pages
AQA Sociology Specification
No ratings yet
AQA Sociology Specification
19 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
46 pages
ML Midterm Cheatsheet
No ratings yet
ML Midterm Cheatsheet
2 pages
MLib Cheat Sheet Design
No ratings yet
MLib Cheat Sheet Design
1 page
AI Lecture 9
No ratings yet
AI Lecture 9
39 pages
RUNDOWN Prom 31
No ratings yet
RUNDOWN Prom 31
14 pages
Dakesh Institutional Trading Guide
No ratings yet
Dakesh Institutional Trading Guide
6 pages
Employee Safety and Health
No ratings yet
Employee Safety and Health
4 pages
MNU CAI ICI334 Lec4&5
No ratings yet
MNU CAI ICI334 Lec4&5
33 pages
BDA Lec10
No ratings yet
BDA Lec10
33 pages
Sen QB5
No ratings yet
Sen QB5
18 pages
BDA Lec1
No ratings yet
BDA Lec1
25 pages
M-Story Steel Building - FA - 01 PDF
No ratings yet
M-Story Steel Building - FA - 01 PDF
16 pages
Chapter 8 Concurrency-P1
No ratings yet
Chapter 8 Concurrency-P1
30 pages
L2 Students' Barriers in Engaging With Form and Content-Focused AI-generated Feedback in Revising Their Compositions
No ratings yet
L2 Students' Barriers in Engaging With Form and Content-Focused AI-generated Feedback in Revising Their Compositions
23 pages
Lec4 Designpattern
No ratings yet
Lec4 Designpattern
48 pages
BDA Lec3
No ratings yet
BDA Lec3
48 pages
CONSUMER DECISION MAKING Notes
No ratings yet
CONSUMER DECISION MAKING Notes
16 pages
Lecture 02,03
No ratings yet
Lecture 02,03
54 pages
Internship Report
No ratings yet
Internship Report
25 pages
Joint Application For Sale and Transfer of Permanent Authority
No ratings yet
Joint Application For Sale and Transfer of Permanent Authority
5 pages
Lecture 9 - MapReduce
No ratings yet
Lecture 9 - MapReduce
50 pages
Eijppr 2023 Vol 13 Iss2 April 1 17 2116
No ratings yet
Eijppr 2023 Vol 13 Iss2 April 1 17 2116
17 pages
BDA Lec4
No ratings yet
BDA Lec4
40 pages
ENV-WKP (2023) 19 en
No ratings yet
ENV-WKP (2023) 19 en
92 pages
Sodapdf
No ratings yet
Sodapdf
4 pages
Central Bank
No ratings yet
Central Bank
55 pages
MNU CAI ICI334 Lec7
No ratings yet
MNU CAI ICI334 Lec7
30 pages
Lecture 7 - Wide Column Stores - Part 1
No ratings yet
Lecture 7 - Wide Column Stores - Part 1
30 pages
Lec. 3
No ratings yet
Lec. 3
18 pages
Assignment 1
No ratings yet
Assignment 1
12 pages
Housing Brochure
No ratings yet
Housing Brochure
2 pages
Modern Synthetic Methods - 4
No ratings yet
Modern Synthetic Methods - 4
49 pages
Gastrointestinal Physiology
No ratings yet
Gastrointestinal Physiology
6 pages
Year 6 Grammar Revision
No ratings yet
Year 6 Grammar Revision
4 pages
Lec5 Flask
No ratings yet
Lec5 Flask
5 pages
Answer Midterm 2024 - 11 - 19
No ratings yet
Answer Midterm 2024 - 11 - 19
4 pages
Tata Car Insurance 2024-25
No ratings yet
Tata Car Insurance 2024-25
8 pages
Section 5
No ratings yet
Section 5
7 pages
Chapter 8 Notes 4
No ratings yet
Chapter 8 Notes 4
11 pages
Berjaya Corporation Berhad
No ratings yet
Berjaya Corporation Berhad
4 pages
TOI Ahmadabad
No ratings yet
TOI Ahmadabad
24 pages
SP4-6 Test3 Czytanie
No ratings yet
SP4-6 Test3 Czytanie
3 pages
Presentation KL Maritime
No ratings yet
Presentation KL Maritime
7 pages
Conclusion
No ratings yet
Conclusion
3 pages
Gratuity Nomination
No ratings yet
Gratuity Nomination
3 pages

BDA Lec11

Uploaded by

BDA Lec11

Uploaded by

3rd grade

Big Data Analytics

Data/Workflow Frameworks Analytics and Algorithms

02. Spark MLlib

03. Pipeline Example

● Theoretical Approach ● Practical Approach

• Already taken in other courses • In this course

• Spark provides support for statistics and machine learning.

● A Param is a named parameter with self-contained documentation. A ParamMap is

● There are two main ways to pass parameters to an algorithm:

● Upon instantiating an untrained algorithm, it becomes time to fit it to data. In this

CREDITS: This presentation template was created by Slidesgo, and includes

You might also like