0% found this document useful (0 votes)
25 views32 pages

BDA Lec11

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views32 pages

BDA Lec11

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

3rd grade

Big Data Analytics


Dr. Nesma Mahmoud
Lecture 11: Large-
Scale ML for Big
Data Analytics
Big Data Analytics (In short)
Goal: Generalizations
A model or summarization of the data.

Data/Workflow Frameworks Analytics and Algorithms

Spark
MapReduce Large-scale Data
Hadoop File System Mining/ML
Streaming
What will we learn in this lecture?
01. Intro to ML/DM

02. Spark MLlib

03. Pipeline Example


01. Intro to ML/DM
What is ML? ML vs DM?
Machine learning is a specific subset of AI that trains a machine how to learn. SAS definition

Machine learning (ML) is a type of artificial intelligence that allows machines to learn from data without being explicitly
programmed. ISO definition

“ML is a scientific discipline that deals with the construction and study of algorithms that can lean form data.
Such Algorithms operate:
1. by building a model based on inputs
2. and using that make predictions and decision rather that
following explicitly programmed instructions “
Supervised vs Unsupervised Learning
Supervised Learning
• Using labeled historical data and training
a model to predict the values of those
labels based on various features of the
data points.
• Classification (categorical values)
• E.g., predicting disease, classifying
images, ...
• Regression (continuous values)
• E.g., predicting sales, predicting height, ...
Unsupervised Learning
• No label to predict.
• Trying to find patterns or discover the
underlying structure in a given set of
data.
• Clustering, anomaly detection,
dimensionality reduction.
ML as a Process
● Data Ingestion:
○ • Data Storage: -
Browser, and mobile
application event logs
or accessing external
web APIs - HDFS,
Amazon S3, and other
filesystems; SQL
databases such as
MySQL or
PostgreSQL; Data transformation: - Filter out or
distributed NoSQL remove records with bad or missing
data stores such as values - Fill in bad or missing data -
HBase, Cassandra, Apply robust techniques to outliers -
and DynamoDB, … Apply transformations to potential
outliers - Extract useful features
Theoretical vs. Practical Approach to ML for Big Data Analytics

● Theoretical Approach ● Practical Approach

• Understanding the underlying algorithms • Learning how to leverage powerful tools and
and mathematical principles. frameworks designed for big data.
• Developing a strong foundation in • Understanding the distributed computing
statistical learning theory. paradigms used in these tools.
• Implementing models from scratch using • Applying machine learning techniques to real-
libraries like TensorFlow or PyTorch. world, large-scale datasets.

• Already taken in other courses • In this course


The Advanced Analytic Process
Feature engineering
● Data collection Feature Selection: Choose the most relevant features
(variables, columns) from the dataset that will contribute to
● Data cleaning the model's accuracy.
● Feature engineering Feature Creation: Derive new features from existing ones
● Training models to capture additional information. For example, creating a
"age" feature from "date of birth" and "current date.“
● Model tuning and Feature transformation: Applying transformations like log
● evaluation transformations, normalization, or one-hot encoding(type of
encoding) to improve model performance.
02. Spark MLlib
Core libraries of Apache Spark
Another popular aspect of Spark is
its ability to perform large-scale
machine learning with a built-in
library of machine learning
algorithms called MLlib.

• Spark provides support for statistics and machine learning.


• Supervised learning
• Unsupervised engines
• Deep learning
• MLlib is also comparable to or even better than other libraries (Mahout, scikit) specialized in large-
scale machine learning.
What is Spark Mllib?
● MLlib is Spark’s machine learning (ML) library.
● Its goal is to make practical machine learning scalable and easy.
● At a high level, it provides tools such as:
○ ML Algorithms: common learning algorithms such as classification,
regression, clustering, and collaborative filtering
■ classification: logistic regression, support vector machine (SVM), naive
Bayes, MLP, KNN, Decision Tree.
■ regression: Generalized Linear Regression (GLM)
■ collaborative filtering: Alternating Least Squares (ALS)
■ clustering: k-means, Fuzzy, mean shift, DBSCAN
○ Featurization: feature extraction, transformation, dimensionality reduction,
and selection
○ Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
○ Persistence: saving and load algorithms, models, and Pipelines
○ Utilities: linear algebra, statistics, data handling, etc.
What is Mllib?
● MLlib is a package built on Spark.
● It provides interfaces for:
○ • Gathering and cleaning data
○ • Feature engineering and feature selection
○ • Training and Tuning large-scale supervised and unsupervised machine
learning models
○ • Using those models in production
● MLlib consists of two packages.
○ org.apache.spark.mllib
■ • Uses RDDs
■ • It is in maintenance mode (only receives bug fixes, not new features)
○ org.apache.spark.ml
■ • Uses DataFrames
■ • Offers a high-level interface for building machine learning pipelines
High-Level MLlib Concepts
● ML pipelines (spark.ml) provide a uniform set of high-level APIs built on top of
DataFrames to create machine learning pipelines.
Pipeline
● Pipeline is a sequence of algorithms to process and learn from data.
● E.g., a text document processing workflow might include several stages:
○ • Split each document’s text into words.
○ • Convert each document’s words into a numerical feature vector.
○ • Learn a prediction model using the feature vectors and labels.
● Main pipeline components: transformers and estimators

https://spark.apache.org/docs/latest/ml-pipeline.html#pipeline
Transformers
● Transformers take a DataFrame as input and produce a new DataFrame as
output.
● DataFrame: This ML API uses DataFrame from Spark SQL as an ML dataset, which
can hold a variety of data types. E.g., a DataFrame could have different columns
storing text, feature vectors, true labels, and predictions.
● The class Transformer implements a method transform() that converts one
DataFrame into another.
● For example VectorAssembler is a transformer as it takes the input dataframe and
returns the transformed dataframe with a new column which is vector
representation of all the features.

Estimators
● Estimator is an abstraction of a learning algorithm that fits a model on a dataset.
● The class Estimator implements a method fit(), which accepts a DataFrame and
produces a Model (Transformer).
● For example LogisticRegression is an estimator which returns
a LogisticRegresionModel (a Transformer) after learning parameters about the
data.
How Does Pipeline Work? (1/3)
● A pipeline is a sequence of stages.
○ Pipeline class has a fit() method which kicks off the entire workflow.
● Stages of a pipeline run in order.
● The input DataFrame is transformed as it passes through each stage.
○ • Each stage is either a Transformer or an Estimator.
● E.g., the top row represents a Pipeline with three stages. The first two
(Tokenizer and HashingTF) are Transformers (blue), and the third
(LogisticRegression) is an Estimator (red).
● The bottom row represents data flowing through the pipeline, where
cylinders indicate DataFrames.
How Does Pipeline Work? (2/3)
● Pipeline.fit(): is called on the original DataFrame
○ • DataFrame with raw text documents and labels
● Tokenizer.transform(): splits the raw text documents into words
○ • Adds a new column with words to the DataFrame
● HashingTF.transform(): converts the words column into feature vectors
○ • Adds new column with those vectors to the DataFrame
● LogisticRegression.fit(): produces a model (LogisticRegressionModel).
How Does Pipeline Work? (3/3)
● A Pipeline is an Estimator (DataFrame =[fit]=> Model).
○ After a Pipeline’s fit() runs, it produces a PipelineModel.
● PipelineModel is a Transformer (DataFrame =[transform]=> DataFrame).
○ The PipelineModel is used at test time.
● During execution each stage is called sequentially and based on the type
of PipelineStage (whether its a Transformer or an Estimator) there respective fit() or
transform() methods are called. Take a look at the diagram to get a better
understanding of the flow.
Parameters
● MLlib Estimators and Transformers use a uniform API for specifying parameters.

● A Param is a named parameter with self-contained documentation. A ParamMap is


a set of (parameter, value) pairs.

● There are two main ways to pass parameters to an algorithm:


1. Set parameters for an instance.
■ E.g., if lr is an instance of LogisticRegression, one could call
lr.setMaxIter(10) to make lr.fit() use at most 10 iterations. This API
resembles the API used in spark.mllib package.
2. Pass a ParamMap to fit() or transform(). Any parameters in the ParamMap will
override parameters previously specified via setter methods.
Parameters – examples
03. Pipeline Example
Read Data as DataFrame
● This dataset consists of a categorical label with
two values (good or bad), a categorical
variable (color), and two numerical variables.
While the data is synthetic, let’s imagine that
this dataset represents a company’s customer
health.
● The “color” column represents some
categorical health rating made by a customer
service representative.
● The “lab” column represents the true customer
health.
● The other two values are some numerical
measures of activity within an application (e.g.,
minutes spent on site and purchases).
Suppose that we want to train a classification
model where we hope to predict a binary
variable—the label—from the other values.
Feature Engineering with Transformers
● Manipulating these columns is often in pursuit of building features (that
we will input into our model).
● Transformers exist to either cut down the number of features, add more
features, manipulate current ones, or simply to help us format our data
correctly.
○ Transformers add new columns to DataFrames.
● When we use MLlib, all inputs to machine learning algorithms in Spark
must consist of type Double (for labels) and Vector[Double] (for
features).
Feature Engineering with Transformers
● The current dataset does not meet that requirement and therefore we need to
transform it to the proper format.

•lab ~ .:
•This part specifies the dependent variable, which is lab.
•The dot (.) represents all the other columns in the dataset as independent variables. This means that all columns except lab
will be considered as features.
•+ color:value1 + color:value2:
•This part adds interaction terms to the model.
•color:value1 and color:value2 create interaction features between the color column and the value1 and value2 columns,
respectively.
•This means that the model will consider not only the individual effects of color, value1, and value2, but also how these
variables interact with each other.
Transformation output
In the output we can see the result of our
transformation—a column called features
that has our previously raw data. What’s
happening behind the scenes is actually
pretty simple. RFormula inspects our data
during the fit call and outputs an object
that will transform our data according to
the specified formula, which is called an
RFormulaModel.
Transformation output & Split
[green,good,1,14.386294994851129,(10,[0,2,3,4,7],[1.0,1.0,14.3862 The 5-th column is a structure
94994851129,1.0,14.386294994851129]),0.0] representing sparse vectors in Spark. It
[blue,bad,8,14.386294994851129,(10,[2,3,6,9],[8.0,14.38629499485 has three components:
1129,8.0,14.386294994851129]),1.0] •vector length - in this case all vectors
[blue,bad,12,14.386294994851129,(10,[2,3,6,9],[12.0,14.386294994 are of length 10 elements
851129,12.0,14.386294994851129]),1.0] •index array holding the indices of non-
[green,good,15,38.97187133755819,(10,[0,2,3,4,7],[1.0,15.0,38.971 zero elements
87133755819,15.0,38.97187133755819]),0.0] •value array of non-zero valuess
Estimators
● To create our classifier we instantiate an instance of LogisticRegression, using the
default configuration or hyperparameters.
● We then set the label columns and the feature columns; the column names we are
setting—label and features—are actually the default labels for all estimators in
Spark Mllib

● Upon instantiating an untrained algorithm, it becomes time to fit it to data. In this


case, this returns a LogisticRegressionModel:

This code will kick off a Spark job to train the model.
Model & Prediction
● Once complete, you can use the model to make predictions.
● We make predictions with the transform method.
● For example, we can transform our training dataset to see what labels our model
assigned to the training data and how those compare to the true outputs.
● This, again, is just another DataFrame we can manipulate. Let’s perform that
prediction with the following code snippet:
Thanks!
Do you have any questions?

CREDITS: This presentation template was created by Slidesgo, and includes


icons by Flaticon, and infographics & images by Freepik

You might also like