BDA Lec11
BDA Lec11
Spark
MapReduce Large-scale Data
Hadoop File System Mining/ML
Streaming
What will we learn in this lecture?
01. Intro to ML/DM
Machine learning (ML) is a type of artificial intelligence that allows machines to learn from data without being explicitly
programmed. ISO definition
“ML is a scientific discipline that deals with the construction and study of algorithms that can lean form data.
Such Algorithms operate:
1. by building a model based on inputs
2. and using that make predictions and decision rather that
following explicitly programmed instructions “
Supervised vs Unsupervised Learning
Supervised Learning
• Using labeled historical data and training
a model to predict the values of those
labels based on various features of the
data points.
• Classification (categorical values)
• E.g., predicting disease, classifying
images, ...
• Regression (continuous values)
• E.g., predicting sales, predicting height, ...
Unsupervised Learning
• No label to predict.
• Trying to find patterns or discover the
underlying structure in a given set of
data.
• Clustering, anomaly detection,
dimensionality reduction.
ML as a Process
● Data Ingestion:
○ • Data Storage: -
Browser, and mobile
application event logs
or accessing external
web APIs - HDFS,
Amazon S3, and other
filesystems; SQL
databases such as
MySQL or
PostgreSQL; Data transformation: - Filter out or
distributed NoSQL remove records with bad or missing
data stores such as values - Fill in bad or missing data -
HBase, Cassandra, Apply robust techniques to outliers -
and DynamoDB, … Apply transformations to potential
outliers - Extract useful features
Theoretical vs. Practical Approach to ML for Big Data Analytics
• Understanding the underlying algorithms • Learning how to leverage powerful tools and
and mathematical principles. frameworks designed for big data.
• Developing a strong foundation in • Understanding the distributed computing
statistical learning theory. paradigms used in these tools.
• Implementing models from scratch using • Applying machine learning techniques to real-
libraries like TensorFlow or PyTorch. world, large-scale datasets.
https://spark.apache.org/docs/latest/ml-pipeline.html#pipeline
Transformers
● Transformers take a DataFrame as input and produce a new DataFrame as
output.
● DataFrame: This ML API uses DataFrame from Spark SQL as an ML dataset, which
can hold a variety of data types. E.g., a DataFrame could have different columns
storing text, feature vectors, true labels, and predictions.
● The class Transformer implements a method transform() that converts one
DataFrame into another.
● For example VectorAssembler is a transformer as it takes the input dataframe and
returns the transformed dataframe with a new column which is vector
representation of all the features.
●
Estimators
● Estimator is an abstraction of a learning algorithm that fits a model on a dataset.
● The class Estimator implements a method fit(), which accepts a DataFrame and
produces a Model (Transformer).
● For example LogisticRegression is an estimator which returns
a LogisticRegresionModel (a Transformer) after learning parameters about the
data.
How Does Pipeline Work? (1/3)
● A pipeline is a sequence of stages.
○ Pipeline class has a fit() method which kicks off the entire workflow.
● Stages of a pipeline run in order.
● The input DataFrame is transformed as it passes through each stage.
○ • Each stage is either a Transformer or an Estimator.
● E.g., the top row represents a Pipeline with three stages. The first two
(Tokenizer and HashingTF) are Transformers (blue), and the third
(LogisticRegression) is an Estimator (red).
● The bottom row represents data flowing through the pipeline, where
cylinders indicate DataFrames.
How Does Pipeline Work? (2/3)
● Pipeline.fit(): is called on the original DataFrame
○ • DataFrame with raw text documents and labels
● Tokenizer.transform(): splits the raw text documents into words
○ • Adds a new column with words to the DataFrame
● HashingTF.transform(): converts the words column into feature vectors
○ • Adds new column with those vectors to the DataFrame
● LogisticRegression.fit(): produces a model (LogisticRegressionModel).
How Does Pipeline Work? (3/3)
● A Pipeline is an Estimator (DataFrame =[fit]=> Model).
○ After a Pipeline’s fit() runs, it produces a PipelineModel.
● PipelineModel is a Transformer (DataFrame =[transform]=> DataFrame).
○ The PipelineModel is used at test time.
● During execution each stage is called sequentially and based on the type
of PipelineStage (whether its a Transformer or an Estimator) there respective fit() or
transform() methods are called. Take a look at the diagram to get a better
understanding of the flow.
Parameters
● MLlib Estimators and Transformers use a uniform API for specifying parameters.
•lab ~ .:
•This part specifies the dependent variable, which is lab.
•The dot (.) represents all the other columns in the dataset as independent variables. This means that all columns except lab
will be considered as features.
•+ color:value1 + color:value2:
•This part adds interaction terms to the model.
•color:value1 and color:value2 create interaction features between the color column and the value1 and value2 columns,
respectively.
•This means that the model will consider not only the individual effects of color, value1, and value2, but also how these
variables interact with each other.
Transformation output
In the output we can see the result of our
transformation—a column called features
that has our previously raw data. What’s
happening behind the scenes is actually
pretty simple. RFormula inspects our data
during the fit call and outputs an object
that will transform our data according to
the specified formula, which is called an
RFormulaModel.
Transformation output & Split
[green,good,1,14.386294994851129,(10,[0,2,3,4,7],[1.0,1.0,14.3862 The 5-th column is a structure
94994851129,1.0,14.386294994851129]),0.0] representing sparse vectors in Spark. It
[blue,bad,8,14.386294994851129,(10,[2,3,6,9],[8.0,14.38629499485 has three components:
1129,8.0,14.386294994851129]),1.0] •vector length - in this case all vectors
[blue,bad,12,14.386294994851129,(10,[2,3,6,9],[12.0,14.386294994 are of length 10 elements
851129,12.0,14.386294994851129]),1.0] •index array holding the indices of non-
[green,good,15,38.97187133755819,(10,[0,2,3,4,7],[1.0,15.0,38.971 zero elements
87133755819,15.0,38.97187133755819]),0.0] •value array of non-zero valuess
Estimators
● To create our classifier we instantiate an instance of LogisticRegression, using the
default configuration or hyperparameters.
● We then set the label columns and the feature columns; the column names we are
setting—label and features—are actually the default labels for all estimators in
Spark Mllib
This code will kick off a Spark job to train the model.
Model & Prediction
● Once complete, you can use the model to make predictions.
● We make predictions with the transform method.
● For example, we can transform our training dataset to see what labels our model
assigned to the training data and how those compare to the true outputs.
● This, again, is just another DataFrame we can manipulate. Let’s perform that
prediction with the following code snippet:
Thanks!
Do you have any questions?