Machine Learning Spark ML
Machine Learning Spark ML
Department: MCA
Project Guide: Dr. Partha Sarathi Bishnu
3
SUPERVISED AND
UNSUPERVISED LEARNING
• UNSUPERVISED LEARNING
• THERE ARE NOT PREDEFINED
AND KNOWN SET OF
OUTCOMES
• LOOK FOR HIDDEN PATTERNS
AND RELATIONS IN THE DATA
• A TYPICAL EXAMPLE:
CLUSTERING
4
SUPERVISED AND
UNSUPERVISED LEARNING
• SUPERVISED LEARNING
• FOR EVERY EXAMPLE IN THE DATA THERE IS ALWAYS A
PREDEFINED OUTCOME
• MODELS THE RELATIONS BETWEEN A SET OF DESCRIPTIVE
FEATURES AND A TARGET (FITS DATA TO A FUNCTION)
• 2 GROUPS OF PROBLEMS:
• CLASSIFICATION
• REGRESSION
5
SUPERVISED LEARNING
• CLASSIFICATION
6
MACHINE LEARNING AS A
PROCESS
- Define measurable and quantifiable goals
Define - Use this stage to learn about the problem
Objectives
- Data Splitting
- Features Engineering
Model Model
- Estimating Performance
Evaluation Building
- Evaluation and Model
Selection
7
ML AS A PROCESS: DATA PREPARATION
• Needed for several reasons
• Some Models have strict data requirements
• Scale of the data, data point intervals, etc
• Some characteristics of the data may impact dramatically on the model
performance
• Time on data preparation should not be underestimated
• Scaling
• Missing Values • Centering
• Error Values Data
Raw • Different Scales Data • Skewness
Modeling
• Dimensionality Transform • Outliers Read
Data ation • Missing phase
• Types Problems
• Many others Values y
• Errors
8
ML AS A PROCESS: FEATURE
ENGINEERING
• Determine the predictors (features) to be used is one of the most critical questions
• Some times we need to add predictors
• Reduce Number:
• Fewer predictors more interpretable model and less costly
• Most of the models are affected by high dimensionality, specially for non-informative predictors
Algorithms that
Multiple models
use models as
Wrappers adding and
removing
input and
performance as
Genetics
Algorithms
parameter
output
9
ML AS A PROCESS: MODEL
BUILDING
• DATA SPLITTING
• ALLOCATE DATA TO DIFFERENT TASKS
• MODEL TRAINING
• PERFORMANCE EVALUATION
• DEFINE TRAINING, VALIDATION AND TEST SETS