Machine Learning Spark ML
Machine Learning Spark ML
6
ML in real-life
7
Supervised and Unsupervised Learning
• Unsupervised Learning
• There are not predefined and known set of outcomes
• Look for hidden patterns and relations in the data
• A typical example: Clustering
8
Supervised and Unsupervised Learning
• Supervised Learning
• For every example in the data there is always a predefined outcome
• Models the relations between a set of descriptive features and a
target (Fits data to a function)
• 2 groups of problems:
• Classification
• Regression
9
Supervised Learning
• Classification
• Predicts which class a given sample of data (sample of descriptive
features) is part of (discrete value).
• Regression
• Predicts continuous values.
10
Machine Learning as a Process
- Define measurable and quantifiable goals
- Use this stage to learn about the problem
Define
Objectives
- Normalization
- Transformation
- Missing Values
- Outliers
Model
Deployment Data
- Study models accuracy Preparation
- Work better than the naïve - Data Splitting
approach or previous system - Features Engineering
- Do the results make sense in the - Estimating Performance
context of the problem - Evaluation and Model
Selection
Model Model
Evaluation Building
11
ML as a Process: Data Preparation
• Needed for several reasons
• Some Models have strict data requirements
• Scale of the data, data point intervals, etc
• Some characteristics of the data may impact dramatically on the model
performance
• Time on data preparation should not be underestimated
• Scaling
• Missing Values • Centering
• Error Values
Raw • Different Scales Data
Transform
• Skewness
• Outliers
Data Modeling
Data • Dimensionality
• Types Problems ation • Missing Ready phase
• Many others Values
• Errors
12
ML as a Process: Feature engineering
• Determine the predictors (features) to be used is one of the most critical
questions
• Some times we need to add predictors
• Reduce Number:
• Fewer predictors more interpretable model and less costly
• Most of the models are affected by high dimensionality, specially for non-informative predictors
Algorithms that use
Multiple models
Wrappers adding and removing
parameter
models as input and
performance as
Genetics Algorithms
output
• Binning predictors
13
ML as a Process: Model Building
• Data Splitting
• Allocate data to different tasks
• model training
• performance evaluation
• Define Training, Validation and Test sets
• Feature Selection (Review the decision made previously)
• Estimating Performance
• Visualization of results – discovery interesting areas of the problem space
• Statistics and performance measures
• Evaluation and Model selection
• The ‘no free lunch’ theorem no a priory assumptions can be made
• Avoid use of favorite models if NEEDED
14