0% found this document useful (0 votes)
15 views

Machine Learning Spark ML

The document discusses machine learning and Spark MLlib. It covers topics like supervised and unsupervised learning, the machine learning process including data preparation, feature engineering, model building, and model evaluation. Spark MLlib can be used to build machine learning models on large datasets.

Uploaded by

Aditya Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Machine Learning Spark ML

The document discusses machine learning and Spark MLlib. It covers topics like supervised and unsupervised learning, the machine learning process including data preparation, feature engineering, model building, and model evaluation. Spark MLlib can be used to build machine learning models on large datasets.

Uploaded by

Aditya Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Machine Learning with Spark MLlib

Name- Aditya Kumar


Roll No: MCA/40053/22

Department: MCA
Project Guide: Dr. Partha Sarathi Bishnu

Institute Name: Birla Institute of Technology


Mesra, Lalpur
MACHINE LEARNING (ML)

• ML IS A BRANCH OF ARTIFICIAL INTELLIGENCE:


• USES COMPUTING BASED SYSTEMS TO MAKE SENSE OUT OF DATA
• EXTRACTING PATTERNS, FITTING DATA TO FUNCTIONS, CLASSIFYING DATA,
ETC.
• ML SYSTEMS CAN LEARN AND IMPROVE
• WITH HISTORICAL DATA, TIME AND EXPERIENCE
• BRIDGES THEORETICAL COMPUTER SCIENCE AND REAL NOISE DATA.
ML IN REAL-LIFE

3
SUPERVISED AND
UNSUPERVISED LEARNING
• UNSUPERVISED LEARNING
• THERE ARE NOT PREDEFINED
AND KNOWN SET OF
OUTCOMES
• LOOK FOR HIDDEN PATTERNS
AND RELATIONS IN THE DATA
• A TYPICAL EXAMPLE:
CLUSTERING

4
SUPERVISED AND
UNSUPERVISED LEARNING
• SUPERVISED LEARNING
• FOR EVERY EXAMPLE IN THE DATA THERE IS ALWAYS A
PREDEFINED OUTCOME
• MODELS THE RELATIONS BETWEEN A SET OF DESCRIPTIVE
FEATURES AND A TARGET (FITS DATA TO A FUNCTION)
• 2 GROUPS OF PROBLEMS:
• CLASSIFICATION
• REGRESSION
5
SUPERVISED LEARNING
• CLASSIFICATION

• PREDICTS WHICH CLASS A GIVEN SAMPLE OF


DATA (SAMPLE OF DESCRIPTIVE FEATURES) IS
PART OF (DISCRETE VALUE).
• REGRESSION

• PREDICTS CONTINUOUS VALUES.

6
MACHINE LEARNING AS A
PROCESS
- Define measurable and quantifiable goals
Define - Use this stage to learn about the problem
Objectives

- Study models accuracy Model - Normalization


- Work better than the naïve Deploymen - Transformation
Data
approach or previous system t - Missing Values
Preparation
- Do the results make sense in the - Outliers
context of the problem

- Data Splitting
- Features Engineering
Model Model
- Estimating Performance
Evaluation Building
- Evaluation and Model
Selection
7
ML AS A PROCESS: DATA PREPARATION
• Needed for several reasons
• Some Models have strict data requirements
• Scale of the data, data point intervals, etc
• Some characteristics of the data may impact dramatically on the model
performance
• Time on data preparation should not be underestimated
• Scaling
• Missing Values • Centering
• Error Values Data
Raw • Different Scales Data • Skewness
Modeling
• Dimensionality Transform • Outliers Read
Data ation • Missing phase
• Types Problems
• Many others Values y
• Errors

8
ML AS A PROCESS: FEATURE
ENGINEERING
• Determine the predictors (features) to be used is one of the most critical questions
• Some times we need to add predictors
• Reduce Number:
• Fewer predictors more interpretable model and less costly
• Most of the models are affected by high dimensionality, specially for non-informative predictors
Algorithms that
Multiple models
use models as
Wrappers adding and
removing
input and
performance as
Genetics
Algorithms
parameter
output

• Binning predictors Evaluate the


Filters relevance of the
predictor
Based normally on
correlations

9
ML AS A PROCESS: MODEL
BUILDING
• DATA SPLITTING
• ALLOCATE DATA TO DIFFERENT TASKS
• MODEL TRAINING
• PERFORMANCE EVALUATION
• DEFINE TRAINING, VALIDATION AND TEST SETS

• FEATURE SELECTION (REVIEW THE DECISION MADE PREVIOUSLY)


• ESTIMATING PERFORMANCE
• VISUALIZATION OF RESULTS – DISCOVERY INTERESTING AREAS OF THE PROBLEM SPACE
• STATISTICS AND PERFORMANCE MEASURES

• EVALUATION AND MODEL SELECTION


• THE ‘NO FREE LUNCH’ THEOREM NO A PRIORY ASSUMPTIONS CAN BE MADE
10
• AVOID USE OF FAVORITE MODELS IF NEEDED

You might also like