Mla Aiml
Mla Aiml
Thursday(3PM-4PM)
Friday (2PM - 3PM)
Monday ( 2PM - 4PM) - LAB
20 August 2020 Department of CSE, GIT EID 403 and machine learning 2
Syllabus
Module I: Number of hours (LTP) 9 0 6
Machine Learning Fundamentals: Use of Machine Learning, Types of machine learning systems,
machine learning challenges, testing and validating, working with real data, obtaining the data,
visualizing the data, data preparation, training and fine tuning the model.
20 August 2020 Department of CSE, GIT EID 403 and machine learning 3
Syllabus
Module IV: Number of hours (LTP) 9 0 6
Classification, training a binary classifier, performance measures, multiclass classification, error
analysis, multi label classification, multi output classification. Logistic Regression: Classification
using Logistic Regression, Logistic Regression vs. Linear Regression, Logistic Regression with one
Variable and with Multiple Variables
20 August 2020 Department of CSE, GIT EID 403 and machine learning 4
Text Book -1
20 August 2020 Department of CSE, GIT EID 403 and machine learning 5
Text Books and Reference books
.
1. T2: Ian Goodfellow, Yoshua Bengio, Aaron Courville,Deep learning, MIT press, 2016
2. T3: Tom M. Mitchell, “Machine Learning” First Edition by Tata McGraw- Hill Education
3. R1: Ethem Alpaydin,”Introduction to Machine Learning ” 2nd Edition, The MIT Press, 2009
4. R2: Christopher M. Bishop, “Pattern Recognition and Machine Learning” By Springer, 2007.
5. R3: Mevi P. Murphy, “Machine Learning: A Probabilistic Perspective” by The MIT Press,
2012
20 August 2020 Department of CSE, GIT EID 403 and machine learning 6
Learning theory - Thorndike
Structure-
7
Learning With Example
X = {0,1,2,3,4,5,6,7,8,9}
Y = {0,2,4,6,8,10,12,14,16,18}
Y = 2X
8
Learning With Example 2
X = {0,1,2,3,4,5,6,7, 8, 9}
Y = {-1, 1, 3, 5, 7, 9, ?, 13, ?, 17 }
Y = 2X-1
Predicted (Interpolated) Values are : 11,15
9
Learning With Example 3
X = {0,1,2,3,4,5,6,7, 8, 9}
Y = {0, 1, 2, 0, 1, 2, 0, 1, 2, 0}
Y = X mod 3
Number of Classes and their labels are : 3; 0, 1, 2,
10
Traditional Approach
11
Machine Learning Approach
12
Automatically Adapting to the change
13
ML can help humans learn
14
Summarized as -
● Problems for which existing solutions require a lot of hand-tuning or long
lists of rules: one Machine Learning algorithm can often simplify code and
perform better.
20 August 2020 Department of CSE, GIT EID 403 and machine learning 16
Supervised Learning Systems -
A Labelled Training Regression
20 August 2020 Department of CSE, GIT EID 403 and machine learning 17
Supervised Learning Systems -
Some Important Algorithms:
• k-Nearest Neighbors
• Linear Regression
• Logistic Regression
• Support Vector Machines (SVMs)
• Decision Trees and Random Forests
• Neural networks
20 August 2020 Department of CSE, GIT EID 403 and machine learning 18
Un-Supervised Learning Systems -
20 August 2020 Department of CSE, GIT EID 403 and machine learning 19
Un-Supervised Learning Systems -
Some Important Algorithms:
● Clustering
● K-means
● DBSCAN
● Hirerchical Cluster Analysis
● Anomaly Detection
● one-class SVM
● Isolation Forest
● Auto-encoders
20 August 2020 Department of CSE, GIT EID 403 and machine learning 20
Un-Supervised Learning Systems -
Some Important Algorithms:
● Dimensionality Reduction: where the goal is to compress the data without
losing too much information (one way to do it is to merge highly correlated
features)
● Principal Component Analysis: PCA
● t-distributed stochastic Neighbor Embedding: T-SNE
● Autoencoders
● Kernel PCA
● Local Linear Embedding (LLE)
20 August 2020 Department of CSE, GIT EID 403 and machine learning 21
Un-Supervised Learning Systems -
Some Important Algorithms:
● Association rule learning algorithms find interesting relations between
attributes
● Apriori
● Eclat
20 August 2020 Department of CSE, GIT EID 403 and machine learning 22
Un-Supervised Learning Systems -
Some Important Algorithms:
● Association rule learning algorithms find interesting relations between
attributes
● Apriori
● Eclat
20 August 2020 Department of CSE, GIT EID 403 and machine learning 23
Un-Supervised Learning Systems -
Anomaly Detection :
20 August 2020 Department of CSE, GIT EID 403 and machine learning 24
Other Learning Systems -
Semi-Supervised Learning :
-partially labeled training data, usually a lot of unlabeled data and a little bit
of labeled data
Reinforcement Learning:
-select and perform actions, and get rewards in return.
-learn by itself what is the best strategy
20 August 2020 Department of CSE, GIT EID 403 and machine learning 25
Other Learning Systems -
Semi-Supervised Learning :
-partially labeled training data, usually a lot of unlabeled data and a little bit
of labeled data
Reinforcement Learning:
-select and perform actions, and get rewards in return.
-learn by itself what is the best strategy
-DeepMind’s AlphaGo program
20 August 2020 Department of CSE, GIT EID 403 and machine learning 26
Batch & Online Learning -
● In batch learning, the model is incapable of incremental learning.
● to know about new data need to train a new version of the system from
scratch on the full dataset then stop the old system and replace it
● It starts by learning from all of the available data offline, and then gets
deployed to produce predictions without feeding it any new data points.
20 August 2020 Department of CSE, GIT EID 403 and machine learning 27
● Another name of batch learning is Offline Learning.
Batch & Online Learning -
● Train the data incrementally by continuously feeding it data instances as they come.
of instances.
● Online learning is great for systems that receive data in a continuous flow.
20 August 2020 Department of CSE, GIT EID 403 and machine learning 28
Instance-Based Vs Model-Based -
● Machine Learning systems is by how they generalize.
● able to generalize to examples it has never seen before
● two main approaches to generalization: instance-based learning and model-based learning.
20 August 2020 Department of CSE, GIT EID 403 and machine learning 29
Instance-Based
● the system learns the examples by heart
● generalizes to new cases by
comparing them to the learned examples
● using a similarity measure
20 August 2020 Department of CSE, GIT EID 403 and machine learning 30
Model-Based
● from a set of examples - to be build a model of these examples
● then use that model to make predictions
20 August 2020 Department of CSE, GIT EID 403 and machine learning 31
Let’s Go to JUPYTER NOTEBOOK
http://localhost:8888/lab/tree/Desktop/handson-ml2-master/handson-ml2-master/01_the_machine_learning_landscape.ipynb
20 August 2020 Department of CSE, GIT EID 403 and machine learning 32
In summary - ML programs
● typical Machine Learning project looks like
• You studied the data.
• You selected a model.
• You trained it on the training data (i.e., the learning algorithm searched for
the model parameter values that minimize a cost function).
• Finally, you applied the model to make predictions on new cases (this is
called inference), hoping that this model will generalize well.
20 August 2020 Department of CSE, GIT EID 403 and machine learning 33
Challenges of Machine Learning
BAD DATA:
● Insufficient Quantity of Training Data
- The Unreasonable Effectiveness of Data
● Nonrepresentative Training Data
● Poor Quality data - full of errors, outliers, and noise due to poor quality
measurements
- Cleaning up the data
● Irrelevant Features - coming up with a good set of features to train on
● Feature engineering process
- Feature selection: selecting the most useful features,
- Feature extraction: combining existing features,
- Creating new features by gathering new data
20 August 2020 Department of CSE, GIT EID 403 and machine learning 34
Challenges of Machine Learning
20 August 2020 Department of CSE, GIT EID 403 and machine learning 35
Challenges of Machine Learning
BAD algorithms:
● Over fitting: model performs well on the training data, but it does not
generalize well -
- the model is too complex relative to the amount and noisiness of the
training data.
● To simplify the model by selecting one with fewer parameters (e.g., a
linear model rather than a high-degree polynomial model), by reducing
the number of attributes in the training data or by constraining the model
● To gather more training data
● To reduce the noise in the training data (e.g., fix data errors and remove
outliers
20 August 2020 Department of CSE, GIT EID 403 and machine learning 36
Challenges of Machine Learning
Regularization:
- reducing degrees of freedom
- tuning hyperparameters
20 August 2020 Department of CSE, GIT EID 403 and machine learning 37
Challenges of Machine Learning
- The objective here is to know what they are (Underfitting and Overfitting), how they degrade model performance, and
finally, look at the concepts by which they can be properly managed.
- Let us first familiarize ourselves with the following.
- •Noise: Noise is distorted relevant features that reduces the performance of the model.
- •Bias: It is a result of oversimplification of the model. It is usually indicated by high training errors and high testing
errors. The reason is oversimplified model will not be able to capture true patterns / relationships between input data
and output data well due to insufficient consideration of input features or training is prematurely stopped. Example:
Linear Regression. Thus, High bias leads to Underfitting.
- •Variance: It is a result of making the model that works fine with training data but faults terribly on test data. It is
usually indicated by lower training errors but high test errors. It is actually the difference between training and testing
results. Example: k Nearest Neighbor (kNN), Decision Trees. Thus, High variance leads to Overfitting.
- •Both bias and variance lead to high prediction errors. They are diagrammatically shown in the figures given in next
two slides.
20 August 2020 Department of CSE, GIT EID 403 and machine learning 38
Challenges of Machine Learning
-
20 August 2020 Department of CSE, GIT EID 403 and machine learning 39
Challenges of Machine Learning
-
20 August 2020 Department of CSE, GIT EID 403 and machine learning 40
Challenges of Machine Learning
- K - Fold Cross validation
20 August 2020 Department of CSE, GIT EID 403 and machine learning 41
Challenges of Machine Learning
BAD algorithms:
● Under fitting: - when your model is too simple to learn the underlying
structure
- reality is just more complex than the model
- predictions are bound to be inaccurate
● How to Fix:
- Selecting a more powerful model, with more parameters
- Feeding better features to the learning algorithm (feature engineering)
- Reducing the constraints on the model (e.g., reducing the regularization
hyperparameter)
20 August 2020 Department of CSE, GIT EID 403 and machine learning 42
Takeaways -
● Machine Learning is about making machines get better at some task by learning
from data, instead of having to explicitly code rules.
● There are many different types of ML systems: supervised or not, batch or
online, instance-based or model-based, and so on.
● In a ML project you gather data in a training set and feed it to a learning
algorithm.
- In model-based it tunes some parameters - to make good predictions
- In instance-based - learns the examples by heart - generalizes using a similarity
measure.
● The system will not perform well if your training set is too small, or if the data is
not representative, noisy, or polluted.
- model needs to be neither too simple (underfit) nor too complex (overfit).
20 August 2020 Department of CSE, GIT EID 403 and machine learning 43
TESTING and VALIDATION
Training set and Testing set - 80 : 20
Training error and Generalization error
- Overfitting : training error is low and generalization error is high
- Choice of hyperparameter
- Practical vs real time
- generalization error is high in real production scenarios. why?
Validation - testing on a specific data set
Cross validation - validation on multiple chunks
Data Mismatch - validation set and the test must be as representative as possible
20 August 2020 Department of CSE, GIT EID 403 and machine learning 44
ML - Definition
● Machine Learning is the field of study that gives computers the ability
to learn without being explicitly programmed —Arthur Samuel, 1959
20 August 2020 Department of CSE, GIT EID 403 and machine learning 45
ML - Project
● Get the data.
● Discover and visualize the data to gain insights.
● Prepare the data for Machine Learning algorithms.
● Select a model and train it.
● Fine-tune your model.
● Present your solution.
● Launch, monitor, and maintain your system.
20 August 2020 Department of CSE, GIT EID 403 and machine learning 46
ML - Project
● Frame the problem
○ what is business objective
○ end goal - not to build a model but how does organization get helped
○ End goal confers how do we frame the problem, what algorithms used, what
performance measures used and how much effort to put
○ Ex: A Machine Learning pipeline for real estate investments
20 August 2020 Department of CSE, GIT EID 403 and machine learning 47
ML - Project
● Select Performance Measure
- a typical performance measure for regression problems is the Root
Mean Square Error (RMSE) -
20 August 2020 Department of CSE, GIT EID 403 and machine learning 48
ML - Project
- Download the data ( github, kaggle etc…)
- Quick look at the Data structure (df.head(), df.describe(), df. hist())
- Create test set - stratified split, sample bias
- Insights through visualization - Correlation
20 August 2020 Department of CSE, GIT EID 403 and machine learning 49
ML - Project
- Prepare the data for ML algorithms
- Write functions instead doing manually
- This will allow you to reproduce these transformations easily on any dataset
- You will gradually build a library of transformation functions that you can reuse in
future projects.
- You can use these functions in your live system to transform the new data before
feeding it to your algorithms.
- This will make it possible for you to easily try various transformations and see
which combination of transformations works best.
20 August 2020 Department of CSE, GIT EID 403 and machine learning 50
ML - Project
- Data Cleaning
- three options
- Get rid of the rows
- Get rid of the attribute
- set values to some possible values ( Mean,median, zero etc)
- Categorical - Ordinal encoding, One-Hot encoding,
- Feature Scaling - MinMax (Normalization), Standardization
- Transformation of Pipeline
20 August 2020 Department of CSE, GIT EID 403 and machine learning 51
ML - Project
- Training Model
- Sample training with cross validation
-
20 August 2020 Department of CSE, GIT EID 403 and machine learning 52