1 Introduction To ML
1 Introduction To ML
Machine Learning
Ketan Rajawat
IIT Kanpur
What is ML?
• Humans: learn from experience
• I love mangoes
• Too busy to pick out the best ones myself
• Design a robot that can
• learn my taste preferences and
• choose tasty mangoes for me
• Key Idea: Robot should predict whether a mango will taste good
based on easily measurable features
Features
• Potential features
• Variety (type of mango)
• Ripeness level
• Color
• Season of harvest
Firm with
yellowish color
Decision tree
classifier
• Goal: Predict the taste of a new mango (without actually tasting it)
How to improve accuracy?
• More data
• larger training sets usually help
• Unless data is poor quality (e.g. labeling errors)
• More features
• Sometimes (e.g. image and speech recognition)
• But not necessarily (e.g.: day of the week, medical diagnostics)
• Reinforcement Learning
•…
Supervised Learning
https://www.javatpoint.com/supervised-machine-learning
Examples of Supervised Learning
Unsupervised Learning
• Depends on dataset
Classification vs. Regression
• Label is either categorical or numerical
Labels
Rows are
Single data
features
point
https://www.kaggle.com/datasets/mathchi/diabetes-data-set
How are models trained?
• Start with a basic (untrained) model, e.g. always outputs 1
𝑥𝑖
Model output 𝑦ො𝑖 = 𝑓 𝑥𝑖
2 115 0 0 0 35.3 0.134 29 Model 1
Loss ℓ𝑖 = 𝑦ො𝑖 − 𝑦𝑖 2
0
Actual label 𝑦𝑖
Optimization Fine-tune
the model
• Would our model predict well over new & unseen data?
• How to assess the ability of our model to generalize on new data?
Training and testing
• Learn model over training data
(minimize training loss)
• Calculate loss over test data
• Which model is better?
Model A B C D
Training loss 0.9 0.8 1 0.001
Test loss 0.8 0.9 0.7 0.91
Best model
• Can use other loss functions or other metrics like accuracy
https://www.linkedin.com/pulse/traintest-split-versus-cross-fold-validation-william-monroe/
Working example
• Predict if house price is high or low
• The last column contains the house price normalized from 1-5.
We can consider price >=2 to be high and <2 to be low.
https://colab.research.google.com/drive/1ge4xx5A6n3orwM0pBLbquUsceLYHUxBH?usp=sharing
Try out other models
model Training accuracy Test accuracy
model = LogisticRegression() 80.6 81.6
model = SVC(kernel='linear', C=1.0) 83.3 83.8
model = DecisionTreeClassifier() 100 84.4
model = RandomForestClassifier() 100 89
model = KNeighborsClassifier() 77.5 62.5
model = GaussianNB() 76.4 76.7
model = GradientBoostingClassifier() 89 88.2
model = MLPClassifier() 76 76
model = LogisticRegression(penalty='l1', solver='liblinear', 83.2 83.9
max_iter=1000) # L1 regularization
Homework
• Study more about decision tree classifier
https://www.youtube.com/watch?v=ZVR2Way4nwQ
https://scikit-learn.org/stable/modules/tree.html
• Book on basics of ML
https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-
Pattern-Recognition-and-Machine-Learning-2006.pdf