0% found this document useful (0 votes)
12 views

Introduction To Machine Learning and Python

This document provides an introduction to machine learning. It discusses how much data is created every day from sources like Google searches, social media posts, and IoT devices. It then gives an overview of machine learning, including common algorithms for classification, regression, unsupervised learning, and reinforcement learning. Some key concepts discussed are bias-variance tradeoff, curse of dimensionality, and splitting data into training, validation, and test sets. The document aims to explain core ideas in machine learning at a high level.

Uploaded by

rokr58
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Introduction To Machine Learning and Python

This document provides an introduction to machine learning. It discusses how much data is created every day from sources like Google searches, social media posts, and IoT devices. It then gives an overview of machine learning, including common algorithms for classification, regression, unsupervised learning, and reinforcement learning. Some key concepts discussed are bias-variance tradeoff, curse of dimensionality, and splitting data into training, validation, and test sets. The document aims to explain core ideas in machine learning at a high level.

Uploaded by

rokr58
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Introduction to Machine

Learning
Sourav Nandi
ABOUT.ME/SOURAV.NANDI
Symphony AI, IIT Kanpur
Social Handles- @souravstat
1
How much Data is Created Every Day?
 ‘Google’ has become a Verb! (3.5 billion search queries every day)
 2.5 quintillion bytes of data are produced by us every day. (18 Zeroes!)
 ~90% of the World’s Data created in last 2 years – Accelerating Pace
 Every Day-
➢ ~250 billion emails are sent (45% are Spam- Hit the Unsubscribe button!)
➢ 100+ million photos and videos are shared on Instagram
➢ ~500 million Tweets are made (~45% of Covid-19 tweets estd to be Bot-Generated!)
 By the end of 2020, ~31 Billion IoT devices. The estimated size of the entire
digital universe will be a whopping 44 zettabytes (21 Zeroes in a ZB!)
 What to do with all these Data?
2

Source- Statista, IEEE, TechJury, ILS, Raconteur, NPR, Sendpulse


Machine Learning- The Art and Science of
Learning from Data
 We are drowning in Information and starving for
Knowledge — John Naisbitt (Author of ‘Megatrends’)
 Is Learning Possible?
 Generalization/ Pattern Recognition (Easy) vs
Extrapolation/ Finding Higher Dimensional Insights (Hard)

3
Timeline of Machine Learning (Wikipedia)

4
The ML
Family Tree

Image Credit: Vas3k, https://noeliagorod.com/


Classification- Split into Categories
 Usage: Fraud Detection (Online Transaction), Spam Filtering (Email),
Sentiment Analysis (+ve/-ve/neutral), Handwriting Recognition (MNIST) etc.
 Overview of Popular Algorithms
➢ Logistic Regression (GLM* with Logit link), Multinomial Logit
➢ Decision Tree, Random Forest, Bagging, Boosting
➢ Naïve Bayes (Conditional Independence
b/w Features, Given a Category)

* Allows Response (Dependent) Variables to have error distribution models other than a normal distribution
Classification (Contd.)
 KNN (K-Nearest Neighbors) Algorithm
(Idea: Find Closest K Neighbors)
 Distance Measures: Euclidean distance,
Mahalanobis distance, Manhattan distance,
Cosine Distance etc
 Support Vector Machine and Kernel Trick

Image Credit: Towards data science


Regression
 Multiple Linear Regression
(Discussed in earlier classes in
detail)
 Ordinary Least Square Method:
Computes the unique line (or
hyperplane) that minimizes the sum
of squared distances (usually
vertical) between the true data and
that line
 Ridge Regression and LASSO
(reducing model complexity to
prevent overfitting, Variable
Selection)

Image Credit: ISLR book (Ref 3) 8


Unsupervised Learning
 Market Segmentation (Clustering),
Anomaly Detection, Image Compression
(Dimensionality Reduction) etc
 K-means Clustering
 Principal Component Analysis
(Projection into Lower Dimensional
Space, Summarizing Information)

Image Credit:Vas3K
Unsupervised Learning
 Hierarchical Clustering
 Distance between Clusters
➢ Single Linkage (Min)
➢ Complete Linkage (Max)
➢ Average Linkage (All Pairs)
➢ Centroid Linkage
 Association Rule Mining (Looking
for patterns, eg, analyzing Shopping
behavior, Marketing Strategy)

10
Image Credit: saedsayad.com
Reinforcement Learning
 Model is trained by having an
Agent interact with environment
 Desired Action gets Rewarded
 “Good Behaviors are Reinforced”
 One of Three fundamental ML
Paradigms (along with Supervised
learning and Unsupervised
learning)
 When in an active Environment,
like Video Games (Super Mario!),
Self Driving Car etc
 Goal is to Minimize Error (maybe
difficult to predict all possible
Credit: TWIML Online
moves) 11
Diving Deeper into some Core Ideas

12
Correlation does not
Imply Causation

Image Credit: xkcd

13
Bias-Variance Tradeoff:
Underfitting vs Overfitting

14
Curse of Dimensionality
 Problems in High Dimension due to Data Sparsity
 Adding each new dimension (ie, adding a feature)
increases the data set requirement exponentially
 Separation of Wind Turbines- 2D vs 3D view

Image Credit: deepai.org

15
Comparison of
some Popular
Algorithms

Table Credit: dataiku.com

16
Things to keep in mind
 Splitting the Dataset:
➢ Training Set (Data Sample used to Fit the Model, to get the Parameters)
➢ Validation Set (Tuning Hyperparameters to choose final model)
➢ Test Set (To evaluate the final model, should not be used for training)
 Other Common Pitfalls in ML (Violation of Assumptions):
➢ Non-Linearity (Plotting Residuals against Fitted Values, Non-linear Transformation)
➢ High Leverage Points (Cook’s Distance Plot)
➢ Correlation of Error Terms (eg, Time Series data)- Controlled Experiment
➢ Heteroscedasticity (Non-constant Variance of Error Term)
➢ Multicollinearity (Correlated Predictors, eg. Dummy Variable Trap)
17
That’s All, Friends!

STAT&ML Lab is a non-profit


organization to bring young minds
into research projects on
Statistics & Machine Learning.

We aim to provide training and


research projects on Statistics,
Data Science, and ML.

The primary goal of this lab is to


promote research in Statistics in
India and throughout the world

18
Thanks & References
1) Special Thanks to All the Participants, BKC College, WBSU and STAT & ML Lab:
https://www.ctanujit.org/statml-lab.html
2) Image Credit- Wikipedia, Reddit, SlideShare, me.me, Imgflip, xkcd,
3) http://faculty.marshall.usc.edu/gareth-james/ISL/ (ISLR Book)
4) https://developers.google.com/machine-learning/guides/good-data-analysis
5) https://hackernoon.com/choosing-the-right-machine-learning-algorithm-68126944ce1f
6) https://en.wikipedia.org/wiki/Reinforcement_learning
7) Download Latest Version Of this PPT (& other Materials):
https://github.com/souravstat/
8) Please feel free to reach out to me anytime for a discussion:
https://about.me/sourav.nandi , [email protected]

19

You might also like