Introduction To Machine Learning and Python
Introduction To Machine Learning and Python
Learning
Sourav Nandi
ABOUT.ME/SOURAV.NANDI
Symphony AI, IIT Kanpur
Social Handles- @souravstat
1
How much Data is Created Every Day?
‘Google’ has become a Verb! (3.5 billion search queries every day)
2.5 quintillion bytes of data are produced by us every day. (18 Zeroes!)
~90% of the World’s Data created in last 2 years – Accelerating Pace
Every Day-
➢ ~250 billion emails are sent (45% are Spam- Hit the Unsubscribe button!)
➢ 100+ million photos and videos are shared on Instagram
➢ ~500 million Tweets are made (~45% of Covid-19 tweets estd to be Bot-Generated!)
By the end of 2020, ~31 Billion IoT devices. The estimated size of the entire
digital universe will be a whopping 44 zettabytes (21 Zeroes in a ZB!)
What to do with all these Data?
2
3
Timeline of Machine Learning (Wikipedia)
4
The ML
Family Tree
* Allows Response (Dependent) Variables to have error distribution models other than a normal distribution
Classification (Contd.)
KNN (K-Nearest Neighbors) Algorithm
(Idea: Find Closest K Neighbors)
Distance Measures: Euclidean distance,
Mahalanobis distance, Manhattan distance,
Cosine Distance etc
Support Vector Machine and Kernel Trick
Image Credit:Vas3K
Unsupervised Learning
Hierarchical Clustering
Distance between Clusters
➢ Single Linkage (Min)
➢ Complete Linkage (Max)
➢ Average Linkage (All Pairs)
➢ Centroid Linkage
Association Rule Mining (Looking
for patterns, eg, analyzing Shopping
behavior, Marketing Strategy)
10
Image Credit: saedsayad.com
Reinforcement Learning
Model is trained by having an
Agent interact with environment
Desired Action gets Rewarded
“Good Behaviors are Reinforced”
One of Three fundamental ML
Paradigms (along with Supervised
learning and Unsupervised
learning)
When in an active Environment,
like Video Games (Super Mario!),
Self Driving Car etc
Goal is to Minimize Error (maybe
difficult to predict all possible
Credit: TWIML Online
moves) 11
Diving Deeper into some Core Ideas
12
Correlation does not
Imply Causation
13
Bias-Variance Tradeoff:
Underfitting vs Overfitting
14
Curse of Dimensionality
Problems in High Dimension due to Data Sparsity
Adding each new dimension (ie, adding a feature)
increases the data set requirement exponentially
Separation of Wind Turbines- 2D vs 3D view
15
Comparison of
some Popular
Algorithms
16
Things to keep in mind
Splitting the Dataset:
➢ Training Set (Data Sample used to Fit the Model, to get the Parameters)
➢ Validation Set (Tuning Hyperparameters to choose final model)
➢ Test Set (To evaluate the final model, should not be used for training)
Other Common Pitfalls in ML (Violation of Assumptions):
➢ Non-Linearity (Plotting Residuals against Fitted Values, Non-linear Transformation)
➢ High Leverage Points (Cook’s Distance Plot)
➢ Correlation of Error Terms (eg, Time Series data)- Controlled Experiment
➢ Heteroscedasticity (Non-constant Variance of Error Term)
➢ Multicollinearity (Correlated Predictors, eg. Dummy Variable Trap)
17
That’s All, Friends!
18
Thanks & References
1) Special Thanks to All the Participants, BKC College, WBSU and STAT & ML Lab:
https://www.ctanujit.org/statml-lab.html
2) Image Credit- Wikipedia, Reddit, SlideShare, me.me, Imgflip, xkcd,
3) http://faculty.marshall.usc.edu/gareth-james/ISL/ (ISLR Book)
4) https://developers.google.com/machine-learning/guides/good-data-analysis
5) https://hackernoon.com/choosing-the-right-machine-learning-algorithm-68126944ce1f
6) https://en.wikipedia.org/wiki/Reinforcement_learning
7) Download Latest Version Of this PPT (& other Materials):
https://github.com/souravstat/
8) Please feel free to reach out to me anytime for a discussion:
https://about.me/sourav.nandi , [email protected]
19