0% found this document useful (0 votes)
6 views29 pages

Intro To ML

The document provides an overview of machine learning (ML), detailing its types such as classification and regression, and the essential steps in the ML process including data mining, pre-processing, and model training. It emphasizes the importance of data cleaning, scaling, and selecting appropriate algorithms while addressing the concepts of overfitting and underfitting. Additionally, it recommends resources for further learning in ML, including books and online platforms.

Uploaded by

Dion Wisent
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views29 pages

Intro To ML

The document provides an overview of machine learning (ML), detailing its types such as classification and regression, and the essential steps in the ML process including data mining, pre-processing, and model training. It emphasizes the importance of data cleaning, scaling, and selecting appropriate algorithms while addressing the concepts of overfitting and underfitting. Additionally, it recommends resources for further learning in ML, including books and online platforms.

Uploaded by

Dion Wisent
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Lunch and Mlearn

Dexter Fichuk
https://goo.gl/VaWHrb (content here)
https://www.continuum.io/downloa
ds

http://tiny.cc/conda
What is ML?
Types of ML

Classification Regression
Classification can if something is Predicting a value based on the
true or false (1 or 0), could be input, could be predicting a
classifying a picture as a cat or credit score, the temperature,
dog or classifying if something is stocks, or anything where the
a square, triangle or circle. there is continuous output
options, (eg. 2.4893, 1.00049,
59.23)
The Flow
Training/
Data Mining Pre-Processing
Evaluating

Collecting a Dataset Cleaning the Data Building a Complete


Model
Mostly doing Detecting the
supervised learning values/features Involves testing different
here, meaning that our (columns) that matter, algorithms/hyperparamet
training set already has removing ones that ers to find the highest
outcome labels. don’t. accuracy for the dataset.
Could also involve Normalizing/Scaling
creating simulation data
datasets (transactions,
Sometimes plotting
etc.)
Data-Mining
● Try searching sites like
kaggle, open data
government sites, and the
UCI machine learning
Data Mining repository besides Google.
● If simulating the data,
make sure to research
reasonable ranges and
occurrences of different
cases.
Pre-Processing
Involves tasks such as:

Pre- ● Removing irrelevant


features

Processing ● Deciding what to do with


null entries (replace with
column avg., remove row,
Cleaning your dataset etc.)
● Scaling inputs and
transforming text fields to
numerical representations.
Data
Variables X (input data) y (output label)
Representing Data
outputs / labels
one sample 1.1 2.2 3.4 5.6 1.0 1.6
6.7 0.5 0.4 2.6 1.6 2.7
2.4 9.3 7.3 6.4 2.8 4.4

X= 1.5
0.5
0.0
3.5
4.3
8.1
8.3
3.6
3.4
4.6
y= 0.5
0.2
5.1 9.7 3.5 7.9 5.1 5.6
3.7 7.8 2.6 3.2 6.3 6.7

one feature (column)


Categorical If you mapped:
Variables {red->0, green->1, blue->2}, a
linear relationship would be
“red” “green” “blue” imposed between the values,
therefore it is better to perform a
1 0 0 categorical transformation on
types of text fields that are

0 1 0 options, rather than ratings.

A field such as 5-star ratings


0 0 1 could be scaled as 0, 0.25, 0.5,
0.75, and 1.
Scaling Inputs
Movie Reviews (/5) A field such as 5-star ratings
could be scaled as 0, 0.20, 0.4,
Before After Scaling
Scaling
0.6, 0.8 and 1.

1 .2 Whether an input should be


scaled is largely dependent on

3 .6 the learning algorithm you’re


selecting.

5 1 Scaling is great for algorithms


such as Neural Networks and

2 .4 SVMs.
Training A Model
Splitting the Data
Simple Splitting
The gold standard of evaluating a model is by testing it on data it has
not seen in training. This means taking a percentage out of the training
set (typically 10-20%), and running it through the trained model to see
it’s accuracy.

It’s important to set a random state for the split, so you can evaluate
your model on the same training set every time, making your results
reproducible.
Training and Testing Data
training set
1.1 2.2 3.4 5.6 1.0 1.6
6.7 0.5 0.4 2.6 1.6 2.7
2.4 9.3 7.3 6.4 2.8 4.4

X= 1.5
0.5
0.0
3.5
4.3
8.1
8.3
3.6
3.4
4.6
y= 0.5
0.2
BAD SPLIT

5.1 9.7 3.5 7.9 5.1 5.6


3.7 7.8 2.6 3.2 6.3 6.7

test set
Picking an Algorithm
There are many algorithms to choose from, but lucky for us, Scikit-Learn
has a ton built in and can be used mostly interchangeably, meaning
that different classifiers can be used in a loop then plotted to compare
performance.

Each algorithm has better use cases and could outperform others for a
specific task. There is no master algorithm.

Scikit-Learn has a great cheat sheet for picking algorithms.


Source: https://goo.gl/liKQbr
Generalizing
Overfitting and Underfitting
Training

Training

Sweet spot

Accuracy
Testing Generalization

Underfitting Overfitting

Model complexity
Overfitting and Underfitting
● Gradient Boosting

Algorithms (XGBoost, LightGBM)


● Random Forests
● Multi-Layer Perceptron (NN)
A few great ones for
● Support Vector Machines
baselining.
Parameter
Tuning ● GridSearch
● RandomSearch
● Hyperopt
Each Algorithm has a
variety of parameters, there
are a few ways of finding
optimal ones.
Recap

Data Mining Pre-Processing Splitting Data

Trainin
Evaluating
g
Jupyter Notebook Use
Recommende
d Resources
Accuracy
● Hands-On Machine Learning with
Scikit-Learn and TensorFlow by
Aurélien Géron
● Deep Learning with Python by
François Chollet
● Kaggle
github.com/dexterfichuk/ML-
Bootcamp

https://goo.gl/VaWHrb
http://scikit-learn.org/

You might also like