BMI 704 - Machine Learning Lab

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 17

BMI 704 – Machine Learning

Lab
030719
Topics
• Types of Machine Learning
• Supervised - you have data where you have observed the input variables
(features) and the outcome
• Unsupervised - you have a data but there is no observed outcome

• Introduction to Supervised Learning


• Introduction to Unsupervised Learning

• Algorithms and Packages


Supervised Learning
• Outcome
• You know the outcome (labelled variables; Y)
Your model
• Continuous or binary

• Features
• i.e. variables (Xs)
• Inputs you are using to predict outcome

• Model Models
• 1) Pick a guy Diabetes = 0.5*age + 0.2*sex + 2.1*BMI + …
• 2) sub his features into the model
Height = 0.2*age + 0.8*sex + 1.3*weight + …
• 3) now you know his outcome
Where is the predicting model come from?
• 1) Pick an algorithm
• Linear model
• Y = X1 + X2 + X3

• 2) Split your data set into train and test (e.g. 80/20,
70/30)

• 3) Build your model using the training data set


• Cross validation find best model parameters

• 4) Run your optimized model using the test data set

• 5) Report model performance and your results


Measurement of how well your algorithm did?
Loss function
• Objective metric, max or
min

Simple Regression
• R2 - amount of variance
explained

Multiple regression with


varying model size
• Adjusted R2
• AIC/BIC/Cp
Measurement of how well your algorithm did?
Classification (Y = binary)
• Receiver operating
characteristic (ROC) curve
and area under the curve
(AUC)

• If Y = 1 or 0;
• High sensitivity:
• Y = 1; ➙ Y^ = 1
• High specificity:
• Y = 0; ➙ Y^ = 0
Which model (algorithm) should you use?
Unsupervised Learning
• Not interest in predicting Y but exploratory analysis (Xs)
• discovering patterns
• Find subgroups that you don’t know
• Visualize the results

• Hard to validate results

• Principle component analysis


• X1, X2, X3, X4 … Xn
• ➙ create latent variables (PCs)

• A few latent variables to capture the most of the information of the data
• i.e. the variance explained

• Variance explained: PC1 > PC2 > PC3 …


Score plot loading plot

loading x%
Score x%
Unsupervised Learning
• Clustering
• PCA looks to find a low-dimensional representation of the observations that
explain a good fraction of the variance;
• Clustering looks to find homogeneous subgroups among the observations.

• K-means clustering
• hierarchical clustering
K-means clustering
• partitioning a data set into K distinct, non-overlapping clusters.
• Specify how many clusters do you want
• The algorithm looks for
local optimum
• Run a few times to see
the different
hierarchical clustering
• tree-based representation of the
observations, called a
dendrogram.
• bottom-up clustering
Algorithms and Packages
• ML Algorithms (many, many, many!)
• Basics: linear-based
• Shrinkage Methods
• Lasso and Ridge regression
• ElasticNet
• Non-linear methods
• Spline
• Support Vector Machines
• Tree based methods
• Decision trees
• Random Forests
• Packages in R
• Individual packages for each algorithm - glmnet
• Meta packages – caret

You might also like