Linear regression is an algorithm that provides a linear relationship between an
independent variable and a dependent variable to predict the outcome of future
events. It is a statistical method used in data science and machine learning for
predictive analysis.
Model evaluation aims to define how well the model performs its task. The model's
performance can vary both across use cases and within a single use case, e.g., by
defining different parameters for the algorithm or data selections. Accordingly, we
need to evaluate the model's accuracy at each training run.
Types of evaluations
Train and test on same Dataset
Train/Test Split
Regression metrics are quantitative measures used to evaluate the nice of a
regression model. Scikit-analyze provides several metrics, each with its strengths
and boundaries, to assess how well a model suits the statistics.
High training accuracy isn't necessarily a good thing
Result of over-fitting
• Over-fit: the model is overly trained to the dataset, which may capture noise
and produce a non-generalized model
Our models must have a high, out-of-sample accuracy
How can we improve out-of-sample accuracy?
Multiple linear regression is an extension of simple linear regression, where multiple
independent variables are used to predict the dependent variable.
Example
Independent Variables Effectiveness on Prediction
Does revision time, test anxiety, lecture attendance, and gender have any effect on the
exam performance of students?
Predicting impacts of changes
How much does blood pressure go up (or down) for every unit increase (or decrease) in
the BMI of a patient?
• How to estimate?
Ordinary Least Squares
Linear algebra operations
Takes a long time for large datasets (10K+ rows)
• An optimization algorithm
Gradient Descent
The proper approach if you have a very large dataset
Classification is a supervised machine learning method where the model tries to
predict the correct label of a given input data.
A supervised learning approach
Categorizing some unknown items into a discrete set of categories or "classes"
The target attribute is a categorical variable
Decision Trees (ID3, C4.5, C5.0)
Naïve Bayes Linear Discriminant Analysis
k-Nearest Neighbor
Logistic Regression Neural Networks
Support Vector Machines (SVM)
A method for classifying cases based on their similarity to other
cases
Cases that are near each other are said to be "neighbors“
Based on similar cases with the same class labels are near each
other
Pick a value for K.
Calculate the distance of unknown cases from all cases.
Select the K-observations in the training data that are "nearest"
to the unknown data point.
Predict the response of the unknown data point using the most
popular response value from the K-nearest neighbors.
A decision tree is a type of
supervised machine learning used
to categorize or make predictions
based on how a previous set of
questions were answered.
Choose an attribute from your Drug A Drug B
dataset.
Calculate the significance of the
attribute in splitting of data
Split data based on the value of the
best attribute
Go to step 1.
Entropy=? Entropy=?
“A baby learns to crawl, walk, and then
run.We are in the crawling stage when it
comes to applying machine learning”
Dave Waters