0% found this document useful (0 votes)
16 views9 pages

MODELS (AutoRecovered)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views9 pages

MODELS (AutoRecovered)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

MODELS

Logistic Regression:
Despite its name, logistic regression is used for classification. This model calculates the probability, p,
that an observation belongs to a binary class.
if p is more than or equal to 0.5, we label the data as one
if p is less than 0.5, we label it zero

The default probability threshold for logistic regression in scikit-learn is zero-point-five. This
threshold can also apply to other models such as KNN.

Hyperparameter tuning
Parameters that we specify before fitting a model, like alpha and n_neighbors, are called
hyperparameters.
So, a fundamental step for building a successful model is choosing the correct hyperparameters.
We can try lots of different values, fit all of them separately, see how well they perform, and choose
the best values!

One approach for hyperparameter tuning is called grid


search, where we choose a grid of possible hyperparameter
values to try. For example, we can search across two
hyperparameters for a KNN model - the type of metric and a
different number of neighbors
We perform k-fold cross-validation for each combination of hyperparameters. The mean scores for
each combination are shown here. We then choose hyperparameters that performed best

alpha and solver are both hyper parameters for


Ridge regression the cross validation scored is
calculated for each pair of alpha and solver and the
best is choosen

Grid search is great. However, the number of fits is equal to the number of hyperparameters
multiplied by the number of values multiplied by the number of folds.
Therefore, it doesn't scale well! So, performing 3-fold cross-validation for one hyperparameter with
10 values each means 30 fits, while 10-fold cross-validation on 3 hyperparameters with 10 values
each equals 900 fits!

Hence we can choose Random Search


which picks random hyperparameter
values rather than exhaustively searching
through all options.

DATA PRE PROCESSING :


Dealing with categorical data
we need to convert them into numeric features. We achieve this by splitting the feature into multiple
binary features called dummy variables, one for each category. Zero means the observation was not
that category, while one means it was.
We create binary
features for each
genre. As each song
has one genre, each
row will have a 1 in
one of the ten
columns and zeros in
the rest. If a song is
not any of the first
nine genres, then
implicitly, it is a rock song. That means we only need nine features, so we can delete the Rock column
To create dummy variables we can use scikit-learn's OneHotEncoder, or pandas' get_dummies.

If the DataFrame only has one


categorical feature, we can pass
the entire DataFrame, thus
skipping the step of combining
variables. If we don't specify a
column, the new DataFrame's
binary columns will have the
original feature name prefixed,
so they will start with genre-

underscore - as shown here. Notice the original genre column is automatically dropped. Once we
have dummy variables, we can fit models as before.

Handling missing data:


Count of missing
values in each feature

A common approach is to remove missing observations accounting for less than 5% of all data.
Subset has
the cols in
which null values are present if one of the col has null values entire row is removed

Another option is to impute missing data. This means making an educated guess as to what
the missing values could be
We can impute the mean of all non-missing entries for a given feature. We can also use other
values like the median. For categorical values we commonly impute the most frequent value.
We must split our data before imputing to avoid leaking test set information to our model,
a concept known as data leakage.

Divide the training and testing data into two parts categorical and numerical data using the same
random state ensures the position of a col is same for both the data
Imputers are also called as transformers for their ability to transform the data

Centering and scaling the data:

We normally scale or centre the data to make sure they are on the same scale

Scaling the data:

Different ways of scaling:


Standarization : Subtract the mean and div by variance: Data is centered around 0 have variance 1
Subtract min and div by range so that min is 0 and max is 1
can also normalize the data such that it is in the range of -1 to 1

Before scaling we split the data to avoid data leakage

Pipelines can be used


to do 2 steps at the
same time
Initialize the steps in
the Pipeline and then
call pipeline.fit

Cross Validation in Scaling


UNSUPERVISED LEARNING

Unsupervised learning is a class of machine learning techniques for discovering patterns in


data

Evaluating a cluster:

Cross tabulations are used to know what


the labelled data is indicating

Take the example of iris data set

The following way is used for creating a cross table after clustering the data

A cross table looks like this

In the iris data set we have the labels so we


evaluated uaing the lables

In case of anyother data which donot have the labels we


evaluate the cluster using inertia

Inertia means how spread the data is in a cluster

Lower the inertia better the cluster

It measures distance of each sample from the centroid of


the cluster

FEATURE VARIANCE:
Variance measures the spread of the data
values
In kmeans feature variance =feature influence
Standard Scaler is a module used for
modifying feature variance
It transforms every feature to have mean 0 and variance 1

HIERARCHIAL CLUSTERING

Hierarchical clustering is very useful for cluster


visualization

Hierarchical clusters as visualized into diagrams


called dendograms ->

This is done by first forming the cluster for every


row and then merging the two closest rows one at
a time

Linkage forms the hierarchical


clustering and dendogram is
used for visulaizing the image

hierarchical clustering is not


only a visualization tool we can also extract clusters from intermediate stages which can be used in
further computations
An intermediate stage in the hierarchical clustering is specified by choosing a height on the
dendrogram.

T-SNE
T-SNE stands for T-Distributed Stochastic Neighbor Embedding
It has a complicated name, but it serves a very simple purpose. It maps samples from their high-
dimensional space into a 2- or 3-dimensional space so they can visualized. While some distortion is
inevitable, t-SNE does a great job of approximately representing the distances between the samples.
For this reason, t-SNE is an invaluable visual aid for understanding a dataset.

It Maps samples to 2D or 3D

TSNE has only fit_transform method instead of sep


fit and transform
We only give the X values to TSNE not the Y
Learning rate is also given to the TSNE
It is different for diff data Normal rage is 50 to 200
Even though
code is same if we rerun the plot may differ But even though the plot differs Relative positions of
data do not change

DIMENSIONALITY REDUCTION

It finds patterns in data and re express the data in compressed form

PCA

It performs dimensionality reduction in 2 steps


1. Decorrelation(doesn’t change dimension of data)

2. reduces dimension

DECORRELATION

In this data is rotated such that it is aligned with the axes and data is also shifted such that mean is

It has fot and transform methds in sklearn

Fit() method learns how much to shift the dats and how much to rotate the dats but doesnot rotate
them

Transform() methods rotates the data based on fit metod That means transform can also be used for
new unseen data

Also PCA removes the correlation between two features i.e if two features are linearly correalted at
the beginning they are not linearly corelated after the shofting and rotating

PCA is called Principal because it identifies the principal data and rotates and shifts based on those
principal components

They can be found out using model.components_


Intrinsic Dimension:

It is no.of PCA features with significance variance

Non-negative matrix factorization:

It is a dimensionality reduction technique like PCA


NMF models are interpretable unlike PCA

But All the features must be non negative

NMF expresses documents as combinations of topics

And images as combination of patterns

LINEAR CLASSIFIERS

LINEAR SVC(support Vector Classifier)

Dot product-x@y
Lr.coef_ gives the co-efficients of the Predictor equation
lr.intercept_ gives the intercept of the predictor equation
Based on the raw model output class is identified

LOSS FUNCTIONS

Scikit learn’s LinearRegression uses squared error for loss function


Minimization is with respect to parameters of the model

Squared errors are not appropriate for classification models


Hence we use 0-1 loss 0 for correct prediction 1 or incorrect so that we know how many errors
occurred (hard to minimize So models wont use it)

This function is used to minimize any function

REGULARIZATION IN LOGISTIC REGRESSION

You might also like