MODELS (AutoRecovered)
MODELS (AutoRecovered)
Logistic Regression:
Despite its name, logistic regression is used for classification. This model calculates the probability, p,
that an observation belongs to a binary class.
if p is more than or equal to 0.5, we label the data as one
if p is less than 0.5, we label it zero
The default probability threshold for logistic regression in scikit-learn is zero-point-five. This
threshold can also apply to other models such as KNN.
Hyperparameter tuning
Parameters that we specify before fitting a model, like alpha and n_neighbors, are called
hyperparameters.
So, a fundamental step for building a successful model is choosing the correct hyperparameters.
We can try lots of different values, fit all of them separately, see how well they perform, and choose
the best values!
Grid search is great. However, the number of fits is equal to the number of hyperparameters
multiplied by the number of values multiplied by the number of folds.
Therefore, it doesn't scale well! So, performing 3-fold cross-validation for one hyperparameter with
10 values each means 30 fits, while 10-fold cross-validation on 3 hyperparameters with 10 values
each equals 900 fits!
underscore - as shown here. Notice the original genre column is automatically dropped. Once we
have dummy variables, we can fit models as before.
A common approach is to remove missing observations accounting for less than 5% of all data.
Subset has
the cols in
which null values are present if one of the col has null values entire row is removed
Another option is to impute missing data. This means making an educated guess as to what
the missing values could be
We can impute the mean of all non-missing entries for a given feature. We can also use other
values like the median. For categorical values we commonly impute the most frequent value.
We must split our data before imputing to avoid leaking test set information to our model,
a concept known as data leakage.
Divide the training and testing data into two parts categorical and numerical data using the same
random state ensures the position of a col is same for both the data
Imputers are also called as transformers for their ability to transform the data
We normally scale or centre the data to make sure they are on the same scale
Evaluating a cluster:
The following way is used for creating a cross table after clustering the data
FEATURE VARIANCE:
Variance measures the spread of the data
values
In kmeans feature variance =feature influence
Standard Scaler is a module used for
modifying feature variance
It transforms every feature to have mean 0 and variance 1
HIERARCHIAL CLUSTERING
T-SNE
T-SNE stands for T-Distributed Stochastic Neighbor Embedding
It has a complicated name, but it serves a very simple purpose. It maps samples from their high-
dimensional space into a 2- or 3-dimensional space so they can visualized. While some distortion is
inevitable, t-SNE does a great job of approximately representing the distances between the samples.
For this reason, t-SNE is an invaluable visual aid for understanding a dataset.
It Maps samples to 2D or 3D
DIMENSIONALITY REDUCTION
PCA
2. reduces dimension
DECORRELATION
In this data is rotated such that it is aligned with the axes and data is also shifted such that mean is
Fit() method learns how much to shift the dats and how much to rotate the dats but doesnot rotate
them
Transform() methods rotates the data based on fit metod That means transform can also be used for
new unseen data
Also PCA removes the correlation between two features i.e if two features are linearly correalted at
the beginning they are not linearly corelated after the shofting and rotating
PCA is called Principal because it identifies the principal data and rotates and shifts based on those
principal components
LINEAR CLASSIFIERS
Dot product-x@y
Lr.coef_ gives the co-efficients of the Predictor equation
lr.intercept_ gives the intercept of the predictor equation
Based on the raw model output class is identified
LOSS FUNCTIONS