Final ML
Final ML
Machine learning involves teaching a computer to perform tasks using data by automatically extracting algorithms from example data, especially for problems without predefined algorithms like spam detection. It is closely related to statistics and optimization, with a primary focus on
prediction.Structured data:Highly organized, made up mostly of tables with rows and columns that define theirmeaning. (e.g. Excel spreadsheets and rel. databases). Unstructured data: Everything else-emails, text messages,text files, audio files, voicemails, video files, images of pictures, illustrations etc.To make sense of it we need knowledge discovery Why ML:data=cheap &
abundant,knowledge=expensice&scare, enables the extraction of valuable knowledge from the rapidly growing, abundant data, transforming it into actionable insights Supervised learning:Given a set of data and labels,learn a model which will predict anlabel for new dataD= {Xi, Yi}, learn F:Xk -> Yk.Used to automate manual labour.Training an algorithm on a labeled dataset where it learns from input/output
pairs provided by a "teacher" to automate tasks like classification and regression. It generalizes from known examples to make predictions on new data, with applications ranging from identifying handwritten digits to detecting credit card fraud and enabling self-driving cars. Unsupervised learn.: input data without explicit instructions or known output data,meaning no labels are required discover patterns in
data D= {Xi},group the data into Y classes using a model or function F: Xi ->Y,discovering trending topics,grouping data into clusters,outlier detection, reducing dimensionality,with applications such as identifying topics in blog posts, detecting trends, clustering customer pref. Reinforcement learning inv. a system learning to make a sequence of decisions to achieve a goal by interacting with its environment,
taking actions, receiving feedback, and optimizing its behavior to maximize rewards. The focus is on learning a policy,output->a sequence of actions leading to the desired outcome, with applications in game playing, robot navigation, and self-driving cars, and challenges arising from limited or unreliable sensory information.Challenges:when the system has limited or unreliable sensory information. easoning
under uncertainty to make optimal decisions-maximize reward-D={environment(e), actions(a), rewards(r)}, learn policy and utility functions,Policy: F1 : {e,r} -> a,Utility: F2: {a,e} -> r. Instance: Pikachu,Label/Class: Mouse,Features/Attributes: Abilities, Weight,Legendary,Feature Values: lighting rod, 2, yes,Feature Vector: (lighting rod, 2, yes), A model: an equation that links the values of some features to the
predicted value of the target variable;• finding the equation (and coefficients in it) is called ‘building a model’ (see also‘fitting a model’).Score functions/Fit statistics/Score metrics – measures of how well the model fits the data.Feature selection – reducing the number of predictors by selecting the important ones(dimensionality reduction).Feature extraction – reducing the number of predictors by means
of amathematical operation (e.g., PCA)Two types of SL: Classification - Discrete Output,Ex: color, gender, yes/no, class membership,Will you pass this course? Examine the statistics of two football teams, and predict which team will win tomorrow's match. Regression:Continuous output,Ex: temperature, age, distance, salary,How many points will you get in the exam? Predict the number of Microsoft shares
that will be traded tomorrow.Dummy classifier: Classifies the given data using only simple strategies:Most-frequent, Uniform, ConstantDummy regressor:Makes predictions using simple strategies:-Mean, Median Both dummy:Don’t generate any insight about the data,Serves as a simple baseline to compare against other more complex classifiers/regressors; if performance is negative => worse than
random/dummy.Types of data:Images-are 2D arrays of numbers(RGB values for each pixel),Text-words/letters need to be converted in a format understandable to CP. Preprocesing:Feature extr. and normaliza., raw data to be suitable for an ML model. Scaling data involves multiplying all instances of a variable by a constant to change the variable's range. Machine learning algorithms generally perform poorly
when input numerical attributes have very different scales. Different scaling techniques address this issue: Standard Scaler: Uses z-scores(Subtracting the mean and dividing by the standard-deviation) to normalize data to a mean of 0 and standard deviation of 1; effective for non-skewed data. Robust Scaler: Similar to Standard Scaler but uses the median and interquartile range; better for skewed data and
handling outliers.MinMax Scaler: Scales data to a specified range between 0xmin and xmax1; useful for data with a bounded range or non-Gaussian distribution. Normalizer: Rescales data by row, making each row's norm equal to 1; useful when the direction of data matters, such as in histograms. Univariate Transformations:logarithmic, geometric, power transformations.Most ML models perform best with
Gaussian-distributed data.Methods to achieve Gaussian distribution include Box-Cox and Yeo-Johnson transforms.Parameters can be automatically estimated to minimize skewness and stabilize variance. Binning:Separates feature values into nn categories (e.g., equally spaced).Replaces all values within a category by a single value, such as the mean.Effective for models with few parameters (e.g., regression),
but less effective for models with many parameters (e.g., decision trees). Measuring Classification Success:Evaluate how predictive the models are.New data will differ from training data.Avoid overfitting to ensure the model generalizes well. Training and Test Set:Split data into training and test sets to avoid overfitting.Training set: used to build the model.Test set: used to evaluate model
performance.Separate data ensures the model generalizes to new, unseen data. Subject-wise splitting:Cross-validation-evaluate/test your model’s ability to predict new data,detect overfitting or selection bias,K-fold,Leave-one-out.ML Pipelines are workflows designed to execute a sequence of tasks, including data normalization (scaling), imputation method of missing values, dimensionality reduction, and
classification, streamlining the process of building and deploying machine learning models. Preprocessing: missing value=no standard encoding(0,blank,Na,Nan)Imputation:replecing a missing value with an estimate for that value,mean/median/knn/model-driven/iterative. Feature selection: avoid overfitting,faster brediction and training,less storage for model and dataset,Stategies:Univariate statistics
(f_regression, f_classification),Model-based selection (Lasso),Iterative selection (random_forests)Univariate statistics involves examining each feature individually, removing those lacking significant relationships with the target variable, and discarding features significant only in combination with others (interactions). Selection based on confidence often involves ANOVA-related methods like f_regression,
f_classif, and chi2 in scikit-learn, examining statistics and p-values to determine feature importance. Mutual information is also univariate, but doesn’tassume a linear model (like the F statistics do) Can be used with SelectKBest etc. measures the reduction in uncertainty forone variable given a known value of the other variable. It's important to validate the selected features on a separate validation set to
ensure generalization to unseen data.Model-based selection involves using heuristics and models to identify key features, leveraging methods like Lasso and tree-based models. Iterative approaches, such as forward and backward selection(start with single feature,find most important,add,iterate/ fif model,find least imp,remove,iterate), as well as RFE(repeatedly training model and removing the least imp
features until the desired num is reached), help refine feature sets, though they can be computationally expensive. Categorical variables: Data is often viewed as a 2D array of floating-point numbers, with each column representing a continuous feature.Real-World: Many involve discrete, non-numeric categorical features, which lack a natural order. Proper representation significantly affects model
performance and often requires scaling.Ordinal:logical order, Interval:equal intervals on the variable represent equal differences in the measured property,no true 0,Ratio:same as interval,but ratio of the score must make sense and have true 0 value. Encoding methods:One-hot encoding: converts cat. var. into binary features, each representing a category with values of 0 or 1: data_dummies =
pd.get_dummies(data).Matrix-equally distant, Mismatch in categories between training and test sets can lead to incorrect model behavior. It's essential to have the same set of dummy features in both sets to maintain consistent semantics Count-Based Encoding: Represents categorical features based on frequency counts or other numerical metrics.For high cardinality cat features(countries),instead of 50
one-hot, replace label with the value of a variable combined over that label.For regression: ppl in this state have an average response of y, binary classification: ppl in this state have likelihood p for class 1, Multiclass:one feature per class:probability distribution. Digital images:The values are all discrete and integers.,Can be considered as a large arrey of discrete dots,each dot has a brightness associated with,
dots=pixels, Arrays and Images: Images are represented as matrices(numpy arrays),Can be written as a function f(x,y)->pixel intesity,type of images:binary,grayscale,color. Binaryimage(1bit image):each pixel is either black or white,only 2 possible values for each pixel(0,1),only need 1 bit per pixel. Grayscale image:each pixel is a shade of gray, from 0black to 255white,each pixel can be represented by 8 bits,or
exactly one byte,other grayscale ranges are used,but generally are power of 2.Color images(Multi-channel images):a stack of multiple matrics,rep the multiple channel values for each pixel,e.g.RGB color is described by the amount of green,red,blue in it. Confusion matrices: row1:TN,FP, row2:FN TP, Accuracy:TP+TN/sum of everything, measures the proportion of correctly classified instances among
all,problems with imbalanced datasets,treats all errors eqyally ignoring FP and FN in different applications, misclassification can have different consequences,accuracy may no§t reflect these costs, preprocessing influence,accuracy does not indicat ethe level of prediction confidence, Precision:TP/TP+FP, what propotion of positive indentifications are acctually correct(no FP=> precision of
1),Recall:TP/TP+FN,what porportion of actual positives was identified correctly(no FN => recall=1) F-score:2*((precision*recall)/( precision+recall), balance between prec. and recall, when want to consider FP and FN, combines them into a single metrics when there is an imbalance between the classes. Text data can consist of words, sentences, or entire documents, characterized by high variability such as
punctuation, different forms of the same word, typos, and capitalization. Representing text data often involves breaking sentences into words and using one-hot encoding, which can lead to very large, sparse vectors and issues with concatenated vectors' size and sentence length variation, posing challenges for most ML methods(unequal length). Term Frequency(TF):num of times term appears in a
document/num of terms in the docment.Inverse Document Freq(IDF)=log(N/n),where N=num of documents and n=num of documents a term has appeared in.IDF of a rare word is high, whereas IDK of a frequent word is likely to be low.TF-IDF value of a term=TF*IDF. Adv:highly interpretable,each word is independent feature,simple method,fairly effective approach.Limit:all structure is lost,misspelling,some
expressions consist of multiple words.Tracking expressions with multiple words with n-gram,instead of using individual words as tokens,use groups of n consectuve words(‘man’ -1gram, “be fool”-2gram/bigram). Tokenization-convert sentences to words,can swap out sensitive data,the list of tokens turns into input for additional processing including parsing or text mining. Stemming:reduces words to their
root by removing inflectional suffixes(studies->studi),Lemmatization:removes inflection by determining the part of speech and using a detailed language database.(studies-study).Restricting the vocab:removing unnecessary punctuation,tags,removing stop words-frequent such as “the”.”is” that have low semantic content.Classification-training a model to separate the data in 2 or multiple
classes.Regression-fitting the data to describe the relationship between 2 features or between a feature and the label(ex. Linear relationship) Training Classifiers: Classifiers trained on labeled datasets to automatically draw a decision boundary between classes. Types of Decision Boundaries: These boundaries can be straight (linear, "stiff") or curved (non-linear, "flexible"). Model Complexity:Complexity of
the model correlates with the flexibility of the decision boundary.Purpose of Decision Boundaries: serve as models to separate different classes based on the data.Creation of Decision Boundaries: Classifiers inherently create these boundaries to distinguish between classes.1-NN:Given a set of labeled instances (training set), new instances (test set) are classified based on their nearest labeled neighbor. K-
NN:K represents the number of labeled neighbors to consider.more computation for testing than for trainingTest point are assigned the majority label of k neares neighbors.k =N:since all datapoints are considered, the predicted label for a test point will always be the the majority label of all datapoints. Equivalent to a majority classifier.In case of a tia between predicted labels, there are different
possibilities,most common=ranodm selection,more neighbors=less complex decision boundary. Higher number for neighbors=straighter boundary.Weights(how much influence on one another) in k-NN= extension of the basic algorithm: not all neighbours get an equal vote(Ex: rating of best restaurant in town, opinion of people living closer to the city count more Distance-weighting= each neighbour has a
weight which is based on its distance to the data point to be classified (closer a point is to the center of the cell being estimated, the more influence, or weight, it has in the averaging process) Inverse distance weighting – each point has a weight equal to the inverse of its distance to the point to be classified (neighboring points have a higher vote) Inverse of the square of the distance.Kernel functions
(Gaussian kernel, tricube kernel) If we change the distance function, the results will change.Implication:with distance weighting, k=n is no longer equivalent to majority based classifier. Computing distance in k-NN:d=sqart(x2-x1)^2 + (y2-y1)^2.Manhattan: |x2-x1| + |y2-y1|.K positive int,determines model complexity,it’s the decision boundary that separates the classes,in reg. the model is the line that fits
the data,smaller k lead to more complex decision bound.k too low=danger of overfitting,high complexity,k too high=danger of underfitting,low complexity .How to choose k?:Typically odd for even number of classes, as k decrease accuracy and CP complexity increase.Rule of thumb:k=sqart n. Nearest Centroid:lot of point around,goup them in diff categories,chose only one special point form each
category,this is a centroid.Take the nearest and compare how close you are to it.find the mean of the clusters and compare how close the test data is from them. The class whose centroid it is closest to, in squared distance, is the predicted class for the new sample.Nearest ShrunkenCC: Each class centroid is "shrunk" towards the overall centroid for all classes by a specified threshold amount, moving towards
zero and becoming zero if it hits zero. Example: With a threshold of 2.0, centroids are adjusted accordingly; e.g., 3.2 becomes 1.2, -3.4 becomes -1.4, and 1.2 becomes zero.The new sample is classified with shrunken class centroids.Distance metrics:eulidean distance & manhatten.This method can pottentially enhance classifier accuracy by reducing noisy genes and does automatic gene selection. KNN
regression=use k-nn to fit the data, takes the distance ad fits the best line on it,can weight your regressor based on distance,tries to fit outliners. k-nn classification combines the discrete predictions of k-neighbours,k-nn regression combines continuous predictions,fits the best line between the neighbors. Adv: learning process cost=0,no assumptions about characteristics of the concepts to learn have to be
done,complex concepts can be learned by local approximation using simple procedures.Disadv.:can’t be interpreted,CP expensive with large dataset,performance depends on the num of dimensions.Regression:sup. learning to predict outputs(y), need:continuous or categorical input features(x), training examples(many x for whohc y is known),a model(fun that represent the relationship b/n x&y),a cost of
fun(how our model approximates the training ex.),optimization(fiding patterns,while minimizing the loss fun). Linear R(ordinary leas squares):regression!!!,given x predic y,assume relation y=mx+e,m is parameter and e measurment of noise.Goal:estimate m from training data for x and y,most common approach:minimize the least squares error(LSE) min distance b/n measurments and regression line,easy to
compute.If the line doen’t pass through the origin, intoduce a bias(intercept) term(b), par b is the sym of diff b/n the value for y and their estimates(m*x),increases complexity,more df.Applications:stock price prediction,housing pricing ,Adv:explainable,fast training,interpretable results by its output coeff.Disavg:Assume linearty,sensitive to outliners,underfit with small,high-dimensional data.Curve
fitting:finding math funtion that has the best fit on a series of data point. Smoothing:not looking for an exact fit but for a curve that fits the data approximately.Linear:y=ax+b(1st degree polynomial)-fits a curve to 2 points,y=ax2+bx+c(2nd degree polynomial)-curve 3 points,y=ax3+bx2+cx+d(3rd)-4 points. Perfect fit:goas through all points, the best may not be the perfect fit,it should give the best predictive
value,overfitting->high degree polynomial.Fit:accuracy of a predictive model,predictive model close to the observed values,regression models=R^2=1-(SSR/SST). Underfitting:the induced model is not complex enough to model thhe data, performs badly on training and validation set. Overfitting:The induced model is too complex to model the data(tries to fit the noise)Performs better on training set than on
validatio set.Regularization:overcome overfitting in regression,reduction of the magnitute/strength of the coefficents.Don’t want overfitting=>we limit the variation of the parameters to prevent extreme fits to the training set. Ridge R:reduce model complexity by coeff. Shrinkage.,penalty term controlled by alpha,by alpha we are basically contorlling the penalty term.higher value of alpha->bigger
panelty.Magnitute of coeff is reduced. Can’t perform feature sel.Lasso R:Least absolute shrinkage selector operator, magnitute of coeff reduced at even small values of alpha,lasso reduces some of the coeff to 0,this prop is known as feature selection, which is absent in the case of ridge reg.House prices prediction, Clinical outcome based on health data Gradient Descent: m is a parameter,e.g., your
weights,biases and activations. η is the learning rate (eta)/alpha or gamma sometimes,J is formally known as objective function,but most often cost/loss function. Minimize the loss function J(m), by moving the parameters m in the opposite direction of the gradient of J(m). Logistic R:classifier!!!uses calculated logits(score) to predict the target class.Replace the sign(.) in a linear fun with sigmoid or logistic
function.Sigmoid Function:assumes a particular functional form (a sigmoid) is applied to the linear funct. of the data.Output is a smooth amd differentiable funct. of the input and the weigths. Log reg assumes a particular functional form(a sigmoid) is applied to the linear function of the data.1 parameter per data feature and the bias,feature can be discrete or continuous,output of the model b/n 0 and
1.Decision boundary(for classification):single line/contour which separate data points into regions.Probabilistic Interpretation:can be used to model class probability.Loss function:our goal in training is to find the best set of weights and biases that minimizes the loss function.Entropy:amount of information,level of uncertainty.Instead of loss function, we use logarithmic function.Cross Entropy:logarithmic
loss,predicted class probability compared to actual class for output 0 or 1, score calculated penalized propability based on how far it is from actual value. Alg for Logistic R: 1.choose a step size,2.start with guess for w,3.for all features j=1...N,set,4.If there an improvement to the log likelihood score?,5.If no:stop,if yes:back to step 3. L1(Lasso) regularization:encourages sparcity,adds “absolute value of
magnitute” of coefficient as penalty term, if alpha is zero,we get back the original Lin.R,if alpha is very large,almost all coeff are zero,so it will lead to underfitting. L2(Ridge) regression:encourage small weights,adds “squared magnitute” -/-,if alpha is zero-/-, if alpha is very large, too much weight is added to the penalty->underfitting Adv.:easily extended to multiple classes,probability distributions
availible,quick to train,resist overfiting,can interpret model coef,good accuracy for many simple datasets, multiclass slassification prob.Disadv.:Linear decision boundary,may overfit small high-dim daatsets,Log reg->discrete output,loss function->MSELin Reg:The residual variance is computed as the sum of the squares of the y ‐coordinates from the data minus the y ‐coordinates predicted by linear regression.
The coefficients of the least squares regression line are determined by minimizing the sum of the squares of the residuals (y-coordinates - slope*x-coordinates) Gradient descent alg: 1.Initialize random weights and bias,2. Pass an input through the network and compute predicted values from output layer,3.C alculate error between the actual value and the predicted value,4. Go to each weights which
contributes to the error and change its respective values to reduce the error,5. Reiterate until you find the best weights of network. Conventional ML:Data->Feature extr.->Mapping features to class labels->Decision, more features lead to more info,but curse of dimensionality,increased complexity,but not performance.Limit of Linear Classifier:classify input based on linear combin. of features,many
deciisons involve non-linear func of the input(ex. classes can’t be sep by straight line +/-),slope=constant Non-linear models with complex feat.:Data->simple feat->Advanced/Complex feat in deap layers->Mapping feat to the labels->Decision(alg does it itself)we need to tune the params,the params we need to tune ourselves are called hyperparameters,CNN for computer vision tasks, speech
recognition(highly nonlinear func.),Adv:can compute complex relationship b/n input feat and target var.,decision trees,svm,more flexible,learning interction and structLimit:CP expensive,require more data to train,less interpretable,slope:changing,hope never to see in a test.Goal:to construct non-linear classifier that utilize functions of input variables.Neural network app:use large num of simpler func,funct
are fixed(Gaussian,Sigmoid,ReLu,Poly),optimization inv linear comb of these fixed functions. Layers:Input(1),Hidden(many),Output(1)Artifical Neural:Network:neural netw define funct of inputs(hidden features),computed by neurons,Artifical neurons=units. Neural Network Archit.(Multi-Layer Perc):each unit computes its value based on linear combination of values of units that point into it,and an
activation functon.Have additional layers,to optimize the weights-gradient descent,MLPs consist of input,one or more hidden and ouput layer.Each layer contains perceptions(units),with connections(weights) b/n them.MLPs with 1 hidden layer are more commen due to simplicity.however,adding more hidden might improve the networks ability to capture intricate patterns in the data.MLP overcome the
limit of SLP by intorducing non-linear activation functions and mult hidden layers,enabling them to approxiate non-linear func and solve more compelx ML alg.,optimize hyperpara. through trail and error on validation set. Representational power:NN with at least 1 hidden layer is a universal approximator(can rep any func)Capacity of network increases with more hidden units and more hidden layers. NN
components:input layer(x) independent var,arbitrary amount of hidden layers,output layer(y)dependent variable,set of weights(coeff) and biases at each layer(W and b),choice of activation function for each layer(σ) Training NNs:2 stages:forward propagation(feedforward) and backpropagation Forward:input data is fed into the neural network, multiplied by weights, passed through activation functions layer
by layer, and the final output is produced and compared to the actual target values to compute the loss or error.calculating the predicted output(y) Backpropagation updates the neural network's weights to minimize the error between predicted and actual target values by computing error gradients and adjusting weights in the opposite direction of the gradient, iterating over multiple epochs with forward
and backward passes to optimize performance.updating the weights and biases.Both The process iterates until the network converges to minimize the loss function and produce accurate predictions, with techniques like regularization, dropout, and batch normalization improving training and preventing overfitting. The forward pass performs inference, and the backward pass updates weights and biases to
reduce error using gradient computation and the chain rule of calculus.Activation fun:applied on the hidden layers,achieve nonlinearity.Popular activation func:Forward pass:performs inference, where input data is fed into the network, and activations are computed layer by layer until the output layer is reached.The output of each neuron is computed as a weighted sum of its inputs passed through an
activation function (e.g., sigmoid function for hidden layers).After that the erorr or loss is computed. Backward pass:performs learning.Change w and b to reduce error,use chain rule of derivative of the loss funct. Gradient descent:updating the weights and biases by incr. or reduc.it.Deep learning: multiple hidden layers, can learn a hierarchical feature representation:first layer-extract some simple
features,more abstract,high-level rep as you go deeper into the layer,the extracted feat are used in the end for classification. MLPs are versatile and effective for general-purpose machine learning tasks, they process the entire input data globally without focusing on local patterns or structures, making them suitable for tasks where input data exhibits simple relationships and does not have spatial or temporal
dependencies, such as tabular data or basic pattern recognition tasks. In contrast, CNNs excel in tasks involving spatial data by leveraging parameter sharing, local information processing, and spatial hierarchies to learn complex features and patterns from images, videos, and other spatial data formats. CNNs automatically learn hierarchical representations of the input data, capturing local features and
gradually combining them to learn global representations, which makes them particularly effective for understanding the content of spatial data. Convolution layer:connect each hidden unit to a small input patch and share the weight across space.the network is called convolutional network.Filters its input to reveal patterns. Pooling layer:by “pooling” filter responses at diff locations we gain robustness to
the exact spatial loc of features.Segmentation:input->pre-processib->cnn->post-processing->output. Parameters and Hyperpara.:For KNN-no parameters,doesn’t do training.Validation set:to tune the hyperpara.NEVER tune on test set!Classification prob:irregular boundaries,irregular distribution,imbalanced training size,outliners. SVMs can be used for classification and regression, it is a discirminative
classifier formally defined by a separating hyperplane, use only support .Linear SVM focuses on boundary points instead of futting all the points. Goal:learn a boundary that leads to the largest margin(buffer) from points on both sides.Support vectors=subset of vectors that support(determine the bpundary),after training SVM,a support vector is any instance located on the margin,the decin boundary is
entirely determined by the support vectors.Any points that are not a support vector have o influence whatsoever,you could remove them,add more points, or move them around,and as long as they stay off the street they won’t affect the decision boudary. Decision Function:by computing it->Linear SVM clasifier predict the class of a newistanse x,takes a dataset as input,gives a deciison as output,result>0-
>predicted class(y) is 1(+), else->-1(-),inputs b/n the margin are of unknown class.Linear SVM classifier-Decision Function:for a dataset of 2 features,decision fun is a 2D plane.Decision boundary-set of points where dec func=0.Dashed lines rep point where deciison is 1/-1.Parallel and at equal distance to the decision boundary,formins a margin around it.In training a margin-based classifier like SVM, the
goal is to maximize the margin between data points and the hyperplane while ensuring correct classification through optimization methods like projective gradient descent. By halving the slope, the points where the decision function equals ±1 are positioned farther from the decision boundary, optimizing the classification process. Hard margin SVM aims to find a hyperplane that perfectly separates classes
without allowing any misclassifications, which works well only when the data is linearly separable; however, it's sensitive to outliers and doesn't generalize well. Soft margin SVM introduces a margin of tolerance, allowing for some misclassifications to find a more robust hyperplane, especially when the data is not perfectly separable. It incorporates slack variables to handle misclassifications and balances
between maximizing the margin and minimizing errors using a regularization parameter C.Hinge Loss:Loss function that incorporates a margin of distance from the classification boundary into the cost calculation., penalizes misclassified samples and correctly classified ones that are within a defined margin(smaller w larger mar) from the decision boundary. Linear Soft-Margin SVM loss fun:decision fun,cost
fun:average of your loss func over the entire training set,loss fun(or error):what you are trying to min for a single training example to achieve your objective .Kernel SVMs:One way to make a linear model more flexible by adding more feat.example:by adding interactions or polynomials of the input features.when we transform back this line to original plane,it maps to ellipse boundary.these transformations
are called kernals.,allow for complex decision boundaries,even with few features data,work well on low-dimensional and high dimen data,important parameters regularization-CC,choice of kernel&kernel-specific param.The RBF kernel has only 1 parameter,gamma,inverse of the width of the Gaussian kernel.C and gamma both contorl the complexity of the model and should be adjusted together.,kernal func
is a ismilarity func,maps low dimensional data to high dim. Data.Disadv:require careful prepocesing,hard to ispect,difficult to understand,don/t scale very well with the um of samples,challenging runtime and memory usage,originally designed as 2-class classifier. SVM similarity fun:technique to tackle nonlinear problems by adding features computed using a sim function that measures how much each
instance resembles a particular landmark.Param for Gaussian RBF Kenrel:Gamma=bandwidth,relates to scaling of data.SVM hyperpara:Linear SVM hyperpara-C(regulariz.)Polynomial SVM hyper-C(regulariz), d(polynomial degree),RBF SVM hyper.-C(regulariz),gamma(width of kernel-only the points close to the hyperplane modeling). SVM for regression:the goal is to find a function that has at most deviation
up),Discrete(num of times NL qualified for WC),Continuous(% change in stock price). Set theory:P(A ∪ B) = P(A) + P(B) – P(A ∩ B). Condi. prob(A=1|B=1) A is true given B is true. Chain rule:joint distribution can be specified in terms of cond prob. H(X, Y ) = H(X|Y ) + H(Y ) = H(Y |X ) + H(X ) Types of Classifiers:Instance based classifiers-use observations directly withot models/e.g.K nearest
from the actually obtained targets for all the train. Data and is at the same time flat as possible,allow us to fit nonlinear models to data while considering both smoothness and error, making them versatile tools for various regression tasks. Intro to Prob.: Random varibale:an element/event whose status is unknown.Domain:The set of a values a ranodm variable can take: Binary(stock market go
neighbors.Generative:/buid a generative stat model/e.g.Bayes classifiers.Discriminative-directly estimate a decision rule/e.g.Decision tree. Bayes Rule:p(y|x)={p(x|y)*p(y})/p(x).Naive Bayes Classifier:high post brob,Discrete:Probability that you will pass the exam if your teacher is Larry,P(x) = P(Larry) = 28/80=0.35,P(c) = P(yes) = 56/80=0.7,P(x|c) = P(Larry|yes) = 19/56=0.34,P(y|x) = 0.34 * 0.7 / 0.35 = 0.68 =>
68%.
Classification:Logistic Reg, NN(Multi-layer perception), Naïve Bayes,KNN,Deciison Trees,Linear&Kernel SVM,Ensemble learning(Random Forests,Gradient Boosting) Dimensionality Reduction(PCA,NMF,t-SNE),Clustering(K-means)
Regression:Linear,Polynomial Reg,NN(MLP) Reg,Bayesian Ridge reg,KNN Reg,Decision Trees Reg,Linear SVM Reg,Kernel SVM Reg,Ensemble methods(Random Forest Reg,Gradient Boosting Reg)
Berneulli Naive Bayes Classifier: assume binary,Classify whether a person who with[Confident=Yes,Studied=Yes,Sick=No]passes or fails an exam The prior class probabilities are:P(Pass)=35P(Pass)=53,P(Fail)=25P(Fail)=52Next, we calculate the joint probability of the given features:
P(x)=P(Confident=Yes)×P(Studied=Yes)×P(Sick=No)P(x)=P(Confident=Yes)×P(Studied=Yes)×P(Sick=No)=(35)×(35)×(25)=0.144=(53)×(53)×(52)=0.144We then compute the likelihood for each feature under both possible outcomes (Pass and Fail).For the Pass outcome: P(Confident=Yes∣Result=Pass)=23P(Confident=Yes∣Result=Pass)=32
P(Studied=Yes∣Result=Pass)=23P(Studied=Yes∣Result=Pass)=32 P(Sick=No∣Result=Pass)=13P(Sick=No∣Result=Pass)=31So, the joint probability for Pass is: P(X∣Result=Pass)×P(Result=Pass)=(23)×(23)×(13)×(35)=0.0880P(X∣Result=Pass)×P(Result=Pass)=(32)×(32)×(31)×(53)=0.0880For the Fail outcome: P(Confident=Yes∣Result=Fail)=12P(Confident=Yes∣Result=Fail)=21
P(Studied=Yes∣Result=Fail)=12P(Studied=Yes∣Result=Fail)=21 P(Sick=No∣Result=Fail)=12P(Sick=No∣Result=Fail)=21So, the joint probability for Fail is: P(X∣Result=Fail)×P(Result=Fail)=(12)×(12)×(12)×(25)=0.05P(X∣Result=Fail)×P(Result=Fail)=(21)×(21)×(21)×(52)=0.05.Calculate the posterior probabilities: P(Result=Pass∣X)=0.08800.144=0.611P(Result=Pass∣X)=0.1440.0880
=0.611 P(Result=Fail∣X)=0.050.144=0.34P(Result=Fail∣X)=0.1440.05=0.34Since P(Result=Pass∣X)>P(Result=Fail∣X)P(Result=Pass∣X)>P(Result=Fail∣X), the student is more likely to pass given the conditions.
Adv:Simple,doing a bunch of counts,workd well with small train data,the class with the highes prob is considered as the most likely class.Disadv:Parmeter estimation,determinaton of p(x|y) where cases of y label not enough.Decision trees:Take one feature at a time and test a binary condition.for instance:is the feature larger than 0.5?if yes,grow node to the lef,if no to the right.Decision trees classify data by
using internal nodes to test feature conditions, branching based on the best attribute value, and leaf nodes representing classification outcomes, with decision boundaries perpendicular to feature axes. Alg:choose an attribute on which to descend at each level,condition on earlier choices,generally,restrict only one dimen at a time,declere an output value when you get to the bottom. To construct a useful
decision tree:1.start from an empty decison tree,2.split on the next best attribute(determined using Info Theory,Gini coef,Con entropy),3.repeat Properties:H is always non-negative,if X and Y independent, then X doesn’t tell us anything about Y: H(Y |X ) = H(Y ),but Y tells us everything about Y : H(Y |Y ) = 0,by knowing X , we can only decrease uncertainty about Y : H(Y|X ) ≤ H(Y ) Limit size:pick
max_depth(inc->overfit),max_leaf_nodes,min_samples_split Decision tree regression:reg vs class,both partition the data,Criterion for class:Gini,Cross Entropy,Into Gain,Criterion for reg:weighted mean square error. Adv.:Suitable for multi-class class,Model is most easily interpretable,Can handle num and cat data,Non-linear,Can tolerate missing value. Disadv:Prone to overfitting without prunning,Weak
learners:single decision tree does not make great predictions,multiple trees can be combined to create stronger ensemble models. Generalization error=bias+variance+noise(typically trade off in relation to model complexity).Ensemble methods:try to reduce bias and/or variance of weak models by combineing several of them together to achieve better perf. Simple(weak/base) models:Log reg,Naïve
Bayes,KNN,Decison trees,Kernel SVM-these models perform not so well by themselves bc. Have a high bias or have too much variance to be robust.Esemble methods:simpl models used as building blocks for designing more complex models by combining several of them through Voting,Bagging,boosting,Stacking. Voting:Build diff. models,Classifier that are most “sure” will vote with more conviction.Classifier
will be most “sure” about a particular part of the space,Average the result,More models are better if they are correlated,Work with neural netowrks,Can avg any models as long as they provide calibrated prob. Bagging(Bootstrap Aggregation): reduce variance,Generic way to buil “slightly different” models,Draw bootstrap samples from dataset,Implemented in classifier®gressor,Train many models on
bootstrapped data,then take avg(e.g.RF), ,trains sev independent base models and avg their predicitons.Boosting: reduce bias,Given a weak model,run it multiple times on(reweighted) training data,let them learned classifier vote(Gradient boosting),an ensemble model with lower bias,several instance of the same base model are trained sequentially.At each iteration,the way to train the current weak learner
depends on the previous weak learners and more especially on how they are performing on the data. Stacking:Trains many models in parallel and combines them by training a meta-model to output a prediction based on the diff weak models predict.Learns several different(heterogeneous) weak learners, the output of base models is used as input to train a meta model to output predictions,Learns several
diff(heterogeneous) weak learners,Classification prob example:chose some learners:KNN clss,LogReg&SVM,choose a NN as meta-model,output of the 3 weak learners=input to NN,Output of NN=Final predict.Fitting:1.split the training in 2 folds,2.choose L weak learners and fit them to data of the first fold,3.For each of the L weak learners,make pred for observ in the second fold.3.Fit the meta-model on the
second fold,using predictions made by the weak as inputs.Limitation:Only half of the data to train the base models and half of the data to train the meta-model. Solution:k-fold-cross-training approach.Bootstrapping:Generating samples of size B(called bootstrap samples) from an initial dataset of size N by randomly drawing with replacment B observations,a classifier,a reg algorithm,a resempling
technique.Can evaluate variance of confidence interval.Creates multiple bootstrap samples,each new bootstrap smaples will act as another independent dataset,Fit weak learners for each sample,Aggregate them(avg output),Regression:simple avg,Classification problem:simple majority vote(hard voting)highest avg prob(soft voting) Gini Index vs Entropy:Gini-less CP expensive,limited to binary
class./Entorpy:multi-class classification,requires log calc,works better in high imbalanced cases.Random Forest:all trees=equal weightsStoeng learners composed of multiple trees can be calles forests.Bagging method where deep trees,fitted on bootstrap samples are combined to produce an output woth low var. Shallow trees:less variance but higher bias,better for sequential methods.Deep trees:low bias
but high variance, better choice for bagging method that is mainly focused at reducing var.Randomize in 2 ways to reduce corr.:For each tree(boots)/split(ran sam f)Classification with RDF:the mode of the classes outputted by the trees.Regression:the mean of the values outputted by the trees.Tuning:max_features,around sqrt(n_feat)-class,around n_feat-regg. AdaBoost:Add weak learners one by
one,looking for the best possible pair,updates observation weights in the dataset,adds the weak learner to the weighted sum acc to update the coeff. That expresses the perf of this weak model:better a weak learner perf,the more it contributes to the strong learner. start with equal observation weights and iteratively fit weak models, update their coefficients, and adjust observation weights to focus on
misclassified instances, thereby refining the ensemble model.eery classifier has a diff weight on the final product,boosting occurs sequentially. Gradient Boosting:additive model,missclassicatins are identified,com weak learner to for a strong.step-by-step manner,int the learning rate,many shallow trees,small size model,uses one-vs-one for multiclass. Gradient Boosting regg:1.make an initial guess od the
samples weight by calc the avg,2.compute residuals,3.combine pred value with resid(*lear rate),4.compute new res,5.new pred val. Grad Boosting with Deciison trees:iteratively fits weak learners to the pseudo-residuals (opposite of the gradient) of the current model, updates the model with a step size determined by the learning rate, and adjusts the pseudo-residuals to refine the ensemble, optimizing for
log loss in classification or square loss in regression.When to use Tree-based mod:non-linear rel,single tree:very interpretable(if small),random forests very robus,good benchmark. Bias:how far on avg the model predict for the correct value?Variance:how far apart are the models predicts?Underfitting:high b&low v-models complexity is not sufficient to rep the intrinsic pattern of the data. Overfitting:low
b&high v-model compl is extremely high incaple to gen to unseen data.Low b&v-sweet spot,correct and not noisy,High b&v-incorrect&noisyIrreductible error:how noisy the data is itself:what is the variance of the target around its true mean?Reducible error:Bias^2:how much the avg of the estimate devide from the true mean?Variance:what is the deviation of the estimates around their mean? Cross-
Val:splitting the data in a training/test multiple times.Each data point is in the test set exactly once. K-cros-val:splits the data into k sets. The data is trained and tested k times using the different sets. Adv:less luck,shows sensitivity,high var=high sensiDisadv:inc. CP power,imbala.Straitified c-v:class freq in each fold is same as in overall data makes sure there is no class imbalance in the dif folds. LOO:k-fold c-
v,where k=N and N=nm of its items in data.time consuming,gen pred given the max avail data,useful to find regular&irregular items from dataset point of viewAdv:better est of model perf on small dataset,by utilizing all data for train&test,more accurate gen perf. Disadv:time&CPconsumng .Shuffle-split c-v:contorls test&train size&num of iterations,also stratified variant availible.Cross-validation with groups
involves splitting data so that groups relevant to the learning problem, such as different persons or patients, do not appear in both training and test sets, ensuring the model generalizes to new, unseen groups.(emotion rec) Tunning:improves a models generalization perfor. by adjusting parameter values, often through simple grid search or grid search with cross-validation,but optimizing on the test set risks
compromising its independence, necessitating an additional final test set. AUC:higher(closer to1)-better 0to1,ROC:shows false positive rate(FPR)&true positive rate(TPR),app when balanced b/n each class. Precison-recall curve:imbalanced class,all possible thresholds at once.Macro-average F1 provides an equal-weighted average F1 score across all classes, while Weighted F1 adjusts for class imbalance by
weighting the mean of per-class F1 scores based on class support.(bigger classes are important) Micro-average F1 treats all samples equally, computing precision and recall over a single confusion matrix, regardless of class. Metrics for regression: coefficient of determination(R2)- measures the proportion of the variance in the dependent variable (target) that is predictable from the independent
variables(features); ranges from 0(bad) to 1(perfect); provides a clear indication of how well the model fits. MSE- calculates the average of the squared differences between predicted and actual values. Useful for penalizing large errors, sensitive to outliers. Mean Absolute Error-calculates the average of the absolute differences between predicted and actual values, influenced by the scale of the target
variable. Imbalanced data: one class is more frequent than others. Accuracy can be misleading because simple models always predict the majority class. Alternative matrices are needed to evaluate efficiency of models on imbalanced data. Random undersampling: removing data points from the majority class randomly until the desired balance is achieved (class ration close to 1:1). Advantages: Speed;
Disadvantages: data loss, info loss. Random oversampling: duplicating data points from the minority class randomly, until a balanced is achieved (1:1 data ratio). Adv.: balances classes, preserves info; Disadv.: slow training, risk of overfitting. Edited nearest neighbours: targets noisy data; technique to reduce the size of KNN algorithms; it identifies and removes data points from the majority class that are
likely noisy based on their nearest neighbors. Two removal strategies: mode-a point is removed if all its neighbours belong to a different class; all- a point is removed if any of its neighbours belongs to a different class. Adv.: reduced training time, improved classification, boundary cleaning (removes misleading outliers). Disadv.: data loss, parameter dependence (ENN depends on the value of k).
SMOTE(Synthetic Minority Oversampling technique): for imbalanced data; it generates synthetic data points by interpolating(random number b/n 0 & 1 is multiplied by the difference b/n minority sample and its selected neighbour in the feature space) b/n minority class samples. Adv.: balances classes, preserves info, improves performance. Disadv.: overfitting risk, borderline issues (generated points
might be near to the decision boundary making generalization harder). The curse of dimensionality poses challenges in high-dimensional spaces, affecting data analysis and modelling. Nonparametric methods like histograms are particularly vulnerable. As dimensions increase, the number of bins required grows exponentially, leading to a less meaningful proximity b/n data points. Euclidean distance loses
discriminative power. Solutions would be adjusting parameters or using dimensionality reduction techniques. Challenges when analysing data: data sparsity-data point spread out more thinly when dimensions are increased; increased computational cost; overfitting. Classifier performance: tends to peak at certain number of features and degrades as the number of ft continues to increase, highlighting the
detrimental effect of dimensionality. For dimensionality reduction we could use feature selection-selecting a subset of relevant ft focuses the model on the most informative data, discarding the rest (d-k) ft; feature extraction- creates new set of ft that are combinations of the original once; dimensionality reduction- preserves the essential info while lowering the dimensions; regularization-penalizes models
with too complex decision boundaries, reducing the risk of overfitting. The curse of dimensionality and overfitting are closely related (example: classification), but the first occurs when there is not enough ft whereas the second occurs when there are to many ft and noise. Principal component analysis(PCA): unsupervised method, all principal components are orthogonal to each other the maximum number
of principal components you can chose is less or equal to the number of features works with unlabeled data; searches for the directions that data have the largest variance it’s an unsupervised method that projects data from a high-dimensional space to a lower-dimensional space while preserving as much of the information as possible using principle components(new ft found by the model). Advantages:
works on any data type, reduced complexity, improved interpretability, has potential for better classification, focuses on capturing most variance in the data. Disadvantages: sensitive to outliers, assumes linear relationships b/n ft, leading to non-effective capturing of non-linear once. Computing PCA: 1.taking the whole datasets & ignoring class labels, because we focus only on ft; 2.compute the
mean(average value across all data points) & covariance(captures the relationship b/n each pair of ft; positive= ft move together; negative=they move in opposite directions); 3.center the data(subtract mean)-removes the bias; 3.1.scaling to unit variance/optional step/-ensures all ft continue equally to the PSA analysis; 4.obtain eigenvectors(represent the directions of greatest variance in the data; they
define the new axes (principal components) in the transformed ft space) & eigenvalues(represent the amount of variance captured by each eigenvector; the eigenvalues are in descending order, with the first eigenvector corresponding to the direction of greatest variance): two methods- Eigenvalue Decomposition(decompose the covariance matrix) & Singular Value Decomposition (SVD)(a more general
technique applicable to non-square matrices); 4.1.project the feature space: project the centered data points onto the chosen eigenvectors to obtain their principal component scores: X=UDV^T where U=orthogonal matrix, n x n spaces; D=diagonal, containing singular values; V= orthogonal matrix, n x n features, drops T rows for dimensionality reduction; 5.sort eigenvalues in descending order; 6. Choose
the k eigenvalues in descending order; 7. Construction of the projection matrix used to transform the data; 8.projection matrix (matrix of our concatenated top k eigenvectors). More than 2 ft PCA rotates the coordinate system that the first new principle component has the largest variance, the second one has the second largest and so on. The amount of principle components used depends on the data size,
the complexity of the decision boundary and the classifier used(total explained variance decide the number of ft). Total explained variance should make the amount of lost info sufficiently small. Non-negative matrix factorization: decomposes non-negative data matrices(img, ect) into lower-dimensional matrices; focuses on data where negative values don’t make sense; reduces dimensionality for easier
analysis & better model performance. It works by taking the data matrix V and factorizing it into W*H where W holds basis vectors representing "parts" of the data (m rows, k columns) and H(coefficient matrix) stores weights for combining basis vectors to reconstruct data points (k rows, n columns). Advantages: interpretable basis vectors(easier interpretation); part-based representation; application in
image & music analysis, text mining, etc. Disadvantages: non ideal for data with inherent negative values; user-defined number(k) can affect results; no guaranteed unique factorization; Can only be applied to non-negative data; noon-convex optimization, requires initialization; slow on large data sets. NMF can be viewed as soft clustering: each point is positive linear combination of weights. Matrix
factorization: X=AB where X is the data matrix (n rows, p columns); A is the basis matrix (n rows, k columns); B is the coefficient matrix (k rows, p columns). Latent space: contains a hidden compressed representation of the data. In ML terms it refers to a lower-dimensional space that captures the essential characteristics of the original, higher-dimensional data. NMF&PCA are latent space transformation
techniques. Advantages: dimensionality reduction(allows easier analysis); efficiency; uncovering hidden structure(latent space can reveal underlying patterns in the data). Applications: recommendation systems; image and text analysis; anomaly detection. Linear interpolation in img space: is a mathematical method used to estimate the value of a data point between two known data points. In the context
of digital img, it's a technique for approximating the color or intensity of a pixel that falls between existing pixels. Latent ft are non-negative. Resizing example: When resizing an img digitally, you're changing the number of pixels in the img. If you enlarge an img (increase the number of pixels), linear interpolation creates new pixels by estimating the color or intensity values(weights) based on the
surrounding pixels in the original image. Limitations: blurring; not ideal for all images.Clustering:Goals:Data exploration: Are there coherent groups? How many groups are there?Data partitioning= divide data by group before further processing.Unsupervised feature extraction= Derive features from clusters or cluster distancesE.g. Clustering techniques,K-means, Hierarchical Clustering, Density Based
Techniques, Gaussian Mixtures Models,K-means Clustering= separate n samples in k groups of equal variance(req. number of clusters to be specified)Algorithm:1.chose the num of K,2.randomly choose initial pos. of K centroids,3.assign each of the points to the "nearest centroid",4.recompute centroid positions,5.If solution converges -> Stop, else go the step 3.Obj fun for K-mean: aims to find a local
minimum by minimizing squared distances between data points and cluster centroids. Additionally, it facilitates the assignment of new data points to cluster memberships based on the existing clusters' characteristics. Restrictions of Cluster shapes:Voronoi-diagrams of centers= a partition of a plane into regions close a given set of objects,always convex in space .Limitations:only simple cluster shapes,cluster
boundaries are determined from the middle of the centres,can’t model covariances well in ansiotropicaly distributed clusters (exhibiting properties with different values when measured in different directions.) Computational Properties:By default K-means in sklearn does 10 random restarts with different initializations.For large datasets, K-means initialization may take much longer than clustering.Consider
using random, in particular for MiniBatchKMean,Uses mini-batches to reduce the computation time, while stillattempting to optimise the same objective function (partial_fit). MiniBatchMenas: subsets of the input data (rather than the whole data), randomly sampled in each training iteration .Alghorithm:1. Draw samples randomly from the dataset, to form a mini-batch. Assign to nearest centroid.2. Update
the centroids by using a convex combination of the average of the samples and the previous samples assigned to that centroid.3. Perform 1 and 2 until convergence or for a fixed number of iterations. Feature extraction using k-means:Cluster membership->cat features,Cluster distanced->continuous features.Hierarchical Clustering: a series of partitions from a single cluster containing all the data points to N
clusters containing 1 data point each.,gives more holistic view,can help with picking the num of clusters,some linkage criteria may lead to imbalanced cluster size,fast with sparse connectivity, can restrict to input “topology” given by any graph. Algorithm:1.Start with N independent clusters: {P1 }, {P2 },...,{PN}2.Find the two closest (most similar) clusters, and join them.3.Repeat step 2 until all points belong to
the same cluster.Heavily influenced by:distance metric (Euclidian, Manhattan, Maximum),linkage criterion- determines how clusters are merged,the distance between 2 clusters is a function of the pairwise distance between each point,clusters that minimize this function are then combined.Linkage criteria are broadly equivalent for single point clusters: Complete linkage: maximum distance between
farthest points in each cluster,Single linkage: minimum distance between closest points in each cluster,Average Linkage: average distance of all mixed pairs,Centroid Linkage: distance between cluster centroids,Ward linkage: cluster variance (select clusters that maximize decrease in variance),minimize the sum of squared differences within all clusters,lead to more equally seized clusters. DBSCAN (Density-
Based Spatial Clustering of Applications with Noise):2 important hyperpara. silon (distance) and minPts (minimum number of points)Algorithm:1. point picked at random2.Select next point randomly from unvisited points 3.Count number of points in the neighbourhood (more than minPts=> new cluster formed)4.Count number of points in each unvisited neighbourhood (>minPts=> add to cluster) 5.Repeat
- If <minPts=> marked as noiseDensity:number of sample points within a specified radius r (epsilon), Core point: sample with more than a specified number of points (min_samples) within epsilon (includes samples inside the cluster), Border point has fewer than min_samples within epsilon, but is in the neighborhood of a core point, Noise point : any point that is not a core point or a border
point.Advg:Allows complex cluster shape,Can detect outliers,Needs two parameters to adjust, epislon is hard to pick(can be done based on number of clusters though).Can learn arbitrary cluster shapes, Limitations: Varying densities,High-dimensional data.Mixture models:Generative model – find p(X).Mixture model assumption:Data is mixture of small number of known distributions,Each mixture
component distribution can be learned “simply”, Each point comes from one particular component. We learn the component parameters and weights of components.Guassian Mixture models:Each component is created by a Gaussian distribution,there is a multinomial distribution over the components,non-convex optimization,alternately assign points to components and compute mean and variance (EM
algorithm),Initialized with K-means, random restarts. Goal:Create parametric denisty model,allow for testing how likely a new point is,clustering.GMM vs K-means:GMM assumes data is generated from a mixture of Gaussian distributions and performs soft clustering, making it flexible in handling clusters of various shapes and sizes, though it's computationally complex. In contrast, k-means assumes data can
be partitioned into k spherical clusters, performs hard clustering, is computationally simpler, but requires specifying the number of clusters in advance. Evaluating clustering results:Elbow plot:graphical tool used in k-means clustering to determine the optimal number of clusters (k) by visualizing the explained variance or within-cluster sum of squares (WCSS) across different k values. The X-axis represents
the number of clusters, and the Y-axis represents the explained variance or WCSS. The optimal k value is typically found at the "elbow point," where the curve sharply bends, indicating that adding more clusters beyond this point provides diminishing returns in terms of variance explained. Silhouette coefficient: metric used to evaluate clustering quality by considering both cohesion within clusters and
separation between clusters. It ranges from -1 to +1, with values closer to +1 indicating well-clustered points, values around 0 suggesting overlapping or indistinct clusters, and values closer to -1 indicating misclassified points. A higher average silhouette coefficient across all data points indicates a better clustering solution, making it a useful metric for assessing clustering performance alongside other
methods like elbow plots.
Classification:Logistic Reg, NN(Multi-layer perception), Naïve Bayes,KNN,Deciison Trees,Linear&Kernel SVM,Ensemble learning(Random Forests,Gradient Boosting) Dimensionality Reduction(PCA,NMF,t-SNE),Clustering(K-means)
Regression:Linear,Polynomial Reg,NN(MLP) Reg,Bayesian Ridge reg,KNN Reg,Decision Trees Reg,Linear SVM Reg,Kernel SVM Reg,Ensemble methods(Random Forest Reg,Gradient Boosting Reg)