0% found this document useful (0 votes)
2 views

ML-python

The document provides an overview of machine learning (ML) concepts, techniques, and Python libraries, emphasizing the distinction between supervised and unsupervised learning. Key ML techniques include regression, classification, clustering, and anomaly detection, while popular Python libraries for ML include NumPy, SciPy, and Scikit-learn. It also covers model evaluation methods and specific algorithms like K-Nearest Neighbors, Decision Trees, and Support Vector Machines.

Uploaded by

randa.maazouza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

ML-python

The document provides an overview of machine learning (ML) concepts, techniques, and Python libraries, emphasizing the distinction between supervised and unsupervised learning. Key ML techniques include regression, classification, clustering, and anomaly detection, while popular Python libraries for ML include NumPy, SciPy, and Scikit-learn. It also covers model evaluation methods and specific algorithms like K-Nearest Neighbors, Decision Trees, and Support Vector Machines.

Uploaded by

randa.maazouza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Machine Learning with python

ML : Science that gives computers the ability to learn without begin explicitly programmed

ML Techniques :

 Regression/Estimation : Predicting continuous values


 Classification : Predecting class / category
 Clustering :Finfing the structure of data , summarization
 Associations : frequent co occurring item /events (ex : a products always bought together by
same clt)
 Anomaly detection : descove unusual cases
 Sequence mining : predicting next events
 Dimension reduction :reduce size of data

Python libraries for machine learning :

 NumPy : to work with N-dimension arrays


 SciPy : signal process /optimaztion /statistics
 Matplotib : 2D and 3D plotting
 Pandas :import /manipul /analys data
 Scikit-learn : algo and tools for ML

ML Pipline :

All this process is included in scikit-learn

Supervised learning :

teach model by training it with labled dataset

labeled dataset -> classes

supervised technique : classification et regression

Unupervised learning :

Trains on dataset and conclusion on unlabled data

Unsupervised technique : demnsion reduction , clustering , density estimation , market basket


analysis

Regression

To predict continuous value

Type des var dans reg :

Independent (x explanatory var causes of y )


dependent (y state , target ,final goal to study )

type de reg : selon nbr de x

 Simple reg (linear et non linear ) ex : predict co2 using enginsize


 multiple reg (linear et non linear ) ex : predict co2 using enginsize and cylinders

Reg algo : poisson /linear/neural network /decision forest /boosted decision tree /k-nearest
neighbors

Simple Linear Regression

X can be continuous or categorial

Y always continuous

Y=b+ax

a et b param to adjust (les coff of the fit ligne) avec a :gradient et b intercept

MSE marge d error

Model Evaluation in Regression Models

After bulding a model we should evaluate it

Accuracy of a model : how much we can trust this model / Performance of a model
Types to evaluation :

Train and test on same dataset :

Compare actual with predicted to know the accuracy

Training accuracy : % correct predictions using test dataset !! high could be overfit

Out of sample accuracy : % correct predictions on data that the model has not been trained on
Train /Test split :

This methode will improve out of sample accuracy

K-fold cross validation :

Evaluation Metrics in Regression Models


Error : diff data point and trend line generated by the algo

MSE / RMSE interpretable en meme unit /RAE /RSE

divise le DataFrame cdf en deux ensembles : train (80 % des données) pour l'entraînement du modèle
et test (20 % des données) pour évaluer la performance du modèle. :

msk = np.random.rand(len(df)) < 0.8

train = cdf[msk] #tab de train

test = cdf[~msk] #tab de test

from sklearn import linear_model

regr = linear_model.LinearRegression() # Création d'un modèle de régression linéaire

# Préparation des données d'entraînement

train_x = np.asanyarray(train[['ENGINESIZE']])

train_y = np.asanyarray(train[['CO2EMISSIONS']])

regr.fit(train_x, train_y) # Entraînement du modèle

# Affichage des coefficients et de l'interception a et b

print ('Coefficients: ', regr.coef_[0][0]) #a

print ('Intercept: ',regr.intercept_[0]) #b

#Tester model

test_x = np.asanyarray(test[['ENGINESIZE']])

test_y = np.asanyarray(test[['CO2EMISSIONS']])

test_y_ = regr.predict(test_x)

Multiple Linear Reg :

Y=b+ax1+cx2+…

Y=zX

z=[b,a,c,…]: weight vector of regression or optimize param

X=[1,x1,x2,..] / line calld hyperline

Obectif est de minimize MSE pour faire ca il faut trouver best z

Comment trouver z :

algebre linear (complex qst temp)

An optimaztion algo (gradient decendt random val for coff compare et trouver the best )

Chapitre 3 : Classification

Supervised learning approach


Learn realation between items

Target attribut in classification is a categorical var discret values

Type de classification : decision trees -knearest neighbor – logistic reg – neural networks

K-Nearest Neighbours KNN

 Find closest cases and assigne the same class label to our case
 La distance entre deux point est la mesure de leurs dissimilarity
peut etre calculer par euclidean dis

How its work ?

 Pick val for k


 Calculate dis of unknown case from all other cases in our dataset
 Select k-observation (the nearest to our unknown case )
 Predict result using most popular response val from k-nearest neighbors (k-observation)

How Calculate dis of unknown case from all other cases in our dataset ?

 euclidean dis :

what is the best val of k for KNN ?

k c est nbr de nearest neibhors to examine

if k is very low :bad prediction -> overfitting

if k is very high : overly generalized

to finde best K :

 resreve from data a part for testing the accuracy of the model
 start with k=1 and calculate accuracyof prediction using my test set
 repeat process increasing k untel find ur best k

We can use it to finde


cantinous val (avarg of
the nearest neighbors to predict val of new case )

Evaluation Metrics in Classification:

Those metrics explain the performance of a model

Compare val de test set avec val predicted by model

Jaccard index

(y:actual labels , y’:predicted labels )

Jaccard as the size of the intersect / union of two label sets

F1-score

Other way to verifier accuracy

Using test set

Rows shows actual true labels

Colomus shows
predicted val by model
Log loss

Measures the performance of a classifier where predicted output val in [0,1]

Equation : (y*log(y’)+(1-y)*log(1-y ‘)) ave y true val et y’ val predicted !! this equation Measure how
predicted val from actueal label

The wa calculate logloss=-1/n Σ(y*log(y’)+(1-y)*log(1-y ‘))

Decision Trees

Testing attribute and branching the cases based on the result of the test

Each node coresponde to a test and each branch correspond to a result and each leaf node a class

How to buld a decision tree algo :

1. Choose attribute from ur dataset

2. Calculate the
significance of
attribute in splitting of
data (to see if it a eefective attribute or not )
3. Split data based on the value of the best attribute
4. Go to step 1

!! pure node si 100% of the cases apartient a une meme category ex:

Entropy : nbr of randomness or uncertainty in data

In decision trees we look fro trees with -entropy in theier nodes

Entropy=0 is the best


Ex 1 we calculi entropy for cholesterol:

6 drug b si col est normale

=>E=0.811

EX2 sex

Betwen colosterol et sex whos the best to choose ? => tree with higher information gain after splitting

Information gain : IG

Info can Incrase level of certainty after splitting

Si -E => + info gain

IG=(E befor split )-(weighted E after split )

Ex de sex : IG= 0.940-[(7/14)*0.985 + (7/14)*0.592 ]

Chapitre 4

Support Vector Machine SVM

 Used for classification


 Supervised algo
 Classifies cases by finding a separator
 Data can be seperted by a curb non a ligne

1. Map data to a high-dimensional feature space


2. Find a separator
Transformin data :

 Kernelling : le faite de mapper data a higher-dim


 Fct math utuliser pour faire ca est nome kernel peut etre de diff type :
(liear,polynomial,RBF,sigmond) those methode included in ml libreries
 To know best fuction preform w/ our dataset we test and compare

How to find separator:

Svm based on the idea of find a hyperplane who who devise data set into classes

With big margin possible

Hyperplane learnd from training data

SVM application : Image recognition-Sentiment Analysis-Detect spam …

Evaluation :

 Classification binaire ou multi-classes : F1-Score, Jaccard Index, …


 Régression : RMSE, MAE
 Clustering : Adjusted Rand Index

Logistic Regression

 Used for classification


 Statiscale and ML technique for classifying records
 Logstic regression diff de linear reg car linear we predict continuous values alors que logistic
reg we predict classes
 Indep var (X) should be continuous if they are cat we should transform theme to continuous
val

When use logistic reg :

 When ur target field in ur data is categorical binary


 If u need probability of ur result
 Whe u need a linear decision boundry
 If u need to understand the impact of the features

Logistic Regression Training

Main obejectif of traning is change param to find best estimation

Clustering

Groups have similers charactaire

Unsupervised

Clustring algo : k-means k-median

k-means

claculer similarite par calculi distance


1- Initialiser K : determiner nbr de cluster et randomly place k centiods
2- Calculer distance
3- Assigne each point to the closest centroid
4- New cintriods
5- Repeat until no more changes of cintiods

You might also like