Machine Learning with python
ML : Science that gives computers the ability to learn without begin explicitly programmed
ML Techniques :
Regression/Estimation : Predicting continuous values
Classification : Predecting class / category
Clustering :Finfing the structure of data , summarization
Associations : frequent co occurring item /events (ex : a products always bought together by
same clt)
Anomaly detection : descove unusual cases
Sequence mining : predicting next events
Dimension reduction :reduce size of data
Python libraries for machine learning :
NumPy : to work with N-dimension arrays
SciPy : signal process /optimaztion /statistics
Matplotib : 2D and 3D plotting
Pandas :import /manipul /analys data
Scikit-learn : algo and tools for ML
ML Pipline :
All this process is included in scikit-learn
Supervised learning :
teach model by training it with labled dataset
labeled dataset -> classes
supervised technique : classification et regression
Unupervised learning :
Trains on dataset and conclusion on unlabled data
Unsupervised technique : demnsion reduction , clustering , density estimation , market basket
analysis
Regression
To predict continuous value
Type des var dans reg :
Independent (x explanatory var causes of y )
dependent (y state , target ,final goal to study )
type de reg : selon nbr de x
Simple reg (linear et non linear ) ex : predict co2 using enginsize
multiple reg (linear et non linear ) ex : predict co2 using enginsize and cylinders
Reg algo : poisson /linear/neural network /decision forest /boosted decision tree /k-nearest
neighbors
Simple Linear Regression
X can be continuous or categorial
Y always continuous
Y=b+ax
a et b param to adjust (les coff of the fit ligne) avec a :gradient et b intercept
MSE marge d error
Model Evaluation in Regression Models
After bulding a model we should evaluate it
Accuracy of a model : how much we can trust this model / Performance of a model
Types to evaluation :
Train and test on same dataset :
Compare actual with predicted to know the accuracy
Training accuracy : % correct predictions using test dataset !! high could be overfit
Out of sample accuracy : % correct predictions on data that the model has not been trained on
Train /Test split :
This methode will improve out of sample accuracy
K-fold cross validation :
Evaluation Metrics in Regression Models
Error : diff data point and trend line generated by the algo
MSE / RMSE interpretable en meme unit /RAE /RSE
divise le DataFrame cdf en deux ensembles : train (80 % des données) pour l'entraînement du modèle
et test (20 % des données) pour évaluer la performance du modèle. :
msk = np.random.rand(len(df)) < 0.8
train = cdf[msk] #tab de train
test = cdf[~msk] #tab de test
from sklearn import linear_model
regr = linear_model.LinearRegression() # Création d'un modèle de régression linéaire
# Préparation des données d'entraînement
train_x = np.asanyarray(train[['ENGINESIZE']])
train_y = np.asanyarray(train[['CO2EMISSIONS']])
regr.fit(train_x, train_y) # Entraînement du modèle
# Affichage des coefficients et de l'interception a et b
print ('Coefficients: ', regr.coef_[0][0]) #a
print ('Intercept: ',regr.intercept_[0]) #b
#Tester model
test_x = np.asanyarray(test[['ENGINESIZE']])
test_y = np.asanyarray(test[['CO2EMISSIONS']])
test_y_ = regr.predict(test_x)
Multiple Linear Reg :
Y=b+ax1+cx2+…
Y=zX
z=[b,a,c,…]: weight vector of regression or optimize param
X=[1,x1,x2,..] / line calld hyperline
Obectif est de minimize MSE pour faire ca il faut trouver best z
Comment trouver z :
algebre linear (complex qst temp)
An optimaztion algo (gradient decendt random val for coff compare et trouver the best )
Chapitre 3 : Classification
Supervised learning approach
Learn realation between items
Target attribut in classification is a categorical var discret values
Type de classification : decision trees -knearest neighbor – logistic reg – neural networks
K-Nearest Neighbours KNN
Find closest cases and assigne the same class label to our case
La distance entre deux point est la mesure de leurs dissimilarity
peut etre calculer par euclidean dis
How its work ?
Pick val for k
Calculate dis of unknown case from all other cases in our dataset
Select k-observation (the nearest to our unknown case )
Predict result using most popular response val from k-nearest neighbors (k-observation)
How Calculate dis of unknown case from all other cases in our dataset ?
euclidean dis :
what is the best val of k for KNN ?
k c est nbr de nearest neibhors to examine
if k is very low :bad prediction -> overfitting
if k is very high : overly generalized
to finde best K :
resreve from data a part for testing the accuracy of the model
start with k=1 and calculate accuracyof prediction using my test set
repeat process increasing k untel find ur best k
We can use it to finde
cantinous val (avarg of
the nearest neighbors to predict val of new case )
Evaluation Metrics in Classification:
Those metrics explain the performance of a model
Compare val de test set avec val predicted by model
Jaccard index
(y:actual labels , y’:predicted labels )
Jaccard as the size of the intersect / union of two label sets
F1-score
Other way to verifier accuracy
Using test set
Rows shows actual true labels
Colomus shows
predicted val by model
Log loss
Measures the performance of a classifier where predicted output val in [0,1]
Equation : (y*log(y’)+(1-y)*log(1-y ‘)) ave y true val et y’ val predicted !! this equation Measure how
predicted val from actueal label
The wa calculate logloss=-1/n Σ(y*log(y’)+(1-y)*log(1-y ‘))
Decision Trees
Testing attribute and branching the cases based on the result of the test
Each node coresponde to a test and each branch correspond to a result and each leaf node a class
How to buld a decision tree algo :
1. Choose attribute from ur dataset
2. Calculate the
significance of
attribute in splitting of
data (to see if it a eefective attribute or not )
3. Split data based on the value of the best attribute
4. Go to step 1
!! pure node si 100% of the cases apartient a une meme category ex:
Entropy : nbr of randomness or uncertainty in data
In decision trees we look fro trees with -entropy in theier nodes
Entropy=0 is the best
Ex 1 we calculi entropy for cholesterol:
6 drug b si col est normale
=>E=0.811
EX2 sex
Betwen colosterol et sex whos the best to choose ? => tree with higher information gain after splitting
Information gain : IG
Info can Incrase level of certainty after splitting
Si -E => + info gain
IG=(E befor split )-(weighted E after split )
Ex de sex : IG= 0.940-[(7/14)*0.985 + (7/14)*0.592 ]
Chapitre 4
Support Vector Machine SVM
Used for classification
Supervised algo
Classifies cases by finding a separator
Data can be seperted by a curb non a ligne
1. Map data to a high-dimensional feature space
2. Find a separator
Transformin data :
Kernelling : le faite de mapper data a higher-dim
Fct math utuliser pour faire ca est nome kernel peut etre de diff type :
(liear,polynomial,RBF,sigmond) those methode included in ml libreries
To know best fuction preform w/ our dataset we test and compare
How to find separator:
Svm based on the idea of find a hyperplane who who devise data set into classes
With big margin possible
Hyperplane learnd from training data
SVM application : Image recognition-Sentiment Analysis-Detect spam …
Evaluation :
Classification binaire ou multi-classes : F1-Score, Jaccard Index, …
Régression : RMSE, MAE
Clustering : Adjusted Rand Index
Logistic Regression
Used for classification
Statiscale and ML technique for classifying records
Logstic regression diff de linear reg car linear we predict continuous values alors que logistic
reg we predict classes
Indep var (X) should be continuous if they are cat we should transform theme to continuous
val
When use logistic reg :
When ur target field in ur data is categorical binary
If u need probability of ur result
Whe u need a linear decision boundry
If u need to understand the impact of the features
Logistic Regression Training
Main obejectif of traning is change param to find best estimation
Clustering
Groups have similers charactaire
Unsupervised
Clustring algo : k-means k-median
k-means
claculer similarite par calculi distance
1- Initialiser K : determiner nbr de cluster et randomly place k centiods
2- Calculer distance
3- Assigne each point to the closest centroid
4- New cintriods
5- Repeat until no more changes of cintiods