ML-python
ML-python
ML : Science that gives computers the ability to learn without begin explicitly programmed
ML Techniques :
ML Pipline :
Supervised learning :
Unupervised learning :
Regression
Reg algo : poisson /linear/neural network /decision forest /boosted decision tree /k-nearest
neighbors
Y always continuous
Y=b+ax
a et b param to adjust (les coff of the fit ligne) avec a :gradient et b intercept
Accuracy of a model : how much we can trust this model / Performance of a model
Types to evaluation :
Training accuracy : % correct predictions using test dataset !! high could be overfit
Out of sample accuracy : % correct predictions on data that the model has not been trained on
Train /Test split :
divise le DataFrame cdf en deux ensembles : train (80 % des données) pour l'entraînement du modèle
et test (20 % des données) pour évaluer la performance du modèle. :
train_x = np.asanyarray(train[['ENGINESIZE']])
train_y = np.asanyarray(train[['CO2EMISSIONS']])
#Tester model
test_x = np.asanyarray(test[['ENGINESIZE']])
test_y = np.asanyarray(test[['CO2EMISSIONS']])
test_y_ = regr.predict(test_x)
Y=b+ax1+cx2+…
Y=zX
Comment trouver z :
An optimaztion algo (gradient decendt random val for coff compare et trouver the best )
Chapitre 3 : Classification
Type de classification : decision trees -knearest neighbor – logistic reg – neural networks
Find closest cases and assigne the same class label to our case
La distance entre deux point est la mesure de leurs dissimilarity
peut etre calculer par euclidean dis
How Calculate dis of unknown case from all other cases in our dataset ?
euclidean dis :
to finde best K :
resreve from data a part for testing the accuracy of the model
start with k=1 and calculate accuracyof prediction using my test set
repeat process increasing k untel find ur best k
Jaccard index
F1-score
Colomus shows
predicted val by model
Log loss
Equation : (y*log(y’)+(1-y)*log(1-y ‘)) ave y true val et y’ val predicted !! this equation Measure how
predicted val from actueal label
Decision Trees
Testing attribute and branching the cases based on the result of the test
Each node coresponde to a test and each branch correspond to a result and each leaf node a class
2. Calculate the
significance of
attribute in splitting of
data (to see if it a eefective attribute or not )
3. Split data based on the value of the best attribute
4. Go to step 1
!! pure node si 100% of the cases apartient a une meme category ex:
=>E=0.811
EX2 sex
Betwen colosterol et sex whos the best to choose ? => tree with higher information gain after splitting
Information gain : IG
Chapitre 4
Svm based on the idea of find a hyperplane who who devise data set into classes
Evaluation :
Logistic Regression
Clustering
Unsupervised
k-means