Lab1 Features Selections-Class-GI2
Lab1 Features Selections-Class-GI2
3AGI+Neprev
Elève Ingénieur:
Objectif
L'objectif de ce lab est de voir qulques techniques de selction de variables en cas de modèles de régression:
embeding methods (lasso and ridge), kbest, SelectModel, RFE, RFECV, Statmodels
import warnings
warnings.filterwarnings('ignore')
df =pd.read_csv('train_bm.csv')
In [5]: df.head()
Fruits and
3 FDX07 19.20 Regular 0.000000 182.0950 OUT010
Vegetables
8523 12
Item_Identifier object
Out[3]:
Item_Weight float64
Item_Fat_Content object
Item_Visibility float64
Item_Type object
Item_MRP float64
Outlet_Identifier object
Outlet_Establishment_Year int64
Outlet_Size object
Outlet_Location_Type object
Outlet_Type object
Item_Outlet_Sales float64
dtype: object
['Item_Identifier',
Out[4]:
'Item_Weight',
'Item_Fat_Content',
'Item_Visibility',
'Item_Type',
'Item_MRP',
'Outlet_Identifier',
'Outlet_Establishment_Year',
'Outlet_Size',
'Outlet_Location_Type',
'Outlet_Type',
'Item_Outlet_Sales']
Fruits and
top FDW13 NaN Low Fat NaN NaN OUT027
Vegetables
Fruits and
3 FDX07 19.200 Regular 0.000000 182.0950 OUT010
Vegetables
Baking
5 FDP36 10.395 Regular 0.000000 51.4008 OUT018
Goods
Snack
6 FDO10 13.650 Regular 0.012741 57.6588 OUT013
Foods
Snack
7 FDP10 NaN Low Fat 0.127470 107.7622 OUT027
Foods
Frozen
8 FDH17 16.200 Regular 0.016687 96.9726 OUT045
Foods
Frozen
9 FDU28 19.200 Regular 0.094450 187.8214 OUT017
Foods
Fruits and
10 FDY07 11.800 Low Fat 0.000000 45.5402 OUT049
Vegetables
Fruits and
12 FDX32 15.100 Regular 0.100014 145.4786 OUT049
Vegetables
Snack
13 FDS46 17.600 Regular 0.047257 119.6782 OUT046
Foods
Fruits and
14 FDF32 16.350 Low Fat 0.068024 196.4426 OUT013
Vegetables
Health and
16 NCB42 11.800 Low Fat 0.008596 115.3492 OUT018
Hygiene
Hard
18 DRI11 NaN Low Fat 0.034238 113.2834 OUT027
Drinks
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identifier Outlet_Es
3873
Out[7]:
In [22]: df.Item_Weight.isnull().sum()
1463
Out[22]:
3873
Out[25]:
In [9]: # replacing missing value by 'Mode" (most frequent) value of other values
#df['Item_Weight'] = df['Item_Weight'].fillna(df['Item_Weight'].mean)
df['Item_Weight'] = df['Item_Weight'].fillna(df.Item_Weight.mode()[0])
#df.Item_Weight=df.Item_Weight.fillna(value=df.Item_Weight.mean)
0
Out[10]:
2410
Out[11]:
df['Outlet_Size'] = df['Outlet_Size'].fillna(value='Mode')
In [ ]:
In [32]: #no missing values
df.isnull().sum().sum()
0
Out[32]:
Item_Identifier object
Out[33]:
Item_Weight object
Item_Fat_Content object
Item_Visibility float64
Item_Type object
Item_MRP float64
Outlet_Identifier object
Outlet_Establishment_Year int64
Outlet_Size object
Outlet_Location_Type object
Outlet_Type object
Item_Outlet_Sales float64
dtype: object
In [13]: #number of occurrences of each unique element in the series of name 'Outlet_Type'
df['Outlet_Type'].unique()
df['Outlet_Type'].value_counts()
In [16]: #Encoding categorical variables : LabelEncoder: assigns each categorical value an integer
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
#encoder dfc
dfc.Outlet_Type=encoder.fit_transform(df['Outlet_Type'])
Out[46]: Outlet_Type
0 1
1 2
Outlet_Type
2 1
3 0
4 1
5 2
6 1
7 3
8 1
9 1
10 1
11 1
12 1
13 1
14 1
15 1
16 2
17 1
18 3
19 1
In [47]: #Dummies encoding: converts categorical data into dummy or indicator variables
pd.get_dummies(df['Outlet_Type']).head()
0 0 1 0 0
1 0 0 1 0
2 0 1 0 0
3 1 0 0 0
4 0 1 0 0
In [48]: df.shape
(8523, 12)
Out[48]:
In [49]: df.Outlet_Establishment_Year.value_counts()
1985 1463
Out[49]:
1987 932
1999 930
1997 930
2004 930
2002 929
2009 928
2007 926
1998 555
Name: Outlet_Establishment_Year, dtype: int64
In [ ]: #df.describe()#include='float64')
In [ ]: df.Item_Weight.head()
In [18]: df.shape
df=pd.get_dummies(df, columns=['Outlet_Type','Outlet_Establishment_Year','Outlet_Size'])
df.shape
(8523, 26)
Out[18]:
In [ ]: df.info()
FDW13 10
Out[20]:
FDG33 10
NCY18 9
FDD38 9
DRE49 9
Name: Item_Identifier, dtype: int64
In [ ]: df.shape
0 FDA15 8
1 DRC01 6
2 FDN15 7
3 FDX07 6
4 NCD19 6
In [53]: df.shape
(8523, 27)
Out[53]:
In [54]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Item_Identifier 8523 non-null object
1 Item_Weight 8523 non-null object
2 Item_Fat_Content 8523 non-null object
3 Item_Visibility 8523 non-null float64
4 Item_Type 8523 non-null object
5 Item_MRP 8523 non-null float64
6 Outlet_Identifier 8523 non-null object
7 Outlet_Location_Type 8523 non-null object
8 Item_Outlet_Sales 8523 non-null float64
9 Outlet_Type_Grocery Store 8523 non-null uint8
10 Outlet_Type_Supermarket Type1 8523 non-null uint8
11 Outlet_Type_Supermarket Type2 8523 non-null uint8
12 Outlet_Type_Supermarket Type3 8523 non-null uint8
13 Outlet_Establishment_Year_1985 8523 non-null uint8
14 Outlet_Establishment_Year_1987 8523 non-null uint8
15 Outlet_Establishment_Year_1997 8523 non-null uint8
16 Outlet_Establishment_Year_1998 8523 non-null uint8
17 Outlet_Establishment_Year_1999 8523 non-null uint8
18 Outlet_Establishment_Year_2002 8523 non-null uint8
19 Outlet_Establishment_Year_2004 8523 non-null uint8
20 Outlet_Establishment_Year_2007 8523 non-null uint8
21 Outlet_Establishment_Year_2009 8523 non-null uint8
22 Outlet_Size_High 8523 non-null uint8
23 Outlet_Size_Medium 8523 non-null uint8
24 Outlet_Size_Mode 8523 non-null uint8
25 Outlet_Size_Small 8523 non-null uint8
26 Item_identifier_count 8523 non-null int64
dtypes: float64(3), int64(1), object(6), uint8(17)
memory usage: 807.5+ KB
In [56]: L0=df.columns.to_list()
print(L0)
df.head()
Fruits and
3 FDX07 19.2 Regular 0.000000 182.0950 OUT010
Vegetables
5 rows × 27 columns
In [57]: df=df[L0]
df.head()
Out[57]:
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identifier Outlet_Loc
Fruits and
3 FDX07 19.2 Regular 0.000000 182.0950 OUT010
Vegetables
5 rows × 27 columns
In [ ]: df1=df.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Item_Identifier 8523 non-null object
1 Item_Weight 8523 non-null object
2 Item_Fat_Content 8523 non-null object
3 Item_Type 8523 non-null object
4 Outlet_Identifier 8523 non-null object
5 Outlet_Location_Type 8523 non-null object
6 Outlet_Type_Grocery Store 8523 non-null uint8
7 Outlet_Type_Supermarket Type1 8523 non-null uint8
8 Outlet_Type_Supermarket Type2 8523 non-null uint8
9 Outlet_Type_Supermarket Type3 8523 non-null uint8
10 Outlet_Establishment_Year_1985 8523 non-null uint8
11 Outlet_Establishment_Year_1987 8523 non-null uint8
12 Outlet_Establishment_Year_1997 8523 non-null uint8
13 Outlet_Establishment_Year_1998 8523 non-null uint8
14 Outlet_Establishment_Year_1999 8523 non-null uint8
15 Outlet_Establishment_Year_2002 8523 non-null uint8
16 Outlet_Establishment_Year_2004 8523 non-null uint8
17 Outlet_Establishment_Year_2007 8523 non-null uint8
18 Outlet_Establishment_Year_2009 8523 non-null uint8
19 Outlet_Size_High 8523 non-null uint8
20 Outlet_Size_Medium 8523 non-null uint8
21 Outlet_Size_Mode 8523 non-null uint8
22 Outlet_Size_Small 8523 non-null uint8
23 Item_identifier_count 8523 non-null int64
dtypes: int64(1), object(6), uint8(17)
memory usage: 607.7+ KB
In [ ]: df2.columns.to_list()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Item_Identifier 8523 non-null object
1 Item_Weight 8523 non-null object
2 Item_Fat_Content 8523 non-null object
3 Item_Visibility 8523 non-null float64
4 Item_Type 8523 non-null object
5 Item_MRP 8523 non-null float64
6 Outlet_Identifier 8523 non-null object
7 Outlet_Location_Type 8523 non-null object
8 Item_Outlet_Sales 8523 non-null float64
9 Outlet_Type_Grocery Store 8523 non-null uint8
10 Outlet_Type_Supermarket Type1 8523 non-null uint8
11 Outlet_Type_Supermarket Type2 8523 non-null uint8
12 Outlet_Type_Supermarket Type3 8523 non-null uint8
13 Outlet_Establishment_Year_1985 8523 non-null uint8
14 Outlet_Establishment_Year_1987 8523 non-null uint8
15 Outlet_Establishment_Year_1997 8523 non-null uint8
16 Outlet_Establishment_Year_1998 8523 non-null uint8
17 Outlet_Establishment_Year_1999 8523 non-null uint8
18 Outlet_Establishment_Year_2002 8523 non-null uint8
19 Outlet_Establishment_Year_2004 8523 non-null uint8
20 Outlet_Establishment_Year_2007 8523 non-null uint8
21 Outlet_Establishment_Year_2009 8523 non-null uint8
22 Outlet_Size_High 8523 non-null uint8
23 Outlet_Size_Medium 8523 non-null uint8
24 Outlet_Size_Mode 8523 non-null uint8
25 Outlet_Size_Small 8523 non-null uint8
26 Item_identifier_count 8523 non-null int64
dtypes: float64(3), int64(1), object(6), uint8(17)
memory usage: 807.5+ KB
5 rows × 22 columns
In [63]: dfn.head()
5 rows × 21 columns
In [64]: dfn.isnull().sum()
Item_Visibility 0
Out[64]:
Item_MRP 0
Item_Outlet_Sales 0
Outlet_Type_Grocery Store 0
Outlet_Type_Supermarket Type1 0
Outlet_Type_Supermarket Type2 0
Outlet_Type_Supermarket Type3 0
Outlet_Establishment_Year_1985 0
Outlet_Establishment_Year_1987 0
Outlet_Establishment_Year_1997 0
Outlet_Establishment_Year_1998 0
Outlet_Establishment_Year_1999 0
Outlet_Establishment_Year_2002 0
Outlet_Establishment_Year_2004 0
Outlet_Establishment_Year_2007 0
Outlet_Establishment_Year_2009 0
Outlet_Size_High 0
Outlet_Size_Medium 0
Outlet_Size_Mode 0
Outlet_Size_Small 0
Item_identifier_count 0
dtype: int64
In [66]: #boxplot
sns.boxplot(data = dfn[['Item_Weight' ,'Item_Visibility', 'Item_MRP']], orient = 'v')
<AxesSubplot:>
Out[66]:
In [ ]: #boxplot
sns.boxplot(data = dfn[['Item_Weight' ,'Item_Visibility', 'Item_MRP']], orient = 'v')
In [68]: #outliers
sns.boxplot(data = dfn['Item_Outlet_Sales'], orient = 'v')
<AxesSubplot:>
Out[68]:
maximum=Q 3 + 1.5 ∗ I QR
In [ ]: sns.boxplot(data=df.Item_Outlet_Sales,orient='h')
186
Out[67]:
<seaborn.axisgrid.PairGrid at 0x1b02fe8a278>
Out[65]:
Correlation:
A correlation matrix is a common tool used to compare the coefficients of correlation between different features
(or attributes) in a dataset. It allows to visualize how much (or how little) correlation exists between different
variables
In [27]: plt.figure(figsize=(10,8))
sns.heatmap(correlations, annot = True)
<AxesSubplot:>
Out[27]:
In [28]: X = dfn.drop(['Item_Outlet_Sales'], axis=1)# matrix of features
y = dfn['Item_Outlet_Sales']# vector of labels
X.shape, y.shape
Régression linéaire
La dernière colonne de datafame df 'sales' qu'on note y^: représentant le prix total de quelques produits d'une
chaine de magasins.
L'objectif de la régression linéaire est de prédire les ventes de chaque produit x dans un point de vente
particulier en fonction des différentes variables x , j j = 1, . . . , d en écrivant la prédiction y^ comme
y
^ = ∑ wj x j ,
j=0
(ici on pose x 0 = 1 ).
La recherche des poids et de biais en minimisant l'erreur en moindre carré qui est la moyenne sur toutes les
startup
n d
1
i i 2
min f (w) := ∑ ( ∑ wj x − y ) (1)
d+1
j
w∈R 2n
i=1 j=0
In [3]: Image('regression1.png')
Out[3]:
In [4]: Image('regression2.png')
Out[4]:
In [5]: Image('regression3.png')
Out[5]:
In [6]: Image('regression4.png')
Out[6]:
Ensembles training et test
Pour évaluer les performances du modèle sur des données invisibles, on divise les données en training and
testing sets. On entraine le modèlele par exemple avec 80% d'exemples et le test avec 20% du reste. On utilise
'train_test_split function'de scikit-learn library. Choisir 'random_state', pour obtenir le même partage à chaque
fois afin de pouvoir reproduire les résultats. Enfin, on affiche les tailles de training and test sets pour voir si notre
division est correcte
In [ ]: ## **Features scaling**
#from sklearn.preprocessing import MinMaxScaler
#scaler = MinMaxScaler()
#from sklearn.preprocessing import RobustScaler
#scaler = RobustScaler()
#from sklearn.preprocessing import StandardScaler
#scaler = StandardScaler()
#X_train_ = scaler.fit_transform(X_train)
#X_test= scaler.transform(X_test)
In [ ]: print(X_test.shape)
Metrics for regression: General case
y and y^ vectors of respectivly exact and pedicted values
1
M AE = ∑ |yi − y
^ |
i
n
Mean_squared_error :
1
2
M SE = ∑(yi − y
^ )
i
2n
R-Squared or coefficient of determination: R Used to find the correlation between the predicted and
2
2
^ )
∑(yi − y
2 i
R = 1 − ∈] − ∞, 1]
2
∑(yi − yȳ )
M SE(model)
2
R = 1 −
M SE(basline)
But if we have more features, R-squared either increases or does not change and does and can not see how any
feature impact the model
2 n − 1
2
R̄ = 1 − (1 − R ) ,
n − d + 1
LinearRegression()
Out[32]:
0.5630614279454249
Out[48]:
In [33]: w0 = LR.intercept_#biais
w_R = LR.coef_#weights
Intercept = -207.5275883376171
Coefs = [ -2.0262207 -343.68577229 15.53712456 -1305.69259881
363.57108798 -237.90354403 1180.02505485 67.03378685
-29.77356392 292.14669988 -192.70133081 -451.98090278
-43.28072999 450.56672686 145.89285794 -237.90354403
-29.77356392 490.14060804 -90.08920286 -370.27784126
5.97821985]
MAE = 832.7697608114595
MSE = 1270620.5219161748
R2 = 0.5630614279454249
MAE = 851.9198567977413
MSE = 1276730.9218397879
R2 = 0.5637881003820474
5 rows × 21 columns
[1426.1436, 2489.3108732996498]
[1201.769, 2522.6479061046225]
[1836.2764, 2238.8794424597736]
[2410.8618, 2432.726290160981]
[1549.9824, 3115.3221305422135]
[3169.208, 3616.2284038788393]
[2036.6822, 2905.006552265028]
[824.9262, 1832.827140131652]
[378.1744, 1129.2447358303984]
[1573.9512, 1802.1739343111042]
In [ ]:
<AxesSubplot:title={'center':'LR Coefficients'}>
Out[49]:
In [50]: k=15
#LR liste contenant les k varibles selectionnées
L_R=list(coefR[0:k].keys())
print(L_R)
σ1 0 … … 0
⎡ ⎤
⎢ ⎥
0 ⋱ ⋱ … 0
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⋮ ⋱ σr 0 ⋮ ⎥
une matrice orthogonale V et une matrice dont tous les
d×d n×d
∈ R Σ = ⎢ ⎥ ∈ R
⎢ ⎥
⎢ ⎥
⎢ 0 0 ⋱ 0 0⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢⋯ ⋱ ⋱ ⋮ ⎥
⎣ ⎦
0 0 0 0 0
T
X = U ΣV .
matrice X +
= VΣ
+
U
T
.
Alors, le vecteur w = X
+
y est une solution de problème de régression (1) de norme minimale.
In [57]: w_p=np.linalg.pinv(X_trainr).dot(y_train)
In [53]: print(w_p)
print(w0,w_R)
In [59]: plt.scatter(range(len(w_R)),w_R)
plt.scatter(range(len(w_p[1:X_train.shape[1]+1])),w_p[1:X_train.shape[1]+1])
<matplotlib.collections.PathCollection at 0x1b02fe51c50>
Out[59]:
In [62]: d=X_train.shape[1]
d1=d+1# nb de features +1(pour le bias)
print(d)
coefp = pd.Series(np.abs(w_p[1:d1]).reshape(d,), feature_names ).sort_values(ascending=Fal
coefp.plot(kind='bar', title='Pseudo-inverse Coefficients')
21
<AxesSubplot:title={'center':'Pseudo-inverse Coefficients'}>
Out[62]:
Evaluer le modèle sur les deux ensembles test et training (avec MAE, MSE, R2)
Utiliser la matrice pseudo inverse puis évaluer le modèle obtenu
Pour chaque technique utilisée, représenter les variables les plus pertinentes par importance puis trier dans
une liste les deux variables les plus décisives.
In [ ]: ## fin de TP
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: # Pour GI1
n = y.shape[0]
#print(d)
return ?
Descente de gradient
L'algorithme de descente de gradient appliqué à une fonction f est défini par un point initial w ainsi que par 0
l'itération
wk+1 = wk − αk ∇f (wk ),
Le pas de descente α est appelé taux d'apprentissage (leaning rate) en ML et en général noté l .
k r
Ici A = X
T
trainr
Xtrainr et b = X
T
trainr
ytrain
On choisit w .
d
0 ∈ R
2
∣ ∥gk ∥
2
∣ αk = , wk+1 = wk + αk dk ,
(Adk , dk )
∣
∣
gk+1 = Awk+1 − b = gk + αk Adk ,
∣
2
∣ ∥gk+1 ∥
2
∣ βk = , dk+1 = −gk+1 + βk dk .
2
∣ ∥gk ∥
2
Pour chaque modèle, on demande de calculer le vecteur prédiction puis R de l'ensemble test et l'ensemble 2
training
In [63]: d1=X_trainr.shape[1]
A=X_trainr.T.dot(X_trainr)
b=X_trainr.T.dot(y_train).reshape(d1,1)
#d=A.shape[0]
In [ ]:
w_f=np.random.rand(d1,1)#initialisation
for k in range(100):
w_f=w_f-lr*(A.dot(w_f)-b)#-lam*w
w_c=np.random.rand(d1,1) #initialisation
g=(A.dot(w_c)-b)
g1=-g
for k in range(20):
#print(k)
if np.linalg.norm(g)>10**(-5):
alpha=?
w_c =?
g0=g
g=?
beta=?
g1=-g+beta*g1
In [ ]: k=15
L_c=list(coefc[0:k].keys())
#Lc liste contenant les 15 varibles selectionnées
# Intersection de L_R et L_c
print(len(set(L_c).intersection(set(L_R))))
In [ ]: plt.scatter(range(len(LR.coef_)), LR.coef_)
plt.scatter(range(len(w_p)-1),w_p[1:len(w_p)])
plt.scatter(range(len(w_c)-1), w_c[1:d1])
#plt.scatter(range(len(w_p)-1),w_p[1:len(w_p)])
In [ ]: #fin
# R Bessi, September 2023