0% found this document useful (0 votes)
10 views25 pages

Lab1 Features Selections-Class-GI2

Uploaded by

Oussama Souissi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views25 pages

Lab1 Features Selections-Class-GI2

Uploaded by

Oussama Souissi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Optimisation Appliquée au ML

3AGI+Neprev

Projet 1 - Feature selection


Enseignante: Radhia Bessi

Elève Ingénieur:

Objectif
L'objectif de ce lab est de voir qulques techniques de selction de variables en cas de modèles de régression:
embeding methods (lasso and ridge), kbest, SelectModel, RFE, RFECV, Statmodels

Data preprocessing and exploration

Commencons par importer notre jeu de données

In [1]: import numpy as np


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [2]: # Importer le data set 'train_bm.csv'

df =pd.read_csv('train_bm.csv')

# A CSV (comma-separated values) file is a text file in which information is separated by

In [5]: df.head()

Out[5]: Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identifier Outlet_Est

0 FDA15 9.30 Low Fat 0.016047 Dairy 249.8092 OUT049

1 DRC01 5.92 Regular 0.019278 Soft Drinks 48.2692 OUT018

2 FDN15 17.50 Low Fat 0.016760 Meat 141.6180 OUT049

Fruits and
3 FDX07 19.20 Regular 0.000000 182.0950 OUT010
Vegetables

4 NCD19 8.93 Low Fat 0.000000 Household 53.8614 OUT013

In [ ]: #more description for data set: https://www.kaggle.com/code/rushikeshghate/sales-predictio


In [3]: # shape of ddataframe df
df.shape
n=df.shape[0]
d=df.shape[1]
print(n,d)
# Types de variables
df.dtypes

8523 12
Item_Identifier object
Out[3]:
Item_Weight float64
Item_Fat_Content object
Item_Visibility float64
Item_Type object
Item_MRP float64
Outlet_Identifier object
Outlet_Establishment_Year int64
Outlet_Size object
Outlet_Location_Type object
Outlet_Type object
Item_Outlet_Sales float64
dtype: object

In [4]: #noms des varibles (en liste)


df.columns.to_list()
#print(cols)

['Item_Identifier',
Out[4]:
'Item_Weight',
'Item_Fat_Content',
'Item_Visibility',
'Item_Type',
'Item_MRP',
'Outlet_Identifier',
'Outlet_Establishment_Year',
'Outlet_Size',
'Outlet_Location_Type',
'Outlet_Type',
'Item_Outlet_Sales']

In [5]: #data description


df.describe(include='all')

Out[5]: Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identifier Ou

count 8523 7060.000000 8523 8523.000000 8523 8523.000000 8523

unique 1559 NaN 5 NaN 16 NaN 10

Fruits and
top FDW13 NaN Low Fat NaN NaN OUT027
Vegetables

freq 10 NaN 5089 NaN 1232 NaN 935

mean NaN 12.857645 NaN 0.066132 NaN 140.992782 NaN

std NaN 4.643456 NaN 0.051598 NaN 62.275067 NaN

min NaN 4.555000 NaN 0.000000 NaN 31.290000 NaN

25% NaN 8.773750 NaN 0.026989 NaN 93.826500 NaN

50% NaN 12.600000 NaN 0.053931 NaN 143.012800 NaN


Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identifier Ou

75% NaN 16.850000 NaN 0.094585 NaN 185.643700 NaN

max NaN 21.350000 NaN 0.328391 NaN 266.888400 NaN

In [6]: # info générales


df.isnull().sum().sum
df.head(20)

Out[6]: Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identifier Outlet_Es

0 FDA15 9.300 Low Fat 0.016047 Dairy 249.8092 OUT049

1 DRC01 5.920 Regular 0.019278 Soft Drinks 48.2692 OUT018

2 FDN15 17.500 Low Fat 0.016760 Meat 141.6180 OUT049

Fruits and
3 FDX07 19.200 Regular 0.000000 182.0950 OUT010
Vegetables

4 NCD19 8.930 Low Fat 0.000000 Household 53.8614 OUT013

Baking
5 FDP36 10.395 Regular 0.000000 51.4008 OUT018
Goods

Snack
6 FDO10 13.650 Regular 0.012741 57.6588 OUT013
Foods

Snack
7 FDP10 NaN Low Fat 0.127470 107.7622 OUT027
Foods

Frozen
8 FDH17 16.200 Regular 0.016687 96.9726 OUT045
Foods

Frozen
9 FDU28 19.200 Regular 0.094450 187.8214 OUT017
Foods

Fruits and
10 FDY07 11.800 Low Fat 0.000000 45.5402 OUT049
Vegetables

11 FDA03 18.500 Regular 0.045464 Dairy 144.1102 OUT046

Fruits and
12 FDX32 15.100 Regular 0.100014 145.4786 OUT049
Vegetables

Snack
13 FDS46 17.600 Regular 0.047257 119.6782 OUT046
Foods

Fruits and
14 FDF32 16.350 Low Fat 0.068024 196.4426 OUT013
Vegetables

15 FDP49 9.000 Regular 0.069089 Breakfast 56.3614 OUT046

Health and
16 NCB42 11.800 Low Fat 0.008596 115.3492 OUT018
Hygiene

17 FDP49 9.000 Regular 0.069196 Breakfast 54.3614 OUT049

Hard
18 DRI11 NaN Low Fat 0.034238 113.2834 OUT027
Drinks
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identifier Outlet_Es

19 FDU02 13.350 Low Fat 0.102492 Dairy 230.5352 OUT035

Deleting Data points with missing values


In [7]: # count of missing values in each column
df.isnull().sum()
## total number of missing values in the dataframe
df.isnull().sum().sum()

3873
Out[7]:

In [22]: df.Item_Weight.isnull().sum()

1463
Out[22]:

In [25]: #deleting rows containing missing values


df_del = df.dropna(axis=0)
df_del.Item_Weight.isnull().sum()
df.isnull().sum().sum()

3873
Out[25]:

In [26]: # shape before and after removing missing values


print(df.shape,df_del.shape)

(8523, 12) (4650, 12)

In [9]: # replacing missing value by 'Mode" (most frequent) value of other values
#df['Item_Weight'] = df['Item_Weight'].fillna(df['Item_Weight'].mean)
df['Item_Weight'] = df['Item_Weight'].fillna(df.Item_Weight.mode()[0])

#df.Item_Weight=df.Item_Weight.fillna(value=df.Item_Weight.mean)

In [10]: # # Item_Weight variable without missing values treatment


df['Item_Weight'].isnull().sum()

0
Out[10]:

In [11]: #faire la meme chose pour la vaiable 'Outlet_Size'


df['Outlet_Size'].isnull().sum()

2410
Out[11]:

In [12]: ## replacing missing value by Mode value of 'Outlet_Size' value

df['Outlet_Size'] = df['Outlet_Size'].fillna(value='Mode')

In [ ]:
In [32]: #no missing values
df.isnull().sum().sum()

0
Out[32]:

In [ ]: # Return DataFrame with duplicate rows removed.


df=df.drop_duplicates()
df.shape

Encoding Categorical variables encoding


In [33]: df.dtypes

Item_Identifier object
Out[33]:
Item_Weight object
Item_Fat_Content object
Item_Visibility float64
Item_Type object
Item_MRP float64
Outlet_Identifier object
Outlet_Establishment_Year int64
Outlet_Size object
Outlet_Location_Type object
Outlet_Type object
Item_Outlet_Sales float64
dtype: object

In [13]: #number of occurrences of each unique element in the series of name 'Outlet_Type'
df['Outlet_Type'].unique()
df['Outlet_Type'].value_counts()

Supermarket Type1 5577


Out[13]:
Grocery Store 1083
Supermarket Type3 935
Supermarket Type2 928
Name: Outlet_Type, dtype: int64

In [16]: #Encoding categorical variables : LabelEncoder: assigns each categorical value an integer
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

In [17]: # create dfc: a dataframe of 'outlet_type' variable


dfc=pd.DataFrame(df['Outlet_Type'],columns=['Outlet_Type'])

#encoder dfc
dfc.Outlet_Type=encoder.fit_transform(df['Outlet_Type'])

In [46]: #print some columns of dfc


dfc.head(20)
#dfc.value_counts()

Out[46]: Outlet_Type

0 1

1 2
Outlet_Type

2 1

3 0

4 1

5 2

6 1

7 3

8 1

9 1

10 1

11 1

12 1

13 1

14 1

15 1

16 2

17 1

18 3

19 1

In [47]: #Dummies encoding: converts categorical data into dummy or indicator variables

pd.get_dummies(df['Outlet_Type']).head()

Out[47]: Grocery Store Supermarket Type1 Supermarket Type2 Supermarket Type3

0 0 1 0 0

1 0 0 1 0

2 0 1 0 0

3 1 0 0 0

4 0 1 0 0

In [48]: df.shape

(8523, 12)
Out[48]:

In [49]: df.Outlet_Establishment_Year.value_counts()

1985 1463
Out[49]:
1987 932
1999 930
1997 930
2004 930
2002 929
2009 928
2007 926
1998 555
Name: Outlet_Establishment_Year, dtype: int64

In [ ]: #df.describe()#include='float64')

In [ ]: df.Item_Weight.head()

In [18]: df.shape
df=pd.get_dummies(df, columns=['Outlet_Type','Outlet_Establishment_Year','Outlet_Size'])
df.shape

(8523, 26)
Out[18]:

In [ ]: df.info()

In [20]: temp= df['Item_Identifier'].value_counts()


temp.head()

FDW13 10
Out[20]:
FDG33 10
NCY18 9
FDD38 9
DRE49 9
Name: Item_Identifier, dtype: int64

In [ ]: df.shape

In [21]: df['Item_identifier_count'] = df['Item_Identifier'].apply(lambda x: temp[x])


df[['Item_Identifier','Item_identifier_count']].head()

Out[21]: Item_Identifier Item_identifier_count

0 FDA15 8

1 DRC01 6

2 FDN15 7

3 FDX07 6

4 NCD19 6

In [53]: df.shape

(8523, 27)
Out[53]:

In [54]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Item_Identifier 8523 non-null object
1 Item_Weight 8523 non-null object
2 Item_Fat_Content 8523 non-null object
3 Item_Visibility 8523 non-null float64
4 Item_Type 8523 non-null object
5 Item_MRP 8523 non-null float64
6 Outlet_Identifier 8523 non-null object
7 Outlet_Location_Type 8523 non-null object
8 Item_Outlet_Sales 8523 non-null float64
9 Outlet_Type_Grocery Store 8523 non-null uint8
10 Outlet_Type_Supermarket Type1 8523 non-null uint8
11 Outlet_Type_Supermarket Type2 8523 non-null uint8
12 Outlet_Type_Supermarket Type3 8523 non-null uint8
13 Outlet_Establishment_Year_1985 8523 non-null uint8
14 Outlet_Establishment_Year_1987 8523 non-null uint8
15 Outlet_Establishment_Year_1997 8523 non-null uint8
16 Outlet_Establishment_Year_1998 8523 non-null uint8
17 Outlet_Establishment_Year_1999 8523 non-null uint8
18 Outlet_Establishment_Year_2002 8523 non-null uint8
19 Outlet_Establishment_Year_2004 8523 non-null uint8
20 Outlet_Establishment_Year_2007 8523 non-null uint8
21 Outlet_Establishment_Year_2009 8523 non-null uint8
22 Outlet_Size_High 8523 non-null uint8
23 Outlet_Size_Medium 8523 non-null uint8
24 Outlet_Size_Mode 8523 non-null uint8
25 Outlet_Size_Small 8523 non-null uint8
26 Item_identifier_count 8523 non-null int64
dtypes: float64(3), int64(1), object(6), uint8(17)
memory usage: 807.5+ KB

In [56]: L0=df.columns.to_list()
print(L0)
df.head()

['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_Type', 'It


em_MRP', 'Outlet_Identifier', 'Outlet_Location_Type', 'Item_Outlet_Sales', 'Outlet_Type_Gr
ocery Store', 'Outlet_Type_Supermarket Type1', 'Outlet_Type_Supermarket Type2', 'Outlet_Ty
pe_Supermarket Type3', 'Outlet_Establishment_Year_1985', 'Outlet_Establishment_Year_1987',
'Outlet_Establishment_Year_1997', 'Outlet_Establishment_Year_1998', 'Outlet_Establishment_
Year_1999', 'Outlet_Establishment_Year_2002', 'Outlet_Establishment_Year_2004', 'Outlet_Es
tablishment_Year_2007', 'Outlet_Establishment_Year_2009', 'Outlet_Size_High', 'Outlet_Size
_Medium', 'Outlet_Size_Mode', 'Outlet_Size_Small', 'Item_identifier_count']
Out[56]:
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identifier Outlet_Loc

0 FDA15 9.3 Low Fat 0.016047 Dairy 249.8092 OUT049

1 DRC01 5.92 Regular 0.019278 Soft Drinks 48.2692 OUT018

2 FDN15 17.5 Low Fat 0.016760 Meat 141.6180 OUT049

Fruits and
3 FDX07 19.2 Regular 0.000000 182.0950 OUT010
Vegetables

4 NCD19 8.93 Low Fat 0.000000 Household 53.8614 OUT013

5 rows × 27 columns

In [22]: L0=['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_Type',

In [57]: df=df[L0]
df.head()

Out[57]:
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identifier Outlet_Loc

0 FDA15 9.3 Low Fat 0.016047 Dairy 249.8092 OUT049

1 DRC01 5.92 Regular 0.019278 Soft Drinks 48.2692 OUT018

2 FDN15 17.5 Low Fat 0.016760 Meat 141.6180 OUT049

Fruits and
3 FDX07 19.2 Regular 0.000000 182.0950 OUT010
Vegetables

4 NCD19 8.93 Low Fat 0.000000 Household 53.8614 OUT013

5 rows × 27 columns

In [ ]: df1=df.

In [60]: df2 = df.select_dtypes(exclude=['float'])


df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Item_Identifier 8523 non-null object
1 Item_Weight 8523 non-null object
2 Item_Fat_Content 8523 non-null object
3 Item_Type 8523 non-null object
4 Outlet_Identifier 8523 non-null object
5 Outlet_Location_Type 8523 non-null object
6 Outlet_Type_Grocery Store 8523 non-null uint8
7 Outlet_Type_Supermarket Type1 8523 non-null uint8
8 Outlet_Type_Supermarket Type2 8523 non-null uint8
9 Outlet_Type_Supermarket Type3 8523 non-null uint8
10 Outlet_Establishment_Year_1985 8523 non-null uint8
11 Outlet_Establishment_Year_1987 8523 non-null uint8
12 Outlet_Establishment_Year_1997 8523 non-null uint8
13 Outlet_Establishment_Year_1998 8523 non-null uint8
14 Outlet_Establishment_Year_1999 8523 non-null uint8
15 Outlet_Establishment_Year_2002 8523 non-null uint8
16 Outlet_Establishment_Year_2004 8523 non-null uint8
17 Outlet_Establishment_Year_2007 8523 non-null uint8
18 Outlet_Establishment_Year_2009 8523 non-null uint8
19 Outlet_Size_High 8523 non-null uint8
20 Outlet_Size_Medium 8523 non-null uint8
21 Outlet_Size_Mode 8523 non-null uint8
22 Outlet_Size_Small 8523 non-null uint8
23 Item_identifier_count 8523 non-null int64
dtypes: int64(1), object(6), uint8(17)
memory usage: 607.7+ KB

In [ ]: df2.columns.to_list()

In [61]: df1 = df.select_dtypes(exclude=['uint8','float'])


df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Item_Identifier 8523 non-null object
1 Item_Weight 8523 non-null object
2 Item_Fat_Content 8523 non-null object
3 Item_Visibility 8523 non-null float64
4 Item_Type 8523 non-null object
5 Item_MRP 8523 non-null float64
6 Outlet_Identifier 8523 non-null object
7 Outlet_Location_Type 8523 non-null object
8 Item_Outlet_Sales 8523 non-null float64
9 Outlet_Type_Grocery Store 8523 non-null uint8
10 Outlet_Type_Supermarket Type1 8523 non-null uint8
11 Outlet_Type_Supermarket Type2 8523 non-null uint8
12 Outlet_Type_Supermarket Type3 8523 non-null uint8
13 Outlet_Establishment_Year_1985 8523 non-null uint8
14 Outlet_Establishment_Year_1987 8523 non-null uint8
15 Outlet_Establishment_Year_1997 8523 non-null uint8
16 Outlet_Establishment_Year_1998 8523 non-null uint8
17 Outlet_Establishment_Year_1999 8523 non-null uint8
18 Outlet_Establishment_Year_2002 8523 non-null uint8
19 Outlet_Establishment_Year_2004 8523 non-null uint8
20 Outlet_Establishment_Year_2007 8523 non-null uint8
21 Outlet_Establishment_Year_2009 8523 non-null uint8
22 Outlet_Size_High 8523 non-null uint8
23 Outlet_Size_Medium 8523 non-null uint8
24 Outlet_Size_Mode 8523 non-null uint8
25 Outlet_Size_Small 8523 non-null uint8
26 Item_identifier_count 8523 non-null int64
dtypes: float64(3), int64(1), object(6), uint8(17)
memory usage: 807.5+ KB

In [23]: #eliminer 'object vaiables'


dfn= df.select_dtypes(exclude=['object'])
dfn.head()

Out[23]: Outlet_Type_Grocery Outlet_Type_Supermarket Outlet_


Item_Weight Item_Visibility Item_MRP Item_Outlet_Sales
Store Type1

0 9.30 0.016047 249.8092 3735.1380 0 1

1 5.92 0.019278 48.2692 443.4228 0 0

2 17.50 0.016760 141.6180 2097.2700 0 1

3 19.20 0.000000 182.0950 732.3800 1 0

4 8.93 0.000000 53.8614 994.7052 0 1

5 rows × 22 columns

In [63]: dfn.head()

Out[63]: Outlet_Type_Grocery Outlet_Type_Supermarket Outlet_Type_Superma


Item_Visibility Item_MRP Item_Outlet_Sales
Store Type1 Ty

0 0.016047 249.8092 3735.1380 0 1

1 0.019278 48.2692 443.4228 0 0

2 0.016760 141.6180 2097.2700 0 1


Outlet_Type_Grocery Outlet_Type_Supermarket Outlet_Type_Superma
Item_Visibility Item_MRP Item_Outlet_Sales
Store Type1 Ty

3 0.000000 182.0950 732.3800 1 0

4 0.000000 53.8614 994.7052 0 1

5 rows × 21 columns

In [64]: dfn.isnull().sum()

Item_Visibility 0
Out[64]:
Item_MRP 0
Item_Outlet_Sales 0
Outlet_Type_Grocery Store 0
Outlet_Type_Supermarket Type1 0
Outlet_Type_Supermarket Type2 0
Outlet_Type_Supermarket Type3 0
Outlet_Establishment_Year_1985 0
Outlet_Establishment_Year_1987 0
Outlet_Establishment_Year_1997 0
Outlet_Establishment_Year_1998 0
Outlet_Establishment_Year_1999 0
Outlet_Establishment_Year_2002 0
Outlet_Establishment_Year_2004 0
Outlet_Establishment_Year_2007 0
Outlet_Establishment_Year_2009 0
Outlet_Size_High 0
Outlet_Size_Medium 0
Outlet_Size_Mode 0
Outlet_Size_Small 0
Item_identifier_count 0
dtype: int64

In [66]: #boxplot
sns.boxplot(data = dfn[['Item_Weight' ,'Item_Visibility', 'Item_MRP']], orient = 'v')

<AxesSubplot:>
Out[66]:

In [ ]: #boxplot
sns.boxplot(data = dfn[['Item_Weight' ,'Item_Visibility', 'Item_MRP']], orient = 'v')

In [68]: #outliers
sns.boxplot(data = dfn['Item_Outlet_Sales'], orient = 'v')
<AxesSubplot:>
Out[68]:

Boxplot and outliers


A boxplot based on five key numbers:

Q1=df.var.quantile(0.25):1st Quartile (25th percentile)


median (2nd Quartile/ 50th Percentile)
Q3=df.var.quantile(0.75):3rd Quartile (75th percentile):Q3
IQR =Q3-Q1: Inter Quartile Range
minimum=Q 1– 1.5 ∗ I QR

maximum=Q 3 + 1.5 ∗ I QR

In [ ]: sns.boxplot(data=df.Item_Outlet_Sales,orient='h')

In [67]: # Outliers: extreme values within the dataset


Q1=df.Item_Outlet_Sales.quantile(0.25)
Q3=df.Item_Outlet_Sales.quantile(0.75)
IQR=Q3-Q1
Inf=Q1-1.5*IQR
Sup=Q3+1.5*IQR
#total outlier
len(df.loc[(df.Item_Outlet_Sales>Sup)])+len(df.loc[(df.Item_Outlet_Sales<Inf)])

186
Out[67]:

In [65]: #pairplot: To plot multiple pairwise bivariate distributions in a dataset


sns.pairplot(data = dfn[['Item_Outlet_Sales', 'Item_Visibility', 'Item_MRP']])

<seaborn.axisgrid.PairGrid at 0x1b02fe8a278>
Out[65]:
Correlation:
A correlation matrix is a common tool used to compare the coefficients of correlation between different features
(or attributes) in a dataset. It allows to visualize how much (or how little) correlation exists between different
variables

In [26]: correlations = dfn.corr(method='pearson')

In [27]: plt.figure(figsize=(10,8))
sns.heatmap(correlations, annot = True)

<AxesSubplot:>
Out[27]:
In [28]: X = dfn.drop(['Item_Outlet_Sales'], axis=1)# matrix of features
y = dfn['Item_Outlet_Sales']# vector of labels
X.shape, y.shape

((8523, 21), (8523,))


Out[28]:

In [29]: feature_names = X.columns.to_list()


print(feature_names)

['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Outlet_Type_Grocery Store', 'Outlet_Type_S


upermarket Type1', 'Outlet_Type_Supermarket Type2', 'Outlet_Type_Supermarket Type3', 'Outl
et_Establishment_Year_1985', 'Outlet_Establishment_Year_1987', 'Outlet_Establishment_Year_
1997', 'Outlet_Establishment_Year_1998', 'Outlet_Establishment_Year_1999', 'Outlet_Establi
shment_Year_2002', 'Outlet_Establishment_Year_2004', 'Outlet_Establishment_Year_2007', 'Ou
tlet_Establishment_Year_2009', 'Outlet_Size_High', 'Outlet_Size_Medium', 'Outlet_Size_Mod
e', 'Outlet_Size_Small', 'Item_identifier_count']

Régression linéaire
La dernière colonne de datafame df 'sales' qu'on note y^: représentant le prix total de quelques produits d'une
chaine de magasins.
L'objectif de la régression linéaire est de prédire les ventes de chaque produit x dans un point de vente
particulier en fonction des différentes variables x , j j = 1, . . . , d en écrivant la prédiction y^ comme

y
^ = ∑ wj x j ,

j=0

(ici on pose x 0 = 1 ).

w0 s'appelle biais: prédiction pour variables nulles


w1 , w2 , w3 , . . . , wd sont les poids (weights)

La recherche des poids et de biais en minimisant l'erreur en moindre carré qui est la moyenne sur toutes les
startup

n d
1
i i 2
min f (w) := ∑ ( ∑ wj x − y ) (1)
d+1
j
w∈R 2n
i=1 j=0

La fonction f (w) pour A et b .


1 2 1 T T T T
= ∥Xw − y∥ = w Aw − b w = X X = X y
2n 2

Le problème admet toujours au moins une solution et w ∈ arg min


w∈R
d+1 f (w) si et seulement si
X
T
Xw = X
T
b .

In [2]: from IPython.display import Image

In [3]: Image('regression1.png')

Out[3]:

In [4]: Image('regression2.png')

Out[4]:
In [5]: Image('regression3.png')

Out[5]:

In [6]: Image('regression4.png')

Out[6]:
Ensembles training et test
Pour évaluer les performances du modèle sur des données invisibles, on divise les données en training and
testing sets. On entraine le modèlele par exemple avec 80% d'exemples et le test avec 20% du reste. On utilise
'train_test_split function'de scikit-learn library. Choisir 'random_state', pour obtenir le même partage à chaque
fois afin de pouvoir reproduire les résultats. Enfin, on affiche les tailles de training and test sets pour voir si notre
division est correcte

Overfitting and Underfitting


When a model performs very well for training data but has poor performance with test data (new data), it is
known as overfitting: Data used for training is not cleaned or is not enough or the model is too complex or
has a high variance
When a model has not learned the patterns in the training data well and is unable to generalize well on the
new data, it is known as underfitting: Data used for training is not cleaned or is not enough or the model is
too simple or it has a high bias

In [30]: from sklearn.model_selection import train_test_split


X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.2, random_state=0)#rando

In [ ]: ## **Features scaling**
#from sklearn.preprocessing import MinMaxScaler
#scaler = MinMaxScaler()
#from sklearn.preprocessing import RobustScaler
#scaler = RobustScaler()
#from sklearn.preprocessing import StandardScaler
#scaler = StandardScaler()
#X_train_ = scaler.fit_transform(X_train)
#X_test= scaler.transform(X_test)

In [ ]: print(X_test.shape)
Metrics for regression: General case
y and y^ vectors of respectivly exact and pedicted values

Mean absolute error :

1
M AE = ∑ |yi − y
^ |
i
n

Mean_squared_error :

1
2
M SE = ∑(yi − y
^ )
i
2n

R-Squared or coefficient of determination: R Used to find the correlation between the predicted and
2

actual values of dependent variable.

2
^ )
∑(yi − y
2 i
R = 1 − ∈] − ∞, 1]
2
∑(yi − yȳ )

M SE(model)
2
R = 1 −
M SE(basline)

Better is the model, higher is the R-squared.

But if we have more features, R-squared either increases or does not change and does and can not see how any
feature impact the model

R-adjusted squared error

2 n − 1
2
R̄ = 1 − (1 − R ) ,
n − d + 1

where n is number of samples and d number of features

Pour entainer et tester le modèle on utilise 'LinearRegression' de scikit-learn’ (sklearn)

In [31]: from sklearn.linear_model import LinearRegression


LR = LinearRegression()

In [32]: # entraier le modele sur training set


LR.fit(X_train,y_train)

LinearRegression()
Out[32]:

In [48]: # Le score ici est R2


LR.score(X_train,y_train)

0.5630614279454249
Out[48]:

In [33]: w0 = LR.intercept_#biais
w_R = LR.coef_#weights

In [34]: print(f'Intercept = {w0}')


print(f'Coefs = {w_R}')

Intercept = -207.5275883376171
Coefs = [ -2.0262207 -343.68577229 15.53712456 -1305.69259881
363.57108798 -237.90354403 1180.02505485 67.03378685
-29.77356392 292.14669988 -192.70133081 -451.98090278
-43.28072999 450.56672686 145.89285794 -237.90354403
-29.77356392 490.14060804 -90.08920286 -370.27784126
5.97821985]

In [36]: from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score

In [44]: #prédire le modele sur l'ensemble training


y_R_train =LR.predict(X_train)
print(f'MAE = {mean_absolute_error(y_train.values, y_R_train)}')
print(f'MSE = {mean_squared_error(y_train.values, y_R_train)}')
#print(f'RMSE = {rmse(y_test.values, y_pred)}')
print(f'R2 = {r2_score(y_train.values, y_R_train)}')

MAE = 832.7697608114595
MSE = 1270620.5219161748
R2 = 0.5630614279454249

In [45]: #prédire le modele sur l'ensemble training


y_R_test =LR.predict(X_test)
print(f'MAE = {mean_absolute_error(y_test.values, y_R_test)}')
print(f'MSE = {mean_squared_error(y_test.values, y_R_test)}')
#print(f'RMSE = {rmse(y_test.values, y_pred)}')
print(f'R2 = {r2_score(y_test.values, y_R_test)}')

MAE = 851.9198567977413
MSE = 1276730.9218397879
R2 = 0.5637881003820474

In [42]: test_df = pd.DataFrame(X_test, columns = X.columns)


test_df.head()

Out[42]: Outlet_Type_Grocery Outlet_Type_Supermarket Outlet_Type_Supermark


Item_Weight Item_Visibility Item_MRP
Store Type1 Typ

4931 14.500 0.089960 159.5604 0 1

4148 12.150 0.009535 64.5510 0 0

7423 11.500 0.017742 129.6626 0 1

4836 10.195 0.000000 143.1154 0 1

944 21.000 0.049264 195.0478 0 1

5 rows × 21 columns

In [47]: for i in range(10):


print([y_test.values[i],y_R_test[i]])

[1426.1436, 2489.3108732996498]
[1201.769, 2522.6479061046225]
[1836.2764, 2238.8794424597736]
[2410.8618, 2432.726290160981]
[1549.9824, 3115.3221305422135]
[3169.208, 3616.2284038788393]
[2036.6822, 2905.006552265028]
[824.9262, 1832.827140131652]
[378.1744, 1129.2447358303984]
[1573.9512, 1802.1739343111042]

In [ ]: import matplotlib.pyplot as plt

In [ ]:

In [49]: coefR = pd.Series(np.abs(w_R), feature_names ).sort_values(ascending=False)


coefR.plot(kind='bar', title='LR Coefficients')

<AxesSubplot:title={'center':'LR Coefficients'}>
Out[49]:

In [50]: k=15
#LR liste contenant les k varibles selectionnées
L_R=list(coefR[0:k].keys())
print(L_R)

['Outlet_Type_Grocery Store', 'Outlet_Type_Supermarket Type3', 'Outlet_Size_Medium', 'Outl


et_Establishment_Year_1999', 'Outlet_Establishment_Year_2004', 'Outlet_Size_Small', 'Outle
t_Type_Supermarket Type1', 'Item_Visibility', 'Outlet_Establishment_Year_1997', 'Outlet_Ty
pe_Supermarket Type2', 'Outlet_Establishment_Year_2009', 'Outlet_Establishment_Year_1998',
'Outlet_Establishment_Year_2007', 'Outlet_Size_Mode', 'Outlet_Establishment_Year_1985']

Régression et pseudo inverse


On rappelle la décomposition en valeur singulière (SVD) de X: il existe une matrice orthogonale U ,
n×d
∈ R )

σ1 0 … … 0
⎡ ⎤

⎢ ⎥
0 ⋱ ⋱ … 0
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⋮ ⋱ σr 0 ⋮ ⎥
une matrice orthogonale V et une matrice dont tous les
d×d n×d
∈ R Σ = ⎢ ⎥ ∈ R
⎢ ⎥
⎢ ⎥
⎢ 0 0 ⋱ 0 0⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢⋯ ⋱ ⋱ ⋮ ⎥

⎣ ⎦
0 0 0 0 0

éléments sont positifs et nulle en dehors de la diagonale telles que

T
X = U ΣV .

Si on note la matrice à d lignes n colonnes, Σ on appelle pseudo inverse de X, la


+ 1 1
= diag( ,..., , 0, . . .0)
σ1 σr

matrice X +
= VΣ
+
U
T
.

Alors, le vecteur w = X
+
y est une solution de problème de régression (1) de norme minimale.

In [61]: #Afficher les variables par importance


X_trainr=np.concatenate([np.ones((X_train.shape[0],1)),X_train], axis=1)
X_testr=np.concatenate([np.ones((X_test.shape[0],1)),X_test], axis=1)

In [57]: w_p=np.linalg.pinv(X_trainr).dot(y_train)

In [53]: print(w_p)
print(w0,w_R)

[ -121.01384688 -2.0262207 -343.68577229 15.53712456


-1336.53501111 315.49412166 -261.02760391 1161.05464648
39.75607547 -48.99195149 282.1035425 -215.2364401
-450.15205623 -48.58128527 440.52356949 140.59230265
-261.02760391 -48.99195149 449.87498635 -123.22542272
-398.67145902 5.97821985]
-207.5275883376171 [ -2.0262207 -343.68577229 15.53712456 -1305.69259881
363.57108798 -237.90354403 1180.02505485 67.03378685
-29.77356392 292.14669988 -192.70133081 -451.98090278
-43.28072999 450.56672686 145.89285794 -237.90354403
-29.77356392 490.14060804 -90.08920286 -370.27784126
5.97821985]

In [59]: plt.scatter(range(len(w_R)),w_R)
plt.scatter(range(len(w_p[1:X_train.shape[1]+1])),w_p[1:X_train.shape[1]+1])

<matplotlib.collections.PathCollection at 0x1b02fe51c50>
Out[59]:
In [62]: d=X_train.shape[1]
d1=d+1# nb de features +1(pour le bias)
print(d)
coefp = pd.Series(np.abs(w_p[1:d1]).reshape(d,), feature_names ).sort_values(ascending=Fal
coefp.plot(kind='bar', title='Pseudo-inverse Coefficients')

21
<AxesSubplot:title={'center':'Pseudo-inverse Coefficients'}>
Out[62]:

In [ ]: #Evaluation sur l'ensemble training


y_p_train=?
# à completer avec l'évalution mais pour le nouveau data set !!

Travail à completer pour GI2


Refaire le meme travail pour le dataset 50_Startups0.csv: On demande:
d'importer le jeu données
Traiter les valeurs manquntes
Chercher les valeurs abérrantes pour chaque variable
Convertir la variable state en numerique (utiliser dummies)
La matrice X des caratéristiques contient toutes les variables sauf 'Profit', le label y représente 'Profit'.
Extraire X et y
Calculer et interpréter la matrice de corrélation

Appliquer le modèle de regression liniéaire de sklearn

Evaluer le modèle sur les deux ensembles test et training (avec MAE, MSE, R2)
Utiliser la matrice pseudo inverse puis évaluer le modèle obtenu
Pour chaque technique utilisée, représenter les variables les plus pertinentes par importance puis trier dans
une liste les deux variables les plus décisives.

Les interprétations des résultats enrichissent bien votre travail

In [ ]: ## fin de TP

In [ ]: # Version Septembre 2023, R. Bessi

In [ ]:

In [ ]:

In [ ]:

In [ ]:

In [ ]:

In [ ]:

In [ ]: # Pour GI1

In [ ]: # Fonction pour définir Adjusted R2


def r2_adj(d,y,z):

n = y.shape[0]
#print(d)
return ?

Régression via un problème d'optimisation


Algoritmes classiques

Descente de gradient
L'algorithme de descente de gradient appliqué à une fonction f est défini par un point initial w ainsi que par 0

l'itération

wk+1 = wk − αk ∇f (wk ),

où α k > 0 est une longueur de pas pour l'itération k.

Le pas de descente α est appelé taux d'apprentissage (leaning rate) en ML et en général noté l .
k r

Cas d'un problème quadratique


Si f (w) , pour A symétrique, on rappelle ∇f (w)
1 T T
= w Aw − b w = 0.5(Aw, w) − (b, w) = Aw − b
2

Ici A = X
T
trainr
Xtrainr et b = X
T
trainr
ytrain

Implémenter la descente de gradient avec trois choix de longueurs de pas possibles:

GPF: à pas fixe α = 2/(λmax (A) + λmin (A)) .


GPF: à pas optimal : α optimal k

Gradient conjugué: GC:

On choisit w .
d
0 ∈ R

On prend g 0 = Aw0 − b = −d0 .


Pour k ≥ 0 , on calcule

2
∣ ∥gk ∥
2

∣ αk = , wk+1 = wk + αk dk ,
(Adk , dk )


gk+1 = Awk+1 − b = gk + αk Adk ,

2
∣ ∥gk+1 ∥
2
∣ βk = , dk+1 = −gk+1 + βk dk .
2
∣ ∥gk ∥
2

Pour chaque modèle, on demande de calculer le vecteur prédiction puis R de l'ensemble test et l'ensemble 2

training

In [63]: d1=X_trainr.shape[1]
A=X_trainr.T.dot(X_trainr)
b=X_trainr.T.dot(y_train).reshape(d1,1)
#d=A.shape[0]

In [ ]:

In [ ]: # Calculer rho(A) et conditionnent(A): rayon spectral et conditionnent de A (utiliser la b


rho=
Cond=

Quel est l'effet de ρ(A) et con(A) sur les algorithmes GPF et GC ?


In [ ]: plt.scatter(range(len(LR.coef_)), LR.coef_)
plt.scatter(range(len(w_p)-1),w_p[1:len(w_p)])

In [ ]: # Méthode du graient à pas fixe


n=A.shape[0]
d1=A.shape[1]
lr=np.real(2/(np.max(np.linalg.eig(A)[0])+np.min(np.linalg.eig(A)[0]))) # le pas de descen

w_f=np.random.rand(d1,1)#initialisation
for k in range(100):
w_f=w_f-lr*(A.dot(w_f)-b)#-lam*w

In [ ]: # Evaluer la performance de la méthode du graient à pas fixe (utiliser ici w_f)

In [ ]: #Implemneter l'algorithme de gradient conjugué


# Méthode du GC

w_c=np.random.rand(d1,1) #initialisation
g=(A.dot(w_c)-b)
g1=-g
for k in range(20):
#print(k)
if np.linalg.norm(g)>10**(-5):
alpha=?
w_c =?
g0=g
g=?
beta=?
g1=-g+beta*g1

In [ ]: # Evaluer la performance de la méthode du graient conjugué (utiliser ici w_c)

In [ ]: #visualiser la solution obtenue w_c par importance


coefc =?
coefc.plot?

In [ ]: k=15
L_c=list(coefc[0:k].keys())
#Lc liste contenant les 15 varibles selectionnées
# Intersection de L_R et L_c
print(len(set(L_c).intersection(set(L_R))))

In [ ]: plt.scatter(range(len(LR.coef_)), LR.coef_)
plt.scatter(range(len(w_p)-1),w_p[1:len(w_p)])

plt.scatter(range(len(w_c)-1), w_c[1:d1])
#plt.scatter(range(len(w_p)-1),w_p[1:len(w_p)])

In [ ]: #fin
# R Bessi, September 2023

You might also like