0% found this document useful (0 votes)

10 views25 pages

Lab1 Features Selections-Class-GI2

Uploaded by

Oussama Souissi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views25 pages

Lab1 Features Selections-Class-GI2

Uploaded by

Oussama Souissi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Optimisation Appliquée au ML

3AGI+Neprev

Projet 1 - Feature selection

Enseignante: Radhia Bessi

Elève Ingénieur:

Objectif
L'objectif de ce lab est de voir qulques techniques de selction de variables en cas de modèles de régression:
embeding methods (lasso and ridge), kbest, SelectModel, RFE, RFECV, Statmodels

Data preprocessing and exploration

Commencons par importer notre jeu de données

In [1]: import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [2]: # Importer le data set 'train_bm.csv'

df =pd.read_csv('train_bm.csv')

# A CSV (comma-separated values) file is a text file in which information is separated by

In [5]: df.head()

Out[5]: Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identifier Outlet_Est

0 FDA15 9.30 Low Fat 0.016047 Dairy 249.8092 OUT049

1 DRC01 5.92 Regular 0.019278 Soft Drinks 48.2692 OUT018

2 FDN15 17.50 Low Fat 0.016760 Meat 141.6180 OUT049

Fruits and
3 FDX07 19.20 Regular 0.000000 182.0950 OUT010
Vegetables

4 NCD19 8.93 Low Fat 0.000000 Household 53.8614 OUT013

In [ ]: #more description for data set: https://www.kaggle.com/code/rushikeshghate/sales-predictio

In [3]: # shape of ddataframe df
df.shape
n=df.shape[0]
d=df.shape[1]
print(n,d)
# Types de variables
df.dtypes

8523 12
Item_Identifier object
Out[3]:
Item_Weight float64
Item_Fat_Content object
Item_Visibility float64
Item_Type object
Item_MRP float64
Outlet_Identifier object
Outlet_Establishment_Year int64
Outlet_Size object
Outlet_Location_Type object
Outlet_Type object
Item_Outlet_Sales float64
dtype: object

In [4]: #noms des varibles (en liste)

df.columns.to_list()
#print(cols)

['Item_Identifier',
Out[4]:
'Item_Weight',
'Item_Fat_Content',
'Item_Visibility',
'Item_Type',
'Item_MRP',
'Outlet_Identifier',
'Outlet_Establishment_Year',
'Outlet_Size',
'Outlet_Location_Type',
'Outlet_Type',
'Item_Outlet_Sales']

In [5]: #data description

df.describe(include='all')

Out[5]: Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identifier Ou

count 8523 7060.000000 8523 8523.000000 8523 8523.000000 8523

unique 1559 NaN 5 NaN 16 NaN 10

Fruits and
top FDW13 NaN Low Fat NaN NaN OUT027
Vegetables

freq 10 NaN 5089 NaN 1232 NaN 935

mean NaN 12.857645 NaN 0.066132 NaN 140.992782 NaN

std NaN 4.643456 NaN 0.051598 NaN 62.275067 NaN

min NaN 4.555000 NaN 0.000000 NaN 31.290000 NaN

25% NaN 8.773750 NaN 0.026989 NaN 93.826500 NaN

50% NaN 12.600000 NaN 0.053931 NaN 143.012800 NaN

Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identifier Ou

75% NaN 16.850000 NaN 0.094585 NaN 185.643700 NaN

max NaN 21.350000 NaN 0.328391 NaN 266.888400 NaN

In [6]: # info générales

df.isnull().sum().sum
df.head(20)

Out[6]: Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identifier Outlet_Es

0 FDA15 9.300 Low Fat 0.016047 Dairy 249.8092 OUT049

1 DRC01 5.920 Regular 0.019278 Soft Drinks 48.2692 OUT018

2 FDN15 17.500 Low Fat 0.016760 Meat 141.6180 OUT049

Fruits and
3 FDX07 19.200 Regular 0.000000 182.0950 OUT010
Vegetables

4 NCD19 8.930 Low Fat 0.000000 Household 53.8614 OUT013

Baking
5 FDP36 10.395 Regular 0.000000 51.4008 OUT018
Goods

Snack
6 FDO10 13.650 Regular 0.012741 57.6588 OUT013
Foods

Snack
7 FDP10 NaN Low Fat 0.127470 107.7622 OUT027
Foods

Frozen
8 FDH17 16.200 Regular 0.016687 96.9726 OUT045
Foods

Frozen
9 FDU28 19.200 Regular 0.094450 187.8214 OUT017
Foods

Fruits and
10 FDY07 11.800 Low Fat 0.000000 45.5402 OUT049
Vegetables

11 FDA03 18.500 Regular 0.045464 Dairy 144.1102 OUT046

Fruits and
12 FDX32 15.100 Regular 0.100014 145.4786 OUT049
Vegetables

Snack
13 FDS46 17.600 Regular 0.047257 119.6782 OUT046
Foods

Fruits and
14 FDF32 16.350 Low Fat 0.068024 196.4426 OUT013
Vegetables

15 FDP49 9.000 Regular 0.069089 Breakfast 56.3614 OUT046

Health and
16 NCB42 11.800 Low Fat 0.008596 115.3492 OUT018
Hygiene

17 FDP49 9.000 Regular 0.069196 Breakfast 54.3614 OUT049

Hard
18 DRI11 NaN Low Fat 0.034238 113.2834 OUT027
Drinks
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identifier Outlet_Es

19 FDU02 13.350 Low Fat 0.102492 Dairy 230.5352 OUT035

Deleting Data points with missing values

In [7]: # count of missing values in each column
df.isnull().sum()
## total number of missing values in the dataframe
df.isnull().sum().sum()

3873
Out[7]:

In [22]: df.Item_Weight.isnull().sum()

1463
Out[22]:

In [25]: #deleting rows containing missing values

df_del = df.dropna(axis=0)
df_del.Item_Weight.isnull().sum()
df.isnull().sum().sum()

3873
Out[25]:

In [26]: # shape before and after removing missing values

print(df.shape,df_del.shape)

(8523, 12) (4650, 12)

In [9]: # replacing missing value by 'Mode" (most frequent) value of other values
#df['Item_Weight'] = df['Item_Weight'].fillna(df['Item_Weight'].mean)
df['Item_Weight'] = df['Item_Weight'].fillna(df.Item_Weight.mode()[0])

#df.Item_Weight=df.Item_Weight.fillna(value=df.Item_Weight.mean)

In [10]: # # Item_Weight variable without missing values treatment

df['Item_Weight'].isnull().sum()

0
Out[10]:

In [11]: #faire la meme chose pour la vaiable 'Outlet_Size'

df['Outlet_Size'].isnull().sum()

2410
Out[11]:

In [12]: ## replacing missing value by Mode value of 'Outlet_Size' value

df['Outlet_Size'] = df['Outlet_Size'].fillna(value='Mode')

In [ ]:
In [32]: #no missing values
df.isnull().sum().sum()

0
Out[32]:

In [ ]: # Return DataFrame with duplicate rows removed.

df=df.drop_duplicates()
df.shape

Encoding Categorical variables encoding

In [33]: df.dtypes

Item_Identifier object
Out[33]:
Item_Weight object
Item_Fat_Content object
Item_Visibility float64
Item_Type object
Item_MRP float64
Outlet_Identifier object
Outlet_Establishment_Year int64
Outlet_Size object
Outlet_Location_Type object
Outlet_Type object
Item_Outlet_Sales float64
dtype: object

In [13]: #number of occurrences of each unique element in the series of name 'Outlet_Type'
df['Outlet_Type'].unique()
df['Outlet_Type'].value_counts()

Supermarket Type1 5577

Out[13]:
Grocery Store 1083
Supermarket Type3 935
Supermarket Type2 928
Name: Outlet_Type, dtype: int64

In [16]: #Encoding categorical variables : LabelEncoder: assigns each categorical value an integer
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

In [17]: # create dfc: a dataframe of 'outlet_type' variable

dfc=pd.DataFrame(df['Outlet_Type'],columns=['Outlet_Type'])

#encoder dfc
dfc.Outlet_Type=encoder.fit_transform(df['Outlet_Type'])

In [46]: #print some columns of dfc

dfc.head(20)
#dfc.value_counts()

Out[46]: Outlet_Type

0 1

1 2
Outlet_Type

2 1

3 0

4 1

5 2

6 1

7 3

8 1

9 1

10 1

11 1

12 1

13 1

14 1

15 1

16 2

17 1

18 3

19 1

In [47]: #Dummies encoding: converts categorical data into dummy or indicator variables

pd.get_dummies(df['Outlet_Type']).head()

Out[47]: Grocery Store Supermarket Type1 Supermarket Type2 Supermarket Type3

0 0 1 0 0

1 0 0 1 0

2 0 1 0 0

3 1 0 0 0

4 0 1 0 0

In [48]: df.shape

(8523, 12)
Out[48]:

In [49]: df.Outlet_Establishment_Year.value_counts()

1985 1463
Out[49]:
1987 932
1999 930
1997 930
2004 930
2002 929
2009 928
2007 926
1998 555
Name: Outlet_Establishment_Year, dtype: int64

In [ ]: #df.describe()#include='float64')

In [ ]: df.Item_Weight.head()

In [18]: df.shape
df=pd.get_dummies(df, columns=['Outlet_Type','Outlet_Establishment_Year','Outlet_Size'])
df.shape

(8523, 26)
Out[18]:

In [ ]: df.info()

In [20]: temp= df['Item_Identifier'].value_counts()

temp.head()

FDW13 10
Out[20]:
FDG33 10
NCY18 9
FDD38 9
DRE49 9
Name: Item_Identifier, dtype: int64

In [ ]: df.shape

In [21]: df['Item_identifier_count'] = df['Item_Identifier'].apply(lambda x: temp[x])

df[['Item_Identifier','Item_identifier_count']].head()

Out[21]: Item_Identifier Item_identifier_count

0 FDA15 8

1 DRC01 6

2 FDN15 7

3 FDX07 6

4 NCD19 6

In [53]: df.shape

(8523, 27)
Out[53]:

In [54]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Item_Identifier 8523 non-null object
1 Item_Weight 8523 non-null object
2 Item_Fat_Content 8523 non-null object
3 Item_Visibility 8523 non-null float64
4 Item_Type 8523 non-null object
5 Item_MRP 8523 non-null float64
6 Outlet_Identifier 8523 non-null object
7 Outlet_Location_Type 8523 non-null object
8 Item_Outlet_Sales 8523 non-null float64
9 Outlet_Type_Grocery Store 8523 non-null uint8
10 Outlet_Type_Supermarket Type1 8523 non-null uint8
11 Outlet_Type_Supermarket Type2 8523 non-null uint8
12 Outlet_Type_Supermarket Type3 8523 non-null uint8
13 Outlet_Establishment_Year_1985 8523 non-null uint8
14 Outlet_Establishment_Year_1987 8523 non-null uint8
15 Outlet_Establishment_Year_1997 8523 non-null uint8
16 Outlet_Establishment_Year_1998 8523 non-null uint8
17 Outlet_Establishment_Year_1999 8523 non-null uint8
18 Outlet_Establishment_Year_2002 8523 non-null uint8
19 Outlet_Establishment_Year_2004 8523 non-null uint8
20 Outlet_Establishment_Year_2007 8523 non-null uint8
21 Outlet_Establishment_Year_2009 8523 non-null uint8
22 Outlet_Size_High 8523 non-null uint8
23 Outlet_Size_Medium 8523 non-null uint8
24 Outlet_Size_Mode 8523 non-null uint8
25 Outlet_Size_Small 8523 non-null uint8
26 Item_identifier_count 8523 non-null int64
dtypes: float64(3), int64(1), object(6), uint8(17)
memory usage: 807.5+ KB

In [56]: L0=df.columns.to_list()
print(L0)
df.head()

['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_Type', 'It

em_MRP', 'Outlet_Identifier', 'Outlet_Location_Type', 'Item_Outlet_Sales', 'Outlet_Type_Gr
ocery Store', 'Outlet_Type_Supermarket Type1', 'Outlet_Type_Supermarket Type2', 'Outlet_Ty
pe_Supermarket Type3', 'Outlet_Establishment_Year_1985', 'Outlet_Establishment_Year_1987',
'Outlet_Establishment_Year_1997', 'Outlet_Establishment_Year_1998', 'Outlet_Establishment_
Year_1999', 'Outlet_Establishment_Year_2002', 'Outlet_Establishment_Year_2004', 'Outlet_Es
tablishment_Year_2007', 'Outlet_Establishment_Year_2009', 'Outlet_Size_High', 'Outlet_Size
_Medium', 'Outlet_Size_Mode', 'Outlet_Size_Small', 'Item_identifier_count']
Out[56]:
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identifier Outlet_Loc

0 FDA15 9.3 Low Fat 0.016047 Dairy 249.8092 OUT049

1 DRC01 5.92 Regular 0.019278 Soft Drinks 48.2692 OUT018

2 FDN15 17.5 Low Fat 0.016760 Meat 141.6180 OUT049

Fruits and
3 FDX07 19.2 Regular 0.000000 182.0950 OUT010
Vegetables

4 NCD19 8.93 Low Fat 0.000000 Household 53.8614 OUT013

5 rows × 27 columns

In [22]: L0=['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_Type',

In [57]: df=df[L0]
df.head()

Out[57]:
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identifier Outlet_Loc

0 FDA15 9.3 Low Fat 0.016047 Dairy 249.8092 OUT049

1 DRC01 5.92 Regular 0.019278 Soft Drinks 48.2692 OUT018

2 FDN15 17.5 Low Fat 0.016760 Meat 141.6180 OUT049

Fruits and
3 FDX07 19.2 Regular 0.000000 182.0950 OUT010
Vegetables

4 NCD19 8.93 Low Fat 0.000000 Household 53.8614 OUT013

5 rows × 27 columns

In [ ]: df1=df.

In [60]: df2 = df.select_dtypes(exclude=['float'])

df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Item_Identifier 8523 non-null object
1 Item_Weight 8523 non-null object
2 Item_Fat_Content 8523 non-null object
3 Item_Type 8523 non-null object
4 Outlet_Identifier 8523 non-null object
5 Outlet_Location_Type 8523 non-null object
6 Outlet_Type_Grocery Store 8523 non-null uint8
7 Outlet_Type_Supermarket Type1 8523 non-null uint8
8 Outlet_Type_Supermarket Type2 8523 non-null uint8
9 Outlet_Type_Supermarket Type3 8523 non-null uint8
10 Outlet_Establishment_Year_1985 8523 non-null uint8
11 Outlet_Establishment_Year_1987 8523 non-null uint8
12 Outlet_Establishment_Year_1997 8523 non-null uint8
13 Outlet_Establishment_Year_1998 8523 non-null uint8
14 Outlet_Establishment_Year_1999 8523 non-null uint8
15 Outlet_Establishment_Year_2002 8523 non-null uint8
16 Outlet_Establishment_Year_2004 8523 non-null uint8
17 Outlet_Establishment_Year_2007 8523 non-null uint8
18 Outlet_Establishment_Year_2009 8523 non-null uint8
19 Outlet_Size_High 8523 non-null uint8
20 Outlet_Size_Medium 8523 non-null uint8
21 Outlet_Size_Mode 8523 non-null uint8
22 Outlet_Size_Small 8523 non-null uint8
23 Item_identifier_count 8523 non-null int64
dtypes: int64(1), object(6), uint8(17)
memory usage: 607.7+ KB

In [ ]: df2.columns.to_list()

In [61]: df1 = df.select_dtypes(exclude=['uint8','float'])

df.info()

In [23]: #eliminer 'object vaiables'

dfn= df.select_dtypes(exclude=['object'])
dfn.head()

Out[23]: Outlet_Type_Grocery Outlet_Type_Supermarket Outlet_

Item_Weight Item_Visibility Item_MRP Item_Outlet_Sales
Store Type1

0 9.30 0.016047 249.8092 3735.1380 0 1

1 5.92 0.019278 48.2692 443.4228 0 0

2 17.50 0.016760 141.6180 2097.2700 0 1

3 19.20 0.000000 182.0950 732.3800 1 0

4 8.93 0.000000 53.8614 994.7052 0 1

5 rows × 22 columns

In [63]: dfn.head()

Out[63]: Outlet_Type_Grocery Outlet_Type_Supermarket Outlet_Type_Superma

Item_Visibility Item_MRP Item_Outlet_Sales
Store Type1 Ty

0 0.016047 249.8092 3735.1380 0 1

1 0.019278 48.2692 443.4228 0 0

2 0.016760 141.6180 2097.2700 0 1

Outlet_Type_Grocery Outlet_Type_Supermarket Outlet_Type_Superma
Item_Visibility Item_MRP Item_Outlet_Sales
Store Type1 Ty

3 0.000000 182.0950 732.3800 1 0

4 0.000000 53.8614 994.7052 0 1

5 rows × 21 columns

In [64]: dfn.isnull().sum()

Item_Visibility 0
Out[64]:
Item_MRP 0
Item_Outlet_Sales 0
Outlet_Type_Grocery Store 0
Outlet_Type_Supermarket Type1 0
Outlet_Type_Supermarket Type2 0
Outlet_Type_Supermarket Type3 0
Outlet_Establishment_Year_1985 0
Outlet_Establishment_Year_1987 0
Outlet_Establishment_Year_1997 0
Outlet_Establishment_Year_1998 0
Outlet_Establishment_Year_1999 0
Outlet_Establishment_Year_2002 0
Outlet_Establishment_Year_2004 0
Outlet_Establishment_Year_2007 0
Outlet_Establishment_Year_2009 0
Outlet_Size_High 0
Outlet_Size_Medium 0
Outlet_Size_Mode 0
Outlet_Size_Small 0
Item_identifier_count 0
dtype: int64

In [66]: #boxplot
sns.boxplot(data = dfn[['Item_Weight' ,'Item_Visibility', 'Item_MRP']], orient = 'v')

<AxesSubplot:>
Out[66]:

In [ ]: #boxplot
sns.boxplot(data = dfn[['Item_Weight' ,'Item_Visibility', 'Item_MRP']], orient = 'v')

In [68]: #outliers
sns.boxplot(data = dfn['Item_Outlet_Sales'], orient = 'v')
<AxesSubplot:>
Out[68]:

Boxplot and outliers

A boxplot based on five key numbers:

Q1=df.var.quantile(0.25):1st Quartile (25th percentile)

median (2nd Quartile/ 50th Percentile)
Q3=df.var.quantile(0.75):3rd Quartile (75th percentile):Q3
IQR =Q3-Q1: Inter Quartile Range
minimum=Q 1– 1.5 ∗ I QR

maximum=Q 3 + 1.5 ∗ I QR

In [ ]: sns.boxplot(data=df.Item_Outlet_Sales,orient='h')

In [67]: # Outliers: extreme values within the dataset

Q1=df.Item_Outlet_Sales.quantile(0.25)
Q3=df.Item_Outlet_Sales.quantile(0.75)
IQR=Q3-Q1
Inf=Q1-1.5*IQR
Sup=Q3+1.5*IQR
#total outlier
len(df.loc[(df.Item_Outlet_Sales>Sup)])+len(df.loc[(df.Item_Outlet_Sales<Inf)])

186
Out[67]:

In [65]: #pairplot: To plot multiple pairwise bivariate distributions in a dataset

sns.pairplot(data = dfn[['Item_Outlet_Sales', 'Item_Visibility', 'Item_MRP']])

<seaborn.axisgrid.PairGrid at 0x1b02fe8a278>
Out[65]:
Correlation:
A correlation matrix is a common tool used to compare the coefficients of correlation between different features
(or attributes) in a dataset. It allows to visualize how much (or how little) correlation exists between different
variables

In [26]: correlations = dfn.corr(method='pearson')

In [27]: plt.figure(figsize=(10,8))
sns.heatmap(correlations, annot = True)

<AxesSubplot:>
Out[27]:
In [28]: X = dfn.drop(['Item_Outlet_Sales'], axis=1)# matrix of features
y = dfn['Item_Outlet_Sales']# vector of labels
X.shape, y.shape

((8523, 21), (8523,))

Out[28]:

In [29]: feature_names = X.columns.to_list()

print(feature_names)

['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Outlet_Type_Grocery Store', 'Outlet_Type_S

upermarket Type1', 'Outlet_Type_Supermarket Type2', 'Outlet_Type_Supermarket Type3', 'Outl
et_Establishment_Year_1985', 'Outlet_Establishment_Year_1987', 'Outlet_Establishment_Year_
1997', 'Outlet_Establishment_Year_1998', 'Outlet_Establishment_Year_1999', 'Outlet_Establi
shment_Year_2002', 'Outlet_Establishment_Year_2004', 'Outlet_Establishment_Year_2007', 'Ou
tlet_Establishment_Year_2009', 'Outlet_Size_High', 'Outlet_Size_Medium', 'Outlet_Size_Mod
e', 'Outlet_Size_Small', 'Item_identifier_count']

Régression linéaire
La dernière colonne de datafame df 'sales' qu'on note y^: représentant le prix total de quelques produits d'une
chaine de magasins.
L'objectif de la régression linéaire est de prédire les ventes de chaque produit x dans un point de vente
particulier en fonction des différentes variables x , j j = 1, . . . , d en écrivant la prédiction y^ comme

y
^ = ∑ wj x j ,

j=0

(ici on pose x 0 = 1 ).

w0 s'appelle biais: prédiction pour variables nulles

w1 , w2 , w3 , . . . , wd sont les poids (weights)

La recherche des poids et de biais en minimisant l'erreur en moindre carré qui est la moyenne sur toutes les
startup

n d
1
i i 2
min f (w) := ∑ ( ∑ wj x − y ) (1)
d+1
j
w∈R 2n
i=1 j=0

La fonction f (w) pour A et b .

1 2 1 T T T T
= ∥Xw − y∥ = w Aw − b w = X X = X y
2n 2

Le problème admet toujours au moins une solution et w ∈ arg min

w∈R
d+1 f (w) si et seulement si
X
T
Xw = X
T
b .

In [2]: from IPython.display import Image

In [3]: Image('regression1.png')

Out[3]:

In [4]: Image('regression2.png')

Out[4]:
In [5]: Image('regression3.png')

Out[5]:

In [6]: Image('regression4.png')

Out[6]:
Ensembles training et test
Pour évaluer les performances du modèle sur des données invisibles, on divise les données en training and
testing sets. On entraine le modèlele par exemple avec 80% d'exemples et le test avec 20% du reste. On utilise
'train_test_split function'de scikit-learn library. Choisir 'random_state', pour obtenir le même partage à chaque
fois afin de pouvoir reproduire les résultats. Enfin, on affiche les tailles de training and test sets pour voir si notre
division est correcte

Overfitting and Underfitting

When a model performs very well for training data but has poor performance with test data (new data), it is
known as overfitting: Data used for training is not cleaned or is not enough or the model is too complex or
has a high variance
When a model has not learned the patterns in the training data well and is unable to generalize well on the
new data, it is known as underfitting: Data used for training is not cleaned or is not enough or the model is
too simple or it has a high bias

In [30]: from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.2, random_state=0)#rando

In [ ]: ## **Features scaling**
#from sklearn.preprocessing import MinMaxScaler
#scaler = MinMaxScaler()
#from sklearn.preprocessing import RobustScaler
#scaler = RobustScaler()
#from sklearn.preprocessing import StandardScaler
#scaler = StandardScaler()
#X_train_ = scaler.fit_transform(X_train)
#X_test= scaler.transform(X_test)

In [ ]: print(X_test.shape)
Metrics for regression: General case
y and y^ vectors of respectivly exact and pedicted values

Mean absolute error :

1
M AE = ∑ |yi − y
^ |
i
n

Mean_squared_error :

1
2
M SE = ∑(yi − y
^ )
i
2n

R-Squared or coefficient of determination: R Used to find the correlation between the predicted and
2

actual values of dependent variable.

2
^ )
∑(yi − y
2 i
R = 1 − ∈] − ∞, 1]
2
∑(yi − yȳ )

M SE(model)
2
R = 1 −
M SE(basline)

Better is the model, higher is the R-squared.

But if we have more features, R-squared either increases or does not change and does and can not see how any
feature impact the model

R-adjusted squared error

2 n − 1
2
R̄ = 1 − (1 − R ) ,
n − d + 1

where n is number of samples and d number of features

Pour entainer et tester le modèle on utilise 'LinearRegression' de scikit-learn’ (sklearn)

In [31]: from sklearn.linear_model import LinearRegression

LR = LinearRegression()

In [32]: # entraier le modele sur training set

LR.fit(X_train,y_train)

LinearRegression()
Out[32]:

In [48]: # Le score ici est R2

LR.score(X_train,y_train)

0.5630614279454249
Out[48]:

In [33]: w0 = LR.intercept_#biais
w_R = LR.coef_#weights

In [34]: print(f'Intercept = {w0}')

print(f'Coefs = {w_R}')

Intercept = -207.5275883376171
Coefs = [ -2.0262207 -343.68577229 15.53712456 -1305.69259881
363.57108798 -237.90354403 1180.02505485 67.03378685
-29.77356392 292.14669988 -192.70133081 -451.98090278
-43.28072999 450.56672686 145.89285794 -237.90354403
-29.77356392 490.14060804 -90.08920286 -370.27784126
5.97821985]

In [36]: from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score

In [44]: #prédire le modele sur l'ensemble training

y_R_train =LR.predict(X_train)
print(f'MAE = {mean_absolute_error(y_train.values, y_R_train)}')
print(f'MSE = {mean_squared_error(y_train.values, y_R_train)}')
#print(f'RMSE = {rmse(y_test.values, y_pred)}')
print(f'R2 = {r2_score(y_train.values, y_R_train)}')

MAE = 832.7697608114595
MSE = 1270620.5219161748
R2 = 0.5630614279454249

In [45]: #prédire le modele sur l'ensemble training

y_R_test =LR.predict(X_test)
print(f'MAE = {mean_absolute_error(y_test.values, y_R_test)}')
print(f'MSE = {mean_squared_error(y_test.values, y_R_test)}')
#print(f'RMSE = {rmse(y_test.values, y_pred)}')
print(f'R2 = {r2_score(y_test.values, y_R_test)}')

MAE = 851.9198567977413
MSE = 1276730.9218397879
R2 = 0.5637881003820474

In [42]: test_df = pd.DataFrame(X_test, columns = X.columns)

test_df.head()

Out[42]: Outlet_Type_Grocery Outlet_Type_Supermarket Outlet_Type_Supermark

Item_Weight Item_Visibility Item_MRP
Store Type1 Typ

4931 14.500 0.089960 159.5604 0 1

4148 12.150 0.009535 64.5510 0 0

7423 11.500 0.017742 129.6626 0 1

4836 10.195 0.000000 143.1154 0 1

944 21.000 0.049264 195.0478 0 1

5 rows × 21 columns

In [47]: for i in range(10):

print([y_test.values[i],y_R_test[i]])

[1426.1436, 2489.3108732996498]
[1201.769, 2522.6479061046225]
[1836.2764, 2238.8794424597736]
[2410.8618, 2432.726290160981]
[1549.9824, 3115.3221305422135]
[3169.208, 3616.2284038788393]
[2036.6822, 2905.006552265028]
[824.9262, 1832.827140131652]
[378.1744, 1129.2447358303984]
[1573.9512, 1802.1739343111042]

In [ ]: import matplotlib.pyplot as plt

In [ ]:

In [49]: coefR = pd.Series(np.abs(w_R), feature_names ).sort_values(ascending=False)

coefR.plot(kind='bar', title='LR Coefficients')

<AxesSubplot:title={'center':'LR Coefficients'}>
Out[49]:

In [50]: k=15
#LR liste contenant les k varibles selectionnées
L_R=list(coefR[0:k].keys())
print(L_R)

['Outlet_Type_Grocery Store', 'Outlet_Type_Supermarket Type3', 'Outlet_Size_Medium', 'Outl

et_Establishment_Year_1999', 'Outlet_Establishment_Year_2004', 'Outlet_Size_Small', 'Outle
t_Type_Supermarket Type1', 'Item_Visibility', 'Outlet_Establishment_Year_1997', 'Outlet_Ty
pe_Supermarket Type2', 'Outlet_Establishment_Year_2009', 'Outlet_Establishment_Year_1998',
'Outlet_Establishment_Year_2007', 'Outlet_Size_Mode', 'Outlet_Establishment_Year_1985']

Régression et pseudo inverse

On rappelle la décomposition en valeur singulière (SVD) de X: il existe une matrice orthogonale U ,
n×d
∈ R )

σ1 0 … … 0
⎡ ⎤

⎢ ⎥
0 ⋱ ⋱ … 0
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⋮ ⋱ σr 0 ⋮ ⎥
une matrice orthogonale V et une matrice dont tous les
d×d n×d
∈ R Σ = ⎢ ⎥ ∈ R
⎢ ⎥
⎢ ⎥
⎢ 0 0 ⋱ 0 0⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢⋯ ⋱ ⋱ ⋮ ⎥

⎣ ⎦
0 0 0 0 0

éléments sont positifs et nulle en dehors de la diagonale telles que

T
X = U ΣV .

Si on note la matrice à d lignes n colonnes, Σ on appelle pseudo inverse de X, la

+ 1 1
= diag( ,..., , 0, . . .0)
σ1 σr

matrice X +
= VΣ
+
U
T
.

Alors, le vecteur w = X
+
y est une solution de problème de régression (1) de norme minimale.

In [61]: #Afficher les variables par importance

X_trainr=np.concatenate([np.ones((X_train.shape[0],1)),X_train], axis=1)
X_testr=np.concatenate([np.ones((X_test.shape[0],1)),X_test], axis=1)

In [57]: w_p=np.linalg.pinv(X_trainr).dot(y_train)

In [53]: print(w_p)
print(w0,w_R)

[ -121.01384688 -2.0262207 -343.68577229 15.53712456

-1336.53501111 315.49412166 -261.02760391 1161.05464648
39.75607547 -48.99195149 282.1035425 -215.2364401
-450.15205623 -48.58128527 440.52356949 140.59230265
-261.02760391 -48.99195149 449.87498635 -123.22542272
-398.67145902 5.97821985]
-207.5275883376171 [ -2.0262207 -343.68577229 15.53712456 -1305.69259881
363.57108798 -237.90354403 1180.02505485 67.03378685
-29.77356392 292.14669988 -192.70133081 -451.98090278
-43.28072999 450.56672686 145.89285794 -237.90354403
-29.77356392 490.14060804 -90.08920286 -370.27784126
5.97821985]

In [59]: plt.scatter(range(len(w_R)),w_R)
plt.scatter(range(len(w_p[1:X_train.shape[1]+1])),w_p[1:X_train.shape[1]+1])

<matplotlib.collections.PathCollection at 0x1b02fe51c50>
Out[59]:
In [62]: d=X_train.shape[1]
d1=d+1# nb de features +1(pour le bias)
print(d)
coefp = pd.Series(np.abs(w_p[1:d1]).reshape(d,), feature_names ).sort_values(ascending=Fal
coefp.plot(kind='bar', title='Pseudo-inverse Coefficients')

21
<AxesSubplot:title={'center':'Pseudo-inverse Coefficients'}>
Out[62]:

In [ ]: #Evaluation sur l'ensemble training

y_p_train=?
# à completer avec l'évalution mais pour le nouveau data set !!

Travail à completer pour GI2

Refaire le meme travail pour le dataset 50_Startups0.csv: On demande:
d'importer le jeu données
Traiter les valeurs manquntes
Chercher les valeurs abérrantes pour chaque variable
Convertir la variable state en numerique (utiliser dummies)
La matrice X des caratéristiques contient toutes les variables sauf 'Profit', le label y représente 'Profit'.
Extraire X et y
Calculer et interpréter la matrice de corrélation

Appliquer le modèle de regression liniéaire de sklearn

Evaluer le modèle sur les deux ensembles test et training (avec MAE, MSE, R2)
Utiliser la matrice pseudo inverse puis évaluer le modèle obtenu
Pour chaque technique utilisée, représenter les variables les plus pertinentes par importance puis trier dans
une liste les deux variables les plus décisives.

Les interprétations des résultats enrichissent bien votre travail

In [ ]: ## fin de TP

In [ ]: # Version Septembre 2023, R. Bessi

In [ ]:

In [ ]: # Pour GI1

In [ ]: # Fonction pour définir Adjusted R2

def r2_adj(d,y,z):

n = y.shape[0]
#print(d)
return ?

Régression via un problème d'optimisation

Algoritmes classiques

Descente de gradient
L'algorithme de descente de gradient appliqué à une fonction f est défini par un point initial w ainsi que par 0

l'itération

wk+1 = wk − αk ∇f (wk ),

où α k > 0 est une longueur de pas pour l'itération k.

Le pas de descente α est appelé taux d'apprentissage (leaning rate) en ML et en général noté l .
k r

Cas d'un problème quadratique

Si f (w) , pour A symétrique, on rappelle ∇f (w)
1 T T
= w Aw − b w = 0.5(Aw, w) − (b, w) = Aw − b
2

Ici A = X
T
trainr
Xtrainr et b = X
T
trainr
ytrain

Implémenter la descente de gradient avec trois choix de longueurs de pas possibles:

GPF: à pas fixe α = 2/(λmax (A) + λmin (A)) .

GPF: à pas optimal : α optimal k

Gradient conjugué: GC:

On choisit w .
d
0 ∈ R

On prend g 0 = Aw0 − b = −d0 .

Pour k ≥ 0 , on calcule

2
∣ ∥gk ∥
2

∣ αk = , wk+1 = wk + αk dk ,
(Adk , dk )
∣
∣
gk+1 = Awk+1 − b = gk + αk Adk ,
∣
2
∣ ∥gk+1 ∥
2
∣ βk = , dk+1 = −gk+1 + βk dk .
2
∣ ∥gk ∥
2

Pour chaque modèle, on demande de calculer le vecteur prédiction puis R de l'ensemble test et l'ensemble 2

training

In [63]: d1=X_trainr.shape[1]
A=X_trainr.T.dot(X_trainr)
b=X_trainr.T.dot(y_train).reshape(d1,1)
#d=A.shape[0]

In [ ]:

In [ ]: # Calculer rho(A) et conditionnent(A): rayon spectral et conditionnent de A (utiliser la b

rho=
Cond=

Quel est l'effet de ρ(A) et con(A) sur les algorithmes GPF et GC ?

In [ ]: plt.scatter(range(len(LR.coef_)), LR.coef_)
plt.scatter(range(len(w_p)-1),w_p[1:len(w_p)])

In [ ]: # Méthode du graient à pas fixe

n=A.shape[0]
d1=A.shape[1]
lr=np.real(2/(np.max(np.linalg.eig(A)[0])+np.min(np.linalg.eig(A)[0]))) # le pas de descen

w_f=np.random.rand(d1,1)#initialisation
for k in range(100):
w_f=w_f-lr*(A.dot(w_f)-b)#-lam*w

In [ ]: # Evaluer la performance de la méthode du graient à pas fixe (utiliser ici w_f)

In [ ]: #Implemneter l'algorithme de gradient conjugué

# Méthode du GC

w_c=np.random.rand(d1,1) #initialisation
g=(A.dot(w_c)-b)
g1=-g
for k in range(20):
#print(k)
if np.linalg.norm(g)>10**(-5):
alpha=?
w_c =?
g0=g
g=?
beta=?
g1=-g+beta*g1

In [ ]: # Evaluer la performance de la méthode du graient conjugué (utiliser ici w_c)

In [ ]: #visualiser la solution obtenue w_c par importance

coefc =?
coefc.plot?

In [ ]: k=15
L_c=list(coefc[0:k].keys())
#Lc liste contenant les 15 varibles selectionnées
# Intersection de L_R et L_c
print(len(set(L_c).intersection(set(L_R))))

In [ ]: plt.scatter(range(len(LR.coef_)), LR.coef_)
plt.scatter(range(len(w_p)-1),w_p[1:len(w_p)])

plt.scatter(range(len(w_c)-1), w_c[1:d1])
#plt.scatter(range(len(w_p)-1),w_p[1:len(w_p)])

In [ ]: #fin
# R Bessi, September 2023

Answers Revised
No ratings yet
Answers Revised
2 pages
Assignment - Predictive Modeling
88% (24)
Assignment - Predictive Modeling
66 pages
Delhivery Mani
No ratings yet
Delhivery Mani
79 pages
Pandas
No ratings yet
Pandas
43 pages
BigMart Sales Data Analysis
No ratings yet
BigMart Sales Data Analysis
16 pages
PRJ Sales Forecasting
No ratings yet
PRJ Sales Forecasting
22 pages
Mini Project2 DAV Answers - Jupyter Notebook
No ratings yet
Mini Project2 DAV Answers - Jupyter Notebook
21 pages
Project 12 Big Mart Sales Prediction
No ratings yet
Project 12 Big Mart Sales Prediction
15 pages
Big Sales Mart Final Script PDF
No ratings yet
Big Sales Mart Final Script PDF
36 pages
Project 4: Final Project: Bigmart Sales Prediction: Chapter 1: Problem Statement
No ratings yet
Project 4: Final Project: Bigmart Sales Prediction: Chapter 1: Problem Statement
35 pages
Assignment 1
No ratings yet
Assignment 1
7 pages
Mini Project (BDA) Output
No ratings yet
Mini Project (BDA) Output
5 pages
Data Clearning
No ratings yet
Data Clearning
7 pages
ML Practical 4D
No ratings yet
ML Practical 4D
11 pages
BIG Mart Data Analyst Project
No ratings yet
BIG Mart Data Analyst Project
19 pages
Task 6
No ratings yet
Task 6
14 pages
Porter Case Study
No ratings yet
Porter Case Study
27 pages
Data Science Tutorial 1686911993
No ratings yet
Data Science Tutorial 1686911993
41 pages
Ass 1 ML
No ratings yet
Ass 1 ML
21 pages
DA Basics
No ratings yet
DA Basics
6 pages
SPPUML6
No ratings yet
SPPUML6
9 pages
Practice Questions2
No ratings yet
Practice Questions2
2 pages
Dataframe
No ratings yet
Dataframe
19 pages
Customer Segmentation PDF
No ratings yet
Customer Segmentation PDF
18 pages
Aerofit Case Study
No ratings yet
Aerofit Case Study
16 pages
Deep Learning Assignments
No ratings yet
Deep Learning Assignments
13 pages
Supply Chain Analytics
No ratings yet
Supply Chain Analytics
20 pages
ML 5
No ratings yet
ML 5
11 pages
House Price Prediction Using Machine Learning in Python
No ratings yet
House Price Prediction Using Machine Learning in Python
13 pages
SMDM Final - Jupyter Notebook
100% (1)
SMDM Final - Jupyter Notebook
17 pages
Learn Pandas
No ratings yet
Learn Pandas
40 pages
Pro Yec To Machine Learning
No ratings yet
Pro Yec To Machine Learning
35 pages
Supermarket Sales Analysis Project
No ratings yet
Supermarket Sales Analysis Project
8 pages
ST Joseph'S Convent Senior Secondary School: Name:-Shatakshi Gaur Class:-Xii Sec:-A Board Roll No.
No ratings yet
ST Joseph'S Convent Senior Secondary School: Name:-Shatakshi Gaur Class:-Xii Sec:-A Board Roll No.
65 pages
GRL - EX - 4 (1) .Ipynb - Colaboratory
No ratings yet
GRL - EX - 4 (1) .Ipynb - Colaboratory
7 pages
Marketing Analytics Assignment 1
No ratings yet
Marketing Analytics Assignment 1
6 pages
Task 1 - Data Preparation and Customer Analytics - Jupyter Notebook
No ratings yet
Task 1 - Data Preparation and Customer Analytics - Jupyter Notebook
64 pages
DMV - 1 - Jupyter Notebook
No ratings yet
DMV - 1 - Jupyter Notebook
4 pages
Data Preprocessing Example Programs1
No ratings yet
Data Preprocessing Example Programs1
9 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
19 pages
Data Science Practicals - Ipynb
No ratings yet
Data Science Practicals - Ipynb
54 pages
BigMart PDF
100% (1)
BigMart PDF
42 pages
Pandas
No ratings yet
Pandas
8 pages
Aerofit
No ratings yet
Aerofit
7 pages
DV Mid Internal 1
No ratings yet
DV Mid Internal 1
8 pages
SalesDataAnalysis 1693296057
No ratings yet
SalesDataAnalysis 1693296057
14 pages
Masterclass Data Analysis - Ipynb - Colab
No ratings yet
Masterclass Data Analysis - Ipynb - Colab
4 pages
Oddstudents
No ratings yet
Oddstudents
35 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
Grocery
No ratings yet
Grocery
41 pages
E-Commerce Product Delivery Prediction
No ratings yet
E-Commerce Product Delivery Prediction
13 pages
Coding
No ratings yet
Coding
3 pages
Task 2 - Experimentation and Uplift Testing - Jupyter Notebook
No ratings yet
Task 2 - Experimentation and Uplift Testing - Jupyter Notebook
41 pages
Dsbda Assignment 1
No ratings yet
Dsbda Assignment 1
5 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
Customer Segmentation 1683225943
No ratings yet
Customer Segmentation 1683225943
34 pages
Pandas Syntax Revision For ML
No ratings yet
Pandas Syntax Revision For ML
10 pages
002 Python Pandas
No ratings yet
002 Python Pandas
19 pages
Housing Prices Notebook
No ratings yet
Housing Prices Notebook
14 pages
Ex 1
No ratings yet
Ex 1
119 pages
Unit II
No ratings yet
Unit II
76 pages
Stat 201 Mt1 Cheatsheet
No ratings yet
Stat 201 Mt1 Cheatsheet
2 pages
STAT 5700 Homework 1
No ratings yet
STAT 5700 Homework 1
19 pages
Cat2 Set 3 - Multiple Choice Questions INSTRUCTION: Multiple Choice
No ratings yet
Cat2 Set 3 - Multiple Choice Questions INSTRUCTION: Multiple Choice
7 pages
Be Electrical Engineering Semester 3 2023 May Engineering Mathematics III m3 Pattern 2019
No ratings yet
Be Electrical Engineering Semester 3 2023 May Engineering Mathematics III m3 Pattern 2019
5 pages
Chapter 2 - Correlation and Regression
No ratings yet
Chapter 2 - Correlation and Regression
19 pages
Statistics For CS
No ratings yet
Statistics For CS
16 pages
EDA QB Full Answers
No ratings yet
EDA QB Full Answers
18 pages
Variance and Standard Deviation
No ratings yet
Variance and Standard Deviation
35 pages
Lesson 5 Measure of Skewness 1
No ratings yet
Lesson 5 Measure of Skewness 1
9 pages
Sequence and Series: Harmonic Progression and Relation Between A.M., G.M. and H.M
No ratings yet
Sequence and Series: Harmonic Progression and Relation Between A.M., G.M. and H.M
17 pages
Statistics and Probability Exam Quiz
No ratings yet
Statistics and Probability Exam Quiz
16 pages
Lind 19e Chap003 PPT Accessible
No ratings yet
Lind 19e Chap003 PPT Accessible
46 pages
Class 11 - Maths - Statistics
No ratings yet
Class 11 - Maths - Statistics
35 pages
4th Periodical Exam in Math10
No ratings yet
4th Periodical Exam in Math10
5 pages
M3 Check in Activity 2
No ratings yet
M3 Check in Activity 2
2 pages
LT05 L1TP 125060 20040720 20161130 01 T1 Ver
No ratings yet
LT05 L1TP 125060 20040720 20161130 01 T1 Ver
5 pages
I Puc Mid Term Stats Model Question Paper
No ratings yet
I Puc Mid Term Stats Model Question Paper
3 pages
Business Statistics 6th Edition Levine Solutions Manualpdf Download
100% (11)
Business Statistics 6th Edition Levine Solutions Manualpdf Download
51 pages
Algebra 3 Principles and Sample Problems: 3.1 Probability 3.2 Statistics 3.3 Problems For Solutions
No ratings yet
Algebra 3 Principles and Sample Problems: 3.1 Probability 3.2 Statistics 3.3 Problems For Solutions
24 pages
StatProb Monthly Exam 2022 2023
No ratings yet
StatProb Monthly Exam 2022 2023
3 pages
Comparison of Pearsons and Spearmans Correlation
No ratings yet
Comparison of Pearsons and Spearmans Correlation
21 pages
Práctica 3. Competencia Intraespecífica
No ratings yet
Práctica 3. Competencia Intraespecífica
7 pages
Assignment - III (Interval Estimation)
No ratings yet
Assignment - III (Interval Estimation)
5 pages
Biostatistics (HFS3283) Introduction To Biostatistics
No ratings yet
Biostatistics (HFS3283) Introduction To Biostatistics
43 pages
Tugas Besar Perhitungan Curah Hujan
100% (1)
Tugas Besar Perhitungan Curah Hujan
59 pages
SPS 2320 Theory of Estimation Year 3 Semester II
100% (1)
SPS 2320 Theory of Estimation Year 3 Semester II
2 pages
His To Grams
No ratings yet
His To Grams
15 pages