Project 03: Data Fitting Applied Mathematics and Statistics For Information Technology
Project 03: Data Fitting Applied Mathematics and Statistics For Information Technology
Lecturers:
Ph.D Vũ Quốc Hoàng
Phan Thị Phương Uyên
Nguyễn Văn Quang Huy
Trần Thị Thảo Nhi
Student:
Đặng Ngọc Tiến
Page | 1
Page | 2
Table of Contents
I. Student information and complete progress: ........................................................................ 4
a. Student information:.......................................................................................................... 4
b. Complete progress:............................................................................................................ 4
a. Task a: ............................................................................................................................. 12
b. Task b: ............................................................................................................................. 12
c. Task c: ............................................................................................................................. 14
V. Reference:........................................................................................................................... 17
Page | 3
I. Student information and complete progress:
a. Student information:
• Name: Đặng Ngọc Tiến
• ID student: 20127641
• Class: 20CLC11
b. Complete progress:
Task Complete
Task a 100%
Task b 100%
Task c 100%
Linear regression performs the task to predict a dependent variable value (y) based
on a given independent variable (x). So, this regression technique finds out a linear
relationship between x (input) and y(output). Hence, the name is Linear Regression.
In the figure above, X (input) is the work experience and Y (output) is the salary
of a person. The regression line is the best fit line for our model.
Page | 4
While training the model we are given :
x: input training data (univariate – one input variable(parameter))
y: labels to data (supervised learning)
When training the model it fits the best line to predict the value of y for a given
value of x. The model gets the best regression fit line by finding the best θ1 and
θ2 values.
θ1: intercept
θ2: coefficient of x
Once we find the best θ1 and θ2 values, we get the best fit line. So when we are
finally using our model for prediction, it will predict the value of y for the input
value of x.
Cost function(J) of Linear Regression is the Root Mean quared Error (RMSE)
between predicted y value (pred) and true y value (y).
The simplest way to split the data is to use the train-test split method. It
randomly partitions the dataset into two subsets (called training and test
sets) so that the predefined percentage of the entire dataset is in the
training set.
Page | 5
Then, we train our machine learning model on the training set and
evaluate its performance on the test set. In this way, we are always sure
that the samples used for training are not used for evaluation and vice
versa.
• Introduction to Cross-Validation
However, the train-split method has certain limitations. When the
dataset is small, the method is prone to high variance. Due to the random
partition, the results can be entirely different for different test sets. Why?
Because in some partitions, samples that are easy to classify get into the
test set, while in others, the test set receives the ‘difficult’ ones.
• K-Fold Cross-Validation
In k-fold cross-validation, we first divide our dataset into k equally sized
subsets. Then, we repeat the train-test method k times such that each time
one of the k subsets is used as a test set and the rest k-1 subsets are used
together as a training set. Finally, we compute the estimate of the model’s
performance estimate by averaging the scores over the k trials.
Page | 6
Then, we train and evaluate our machine-learning model 3 times. Each
time, two subsets form the training set, while the remaining one acts as the
test set. In our example:
• Leave-One-Out Cross-Validation
In the leave-one-out (LOO) cross-validation, we train our machine-learning
model n times where n is to our dataset’s size. Each time, only one sample
is used as a test set while the rest are used to train our model.
Page | 7
The final performance estimate is the average of the six individual scores:
c. Life Expectancy:
Life expectancy data were collected from WHO and the United Nations website
from 2000 to 2015 across all countries.
The dataset has 2938 rows and 22 columns. The meanings and data types of each
column are shown in the following table:
Page | 8
See more[3].
In this project, the above data has been performed the following preprocessing
steps:
1. Remove data lines with incomplete information (with NaN value in line)
2. Select only the lines related to the top 95 countries with the largest population
3. Normalize and rename some features: thinness 1-19 years → Thinness age 10-
19, thinness 5-9 years → Thinness age 5-9
4. Remove 2 columns with string values: Country, Status
5. Based on the correlation measure, remove the 9 columns that are least relevant
to the goal value (Life expectancy): Population, Measles, Year, infant deaths, Total
expenditure, under-five deaths, Hepatitis B, percentage expenditures, Alcohol
Page | 9
III. Library and necessary functions:
• Import Numpy
Create matrices and process mathematical operations
• Import Pandas
Read file data .csv
Parameters
*array: ssequence of indexables with same length / shape[0]
Allowed inputs are lists, numpy arrays, scipy-sparse matrices, or pandas
dataframes.
test_size: float or int, default=None
If float, should be between 0.0 and 1.0 and represent the proportion of
the dataset to include in the test split. If int, represents the absolute
number of test samples. If None, the value is set to the complement of
the train size. If train_size is also None, it will be set to 0.25.
train_size: float or int, default=None
If float, should be between 0.0 and 1.0 and represent the proportion of
the dataset to include in the train split. If int, represents the absolute
number of train samples. If None, the value is automatically set to the
complement of the test size.
random_state: int, RandomState instance or None,
default=None
Controls the shuffling applied to the data before applying the split. Pass
an int for reproducible output across multiple function calls.
Shuffle: bool, default=True
Whether or not to shuffle the data before splitting. If shuffle=False then
stratify must be None.
Stratify: array-like, default=None
If not None, data is split in a stratified fashion, using this as the class
labels.
Returns
Splitting: list, length=2 * len(arrays)
List containing train-test split of inputs.
• Class OLSLinearRegression[5]
Page | 10
o fit(self, X, y):
Fit linear model.
Parameters:
X: {array-like, sparse matrix} of shape (n_samples,
n_features)
Training data.
y: array-like of shape (n_samples,) or (n_samples,
n_targets)
Target values. Will be cast to X’s dtype if necessary.
Returns
self: object
Fitted Estimator.
o get_params(self):
Get parameters for this estimator.
Returns:
Params: dict
Parameter names mapped to their values.
o predict(self, X):
Predict using the linear model.
Parameters:
X: array-like or sparse matrix, shape (n_samples,
n_features)
Samples.
Returns:
C: array, shape (n_samples,)
Returns predicted values.
• rmse(y, y_hat):
Mean squared error regression loss.
Parameters:
y: array-like of shape (n_samples,) or (n_samples, n_outputs)
Ground truth (correct) target values.
y_pred: array-like of shape (n_samples,) or (n_samples,
n_outputs)
Estimated target values.
Returns:
Loss: float or ndarray of floats
A non-negative floating point value (the best value is 0.0), or an array of
floating point values, one for each individual target.
• get_features(correlation_threshold)
get features (correlation_threshold > abs(Correlations))
Parameters:
correlation_threshold: float
Returns:
Params: dict
Parameter names
Page | 11
IV. Project details:
a. Task a:
Requirement a: Use all 10 features provided by the problem (2 points)
• Train only once for 10 features on the entire training set (train.csv)
• Show formula for regression model (calculate y by 10 features in X)
• Report 1 results on the test set (test.csv) for the newly trained model
Implementation steps:
• Train only once for 10 features on the entire training set (train.csv)
Use the function fit(X, y) in the OLSLinearRegression class to train 10 sets of
features.
• Show formula for regression model
Use get_params() function in OLSLinearRegression class to get coefficients.
Regression formula:
𝐿𝑖𝑓𝑒 𝑒𝑥𝑝𝑒𝑐𝑡𝑎𝑛𝑐𝑦
= 0.015101 ∗ 𝑋1 + 0.090220 ∗ 𝑋2 + 0.042922 ∗ 𝑋3
+ 0.139289 ∗ 𝑋4 + (−0.567333) ∗ 𝑋5 + (−0.000101) ∗ 𝑋6
+ 0.740713 ∗ 𝑋7 + 0.190936 ∗ 𝑋8 + 24.505974 ∗ 𝑋9
+ 2.393517 ∗ 𝑋10
Conclusion:
In this part, 100% of the data used to build the model, so the model obtained
is not completely guaranteed in terms of accuracy and stability. Proof is error ~
7.064046430584031
b. Task b:
Requirement b: Build a model using only 1 feature, find the model that gives
the best results (2 points)
Page | 12
• Test on all (10) features the topic offers
• Request use 5-fold Cross Validation method to find the best feature
• Report 10 corresponding results for 10 models from 5-fold Cross Validation
(average)
Implementation steps:
• Clone X_train, y_train from data
• Use 5-fold Cross validation method to find the best feature. (RMSE
average calculation)
• With RMSE of Schooling being the smallest. We find that the best feature
is Schooling
• Retrain the model ‘Schooling’ with the best feature on the entire training
set
Page | 13
Show the formula for the best feature regression model:
𝐿𝑖𝑓𝑒 𝑒𝑥𝑝𝑒𝑐𝑡𝑎𝑛𝑐𝑦 = 5.5573994 ∗ 𝑋
Conclusion:
After using 5-fold cross validation method we get the average RMSEs from
which we conclude that Schooling is the best feature.
c. Task c:
Requirement c: Students build their own models, find the model that gives the
best results (3 points)
• Build m different models (minimum 3), and at the same time different
models at 1a and 1b
o The model can be a combination of 2 or more features
o Models can use normalized or transformed features (squared,
cubed...)
o The model can use features created from 2 or more different
features (add 2 features, multiply 2 features...)
o ...
• Request use 5-fold Cross Validation method to find the best model
• Report m corresponding results for m models from 5-fold Cross
Validation (average)
Implementation steps:
• Explain the reason for choosing the design for the model:
o Use Correlations[6] method based on coefficient
Correlation is a method of analysis to investigate the relationship
between two fields/characteristics on a set (eg height and weight, price of
gold and price of rice,...). Based on the correlation coefficient, we can make
a preliminary assessment that which feature affects our model and affects
more or less. The correlation coefficient of r with r is equal to 1 and other
fields are less than 1. The closer to 1, the more relevant, and the closer to -1,
the more irrelevant.
o Correlation chart:
Page | 14
o Select the features with correlation greater than alpha. With alpha
= {0.1, 0.25, 0.5, 0.75}.
o After testing with 5-fold cross validation method, the RMSE
result is not very good. And after receiving a “whispers of
ancestors” then I standardized the model to square root 2.
• From the alpha of the above design, we have:
o P1 = ['Schooling', 'Income composition of resources', 'BMI',
'GDP', 'Diphtheria', 'Polio', 'HIV/AIDS', 'Adult Mortality',
'Thinness age 5-9', 'Thinness age 10-19']
o P2 = ['Schooling', 'Income composition of resources', 'BMI',
'GDP', 'Diphtheria', 'Polio', 'Adult Mortality', 'Thinness age 5-9',
'Thinness age 10-19']
o P3 = ['Schooling', 'Income composition of resources', 'BMI',
'GDP', 'Thinness age 5-9', 'Thinness age 10-19']
o P4 = ['Schooling', 'Income composition of resources']
• Process the model by design and put it on the list
Page | 15
• Use 5-fold Cross validation method to find the best feature. (RMSE
average calculation)
Conclusion:
Page | 16
Compared with the initial error in question a is 7.064046430584031,
then the error in the individual model obtained is 4.7680380241569695 proving
that the individual model just built is closer to the data.
The reason is because the transformed data has reduced dispersion,
changing the subdata model has changed the bias by subcomponents leading to
less biased large model.
V. Reference:
[1] ML | Linear Regression - GeeksforGeeks
[2] Cross-Validation: K-Fold vs. Leave-One-Out | Baeldung on Computer
Science
[3] Life Expectancy (WHO) | Kaggle
[4] sklearn.model_selection.train_test_split — scikit-learn 1.1.1
documentation
[5] Lab 4 by Ms. Uyen
[6] Correlation - Statistical Techniques, Rating Scales, Correlation
Coefficients, and More - Creative Research Systems (surveysystem.com)
Page | 17