0% found this document useful (0 votes)
27 views17 pages

Project 03: Data Fitting Applied Mathematics and Statistics For Information Technology

The document describes a student project on data fitting using linear regression and k-fold cross validation. It includes the student's information and progress, an introduction to linear regression and k-fold cross validation explaining their purpose and process, the necessary libraries and functions used, and an outline of the tasks completed for the project, which involve applying linear regression, k-fold cross validation, and analyzing life expectancy data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views17 pages

Project 03: Data Fitting Applied Mathematics and Statistics For Information Technology

The document describes a student project on data fitting using linear regression and k-fold cross validation. It includes the student's information and progress, an introduction to linear regression and k-fold cross validation explaining their purpose and process, the necessary libraries and functions used, and an outline of the tasks completed for the project, which involve applying linear regression, k-fold cross validation, and analyzing life expectancy data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

HO CHI MINH CITY NATIONAL UNIVERSITY

UNIVERSITY OF NATURE SCIENCE


FACULTY OF INFORMARTION TECHNOLOGY
---------------o0o---------------

PROJECT 03: DATA FITTING

Applied mathematics and statistics


for information technology

Lecturers:
Ph.D Vũ Quốc Hoàng
Phan Thị Phương Uyên
Nguyễn Văn Quang Huy
Trần Thị Thảo Nhi

Student:
Đặng Ngọc Tiến

Ho Chi Minh, August 2022

Page | 1
Page | 2
Table of Contents
I. Student information and complete progress: ........................................................................ 4

a. Student information:.......................................................................................................... 4

b. Complete progress:............................................................................................................ 4

II. Introduction to the project .................................................................................................... 4

a. Linear Regression[1]: ......................................................................................................... 4

b. K-fold Cross validation[2]: ................................................................................................. 5

c. Life Expectancy: ............................................................................................................... 8

III. Library and necessary functions: .................................................................................... 10

• Import Numpy .................................................................................................. 10


• Import Pandas................................................................................................... 10
• import matplotlib.pyplot as plt ......................................................................... 10
• import seaborn as sns ....................................................................................... 10
• Import sklearn.model_selection from train_test_split[4] .................................. 10
• Class OLSLinearRegression[5] ......................................................................... 10
IV. Project details: ................................................................................................................. 12

a. Task a: ............................................................................................................................. 12

b. Task b: ............................................................................................................................. 12

c. Task c: ............................................................................................................................. 14

V. Reference:........................................................................................................................... 17

Page | 3
I. Student information and complete progress:
a. Student information:
• Name: Đặng Ngọc Tiến
• ID student: 20127641
• Class: 20CLC11

b. Complete progress:
Task Complete
Task a 100%
Task b 100%
Task c 100%

II. Introduction to the project


a. Linear Regression[1]:
Linear Regression is a machine learning algorithm based on supervised learning.
It performs a regression task. Regression models a target prediction value based on
independent variables. It is mostly used for finding out the relationship between
variables and forecasting. Different regression models differ based on the kind of
relationship between dependent and independent variables they are considering, and
the number of independent variables getting used.

Linear regression performs the task to predict a dependent variable value (y) based
on a given independent variable (x). So, this regression technique finds out a linear
relationship between x (input) and y(output). Hence, the name is Linear Regression.

In the figure above, X (input) is the work experience and Y (output) is the salary
of a person. The regression line is the best fit line for our model.

Page | 4
While training the model we are given :
x: input training data (univariate – one input variable(parameter))
y: labels to data (supervised learning)

When training the model it fits the best line to predict the value of y for a given
value of x. The model gets the best regression fit line by finding the best θ1 and
θ2 values.
θ1: intercept
θ2: coefficient of x

Once we find the best θ1 and θ2 values, we get the best fit line. So when we are
finally using our model for prediction, it will predict the value of y for the input
value of x.

Cost Function (J):


By achieving the best-fit regression line, the model aims to predict y value such
that the error difference between predicted value and true value is minimum. So, it is
very important to update the θ1 and θ2 values, to reach the best value that minimize
the error between predicted y value (pred) and true y value (y).

Cost function(J) of Linear Regression is the Root Mean quared Error (RMSE)
between predicted y value (pred) and true y value (y).

b. K-fold Cross validation[2]:


• Overview
In this tutorial, we’ll talk about two cross-validation techniques in
machine learning: the k-fold and leave-one-out methods. To do so, we’ll
start with the train-test splits and explain why we need cross-validation in
the first place. Then, we’ll describe the two cross-validation techniques and
compare them to illustrate their pros and cons.
• Train-Test Split Method
An important decision when developing any machine learning model is
how to evaluate its final performance. To get an unbiased estimate of the
model’s performance, we need to evaluate it on the data we didn’t use for
training.

The simplest way to split the data is to use the train-test split method. It
randomly partitions the dataset into two subsets (called training and test
sets) so that the predefined percentage of the entire dataset is in the
training set.

Page | 5
Then, we train our machine learning model on the training set and
evaluate its performance on the test set. In this way, we are always sure
that the samples used for training are not used for evaluation and vice
versa.

Visually, this is how the train-test split method works:

• Introduction to Cross-Validation
However, the train-split method has certain limitations. When the
dataset is small, the method is prone to high variance. Due to the random
partition, the results can be entirely different for different test sets. Why?
Because in some partitions, samples that are easy to classify get into the
test set, while in others, the test set receives the ‘difficult’ ones.

To deal with this issue, we use cross-validation to evaluate the


performance of a machine learning model. In cross-validation, we don’t
divide the dataset into training and test sets only once. Instead, we
repeatedly partition the dataset into smaller groups and then average the
performance in each group. That way, we reduce the impact of partition
randomness on the results.

Many cross-validation techniques define different ways to divide the


dataset at hand. We’ll focus on the two most frequently used: the k-fold
and the leave-one-out methods.

• K-Fold Cross-Validation
In k-fold cross-validation, we first divide our dataset into k equally sized
subsets. Then, we repeat the train-test method k times such that each time
one of the k subsets is used as a test set and the rest k-1 subsets are used
together as a training set. Finally, we compute the estimate of the model’s
performance estimate by averaging the scores over the k trials.

For example, let’s suppose that we have a dataset S


= {𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 , 𝑥5 , 𝑥6} containing 6 samples and that we want to perform a
3-fold cross-validation.

First, we divide S into 3 subsets randomly. For instance:


𝑆1 = {𝑥1 , 𝑥2 }
𝑆2 = {𝑥3 , 𝑥4 }
𝑆3 = {𝑥5 , 𝑥6 }

Page | 6
Then, we train and evaluate our machine-learning model 3 times. Each
time, two subsets form the training set, while the remaining one acts as the
test set. In our example:

Finally, the overall performance is the average of the model’s


performance scores on those three test sets:

• Leave-One-Out Cross-Validation
In the leave-one-out (LOO) cross-validation, we train our machine-learning
model n times where n is to our dataset’s size. Each time, only one sample
is used as a test set while the rest are used to train our model.

We’ll show that LOO is an extreme case of k-fold where k = n. If we apply


LOO to the previous example, we’ll have 6 test subsets:
𝑆1 = {𝑥1 }
𝑆2 = {𝑥2 }
𝑆3 = {𝑥3 }
𝑆4 = {𝑥4 }
𝑆5 = {𝑥5 }
𝑆6 = {𝑥6 }

Iterating over them, we use S \ 𝑆𝑖 as the training data in iteration 𝑖 =


1,2, … , 6, and evaluate the model on 𝑆𝑖 :

Page | 7
The final performance estimate is the average of the six individual scores:

c. Life Expectancy:
Life expectancy data were collected from WHO and the United Nations website
from 2000 to 2015 across all countries.

The dataset has 2938 rows and 22 columns. The meanings and data types of each
column are shown in the following table:

Page | 8
See more[3].

In this project, the above data has been performed the following preprocessing
steps:
1. Remove data lines with incomplete information (with NaN value in line)
2. Select only the lines related to the top 95 countries with the largest population
3. Normalize and rename some features: thinness 1-19 years → Thinness age 10-
19, thinness 5-9 years → Thinness age 5-9
4. Remove 2 columns with string values: Country, Status
5. Based on the correlation measure, remove the 9 columns that are least relevant
to the goal value (Life expectancy): Population, Measles, Year, infant deaths, Total
expenditure, under-five deaths, Hepatitis B, percentage expenditures, Alcohol

After preprocessing, the new data has:


• 1180 data lines
• 11 columns of data including:
o 1 goal value y: Life expectancy
o 10 features that explain X (features that help find target values) include:
Adult Mortality, BMI, Polio, Diphtheria, HIV/AIDS, GDP, Thinness age 10-19,
Thinness age 5-9, Income composition of resources, Schooling, Life expectancy

Students are provided with 2 files:


• train.csv: Contains 1085 samples used to train the model
• test.csv: Contains 95 samples used to test the model

Page | 9
III. Library and necessary functions:
• Import Numpy
Create matrices and process mathematical operations

• Import Pandas
Read file data .csv

• import matplotlib.pyplot as plt


• import seaborn as sns
Correlation chart

• Import sklearn.model_selection from train_test_split[4]


Split arrays or matrices into random train and test subsets.
Quick utility that wraps input validation and
next(ShuffleSplit().split(X, y)) and application to input data into a single
call for splitting (and optionally subsampling) data in a oneliner.

Parameters
*array: ssequence of indexables with same length / shape[0]
Allowed inputs are lists, numpy arrays, scipy-sparse matrices, or pandas
dataframes.
test_size: float or int, default=None
If float, should be between 0.0 and 1.0 and represent the proportion of
the dataset to include in the test split. If int, represents the absolute
number of test samples. If None, the value is set to the complement of
the train size. If train_size is also None, it will be set to 0.25.
train_size: float or int, default=None
If float, should be between 0.0 and 1.0 and represent the proportion of
the dataset to include in the train split. If int, represents the absolute
number of train samples. If None, the value is automatically set to the
complement of the test size.
random_state: int, RandomState instance or None,
default=None
Controls the shuffling applied to the data before applying the split. Pass
an int for reproducible output across multiple function calls.
Shuffle: bool, default=True
Whether or not to shuffle the data before splitting. If shuffle=False then
stratify must be None.
Stratify: array-like, default=None
If not None, data is split in a stratified fashion, using this as the class
labels.

Returns
Splitting: list, length=2 * len(arrays)
List containing train-test split of inputs.

• Class OLSLinearRegression[5]

Page | 10
o fit(self, X, y):
Fit linear model.
Parameters:
X: {array-like, sparse matrix} of shape (n_samples,
n_features)
Training data.
y: array-like of shape (n_samples,) or (n_samples,
n_targets)
Target values. Will be cast to X’s dtype if necessary.
Returns
self: object
Fitted Estimator.
o get_params(self):
Get parameters for this estimator.
Returns:
Params: dict
Parameter names mapped to their values.
o predict(self, X):
Predict using the linear model.
Parameters:
X: array-like or sparse matrix, shape (n_samples,
n_features)
Samples.
Returns:
C: array, shape (n_samples,)
Returns predicted values.
• rmse(y, y_hat):
Mean squared error regression loss.
Parameters:
y: array-like of shape (n_samples,) or (n_samples, n_outputs)
Ground truth (correct) target values.
y_pred: array-like of shape (n_samples,) or (n_samples,
n_outputs)
Estimated target values.
Returns:
Loss: float or ndarray of floats
A non-negative floating point value (the best value is 0.0), or an array of
floating point values, one for each individual target.

• get_features(correlation_threshold)
get features (correlation_threshold > abs(Correlations))
Parameters:
correlation_threshold: float
Returns:
Params: dict
Parameter names

Page | 11
IV. Project details:
a. Task a:
Requirement a: Use all 10 features provided by the problem (2 points)
• Train only once for 10 features on the entire training set (train.csv)
• Show formula for regression model (calculate y by 10 features in X)
• Report 1 results on the test set (test.csv) for the newly trained model

Implementation steps:
• Train only once for 10 features on the entire training set (train.csv)
Use the function fit(X, y) in the OLSLinearRegression class to train 10 sets of
features.
• Show formula for regression model
Use get_params() function in OLSLinearRegression class to get coefficients.

Use the predict(X_test) function in OLSLinearRegression class to predict.


Use the rmse(y_test, y_hat) function to calculate RMSE.

Regression formula:
𝐿𝑖𝑓𝑒 𝑒𝑥𝑝𝑒𝑐𝑡𝑎𝑛𝑐𝑦
= 0.015101 ∗ 𝑋1 + 0.090220 ∗ 𝑋2 + 0.042922 ∗ 𝑋3
+ 0.139289 ∗ 𝑋4 + (−0.567333) ∗ 𝑋5 + (−0.000101) ∗ 𝑋6
+ 0.740713 ∗ 𝑋7 + 0.190936 ∗ 𝑋8 + 24.505974 ∗ 𝑋9
+ 2.393517 ∗ 𝑋10

Conclusion:
In this part, 100% of the data used to build the model, so the model obtained
is not completely guaranteed in terms of accuracy and stability. Proof is error ~
7.064046430584031

b. Task b:
Requirement b: Build a model using only 1 feature, find the model that gives
the best results (2 points)
Page | 12
• Test on all (10) features the topic offers
• Request use 5-fold Cross Validation method to find the best feature
• Report 10 corresponding results for 10 models from 5-fold Cross Validation
(average)

Implementation steps:
• Clone X_train, y_train from data

• Use train_test_split() function to split data into 5 shuffle parts.

• Use 5-fold Cross validation method to find the best feature. (RMSE
average calculation)

• With RMSE of Schooling being the smallest. We find that the best feature
is Schooling

• Retrain the model ‘Schooling’ with the best feature on the entire training
set

• Use the rmse(y_test, y_hat) function to calculate RMSE.

Page | 13
Show the formula for the best feature regression model:
𝐿𝑖𝑓𝑒 𝑒𝑥𝑝𝑒𝑐𝑡𝑎𝑛𝑐𝑦 = 5.5573994 ∗ 𝑋

Conclusion:
After using 5-fold cross validation method we get the average RMSEs from
which we conclude that Schooling is the best feature.

c. Task c:
Requirement c: Students build their own models, find the model that gives the
best results (3 points)
• Build m different models (minimum 3), and at the same time different
models at 1a and 1b
o The model can be a combination of 2 or more features
o Models can use normalized or transformed features (squared,
cubed...)
o The model can use features created from 2 or more different
features (add 2 features, multiply 2 features...)
o ...
• Request use 5-fold Cross Validation method to find the best model
• Report m corresponding results for m models from 5-fold Cross
Validation (average)

Implementation steps:
• Explain the reason for choosing the design for the model:
o Use Correlations[6] method based on coefficient
Correlation is a method of analysis to investigate the relationship
between two fields/characteristics on a set (eg height and weight, price of
gold and price of rice,...). Based on the correlation coefficient, we can make
a preliminary assessment that which feature affects our model and affects
more or less. The correlation coefficient of r with r is equal to 1 and other
fields are less than 1. The closer to 1, the more relevant, and the closer to -1,
the more irrelevant.
o Correlation chart:

Page | 14
o Select the features with correlation greater than alpha. With alpha
= {0.1, 0.25, 0.5, 0.75}.
o After testing with 5-fold cross validation method, the RMSE
result is not very good. And after receiving a “whispers of
ancestors” then I standardized the model to square root 2.
• From the alpha of the above design, we have:
o P1 = ['Schooling', 'Income composition of resources', 'BMI',
'GDP', 'Diphtheria', 'Polio', 'HIV/AIDS', 'Adult Mortality',
'Thinness age 5-9', 'Thinness age 10-19']
o P2 = ['Schooling', 'Income composition of resources', 'BMI',
'GDP', 'Diphtheria', 'Polio', 'Adult Mortality', 'Thinness age 5-9',
'Thinness age 10-19']
o P3 = ['Schooling', 'Income composition of resources', 'BMI',
'GDP', 'Thinness age 5-9', 'Thinness age 10-19']
o P4 = ['Schooling', 'Income composition of resources']
• Process the model by design and put it on the list

Page | 15
• Use 5-fold Cross validation method to find the best feature. (RMSE
average calculation)

• We find that the best feature is Sqrt model 1 with coefficients:

• Use the rmse(y_test, y_hat) function to calculate RMSE.

Show the formula for the best feature regression model:


𝐿𝑖𝑓𝑒 𝑒𝑥𝑝𝑒𝑐𝑡𝑎𝑛𝑐𝑦
= 14.091507 ∗ √𝑋1 + 9.579369 ∗ √𝑋2 + 0.552901 ∗ √𝑋3
+ 0.001705 ∗ √𝑋4 + 0.801555 ∗ √𝑋5 + 0.186563 ∗ √𝑋6
+ (−3.208658) ∗ √𝑋7 + (−0.009331) ∗ √𝑋8 + (−0.009752)
∗ √𝑋9 + 1.818148 ∗ √𝑋10

Conclusion:

Page | 16
Compared with the initial error in question a is 7.064046430584031,
then the error in the individual model obtained is 4.7680380241569695 proving
that the individual model just built is closer to the data.
The reason is because the transformed data has reduced dispersion,
changing the subdata model has changed the bias by subcomponents leading to
less biased large model.

V. Reference:
[1] ML | Linear Regression - GeeksforGeeks
[2] Cross-Validation: K-Fold vs. Leave-One-Out | Baeldung on Computer
Science
[3] Life Expectancy (WHO) | Kaggle
[4] sklearn.model_selection.train_test_split — scikit-learn 1.1.1
documentation
[5] Lab 4 by Ms. Uyen
[6] Correlation - Statistical Techniques, Rating Scales, Correlation
Coefficients, and More - Creative Research Systems (surveysystem.com)

Page | 17

You might also like