0% found this document useful (0 votes)

27 views17 pages

Project 03: Data Fitting Applied Mathematics and Statistics For Information Technology

The document describes a student project on data fitting using linear regression and k-fold cross validation. It includes the student's information and progress, an introduction to linear regression and k-fold cross validation explaining their purpose and process, the necessary libraries and functions used, and an outline of the tasks completed for the project, which involve applying linear regression, k-fold cross validation, and analyzing life expectancy data.

Uploaded by

Sâm Nguyễn Thái Đan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views17 pages

Project 03: Data Fitting Applied Mathematics and Statistics For Information Technology

Uploaded by

Sâm Nguyễn Thái Đan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

HO CHI MINH CITY NATIONAL UNIVERSITY

UNIVERSITY OF NATURE SCIENCE

FACULTY OF INFORMARTION TECHNOLOGY
---------------o0o---------------

PROJECT 03: DATA FITTING

Applied mathematics and statistics

for information technology

Lecturers:
Ph.D Vũ Quốc Hoàng
Phan Thị Phương Uyên
Nguyễn Văn Quang Huy
Trần Thị Thảo Nhi

Student:
Đặng Ngọc Tiến

Ho Chi Minh, August 2022

Page | 1
Page | 2
Table of Contents
I. Student information and complete progress: ........................................................................ 4

a. Student information:.......................................................................................................... 4

b. Complete progress:............................................................................................................ 4

II. Introduction to the project .................................................................................................... 4

a. Linear Regression[1]: ......................................................................................................... 4

b. K-fold Cross validation[2]: ................................................................................................. 5

c. Life Expectancy: ............................................................................................................... 8

III. Library and necessary functions: .................................................................................... 10

• Import Numpy .................................................................................................. 10

• Import Pandas................................................................................................... 10
• import matplotlib.pyplot as plt ......................................................................... 10
• import seaborn as sns ....................................................................................... 10
• Import sklearn.model_selection from train_test_split[4] .................................. 10
• Class OLSLinearRegression[5] ......................................................................... 10
IV. Project details: ................................................................................................................. 12

a. Task a: ............................................................................................................................. 12

b. Task b: ............................................................................................................................. 12

c. Task c: ............................................................................................................................. 14

V. Reference:........................................................................................................................... 17

Page | 3
I. Student information and complete progress:
a. Student information:
• Name: Đặng Ngọc Tiến
• ID student: 20127641
• Class: 20CLC11

b. Complete progress:
Task Complete
Task a 100%
Task b 100%
Task c 100%

II. Introduction to the project

a. Linear Regression[1]:
Linear Regression is a machine learning algorithm based on supervised learning.
It performs a regression task. Regression models a target prediction value based on
independent variables. It is mostly used for finding out the relationship between
variables and forecasting. Different regression models differ based on the kind of
relationship between dependent and independent variables they are considering, and
the number of independent variables getting used.

Linear regression performs the task to predict a dependent variable value (y) based
on a given independent variable (x). So, this regression technique finds out a linear
relationship between x (input) and y(output). Hence, the name is Linear Regression.

In the figure above, X (input) is the work experience and Y (output) is the salary
of a person. The regression line is the best fit line for our model.

Page | 4
While training the model we are given :
x: input training data (univariate – one input variable(parameter))
y: labels to data (supervised learning)

When training the model it fits the best line to predict the value of y for a given
value of x. The model gets the best regression fit line by finding the best θ1 and
θ2 values.
θ1: intercept
θ2: coefficient of x

Once we find the best θ1 and θ2 values, we get the best fit line. So when we are
finally using our model for prediction, it will predict the value of y for the input
value of x.

Cost Function (J):

By achieving the best-fit regression line, the model aims to predict y value such
that the error difference between predicted value and true value is minimum. So, it is
very important to update the θ1 and θ2 values, to reach the best value that minimize
the error between predicted y value (pred) and true y value (y).

Cost function(J) of Linear Regression is the Root Mean quared Error (RMSE)
between predicted y value (pred) and true y value (y).

b. K-fold Cross validation[2]:

• Overview
In this tutorial, we’ll talk about two cross-validation techniques in
machine learning: the k-fold and leave-one-out methods. To do so, we’ll
start with the train-test splits and explain why we need cross-validation in
the first place. Then, we’ll describe the two cross-validation techniques and
compare them to illustrate their pros and cons.
• Train-Test Split Method
An important decision when developing any machine learning model is
how to evaluate its final performance. To get an unbiased estimate of the
model’s performance, we need to evaluate it on the data we didn’t use for
training.

The simplest way to split the data is to use the train-test split method. It
randomly partitions the dataset into two subsets (called training and test
sets) so that the predefined percentage of the entire dataset is in the
training set.

Page | 5
Then, we train our machine learning model on the training set and
evaluate its performance on the test set. In this way, we are always sure
that the samples used for training are not used for evaluation and vice
versa.

Visually, this is how the train-test split method works:

• Introduction to Cross-Validation
However, the train-split method has certain limitations. When the
dataset is small, the method is prone to high variance. Due to the random
partition, the results can be entirely different for different test sets. Why?
Because in some partitions, samples that are easy to classify get into the
test set, while in others, the test set receives the ‘difficult’ ones.

To deal with this issue, we use cross-validation to evaluate the

performance of a machine learning model. In cross-validation, we don’t
divide the dataset into training and test sets only once. Instead, we
repeatedly partition the dataset into smaller groups and then average the
performance in each group. That way, we reduce the impact of partition
randomness on the results.

Many cross-validation techniques define different ways to divide the

dataset at hand. We’ll focus on the two most frequently used: the k-fold
and the leave-one-out methods.

• K-Fold Cross-Validation
In k-fold cross-validation, we first divide our dataset into k equally sized
subsets. Then, we repeat the train-test method k times such that each time
one of the k subsets is used as a test set and the rest k-1 subsets are used
together as a training set. Finally, we compute the estimate of the model’s
performance estimate by averaging the scores over the k trials.

For example, let’s suppose that we have a dataset S

= {𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 , 𝑥5 , 𝑥6} containing 6 samples and that we want to perform a
3-fold cross-validation.

First, we divide S into 3 subsets randomly. For instance:

𝑆1 = {𝑥1 , 𝑥2 }
𝑆2 = {𝑥3 , 𝑥4 }
𝑆3 = {𝑥5 , 𝑥6 }

Page | 6
Then, we train and evaluate our machine-learning model 3 times. Each
time, two subsets form the training set, while the remaining one acts as the
test set. In our example:

Finally, the overall performance is the average of the model’s

performance scores on those three test sets:

• Leave-One-Out Cross-Validation
In the leave-one-out (LOO) cross-validation, we train our machine-learning
model n times where n is to our dataset’s size. Each time, only one sample
is used as a test set while the rest are used to train our model.

We’ll show that LOO is an extreme case of k-fold where k = n. If we apply

LOO to the previous example, we’ll have 6 test subsets:
𝑆1 = {𝑥1 }
𝑆2 = {𝑥2 }
𝑆3 = {𝑥3 }
𝑆4 = {𝑥4 }
𝑆5 = {𝑥5 }
𝑆6 = {𝑥6 }

Iterating over them, we use S \ 𝑆𝑖 as the training data in iteration 𝑖 =

1,2, … , 6, and evaluate the model on 𝑆𝑖 :

Page | 7
The final performance estimate is the average of the six individual scores:

c. Life Expectancy:
Life expectancy data were collected from WHO and the United Nations website
from 2000 to 2015 across all countries.

The dataset has 2938 rows and 22 columns. The meanings and data types of each
column are shown in the following table:

Page | 8
See more[3].

In this project, the above data has been performed the following preprocessing
steps:
1. Remove data lines with incomplete information (with NaN value in line)
2. Select only the lines related to the top 95 countries with the largest population
3. Normalize and rename some features: thinness 1-19 years → Thinness age 10-
19, thinness 5-9 years → Thinness age 5-9
4. Remove 2 columns with string values: Country, Status
5. Based on the correlation measure, remove the 9 columns that are least relevant
to the goal value (Life expectancy): Population, Measles, Year, infant deaths, Total
expenditure, under-five deaths, Hepatitis B, percentage expenditures, Alcohol

After preprocessing, the new data has:

• 1180 data lines
• 11 columns of data including:
o 1 goal value y: Life expectancy
o 10 features that explain X (features that help find target values) include:
Adult Mortality, BMI, Polio, Diphtheria, HIV/AIDS, GDP, Thinness age 10-19,
Thinness age 5-9, Income composition of resources, Schooling, Life expectancy

Students are provided with 2 files:

• train.csv: Contains 1085 samples used to train the model
• test.csv: Contains 95 samples used to test the model

Page | 9
III. Library and necessary functions:
• Import Numpy
Create matrices and process mathematical operations

• Import Pandas
Read file data .csv

• import matplotlib.pyplot as plt

• import seaborn as sns
Correlation chart

• Import sklearn.model_selection from train_test_split[4]

Split arrays or matrices into random train and test subsets.
Quick utility that wraps input validation and
next(ShuffleSplit().split(X, y)) and application to input data into a single
call for splitting (and optionally subsampling) data in a oneliner.

Parameters
*array: ssequence of indexables with same length / shape[0]
Allowed inputs are lists, numpy arrays, scipy-sparse matrices, or pandas
dataframes.
test_size: float or int, default=None
If float, should be between 0.0 and 1.0 and represent the proportion of
the dataset to include in the test split. If int, represents the absolute
number of test samples. If None, the value is set to the complement of
the train size. If train_size is also None, it will be set to 0.25.
train_size: float or int, default=None
If float, should be between 0.0 and 1.0 and represent the proportion of
the dataset to include in the train split. If int, represents the absolute
number of train samples. If None, the value is automatically set to the
complement of the test size.
random_state: int, RandomState instance or None,
default=None
Controls the shuffling applied to the data before applying the split. Pass
an int for reproducible output across multiple function calls.
Shuffle: bool, default=True
Whether or not to shuffle the data before splitting. If shuffle=False then
stratify must be None.
Stratify: array-like, default=None
If not None, data is split in a stratified fashion, using this as the class
labels.

Returns
Splitting: list, length=2 * len(arrays)
List containing train-test split of inputs.

• Class OLSLinearRegression[5]

Page | 10
o fit(self, X, y):
Fit linear model.
Parameters:
X: {array-like, sparse matrix} of shape (n_samples,
n_features)
Training data.
y: array-like of shape (n_samples,) or (n_samples,
n_targets)
Target values. Will be cast to X’s dtype if necessary.
Returns
self: object
Fitted Estimator.
o get_params(self):
Get parameters for this estimator.
Returns:
Params: dict
Parameter names mapped to their values.
o predict(self, X):
Predict using the linear model.
Parameters:
X: array-like or sparse matrix, shape (n_samples,
n_features)
Samples.
Returns:
C: array, shape (n_samples,)
Returns predicted values.
• rmse(y, y_hat):
Mean squared error regression loss.
Parameters:
y: array-like of shape (n_samples,) or (n_samples, n_outputs)
Ground truth (correct) target values.
y_pred: array-like of shape (n_samples,) or (n_samples,
n_outputs)
Estimated target values.
Returns:
Loss: float or ndarray of floats
A non-negative floating point value (the best value is 0.0), or an array of
floating point values, one for each individual target.

• get_features(correlation_threshold)
get features (correlation_threshold > abs(Correlations))
Parameters:
correlation_threshold: float
Returns:
Params: dict
Parameter names

Page | 11
IV. Project details:
a. Task a:
Requirement a: Use all 10 features provided by the problem (2 points)
• Train only once for 10 features on the entire training set (train.csv)
• Show formula for regression model (calculate y by 10 features in X)
• Report 1 results on the test set (test.csv) for the newly trained model

Implementation steps:
• Train only once for 10 features on the entire training set (train.csv)
Use the function fit(X, y) in the OLSLinearRegression class to train 10 sets of
features.
• Show formula for regression model
Use get_params() function in OLSLinearRegression class to get coefficients.

Use the predict(X_test) function in OLSLinearRegression class to predict.

Use the rmse(y_test, y_hat) function to calculate RMSE.

Regression formula:
𝐿𝑖𝑓𝑒 𝑒𝑥𝑝𝑒𝑐𝑡𝑎𝑛𝑐𝑦
= 0.015101 ∗ 𝑋1 + 0.090220 ∗ 𝑋2 + 0.042922 ∗ 𝑋3
+ 0.139289 ∗ 𝑋4 + (−0.567333) ∗ 𝑋5 + (−0.000101) ∗ 𝑋6
+ 0.740713 ∗ 𝑋7 + 0.190936 ∗ 𝑋8 + 24.505974 ∗ 𝑋9
+ 2.393517 ∗ 𝑋10

Conclusion:
In this part, 100% of the data used to build the model, so the model obtained
is not completely guaranteed in terms of accuracy and stability. Proof is error ~
7.064046430584031

b. Task b:
Requirement b: Build a model using only 1 feature, find the model that gives
the best results (2 points)
Page | 12
• Test on all (10) features the topic offers
• Request use 5-fold Cross Validation method to find the best feature
• Report 10 corresponding results for 10 models from 5-fold Cross Validation
(average)

Implementation steps:
• Clone X_train, y_train from data

• Use train_test_split() function to split data into 5 shuffle parts.

• Use 5-fold Cross validation method to find the best feature. (RMSE
average calculation)

• With RMSE of Schooling being the smallest. We find that the best feature
is Schooling

• Retrain the model ‘Schooling’ with the best feature on the entire training
set

• Use the rmse(y_test, y_hat) function to calculate RMSE.

Page | 13
Show the formula for the best feature regression model:
𝐿𝑖𝑓𝑒 𝑒𝑥𝑝𝑒𝑐𝑡𝑎𝑛𝑐𝑦 = 5.5573994 ∗ 𝑋

Conclusion:
After using 5-fold cross validation method we get the average RMSEs from
which we conclude that Schooling is the best feature.

c. Task c:
Requirement c: Students build their own models, find the model that gives the
best results (3 points)
• Build m different models (minimum 3), and at the same time different
models at 1a and 1b
o The model can be a combination of 2 or more features
o Models can use normalized or transformed features (squared,
cubed...)
o The model can use features created from 2 or more different
features (add 2 features, multiply 2 features...)
o ...
• Request use 5-fold Cross Validation method to find the best model
• Report m corresponding results for m models from 5-fold Cross
Validation (average)

Implementation steps:
• Explain the reason for choosing the design for the model:
o Use Correlations[6] method based on coefficient
Correlation is a method of analysis to investigate the relationship
between two fields/characteristics on a set (eg height and weight, price of
gold and price of rice,...). Based on the correlation coefficient, we can make
a preliminary assessment that which feature affects our model and affects
more or less. The correlation coefficient of r with r is equal to 1 and other
fields are less than 1. The closer to 1, the more relevant, and the closer to -1,
the more irrelevant.
o Correlation chart:

Page | 14
o Select the features with correlation greater than alpha. With alpha
= {0.1, 0.25, 0.5, 0.75}.
o After testing with 5-fold cross validation method, the RMSE
result is not very good. And after receiving a “whispers of
ancestors” then I standardized the model to square root 2.
• From the alpha of the above design, we have:
o P1 = ['Schooling', 'Income composition of resources', 'BMI',
'GDP', 'Diphtheria', 'Polio', 'HIV/AIDS', 'Adult Mortality',
'Thinness age 5-9', 'Thinness age 10-19']
o P2 = ['Schooling', 'Income composition of resources', 'BMI',
'GDP', 'Diphtheria', 'Polio', 'Adult Mortality', 'Thinness age 5-9',
'Thinness age 10-19']
o P3 = ['Schooling', 'Income composition of resources', 'BMI',
'GDP', 'Thinness age 5-9', 'Thinness age 10-19']
o P4 = ['Schooling', 'Income composition of resources']
• Process the model by design and put it on the list

Page | 15
• Use 5-fold Cross validation method to find the best feature. (RMSE
average calculation)

• We find that the best feature is Sqrt model 1 with coefficients:

• Use the rmse(y_test, y_hat) function to calculate RMSE.

Show the formula for the best feature regression model:

𝐿𝑖𝑓𝑒 𝑒𝑥𝑝𝑒𝑐𝑡𝑎𝑛𝑐𝑦
= 14.091507 ∗ √𝑋1 + 9.579369 ∗ √𝑋2 + 0.552901 ∗ √𝑋3
+ 0.001705 ∗ √𝑋4 + 0.801555 ∗ √𝑋5 + 0.186563 ∗ √𝑋6
+ (−3.208658) ∗ √𝑋7 + (−0.009331) ∗ √𝑋8 + (−0.009752)
∗ √𝑋9 + 1.818148 ∗ √𝑋10

Conclusion:

Page | 16
Compared with the initial error in question a is 7.064046430584031,
then the error in the individual model obtained is 4.7680380241569695 proving
that the individual model just built is closer to the data.
The reason is because the transformed data has reduced dispersion,
changing the subdata model has changed the bias by subcomponents leading to
less biased large model.

V. Reference:
[1] ML | Linear Regression - GeeksforGeeks
[2] Cross-Validation: K-Fold vs. Leave-One-Out | Baeldung on Computer
Science
[3] Life Expectancy (WHO) | Kaggle
[4] sklearn.model_selection.train_test_split — scikit-learn 1.1.1
documentation
[5] Lab 4 by Ms. Uyen
[6] Correlation - Statistical Techniques, Rating Scales, Correlation
Coefficients, and More - Creative Research Systems (surveysystem.com)

Page | 17

All Types of Cross Validation
No ratings yet
All Types of Cross Validation
9 pages
机器学习
No ratings yet
机器学习
41 pages
Cross Validation Thesis
100% (4)
Cross Validation Thesis
5 pages
AI - Lecture 3
No ratings yet
AI - Lecture 3
50 pages
When To Use and How To Report The Results of PLS-SEM: Joseph F. Hair
No ratings yet
When To Use and How To Report The Results of PLS-SEM: Joseph F. Hair
23 pages
ML Unit 4 Trupesh Patel
No ratings yet
ML Unit 4 Trupesh Patel
56 pages
Module3-Ensemble Learning
No ratings yet
Module3-Ensemble Learning
107 pages
ML Mod 5
No ratings yet
ML Mod 5
58 pages
Machine Learning: Cross Validation Machine Learning by Tom M. Mitchell Muhammad Affan Alim
No ratings yet
Machine Learning: Cross Validation Machine Learning by Tom M. Mitchell Muhammad Affan Alim
56 pages
L2 Supervised Learning
No ratings yet
L2 Supervised Learning
43 pages
ML 1 Lecture 2
No ratings yet
ML 1 Lecture 2
50 pages
Bias Varience Trade Off
100% (2)
Bias Varience Trade Off
35 pages
ML-4 Cross Validation in Machine Learning
No ratings yet
ML-4 Cross Validation in Machine Learning
13 pages
ML-4th Unit
No ratings yet
ML-4th Unit
44 pages
Unit 4 Regression
No ratings yet
Unit 4 Regression
26 pages
Notes - Unit 3 - Machine Learning Lnctu-Bca (Aida) - IV Sem
No ratings yet
Notes - Unit 3 - Machine Learning Lnctu-Bca (Aida) - IV Sem
19 pages
Lecture Slide 02 - Supervised Learning - Summer 2023
No ratings yet
Lecture Slide 02 - Supervised Learning - Summer 2023
43 pages
Lecture-2 Unit 2
No ratings yet
Lecture-2 Unit 2
56 pages
Cross Validation Techniques
No ratings yet
Cross Validation Techniques
27 pages
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
From Everand
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
SUJAUL CHOWDHURY
No ratings yet
Practical Issues
No ratings yet
Practical Issues
30 pages
Unit 3
No ratings yet
Unit 3
55 pages
m2 Data Analytic and Visualization
No ratings yet
m2 Data Analytic and Visualization
53 pages
CH 05 Optimization Technique
No ratings yet
CH 05 Optimization Technique
58 pages
6 Model Evalution
No ratings yet
6 Model Evalution
16 pages
T1 ML QB Soln
No ratings yet
T1 ML QB Soln
23 pages
ML - Module 5
No ratings yet
ML - Module 5
80 pages
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet
Ch5 Resampling Methods
No ratings yet
Ch5 Resampling Methods
66 pages
INSY662 - F23 - Week 3-1
No ratings yet
INSY662 - F23 - Week 3-1
22 pages
EDA Module 2
No ratings yet
EDA Module 2
28 pages
Module 6 - ML
No ratings yet
Module 6 - ML
30 pages
Cross-Validation in Machine Learning
No ratings yet
Cross-Validation in Machine Learning
18 pages
Earned Value Analysis Example - INF3708
100% (2)
Earned Value Analysis Example - INF3708
3 pages
Xiiaiuniticapstone Projectpartii
No ratings yet
Xiiaiuniticapstone Projectpartii
11 pages
MTH601 Final Term Solved Paper
No ratings yet
MTH601 Final Term Solved Paper
21 pages
Module 3 - ML
No ratings yet
Module 3 - ML
101 pages
Crossvalidation - 1
No ratings yet
Crossvalidation - 1
30 pages
List Steps in Data Preparation. Give Short Description of Each Step
No ratings yet
List Steps in Data Preparation. Give Short Description of Each Step
20 pages
Sampling and Sample Preparation
No ratings yet
Sampling and Sample Preparation
15 pages
Supervised Regression Notes
No ratings yet
Supervised Regression Notes
11 pages
Model Answer Paper - UT1-QP-ML-SEM7-COMPUTER-2023-3024 Version2
No ratings yet
Model Answer Paper - UT1-QP-ML-SEM7-COMPUTER-2023-3024 Version2
18 pages
Section 1: Cross-Validation and Model Performance
No ratings yet
Section 1: Cross-Validation and Model Performance
33 pages
Regression Dataset Example
No ratings yet
Regression Dataset Example
14 pages
Comparison Between Performance of Classifiers
No ratings yet
Comparison Between Performance of Classifiers
5 pages
Cross Validation
No ratings yet
Cross Validation
16 pages
ML Notes (Module-3)
No ratings yet
ML Notes (Module-3)
21 pages
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
No ratings yet
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
117 pages
4-ResamplingMethods 1
No ratings yet
4-ResamplingMethods 1
23 pages
P-2.1.2 Cross Validation and Regularization
No ratings yet
P-2.1.2 Cross Validation and Regularization
37 pages
Unit V
No ratings yet
Unit V
12 pages
Unit 2
No ratings yet
Unit 2
28 pages
Cross Validation - Notes
No ratings yet
Cross Validation - Notes
10 pages
Chapter2 1 33
No ratings yet
Chapter2 1 33
18 pages
Unit 5 New
No ratings yet
Unit 5 New
9 pages
Cross Validation in ML
No ratings yet
Cross Validation in ML
5 pages
Chapter 1 Capstone Project Ai Class 12
No ratings yet
Chapter 1 Capstone Project Ai Class 12
5 pages
Aiml Unit 3
No ratings yet
Aiml Unit 3
9 pages
Cross Validation
No ratings yet
Cross Validation
5 pages
Research Trends in Machine Learning: Muhammad Kashif Hanif
No ratings yet
Research Trends in Machine Learning: Muhammad Kashif Hanif
20 pages
Chapter 5 Learning Deterministic Models
No ratings yet
Chapter 5 Learning Deterministic Models
28 pages
A Study On Most Preferred Car Brand
No ratings yet
A Study On Most Preferred Car Brand
7 pages
Data Science Interview Questions and Answers For 2020 PDF
No ratings yet
Data Science Interview Questions and Answers For 2020 PDF
20 pages
Week #2b-3: This Lecture: Chi-Square Tests For
No ratings yet
Week #2b-3: This Lecture: Chi-Square Tests For
57 pages
Prob Stat Lesson 9
No ratings yet
Prob Stat Lesson 9
44 pages
ML Concepts: 1. Parametric Vs Non-Parametric Models:: Examples: Linear, Logistic, SVM
No ratings yet
ML Concepts: 1. Parametric Vs Non-Parametric Models:: Examples: Linear, Logistic, SVM
34 pages
DLL Stat 4th Week 1day 1
No ratings yet
DLL Stat 4th Week 1day 1
8 pages
Validation Over Under Fir Unit 5
No ratings yet
Validation Over Under Fir Unit 5
6 pages
01descriptive Statistics
No ratings yet
01descriptive Statistics
48 pages
Dele
No ratings yet
Dele
29 pages
MTH220 SU05 Jan22 Annotated
No ratings yet
MTH220 SU05 Jan22 Annotated
54 pages
Robust Regression: 1 M-Estimation
No ratings yet
Robust Regression: 1 M-Estimation
8 pages
Binomial Distribution
No ratings yet
Binomial Distribution
7 pages
Bayes Factor Design Analysis With Informed Priors
No ratings yet
Bayes Factor Design Analysis With Informed Priors
27 pages
Normal Distribution: Tripthi M. Mathew, MD, MPH
No ratings yet
Normal Distribution: Tripthi M. Mathew, MD, MPH
29 pages
Class 5 - LinearRegression
No ratings yet
Class 5 - LinearRegression
20 pages
Syllabus Mas291 Fall2021
No ratings yet
Syllabus Mas291 Fall2021
18 pages
Stat ch3
No ratings yet
Stat ch3
26 pages
1q3b8AXWiBQ80Aki yDW-q qNGhtwoVV
No ratings yet
1q3b8AXWiBQ80Aki yDW-q qNGhtwoVV
8 pages
Binary Logistic Regression Analysis in Assessing and Identifying Factors That Influence The Use of Family Planning The Case of Ambo Town, Ethiopia
No ratings yet
Binary Logistic Regression Analysis in Assessing and Identifying Factors That Influence The Use of Family Planning The Case of Ambo Town, Ethiopia
13 pages
Importing The Necessary Libraries
No ratings yet
Importing The Necessary Libraries
3 pages
Time Series Analysis Template
No ratings yet
Time Series Analysis Template
5 pages
What Drives The Development of Life Insurance Sect
No ratings yet
What Drives The Development of Life Insurance Sect
15 pages
Miss Forest
No ratings yet
Miss Forest
10 pages
JUMBO (Jurnal Manajemen, Bisnis, Dan Organisasi) : Misrawati Madukala
No ratings yet
JUMBO (Jurnal Manajemen, Bisnis, Dan Organisasi) : Misrawati Madukala
13 pages
LISREL 9.1 Release Notes
No ratings yet
LISREL 9.1 Release Notes
17 pages
Kumaraswamy's Distribution A Beta-Type Distribution With
No ratings yet
Kumaraswamy's Distribution A Beta-Type Distribution With
12 pages
TR 6
No ratings yet
TR 6
2 pages
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Biostatistics I Handout
No ratings yet
Biostatistics I Handout
5 pages

Project 03: Data Fitting Applied Mathematics and Statistics For Information Technology

Uploaded by

Project 03: Data Fitting Applied Mathematics and Statistics For Information Technology

Uploaded by

HO CHI MINH CITY NATIONAL UNIVERSITY

UNIVERSITY OF NATURE SCIENCE

PROJECT 03: DATA FITTING

Applied mathematics and statistics

Ho Chi Minh, August 2022

II. Introduction to the project .................................................................................................... 4

a. Linear Regression[1]: ......................................................................................................... 4

b. K-fold Cross validation[2]: ................................................................................................. 5

c. Life Expectancy: ............................................................................................................... 8

III. Library and necessary functions: .................................................................................... 10

• Import Numpy .................................................................................................. 10

II. Introduction to the project

Cost Function (J):

b. K-fold Cross validation[2]:

Visually, this is how the train-test split method works:

To deal with this issue, we use cross-validation to evaluate the

Many cross-validation techniques define different ways to divide the

For example, let’s suppose that we have a dataset S

First, we divide S into 3 subsets randomly. For instance:

Finally, the overall performance is the average of the model’s

We’ll show that LOO is an extreme case of k-fold where k = n. If we apply

Iterating over them, we use S \ 𝑆𝑖 as the training data in iteration 𝑖 =

After preprocessing, the new data has:

Students are provided with 2 files:

• import matplotlib.pyplot as plt

• Import sklearn.model_selection from train_test_split[4]

Use the predict(X_test) function in OLSLinearRegression class to predict.

• Use train_test_split() function to split data into 5 shuffle parts.

• Use the rmse(y_test, y_hat) function to calculate RMSE.

• We find that the best feature is Sqrt model 1 with coefficients:

• Use the rmse(y_test, y_hat) function to calculate RMSE.

Show the formula for the best feature regression model:

You might also like