Data Science - LT
Data Science - LT
Data Science is a combination of multiple disciplines that uses statistics, data analysis, and machine
learning to analyze data and to extract knowledge and insights from it.
1
Data Science Material
5. Find and replace missing values - Check for missing values and replace them with a suitable
value (e.g. an average value).
6. Normalize data - Scale the values in a practical range (e.g. 140 cm is smaller than 1,8 m.
However, the number 140 is larger than 1,8. - so scaling is important).
7. Analyze data, find patterns and make future predictions.
8. Represent the result - Present the result with useful insights in a way the "company" can
understand.
What is Data?
Data is a collection of information.
One purpose of Data Science is to structure data, making it interpretable and easy to work with.
Data can be categorized into two groups:
● Structured data
● Unstructured data
Unstructured Data
Unstructured data is not organized. We must organize the data for analysis purposes.
2
Data Science Material
Structured Data
Structured data is organized and easier to work with.
Example
Array = [80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
print(Array)
Database Table
A database table is a table with structured data.
The following table shows a database table with health data extracted from a sports watch:
Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work Hours_Sleep
30 80 120 240 10 7
30 85 120 250 10 7
45 90 130 260 8 7
45 95 130 270 8 7
3
Data Science Material
75 120 150 320 0 8
Variables
A variable is defined as something that can be measured or counted.
4
Data Science Material
Examples can be characters, numbers or time.
In the example under, we can observe that each column represents a variable.
Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work Hours_Sleep
30 80 120 240 10 7
30 85 120 250 10 7
45 90 130 260 8 7
45 95 130 270 8 7
Python
Python is a programming language widely used by Data Scientists.
Python has in-built mathematical libraries and functions, making it easier to calculate mathematical
problems and to perform data analysis.
We will provide practical examples using Python.
To learn more about Python, please visit our Python Tutorial.
Python Libraries
Python has libraries with large collections of mathematical functions and analytical tools.
In this course, we will use the following libraries:
● Pandas - This library is used for structured data operations, like import CSV files, create
dataframes, and data preparation
● Numpy - This is a mathematical library. Has a powerful N-dimensional array object, linear
algebra, Fourier transform, etc.
● Matplotlib - This library is used for visualization of data.
● SciPy - This library has linear algebra modules
5
Data Science Material
Example
import pandas as pd
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11]}
df = pd.DataFrame(data=d)
print(df)
Example Explained
● Import the Pandas library as pd
● Define data with column and rows in a variable named d
● Create a data frame using the function pd.DataFrame()
● The data frame contains 3 columns and 5 rows
● Print the data frame output with the print() function
We see that "col1", "col2" and "col3" are the names of the columns.
Do not be confused about the vertical numbers ranging from 0-4. They tell us the information about
the position of the rows.
In Python, the numbering of rows starts with zero.
Now, we can use Python to count the columns and rows.
We can use df.shape[1] to find the number of columns:
Example
Count the number of columns:
count_column = df.shape[1]
print(count_column)
We can use df.shape[0] to find the number of rows:
Example
Count the number of rows:
count_row = df.shape[0]
print(count_row)
6
Data Science Material
This chapter shows three commonly used functions when working with Data Science: max(), min(),
and mean().
30 80 120 240 10 7
30 85 120 250 10 7
45 90 130 260 8 7
45 95 130 270 8 7
Example
Average_pulse_max = max(80, 85, 90, 95, 100, 105, 110, 115, 120, 125)
print (Average_pulse_max)
7
Data Science Material
Example
Average_pulse_min = min(80, 85, 90, 95, 100, 105, 110, 115, 120, 125)
print (Average_pulse_min)
Example
import numpy as np
Calorie_burnage = [240, 250, 260, 270, 280, 290, 300, 310, 320, 330]
Average_calorie_burnage = np.mean(Calorie_burnage)
print(Average_calorie_burnage)
Before analyzing data, a Data Scientist must extract the data, and make it clean and valuable.
Example
import pandas as pd
print(health_data)
Example Explained
● Import the Pandas library
● Name the data frame as health_data.
● header=0 means that the headers for the variable names are to be found in the first row
(note that 0 means the first row in Python)
● sep="," means that "," is used as the separator between the values. This is because we are
using the file type .csv (comma separated values)
Tip: If you have a large CSV file, you can use the head() function to only show the top 5rows:
Example
import pandas as pd
8
Data Science Material
print(health_data.head())
Data Cleaning
Look at the imported data. As you can see, the data are "dirty" with wrongly or unregistered values:
Example
health_data.dropna(axis=0,inplace=True)
print(health_data)
The result is a data set without NaN rows:
9
Data Science Material
Data Categories
To analyze data, we also need to know the types of data we are dealing with.
Data can be split into three main categories:
1. Numerical - Contains numerical values. Can be divided into two categories:
○ Discrete: Numbers are counted as "whole". Example: You cannot have trained 2.5
sessions, it is either 2 or 3
○ Continuous: Numbers can be of infinite precision. For example, you can sleep for 7
hours, 30 minutes and 20 seconds, or 7.533 hours
2. Categorical - Contains values that cannot be measured up against each other. Example: A
color or a type of training
3. Ordinal - Contains categorical data that can be measured up against each other. Example:
School grades where A is better than B and so on
By knowing the type of your data, you will be able to know what technique to use when analyzing
them.
Data Types
We can use the info() function to list the data types within our data set:
Example
print(health_data.info())
Result:
We see that this data set has two different types of data:
10
Data Science Material
● Float64
● Object
We cannot use objects to calculate and perform analysis here. We must convert the type object to
float64 (float64 is a number with a decimal in Python).
We can use the astype() function to convert the data into float64.
The following example converts "Average_Pulse" and "Max_Pulse" into data type float64 (the other
variables are already of data type float64):
Example
health_data["Average_Pulse"] = health_data['Average_Pulse'].astype(float)
health_data["Max_Pulse"] = health_data["Max_Pulse"].astype(float)
print (health_data.info())
Result:
Example
print(health_data.describe())
Result:
Duration Average_Puls Max_Puls Calorie_Burnag Hours_Work Hours_Slee
e e e p
11
Data Science Material
75% 60.0 113.75 145.0 307.5 8.0 8.0
Linear Functions
In mathematics a function is used to relate one variable to another variable.
Suppose we consider the relationship between calorie burnage and average pulse. It is reasonable
to assume that, in general, the calorie burnage will change as the average pulse changes - we say
that the calorie burnage depends upon the average pulse.
Furthermore, it may be reasonable to assume that as the average pulse increases, so will the
calorie burnage. Calorie burnage and average pulse are the two variables being considered.
Because the calorie burnage depends upon the average pulse, we say that calorie burnage is the
dependent variable and the average pulse is the independent variable.
The relationship between a dependent and an independent variable can often be expressed
mathematically using a formula (function).
A linear function has one independent variable (x) and one dependent variable (y), and has the
following form:
y = f(x) = ax + b
This function is used to calculate a value for the dependent variable when we choose a value for the
independent variable.
Explanation:
● f(x) = the output (the dependant variable)
● x = the input (the independant variable)
● a = slope = is the coefficient of the independent variable. It gives the rate of change of the
dependent variable
● b = intercept = is the value of the dependent variable when x = 0. It is also the point where
the diagonal line crosses the vertical axis.
12
Data Science Material
Here, the numbers and variables means:
● f(x) = The output. This number is where we get the predicted value of Calorie_Burnage
● x = The input, which is Average_Pulse
● 2 = Slope = Specifies how much Calorie_Burnage increases if Average_Pulse increases by
one. It tells us how "steep" the diagonal line is
● 80 = Intercept = A fixed value. It is the value of the dependent variable when x = 0
Graph Explanations:
● The horizontal axis is generally called the x-axis. Here, it represents Average_Pulse.
● The vertical axis is generally called the y-axis. Here, it represents Calorie_Burnage.
● Calorie_Burnage is a function of Average_Pulse, because Calorie_Burnage is assumed to
be dependent on Average_Pulse.
● In other words, we use Average_Pulse to predict Calorie_Burnage.
● The blue (diagonal) line represents the structure of the mathematical function that predicts
calorie burnage.
13
Data Science Material
30 80 120 240 10 7
30 85 120 250 10 7
45 90 130 260 8 7
45 95 130 270 8 7
Example
import matplotlib.pyplot as plt
plt.show()
Example Explained
● Import the pyplot module of the matplotlib library
● Plot the data from Average_Pulse against Calorie_Burnage
● kind='line' tells us which type of plot we want. Here, we want to have a straight line
● plt.ylim() and plt.xlim() tells us what value we want the axis to start on. Here, we want the
axis to begin from zero
14
Data Science Material
● plt.show() shows us the output
The code above will produce the following result:
15
Data Science Material
Look at the line. What happens to calorie burnage if average pulse increases from 80 to 90?
We can use the diagonal line to find the mathematical function to predict calorie burnage.
As it turns out:
● If the average pulse is 80, the calorie burnage is 240
● If the average pulse is 90, the calorie burnage is 260
● If the average pulse is 100, the calorie burnage is 280
There is a pattern. If average pulse increases by 10, the calorie burnage increases by 20.
16
Data Science Material
Example
def slope(x1, y1, x2, y2):
s = (y2-y1)/(x2-x1)
return s
print (slope(80,240,90,260))
Example
import pandas as pd
import numpy as np
x = health_data["Average_Pulse"]
y = health_data["Calorie_Burnage"]
18
Data Science Material
slope_intercept = np.polyfit(x,y,1)
print(slope_intercept)
Example Explained:
● Isolate the variables Average_Pulse (x) and Calorie_Burnage (y) from health_data.
● Call the np.polyfit() function.
● The last parameter of the function specifies the degree of the function, which in this case is
"1".
Tip: linear functions = 1.degree function. In our example, the function is linear, which is in the
1.degree. That means that all coefficients (the numbers) are in the power of one.
We have now calculated the slope (2) and the intercept (80). We can write the mathematical
function as follow:
Predict Calorie_Burnage by using a mathematical expression:
f(x) = 2x + 80
Task:
Now, we want to predict calorie burnage if average pulse is 135.
Remember that the intercept is a constant. A constant is a number that does not change.
We can now substitute the input x with 135:
f(135) = 2 * 135 + 80 = 350
If average pulse is 135, the calorie burnage is 350.
Example
def my_function(x):
return 2*x + 80
print (my_function(135))
Try to replace x with 140 and 150.
Example
import matplotlib.pyplot as plt
plt.show()
Example Explained
● Import the pyplot module of the matplotlib library
● Plot the data from Average_Pulse against Calorie_Burnage
● kind='line' tells us which type of plot we want. Here, we want to have a straight line
● plt.ylim() and plt.xlim() tells us what value we want the axis to start and stop on.
● plt.show() shows us the output
Introduction to Statistics
Statistics is the science of analyzing data.
When we have created a model for prediction, we must assess the prediction's reliability.
After all, what is a prediction worth, if we cannot rely on it?
Descriptive Statistics
We will first cover some basic descriptive statistics.
Descriptive statistics summarizes important features of a data set such as:
● Count
● Sum
● Standard Deviation
● Percentile
● Average
● Etc..
It is a good starting point to become familiar with the data.
We can use the describe() function in Python to summarize the data:
Example
print (full_health_data.describe())
Output:
20
Data Science Material
Example
import numpy as np
Max_Pulse= full_health_data["Max_Pulse"]
percentile10 = np.percentile(Max_Pulse, 10)
print(percentile10)
● Max_Pulse = full_health_data["Max_Pulse"] - Isolate the variable Max_Pulse from the full
health data set.
● np.percentile() is used to define that we want the 10% percentile from Max_Pulse.
The 10% percentile of Max_Pulse is 120. This means that 10% of all the training sessions have a
Max_Pulse of 120 or lower.
Standard Deviation
Standard deviation is a number that describes how spread out the observations are.
21
Data Science Material
A mathematical function will have difficulties in predicting precise values, if the observations are
"spread". Standard deviation is a measure of uncertainty.
A low standard deviation means that most of the numbers are close to the mean (average) value.
A high standard deviation means that the values are spread out over a wider range.
Tip: Standard Deviation is often represented by the symbol Sigma: σ
We can use the std() function from Numpy to find the standard deviation of a variable:
Example
import numpy as np
std = np.std(full_health_data)
print(std)
The output:
Coefficient of Variation
The coefficient of variation is used to get an idea of how large the standard deviation is.
Mathematically, the coefficient of variation is defined as:
Coefficient of Variation = Standard Deviation / Mean
We can do this in Python if we proceed with the following code:
Example
import numpy as np
cv = np.std(full_health_data) / np.mean(full_health_data)
print(cv)
The output:
22
Data Science Material
We see that the variables Duration, Calorie_Burnage and Hours_Work has a high Standard
Deviation compared to Max_Pulse, Average_Pulse and Hours_Sleep.
Variance
Variance is another number that indicates how spread out the values are.
In fact, if you take the square root of the variance, you get the standard deviation. Or the other way
around, if you multiply the standard deviation by itself, you get the variance!
We will first use the data set with 10 observations to give an example of how we can calculate the
variance:
Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work Hours_Sleep
30 80 120 240 10 7
30 85 120 250 10 7
45 90 130 260 8 7
45 95 130 270 8 7
Step 2: For Each Value - Find the Difference From the Mean
2. Find the difference from the mean for each value:
80 - 102.5 = -22.5
85 - 102.5 = -17.5
90 - 102.5 = -12.5
23
Data Science Material
95 - 102.5 = -7.5
100 - 102.5 = -2.5
105 - 102.5 = 2.5
110 - 102.5 = 7.5
115 - 102.5 = 12.5
120 - 102.5 = 17.5
125 - 102.5 = 22.5
Example
import numpy as np
var = np.var(health_data)
print(var)
The output:
24
Data Science Material
Example
import numpy as np
var_full = np.var(full_health_data)
print(var_full)
The output:
Correlation
Correlation measures the relationship between two variables.
We mentioned that a function has a purpose to predict a value, by converting input (x) to output
(f(x)). We can say also say that a function uses the relationship between two variables for
prediction.
Correlation Coefficient
The correlation coefficient measures the relationship between two variables.
The correlation coefficient can never be less than -1 or higher than 1.
● 1 = there is a perfect linear relationship between the variables (like Average_Pulse against
Calorie_Burnage)
● 0 = there is no linear relationship between the variables
● -1 = there is a perfect negative linear relationship between the variables (e.g. Less hours
worked, leads to higher calorie burnage during a training session)
25
Data Science Material
Example
import matplotlib.pyplot as plt
26
Data Science Material
Example of a Perfect Negative Linear Relationship (Correlation
Coefficient = -1)
We have plotted fictional data here. The x-axis represents the amount of hours worked at our job
before a training session. The y-axis is Calorie_Burnage.
If we work longer hours, we tend to have lower calorie burnage because we are exhausted before
the training session.
The correlation coefficient here is -1.
Example
import pandas as pd
import matplotlib.pyplot as plt
27
Data Science Material
Example of No Linear Relationship (Correlation coefficient = 0)
Here, we have plotted Max_Pulse against Duration from the full_health_data set.
As you can see, there is no linear relationship between the two variables. It means that longer
training session does not lead to higher Max_Pulse.
The correlation coefficient here is 0.
Example
import matplotlib.pyplot as plt
Correlation Matrix
A matrix is an array of numbers arranged in rows and columns.
A correlation matrix is simply a table showing the correlation coefficients between variables.
Here, the variables are represented in the first row, and in the first column:
28
Data Science Material
The table above has used data from the full health data set.
Observations:
● We observe that Duration and Calorie_Burnage are closely related, with a correlation
coefficient of 0.89. This makes sense as the longer we train, the more calories we burn
● We observe that there is almost no linear relationships between Average_Pulse and
Calorie_Burnage (correlation coefficient of 0.02)
● Can we conclude that Average_Pulse does not affect Calorie_Burnage? No. We will come
back to answer this question later!
Example
Corr_Matrix = round(full_health_data.corr(),2)
print(Corr_Matrix)
Output:
Using a Heatmap
We can use a Heatmap to Visualize the Correlation Between Variables:
29
Data Science Material
The closer the correlation coefficient is to 1, the greener the squares get.
The closer the correlation coefficient is to -1, the browner the squares get.
Example
import matplotlib.pyplot as plt
import seaborn as sns
correlation_full_health = full_health_data.corr()
axis_corr = sns.heatmap(
correlation_full_health,
vmin=-1, vmax=1, center=0,
cmap=sns.diverging_palette(50, 500, n=500),
square=True
)
plt.show()
30
Data Science Material
Example Explained:
● Import the library seaborn as sns.
● Use the full_health_data set.
● Use sns.heatmap() to tell Python that we want a heatmap to visualize the correlation matrix.
● Use the correlation matrix. Define the maximal and minimal values of the heatmap. Define
that 0 is the center.
● Define the colors with sns.diverging_palette. n=500 means that we want 500 types of color in
the same color palette.
● square = True means that we want to see squares.
Example
import pandas as pd
import matplotlib.pyplot as plt
Drowning_Accident = [20,40,60,80,100,120,140,160,180,200]
Ice_Cream_Sale = [20,40,60,80,100,120,140,160,180,200]
Drowning = {"Drowning_Accident": [20,40,60,80,100,120,140,160,180,200],
"Ice_Cream_Sale": [20,40,60,80,100,120,140,160,180,200]}
Drowning = pd.DataFrame(data=Drowning)
correlation_beach = Drowning.corr()
print(correlation_beach)
Output:
31
Data Science Material
32
Data Science Material
Linear Regression
The term regression is used when you try to find the relationship between variables.
In Machine Learning and in statistical modeling, that relationship is used to predict the outcome of
events.
In this module, we will cover the following questions:
● Can we conclude that Average_Pulse and Duration are related to Calorie_Burnage?
● Can we use Average_Pulse and Duration to predict Calorie_Burnage?
33
Data Science Material
Linear Regression Using One Explanatory Variable
In this example, we will try to predict Calorie_Burnage with Average_Pulse using Linear Regression:
Example
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
x = full_health_data["Average_Pulse"]
y = full_health_data ["Calorie_Burnage"]
def myfunc(x):
return slope * x + intercept
plt.scatter(x, y)
plt.plot(x, slope * x + intercept)
plt.ylim(ymin=0, ymax=2000)
plt.xlim(xmin=0, xmax=200)
plt.xlabel("Average_Pulse")
plt.ylabel ("Calorie_Burnage")
plt.show()
Example Explained:
● Import the modules you need: Pandas, matplotlib and Scipy
● Isolate Average_Pulse as x. Isolate Calorie_burnage as y
● Get important key values with: slope, intercept, r, p, std_err = stats.linregress(x, y)
● Create a function that uses the slope and intercept values to return a new value. This new
value represents where on the y-axis the corresponding x value will be placed
● Run each value of the x array through the function. This will result in a new array with new
values for the y-axis: mymodel = list(map(myfunc, x))
● Draw the original scatter plot: plt.scatter(x, y)
● Draw the line of linear regression: plt.plot(x, mymodel)
● Define maximum and minimum values of the axis
● Label the axis: "Average_Pulse" and "Calorie_Burnage"
34
Data Science Material
Output:
Regression Table
The output from linear regression can be summarized in a regression table.
The content of the table includes:
● Information about the model
● Coefficients of the linear regression function
● Regression statistics
● Statistics of the coefficients from the linear regression function
● Other information that we will not cover in this module
35
Data Science Material
Regression Table with Average_Pulse as Explanatory Variable
Example
import pandas as pd
import statsmodels.formula.api as smf
Example Explained:
● Import the library statsmodels.formula.api as smf. Statsmodels is a statistical library in
Python.
● Use the full_health_data set.
● Create a model based on Ordinary Least Squares with smf.ols(). Notice that the explanatory
variable must be written first in the parenthesis. Use the full_health_data data set.
● By calling .fit(), you obtain the variable results. This holds a lot of information about the
regression model.
● Call summary() to get the table with the results of linear regression.
36
Data Science Material
● Dep. Variable: is short for "Dependent Variable". Calorie_Burnage is here the dependent
variable. The Dependent variable is here assumed to be explained by Average_Pulse.
● Model: OLS is short for Ordinary Least Squares. This is a type of model that uses the Least
Square method.
● Date: and Time: shows the date and time the output was calculated in Python.
37
Data Science Material
The "Coefficients Part" in Regression Table
● Coef is short for coefficient. It is the output of the linear regression function.
The linear regression function can be rewritten mathematically as:
Calorie_Burnage = 0.3296 * Average_Pulse + 346.8662
These numbers means:
● If Average_Pulse increases by 1, Calorie_Burnage increases by 0.3296 (or 0,3 rounded)
● If Average_Pulse = 0, the Calorie_Burnage is equal to 346.8662 (or 346.9 rounded).
● Remember that the intercept is used to adjust the model's precision of predicting!
Do you think that this is a good model?
Example
def Predict_Calorie_Burnage(Average_Pulse):
return(0.3296*Average_Pulse + 346.8662)
print(Predict_Calorie_Burnage(120))
print(Predict_Calorie_Burnage(130))
print(Predict_Calorie_Burnage(150))
print(Predict_Calorie_Burnage(180))
38
Data Science Material
The "Statistics of the Coefficients Part" in Regression Table
Now, we want to test if the coefficients from the linear regression function has a significant impact
on the dependent variable (Calorie_Burnage).
This means that we want to prove that it exists a relationship between Average_Pulse and
Calorie_Burnage, using statistical tests.
There are four components that explains the statistics of the coefficients:
● std err stands for Standard Error
● t is the "t-value" of the coefficients
● P>|t| is called the "P-value"
● [0.025 0.975] represents the confidence interval of the coefficients
We will focus on understanding the "P-value" in this module.
The P-value
The P-value is a statistical number to conclude if there is a relationship between Average_Pulse and
Calorie_Burnage.
We test if the true value of the coefficient is equal to zero (no relationship). The statistical test for
this is called Hypothesis testing.
● A low P-value (< 0.05) means that the coefficient is likely not to equal zero.
● A high P-value (> 0.05) means that we cannot conclude that the explanatory variable affects
the dependent variable (here: if Average_Pulse affects Calorie_Burnage).
● A high P-value is also called an insignificant P-value.
Hypothesis Testing
Hypothesis testing is a statistical procedure to test if your results are valid.
In our example, we are testing if the true coefficient of Average_Pulse and the intercept is equal to
zero.
Hypothesis test has two statements. The null hypothesis and the alternative hypothesis.
● The null hypothesis can be shortly written as H0
39
Data Science Material
● The alternative hypothesis can be shortly written as HA
Mathematically written:
H0: Average_Pulse = 0
HA: Average_Pulse ≠ 0
H0: Intercept = 0
HA: Intercept ≠ 0
The sign ≠ means "not equal to"
R - Squared
R-Squared and Adjusted R-Squared describes how well the linear regression model fits the data
points:
40
Data Science Material
● A high R-Squared value means that many data points are close to the linear regression
function line.
● A low R-Squared value means that the linear regression function line does not fit the data
well.
41
Data Science Material
Example
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
x = full_health_data["Duration"]
y = full_health_data ["Calorie_Burnage"]
def myfunc(x):
return slope * x + intercept
print(mymodel)
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.ylim(ymin=0, ymax=2000)
plt.xlim(xmin=0, xmax=200)
plt.xlabel("Duration")
plt.ylabel ("Calorie_Burnage")
plt.show()
42
Data Science Material
Summary - Predicting Calorie_Burnage with Average_Pulse
How can we summarize the linear regression function with Average_Pulse as explanatory variable?
● Coefficient of 0.3296, which means that Average_Pulse has a very small effect on
Calorie_Burnage.
● High P-value (0.824), which means that we cannot conclude a relationship between
Average_Pulse and Calorie_Burnage.
● R-Squared value of 0, which means that the linear regression function line does not fit the
data well.
Example
import pandas as pd
import statsmodels.formula.api as smf
Example Explained:
● Import the library statsmodels.formula.api as smf. Statsmodels is a statistical library in
Python.
● Use the full_health_data set.
● Create a model based on Ordinary Least Squares with smf.ols(). Notice that the explanatory
variable must be written first in the parenthesis. Use the full_health_data data set.
● By calling .fit(), you obtain the variable results. This holds a lot of information about the
regression model.
● Call summary() to get the table with the results of linear regression.
43
Data Science Material
Output:
Example
def Predict_Calorie_Burnage(Average_Pulse, Duration):
return(3.1695*Average_Pulse + 5.8434 * Duration - 334.5194)
print(Predict_Calorie_Burnage(110,60))
print(Predict_Calorie_Burnage(140,45))
print(Predict_Calorie_Burnage(175,20))
The Answers:
● Average pulse is 110 and duration of the training session is 60 minutes = 365 Calories
● Average pulse is 140 and duration of the training session is 45 minutes = 372 Calories
● Average pulse is 175 and duration of the training session is 20 minutes = 337 Calories
44
Data Science Material
Access the Coefficients
Look at the coefficients:
● Calorie_Burnage increases with 3.17 if Average_Pulse increases by one.
● Calorie_Burnage increases with 5.84 if Duration increases by one.
Adjusted R-Squared
There is a problem with R-squared if we have more than one explanatory variable.
R-squared will almost always increase if we add more variables, and will never decrease.
This is because we are adding more data points around the linear regression function.
If we add random variables that does not affect Calorie_Burnage, we risk to falsely conclude that
the linear regression function is a good fit. Adjusted R-squared adjusts for this problem.
It is therefore better to look at the adjusted R-squared value if we have more than one explanatory
variable.
The Adjusted R-squared is 0.814.
The value of R-Squared is always between 0 to 1 (0% to 100%).
● A high R-Squared value means that many data points are close to the linear regression
function line.
● A low R-Squared value means that the linear regression function line does not fit the data
well.
Conclusion: The model fits the data point well!
Congratulations! You have now finished the final module of the data science library.
45