Unit 5
Unit 5
Department of Mathematics
Course Materials
Year : II
Semester : III
Syllabus
Unit I
Introduction, Causality and Experiments, Data Preprocessing: Data cleaning, Data reduction, Data
transformation, Data discretization. Visualization and Graphing: Visualizing Categorical
Distributions, Visualizing Numerical Distributions, Overlaid Graphs, plots, and summary statistics of
exploratory data analysis
Unit-II
Randomness, Probability, Sampling, Sample Means and Sample Sizes
Unit-III
Introduction to Statistics, Descriptive statistics – Central tendency, dispersion, variance, covariance,
kurtosis, five point summary, Distributions, Bayes Theorem, Error Probabilities;
Unit-IV
Statistical Inference; Hypothesis Testing, P-Values, Assessing Models, Decisions and Uncertainty,
Comparing Samples, A/B Testing, Causality.
Unit-V
Estimation, Confidence Intervals, Inference for Regression, Classification, Graphical Models,
Prediction, Updating Predictions.
Text Books:
1. Adi Adhikari and John DeNero, “Computational and Inferential Thinking: The Foundations
of Data Science”, e-book.
Reference Books:
1. Data Mining for Business Analytics: Concepts, Techniques and Applications in R, by Galit
Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, Kenneth C. Lichtendahl Jr., Wiley
India, 2018.
2. Rachel Schutt & Cathy O’Neil, “Doing Data Science” O’ Reilly, First Edition, 2013.
2
Foundation of Data Science/22CSC202- Learning Materials Unit V
Unit V
1. Estimation
Estimation refers to the process by which one makes inferences about a population, based on
information obtained from a sample.
Statisticians use sample statistics to estimate population parameters. For example, sample means are
used to estimate population means; sample proportions, to estimate population proportions.
But if the population is very large – for example, if it consists of income of all the households in the
United States – then it might be too expensive and time-consuming to gather data from the entire
population. In such situations, data scientists rely on sampling at random from the population.
This leads to a question of inference: How to make justifiable conclusions about the unknown
parameter, based on the data in the random sample? We will answer this question by using
inferential thinking.
A statistic based on a random sample can be a reasonable estimate of an unknown parameter in the
population. For example, you might want to use the median/mean annual income of sampled
households as an estimate of the median/mean annual income of all households in the U.S.
Type of Estimation
Point estimate: A point estimate of a population parameter is a single value of a statistic. For
example, the sample mean x is a point estimate of the population mean μ. Similarly, the sample
proportion p is a point estimate of the population proportion P.
Interval estimate: An interval estimate is defined by two numbers, between which a population
parameter is said to lie. For example, a < x < b is an interval estimate of the population mean μ. It
indicates that the population mean is greater than a but less than b.
2. Confidence Interval
A confidence interval is the mean of your estimate plus and minus the variation in that estimate. This
is the range of values you expect your estimate to fall between if you redo your test, within a certain
level of confidence.
3
Foundation of Data Science/22CSC202- Learning Materials Unit V
For example:
Confident Interval = 𝑋 ± 𝑍
√
4
Foundation of Data Science/22CSC202- Learning Materials Unit V
For example:
We can take a sample and calculate the mean and the standard deviation of that sample.
The sample data is used to make an estimation of the average age of all the Nobel Prize winners.
Solution
The following steps are used to calculate a confidence interval for population mean:
5
Foundation of Data Science/22CSC202- Learning Materials Unit V
The formula for calculating the sample mean is the sum of all the values and divided by the sample
size.
𝑺𝒂𝒎𝒑𝒍𝒆 𝒎𝒆𝒂𝒏 (𝒙) = 𝟔𝟐. 𝟏, standard deviation (𝒔) = 𝟏𝟑. 𝟒𝟔, 𝐒𝐚𝐦𝐩𝐥𝐞 𝐬𝐢𝐳𝐞 (𝐧) = 𝟑𝟎,
The margin of error is the difference between the point estimate and the lower and upper bounds.
𝑠
𝑀𝑎𝑟𝑔𝑖𝑛 𝑜𝑓 𝐸𝑟𝑟𝑜𝑟 (𝐸) = 𝑡 (𝑑𝑓).
√𝑛
𝑡 is calculated from the standard normal distribution and the confidence level.
is calculated from the sample standard deviation (s) and the sample size (n).
√
In our example with a sample standard deviation (s) of 13.46 and sample size of 30:
Foundation of Data Science/22CSC202- Learning Materials Unit V
With Python use the Scipy Stats library t.ppf() – percent point function find the t-value for an 𝛼 /2 =
0.025 and 29 degrees of freedom.
import numpy as np
print(stats.t.ppf(np.abs(0.025), 29))
output:
𝑡 . (29) = ~2.05
The lower and upper bounds of the confidence interval are found by subtracting and adding the
margin of error (E) from the point estimate (𝒙).
Conclusion
The 95% confidence interval for the mean age of Nobel Prize winners is between 57.06 and 67.14
years. Population mean fall between 57.06 and 67.14
Python
Foundation of Data Science/22CSC202- Learning Materials Unit V
# Calculate alpha, degrees of freedom (df), the critical t-value, and the margin of error
alpha = (1-confidence_level)
df = n - 1
standard_error = s/math.sqrt(n)
t_score = np.abs(stats.t.ppf(alpha/2, df)) #stats.norm.ppf(alpha/2) for z test
margin_of_error = t_score * standard_error
Output
T-Score: 2.045
The 95.0% confidence interval for the population mean is: between 57.074 and 67.126
8
STAT – 324 Summer Semester 1426/1427 Dr. Abdullah Al-Shiha
9.1 Introduction:
• Suppose we have a population with some unknown
parameter(s).
Example: Normal(μ,σ)
μ and σ are parameters.
• We need to draw conclusions (make inferences) about the
unknown parameters.
• We select samples, compute some statistics, and make
inferences about the unknown parameters based on the
sampling distributions of the statistics.
* Statistical Inference
(1) Estimation of the parameters (Chapter 9)
→ Point Estimation
→ Interval Estimation (Confidence Interval)
(2) Tests of hypotheses about the parameters (Chapter 10)
Point Estimation:
A point estimate of some population parameter θ is a single
value θˆ of a statistic Θˆ . For example, the value x of the statistic
X computed from a sample of size n is a point estimate of the
population mean μ.
Notation:
Za is the Z-value leaving an area of a to the
right; i.e., P(Z>Za)=a or equivalently,
P(Z<Za)=1−a
σ σ
( X − Zα , X + Zα )
2
n 2
n
σ
⇔ X ± Zα
2
n
σ σ
⇔ X − Zα < μ < X + Zα
2
n 2
n
where Z α is the Z-value leaving an area
2
of α/2 to the right; i.e., P(Z> Z α )=α/2, or
2
equivalently, P(Z< Z α )=1−α/2.
2
Note:
σ σ
We are (1−α)100% confident that μ ∈ ( X − Z α , X + Zα )
2
n 2
n
Example 9.2:
The average zinc concentration recorded from a sample of zinc
measurements in 36 different locations is found to be 2.6
gram/milliliter. Find a 95% and 99% confidence interval (C.I.)
for the mean zinc concentration in the river. Assume that the
population standard deviation is 0.3.
Solution:
μ=the mean zinc concentration in the river.
(unknown parameter)
Population Sample
μ=?? n=36
σ=0.3 X =2.6
Z α = Z0.025
2
= 1.96
A 95% C.I. for μ is
σ
X ± Zα
2
n
σ σ
⇔ X − Zα < μ < X + Zα
2
n 2
n
⇔ 2.6 − (1.96)⎛⎜
0.3 ⎞ ⎛ 0.3 ⎞
⎟ < μ < 2.6 + (1.96)⎜ ⎟
⎝ 36 ⎠ ⎝ 36 ⎠
⇔ 2.6 − 0.098 < μ < 2.6 + 0.098
⇔ 2.502 < μ < 2.698
⇔ μ ∈( 2.502 , 2.698)
We are 95% confident that μ ∈( 2.502 , 2.698).
Note:
σ
max error of estimation = Z α with (1−α)100% confidence.
2
n
Example:
In Example 9.2, we are 95% confident that the sample mean
X = 2.6 differs from the true mean μ by an amount less than
σ ⎛ 0.3 ⎞
Zα = (1.96)⎜ ⎟ = 0.098 .
2
n ⎝ 36 ⎠
Note:
σ
Let e be the maximum amount of the error, that is e = Z α ,
2
n
then:
2
σ σ ⎛ σ⎞
e = Zα ⇔ n = Zα ⇔ n = ⎜ Zα ⎟
⎜ ⎟
2
n 2
e ⎝ 2 e⎠
Theorem 9.2:
If X is used as an estimate of μ, we can then be (1−α)100%
confident that the error (in estimation) will not exceed a
2
⎛ σ⎞
specified amount e when the sample size is n = ⎜ Z α ⎟ .
⎜ ⎟
⎝ 2 e⎠
Note:
1. All fractional values of n = ( Z α σ / e) 2 are rounded up to the
2
next whole number.
2. If σ is unknown, we could take a preliminary sample of
size n≥30 to provide an estimate of σ. Then using
n
S= ∑ ( X i − X ) /( n − 1)
2
as an approximation for σ in
i =1
Theorem 9.2 we could determine approximately how
many observations are needed to provide the desired
degree of accuracy.
Example 9.3:
How large a sample is required in Example 9.2 if we want to be
95% confident that our estimate of μ is off by less than 0.05?
Solution:
n
First, a point estimate for μ is X = ∑ X i / n = 10.0
i =1
⇔ 10.0 − (2.447)⎛⎜
0.283 ⎞ ⎛ 0.283 ⎞
⎟ < μ < 10.0 + (2.447)⎜ ⎟
⎝ 7 ⎠ ⎝ 7 ⎠
⇔ 10.0 − 0.262< μ < 10.0 + 0.262
⇔ 9.74 < μ < 10.26
⇔ μ ∈( 9.74 , 10.26)
We are 95% confident that μ ∈( 9.74 , 10.26).
σ 12 σ 22
or ( X 1 − X 2 ) ± Z α +
2
n1 n2
⎛ σ 12 σ 22 σ 12 σ 22 ⎞⎟
⎜
or ( X 1 − X 2 ) − Z α + , ( X1 − X 2 ) + Z α +
⎜ n n n n2 ⎟
⎝ 2 1 2 2 1 ⎠
mileage for engine A was 36 miles per gallon and the average
for engine B was 42 miles per gallon. Find 96% confidence
interval for μB−μA, where μA and μB are population mean gas
mileage for engines A and B, respectively. Assume that the
population standard deviations are 6 and 8 for engines A and B,
respectively.
Solution:
Engine A Engine B
nA=50 nB=75
X A =36 X B =42
σA=6 σB=8
A point estimate for μB−μA is X B − X A =42−36=6.
Confidence interval:
α = ??
96% = (1−α)100% ⇔ 0. 96 = (1−α) ⇔ α=0.04 ⇔ α/2 = 0.02
Z α = Z0.02 = 2.05
2
A 96% C.I. for μB−μA is
σ B2 σ A2 σ B2 σ A2
( X B − X A ) − Zα + < μB − μ A < ( X B − X A ) + Zα +
2
nB nA 2
nB nA
σ B2 σ A2
( X B − X A ) ± Zα +
2
nB nA
82 62
(42 − 36) ± Z 0.02 +
75 50
64 36
6 ± (2.05) +
75 50
6 ± 2.571
3.43 < μB−μA < 8.57
We are 96% confident that μB−μA ∈(3.43, 8.57).
Solution:
Wire A Wire B
nA=6 nB=6
X A =140.67 X B =138.50
S2A=7.86690 S2B=7.10009
A point estimate for μA−μB is X A − X B =140.67−138.50=2.17.
Confidence interval:
95% = (1−α)100% ⇔ 0. 95 = (1−α) ⇔ α=0.05 ⇔ α/2 = 0.025
ν= df = nA+nB − 2= 10B
t α = t0.025 = 2.228
2
(n A − 1) S A2 + (n B − 1) S B2
S 2p =
n A + nB − 2
(6 − 1)(7.86690) + (6 − 1)(7.10009)
= =7.4835
6+6−2
S p = S 2p = 7.4835 = 2.7356
A 95% C.I. for μA−μB is
1 1 1 1
( X A − X B ) − tα S p + < μ A − μ B < ( X A − X B ) + tα S p +
2
n A nB 2
n A nB
1 1
or ( X A − X B ) ± t α S p +
2
n A nB
1 1
(140.67 − 138.50) ± (2.228) (2.7356) +
6 6
2.17 ± 3.51890
−1.35< μA−μB < 5.69
We are 95% confident that μA−μB ∈(−1.35, 5.69)B
4. Regression analysis
Regression is defined as a statistical method that helps us to analyze and understand the relationship
between two or more variables of interest.
Regression is a statistical method used in finance, investing, and other disciplines that attempts to
determine the strength and character of the relationship between one dependent variable (usually
denoted by Y) and a series of other variables (known as independent variables).
In regression, we normally have one dependent variable and one or more independent variables.
Here we try to “regress” the value of the dependent variable “Y” with the help of the independent
variables.
Applications of Regression
Regression analysis is used for prediction and forecasting. This has substantial overlap with the field
of machine learning. This statistical method is used across different industries such as,
Financial Industry- Understand the trend in the stock prices, forecast the prices, and evaluate risks
in the insurance domain
Marketing- Understand the effectiveness of market campaigns, and forecast pricing and sales of the
product.
Manufacturing- Evaluate the relationship of variables that determine to define a better engine to
provide better performance
Medicine- Forecast the different combinations of medicines to prepare generic medicines for
diseases.
i. Linear Regression
The simplest of all regression types is Linear Regression which tries to establish relationships
between Independent and Dependent variables. Linear Regression is a predictive model used for
finding the linear relationship between a dependent variable and one or more independent variables.
Linear regression establishes the linear relationship between two variables based on a line of best fit.
Linear regression is thus graphically depicted using a straight line with the slope defining how the
change in one variable impacts a change in the other.
9
Foundation of Data Science/22CSC202- Learning Materials Unit V
Y_Predict = b0 + b1x
𝑛𝛴𝑥𝑦 − (𝛴𝑥)(𝛴𝑦)
𝑆𝑙𝑜𝑝𝑒 (𝑏1) =
𝑛𝛴𝑥 − (𝛴𝑥)
𝛴𝑦 − 𝑏1𝛴𝑥
𝐼𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 (𝑏0) =
𝑛
Where,
x = Independent variable
b0 = The y-intercept
Example:
Note that the observed (x, y) data points fall directly on a line. Linear relationship between
Fahrenheit and Celsius. As you may remember, the relationship between degrees Fahrenheit and
degrees Celsius is known to be:
F=(9/5)C+32 or Y=32+(9/5)X
Here,
Y=F
X= C
a=32
b=9/5
10
Foundation of Data Science/22CSC202- Learning Materials Unit V
That is, if you know the temperature in degrees Celsius, you can use this equation to determine the
temperature in degrees Fahrenheit exactly.
If the relationship between Independent and dependent variables is multiple in number, then it is
called Multiple Linear Regression.
Multiple linear regression is used to estimate the relationship between two or more independent
variables and one dependent variable. e.g. how rainfall, temperature, and amount of fertilizer added
affect crop growth).
Y=a+b1X1+b2X2+b3X3+...+btXt+u
Python
In the example below, the x-axis represents age, and the y-axis represents speed. We have registered
the age and speed of 13 cars as they were passing a tollbooth. Let us see if the data we collected
could be used in a linear regression:
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
Code:
Foundation of Data Science/22CSC202- Learning Materials Unit V
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
plt.scatter(x, y)
plt.show()
import numpy as np
# number of observations/points
n = np.size(x)
m_x = np.mean(x)
m_y = np.mean(y)
# putting labels
plt.xlabel('x')
plt.ylabel('y')
plt.show()
if __name__ == "__main__":
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
# estimating coefficients
b = estimate_coef(x, y)
13
Foundation of Data Science/22CSC202- Learning Materials Unit V
plot_regression_line(x, y, b)
Output:
Estimated coefficients:
5. Classification
Classification is a process of categorizing data or objects into predefined classes or categories based
on their features or attributes. In machine learning, classification is a type of supervised learning
technique where an algorithm is trained on a labeled dataset to predict the class or category of new,
unseen data.
Binary Classification: In binary classification, the goal is to classify the input into one of two
classes or categories. Example – On the basis of the given health conditions of a person, we have to
determine whether the person has a certain disease or not.
Multiclass Classification: In multi-class classification, the goal is to classify the input into one of
Foundation of Data Science/22CSC202- Learning Materials Unit V
several classes or categories. For Example – On the basis of data about different species of flowers,
we have to determine which specie our observation belongs to.
Linear Classifiers: Linear models create a linear decision boundary between classes. They
are simple and computationally efficient. Some of the linear classification models are as
follows:
Logistic Regression
Single-layer Perceptron
K-Nearest Neighbours
Kernel SVM
Naive Bayes
Random Forests,
Suppose we have to predict whether a patient has a certain disease or not, on the basis
of 7 independent variables, called features. This means, there can be only two
possible outcomes:
2. Data preparation: Once you have a good understanding of the problem, the next step is to
prepare your data. This includes collecting and preprocessing the data and splitting it into
training, validation, and test sets. In this step, the data is cleaned, preprocessed, and
transformed into a format that can be used by the classification algorithm.
3. Feature Extraction: The relevant features or attributes are extracted from the data that can
be used to differentiate between the different classes.
Suppose our input X has 7 independent features, having only 5 features influencing
the label or target values remaining 2 are negligibly or not correlated, then we will
use only these 5 features only for the model training.
4. Model Selection: There are many different models that can be used for classification,
including logistic regression, decision trees, support vector machines (SVM), or neural
networks. It is important to select a model that is appropriate for your problem, taking into
account the size and complexity of your data, and the computational resources you have
available.
5. Model Training: Once you have selected a model, the next step is to train it on your training
data. This involves adjusting the parameters of the model to minimize the error between the
predicted class labels and the actual class labels for the training data.
16
Foundation of Data Science/22CSC202- Learning Materials Unit V
6. Model Evaluation: Evaluating the model: After training the model, it is important to
evaluate its performance on a validation set. This will give you a good idea of how well the
model is likely to perform on new, unseen data.
Log Loss or Cross-Entropy Loss, Confusion Matrix, Precision, Recall, and AUC-
ROC curve are the quality metrics used for measuring the performance of the model.
7. Fine-tuning the model: If the model’s performance is not satisfactory, you can fine-tune it
by adjusting the parameters, or trying a different model.
8. Deploying the model: Finally, once we are satisfied with the performance of the model, we
can deploy it to make predictions on test data. it can be used for real world problem.
Classification Accuracy: The proportion of correctly classified instances over the total number of
instances in the test set. It is a simple and intuitive metric but can be misleading in imbalanced
datasets where the majority class dominates the accuracy score.
Confusion matrix: A table that shows the number of true positives, true negatives, false positives,
and false negatives for each class, which can be used to calculate various evaluation metrics.
Precision and Recall: Precision measures the proportion of true positives over the total number of
actual positives, while recall measures the proportion of true positives over the total number of
predicted positives. These metrics are useful in scenarios where one class is more important than the
other, or when there is a trade-off between false positives and false negatives.
F1-Score: The harmonic mean of precision and recall, calculated as 2 x (precision x recall) /
(precision + recall). It is a useful metric for imbalanced datasets where both precision and recall are
Foundation of Data Science/22CSC202- Learning Materials Unit V
important.
Classification algorithms are widely used in many real-world applications across various domains,
including:
Medical diagnosis
Image classification
Sentiment analysis.
Fraud detection
import numpy as np
import pandas as pd
iris = datasets.load_iris()
18
Foundation of Data Science/22CSC202- Learning Materials Unit V
X, y, test_size=0.6, random_state=1)
print("y_test.shape",y_test.shape)
svm_clf.fit(X_train, y_train)
# make predictions
svm_clf_pred = svm_clf.predict(X_test)
accuracy_score(y_test, svm_clf_pred))
19
Foundation of Data Science/22CSC202- Learning Materials Unit V
average='weighted'))
Output
y_test.shape (90,)
0122022120001002222212102200202211220
1 1 2 1 2 1 0 0 0 2 0 2 2 2 0 0]
0122022120001002222212102200202211220
1 1 2 1 2 1 0 0 0 2 0 1 2 2 0 0]
Other Classifiers
gnb = GaussianNB()
gnb.fit(X_train, y_train)
20
Foundation of Data Science/22CSC202- Learning Materials Unit V
# make predictions
gnb_pred = gnb.predict(X_test)
dt = DecisionTreeClassifier(random_state=0)
dt.fit(X_train, y_train)
# make predictions
dt_pred = dt.predict(X_test)
6. Prediction
The term predictive analytics refers to the use of statistics and modeling techniques to make
21
Foundation of Data Science/22CSC202- Learning Materials Unit V
predictions about future outcomes and performance. Predictive analytics looks at current and
historical data patterns to determine if those patterns are likely to emerge again. This allows
businesses and investors to adjust where they use their resources to take advantage of possible future
events.
Predictive analytics is a form of technology that makes predictions about certain unknowns in the
future. It draws on a series of techniques to make these determinations, including artificial
intelligence (AI), data mining, machine learning, modeling, and statistics. Predictive models are used
for all kinds of applications, including weather forecasts, creating video games, translating voice to
text, customer service, and investment portfolio strategies.
6.1 Applications
Forecasting
Credit Score
Credit scoring makes extensive use of predictive analytics. When a consumer or business applies for
credit, data on the applicant's credit history and the credit record of borrowers with similar
characteristics are used to predict the risk that the applicant might fail to perform on any credit
extended.
Fraud Detection
Financial services can use predictive analytics to examine transactions, trends, and patterns. If any of
this activity appears irregular, an institution can investigate it for fraudulent activity. This may be
done by analyzing activity between bank accounts or analyzing when certain transactions occur.
Supply Chain
Supply chain analytics is used to predict and manage inventory levels and pricing strategies. Supply
chain predictive analytics use historical data and statistical models to forecast future supply chain
performance, demand, and potential disruptions. This helps businesses proactively identify and
address risks, optimize resources and processes, and improve decision-making. These steps allow
companies to forecast what materials will be on hand at any given moment and whether there will be
any shortages.
22
Foundation of Data Science/22CSC202- Learning Materials Unit V
Human Resources
Human resources uses predictive analytics to improve various processes, such as forecasting future
workforce needs and skills requirements or analyzing employee data to identify factors that
contribute to high turnover rates. Predictive analytics can also analyze an employee's performance,
skills, and preferences to predict their career progression and help with career development planning
in addition to forecasting diversity or inclusion initiatives.
There are three common techniques used in predictive analytics: Decision trees, neural networks,
and regression. Read more about each of these below.
Regression
This is the model that is used the most in statistical analysis. Use it when you want to determine
patterns in large sets of data and when there's a linear relationship between the inputs. This method
works by figuring out a formula, which represents the relationship between all the inputs found in
the dataset. For example, you can use regression to figure out how price and other key factors can
shape the performance of a security.
Neural Networks
Neural networks were developed as a form of predictive analytics by imitating the way the human
brain works. This model can deal with complex data relationships using artificial intelligence and
pattern recognition. Use it if you have several hurdles that you need to overcome like when you have
too much data on hand, when you don't have the formula you need to help you find a relationship
between the inputs and outputs in your dataset, or when you need to make predictions rather than
come up with explanations.
Decision Trees
Decision trees are the simplest models because they're easy to understand and dissect. They're also
very useful when you need to make a decision in a short period of time.
import pandas
BIKE = pandas.read_csv("day.csv")
23
Foundation of Data Science/22CSC202- Learning Materials Unit V
bike = X=BIKE.drop(['dteday'],axis=1)
categorical_col_updated = ['season','yr','mnth','weathersit','holiday']
#Separating the dependent and independent data variables into two data frames.
X = bike.drop(['cnt'],axis=1)
Y = bike['cnt']
# Splitting the dataset into 80% training data and 20% testing data.
Output
(X train input: (584, 32)
Y Train input: (584,)
X Test input (147, 32)
Y Test input (147,)
DT_model = DecisionTreeRegressor(max_depth=5).fit(X_train,Y_train)
DT_predict = DT_model.predict(X_test) #Predictions on Testing data
print("X_test prediction model output", DT_predict.shape)
24
Foundation of Data Science/22CSC202- Learning Materials Unit V
Output:
X_test prediction model output (147,)
import sklearn
from sklearn.metrics import explained_variance_score, mean_absolute_error,r2_score
print("R2 Score:", sklearn.metrics.r2_score(Y_test,DT_predict))
print("Variance:", sklearn.metrics.explained_variance_score(Y_test,DT_predict))
print("mean_absolute_error
",sklearn.metrics.mean_absolute_error(Y_test,DT_predict))
Output:
R2 Score: 0.9675629781060744
Variance: 0.9684464747668272
mean_absolute_error 277.66481694154714
Output:
R2 Score: 0.9963391318723084
Variance: 0.9963911261654863
mean_absolute_error 75.14739229024944
25