How to Calculate R^2 with Scikit-Learn
Last Updated :
05 Aug, 2024
The coefficient of determination, denoted as R², is an essential metric in regression analysis. It indicates the extent to which the independent variables account for the variation in the dependent variable.
In this article, we will walk you through calculating R² using Scikit-Learn, a powerful Python library for machine learning.
What is R²?
R² quantifies the proportion of variance in the dependent variable that can be predicted from the independent variables. It ranges between 0 and 1, with 0 indicating that the model does not explain any of the variability and 1 indicating that the model explains all the variability.
Mathematically, R² is expressed as:
R^2 = 1 - \frac{\text{SS}_{res}}{\text{SS}_{tot}}
Here:
- SS_{res} is the sum of squares of residuals (the difference between actual and predicted values).
- SS_{tot} is the total sum of squares (the difference between actual values and the mean of actual values).
Calculating R2 with Scikit-Learn for Sample Data
Let's go through an example to calculate R² from sample data using simple linear regression model.
Step 1: Import Necessary Libraries
import numpy as np
from sklearn.metrics import r2_score
Step 2: Generate Sample Data
# Generate random data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
# Assuming a perfect model prediction (just for the sake of demonstration)
y_pred = 4 + 3 * X
Step 3: Computer the R2 using sklearn
# Flatten the arrays to use in r2_score
y = y.flatten()
y_pred = y_pred.flatten()
# Compute R² using Scikit-Learn
R2_sklearn = r2_score(y, y_pred)
print(f"R² (Scikit-Learn Calculation): {R2_sklearn}")
Complete Code
Python
import numpy as np
from sklearn.metrics import r2_score
# Generate random data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
# Assuming a perfect model prediction (just for the sake of demonstration)
y_pred = 4 + 3 * X
# Flatten the arrays to use in r2_score
y = y.flatten()
y_pred = y_pred.flatten()
# Compute R² using Scikit-Learn
R2_sklearn = r2_score(y, y_pred)
print(f"R² (Scikit-Learn Calculation): {R2_sklearn}")
Output:
R² (Scikit-Learn Calculation): 0.7639751938835576
Calculating R2 for Simple Polynomial Regression Problem using Sklearn
Polynomial regression is a type of regression analysis in which the relationship between the independent variable X and the dependent variable y is modeled as an n-th degree polynomial. We will compute R-square value for polynomial regression model using python.
Step 1: Import Libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
Step 2: Generate Sample Data
We'll create a simple nonlinear dataset:
# Generate random data
np.random.seed(42)
X = 6 * np.random.rand(100, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(100, 1)
Step 3: Prepare Polynomial Features
Transform the input data to include polynomial features up to the desired degree (e.g., degree 2):
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
Step 4: Fit the Polynomial Regression Model
Fit a linear regression model to the polynomial features:
model = LinearRegression()
model.fit(X_poly, y)
y_pred = model.predict(X_poly)
Step 5: Calculate R² Using Scikit-Learn
Verify the manual calculation using Scikit-Learn's r2_score function:
# Flatten the arrays to use in r2_score
y = y.flatten()
y_pred = y_pred.flatten()
# Compute R² using Scikit-Learn
R2_sklearn = r2_score(y, y_pred)
print(f"R² (Scikit-Learn Calculation): {R2_sklearn}")
Visualizing the Results
It's often helpful to visualize the polynomial regression curve along with the data points:
plt.scatter(X, y, color='blue', label='Actual')
# Sort the values for better plotting
sorted_indices = X.flatten().argsort()
plt.plot(X[sorted_indices], y_pred[sorted_indices], color='red', linewidth=2, label='Predicted')
plt.title('Actual vs Predicted (Polynomial Regression)')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.show()
Complete Code
Python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# Generate random data
np.random.seed(42)
X = 6 * np.random.rand(100, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(100, 1)
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
model = LinearRegression()
model.fit(X_poly, y)
y_pred = model.predict(X_poly)
# Flatten the arrays to use in r2_score
y = y.flatten()
y_pred = y_pred.flatten()
# Compute R² using Scikit-Learn
R2_sklearn = r2_score(y, y_pred)
print(f"R² (Scikit-Learn Calculation): {R2_sklearn}")
plt.scatter(X, y, color='blue', label='Actual')
# Sort the values for better plotting
sorted_indices = X.flatten().argsort()
plt.plot(X[sorted_indices], y_pred[sorted_indices], color='red', linewidth=2, label='Predicted')
plt.title('Actual vs Predicted (Polynomial Regression)')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.show()
Output:
R² (Scikit-Learn Calculation): 0.8525067519009746
Polynomial Regression CurveConclusion
Calculating R² directly from sample data in Python is straightforward and provides valuable insight into your model's performance. By following the steps outlined above, you can easily implement and interpret R² in your regression analyses without relying on a predefined regression model. This approach is useful when you want to validate the goodness of fit of your predictions against actual data.
Similar Reads
How to Obtain TP, TN, FP, FN with Scikit-Learn Answer: To obtain True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN) for evaluating classification models, Scikit-Learn offers a straightforward method using the confusion_matrix function. This function helps in extracting these metrics directly from your model'
2 min read
How to Calculate Cosine Similarity in R? In this article, we are going to see how to calculate Cosine Similarity in the R Programming language. We can define cosine similarity as the measure of the similarity between two vectors of an inner product space. The formula to calculate the cosine similarity between two vectors is: ΣXiYi / (âΣXi^
2 min read
Step-by-Step Guide to Calculating RMSE Using Scikit-learn Root Mean Square Error (RMSE) is a widely used metrics for evaluating the accuracy of regression models. It not only provides a comprehensive measure of how closely predictions align with actual values but also emphasizes larger errors, making it particularly useful for identifying areas where model
5 min read
How to Calculate the Standard Error of the Mean in R? In this article, we will discuss how to calculate the standard error of the mean in the R programming language. StandardError Of Mean is the standard deviation divided by the square root of the sample size. Formula: Standard Error: (Sample Standard Deviation of Sample)/(Square Root of the sample siz
2 min read
How to Calculate R-Squared for glm in R R-squared (R²) is a measure of goodness-of-fit that quantifies the proportion of variance in the dependent variable explained by the independent variables in a regression model. While commonly used in linear regression, R-squared can also be calculated for generalized linear models (GLMs), which enc
3 min read