Open navigation menu
Close suggestions
Search
Search
en
Change Language
Upload
Sign in
Sign in
Download free for days
0 ratings
0% found this document useful (0 votes)
56 views
Linear Regression Analysis - Polynomial Regression
Regresion lineal
Uploaded by
Victor Papa Hernandez
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
Download now
Download
Save Linear Regression Analysis - Polynomial Regression For Later
Download
Save
Save Linear Regression Analysis - Polynomial Regression For Later
0%
0% found this document useful, undefined
0%
, undefined
Embed
Share
Print
Report
0 ratings
0% found this document useful (0 votes)
56 views
Linear Regression Analysis - Polynomial Regression
Regresion lineal
Uploaded by
Victor Papa Hernandez
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
Download now
Download
Save Linear Regression Analysis - Polynomial Regression For Later
Carousel Previous
Carousel Next
Save
Save Linear Regression Analysis - Polynomial Regression For Later
0%
0% found this document useful, undefined
0%
, undefined
Embed
Share
Print
Report
Download now
Download
You are on page 1
/ 25
Search
Fullscreen
Linear Regression Analysis Tutorial - Polynomial Regression Creator: Muhammad Bilal Alam What is Polynomial Regression? Polynomial regression is a type of regression analysis in which the relationship between the independent variable x and the dependent variable y is modeled as an nth degree polynomial. The formula for a polynomial regression model of degree n can be written as: y = BO + Bix + BaxA2 +... + BntxAn +e where © yjis the dependent variable ‘xs the independent variable * BO, B1, B2, ... Bn are the coefficients of the polynomial regression mode! ® is the error term or the residual 4 nis the degree of the polynomial The goal of polynomial regression is to find the values of the coefficients BO, B1, 62, .. Br that minimize the sum of squared errors between the predicted values of y and the actual values of y The California Housing Dataset for Multiple Linear Regression The California Housing Dataset contains information on the median income, housing age and other features for census tracts in California, The dataset was originally published by Pace, R. Kelley and Ronald Barty in their 1997 paper "Sparse Spatial Autoregressions" and is available in the sklearn.datasets module The dataset consists of 20,640 instances, each representing a census tract in California There are eight features in the dataset, including * Medinc: Median income in the census tract * HouseAge: Median age of houses in the census tract * AveRooms: Average number of rooms per dwelling in the census tract * AveBedims: Average number of bedrooms per dwelling in the census tract * Population: Total number of people living in the census tract * AveOccup: Average number of people per household in the census tract * Latitude: Latitude of the center of the census tract * Longitude: Longitude of the center of the census tract. Step 1: Import the necessary librariesimport pandas as pd import seaborn as sns import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import fetch_california_housing from sklearn.linear_model import LinearRegression from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import PolynonialFeatures from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score from sklearn.model_selection import GridSearchcv import warnings warnings .ilterwarnings(' ignore’) Step 2: Load the dataset # Load the California Housing Dataset from seaborn california = fetch_california_housing() # Convert the data to a pandas datafrane california_df = pd.DataFrame(data-california.data, column: alifornia.feature_nane: # Add the target variable to the dataframe california_df[‘MedHouseVal'] = california.target # Print the first 5 rows of the dataframe california_df.head() Medinc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude Medi 0 33252 410 6984127 1.023810 3220 2555556 37.88 122.23 1 83014 210 6.238137 «0971880 «2407.0 2.109842 «37.86 = -122.22 2 72574 520 8288136 1.073446 4960 2.802260 37.85 = 122.24 3 56431 520 5817352 1.073059 3580 2547988 3785-12228 4 3.8462 520 6281853 1.081081 5650 2181467 3785-12228 Step 3: Do Data Preprocessing along with Data Exploratory Analysis Step 3(a): Check Shape of Dataframe Checking the Shape of Dataframe tell hows how many rows and columns we have in the dataset # Print the shape of the datafrane print("Data shape:", california_df.shape) Data shape: (20640, 9) Step 3(b): Check Info of DataframeThis is very useful to quickly get an overview of the structure and properties of a dataset, and to check for any missing or null values that may need to be addressed before performing any analysis or modeling, california_df.info()
RangeIndex: 20648 entries, @ to 2063¢ Data columns (total 9 columns) # Column Non-Null Count ® MedInc 20640 1 HouseAge 20640 non-null floatea 2 AveRooms «20640 non-null floatéa 3. AveBedrms 20640 non-null floatea 4 Population 20648 non-null float64 5 5 7 ‘non-null AveOccup 20640 non-null floate4 Latitude 20640 non-null floatea Longitude 20640 non-null floatea 8 MedHouseVal 20640 non-null floate4 dtypes: Floatéa(9) nenory usage: 1.4 MB Step 3(c): Show Descriptive Statistics of each numerical column Looking at descriptive statistics in machine learning is important because it gives an overview of the dataset's distribution and key characteristics. Some of the reasons why we should look at descriptive statistics include: * Understanding the distribution of data: Descriptive statistics provide information about the central tendency and the spread of the data. This information is useful in determining the type of distribution and whether the data is skewed or symmetrical * Identifying outliers: Descriptive statistics help to identify any extreme values or outliers in the dataset. These outliers can have a significant impact on the analysis and should be investigated further. From the descriptive statistics, we can observe the following * Outliers: The ‘AveRooms, ‘AveBedrms', ‘Fopulation’, and ‘AveOccup' columns have high maximum values, indicating the presence of outliers in the data. These outliers may need to be treated or removed before model selection. we will create visuals to see them more clearly * Distribution: The ‘Medinc, ‘HouseAge’, and 'MedHouseVal’ columns appear to be normally distributed, as the mean and median values are close to each other, and the standard deviation is not very high. The ‘Latitude’ column is skewed to the left, as the mean is less than the median. The ‘Longitude’ column is skewed to the right, as the mean is greater than the median california_df.describe().Tcount mean std min 25% 50% 7 Medine 206400 3870671 1.899822 0.499900 2.563400 3534800 4.7437 HouseAge 206400 2863948 12585558 1.000000 18000000 29,0000 37.000 AveRooms 206400 5429000 2474173 «0846154 4.440716 5.229128 8.052 AveBedims 206400 1.096675 0.473911 0333333 «1.00607 1.048780 1.099 Population 206400 1425476744 1132462122 3.000000 787.0000 1166000000 1725.000¢ AveOccup 206400 3.070655 10386050 0692308 2429741 «2.818116 3.282 Latitude 206400 35631867 2.135952 32540000 33.930000 34.260000 37.710 Longitude 20640.0 -119.569704 2.003532 -124.350000 -121.800000 -118.490000 -118.010¢ MedHouseVal 206400 2.06855€ 1.153956 0.149990 1.196000 1.797000 2.6472 » Step 3(d): Check for missing values in the Dataframe This is important because most machine learning algorithms cannot handle missing data and will throw an error if missing values are present. Therefore, itis necessary to check for missing values and impute or remove them before fitting the data into a machine learning model, This helps to ensure that the model is trained on complete and accurate data, which leads to better performance and more reliable predictions. Here we have no missing values so lets move on # Check for missing values print("Missing values:\n", california_df.isnull().sum()) Missing values: NedInc e Houseage ‘AveRoons AveBedrms Population AveOccup latitude Longitude NedHouseVal dtype: intea Step 3(e): Check for duplicate values in the Dataframe Checking for duplicate values in machine learning is important because it can affect the accuracy of your model. Duplicate values can skew your data and lead to overfitting, where your model is too closely fit to the training data and does not generalize well to new data We have no duplicate values so thats good california_df.duplicated().sum() @ Step 3(f)(i): Check for Outliers in the DataframeWe should check for outliers as they can have a negative impact on machine learning algorithms as they can skew the results of the analysis. Outliers can significantly alter the mean, standard deviation, and other statistical measures, which can misrepresent the true characteristics of the data. Linear regression models, are sensitive to outliers and can produce inaccurate results ifthe outliers are not properly handled or removed. Therefore, it is important to identify and handle outliers appropriately to ensure the accuracy and reliability of the models. Here in the plots we can clearly see very high outliers on the right hand side. So we need to deal with them appropriately # Create a boxplot of the ‘AveRooms' column ax = sns.boxplot (x=california_df[ 'AveRooms" ]) # Set the titLe ond axes Labels ax.set_title("Boxplot of Average Number of Roos") ax.set_xlabel("AveRooms' ) ax.set_ylabel('') # Customize the y-axis tick Labels to display values in millions yticks = ax.get_yticks() / 1000000 ax.set_yticklabels([‘{:.2f}'.format(ytick) for ytick in yticks]) # Add grid Lines and remove top and right spines ax.grid(axis="y', linestyle="--', alpha=0.7) ax.spines[ top" ].set_visible(False) ax.spines| ‘right’ ].set_visible(False) # Show the plot plt.show() Boxplot of Average Number of Rooms 0.00 see oe o 0 0 o m0 100 10 Wo ‘AveRooms # Create a boxplot of the ‘AveBedrms' colum ax = sns.boxplot (x=california_df[ ‘AveBedrms']) # Set the title and axes Labels ax.set_title("Boxplot of Average Number of Bedrooms’ ) ax.set_xlabel(‘AveBedrms') ax. set_ylabel(’') # Customize the y-axis tick Labels to display values in miLLions yticks: x.get_yticks() / 1ee@¢ee ax.set_yticklabels(["{:.2F)'.format(ytick) for ytick in yticks])# Add grid Lines and remove top and right spines ax.grid(axis="y', linestyle='--',
[email protected]
) ax. spines[ 'top" ]. set_visible(False) ax.spines[ ‘right’ ].set_visible(False) # Show the plot pit. show() Boxplot of Average Number of Bedrooms 0.00 ome 04 ’ ’ . 5s 0» & » 8 0D 3 ‘AueBedrms # Create a boxplot of the ‘Population’ column ax = sns. boxplot (x=california_df[ ‘Population’ ]) # Set the title and axes Labels ax.set_title(*Boxplot of Populations") ax.set_xlabel (‘Population’) ax.set_ylabel(’') # Customize the y-axis tick Labels to display values in millions yticks = ax.get_yticks() / 1000000 ax.set_yticklabels(["{:.2F}'.format(ytick) for ytick in yticks]) # Add grid Lines and remove top and right spines ax.grid(axis="y', linestyle="--', alpha=0.7) ax.spines[ ‘top’ ].set_visible(False) ax.spines[ ‘right ].set_visible(False) # Show the plot plt.show()Boxplot of Populations ‘5000 10000 15000 20000 25000 30000 35000 Population # Create a boxplot of the ‘Avedccup’ colum ax = sns.boxplot (x=california_df[ 'AveOccup']) # Set the title and axes Labels ax.set_title(‘Boxplot of Average Occupancy") ax.set_xlabel(‘AveOccup') ax.set_ylabel(*') # Customize the y-axis tick Labels to display values in millions yticks = ax.get_yticks() / 1000000 ax.set_yticklabels(['{:.24}'.format(ytick) for ytick in yticks]) # Add grid Lines and remove top and right spines ax.grid(axis="y', linestyle="--', alpha=0.7) ax. spines[ ‘top’ ].set_visible(False) ax.spines[ ‘right’ ].set_visible(False) # Show the plot plt.show() Boxplot of Average Occupancy m0 = 0S O=«2000 «200 ‘AveOceup # Create a boxplot of the ‘AveOccup’ column ax = sns.boxplot (x=california_df[ 'AveOccup"]) # Set the title and axes Labels ax.set_title(‘Boxplot of Average Occupancy’) ax.set_xlabel(*AveOccup') ax.set_ylabel('')# Customize the y-axis tick Labels to display values in millions yticks = ax.get_yticks() / 1000000 ax.set_yticklabels(["{:.2F}' format (ytick) for ytick in yticks]) # Add grid Lines and remove top and right spines ax.grid(axis="y', linestyle="--',
[email protected]
) ax. spines[ ‘top’ ].set_visible(False) ax.spines| 'right' ].set_visible(False) # Show the plot pit. show() Boxplot of Average Occupancy 00 = 0) OSs 0S«2000 «200 ‘AveOceup # Create a boxplot of the ‘Medic’ coLum ax = sns.boxplot (x-california_df['MedInc']) # Set the title and axes Labels ax.set_title(‘Boxplot of Medinc') ax.set_xlabel(‘MedInc') ax.set_ylabel(*') # Customize the y-axis tick Labels to display values in millions yticks = ax.get_yticks() / 1900000 ax.set_yticklabels(['{:.2£}'.format(ytick) for ytick in yticks]) # Add grid Lines and remove top and right spines ax.grid(axis="y', Linestyle="--', alpha=0.7) ax. spines[ 'top" ]. set_visible(False) ax.spines[ ‘right’ ].set_visible(False) # Show the plot plt.show()Boxplot of Medinc 0.00 oe o 2 4 6 8 0» 2 Medine Step 2(f)(ii): Deal with Outliers in the Dataframe using Winsorization: This method involves replacing extreme values with the nearest values that are within a certain percentile range. For example, we replace values above the 95th percentile with the value at the 95th percentile and values below the 1st percentile with the value at the 1st percentile. From the visuals we can clearly see that the data is way more normally distributed now # Define the percentile Limits for winsorization pet_lower = 0.1. pct_upper = 0.95 # Apply winsorization to the five columns california_df["AveRooms'] = np.clip(california_df[‘AveRoons"], california_df[‘AveRoons'].quantile(pct_lower), california_df[ ‘AveRoons'].quantile(pct_upper)) california_df["AveBedrms'] = np.clip(california_df["AveBedrns'], california_df{ 'AveBedrms'].quantile(pct_lower california_df| 'AveBedrns' ].quantile(pct_upper california_df["Population’] = np.clip(california_df{ ‘Population’ ], california df| ‘Population’ ] .quantile(pct_low: california_df[ ‘Population’ ] .quantile(pct_upp: california_df["AveOccup'] = np.clip(california_df[ ‘AveOccup'], california df['AveOccup'].quantile(pct_lower), california_df[‘AveOccup" ] .quantile(pct_upper)) california_df["MedInc'] = np.clip(california_df[ ‘Medinc'], california_df['MedInc'].quantile(pet_lower), california_df["Medinc’ ].quantile(pct_upper)) # Create a boxplot of the ‘AveRooms’ column ax = sns.boxplot (x=california_df[ ‘AveRooms" ]) # Set the title and axes Labels ax.set_title(‘Boxplot of Average Number of Rooms") ax.set_xlabel(*AveRooms') ax.set_ylabel('') # Customize the y-axis tick Labels to display values in millions yticks = ax.get_yticks() / 1000000 ax.set_yticklabels(['{:.2F}' format (ytick) for ytick in yticks]) # Add grid Lines and remove top and right spines ax.grid(axis="y', linestyle="--',
[email protected]
)In In ax. spines['top" ].set_visible(False) ax.spines[ ‘right’ ].set_visible(False) # Show the plot pit. show() Boxplot of Average Number of Rooms 0.00 3 a 5 5 7 ‘AveRooms # Create a boxplot of the ‘AveBedrms’ column ax = sns.boxplot (x=california_df[ 'AveBedrns’ ]) # Set the title and axes Labels ax.set_title("Boxplot of Average Number of Bedrooms’) ax.set_xlabel(*AveBedrms') ax.set_ylabel('') # Customize the y-axis tick Labels to display values in millions yticks = ax.get_yticks() / 1000000 ax.set_yticklabels(['{:.2F}'.format(ytick) for ytick in yticks]) # Add grid Lines and remove top and right spines ax.grid(axis="y', linestyle='--',
[email protected]
) ax. spines[ 'top'].set_visible(False) ax. spines[ ‘right '].set_visible(False) # Show the plot plt.show() Boxplot of Average Number of Bedrooms 0.00 030 095 100 105 110 115 120 125 ‘AveBedrms # Create a boxplot of the ‘Population’ column ax = sns.boxplot (x=california_df[ ‘Population’ })In # Set the title and axes Labels ax.set_title(‘Boxplot of Populations’) ax.set_xlabel (‘Population’) ax.set_ylabel('') # Customize the y-axis tick Labels to display values in millions yticks = ax.get_yticks() / 1000000 ax.set_yticklabels(['{:.2F}'.format(ytick) for ytick in yticks]) # Add grid Lines and remove top and right spines ax.grid(axis='y’, linestyle="--', alpha=0.7) ax. spines[ '‘top’].set_visible(False) ax. spines[ ‘right’ ].set_visible(False) # Show the plot plt.show() Boxplot of Populations 0.00 500 1000 +1500 +2000 2500 +5000, Population # Create a boxplot of the ‘MedInc’ column ax = sns. boxplot (x-california_df[‘Medinc']) # Set the title and axes Labels ax.set_title(*Boxplot of MedInc') ax.set_xlabel("MedInc') ax.set_ylabel('') # Customize the y-axis tick Labels to display values in millions yticks = ax.get_yticks() / 1000000 ax.set_yticklabels(['{:.2f}' format (ytick) for ytick in yticks]) # Add grid Lines and remove top and right spines ax.grid(axis="y', linestyle="--', alpha=0.7) ax.spines[ ‘top’ ].set_visible(False) ax.spines[ ‘right’ ].set_visible(False) # Show the plot plt.show()In Boxplot of Medinc 0.00 T 2 3 a 5 6 7 Meaine Step 3(g): Check for Skewness using a Histogram Skewed data can result in biased estimates of model parameters and reduce the accuracy of predictions. Therefore, it is important to assess the distribution of features and target variables to identify any potential issues and take appropriate measures to address them, Here almost all the features and target look normally distributed. There is some Skewness In MedHouseVal but not enough to do Transformation on it Note: For learning purposes | have shown how to do MedHouseVal transformation for skewness in my previous tutorial of Simple Linear Regression. Feel free to check that out # Set figure size and font scale plt.figure(figsize=(8, 6)) sns.set(font_scale=1.5) # Create histogram sns.histplot(data-california df, x='MedHouseVal', kde=True, bins=50, colo #7097 # Set x and y axis Labels and title plt.xlabel( ‘Median House Value’, fontsize=16) plt.ylabel( ‘Frequency’, fontsize-16) plt.title( ‘Distribution of Median House Value in California’, fontsize=20) # Customize x and y axis tick marks and Labels Plt .xticks (fontsize=12) plt.yticks(fontsize=12) # Add vertical Line for mean mean = california_df['MedHouseVal' ].mean() plt.axvline(mean, color='red’, linestyle='--", label=f'Mean: {mean:.2f}') plt.legend(fontsize=14) # Show the plot plt.show()Distribution of Median House Value in California +1000 Frequency o 1 2 3 4 5 Median House Value ## Set figure size and font scale plt.figure(figsize=(8, 6)) sns.set(font_scale=1.5) # Create histogram sns.histplot(data-california df, x='Medinc’, kde=True, bins=50, color="#beaeda", e # Set x and y axis Labels and title plt.xlabel(‘Median Incone Value’, fontsize=16) plt.ylabel("Frequency’, fontsize=16) plt.title( ‘Distribution of Median MedInc Value in California’, fontsize=20) # Customize x and y axis tick marks and Labels plt.xticks(fontsize=12) plt.yticks(fontsize=12) # Add vertical Line for MedIne mean ~ california_df["MedInc’].mean() plt.axvline(mean, color='red', linestyle='--', label=F'Mean: {mean:.2f}') plt.legend(fontsize=14) # Show the plot plt.show()Distribution of Median Medinc Value in California ---- Mean: 3.77 1000 Frequency 7 & 20 hi 1 2 3 4 5 Median Income Value # Set figure size and font scale plt.figure(figsize-(8, 6)) sns.set(font_scale=1.5) # Create histogram sns.histplot(data=california df, x="HouseAge’, kde-True, bins=50, color="#fdco86", # Set x and y axis Labels and title plt.xlabel(‘Houseage’, fontsize=16) plt.ylabel('Frequency', fontsize=16) plt.title( ‘Distribution of House Age in California’, fontsize=20) # Customize x and y axis tick marks and Labels plt.xticks(fontsize=12) plt.yticks(fontsize=12) # Add vertical Line for mean mean = california df[‘Houseage’ ].nean() plt.axvline(mean, color="red’, linestyl pit. legend(fontsize=14) ‘22, label=f'Mean: {mean:.2#}') # Show the plot plt.show()Distribution of House Age in California ---- Mean: 28.64 1200 +000 8 Frequency 8 200 © 10 Pa » 4 =0 HouseAge # Set figure size and font scale plt.figure(figsize=(8, 6)) sns.set(font_scale=1.5) # Create histogram sns.histplot(data-california df, x="AveRooms', kde=True, bins=50, color="#fFFF99", # Set x and y axis Labels and title plt.xlabel(‘AveRooms’, fontsize=16) plt.ylabel( ‘Frequency’, fontsize=16) plt.title( ‘Distribution of AveRooms in California’, fontsize=20) # Customize x and y axis tick marks and Labels pit. xticks (fontsize=12) plt.yticks(fontsize=12) # Add vertical Line for mean mean = california_df['AveRooms’ ].nean() plt.axvline(mean, color="red", linestyl pit. legend(fontsize=14) ‘--", label=f'Mean: {mean:.2F}') # Show the plot plt.show()Distribution of AveRooms in California ---- Mean: 5.28 1000 Frequency g =) & 20 > linil 3 4 5 6 7 AveRooms # Set figure size and font scale plt.figure(figsize-(8, 6)) sns.set(font_scale=1.5) # Create histogram sns.histplot(data-california df, weBedrns’, kde=True, bins=50, color="#386cb0" # Set x and y axis Labels and title plt.xlabel(‘AveBedrms", fontsizé ) plt.ylabel('Frequency’, fontsize-16) plt.title( ‘Distribution of AveBedrms in California’, fontsize=20) # Customize x and y axis tick marks and Labels plt.xticks(fontsize=12) plt.yticks(fontsize=12) # Add vertical Line for mean mean = california df[‘AveBedrns'’ }.mean() plt-axvline(mean, color="red’, Linestyli pit. legend(fontsize=14) ", label=f'mean: {mean:.2F}') # Show the plot plt.show()In Distribution of AveBedrms in California o-=- Mean: 1.06 | +1000 Frequency 7 & 200 oso 095 100105, 404.451.0125 AveBedrms ## Set figure size and font scale plt. figure(figsize-(8, 6)) sns.set(font_scale=1.5) # Create histogram sns.histplot(data-california df, x="Population’, kde=True, bin: 8, color="#F0027F # Set x and y axis Labels and title plt.xlabel( ‘Population’, fontsize=16) plt.ylabel(‘Frequency’, fontsize=16) plt.title( ‘Distribution of Population in California’, fontsize=20) # Customize x and y axis tick marks and Labels plt .xticks(fontsize=12) plt.yticks(fontsize=12) # Add vertical Line for mean mean = california_df[ ‘Population’ ].mean() plt.axvline(mean, color='red', linestyle='--", label=f'Mean: {mean:.2f}') plt. legend(fontsize=14) # Show the plot plt.show()Distribution of Population in California = Mean: 134578 | 1000 Frequency a & 200 ° 0 10001500 «= aon. 2500 00 Population # Set figure size and font scale plt.figure(figsize=(8, 6)) sns.set(font_scale=1.5) # Create histogram sns.histplot(data=california df, x="AveOccup’, kde=True, bin: |, color="#bfSb17", # Set x and y axis Labels and title plt.xlabel(‘AveOccup", fontsize=16) plt.ylabel(‘Frequency’, fontsize=16) plt.title( ‘Distribution of AveOccup in California’, fontsize=20) # Customize x and y axis tick marks and Labels pit. xticks(fontsize=12) plt.yticks(fontsize=12) # Add vertical Line for mean mean ~ california_df['Avedccup' ].mean() plt.axvline(mean, color='red', linestyle='--", label=f'Mean: {mean:.2f}') plt. legend(fontsize=14) # Show the plot plt.show()In Distribution of AveOccup in California ---- Mean: 2.89 1000 Frequency g 400 200 ° 45 20 25 30 35 40 AveOccup # Set figure size and font scale plt.figure(figsize=(8, 6)) sns.set(font_scale=1.5) # Create histogram sns.histplot(data=california df, x="Latitude’, kde-True, bin: ‘wpaccee’ # Set x and y axis Labels and title plt.xlabel(‘AveOccup", fontsize=16) plt.ylabel( ‘Frequency’, fontsize-16) plt.title( ‘Distribution of Latitude’, fontsize=20) # Customize x and y axis tick marks and Labels pit. xticks(fontsize=12) plt.yticks(fontsize=12) # Add vertical Line for mean mean ~ california_df[' Latitude’ ].mean() plt.axvline(mean, color='blue’, linestyle="--", label=f'Mean: {mean:.2f}') plt.1egend(fontsize-14) # Show the plot plt.show()Distribution of Latitude i Mean: 35.63 3000 2500 a +000 8 Ss Frequency 500 ll “0 2 At hetll : AveOccup # Set figure size and font scale plt.figure(figsize-(8, 6)) sns.set(font_scale=1.5) # Create histogram sns.histplot(data=california df, x="Longitude’, kde-True, bins=50, color="#FF5733° # Set x and y axis Labels and title plt.xlabel(‘AveOccup", fontsize=16) plt.ylabel( ‘Longitude’, fontsize=16) pit. title( ‘Distribution of Longitude’, fontsize=20) # Customize x and y axis tick marks and Labels plt.xticks (fontsize=12) plt.yticks(fontsize=12) # Add vertical Line for mean mean = california_df[' Longitude’ ].mean() plt.axvline(mean, color='blue", Linestyle: plt. legend(fontsize=14) "Mean: {mean:.2f}') # Show the plot plt.show()Distribution of Longitude 2500 i Mean: -119.57 2000 g Longitude 1000 500 a4 “192 120 1a 16 18 ‘AveOccup Step 3(h): Create a Vertical Correlation Heatmap The correlation matrix shows the correlation coefficients between every pair of variables in the dataset. A correlation coefficient ranges from -1 to 1 and measures the strength and direction of the linear relationship between two variables. A coefficient of -1 indicates 2 perfect negative correlation, a coefficient of 0 indicates no correlation, and a coefficient of 1 indicates a perfect positive correlation # Calculate the corretation matrix corr_matrix = california_df.corr() # Set up the matpLotlib figure fig, ax = plt.subplots(figsize=(6, 12)) # Create the heatmap wsns.heatmap(corr_matrix, cmap='BrBG', annot=True, fnt='.2f', Linewidths=.5, axzax, sns.heatnap(corr_matrix.corr()[['MedHouseVal' ]].sort_values(by="MedHouseVal’, ascei # Set the title and axis Labels ax.set_title(‘Correlation Heatmap for California Housing Dataset’, fontsize=16) ax.set_xlabel(‘Features', fontsize=14) ax.set_ylabel("Features', fontsize-14) # Rotate the x-axis Labels for readability plt.xticks(rotation=@, ha="right") # Show the plot plt.show()Correlation Heatmap for California Housing Dataset_ 4.00 MedHouseVal 0.75 Medinc 0.50 AveRooms HouseAge 0.031 028 8 2 Longitude 0.05 -0.00 & Latitude 0.4 025 Population 0.24 -0.50 AveBedrms 0.75 AveOccup 1.00 MedHouseVal Features Step 3(i): Perform Feature Scaling Feature scaling is the process of transforming numerical features in a dataset to have similar scales or ranges of values. The purpose of feature scaling is to ensure that all features have the same level of impact on the model and to prevent certain features from dominating the model simply because they have larger values. In linear regression, feature scaling is particularly important because the coefficients of the model represent the change in the dependent variable associated with a one-unit change in the independent variable. Scaling the features to have similar ranges can result in a more accurate and reliable model with more accurate representations of the relationships between the independent variables and the dependent variable. scaler = Standardscaler() california_df_scaled = scaler. fit_transform(california_df)california_df_scaled = pd.DataFrane(california_df_scaled, columns=california_df.co: Step 3(j) Check for Assumptions using Scatter Plots From the scatter plots, we can observe that there isa linear relationship between the dependent variable (Median House Value) and some of the independent variables like Median Income and Total Rooms. However, we can also see that some of the independent variables like Longitude, Latitude, and Housing Median Age do not have a clear linear relationship with the dependent variable. This suggests that a linear regression model might not be the best ft for predicting the Median House Value based on these variables # Create scatter plots fig, axs = plt.subplots(nrows=2, ncols=4, figsize=(30,15)) axs[0,0] scatter (california_df_scaled{'Latitude’], california_df_scaled[ 'MedHouseV. axs[0,0].set_xlabel(‘Latitude") axs[@,0].set_ylabel (‘Median House Value) axs[@,0].set_title(’Latitude vs Median House Value" ) axs[@,1].scatter(california_df_scaled{' Longitude’ ], california_df_scaled[‘MedHouse! axs[0,1] .set_xlabel (‘ Longitude" ) axs[0,1].set_ylabel (‘Median House Value’) axs[@,1].set_title('Longitude vs Median House Value' ) axs[0,2].scatter(california_df_scaled{'Houseage’], california_df_scaled['MedHouseV. axs[@,2].set_xlabel (‘Housing Median Age’) axs[@,2].set_ylabel(‘Median House Value’) axs[@,2].set_title( ‘Housing Median Age vs Median House Value") axs[0,3].scatter(california_df_scaled[ ‘AveRooms'], california_df_scaled[ 'MedHouseV. axs[0,3].set_xlabel( ‘Total Roons' ) axs[0,3].set_ylabel (‘Median House Value" ) axs[0,3].set_title( ‘Total Rooms vs Median House Value’) axs[1,0] scatter (california_df_scaled{ ‘AveBedrms’], california_df_scaled[‘MedHouse! axs[1,0].set_xlabel (‘Total Bedrooms’) axs[1,0].set_ylabel (‘Median House Value’) axs[1,0].set_title('Total Bedrooms vs Median House Value’) axs[1,1].scatter(california_df_scaled{ ‘Population’ ], california_df_scaled['MedHous: axs[1,1] .set_xlabel (‘Population’) axs[1,1].set_ylabel (‘Median House Value’) axs[1,1].set_title('Population vs Median House Value’) axs[1,2].scatter(california_df_scaled{'AveOccup'], california_df_scaled[ 'MedHouseV. axs[1,2] .set_xlabel (‘Households’) axs[1,2].set_ylabel (‘Median House Value’) axs[1,2].set_title(‘Households vs Median House Value’) axs[1,3].scatter(california_df_scaled['MedInc'], california_df_scaled[ 'MedHouseVal axs[1,3].set_xlabel(‘Median Income’) axs[1,3].set_ylabel (‘Median House Value’) axs[1,3].set_title(‘Nedian Income vs Median House Value") plt.show()Step 4: Define Dependant and Independant Variable californi californi f_scaled.drop([ 'MecHouseVal"],axis=1) d#_scaled| 'MedHouseVal''] Step 5: Do Train-Test Split in the ratio 70-30 # Split the data into training and testing sets X train, Xtest, y_train, y_test = train_test_split(x, y, test_size-0.3, random st, Step 6: Searching for Best Polynomial Degreee Based on R2 # initialize variables for storing best model and score best_score = @ best_degree = 1 # Loop through degrees of polynomial. from 1 to 10 for degree in range(1, 11): # create polynomial features poly_features = PolynomialFeatures(degree=degree) X poly_train = poly_features.fit_transform(X train) x(poly_test = poly Features. transform(X_test) # fit Linear regression model to polynomial features poly_reg = LinearRegression() poly_reg.fit(X_poly_train, y train) # evaluate model on test set y_pred = poly_reg.predict(x_poly_test) r2_score(y test, y_pred) mean_squared_error(y_test, y_pred) # check if this is the best model so far if score > best_score: best_score = score best_mse = mse best_degree = degree best_model = poly_reg# print best degree and score print("Best degree:", best_degree) print("R*2 score:", best_score) print("Mean Squared Error:", best_mse) @.7564759864738175 Mean Squared Error: @.2438324527757844 # plot actual vs predicted values pit. figure(figsize=(12, 6)) plt.scatter(y test, y_pred, color="blue’) # add Labels and title plt.xlabel( ‘Actual Values") plt.ylabel (‘Predicted Values" ) plt.title('Polynonial Regression Results (Degree + str(best_degree) +", R°2 = # add diagonal Line for reference plt.plot([y_test-min(), y_test.max()], [y_test.min(), y_test.max()], # display the plot plt.show() Polynomial Regression Results (Degree = 4, R’2 = 0.78) 19000 7 10000 20000 Predicted Values =30000 40000 0 1 2 Actual Values Conclusion: In this polynomial regression tutorial, we explored how to use polynomial regression to model the relationship between variables with a non-linear pattern with the target feature specifically the median house value in California, We started by loading and cleaning the dataset, and then visualizing the relationship between the variables using scatter plots Next, we split the data into training and testing sets and used polynomial regression to fit model to the training data, We gradually increased the degree of the polynomial until we achieved a good balance between bias and variance. We then used the model to make predictions on the test data and evaluated the model's performance using mean squared error and R-squared score
You might also like
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6124)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (627)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1148)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (933)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4/5 (8214)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (631)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1253)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4/5 (8365)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (860)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (877)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (954)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4/5 (2922)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (483)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (277)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Toibin
3.5/5 (2061)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (4972)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (444)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4281)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (447)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2283)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1068)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (1987)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (278)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (1993)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2619)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (1936)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (125)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (1912)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (692)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4074)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (75)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (830)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (143)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (901)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2530)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M L Stedman
4.5/5 (790)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Little Women
From Everand
Little Women
Louisa May Alcott
4/5 (105)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
3.5/5 (109)
Related titles
Click to expand Related Titles
Carousel Previous
Carousel Next
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Principles: Life and Work
From Everand
Principles: Life and Work
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Steve Jobs
From Everand
Steve Jobs
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Yes Please
From Everand
Yes Please
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
The Outsider: A Novel
From Everand
The Outsider: A Novel
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
John Adams
From Everand
John Adams
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
Little Women
From Everand
Little Women
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel