Linear Regression Analysis - Polynomial Regression

Regresion lineal

Uploaded by

Victor Papa Hernandez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

63 views

Linear Regression Analysis - Polynomial Regression

Regresion lineal

Uploaded by

Victor Papa Hernandez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 25

Linear Regression Analysis Tutorial - Polynomial Regression Creator: Muhammad Bilal Alam What is Polynomial Regression? Polynomial regression is a type of regression analysis in which the relationship between the independent variable x and the dependent variable y is modeled as an nth degree polynomial. The formula for a polynomial regression model of degree n can be written as: y = BO + Bix + BaxA2 +... + BntxAn +e where © yjis the dependent variable ‘xs the independent variable * BO, B1, B2, ... Bn are the coefficients of the polynomial regression mode! ® is the error term or the residual 4 nis the degree of the polynomial The goal of polynomial regression is to find the values of the coefficients BO, B1, 62, .. Br that minimize the sum of squared errors between the predicted values of y and the actual values of y The California Housing Dataset for Multiple Linear Regression The California Housing Dataset contains information on the median income, housing age and other features for census tracts in California, The dataset was originally published by Pace, R. Kelley and Ronald Barty in their 1997 paper "Sparse Spatial Autoregressions" and is available in the sklearn.datasets module The dataset consists of 20,640 instances, each representing a census tract in California There are eight features in the dataset, including * Medinc: Median income in the census tract * HouseAge: Median age of houses in the census tract * AveRooms: Average number of rooms per dwelling in the census tract * AveBedims: Average number of bedrooms per dwelling in the census tract * Population: Total number of people living in the census tract * AveOccup: Average number of people per household in the census tract * Latitude: Latitude of the center of the census tract * Longitude: Longitude of the center of the census tract. Step 1: Import the necessary librariesimport pandas as pd import seaborn as sns import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import fetch_california_housing from sklearn.linear_model import LinearRegression from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import PolynonialFeatures from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score from sklearn.model_selection import GridSearchcv import warnings warnings .ilterwarnings(' ignore’) Step 2: Load the dataset # Load the California Housing Dataset from seaborn california = fetch_california_housing() # Convert the data to a pandas datafrane california_df = pd.DataFrame(data-california.data, column: alifornia.feature_nane: # Add the target variable to the dataframe california_df[‘MedHouseVal'] = california.target # Print the first 5 rows of the dataframe california_df.head() Medinc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude Medi 0 33252 410 6984127 1.023810 3220 2555556 37.88 122.23 1 83014 210 6.238137 «0971880 «2407.0 2.109842 «37.86 = -122.22 2 72574 520 8288136 1.073446 4960 2.802260 37.85 = 122.24 3 56431 520 5817352 1.073059 3580 2547988 3785-12228 4 3.8462 520 6281853 1.081081 5650 2181467 3785-12228 Step 3: Do Data Preprocessing along with Data Exploratory Analysis Step 3(a): Check Shape of Dataframe Checking the Shape of Dataframe tell hows how many rows and columns we have in the dataset # Print the shape of the datafrane print("Data shape:", california_df.shape) Data shape: (20640, 9) Step 3(b): Check Info of DataframeThis is very useful to quickly get an overview of the structure and properties of a dataset, and to check for any missing or null values that may need to be addressed before performing any analysis or modeling, california_df.info() RangeIndex: 20648 entries, @ to 2063¢ Data columns (total 9 columns) # Column Non-Null Count ® MedInc 20640 1 HouseAge 20640 non-null floatea 2 AveRooms «20640 non-null floatéa 3. AveBedrms 20640 non-null floatea 4 Population 20648 non-null float64 5 5 7 ‘non-null AveOccup 20640 non-null floate4 Latitude 20640 non-null floatea Longitude 20640 non-null floatea 8 MedHouseVal 20640 non-null floate4 dtypes: Floatéa(9) nenory usage: 1.4 MB Step 3(c): Show Descriptive Statistics of each numerical column Looking at descriptive statistics in machine learning is important because it gives an overview of the dataset's distribution and key characteristics. Some of the reasons why we should look at descriptive statistics include: * Understanding the distribution of data: Descriptive statistics provide information about the central tendency and the spread of the data. This information is useful in determining the type of distribution and whether the data is skewed or symmetrical * Identifying outliers: Descriptive statistics help to identify any extreme values or outliers in the dataset. These outliers can have a significant impact on the analysis and should be investigated further. From the descriptive statistics, we can observe the following * Outliers: The ‘AveRooms, ‘AveBedrms', ‘Fopulation’, and ‘AveOccup' columns have high maximum values, indicating the presence of outliers in the data. These outliers may need to be treated or removed before model selection. we will create visuals to see them more clearly * Distribution: The ‘Medinc, ‘HouseAge’, and 'MedHouseVal’ columns appear to be normally distributed, as the mean and median values are close to each other, and the standard deviation is not very high. The ‘Latitude’ column is skewed to the left, as the mean is less than the median. The ‘Longitude’ column is skewed to the right, as the mean is greater than the median california_df.describe().Tcount mean std min 25% 50% 7 Medine 206400 3870671 1.899822 0.499900 2.563400 3534800 4.7437 HouseAge 206400 2863948 12585558 1.000000 18000000 29,0000 37.000 AveRooms 206400 5429000 2474173 «0846154 4.440716 5.229128 8.052 AveBedims 206400 1.096675 0.473911 0333333 «1.00607 1.048780 1.099 Population 206400 1425476744 1132462122 3.000000 787.0000 1166000000 1725.000¢ AveOccup 206400 3.070655 10386050 0692308 2429741 «2.818116 3.282 Latitude 206400 35631867 2.135952 32540000 33.930000 34.260000 37.710 Longitude 20640.0 -119.569704 2.003532 -124.350000 -121.800000 -118.490000 -118.010¢ MedHouseVal 206400 2.06855€ 1.153956 0.149990 1.196000 1.797000 2.6472 » Step 3(d): Check for missing values in the Dataframe This is important because most machine learning algorithms cannot handle missing data and will throw an error if missing values are present. Therefore, itis necessary to check for missing values and impute or remove them before fitting the data into a machine learning model, This helps to ensure that the model is trained on complete and accurate data, which leads to better performance and more reliable predictions. Here we have no missing values so lets move on # Check for missing values print("Missing values:\n", california_df.isnull().sum()) Missing values: NedInc e Houseage ‘AveRoons AveBedrms Population AveOccup latitude Longitude NedHouseVal dtype: intea Step 3(e): Check for duplicate values in the Dataframe Checking for duplicate values in machine learning is important because it can affect the accuracy of your model. Duplicate values can skew your data and lead to overfitting, where your model is too closely fit to the training data and does not generalize well to new data We have no duplicate values so thats good california_df.duplicated().sum() @ Step 3(f)(i): Check for Outliers in the DataframeWe should check for outliers as they can have a negative impact on machine learning algorithms as they can skew the results of the analysis. Outliers can significantly alter the mean, standard deviation, and other statistical measures, which can misrepresent the true characteristics of the data. Linear regression models, are sensitive to outliers and can produce inaccurate results ifthe outliers are not properly handled or removed. Therefore, it is important to identify and handle outliers appropriately to ensure the accuracy and reliability of the models. Here in the plots we can clearly see very high outliers on the right hand side. So we need to deal with them appropriately # Create a boxplot of the ‘AveRooms' column ax = sns.boxplot (x=california_df[ 'AveRooms" ]) # Set the titLe ond axes Labels ax.set_title("Boxplot of Average Number of Roos") ax.set_xlabel("AveRooms' ) ax.set_ylabel('') # Customize the y-axis tick Labels to display values in millions yticks = ax.get_yticks() / 1000000 ax.set_yticklabels([‘{:.2f}'.format(ytick) for ytick in yticks]) # Add grid Lines and remove top and right spines ax.grid(axis="y', linestyle="--', alpha=0.7) ax.spines[ top" ].set_visible(False) ax.spines| ‘right’ ].set_visible(False) # Show the plot plt.show() Boxplot of Average Number of Rooms 0.00 see oe o 0 0 o m0 100 10 Wo ‘AveRooms # Create a boxplot of the ‘AveBedrms' colum ax = sns.boxplot (x=california_df[ ‘AveBedrms']) # Set the title and axes Labels ax.set_title("Boxplot of Average Number of Bedrooms’ ) ax.set_xlabel(‘AveBedrms') ax. set_ylabel(’') # Customize the y-axis tick Labels to display values in miLLions yticks: x.get_yticks() / 1ee@¢ee ax.set_yticklabels(["{:.2F)'.format(ytick) for ytick in yticks])# Add grid Lines and remove top and right spines ax.grid(axis="y', linestyle='--', [email protected]) ax. spines[ 'top" ]. set_visible(False) ax.spines[ ‘right’ ].set_visible(False) # Show the plot pit. show() Boxplot of Average Number of Bedrooms 0.00 ome 04 ’ ’ . 5s 0» & » 8 0D 3 ‘AueBedrms # Create a boxplot of the ‘Population’ column ax = sns. boxplot (x=california_df[ ‘Population’ ]) # Set the title and axes Labels ax.set_title(*Boxplot of Populations") ax.set_xlabel (‘Population’) ax.set_ylabel(’') # Customize the y-axis tick Labels to display values in millions yticks = ax.get_yticks() / 1000000 ax.set_yticklabels(["{:.2F}'.format(ytick) for ytick in yticks]) # Add grid Lines and remove top and right spines ax.grid(axis="y', linestyle="--', alpha=0.7) ax.spines[ ‘top’ ].set_visible(False) ax.spines[ ‘right ].set_visible(False) # Show the plot plt.show()Boxplot of Populations ‘5000 10000 15000 20000 25000 30000 35000 Population # Create a boxplot of the ‘Avedccup’ colum ax = sns.boxplot (x=california_df[ 'AveOccup']) # Set the title and axes Labels ax.set_title(‘Boxplot of Average Occupancy") ax.set_xlabel(‘AveOccup') ax.set_ylabel(*') # Customize the y-axis tick Labels to display values in millions yticks = ax.get_yticks() / 1000000 ax.set_yticklabels(['{:.24}'.format(ytick) for ytick in yticks]) # Add grid Lines and remove top and right spines ax.grid(axis="y', linestyle="--', alpha=0.7) ax. spines[ ‘top’ ].set_visible(False) ax.spines[ ‘right’ ].set_visible(False) # Show the plot plt.show() Boxplot of Average Occupancy m0 = 0S O=«2000 «200 ‘AveOceup # Create a boxplot of the ‘AveOccup’ column ax = sns.boxplot (x=california_df[ 'AveOccup"]) # Set the title and axes Labels ax.set_title(‘Boxplot of Average Occupancy’) ax.set_xlabel(*AveOccup') ax.set_ylabel('')# Customize the y-axis tick Labels to display values in millions yticks = ax.get_yticks() / 1000000 ax.set_yticklabels(["{:.2F}' format (ytick) for ytick in yticks]) # Add grid Lines and remove top and right spines ax.grid(axis="y', linestyle="--', [email protected]) ax. spines[ ‘top’ ].set_visible(False) ax.spines| 'right' ].set_visible(False) # Show the plot pit. show() Boxplot of Average Occupancy 00 = 0) OSs 0S«2000 «200 ‘AveOceup # Create a boxplot of the ‘Medic’ coLum ax = sns.boxplot (x-california_df['MedInc']) # Set the title and axes Labels ax.set_title(‘Boxplot of Medinc') ax.set_xlabel(‘MedInc') ax.set_ylabel(*') # Customize the y-axis tick Labels to display values in millions yticks = ax.get_yticks() / 1900000 ax.set_yticklabels(['{:.2£}'.format(ytick) for ytick in yticks]) # Add grid Lines and remove top and right spines ax.grid(axis="y', Linestyle="--', alpha=0.7) ax. spines[ 'top" ]. set_visible(False) ax.spines[ ‘right’ ].set_visible(False) # Show the plot plt.show()Boxplot of Medinc 0.00 oe o 2 4 6 8 0» 2 Medine Step 2(f)(ii): Deal with Outliers in the Dataframe using Winsorization: This method involves replacing extreme values with the nearest values that are within a certain percentile range. For example, we replace values above the 95th percentile with the value at the 95th percentile and values below the 1st percentile with the value at the 1st percentile. From the visuals we can clearly see that the data is way more normally distributed now # Define the percentile Limits for winsorization pet_lower = 0.1. pct_upper = 0.95 # Apply winsorization to the five columns california_df["AveRooms'] = np.clip(california_df[‘AveRoons"], california_df[‘AveRoons'].quantile(pct_lower), california_df[ ‘AveRoons'].quantile(pct_upper)) california_df["AveBedrms'] = np.clip(california_df["AveBedrns'], california_df{ 'AveBedrms'].quantile(pct_lower california_df| 'AveBedrns' ].quantile(pct_upper california_df["Population’] = np.clip(california_df{ ‘Population’ ], california df| ‘Population’ ] .quantile(pct_low: california_df[ ‘Population’ ] .quantile(pct_upp: california_df["AveOccup'] = np.clip(california_df[ ‘AveOccup'], california df['AveOccup'].quantile(pct_lower), california_df[‘AveOccup" ] .quantile(pct_upper)) california_df["MedInc'] = np.clip(california_df[ ‘Medinc'], california_df['MedInc'].quantile(pet_lower), california_df["Medinc’ ].quantile(pct_upper)) # Create a boxplot of the ‘AveRooms’ column ax = sns.boxplot (x=california_df[ ‘AveRooms" ]) # Set the title and axes Labels ax.set_title(‘Boxplot of Average Number of Rooms") ax.set_xlabel(*AveRooms') ax.set_ylabel('') # Customize the y-axis tick Labels to display values in millions yticks = ax.get_yticks() / 1000000 ax.set_yticklabels(['{:.2F}' format (ytick) for ytick in yticks]) # Add grid Lines and remove top and right spines ax.grid(axis="y', linestyle="--', [email protected])In In ax. spines['top" ].set_visible(False) ax.spines[ ‘right’ ].set_visible(False) # Show the plot pit. show() Boxplot of Average Number of Rooms 0.00 3 a 5 5 7 ‘AveRooms # Create a boxplot of the ‘AveBedrms’ column ax = sns.boxplot (x=california_df[ 'AveBedrns’ ]) # Set the title and axes Labels ax.set_title("Boxplot of Average Number of Bedrooms’) ax.set_xlabel(*AveBedrms') ax.set_ylabel('') # Customize the y-axis tick Labels to display values in millions yticks = ax.get_yticks() / 1000000 ax.set_yticklabels(['{:.2F}'.format(ytick) for ytick in yticks]) # Add grid Lines and remove top and right spines ax.grid(axis="y', linestyle='--', [email protected]) ax. spines[ 'top'].set_visible(False) ax. spines[ ‘right '].set_visible(False) # Show the plot plt.show() Boxplot of Average Number of Bedrooms 0.00 030 095 100 105 110 115 120 125 ‘AveBedrms # Create a boxplot of the ‘Population’ column ax = sns.boxplot (x=california_df[ ‘Population’ })In # Set the title and axes Labels ax.set_title(‘Boxplot of Populations’) ax.set_xlabel (‘Population’) ax.set_ylabel('') # Customize the y-axis tick Labels to display values in millions yticks = ax.get_yticks() / 1000000 ax.set_yticklabels(['{:.2F}'.format(ytick) for ytick in yticks]) # Add grid Lines and remove top and right spines ax.grid(axis='y’, linestyle="--', alpha=0.7) ax. spines[ '‘top’].set_visible(False) ax. spines[ ‘right’ ].set_visible(False) # Show the plot plt.show() Boxplot of Populations 0.00 500 1000 +1500 +2000 2500 +5000, Population # Create a boxplot of the ‘MedInc’ column ax = sns. boxplot (x-california_df[‘Medinc']) # Set the title and axes Labels ax.set_title(*Boxplot of MedInc') ax.set_xlabel("MedInc') ax.set_ylabel('') # Customize the y-axis tick Labels to display values in millions yticks = ax.get_yticks() / 1000000 ax.set_yticklabels(['{:.2f}' format (ytick) for ytick in yticks]) # Add grid Lines and remove top and right spines ax.grid(axis="y', linestyle="--', alpha=0.7) ax.spines[ ‘top’ ].set_visible(False) ax.spines[ ‘right’ ].set_visible(False) # Show the plot plt.show()In Boxplot of Medinc 0.00 T 2 3 a 5 6 7 Meaine Step 3(g): Check for Skewness using a Histogram Skewed data can result in biased estimates of model parameters and reduce the accuracy of predictions. Therefore, it is important to assess the distribution of features and target variables to identify any potential issues and take appropriate measures to address them, Here almost all the features and target look normally distributed. There is some Skewness In MedHouseVal but not enough to do Transformation on it Note: For learning purposes | have shown how to do MedHouseVal transformation for skewness in my previous tutorial of Simple Linear Regression. Feel free to check that out # Set figure size and font scale plt.figure(figsize=(8, 6)) sns.set(font_scale=1.5) # Create histogram sns.histplot(data-california df, x='MedHouseVal', kde=True, bins=50, colo #7097 # Set x and y axis Labels and title plt.xlabel( ‘Median House Value’, fontsize=16) plt.ylabel( ‘Frequency’, fontsize-16) plt.title( ‘Distribution of Median House Value in California’, fontsize=20) # Customize x and y axis tick marks and Labels Plt .xticks (fontsize=12) plt.yticks(fontsize=12) # Add vertical Line for mean mean = california_df['MedHouseVal' ].mean() plt.axvline(mean, color='red’, linestyle='--", label=f'Mean: {mean:.2f}') plt.legend(fontsize=14) # Show the plot plt.show()Distribution of Median House Value in California +1000 Frequency o 1 2 3 4 5 Median House Value ## Set figure size and font scale plt.figure(figsize=(8, 6)) sns.set(font_scale=1.5) # Create histogram sns.histplot(data-california df, x='Medinc’, kde=True, bins=50, color="#beaeda", e # Set x and y axis Labels and title plt.xlabel(‘Median Incone Value’, fontsize=16) plt.ylabel("Frequency’, fontsize=16) plt.title( ‘Distribution of Median MedInc Value in California’, fontsize=20) # Customize x and y axis tick marks and Labels plt.xticks(fontsize=12) plt.yticks(fontsize=12) # Add vertical Line for MedIne mean ~ california_df["MedInc’].mean() plt.axvline(mean, color='red', linestyle='--', label=F'Mean: {mean:.2f}') plt.legend(fontsize=14) # Show the plot plt.show()Distribution of Median Medinc Value in California ---- Mean: 3.77 1000 Frequency 7 & 20 hi 1 2 3 4 5 Median Income Value # Set figure size and font scale plt.figure(figsize-(8, 6)) sns.set(font_scale=1.5) # Create histogram sns.histplot(data=california df, x="HouseAge’, kde-True, bins=50, color="#fdco86", # Set x and y axis Labels and title plt.xlabel(‘Houseage’, fontsize=16) plt.ylabel('Frequency', fontsize=16) plt.title( ‘Distribution of House Age in California’, fontsize=20) # Customize x and y axis tick marks and Labels plt.xticks(fontsize=12) plt.yticks(fontsize=12) # Add vertical Line for mean mean = california df[‘Houseage’ ].nean() plt.axvline(mean, color="red’, linestyl pit. legend(fontsize=14) ‘22, label=f'Mean: {mean:.2#}') # Show the plot plt.show()Distribution of House Age in California ---- Mean: 28.64 1200 +000 8 Frequency 8 200 © 10 Pa » 4 =0 HouseAge # Set figure size and font scale plt.figure(figsize=(8, 6)) sns.set(font_scale=1.5) # Create histogram sns.histplot(data-california df, x="AveRooms', kde=True, bins=50, color="#fFFF99", # Set x and y axis Labels and title plt.xlabel(‘AveRooms’, fontsize=16) plt.ylabel( ‘Frequency’, fontsize=16) plt.title( ‘Distribution of AveRooms in California’, fontsize=20) # Customize x and y axis tick marks and Labels pit. xticks (fontsize=12) plt.yticks(fontsize=12) # Add vertical Line for mean mean = california_df['AveRooms’ ].nean() plt.axvline(mean, color="red", linestyl pit. legend(fontsize=14) ‘--", label=f'Mean: {mean:.2F}') # Show the plot plt.show()Distribution of AveRooms in California ---- Mean: 5.28 1000 Frequency g =) & 20 > linil 3 4 5 6 7 AveRooms # Set figure size and font scale plt.figure(figsize-(8, 6)) sns.set(font_scale=1.5) # Create histogram sns.histplot(data-california df, weBedrns’, kde=True, bins=50, color="#386cb0" # Set x and y axis Labels and title plt.xlabel(‘AveBedrms", fontsizé ) plt.ylabel('Frequency’, fontsize-16) plt.title( ‘Distribution of AveBedrms in California’, fontsize=20) # Customize x and y axis tick marks and Labels plt.xticks(fontsize=12) plt.yticks(fontsize=12) # Add vertical Line for mean mean = california df[‘AveBedrns'’ }.mean() plt-axvline(mean, color="red’, Linestyli pit. legend(fontsize=14) ", label=f'mean: {mean:.2F}') # Show the plot plt.show()In Distribution of AveBedrms in California o-=- Mean: 1.06 | +1000 Frequency 7 & 200 oso 095 100105, 404.451.0125 AveBedrms ## Set figure size and font scale plt. figure(figsize-(8, 6)) sns.set(font_scale=1.5) # Create histogram sns.histplot(data-california df, x="Population’, kde=True, bin: 8, color="#F0027F # Set x and y axis Labels and title plt.xlabel( ‘Population’, fontsize=16) plt.ylabel(‘Frequency’, fontsize=16) plt.title( ‘Distribution of Population in California’, fontsize=20) # Customize x and y axis tick marks and Labels plt .xticks(fontsize=12) plt.yticks(fontsize=12) # Add vertical Line for mean mean = california_df[ ‘Population’ ].mean() plt.axvline(mean, color='red', linestyle='--", label=f'Mean: {mean:.2f}') plt. legend(fontsize=14) # Show the plot plt.show()Distribution of Population in California = Mean: 134578 | 1000 Frequency a & 200 ° 0 10001500 «= aon. 2500 00 Population # Set figure size and font scale plt.figure(figsize=(8, 6)) sns.set(font_scale=1.5) # Create histogram sns.histplot(data=california df, x="AveOccup’, kde=True, bin: |, color="#bfSb17", # Set x and y axis Labels and title plt.xlabel(‘AveOccup", fontsize=16) plt.ylabel(‘Frequency’, fontsize=16) plt.title( ‘Distribution of AveOccup in California’, fontsize=20) # Customize x and y axis tick marks and Labels pit. xticks(fontsize=12) plt.yticks(fontsize=12) # Add vertical Line for mean mean ~ california_df['Avedccup' ].mean() plt.axvline(mean, color='red', linestyle='--", label=f'Mean: {mean:.2f}') plt. legend(fontsize=14) # Show the plot plt.show()In Distribution of AveOccup in California ---- Mean: 2.89 1000 Frequency g 400 200 ° 45 20 25 30 35 40 AveOccup # Set figure size and font scale plt.figure(figsize=(8, 6)) sns.set(font_scale=1.5) # Create histogram sns.histplot(data=california df, x="Latitude’, kde-True, bin: ‘wpaccee’ # Set x and y axis Labels and title plt.xlabel(‘AveOccup", fontsize=16) plt.ylabel( ‘Frequency’, fontsize-16) plt.title( ‘Distribution of Latitude’, fontsize=20) # Customize x and y axis tick marks and Labels pit. xticks(fontsize=12) plt.yticks(fontsize=12) # Add vertical Line for mean mean ~ california_df[' Latitude’ ].mean() plt.axvline(mean, color='blue’, linestyle="--", label=f'Mean: {mean:.2f}') plt.1egend(fontsize-14) # Show the plot plt.show()Distribution of Latitude i Mean: 35.63 3000 2500 a +000 8 Ss Frequency 500 ll “0 2 At hetll : AveOccup # Set figure size and font scale plt.figure(figsize-(8, 6)) sns.set(font_scale=1.5) # Create histogram sns.histplot(data=california df, x="Longitude’, kde-True, bins=50, color="#FF5733° # Set x and y axis Labels and title plt.xlabel(‘AveOccup", fontsize=16) plt.ylabel( ‘Longitude’, fontsize=16) pit. title( ‘Distribution of Longitude’, fontsize=20) # Customize x and y axis tick marks and Labels plt.xticks (fontsize=12) plt.yticks(fontsize=12) # Add vertical Line for mean mean = california_df[' Longitude’ ].mean() plt.axvline(mean, color='blue", Linestyle: plt. legend(fontsize=14) "Mean: {mean:.2f}') # Show the plot plt.show()Distribution of Longitude 2500 i Mean: -119.57 2000 g Longitude 1000 500 a4 “192 120 1a 16 18 ‘AveOccup Step 3(h): Create a Vertical Correlation Heatmap The correlation matrix shows the correlation coefficients between every pair of variables in the dataset. A correlation coefficient ranges from -1 to 1 and measures the strength and direction of the linear relationship between two variables. A coefficient of -1 indicates 2 perfect negative correlation, a coefficient of 0 indicates no correlation, and a coefficient of 1 indicates a perfect positive correlation # Calculate the corretation matrix corr_matrix = california_df.corr() # Set up the matpLotlib figure fig, ax = plt.subplots(figsize=(6, 12)) # Create the heatmap wsns.heatmap(corr_matrix, cmap='BrBG', annot=True, fnt='.2f', Linewidths=.5, axzax, sns.heatnap(corr_matrix.corr()[['MedHouseVal' ]].sort_values(by="MedHouseVal’, ascei # Set the title and axis Labels ax.set_title(‘Correlation Heatmap for California Housing Dataset’, fontsize=16) ax.set_xlabel(‘Features', fontsize=14) ax.set_ylabel("Features', fontsize-14) # Rotate the x-axis Labels for readability plt.xticks(rotation=@, ha="right") # Show the plot plt.show()Correlation Heatmap for California Housing Dataset_ 4.00 MedHouseVal 0.75 Medinc 0.50 AveRooms HouseAge 0.031 028 8 2 Longitude 0.05 -0.00 & Latitude 0.4 025 Population 0.24 -0.50 AveBedrms 0.75 AveOccup 1.00 MedHouseVal Features Step 3(i): Perform Feature Scaling Feature scaling is the process of transforming numerical features in a dataset to have similar scales or ranges of values. The purpose of feature scaling is to ensure that all features have the same level of impact on the model and to prevent certain features from dominating the model simply because they have larger values. In linear regression, feature scaling is particularly important because the coefficients of the model represent the change in the dependent variable associated with a one-unit change in the independent variable. Scaling the features to have similar ranges can result in a more accurate and reliable model with more accurate representations of the relationships between the independent variables and the dependent variable. scaler = Standardscaler() california_df_scaled = scaler. fit_transform(california_df)california_df_scaled = pd.DataFrane(california_df_scaled, columns=california_df.co: Step 3(j) Check for Assumptions using Scatter Plots From the scatter plots, we can observe that there isa linear relationship between the dependent variable (Median House Value) and some of the independent variables like Median Income and Total Rooms. However, we can also see that some of the independent variables like Longitude, Latitude, and Housing Median Age do not have a clear linear relationship with the dependent variable. This suggests that a linear regression model might not be the best ft for predicting the Median House Value based on these variables # Create scatter plots fig, axs = plt.subplots(nrows=2, ncols=4, figsize=(30,15)) axs[0,0] scatter (california_df_scaled{'Latitude’], california_df_scaled[ 'MedHouseV. axs[0,0].set_xlabel(‘Latitude") axs[@,0].set_ylabel (‘Median House Value) axs[@,0].set_title(’Latitude vs Median House Value" ) axs[@,1].scatter(california_df_scaled{' Longitude’ ], california_df_scaled[‘MedHouse! axs[0,1] .set_xlabel (‘ Longitude" ) axs[0,1].set_ylabel (‘Median House Value’) axs[@,1].set_title('Longitude vs Median House Value' ) axs[0,2].scatter(california_df_scaled{'Houseage’], california_df_scaled['MedHouseV. axs[@,2].set_xlabel (‘Housing Median Age’) axs[@,2].set_ylabel(‘Median House Value’) axs[@,2].set_title( ‘Housing Median Age vs Median House Value") axs[0,3].scatter(california_df_scaled[ ‘AveRooms'], california_df_scaled[ 'MedHouseV. axs[0,3].set_xlabel( ‘Total Roons' ) axs[0,3].set_ylabel (‘Median House Value" ) axs[0,3].set_title( ‘Total Rooms vs Median House Value’) axs[1,0] scatter (california_df_scaled{ ‘AveBedrms’], california_df_scaled[‘MedHouse! axs[1,0].set_xlabel (‘Total Bedrooms’) axs[1,0].set_ylabel (‘Median House Value’) axs[1,0].set_title('Total Bedrooms vs Median House Value’) axs[1,1].scatter(california_df_scaled{ ‘Population’ ], california_df_scaled['MedHous: axs[1,1] .set_xlabel (‘Population’) axs[1,1].set_ylabel (‘Median House Value’) axs[1,1].set_title('Population vs Median House Value’) axs[1,2].scatter(california_df_scaled{'AveOccup'], california_df_scaled[ 'MedHouseV. axs[1,2] .set_xlabel (‘Households’) axs[1,2].set_ylabel (‘Median House Value’) axs[1,2].set_title(‘Households vs Median House Value’) axs[1,3].scatter(california_df_scaled['MedInc'], california_df_scaled[ 'MedHouseVal axs[1,3].set_xlabel(‘Median Income’) axs[1,3].set_ylabel (‘Median House Value’) axs[1,3].set_title(‘Nedian Income vs Median House Value") plt.show()Step 4: Define Dependant and Independant Variable californi californi f_scaled.drop([ 'MecHouseVal"],axis=1) d#_scaled| 'MedHouseVal''] Step 5: Do Train-Test Split in the ratio 70-30 # Split the data into training and testing sets X train, Xtest, y_train, y_test = train_test_split(x, y, test_size-0.3, random st, Step 6: Searching for Best Polynomial Degreee Based on R2 # initialize variables for storing best model and score best_score = @ best_degree = 1 # Loop through degrees of polynomial. from 1 to 10 for degree in range(1, 11): # create polynomial features poly_features = PolynomialFeatures(degree=degree) X poly_train = poly_features.fit_transform(X train) x(poly_test = poly Features. transform(X_test) # fit Linear regression model to polynomial features poly_reg = LinearRegression() poly_reg.fit(X_poly_train, y train) # evaluate model on test set y_pred = poly_reg.predict(x_poly_test) r2_score(y test, y_pred) mean_squared_error(y_test, y_pred) # check if this is the best model so far if score > best_score: best_score = score best_mse = mse best_degree = degree best_model = poly_reg# print best degree and score print("Best degree:", best_degree) print("R*2 score:", best_score) print("Mean Squared Error:", best_mse) @.7564759864738175 Mean Squared Error: @.2438324527757844 # plot actual vs predicted values pit. figure(figsize=(12, 6)) plt.scatter(y test, y_pred, color="blue’) # add Labels and title plt.xlabel( ‘Actual Values") plt.ylabel (‘Predicted Values" ) plt.title('Polynonial Regression Results (Degree + str(best_degree) +", R°2 = # add diagonal Line for reference plt.plot([y_test-min(), y_test.max()], [y_test.min(), y_test.max()], # display the plot plt.show() Polynomial Regression Results (Degree = 4, R’2 = 0.78) 19000 7 10000 20000 Predicted Values =30000 40000 0 1 2 Actual Values Conclusion: In this polynomial regression tutorial, we explored how to use polynomial regression to model the relationship between variables with a non-linear pattern with the target feature specifically the median house value in California, We started by loading and cleaning the dataset, and then visualizing the relationship between the variables using scatter plots Next, we split the data into training and testing sets and used polynomial regression to fit model to the training data, We gradually increased the degree of the polynomial until we achieved a good balance between bias and variance. We then used the model to make predictions on the test data and evaluated the model's performance using mean squared error and R-squared score

House Price Prediction: Project Description
No ratings yet
House Price Prediction: Project Description
11 pages
Emllab
No ratings yet
Emllab
6 pages
T2_summary_VHA
No ratings yet
T2_summary_VHA
14 pages
Regression Algorithm
No ratings yet
Regression Algorithm
9 pages
Introduction To Machine Learning (ML) With Sklearn
No ratings yet
Introduction To Machine Learning (ML) With Sklearn
10 pages
Faisal Nadeem (SAP# 30601)
No ratings yet
Faisal Nadeem (SAP# 30601)
7 pages
Regression Analysis
No ratings yet
Regression Analysis
17 pages
Normialization Dataset
No ratings yet
Normialization Dataset
7 pages
01.multiple Linear Regression - Ipynb - Colaboratory
No ratings yet
01.multiple Linear Regression - Ipynb - Colaboratory
10 pages
Document From Jahnavi
No ratings yet
Document From Jahnavi
20 pages
Data Science Record_05
No ratings yet
Data Science Record_05
20 pages
Import As Import As From Import: "Mean Squared Errors: "
No ratings yet
Import As Import As From Import: "Mean Squared Errors: "
1 page
Machine Learning Laboratory
No ratings yet
Machine Learning Laboratory
23 pages
Kaggle Machine Learning
No ratings yet
Kaggle Machine Learning
6 pages
The Data Science Process
100% (1)
The Data Science Process
53 pages
House Price Prediction Models
No ratings yet
House Price Prediction Models
16 pages
Machine Learning(BCSL606) Lab Manual (2) (1)
No ratings yet
Machine Learning(BCSL606) Lab Manual (2) (1)
117 pages
Exp1a
No ratings yet
Exp1a
5 pages
USA Real Estate Price Prediction Using Decision Tree Regressor, and AdaBoost Regressor
No ratings yet
USA Real Estate Price Prediction Using Decision Tree Regressor, and AdaBoost Regressor
14 pages
Xgboost
No ratings yet
Xgboost
12 pages
Copy of Project 4 _ House Price Prediction.ipynb - Colab
No ratings yet
Copy of Project 4 _ House Price Prediction.ipynb - Colab
5 pages
MDS372_LAB4_2448001
No ratings yet
MDS372_LAB4_2448001
17 pages
California Housing Price Prediction .
No ratings yet
California Housing Price Prediction .
1 page
Module 2
No ratings yet
Module 2
20 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
Data Clearning
No ratings yet
Data Clearning
7 pages
ML Final Prac
No ratings yet
ML Final Prac
47 pages
Setup: Chapter 2 - End-To-End Machine Learning Project
No ratings yet
Setup: Chapter 2 - End-To-End Machine Learning Project
31 pages
Faseeh Chap 2 Report
No ratings yet
Faseeh Chap 2 Report
30 pages
Regression Analysis Lasso and Ridge Regression 1678810035
No ratings yet
Regression Analysis Lasso and Ridge Regression 1678810035
18 pages
f3683849-7ca6-4854-8f96-af11b6e837ec
No ratings yet
f3683849-7ca6-4854-8f96-af11b6e837ec
20 pages
Ita5007 Da2
No ratings yet
Ita5007 Da2
8 pages
machinelearning
No ratings yet
machinelearning
26 pages
Prac - 8 (1) - Jupyter Notebook
No ratings yet
Prac - 8 (1) - Jupyter Notebook
6 pages
Project Linear Regression
No ratings yet
Project Linear Regression
7 pages
House Pricing
No ratings yet
House Pricing
15 pages
Chirag HOusing Price Pred
No ratings yet
Chirag HOusing Price Pred
12 pages
The Boston Housing Dataset
100% (1)
The Boston Housing Dataset
4 pages
Untitled6.Ipynb - Colab
No ratings yet
Untitled6.Ipynb - Colab
6 pages
Linear Reg
No ratings yet
Linear Reg
25 pages
External
No ratings yet
External
11 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Dawit House
No ratings yet
Dawit House
49 pages
Report
No ratings yet
Report
40 pages
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
No ratings yet
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
14 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
L03 The Regression Pipeline
No ratings yet
L03 The Regression Pipeline
94 pages
DL_1
No ratings yet
DL_1
11 pages
ML LinearRegression
No ratings yet
ML LinearRegression
10 pages
module_2
No ratings yet
module_2
35 pages
AIMLlatestmodule 2Notes Removed
No ratings yet
AIMLlatestmodule 2Notes Removed
33 pages
DA_Programs
No ratings yet
DA_Programs
44 pages
P04 The Regression Pipeline - Preprocessing Ans
No ratings yet
P04 The Regression Pipeline - Preprocessing Ans
19 pages
a
No ratings yet
a
2 pages
Boston House Prediction - Colab1
No ratings yet
Boston House Prediction - Colab1
10 pages
2 - Linear - Regression - Multivariate - Ipynb - Colaboratory
No ratings yet
2 - Linear - Regression - Multivariate - Ipynb - Colaboratory
4 pages
profitanalysis
No ratings yet
profitanalysis
18 pages
vertopal.com_2_linear_regression_multivariate
No ratings yet
vertopal.com_2_linear_regression_multivariate
2 pages
20MIS1025 - Regression - Ipynb - Colaboratory
No ratings yet
20MIS1025 - Regression - Ipynb - Colaboratory
5 pages

Linear Regression Analysis - Polynomial Regression

Uploaded by

Linear Regression Analysis - Polynomial Regression

Uploaded by

You might also like