0% found this document useful (0 votes)
29 views38 pages

Cia Code

The document discusses preprocessing a real estate price prediction dataset. It describes dropping irrelevant columns, handling missing values by imputing with the median, detecting outliers using z-scores and boxplots, and removing outliers from the dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views38 pages

Cia Code

The document discusses preprocessing a real estate price prediction dataset. It describes dropping irrelevant columns, handling missing values by imputing with the median, detecting outliers using z-scores and boxplots, and removing outliers from the dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

cia-code

April 23, 2024

Importing data
[32]: import pandas as pd
df1 = pd.read_csv('C:\\Users\\lariy\\Downloads\\price prediction.csv')
df = pd.DataFrame(df1)
df

[32]: address bathrooms bedrooms finishedsqft \


0 2243 Franklin St 2.0 2 1463.0
1 2002 Pacific Ave APT 4 3.5 3 3291.0
2 1945 Washington St APT 411 1.0 1 653.0
3 1896 Pacific Ave APT 802 2.5 2 2272.0
4 1840 Washington St APT 603 1.0 1 837.0
.. … … … …
434 2170 Vallejo St APT 101 2.0 3 2145.0
435 2380 Vallejo St 3.5 4 3042.0
436 2430 Vallejo St 7.5 6 4721.0
437 1859 Green St 1.0 2 1306.0
438 2131 Vallejo St APT 3 1.0 1 1100.0

lastsolddate lastsoldprice latitude longitude neighborhood \


0 02-05-2016 1950000 37.795139 -122.425309 Pacific Heights
1 1/22/2016 4200000 37.794429 -122.428513 Pacific Heights
2 12/16/2015 665000 37.792472 -122.425281 Pacific Heights
3 12/17/2014 2735000 37.794706 -122.426347 Pacific Heights
4 12-02-2015 1050000 37.793212 -122.423744 Pacific Heights
.. … … … … …
434 11/14/2012 1650000 37.795777 -122.433024 Pacific Heights
435 10-01-2012 3195000 37.795330 -122.436540 Pacific Heights
436 9/24/2012 7350000 37.795246 -122.437490 Pacific Heights
437 10/18/2011 1349000 37.796588 -122.429641 Pacific Heights
438 03-04-2016 1250000 37.795255 -122.432880 Pacific Heights

totalrooms usecode yearbuilt zipcode


0 7 Condominium 1900 94109
1 7 Condominium 1961 94109
2 3 Condominium 1987 94109
3 6 Condominium 1924 94109

1
4 3 Condominium 2012 94109
.. … … … …
434 8 Condominium 1914 94123
435 10 SingleFamily 1908 94123
436 13 SingleFamily 1905 94123
437 5 Condominium 1900 94123
438 5 Condominium 1900 94123

[439 rows x 13 columns]

0.1 Data Preprocessing


1. Dropping irrelevant variables/columns from the dataset which adds no intrinsic value to it.

[33]: columns_to_drop = ['address', 'lastsolddate', 'neighborhood', 'usecode',␣


↪'yearbuilt', 'zipcode']

df.drop(columns=columns_to_drop, inplace=True)

[34]: df.columns

[34]: Index(['bathrooms', 'bedrooms', 'finishedsqft', 'lastsoldprice', 'latitude',


'longitude', 'totalrooms'],
dtype='object')

[35]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 439 entries, 0 to 438
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 bathrooms 439 non-null float64
1 bedrooms 439 non-null int64
2 finishedsqft 438 non-null float64
3 lastsoldprice 439 non-null int64
4 latitude 439 non-null float64
5 longitude 437 non-null float64
6 totalrooms 439 non-null int64
dtypes: float64(4), int64(3)
memory usage: 24.1 KB
2.Detecting and Plotting missing values
[36]: missing_values = df.isnull().sum()
print("Missing values in the data:",missing_values)

Missing values in the data: bathrooms 0


bedrooms 0

2
finishedsqft 1
lastsoldprice 0
latitude 0
longitude 2
totalrooms 0
dtype: int64

[37]: import matplotlib.pyplot as plt

# Calculate the number of missing values in each column


missing_values = df.isnull().sum()

# Plot the missing values


plt.figure(figsize=(10, 6))
missing_values.plot(kind='bar')
plt.title('Missing Values in Each Column')
plt.xlabel('Columns')
plt.ylabel('Number of Missing Values')
plt.xticks(rotation=45)
plt.show()

3.Handling missing values

3
[38]: missing_values = df.isnull().sum()
print("Missing values before handling:")
print(missing_values)

Missing values before handling:


bathrooms 0
bedrooms 0
finishedsqft 1
lastsoldprice 0
latitude 0
longitude 2
totalrooms 0
dtype: int64

[39]: #imputing with mean value of the column


from sklearn.impute import SimpleImputer

numerical_features = df.select_dtypes(include=['float64', 'int64']).columns


imputer = SimpleImputer(strategy='median')
df[numerical_features] = imputer.fit_transform(df[numerical_features])

[40]: missing_values_after = df.isnull().sum()


print("\nMissing values after handling:")
print(missing_values_after)

Missing values after handling:


bathrooms 0
bedrooms 0
finishedsqft 0
lastsoldprice 0
latitude 0
longitude 0
totalrooms 0
dtype: int64

[22]: #no missing values

4.Detecting Outliers
[41]: import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import zscore

# Load the dataset


df = pd.read_csv('C:\\Users\\lariy\\Downloads\\price prediction.csv')

4
# Select numerical columns for outlier detection
numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns

# Create box plots for numerical columns to visualize outliers


plt.figure(figsize=(12, 6))
for i, column in enumerate(numerical_columns, 1):
plt.subplot(2, 3, i)
sns.boxplot(data=df[column])
plt.title(column)
plt.tight_layout()
plt.show()

# Identify outliers using z-score


z_scores = zscore(df[numerical_columns])
outliers = (z_scores > 3) | (z_scores < -3)

# Plot outliers using scatter plot


plt.figure(figsize=(12, 6))
for i, column in enumerate(numerical_columns, 1):
plt.subplot(2, 3, i)
plt.scatter(df.index, df[column], c=outliers[:, i-1], cmap='coolwarm',␣
↪alpha=0.5)

plt.title(column)
plt.xlabel('Index')
plt.ylabel(column)
plt.tight_layout()
plt.show()

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[41], line 15
13 plt.figure(figsize=(12, 6))
14 for i, column in enumerate(numerical_columns, 1):
---> 15 plt.subplot(2, 3, i)
16 sns.boxplot(data=df[column])
17 plt.title(column)

File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\matplotlib\pyplot.
↪py:1425, in subplot(*args, **kwargs)

1422 fig = gcf()


1424 # First, search for an existing subplot with a matching spec.
-> 1425 key = SubplotSpec._from_subplot_args(fig, args)
1427 for ax in fig.axes:
1428 # If we found an Axes at the position, we can re-use it if the user␣
↪passed no

1429 # kwargs or if the axes class and kwargs are identical.

5
1430 if (ax.get_subplotspec() == key
1431 and (kwargs == {}
1432 or (ax._projection_init
1433 == fig._process_projection_requirements(**kwargs)))):

File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\matplotlib\gridspec.
↪py:599, in SubplotSpec._from_subplot_args(figure, args)

597 else:
598 if not isinstance(num, Integral) or num < 1 or num > rows*cols:
--> 599 raise ValueError(
600 f"num must be an integer with 1 <= num <= {rows*cols}, "
601 f"not {num!r}"
602 )
603 i = j = num
604 return gs[i-1:j]

ValueError: num must be an integer with 1 <= num <= 6, not 7

[42]: df_no_outliers = df[~outliers.any(axis=1)]

[43]: plt.figure(figsize=(12, 6))


for i, column in enumerate(numerical_columns, 1):
plt.subplot(2, 3, i)
sns.boxplot(data=df_no_outliers[column])
plt.title(column)
plt.tight_layout()
plt.show()

6
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[43], line 3
1 plt.figure(figsize=(12, 6))
2 for i, column in enumerate(numerical_columns, 1):
----> 3 plt.subplot(2, 3, i)
4 sns.boxplot(data=df_no_outliers[column])
5 plt.title(column)

File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\matplotlib\pyplot.
↪py:1425, in subplot(*args, **kwargs)

1422 fig = gcf()


1424 # First, search for an existing subplot with a matching spec.
-> 1425 key = SubplotSpec._from_subplot_args(fig, args)
1427 for ax in fig.axes:
1428 # If we found an Axes at the position, we can re-use it if the user␣
↪passed no

1429 # kwargs or if the axes class and kwargs are identical.


1430 if (ax.get_subplotspec() == key
1431 and (kwargs == {}
1432 or (ax._projection_init
1433 == fig._process_projection_requirements(**kwargs)))):

File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\matplotlib\gridspec.
↪py:599, in SubplotSpec._from_subplot_args(figure, args)

597 else:
598 if not isinstance(num, Integral) or num < 1 or num > rows*cols:
--> 599 raise ValueError(
600 f"num must be an integer with 1 <= num <= {rows*cols}, "
601 f"not {num!r}"
602 )
603 i = j = num
604 return gs[i-1:j]

ValueError: num must be an integer with 1 <= num <= 6, not 7

7
[29]: #the outliers have been reduced to maximum

[44]: df.columns

[44]: Index(['address', 'bathrooms', 'bedrooms', 'finishedsqft', 'lastsolddate',


'lastsoldprice', 'latitude', 'longitude', 'neighborhood', 'totalrooms',
'usecode', 'yearbuilt', 'zipcode'],
dtype='object')

[45]: columns_to_drop = ['address', 'lastsolddate', 'neighborhood', 'usecode',␣


↪'yearbuilt', 'zipcode']

df.drop(columns=columns_to_drop, inplace=True)
df

[45]: bathrooms bedrooms finishedsqft lastsoldprice latitude longitude \


0 2.0 2 1463.0 1950000 37.795139 -122.425309
1 3.5 3 3291.0 4200000 37.794429 -122.428513
2 1.0 1 653.0 665000 37.792472 -122.425281
3 2.5 2 2272.0 2735000 37.794706 -122.426347
4 1.0 1 837.0 1050000 37.793212 -122.423744
.. … … … … … …
434 2.0 3 2145.0 1650000 37.795777 -122.433024
435 3.5 4 3042.0 3195000 37.795330 -122.436540
436 7.5 6 4721.0 7350000 37.795246 -122.437490
437 1.0 2 1306.0 1349000 37.796588 -122.429641
438 1.0 1 1100.0 1250000 37.795255 -122.432880

totalrooms

8
0 7
1 7
2 3
3 6
4 3
.. …
434 8
435 10
436 13
437 5
438 5

[439 rows x 7 columns]

[ ]: #CORRELATION ESTIMATION

[126]: import pandas as pd


import seaborn as sns
import matplotlib.pyplot as plt

# Compute correlation matrix


corr_matrix = df.corr()

# Set custom color palette


colors = sns.color_palette("coolwarm", as_cmap=True)

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap=colors, fmt=".2f", linewidths=0.5,␣
↪cbar=False)

plt.title('Correlation Matrix', fontsize=16, fontweight='bold')


plt.xticks(fontsize=12, rotation=45)
plt.yticks(fontsize=12)
plt.show()

9
0.2 PRINCIPLE COMPONENT ANALYSIS
1.Standardization
[52]: df_index = df.index

[53]: numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns

[54]: import pandas as pd


from sklearn.preprocessing import StandardScaler

[76]: scaler = StandardScaler()


scaled_features = scaler.fit_transform(df)

df_scaled = pd.DataFrame(scaled_features)

10
df_scaled

[76]: 0 1 2 3 4 5 6
0 -0.285556 -0.399122 -0.403174 -0.211435 1.185443 1.148603 0.105818
1 0.698425 0.140000 0.680420 0.624953 0.874861 0.720970 0.105818
2 -0.941543 -0.938244 -0.883323 -0.689106 0.018791 1.152340 -0.909570
3 0.042437 -0.399122 0.076381 0.080372 0.996031 1.010062 -0.148029
4 -0.941543 -0.938244 -0.774252 -0.545990 0.342496 1.357481 -0.909570
.. … … … … … … …
434 -0.285556 0.140000 0.001099 -0.322953 1.464529 0.118894 0.359665
435 0.698425 0.679121 0.532819 0.251367 1.268994 -0.350380 0.867359
436 3.322374 1.757365 1.528090 1.795897 1.232249 -0.477175 1.628900
437 -0.941543 -0.399122 -0.496240 -0.434844 1.819293 0.570418 -0.401876
438 -0.941543 -0.938244 -0.618352 -0.471645 1.236186 0.138114 -0.401876

[439 rows x 7 columns]

2.Covariance Matrix Computation


[90]: import pandas as pd
data_filled = df.fillna(df.mean())

# Compute the covariance matrix


covariance_matrix = np.cov(data_filled, rowvar=False)
covariance_matrix
column_names = ['bathrooms', 'bedrooms', 'finishedsqft', 'lastsoldprice',␣
↪'latitude',

'longitude', 'totalrooms']
covariance_df = pd.DataFrame(covariance_matrix, columns=column_names,␣
↪index=column_names)

# Print the covariance matrix DataFrame


print("Covariance Matrix:")
print(covariance_df)

Covariance Matrix:
bathrooms bedrooms finishedsqft lastsoldprice \
bathrooms 2.329161e+00 2.017295e+00 2.231689e+03 3.157419e+06
bedrooms 2.017295e+00 3.448394e+00 2.356723e+03 3.033648e+06
finishedsqft 2.231689e+03 2.356723e+03 2.845895e+06 3.819981e+09
lastsoldprice 3.157419e+06 3.033648e+06 3.819981e+09 7.253364e+12
latitude 3.289555e-04 8.149072e-05 3.582618e-01 7.795914e+02
longitude -2.819931e-03 -3.999855e-03 -3.881395e+00 -6.205566e+03
totalrooms 4.699683e+00 5.724836e+00 5.793232e+03 7.124227e+06

latitude longitude totalrooms


bathrooms 0.000329 -0.002820 4.699683e+00
bedrooms 0.000081 -0.004000 5.724836e+00

11
finishedsqft 0.358262 -3.881395 5.793232e+03
lastsoldprice 779.591402 -6205.566010 7.124227e+06
latitude 0.000005 0.000009 2.110409e-04
longitude 0.000009 0.000056 -9.048211e-03
totalrooms 0.000211 -0.009048 1.555414e+01

[ ]:

3.Eigen Decomposition
[94]: import pandas as pd

# Convert eigenvalues and eigenvectors to DataFrame


eigen_df = pd.DataFrame({'Eigenvalue': eigenvalues})
eigen_df['Eigenvector'] = [eigenvectors[:, i] for i in range(len(eigenvectors))]

# Print the DataFrame


print("Eigenvalues and Eigenvectors:")
print(eigen_df)

Eigenvalues and Eigenvectors:


Eigenvalue Eigenvector
0 7.253366e+12 [-4.3530400936506286e-07, -4.182401136422737e-…
1 8.341100e+05 [0.0006819708177894264, 0.0009100232661667038,…
2 3.909019e+00 [-0.07635453461451713, -0.34829146033302216, 0…
3 1.175348e+00 [0.15061016522339143, 0.9222261380364992, -7.0…
4 5.325269e-01 [0.9856396143372855, -0.16790163752836, -0.000…
5 5.091827e-05 [0.0008041469959149176, -0.0004697221463473827…
6 3.293921e-06 [4.783139021784667e-05, -8.824796320393624e-06…
4.Rearranging eigenvectors by respective eigenvalues
[96]: import pandas as pd

# Assuming 'eigenvectors' contains the eigenvectors computed from the␣


↪covariance matrix

# Create a DataFrame for eigenvectors


eigenvectors_df = pd.DataFrame(data=eigenvectors, columns=[f'PC{i+1}' for i in␣
↪range(eigenvectors.shape[1])])

# Print the DataFrame


print("Eigenvectors:")
print(eigenvectors_df)

Eigenvectors:
PC1 PC2 PC3 PC4 PC5 \
0 -4.353040e-07 6.819708e-04 -7.635453e-02 1.506102e-01 9.856396e-01
1 -4.182401e-07 9.100233e-04 -3.482915e-01 9.222261e-01 -1.679016e-01

12
2 -5.266495e-04 9.999962e-01 2.655423e-03 -7.046468e-05 -4.754296e-04
3 -9.999999e-01 -5.266507e-04 -3.019358e-07 -6.439164e-08 -9.080315e-08
4 -1.074800e-10 -6.271480e-08 1.190288e-04 -2.203851e-05 1.232789e-04
5 8.555430e-10 -7.352065e-07 4.166878e-04 -1.856730e-04 8.704261e-04
6 -9.821963e-07 2.447252e-03 -9.342675e-01 -3.561116e-01 -1.796084e-02

PC6 PC7
0 8.041470e-04 4.783139e-05
1 -4.697221e-04 -8.824796e-06
2 9.181873e-09 -1.982864e-07
3 -6.302828e-10 -2.212895e-10
4 -1.940234e-01 9.809969e-01
5 -9.809964e-01 -1.940235e-01
6 -3.528579e-04 3.782708e-05
5.Selecting the best features k
[106]: from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA

# Assuming 'features' is your data with missing values


# Impute missing values with mean
imputer = SimpleImputer(strategy='mean')
imputed_features = imputer.fit_transform(df)

# Apply PCA
pca = PCA()
pca.fit(imputed_features)

[106]: PCA()

[107]: cumulative_variance_ratio = np.cumsum(pca.explained_variance_ratio_)

# Plot cumulative explained variance ratio


plt.plot(range(1, len(cumulative_variance_ratio) + 1),␣
↪cumulative_variance_ratio, marker='o', linestyle='-')

plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('Cumulative Explained Variance Ratio by Number of Components')
plt.grid(True)
plt.show()

13
[158]: k = 3 # Number of selected best features
best_k = eigenvectors[:, :k]

# Print the selected eigenvectors


for i in range(eigenvectors.shape[1]):
print("The selected Principle Components are:")
print(f'PC{i+1}: {eigenvectors[:, i]}')

The selected Principle Components are:


PC1: [-4.35304009e-07 -4.18240114e-07 -5.26649488e-04 -9.99999861e-01
-1.07479952e-10 8.55543039e-10 -9.82196278e-07]
The selected Principle Components are:
PC2: [ 6.81970818e-04 9.10023266e-04 9.99996220e-01 -5.26650652e-04
-6.27147956e-08 -7.35206512e-07 2.44725160e-03]
The selected Principle Components are:
PC3: [-7.63545346e-02 -3.48291460e-01 2.65542278e-03 -3.01935766e-07
1.19028810e-04 4.16687798e-04 -9.34267523e-01]
The selected Principle Components are:
PC4: [ 1.50610165e-01 9.22226138e-01 -7.04646780e-05 -6.43916411e-08
-2.20385057e-05 -1.85673016e-04 -3.56111624e-01]

14
The selected Principle Components are:
PC5: [ 9.85639614e-01 -1.67901638e-01 -4.75429552e-04 -9.08031536e-08
1.23278868e-04 8.70426120e-04 -1.79608433e-02]
The selected Principle Components are:
PC6: [ 8.04146996e-04 -4.69722146e-04 9.18187349e-09 -6.30282826e-10
-1.94023389e-01 -9.80996398e-01 -3.52857856e-04]
The selected Principle Components are:
PC7: [ 4.78313902e-05 -8.82479632e-06 -1.98286392e-07 -2.21289454e-10
9.80996888e-01 -1.94023456e-01 3.78270850e-05]

[112]: import pandas as pd

# Assuming 'best_k' contains the first k eigenvectors


principal_components = pd.DataFrame(best_k, columns=[f'PC{i+1}' for i in␣
↪range(best_k.shape[1])])

print("Selected Principal Components:")


print(principal_components)

Selected Principal Components:


PC1 PC2
0 -4.353040e-07 6.819708e-04
1 -4.182401e-07 9.100233e-04
2 -5.266495e-04 9.999962e-01
3 -9.999999e-01 -5.266507e-04
4 -1.074800e-10 -6.271480e-08
5 8.555430e-10 -7.352065e-07
6 -9.821963e-07 2.447252e-03
6.Projection
[119]: import numpy as np

# Assuming 'data' is your original dataset and 'projection_matrix' contains the␣


↪selected principal components

# Perform projection
projected_data = np.dot(df, principal_components)

# Check for NaN values in the projected data


nan_indices = np.isnan(projected_data)

# Alternatively, you can remove rows with NaN values


# projected_data = projected_data[~nan_indices.any(axis=1)]

# Print the shape of the projected data


print("Shape of projected data:", projected_data.shape)

15
# Optionally, you can print the first few rows of the projected data
print("Projected data:")
print(projected_data)

Shape of projected data: (439, 2)


Projected data:
[[-1.95000050e+06 4.36046102e+02]
[-4.20000115e+06 1.07907716e+03]
[-6.65000252e+05 3.02783870e+02]
[-2.73500082e+06 8.31620176e+02]
[-1.05000030e+06 2.84022673e+02]
[-1.76000055e+06 5.73112033e+02]
[-9.75000478e+05 6.51522680e+02]
[-1.09500035e+06 3.73325414e+02]
[-1.22700038e+06 3.93807188e+02]
[-5.40000342e+05 5.06613770e+02]
[ nan nan]
[-1.49500040e+06 3.62668778e+02]
[-1.05000033e+06 3.47024883e+02]
[-8.90000248e+05 2.37292167e+02]
[-1.08500074e+06 1.11659562e+03]
[-1.56500052e+06 5.75799499e+02]
[-1.70000042e+06 3.47702254e+02]
[-3.75000039e+06 -2.54917964e+02]
[-8.05000278e+05 3.16050002e+02]
[-7.02000271e+05 3.30298130e+02]
[-8.75000286e+05 3.13186776e+02]
[-1.49000041e+06 3.85301614e+02]
[-1.42500032e+06 2.29534625e+02]
[-1.22500046e+06 5.54863924e+02]
[-1.56000057e+06 6.78432374e+02]
[-1.07500044e+06 5.48858145e+02]
[-1.30500046e+06 5.34731107e+02]
[-1.42700037e+06 3.17480987e+02]
[-8.95000330e+05 3.91653426e+02]
[-1.60000049e+06 5.00369389e+02]
[-1.15000009e+07 -1.25847323e+03]
[-1.21100042e+06 4.77233315e+02]
[ nan nan]
[-2.73950108e+06 1.32425486e+03]
[-2.25100107e+06 1.44552143e+03]
[-5.30000343e+05 5.11881186e+02]
[-1.60300067e+06 8.55791445e+02]
[-8.10000531e+05 7.95423179e+02]
[-1.05000037e+06 4.30026161e+02]
[-2.50000100e+06 1.24738448e+03]

16
[-1.80000060e+06 6.68041584e+02]
[-1.02500065e+06 9.60192920e+02]
[-6.25000258e+05 3.25849888e+02]
[-2.73500082e+06 8.31620176e+02]
[-1.61150059e+06 7.01309677e+02]
[-1.40000006e+07 -2.47009155e+03]
[-1.76000055e+06 5.73112033e+02]
[-1.18500063e+06 8.75928816e+02]
[-1.02500038e+06 4.60190771e+02]
[-7.90000475e+05 6.92953262e+02]
[-1.15500043e+06 5.06728199e+02]
[ nan nan]
[-8.15000500e+05 7.34787698e+02]
[-1.13000043e+06 5.13893632e+02]
[-2.68100147e+06 2.08806798e+03]
[-7.95000360e+05 4.74318378e+02]
[-1.90000062e+06 6.69376315e+02]
[-9.50000351e+05 4.16691475e+02]
[-7.45000255e+05 2.88651712e+02]
[-1.75000064e+06 7.45373924e+02]
[-6.83500242e+05 2.80038435e+02]
[-8.25000435e+05 6.09520735e+02]
[-1.00000045e+06 5.91359038e+02]
[-1.14000038e+06 4.29629872e+02]
[-9.20000386e+05 4.90490776e+02]
[-1.17500050e+06 6.49196199e+02]
[-1.20000059e+06 8.05025255e+02]
[-5.00000318e+06 4.72676112e+03]
[-1.80000075e+06 9.59037127e+02]
[-9.15000381e+05 4.82122479e+02]
[-6.56000212e+05 2.29524020e+02]
[-3.70000099e+06 9.08405035e+02]
[-4.20000115e+06 1.07907716e+03]
[-1.30100043e+06 4.68839330e+02]
[-5.50000174e+05 1.86344469e+02]
[-1.18000066e+06 9.43561823e+02]
[-8.90000377e+05 4.81286351e+02]
[-1.97500088e+06 1.15987619e+03]
[-1.08000064e+06 9.31227134e+02]
[-1.30000037e+06 3.56363279e+02]
[-8.50000356e+05 4.52356605e+02]
[-1.12612546e+06 5.76936620e+02]
[-1.29000061e+06 8.21626454e+02]
[-9.50000425e+05 5.57693390e+02]
[-7.22000319e+05 4.15766689e+02]
[-8.03000402e+05 5.51108906e+02]
[-1.00000041e+06 5.14358474e+02]
[-8.25000469e+05 6.73522085e+02]

17
[-1.61150059e+06 7.01309677e+02]
[-1.70000076e+06 1.00470626e+03]
[-2.05000087e+06 1.12037740e+03]
[-9.15000370e+05 4.61124150e+02]
[-1.50000077e+06 1.06003254e+03]
[-5.71000071e+06 -1.61164325e+02]
[-2.52500220e+06 3.50722891e+03]
[-9.20000662e+05 1.01548879e+03]
[-1.20000035e+06 3.48031022e+02]
[-5.37500462e+05 7.34933236e+02]
[-8.40000621e+05 9.57621221e+02]
[-9.80000592e+05 8.65894752e+02]
[-9.30000253e+05 2.35221175e+02]
[-6.25000235e+05 2.80852506e+02]
[-4.00000051e+06 -7.85910660e+01]
[-3.99500080e+06 4.74040790e+02]
[-1.95000050e+06 4.36046102e+02]
[-1.00000389e+05 7.12341065e+02]
[-9.70000819e+05 1.29915509e+03]
[-1.15500043e+06 5.06728199e+02]
[-3.90000231e+06 3.36309129e+03]
[-9.00000233e+05 2.06020865e+02]
[-7.25000308e+05 3.93184370e+02]
[-8.25000473e+05 6.80520467e+02]
[-9.40000486e+05 6.74959473e+02]
[-6.20000441e+05 6.73482179e+02]
[-4.95000247e+05 3.39313771e+02]
[-1.60000095e+06 1.38537464e+03]
[-7.50000565e+05 8.75021809e+02]
[-1.49500074e+06 1.01267002e+03]
[-2.47000206e+06 3.25920161e+03]
[-8.22500360e+05 4.66835459e+02]
[-1.27500071e+06 1.01253292e+03]
[-5.37000263e+05 3.57192755e+02]
[-1.19500073e+06 1.06966247e+03]
[-1.24000068e+06 9.58960159e+02]
[-8.10000471e+05 6.81421845e+02]
[-9.65000503e+05 7.01790267e+02]
[-6.25000305e+05 4.13849556e+02]
[-7.19000401e+05 5.71346969e+02]
[-1.38500076e+06 1.07060034e+03]
[-1.31000055e+06 7.00099492e+02]
[-7.55000443e+05 6.42390335e+02]
[-9.30000687e+05 1.06022210e+03]
[-1.30000067e+06 9.31366910e+02]
[-2.15000103e+06 1.39572002e+03]
[-7.25000531e+05 8.18186120e+02]
[-7.75000359e+05 4.76855460e+02]

18
[-8.49000447e+05 6.24882605e+02]
[-5.59000212e+05 2.55609228e+02]
[-1.90000074e+06 9.05376105e+02]
[-1.32500085e+06 1.26020140e+03]
[-1.95000050e+06 4.36046102e+02]
[-1.45700039e+06 3.57678809e+02]
[-1.77000065e+06 7.67838339e+02]
[-1.12500035e+06 3.59526569e+02]
[-1.15000058e+06 8.00361262e+02]
[-6.50000113e+06 4.27796070e+02]
[-4.50000076e+06 2.54083462e+02]
[-2.72000067e+06 5.53524033e+02]
[-1.41800050e+06 5.83219175e+02]
[-8.25000311e+05 3.73519180e+02]
[-1.85300048e+06 4.32126187e+02]
[-4.00000114e+06 1.10542164e+03]
[-1.51000072e+06 9.72774593e+02]
[-3.90000091e+06 6.96080545e+02]
[-1.71000012e+07 -2.20571130e+03]
[-1.00000042e+06 5.43356773e+02]
[-3.65000063e+06 2.37735822e+02]
[-9.95000172e+06 6.53841092e+02]
[-4.35000201e+05 2.66910756e+02]
[-1.25000039e+06 4.11696557e+02]
[-2.00000051e+06 4.44708542e+02]
[-2.82500073e+06 6.39226774e+02]
[-1.70000074e+06 9.54702407e+02]
[-1.90500072e+06 8.60744775e+02]
[-1.95000055e+06 5.23038090e+02]
[-8.25000311e+05 3.72519183e+02]
[-8.00000231e+05 2.28686043e+02]
[-8.80000309e+05 3.54557395e+02]
[-1.15500033e+06 3.24726439e+02]
[-2.53200068e+06 6.26540030e+02]
[-6.50000109e+06 3.49784984e+02]
[-4.07500365e+05 5.85399214e+02]
[-3.80000085e+06 6.11744536e+02]
[-1.20000010e+07 -1.31978680e+03]
[-3.69999995e+06 -1.07360136e+03]
[-5.62700147e+06 1.31155559e+03]
[-6.55000100e+06 1.82462146e+02]
[-2.10000073e+06 8.42045475e+02]
[-1.42500063e+06 8.27537259e+02]
[-2.61000064e+06 5.26456608e+02]
[-7.05000090e+06 -1.48860135e+02]
[-1.15000030e+06 2.64361523e+02]
[-3.21000058e+06 2.48462944e+02]
[-2.66000061e+06 4.64118294e+02]

19
[-2.15000069e+06 7.42717771e+02]
[-7.99500110e+06 -1.05509351e+01]
[-6.41000180e+05 1.73424022e+02]
[-1.19900043e+06 5.07553032e+02]
[-3.60000283e+06 4.42911322e+03]
[-5.60000175e+06 1.84077042e+03]
[-1.26000040e+06 4.25427531e+02]
[-4.97500126e+06 1.08893128e+03]
[-2.41000069e+06 6.74786235e+02]
[-2.38890021e+07 -2.34511635e+03]
[-1.70000064e+06 7.61705584e+02]
[-7.87500145e+06 6.77640060e+02]
[-1.09950019e+07 7.29495682e+02]
[-5.99499996e+06 -1.65525150e+03]
[-3.75000019e+06 -6.33928254e+02]
[-8.30000294e+05 3.39888491e+02]
[-2.25000069e+06 7.16050843e+02]
[-1.46500039e+06 3.53464922e+02]
[-6.80000262e+05 3.18884019e+02]
[-1.13000036e+06 3.82895893e+02]
[-1.24500059e+06 7.86329314e+02]
[-8.85000459e+05 6.37921810e+02]
[-6.50000349e+06 4.91179488e+03]
[-1.02000036e+06 4.09824224e+02]
[-1.99500069e+06 7.82343205e+02]
[-8.90000358e+05 4.45288934e+02]
[-3.35000104e+06 1.08873523e+03]
[-9.90000389e+05 4.78624226e+02]
[-8.05000323e+05 4.02052807e+02]
[-1.25500042e+06 4.69062222e+02]
[-3.60000105e+06 1.04607399e+03]
[-1.25000041e+06 4.41695588e+02]
[-1.97500071e+06 8.29876760e+02]
[-2.52500100e+06 1.22521631e+03]
[-2.40000062e+06 5.36060627e+02]
[-2.22500097e+06 1.26421668e+03]
[-9.41750376e+05 4.65037032e+02]
[-1.30000046e+06 5.23361055e+02]
[-1.15000028e+06 2.22358324e+02]
[-1.80000094e+06 1.31404404e+03]
[-1.95000074e+06 8.98042818e+02]
[-3.45000067e+06 3.59070272e+02]
[-1.84000077e+06 9.69974109e+02]
[-4.15000213e+06 2.94441993e+03]
[-1.70000074e+06 9.54695066e+02]
[-3.75000122e+06 1.32007210e+03]
[-1.08300041e+06 4.91645709e+02]
[-1.90000068e+06 7.87378316e+02]

20
[-2.67500079e+06 7.91230527e+02]
[-7.75000124e+06 3.04473906e+02]
[-1.63500061e+06 7.24939055e+02]
[-1.07500041e+06 5.03857974e+02]
[-2.10000073e+06 8.31042728e+02]
[-1.52500061e+06 7.55871411e+02]
[-1.45000050e+06 5.66366354e+02]
[-1.16500049e+06 6.32462789e+02]
[-1.31000057e+06 7.29099897e+02]
[-1.15000052e+06 6.80358358e+02]
[-2.15000078e+06 9.10711332e+02]
[-2.90000086e+06 8.72723245e+02]
[-6.41500174e+06 1.62155299e+03]
[-8.60000308e+06 3.59081937e+03]
[-1.01175018e+07 8.46635381e+02]
[-1.10000041e+06 4.92689594e+02]
[-2.01000050e+06 4.19445469e+02]
[-2.07500042e+06 2.57207856e+02]
[-1.07500051e+06 6.85860643e+02]
[-7.15000320e+05 4.18450800e+02]
[-2.15000090e+06 1.15071714e+03]
[-1.40000082e+06 1.18070651e+03]
[-1.00000037e+06 4.42358746e+02]
[-3.20000085e+06 7.79736396e+02]
[-6.50000266e+05 3.34683539e+02]
[-9.95000610e+05 8.96992056e+02]
[-1.02500042e+06 5.22191447e+02]
[-7.40000094e+06 -1.72200080e+02]
[-7.81000310e+05 3.83690320e+02]
[-4.95000075e+06 1.18093644e+02]
[-5.88001921e+05 3.49236230e+03]
[-2.00000076e+06 9.19721485e+02]
[-5.65000114e+06 6.69442220e+02]
[-7.50000620e+05 9.80022322e+02]
[-1.73500078e+06 1.03127753e+03]
[-1.85000082e+06 1.07070498e+03]
[-8.65000136e+06 3.06486074e+02]
[-3.80000085e+06 6.11744536e+02]
[-6.50000109e+06 3.49784984e+02]
[-9.46000514e+05 7.26797769e+02]
[-1.80000076e+06 9.77044455e+02]
[-7.45000281e+05 3.37653974e+02]
[-3.15000133e+06 1.69107816e+03]
[-1.60000073e+06 9.62373106e+02]
[-1.73000090e+06 1.24690508e+03]
[-3.35000104e+06 1.08873523e+03]
[-1.78000081e+06 1.06957403e+03]
[-3.85000126e+06 1.37241552e+03]

21
[-2.01000116e+06 1.66844979e+03]
[-3.67500085e+06 6.52573174e+02]
[-1.45000065e+06 8.61368368e+02]
[-1.28000049e+06 5.93891394e+02]
[-9.50000426e+05 5.58688491e+02]
[-3.00000110e+06 1.30506067e+03]
[-8.90000387e+05 5.00292765e+02]
[-3.80000279e+06 4.29876503e+03]
[-3.15000140e+06 1.82607247e+03]
[-5.25000104e+06 6.01106210e+02]
[-1.62500028e+06 1.13202089e+02]
[-1.68500082e+06 1.11861228e+03]
[-1.26100074e+06 1.07391487e+03]
[-7.12500518e+05 7.96768790e+02]
[-1.74000086e+06 1.18363948e+03]
[-7.30000357e+05 4.85550757e+02]
[-8.40000320e+05 3.85620026e+02]
[-6.35000307e+05 4.15583023e+02]
[-8.80000359e+06 4.49049300e+03]
[-1.36800084e+06 1.22555410e+03]
[-1.02500096e+06 1.56019242e+03]
[-1.78200060e+06 6.61517999e+02]
[-1.20000081e+06 1.22402771e+03]
[-2.85500192e+06 2.89643477e+03]
[-9.60000394e+06 4.94419228e+03]
[-1.75000051e+06 5.13371784e+02]
[-1.70000074e+06 9.54702407e+02]
[-2.10000083e+06 1.01604937e+03]
[-8.85000376e+05 4.80922744e+02]
[-9.80000417e+05 5.33890202e+02]
[-1.65000031e+06 1.49037403e+02]
[-8.60000276e+06 2.97081705e+03]
[-7.30000228e+05 2.40551683e+02]
[-1.62900053e+06 5.72099207e+02]
[-1.45000080e+06 1.13236490e+03]
[-2.00000101e+06 1.39371076e+03]
[-1.77000065e+06 7.67838339e+02]
[-5.32000208e+06 2.55023014e+03]
[-2.00000070e+06 8.02712312e+02]
[-1.80000083e+06 1.10504642e+03]
[-6.70000352e+05 4.91149895e+02]
[-4.45000128e+06 1.25642084e+03]
[-1.58950079e+06 1.07990286e+03]
[-1.61000072e+06 9.52102065e+02]
[-2.51500109e+06 1.41548826e+03]
[-4.99900132e+06 1.18728965e+03]
[-6.55000186e+05 1.80050860e+02]
[-2.15000069e+06 7.42717771e+02]

22
[-3.35000194e+06 2.80673454e+03]
[-1.05000029e+06 2.77023382e+02]
[-8.95000424e+05 5.69655201e+02]
[-1.34900042e+06 4.34559451e+02]
[-1.30000080e+06 1.17136265e+03]
[-1.08500067e+06 9.86596109e+02]
[-1.34000061e+06 8.06301277e+02]
[-2.16500102e+06 1.35980990e+03]
[-9.65000281e+05 2.79790611e+02]
[-8.30000502e+05 7.34887680e+02]
[-9.50000939e+05 1.53369061e+03]
[-7.94000449e+05 6.43844390e+02]
[-6.50000277e+06 3.54478218e+03]
[-1.47500063e+06 8.10199798e+02]
[-2.05000134e+06 2.01038171e+03]
[-4.99000099e+06 5.72030316e+02]
[-5.00000149e+06 1.50876014e+03]
[-9.10000458e+05 6.30758537e+02]
[-7.50000105e+06 1.01425589e+01]
[-1.15000050e+06 6.46361844e+02]
[-1.80000080e+06 1.05204594e+03]
[-1.25000075e+06 1.09169962e+03]
[-1.27500062e+06 8.48533539e+02]
[-7.05000433e+05 6.36716502e+02]
[-8.72500461e+05 6.46505505e+02]
[-2.60000116e+06 1.52672146e+03]
[-1.10000037e+06 4.20693564e+02]
[-2.10000097e+06 1.28204694e+03]
[-1.19900043e+06 5.07553032e+02]
[-1.62500100e+06 1.47820308e+03]
[-2.71000073e+06 6.77793788e+02]
[-2.53200068e+06 6.26540030e+02]
[-3.21000058e+06 2.48462944e+02]
[-1.50000077e+06 1.06003254e+03]
[-1.90300029e+06 4.27993523e+01]
[-3.50005829e+04 1.09757852e+03]
[-7.20000459e+05 6.82816538e+02]
[-1.30000046e+06 5.39365034e+02]
[-3.80000279e+06 4.29876503e+03]
[-4.62100210e+06 2.76636720e+03]
[-7.70000562e+05 8.64489024e+02]
[-1.60000075e+06 1.00736747e+03]
[-4.00000168e+06 2.14540963e+03]
[-6.26000357e+05 5.12324978e+02]
[-1.85300048e+06 4.32126187e+02]
[-1.16600078e+06 1.17093655e+03]
[-3.35000373e+06 6.19278769e+03]
[-1.10000047e+06 6.07692175e+02]

23
[-4.00000113e+06 1.09340763e+03]
[-1.45700039e+06 3.57678809e+02]
[-1.77000065e+06 7.67838339e+02]
[-2.40000077e+06 8.32056547e+02]
[-7.45000014e+06 -1.69352958e+03]
[-1.30750030e+07 2.30405906e+03]
[-2.50000052e+06 3.30386011e+02]
[-4.15000071e+06 2.56414152e+02]
[-2.30000030e+06 -4.02878644e+01]
[-9.00000190e+06 1.23215198e+03]
[-3.50000192e+06 2.72376843e+03]
[-1.25000028e+07 2.09188138e+03]
[-1.30000050e+06 6.12362311e+02]
[-1.51000049e+06 5.29763976e+02]
[-1.67600019e+07 -8.26659911e+02]
[-1.20000016e+07 -2.08787587e+02]
[-2.98900122e+06 1.53084935e+03]
[-2.81000056e+06 3.29121021e+02]
[-2.52500165e+06 2.47022049e+03]
[-2.47500083e+06 9.21551970e+02]
[-1.15000049e+06 6.30358547e+02]
[-1.00900030e+06 2.96617831e+02]
[-7.45000014e+06 -1.69352958e+03]
[-2.40000062e+06 5.36044693e+02]
[-8.30000342e+05 4.29888151e+02]
[-1.41000037e+06 3.32431578e+02]
[-3.00000087e+06 8.70060097e+02]
[-3.50000112e+06 1.20173252e+03]
[-9.50000362e+06 4.37482971e+03]
[-4.15000071e+06 2.56414152e+02]
[-3.62500102e+06 9.90906119e+02]
[-2.50000064e+06 5.53384258e+02]
[-9.20000525e+05 7.55488182e+02]
[-2.50200099e+06 1.22234003e+03]
[-4.75000111e+06 8.56422746e+02]
[-3.40000034e+06 -2.55602169e+02]
[-2.90000037e+06 -6.72769005e+01]
[-5.00000262e+06 3.65976652e+03]
[-1.84500068e+06 7.96341161e+02]
[-1.23000053e+06 6.77231963e+02]
[-6.00000061e+06 -4.25890317e+02]
[-3.17500099e+06 1.04589681e+03]
[-9.65000382e+05 4.70812074e+02]
[-1.33000047e+06 5.49562287e+02]
[-1.00000013e+07 -1.24496790e+02]
[-9.50000182e+06 9.46827929e+02]
[-2.85000104e+06 1.21705736e+03]
[-2.72500105e+06 1.28288961e+03]

24
[-5.25000238e+06 3.13511444e+03]
[-2.01500055e+06 5.13808492e+02]
[-5.55000131e+06 1.02710488e+03]
[-4.65000138e+06 1.40110678e+03]
[-1.75000058e+06 6.38371312e+02]
[-1.57600071e+06 9.33028246e+02]
[-6.25000159e+06 1.36746268e+03]
[-9.49000623e+05 9.32216179e+02]
[-8.95000549e+05 8.07655211e+02]
[-1.25000055e+06 7.13700706e+02]
[-1.65000090e+06 1.27604208e+03]
[-3.19500116e+06 1.35937026e+03]
[-7.35000147e+06 8.50142342e+02]
[-1.34900050e+06 5.95558160e+02]
[-1.25000041e+06 4.41696443e+02]]

[159]: projected_data.shape

[159]: (439, 2)

[122]: explained_variance_ratio = pca.explained_variance_ratio_

# Print the explained variance ratio values


print("Explained Variance Ratio:", explained_variance_ratio)

Explained Variance Ratio: [9.99999885e-01 1.14996251e-07 5.38924715e-13


1.62041663e-13
7.34178924e-14 7.01994932e-18 4.54122976e-19]

[123]: import numpy as np


import matplotlib.pyplot as plt

# Assuming 'explained_variance_ratio' contains the explained variance ratio of␣


↪each principal component

explained_variance_ratio = explained_variance_ratio # Replace [...] with your␣


↪actual explained variance ratio values

# Calculate cumulative explained variance ratio


cumulative_variance_ratio = np.cumsum(explained_variance_ratio)

# Plot scree plot


plt.plot(range(1, len(cumulative_variance_ratio) + 1),␣
↪cumulative_variance_ratio, marker='o', linestyle='-')

plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('Scree Plot')
plt.grid(True)

25
plt.show()

2.Biplot
[180]:

---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[180], line 3
1 import pandas as pd
2 import matplotlib.pyplot as plt
----> 3 from prince import PCA
5 # Assuming X contains your standardized data
6 pca = PCA(n_components=2)

ModuleNotFoundError: No module named 'prince'

[183]: pip install prince

Requirement already satisfied: prince in

26
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages
(0.13.0)
Requirement already satisfied: altair<6.0.0,>=4.2.2 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
prince) (5.3.0)
Requirement already satisfied: pandas<3.0.0,>=1.4.1 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
prince) (2.1.3)
Requirement already satisfied: scikit-learn<2.0.0,>=1.0.2 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
prince) (1.4.2)
Requirement already satisfied: jinja2 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
altair<6.0.0,>=4.2.2->prince) (3.1.2)
Requirement already satisfied: packaging in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
altair<6.0.0,>=4.2.2->prince) (23.2)
Requirement already satisfied: jsonschema>=3.0 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
altair<6.0.0,>=4.2.2->prince) (4.20.0)
Requirement already satisfied: typing-extensions>=4.0.1 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
altair<6.0.0,>=4.2.2->prince) (4.7.1)
Requirement already satisfied: toolz in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
altair<6.0.0,>=4.2.2->prince) (0.12.1)
Requirement already satisfied: numpy in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
altair<6.0.0,>=4.2.2->prince) (1.26.2)
Requirement already satisfied: pytz>=2020.1 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
pandas<3.0.0,>=1.4.1->prince) (2023.3.post1)
Requirement already satisfied: python-dateutil>=2.8.2 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
pandas<3.0.0,>=1.4.1->prince) (2.8.2)
Requirement already satisfied: tzdata>=2022.1 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
pandas<3.0.0,>=1.4.1->prince) (2023.3)
Requirement already satisfied: scipy>=1.6.0 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
scikit-learn<2.0.0,>=1.0.2->prince) (1.12.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
scikit-learn<2.0.0,>=1.0.2->prince) (3.2.0)
Requirement already satisfied: joblib>=1.2.0 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
scikit-learn<2.0.0,>=1.0.2->prince) (1.3.2)
Requirement already satisfied: attrs>=22.2.0 in

27
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
jsonschema>=3.0->altair<6.0.0,>=4.2.2->prince) (23.1.0)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
jsonschema>=3.0->altair<6.0.0,>=4.2.2->prince) (2023.11.1)
Requirement already satisfied: referencing>=0.28.4 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
jsonschema>=3.0->altair<6.0.0,>=4.2.2->prince) (0.31.1)
Requirement already satisfied: rpds-py>=0.7.1 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
jsonschema>=3.0->altair<6.0.0,>=4.2.2->prince) (0.13.2)
Requirement already satisfied: six>=1.5 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
python-dateutil>=2.8.2->pandas<3.0.0,>=1.4.1->prince) (1.16.0)
Requirement already satisfied: MarkupSafe>=2.0 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
jinja2->altair<6.0.0,>=4.2.2->prince) (2.1.3)
Note: you may need to restart the kernel to use updated packages.

[notice] A new release of pip available: 22.2.2 -> 24.0


[notice] To update, run: python.exe -m pip install --upgrade pip

[199]: import pandas as pd


import numpy as np
from prince import PCA
# Impute missing values with mean (replace 'mean' with 'median' or␣
↪'most_frequent' if desired)

imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(df_scaled)

# Assuming X is your ndarray


X_df = pd.DataFrame(X_imputed)

# Perform PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(X_df)

[209]: import numpy as np


import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer

# Impute missing values with mean (replace 'mean' with 'median' or␣
↪'most_frequent' if desired)

imputer = SimpleImputer(strategy='mean')

28
X_imputed = imputer.fit_transform(df_scaled)

# Assuming X contains your standardized data


pca = PCA(n_components=2)
principal_components = pca.fit_transform(X_imputed)

# Extract loadings of each feature for the first two principal components
loadings = pca.components_.T[:, :2]

# Plot the first two principal components


plt.figure(figsize=(8, 7))
plt.scatter(principal_components[:, 0], principal_components[:, 1], alpha=0.5)

# Plot the loadings as vectors


for i, feature in enumerate(df_scaled):
plt.arrow(0, 0, loadings[i, 0], loadings[i, 1], color='r', alpha=0.5)
plt.text(loadings[i, 0], loadings[i, 1], feature, color='g', fontsize=10,␣
↪ha='center', va='center')

# Set labels and title


plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('Biplot of Principal Components')
plt.grid(True)

plt.show()

29
[189]: import pandas as pd
import numpy as np
from prince import PCA

# Assuming X is your ndarray


X_df = pd.DataFrame(X)

# Perform PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(X_df)

3.Pairplot
[130]: import seaborn as sns
import pandas as pd

30
sns.pairplot(df)
plt.show()

[132]: from sklearn.impute import SimpleImputer

# Impute missing values with mean (replace 'mean' with 'median' or␣
↪'most_frequent' if desired)

imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

# Now proceed with PCA using X_imputed as the input

[135]: df

31
[135]: bathrooms bedrooms finishedsqft lastsoldprice latitude longitude \
0 2.0 2 1463.0 1950000 37.795139 -122.425309
1 3.5 3 3291.0 4200000 37.794429 -122.428513
2 1.0 1 653.0 665000 37.792472 -122.425281
3 2.5 2 2272.0 2735000 37.794706 -122.426347
4 1.0 1 837.0 1050000 37.793212 -122.423744
.. … … … … … …
434 2.0 3 2145.0 1650000 37.795777 -122.433024
435 3.5 4 3042.0 3195000 37.795330 -122.436540
436 7.5 6 4721.0 7350000 37.795246 -122.437490
437 1.0 2 1306.0 1349000 37.796588 -122.429641
438 1.0 1 1100.0 1250000 37.795255 -122.432880

totalrooms
0 7
1 7
2 3
3 6
4 3
.. …
434 8
435 10
436 13
437 5
438 5

[439 rows x 7 columns]

[138]: import numpy as np


import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer

# Separate features (X) and target variable (y)


X = df.drop(columns=['lastsoldprice'])
y = df['lastsoldprice']

# Handle missing values using imputation


imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

# Perform PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(X_imputed)

32
# Create a DataFrame for the principal components
pc_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])

# Create a scatterplot matrix with both principal components


combined_df = pd.concat([pc_df, y], axis=1)
sns.pairplot(combined_df, hue='lastsoldprice')
plt.show()

[176]: import numpy as np


import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer

# Impute missing values with mean (replace 'mean' with 'median' or␣
↪'most_frequent' if desired)

imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(df_scaled)

33
# Assuming X contains your standardized data
pca = PCA(n_components=2)
principal_components = pca.fit_transform(X_imputed)

# Create a heatmap of the principal components


plt.figure(figsize=(10, 6))
sns.heatmap(principal_components, cmap='coolwarm', annot=True, fmt=".2f",␣
↪cbar=True)

plt.xlabel('Principal Component')
plt.ylabel('Data Point')
plt.title('Heatmap of Principal Components')
plt.show()

[ ]:

Score plot
[152]: import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

34
import matplotlib.pyplot as plt

# Assuming your data is stored in a DataFrame called 'df'

# Impute missing values


imputer = SimpleImputer(strategy='mean') # You can change the strategy as␣
↪needed

imputed_data = imputer.fit_transform(df)

# Standardize the data


scaler = StandardScaler()
scaled_data = scaler.fit_transform(imputed_data)

# Perform PCA
pca = PCA(n_components=2) # Reduce to 2 components for a 2D plot
principal_components = pca.fit_transform(scaled_data)

# Plot the score plot with quadrants


pc_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])
plt.figure(figsize=(8, 6))
plt.scatter(pc_df['PC1'], pc_df['PC2'])
plt.axhline(y=0, color='k', linestyle='--') # Add horizontal line at y=0
plt.axvline(x=0, color='k', linestyle='--') # Add vertical line at x=0
plt.text(0.5, 0.5, ' I', fontsize=12, ha='center')
plt.text(-0.5, 0.5, ' II', fontsize=12, ha='center')
plt.text(-0.5, -0.5, ' III', fontsize=12, ha='center')
plt.text(0.5, -0.5, ' IV', fontsize=12, ha='center')
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.title('Score Plot ')
plt.grid(True)
plt.show()

35
0.3 ADDITIONAL EXPLORATION
[148]: from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor

# Assuming you have already reduced your data dimensionality and stored it in␣
↪X_reduced

# Initialize your chosen model


model = RandomForestRegressor() # Example: RandomForestRegressor, you can use␣
↪any other model

# Perform cross-validation
scores = cross_val_score(model, df_scaled, y, cv=5,␣
↪scoring='neg_mean_squared_error')

# Print the cross-validation scores


print("Cross-validation Mean Squared Error:", -scores.mean())

36
Cross-validation Mean Squared Error: 195630660332.1332

[170]: import numpy as np


import pandas as pd
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer

# Create an imputer object


imputer = SimpleImputer(strategy='mean') # You can replace 'mean' with␣
↪'median' or 'most_frequent'

# Fit the imputer to X and transform X


X_imputed = imputer.fit_transform(df_scaled)

# Assuming 'X' is your standardized data matrix


# Perform PCA
pca = PCA(n_components=2) # Choose the number of components you want
principal_components = pca.fit_transform(X_imputed)

# Get the loadings (coefficients) for each PC


loadings = pca.components_

# Create a DataFrame to display the loadings


loadings_df = pd.DataFrame(loadings, columns=df_scaled.columns,␣
↪index=[f'PC{i+1}' for i in range(pca.n_components_)])

# Display the loadings DataFrame


print("Loadings (Coefficients) for Each Principal Component:")
print(loadings_df)

Loadings (Coefficients) for Each Principal Component:


0 1 2 3 4 5 6
PC1 0.443793 0.414535 0.467587 0.420565 0.017968 -0.195893 0.443845
PC2 -0.092340 -0.005345 -0.067513 -0.077041 -0.750017 -0.646715 -0.013622

[164]: df

[164]: bathrooms bedrooms finishedsqft lastsoldprice latitude longitude \


0 2.0 2 1463.0 1950000 37.795139 -122.425309
1 3.5 3 3291.0 4200000 37.794429 -122.428513
2 1.0 1 653.0 665000 37.792472 -122.425281
3 2.5 2 2272.0 2735000 37.794706 -122.426347
4 1.0 1 837.0 1050000 37.793212 -122.423744
.. … … … … … …
434 2.0 3 2145.0 1650000 37.795777 -122.433024
435 3.5 4 3042.0 3195000 37.795330 -122.436540
436 7.5 6 4721.0 7350000 37.795246 -122.437490

37
437 1.0 2 1306.0 1349000 37.796588 -122.429641
438 1.0 1 1100.0 1250000 37.795255 -122.432880

totalrooms
0 7
1 7
2 3
3 6
4 3
.. …
434 8
435 10
436 13
437 5
438 5

[439 rows x 7 columns]

38

You might also like