0% found this document useful (0 votes)

29 views38 pages

Cia Code

The document discusses preprocessing a real estate price prediction dataset. It describes dropping irrelevant columns, handling missing values by imputing with the median, detecting outliers using z-scores and boxplots, and removing outliers from the dataset.

Uploaded by

SUHANI LARIYA 22112338

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views38 pages

Cia Code

Uploaded by

SUHANI LARIYA 22112338

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

cia-code

April 23, 2024

Importing data
[32]: import pandas as pd
df1 = pd.read_csv('C:\\Users\\lariy\\Downloads\\price prediction.csv')
df = pd.DataFrame(df1)
df

[32]: address bathrooms bedrooms finishedsqft \

0 2243 Franklin St 2.0 2 1463.0
1 2002 Pacific Ave APT 4 3.5 3 3291.0
2 1945 Washington St APT 411 1.0 1 653.0
3 1896 Pacific Ave APT 802 2.5 2 2272.0
4 1840 Washington St APT 603 1.0 1 837.0
.. … … … …
434 2170 Vallejo St APT 101 2.0 3 2145.0
435 2380 Vallejo St 3.5 4 3042.0
436 2430 Vallejo St 7.5 6 4721.0
437 1859 Green St 1.0 2 1306.0
438 2131 Vallejo St APT 3 1.0 1 1100.0

lastsolddate lastsoldprice latitude longitude neighborhood \

0 02-05-2016 1950000 37.795139 -122.425309 Pacific Heights
1 1/22/2016 4200000 37.794429 -122.428513 Pacific Heights
2 12/16/2015 665000 37.792472 -122.425281 Pacific Heights
3 12/17/2014 2735000 37.794706 -122.426347 Pacific Heights
4 12-02-2015 1050000 37.793212 -122.423744 Pacific Heights
.. … … … … …
434 11/14/2012 1650000 37.795777 -122.433024 Pacific Heights
435 10-01-2012 3195000 37.795330 -122.436540 Pacific Heights
436 9/24/2012 7350000 37.795246 -122.437490 Pacific Heights
437 10/18/2011 1349000 37.796588 -122.429641 Pacific Heights
438 03-04-2016 1250000 37.795255 -122.432880 Pacific Heights

totalrooms usecode yearbuilt zipcode

0 7 Condominium 1900 94109
1 7 Condominium 1961 94109
2 3 Condominium 1987 94109
3 6 Condominium 1924 94109

1
4 3 Condominium 2012 94109
.. … … … …
434 8 Condominium 1914 94123
435 10 SingleFamily 1908 94123
436 13 SingleFamily 1905 94123
437 5 Condominium 1900 94123
438 5 Condominium 1900 94123

[439 rows x 13 columns]

0.1 Data Preprocessing

1. Dropping irrelevant variables/columns from the dataset which adds no intrinsic value to it.

[33]: columns_to_drop = ['address', 'lastsolddate', 'neighborhood', 'usecode',␣

↪'yearbuilt', 'zipcode']

df.drop(columns=columns_to_drop, inplace=True)

[34]: df.columns

[34]: Index(['bathrooms', 'bedrooms', 'finishedsqft', 'lastsoldprice', 'latitude',

'longitude', 'totalrooms'],
dtype='object')

[35]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 439 entries, 0 to 438
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 bathrooms 439 non-null float64
1 bedrooms 439 non-null int64
2 finishedsqft 438 non-null float64
3 lastsoldprice 439 non-null int64
4 latitude 439 non-null float64
5 longitude 437 non-null float64
6 totalrooms 439 non-null int64
dtypes: float64(4), int64(3)
memory usage: 24.1 KB
2.Detecting and Plotting missing values
[36]: missing_values = df.isnull().sum()
print("Missing values in the data:",missing_values)

Missing values in the data: bathrooms 0

bedrooms 0

2
finishedsqft 1
lastsoldprice 0
latitude 0
longitude 2
totalrooms 0
dtype: int64

[37]: import matplotlib.pyplot as plt

# Calculate the number of missing values in each column

missing_values = df.isnull().sum()

# Plot the missing values

plt.figure(figsize=(10, 6))
missing_values.plot(kind='bar')
plt.title('Missing Values in Each Column')
plt.xlabel('Columns')
plt.ylabel('Number of Missing Values')
plt.xticks(rotation=45)
plt.show()

3.Handling missing values

3
[38]: missing_values = df.isnull().sum()
print("Missing values before handling:")
print(missing_values)

Missing values before handling:

bathrooms 0
bedrooms 0
finishedsqft 1
lastsoldprice 0
latitude 0
longitude 2
totalrooms 0
dtype: int64

[39]: #imputing with mean value of the column

from sklearn.impute import SimpleImputer

numerical_features = df.select_dtypes(include=['float64', 'int64']).columns

imputer = SimpleImputer(strategy='median')
df[numerical_features] = imputer.fit_transform(df[numerical_features])

[40]: missing_values_after = df.isnull().sum()

print("\nMissing values after handling:")
print(missing_values_after)

Missing values after handling:

bathrooms 0
bedrooms 0
finishedsqft 0
lastsoldprice 0
latitude 0
longitude 0
totalrooms 0
dtype: int64

[22]: #no missing values

4.Detecting Outliers
[41]: import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import zscore

# Load the dataset

df = pd.read_csv('C:\\Users\\lariy\\Downloads\\price prediction.csv')

4
# Select numerical columns for outlier detection
numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns

# Create box plots for numerical columns to visualize outliers

plt.figure(figsize=(12, 6))
for i, column in enumerate(numerical_columns, 1):
plt.subplot(2, 3, i)
sns.boxplot(data=df[column])
plt.title(column)
plt.tight_layout()
plt.show()

# Identify outliers using z-score

z_scores = zscore(df[numerical_columns])
outliers = (z_scores > 3) | (z_scores < -3)

# Plot outliers using scatter plot

plt.figure(figsize=(12, 6))
for i, column in enumerate(numerical_columns, 1):
plt.subplot(2, 3, i)
plt.scatter(df.index, df[column], c=outliers[:, i-1], cmap='coolwarm',␣
↪alpha=0.5)

plt.title(column)
plt.xlabel('Index')
plt.ylabel(column)
plt.tight_layout()
plt.show()

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[41], line 15
13 plt.figure(figsize=(12, 6))
14 for i, column in enumerate(numerical_columns, 1):
---> 15 plt.subplot(2, 3, i)
16 sns.boxplot(data=df[column])
17 plt.title(column)

File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\matplotlib\pyplot.
↪py:1425, in subplot(*args, **kwargs)

1422 fig = gcf()

1424 # First, search for an existing subplot with a matching spec.
-> 1425 key = SubplotSpec._from_subplot_args(fig, args)
1427 for ax in fig.axes:
1428 # If we found an Axes at the position, we can re-use it if the user␣
↪passed no

1429 # kwargs or if the axes class and kwargs are identical.

5
1430 if (ax.get_subplotspec() == key
1431 and (kwargs == {}
1432 or (ax._projection_init
1433 == fig._process_projection_requirements(**kwargs)))):

File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\matplotlib\gridspec.
↪py:599, in SubplotSpec._from_subplot_args(figure, args)

597 else:
598 if not isinstance(num, Integral) or num < 1 or num > rows*cols:
--> 599 raise ValueError(
600 f"num must be an integer with 1 <= num <= {rows*cols}, "
601 f"not {num!r}"
602 )
603 i = j = num
604 return gs[i-1:j]

ValueError: num must be an integer with 1 <= num <= 6, not 7

[42]: df_no_outliers = df[~outliers.any(axis=1)]

[43]: plt.figure(figsize=(12, 6))

for i, column in enumerate(numerical_columns, 1):
plt.subplot(2, 3, i)
sns.boxplot(data=df_no_outliers[column])
plt.title(column)
plt.tight_layout()
plt.show()

6
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[43], line 3
1 plt.figure(figsize=(12, 6))
2 for i, column in enumerate(numerical_columns, 1):
----> 3 plt.subplot(2, 3, i)
4 sns.boxplot(data=df_no_outliers[column])
5 plt.title(column)

File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\matplotlib\pyplot.
↪py:1425, in subplot(*args, **kwargs)

1422 fig = gcf()

1429 # kwargs or if the axes class and kwargs are identical.

1430 if (ax.get_subplotspec() == key
1431 and (kwargs == {}
1432 or (ax._projection_init
1433 == fig._process_projection_requirements(**kwargs)))):

File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\matplotlib\gridspec.
↪py:599, in SubplotSpec._from_subplot_args(figure, args)

ValueError: num must be an integer with 1 <= num <= 6, not 7

7
[29]: #the outliers have been reduced to maximum

[44]: df.columns

[44]: Index(['address', 'bathrooms', 'bedrooms', 'finishedsqft', 'lastsolddate',

'lastsoldprice', 'latitude', 'longitude', 'neighborhood', 'totalrooms',
'usecode', 'yearbuilt', 'zipcode'],
dtype='object')

[45]: columns_to_drop = ['address', 'lastsolddate', 'neighborhood', 'usecode',␣

↪'yearbuilt', 'zipcode']

df.drop(columns=columns_to_drop, inplace=True)
df

[45]: bathrooms bedrooms finishedsqft lastsoldprice latitude longitude \

0 2.0 2 1463.0 1950000 37.795139 -122.425309
1 3.5 3 3291.0 4200000 37.794429 -122.428513
2 1.0 1 653.0 665000 37.792472 -122.425281
3 2.5 2 2272.0 2735000 37.794706 -122.426347
4 1.0 1 837.0 1050000 37.793212 -122.423744
.. … … … … … …
434 2.0 3 2145.0 1650000 37.795777 -122.433024
435 3.5 4 3042.0 3195000 37.795330 -122.436540
436 7.5 6 4721.0 7350000 37.795246 -122.437490
437 1.0 2 1306.0 1349000 37.796588 -122.429641
438 1.0 1 1100.0 1250000 37.795255 -122.432880

totalrooms

8
0 7
1 7
2 3
3 6
4 3
.. …
434 8
435 10
436 13
437 5
438 5

[439 rows x 7 columns]

[ ]: #CORRELATION ESTIMATION

[126]: import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

# Compute correlation matrix

corr_matrix = df.corr()

# Set custom color palette

colors = sns.color_palette("coolwarm", as_cmap=True)

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap=colors, fmt=".2f", linewidths=0.5,␣
↪cbar=False)

plt.title('Correlation Matrix', fontsize=16, fontweight='bold')

plt.xticks(fontsize=12, rotation=45)
plt.yticks(fontsize=12)
plt.show()

9
0.2 PRINCIPLE COMPONENT ANALYSIS
1.Standardization
[52]: df_index = df.index

[53]: numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns

[54]: import pandas as pd

from sklearn.preprocessing import StandardScaler

[76]: scaler = StandardScaler()

scaled_features = scaler.fit_transform(df)

df_scaled = pd.DataFrame(scaled_features)

10
df_scaled

[76]: 0 1 2 3 4 5 6
0 -0.285556 -0.399122 -0.403174 -0.211435 1.185443 1.148603 0.105818
1 0.698425 0.140000 0.680420 0.624953 0.874861 0.720970 0.105818
2 -0.941543 -0.938244 -0.883323 -0.689106 0.018791 1.152340 -0.909570
3 0.042437 -0.399122 0.076381 0.080372 0.996031 1.010062 -0.148029
4 -0.941543 -0.938244 -0.774252 -0.545990 0.342496 1.357481 -0.909570
.. … … … … … … …
434 -0.285556 0.140000 0.001099 -0.322953 1.464529 0.118894 0.359665
435 0.698425 0.679121 0.532819 0.251367 1.268994 -0.350380 0.867359
436 3.322374 1.757365 1.528090 1.795897 1.232249 -0.477175 1.628900
437 -0.941543 -0.399122 -0.496240 -0.434844 1.819293 0.570418 -0.401876
438 -0.941543 -0.938244 -0.618352 -0.471645 1.236186 0.138114 -0.401876

[439 rows x 7 columns]

2.Covariance Matrix Computation

[90]: import pandas as pd
data_filled = df.fillna(df.mean())

# Compute the covariance matrix

covariance_matrix = np.cov(data_filled, rowvar=False)
covariance_matrix
column_names = ['bathrooms', 'bedrooms', 'finishedsqft', 'lastsoldprice',␣
↪'latitude',

'longitude', 'totalrooms']
covariance_df = pd.DataFrame(covariance_matrix, columns=column_names,␣
↪index=column_names)

# Print the covariance matrix DataFrame

print("Covariance Matrix:")
print(covariance_df)

Covariance Matrix:
bathrooms bedrooms finishedsqft lastsoldprice \
bathrooms 2.329161e+00 2.017295e+00 2.231689e+03 3.157419e+06
bedrooms 2.017295e+00 3.448394e+00 2.356723e+03 3.033648e+06
finishedsqft 2.231689e+03 2.356723e+03 2.845895e+06 3.819981e+09
lastsoldprice 3.157419e+06 3.033648e+06 3.819981e+09 7.253364e+12
latitude 3.289555e-04 8.149072e-05 3.582618e-01 7.795914e+02
longitude -2.819931e-03 -3.999855e-03 -3.881395e+00 -6.205566e+03
totalrooms 4.699683e+00 5.724836e+00 5.793232e+03 7.124227e+06

latitude longitude totalrooms

bathrooms 0.000329 -0.002820 4.699683e+00
bedrooms 0.000081 -0.004000 5.724836e+00

11
finishedsqft 0.358262 -3.881395 5.793232e+03
lastsoldprice 779.591402 -6205.566010 7.124227e+06
latitude 0.000005 0.000009 2.110409e-04
longitude 0.000009 0.000056 -9.048211e-03
totalrooms 0.000211 -0.009048 1.555414e+01

[ ]:

3.Eigen Decomposition
[94]: import pandas as pd

# Convert eigenvalues and eigenvectors to DataFrame

eigen_df = pd.DataFrame({'Eigenvalue': eigenvalues})
eigen_df['Eigenvector'] = [eigenvectors[:, i] for i in range(len(eigenvectors))]

# Print the DataFrame

print("Eigenvalues and Eigenvectors:")
print(eigen_df)

Eigenvalues and Eigenvectors:

Eigenvalue Eigenvector
0 7.253366e+12 [-4.3530400936506286e-07, -4.182401136422737e-…
1 8.341100e+05 [0.0006819708177894264, 0.0009100232661667038,…
2 3.909019e+00 [-0.07635453461451713, -0.34829146033302216, 0…
3 1.175348e+00 [0.15061016522339143, 0.9222261380364992, -7.0…
4 5.325269e-01 [0.9856396143372855, -0.16790163752836, -0.000…
5 5.091827e-05 [0.0008041469959149176, -0.0004697221463473827…
6 3.293921e-06 [4.783139021784667e-05, -8.824796320393624e-06…
4.Rearranging eigenvectors by respective eigenvalues
[96]: import pandas as pd

# Assuming 'eigenvectors' contains the eigenvectors computed from the␣

↪covariance matrix

# Create a DataFrame for eigenvectors

eigenvectors_df = pd.DataFrame(data=eigenvectors, columns=[f'PC{i+1}' for i in␣
↪range(eigenvectors.shape[1])])

# Print the DataFrame

print("Eigenvectors:")
print(eigenvectors_df)

Eigenvectors:
PC1 PC2 PC3 PC4 PC5 \
0 -4.353040e-07 6.819708e-04 -7.635453e-02 1.506102e-01 9.856396e-01
1 -4.182401e-07 9.100233e-04 -3.482915e-01 9.222261e-01 -1.679016e-01

12
2 -5.266495e-04 9.999962e-01 2.655423e-03 -7.046468e-05 -4.754296e-04
3 -9.999999e-01 -5.266507e-04 -3.019358e-07 -6.439164e-08 -9.080315e-08
4 -1.074800e-10 -6.271480e-08 1.190288e-04 -2.203851e-05 1.232789e-04
5 8.555430e-10 -7.352065e-07 4.166878e-04 -1.856730e-04 8.704261e-04
6 -9.821963e-07 2.447252e-03 -9.342675e-01 -3.561116e-01 -1.796084e-02

PC6 PC7
0 8.041470e-04 4.783139e-05
1 -4.697221e-04 -8.824796e-06
2 9.181873e-09 -1.982864e-07
3 -6.302828e-10 -2.212895e-10
4 -1.940234e-01 9.809969e-01
5 -9.809964e-01 -1.940235e-01
6 -3.528579e-04 3.782708e-05
5.Selecting the best features k
[106]: from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA

# Assuming 'features' is your data with missing values

# Impute missing values with mean
imputer = SimpleImputer(strategy='mean')
imputed_features = imputer.fit_transform(df)

# Apply PCA
pca = PCA()
pca.fit(imputed_features)

[106]: PCA()

[107]: cumulative_variance_ratio = np.cumsum(pca.explained_variance_ratio_)

# Plot cumulative explained variance ratio

plt.plot(range(1, len(cumulative_variance_ratio) + 1),␣
↪cumulative_variance_ratio, marker='o', linestyle='-')

plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('Cumulative Explained Variance Ratio by Number of Components')
plt.grid(True)
plt.show()

13
[158]: k = 3 # Number of selected best features
best_k = eigenvectors[:, :k]

# Print the selected eigenvectors

for i in range(eigenvectors.shape[1]):
print("The selected Principle Components are:")
print(f'PC{i+1}: {eigenvectors[:, i]}')

The selected Principle Components are:

PC1: [-4.35304009e-07 -4.18240114e-07 -5.26649488e-04 -9.99999861e-01
-1.07479952e-10 8.55543039e-10 -9.82196278e-07]
The selected Principle Components are:
PC2: [ 6.81970818e-04 9.10023266e-04 9.99996220e-01 -5.26650652e-04
-6.27147956e-08 -7.35206512e-07 2.44725160e-03]
The selected Principle Components are:
PC3: [-7.63545346e-02 -3.48291460e-01 2.65542278e-03 -3.01935766e-07
1.19028810e-04 4.16687798e-04 -9.34267523e-01]
The selected Principle Components are:
PC4: [ 1.50610165e-01 9.22226138e-01 -7.04646780e-05 -6.43916411e-08
-2.20385057e-05 -1.85673016e-04 -3.56111624e-01]

14
The selected Principle Components are:
PC5: [ 9.85639614e-01 -1.67901638e-01 -4.75429552e-04 -9.08031536e-08
1.23278868e-04 8.70426120e-04 -1.79608433e-02]
The selected Principle Components are:
PC6: [ 8.04146996e-04 -4.69722146e-04 9.18187349e-09 -6.30282826e-10
-1.94023389e-01 -9.80996398e-01 -3.52857856e-04]
The selected Principle Components are:
PC7: [ 4.78313902e-05 -8.82479632e-06 -1.98286392e-07 -2.21289454e-10
9.80996888e-01 -1.94023456e-01 3.78270850e-05]

[112]: import pandas as pd

# Assuming 'best_k' contains the first k eigenvectors

principal_components = pd.DataFrame(best_k, columns=[f'PC{i+1}' for i in␣
↪range(best_k.shape[1])])

print("Selected Principal Components:")

print(principal_components)

Selected Principal Components:

PC1 PC2
0 -4.353040e-07 6.819708e-04
1 -4.182401e-07 9.100233e-04
2 -5.266495e-04 9.999962e-01
3 -9.999999e-01 -5.266507e-04
4 -1.074800e-10 -6.271480e-08
5 8.555430e-10 -7.352065e-07
6 -9.821963e-07 2.447252e-03
6.Projection
[119]: import numpy as np

# Assuming 'data' is your original dataset and 'projection_matrix' contains the␣

↪selected principal components

# Perform projection
projected_data = np.dot(df, principal_components)

# Check for NaN values in the projected data

nan_indices = np.isnan(projected_data)

# Alternatively, you can remove rows with NaN values

# projected_data = projected_data[~nan_indices.any(axis=1)]

# Print the shape of the projected data

print("Shape of projected data:", projected_data.shape)

15
# Optionally, you can print the first few rows of the projected data
print("Projected data:")
print(projected_data)

Shape of projected data: (439, 2)

Projected data:
[[-1.95000050e+06 4.36046102e+02]
[-4.20000115e+06 1.07907716e+03]
[-6.65000252e+05 3.02783870e+02]
[-2.73500082e+06 8.31620176e+02]
[-1.05000030e+06 2.84022673e+02]
[-1.76000055e+06 5.73112033e+02]
[-9.75000478e+05 6.51522680e+02]
[-1.09500035e+06 3.73325414e+02]
[-1.22700038e+06 3.93807188e+02]
[-5.40000342e+05 5.06613770e+02]
[ nan nan]
[-1.49500040e+06 3.62668778e+02]
[-1.05000033e+06 3.47024883e+02]
[-8.90000248e+05 2.37292167e+02]
[-1.08500074e+06 1.11659562e+03]
[-1.56500052e+06 5.75799499e+02]
[-1.70000042e+06 3.47702254e+02]
[-3.75000039e+06 -2.54917964e+02]
[-8.05000278e+05 3.16050002e+02]
[-7.02000271e+05 3.30298130e+02]
[-8.75000286e+05 3.13186776e+02]
[-1.49000041e+06 3.85301614e+02]
[-1.42500032e+06 2.29534625e+02]
[-1.22500046e+06 5.54863924e+02]
[-1.56000057e+06 6.78432374e+02]
[-1.07500044e+06 5.48858145e+02]
[-1.30500046e+06 5.34731107e+02]
[-1.42700037e+06 3.17480987e+02]
[-8.95000330e+05 3.91653426e+02]
[-1.60000049e+06 5.00369389e+02]
[-1.15000009e+07 -1.25847323e+03]
[-1.21100042e+06 4.77233315e+02]
[ nan nan]
[-2.73950108e+06 1.32425486e+03]
[-2.25100107e+06 1.44552143e+03]
[-5.30000343e+05 5.11881186e+02]
[-1.60300067e+06 8.55791445e+02]
[-8.10000531e+05 7.95423179e+02]
[-1.05000037e+06 4.30026161e+02]
[-2.50000100e+06 1.24738448e+03]

16
[-1.80000060e+06 6.68041584e+02]
[-1.02500065e+06 9.60192920e+02]
[-6.25000258e+05 3.25849888e+02]
[-2.73500082e+06 8.31620176e+02]
[-1.61150059e+06 7.01309677e+02]
[-1.40000006e+07 -2.47009155e+03]
[-1.76000055e+06 5.73112033e+02]
[-1.18500063e+06 8.75928816e+02]
[-1.02500038e+06 4.60190771e+02]
[-7.90000475e+05 6.92953262e+02]
[-1.15500043e+06 5.06728199e+02]
[ nan nan]
[-8.15000500e+05 7.34787698e+02]
[-1.13000043e+06 5.13893632e+02]
[-2.68100147e+06 2.08806798e+03]
[-7.95000360e+05 4.74318378e+02]
[-1.90000062e+06 6.69376315e+02]
[-9.50000351e+05 4.16691475e+02]
[-7.45000255e+05 2.88651712e+02]
[-1.75000064e+06 7.45373924e+02]
[-6.83500242e+05 2.80038435e+02]
[-8.25000435e+05 6.09520735e+02]
[-1.00000045e+06 5.91359038e+02]
[-1.14000038e+06 4.29629872e+02]
[-9.20000386e+05 4.90490776e+02]
[-1.17500050e+06 6.49196199e+02]
[-1.20000059e+06 8.05025255e+02]
[-5.00000318e+06 4.72676112e+03]
[-1.80000075e+06 9.59037127e+02]
[-9.15000381e+05 4.82122479e+02]
[-6.56000212e+05 2.29524020e+02]
[-3.70000099e+06 9.08405035e+02]
[-4.20000115e+06 1.07907716e+03]
[-1.30100043e+06 4.68839330e+02]
[-5.50000174e+05 1.86344469e+02]
[-1.18000066e+06 9.43561823e+02]
[-8.90000377e+05 4.81286351e+02]
[-1.97500088e+06 1.15987619e+03]
[-1.08000064e+06 9.31227134e+02]
[-1.30000037e+06 3.56363279e+02]
[-8.50000356e+05 4.52356605e+02]
[-1.12612546e+06 5.76936620e+02]
[-1.29000061e+06 8.21626454e+02]
[-9.50000425e+05 5.57693390e+02]
[-7.22000319e+05 4.15766689e+02]
[-8.03000402e+05 5.51108906e+02]
[-1.00000041e+06 5.14358474e+02]
[-8.25000469e+05 6.73522085e+02]

17
[-1.61150059e+06 7.01309677e+02]
[-1.70000076e+06 1.00470626e+03]
[-2.05000087e+06 1.12037740e+03]
[-9.15000370e+05 4.61124150e+02]
[-1.50000077e+06 1.06003254e+03]
[-5.71000071e+06 -1.61164325e+02]
[-2.52500220e+06 3.50722891e+03]
[-9.20000662e+05 1.01548879e+03]
[-1.20000035e+06 3.48031022e+02]
[-5.37500462e+05 7.34933236e+02]
[-8.40000621e+05 9.57621221e+02]
[-9.80000592e+05 8.65894752e+02]
[-9.30000253e+05 2.35221175e+02]
[-6.25000235e+05 2.80852506e+02]
[-4.00000051e+06 -7.85910660e+01]
[-3.99500080e+06 4.74040790e+02]
[-1.95000050e+06 4.36046102e+02]
[-1.00000389e+05 7.12341065e+02]
[-9.70000819e+05 1.29915509e+03]
[-1.15500043e+06 5.06728199e+02]
[-3.90000231e+06 3.36309129e+03]
[-9.00000233e+05 2.06020865e+02]
[-7.25000308e+05 3.93184370e+02]
[-8.25000473e+05 6.80520467e+02]
[-9.40000486e+05 6.74959473e+02]
[-6.20000441e+05 6.73482179e+02]
[-4.95000247e+05 3.39313771e+02]
[-1.60000095e+06 1.38537464e+03]
[-7.50000565e+05 8.75021809e+02]
[-1.49500074e+06 1.01267002e+03]
[-2.47000206e+06 3.25920161e+03]
[-8.22500360e+05 4.66835459e+02]
[-1.27500071e+06 1.01253292e+03]
[-5.37000263e+05 3.57192755e+02]
[-1.19500073e+06 1.06966247e+03]
[-1.24000068e+06 9.58960159e+02]
[-8.10000471e+05 6.81421845e+02]
[-9.65000503e+05 7.01790267e+02]
[-6.25000305e+05 4.13849556e+02]
[-7.19000401e+05 5.71346969e+02]
[-1.38500076e+06 1.07060034e+03]
[-1.31000055e+06 7.00099492e+02]
[-7.55000443e+05 6.42390335e+02]
[-9.30000687e+05 1.06022210e+03]
[-1.30000067e+06 9.31366910e+02]
[-2.15000103e+06 1.39572002e+03]
[-7.25000531e+05 8.18186120e+02]
[-7.75000359e+05 4.76855460e+02]

18
[-8.49000447e+05 6.24882605e+02]
[-5.59000212e+05 2.55609228e+02]
[-1.90000074e+06 9.05376105e+02]
[-1.32500085e+06 1.26020140e+03]
[-1.95000050e+06 4.36046102e+02]
[-1.45700039e+06 3.57678809e+02]
[-1.77000065e+06 7.67838339e+02]
[-1.12500035e+06 3.59526569e+02]
[-1.15000058e+06 8.00361262e+02]
[-6.50000113e+06 4.27796070e+02]
[-4.50000076e+06 2.54083462e+02]
[-2.72000067e+06 5.53524033e+02]
[-1.41800050e+06 5.83219175e+02]
[-8.25000311e+05 3.73519180e+02]
[-1.85300048e+06 4.32126187e+02]
[-4.00000114e+06 1.10542164e+03]
[-1.51000072e+06 9.72774593e+02]
[-3.90000091e+06 6.96080545e+02]
[-1.71000012e+07 -2.20571130e+03]
[-1.00000042e+06 5.43356773e+02]
[-3.65000063e+06 2.37735822e+02]
[-9.95000172e+06 6.53841092e+02]
[-4.35000201e+05 2.66910756e+02]
[-1.25000039e+06 4.11696557e+02]
[-2.00000051e+06 4.44708542e+02]
[-2.82500073e+06 6.39226774e+02]
[-1.70000074e+06 9.54702407e+02]
[-1.90500072e+06 8.60744775e+02]
[-1.95000055e+06 5.23038090e+02]
[-8.25000311e+05 3.72519183e+02]
[-8.00000231e+05 2.28686043e+02]
[-8.80000309e+05 3.54557395e+02]
[-1.15500033e+06 3.24726439e+02]
[-2.53200068e+06 6.26540030e+02]
[-6.50000109e+06 3.49784984e+02]
[-4.07500365e+05 5.85399214e+02]
[-3.80000085e+06 6.11744536e+02]
[-1.20000010e+07 -1.31978680e+03]
[-3.69999995e+06 -1.07360136e+03]
[-5.62700147e+06 1.31155559e+03]
[-6.55000100e+06 1.82462146e+02]
[-2.10000073e+06 8.42045475e+02]
[-1.42500063e+06 8.27537259e+02]
[-2.61000064e+06 5.26456608e+02]
[-7.05000090e+06 -1.48860135e+02]
[-1.15000030e+06 2.64361523e+02]
[-3.21000058e+06 2.48462944e+02]
[-2.66000061e+06 4.64118294e+02]

19
[-2.15000069e+06 7.42717771e+02]
[-7.99500110e+06 -1.05509351e+01]
[-6.41000180e+05 1.73424022e+02]
[-1.19900043e+06 5.07553032e+02]
[-3.60000283e+06 4.42911322e+03]
[-5.60000175e+06 1.84077042e+03]
[-1.26000040e+06 4.25427531e+02]
[-4.97500126e+06 1.08893128e+03]
[-2.41000069e+06 6.74786235e+02]
[-2.38890021e+07 -2.34511635e+03]
[-1.70000064e+06 7.61705584e+02]
[-7.87500145e+06 6.77640060e+02]
[-1.09950019e+07 7.29495682e+02]
[-5.99499996e+06 -1.65525150e+03]
[-3.75000019e+06 -6.33928254e+02]
[-8.30000294e+05 3.39888491e+02]
[-2.25000069e+06 7.16050843e+02]
[-1.46500039e+06 3.53464922e+02]
[-6.80000262e+05 3.18884019e+02]
[-1.13000036e+06 3.82895893e+02]
[-1.24500059e+06 7.86329314e+02]
[-8.85000459e+05 6.37921810e+02]
[-6.50000349e+06 4.91179488e+03]
[-1.02000036e+06 4.09824224e+02]
[-1.99500069e+06 7.82343205e+02]
[-8.90000358e+05 4.45288934e+02]
[-3.35000104e+06 1.08873523e+03]
[-9.90000389e+05 4.78624226e+02]
[-8.05000323e+05 4.02052807e+02]
[-1.25500042e+06 4.69062222e+02]
[-3.60000105e+06 1.04607399e+03]
[-1.25000041e+06 4.41695588e+02]
[-1.97500071e+06 8.29876760e+02]
[-2.52500100e+06 1.22521631e+03]
[-2.40000062e+06 5.36060627e+02]
[-2.22500097e+06 1.26421668e+03]
[-9.41750376e+05 4.65037032e+02]
[-1.30000046e+06 5.23361055e+02]
[-1.15000028e+06 2.22358324e+02]
[-1.80000094e+06 1.31404404e+03]
[-1.95000074e+06 8.98042818e+02]
[-3.45000067e+06 3.59070272e+02]
[-1.84000077e+06 9.69974109e+02]
[-4.15000213e+06 2.94441993e+03]
[-1.70000074e+06 9.54695066e+02]
[-3.75000122e+06 1.32007210e+03]
[-1.08300041e+06 4.91645709e+02]
[-1.90000068e+06 7.87378316e+02]

20
[-2.67500079e+06 7.91230527e+02]
[-7.75000124e+06 3.04473906e+02]
[-1.63500061e+06 7.24939055e+02]
[-1.07500041e+06 5.03857974e+02]
[-2.10000073e+06 8.31042728e+02]
[-1.52500061e+06 7.55871411e+02]
[-1.45000050e+06 5.66366354e+02]
[-1.16500049e+06 6.32462789e+02]
[-1.31000057e+06 7.29099897e+02]
[-1.15000052e+06 6.80358358e+02]
[-2.15000078e+06 9.10711332e+02]
[-2.90000086e+06 8.72723245e+02]
[-6.41500174e+06 1.62155299e+03]
[-8.60000308e+06 3.59081937e+03]
[-1.01175018e+07 8.46635381e+02]
[-1.10000041e+06 4.92689594e+02]
[-2.01000050e+06 4.19445469e+02]
[-2.07500042e+06 2.57207856e+02]
[-1.07500051e+06 6.85860643e+02]
[-7.15000320e+05 4.18450800e+02]
[-2.15000090e+06 1.15071714e+03]
[-1.40000082e+06 1.18070651e+03]
[-1.00000037e+06 4.42358746e+02]
[-3.20000085e+06 7.79736396e+02]
[-6.50000266e+05 3.34683539e+02]
[-9.95000610e+05 8.96992056e+02]
[-1.02500042e+06 5.22191447e+02]
[-7.40000094e+06 -1.72200080e+02]
[-7.81000310e+05 3.83690320e+02]
[-4.95000075e+06 1.18093644e+02]
[-5.88001921e+05 3.49236230e+03]
[-2.00000076e+06 9.19721485e+02]
[-5.65000114e+06 6.69442220e+02]
[-7.50000620e+05 9.80022322e+02]
[-1.73500078e+06 1.03127753e+03]
[-1.85000082e+06 1.07070498e+03]
[-8.65000136e+06 3.06486074e+02]
[-3.80000085e+06 6.11744536e+02]
[-6.50000109e+06 3.49784984e+02]
[-9.46000514e+05 7.26797769e+02]
[-1.80000076e+06 9.77044455e+02]
[-7.45000281e+05 3.37653974e+02]
[-3.15000133e+06 1.69107816e+03]
[-1.60000073e+06 9.62373106e+02]
[-1.73000090e+06 1.24690508e+03]
[-3.35000104e+06 1.08873523e+03]
[-1.78000081e+06 1.06957403e+03]
[-3.85000126e+06 1.37241552e+03]

21
[-2.01000116e+06 1.66844979e+03]
[-3.67500085e+06 6.52573174e+02]
[-1.45000065e+06 8.61368368e+02]
[-1.28000049e+06 5.93891394e+02]
[-9.50000426e+05 5.58688491e+02]
[-3.00000110e+06 1.30506067e+03]
[-8.90000387e+05 5.00292765e+02]
[-3.80000279e+06 4.29876503e+03]
[-3.15000140e+06 1.82607247e+03]
[-5.25000104e+06 6.01106210e+02]
[-1.62500028e+06 1.13202089e+02]
[-1.68500082e+06 1.11861228e+03]
[-1.26100074e+06 1.07391487e+03]
[-7.12500518e+05 7.96768790e+02]
[-1.74000086e+06 1.18363948e+03]
[-7.30000357e+05 4.85550757e+02]
[-8.40000320e+05 3.85620026e+02]
[-6.35000307e+05 4.15583023e+02]
[-8.80000359e+06 4.49049300e+03]
[-1.36800084e+06 1.22555410e+03]
[-1.02500096e+06 1.56019242e+03]
[-1.78200060e+06 6.61517999e+02]
[-1.20000081e+06 1.22402771e+03]
[-2.85500192e+06 2.89643477e+03]
[-9.60000394e+06 4.94419228e+03]
[-1.75000051e+06 5.13371784e+02]
[-1.70000074e+06 9.54702407e+02]
[-2.10000083e+06 1.01604937e+03]
[-8.85000376e+05 4.80922744e+02]
[-9.80000417e+05 5.33890202e+02]
[-1.65000031e+06 1.49037403e+02]
[-8.60000276e+06 2.97081705e+03]
[-7.30000228e+05 2.40551683e+02]
[-1.62900053e+06 5.72099207e+02]
[-1.45000080e+06 1.13236490e+03]
[-2.00000101e+06 1.39371076e+03]
[-1.77000065e+06 7.67838339e+02]
[-5.32000208e+06 2.55023014e+03]
[-2.00000070e+06 8.02712312e+02]
[-1.80000083e+06 1.10504642e+03]
[-6.70000352e+05 4.91149895e+02]
[-4.45000128e+06 1.25642084e+03]
[-1.58950079e+06 1.07990286e+03]
[-1.61000072e+06 9.52102065e+02]
[-2.51500109e+06 1.41548826e+03]
[-4.99900132e+06 1.18728965e+03]
[-6.55000186e+05 1.80050860e+02]
[-2.15000069e+06 7.42717771e+02]

22
[-3.35000194e+06 2.80673454e+03]
[-1.05000029e+06 2.77023382e+02]
[-8.95000424e+05 5.69655201e+02]
[-1.34900042e+06 4.34559451e+02]
[-1.30000080e+06 1.17136265e+03]
[-1.08500067e+06 9.86596109e+02]
[-1.34000061e+06 8.06301277e+02]
[-2.16500102e+06 1.35980990e+03]
[-9.65000281e+05 2.79790611e+02]
[-8.30000502e+05 7.34887680e+02]
[-9.50000939e+05 1.53369061e+03]
[-7.94000449e+05 6.43844390e+02]
[-6.50000277e+06 3.54478218e+03]
[-1.47500063e+06 8.10199798e+02]
[-2.05000134e+06 2.01038171e+03]
[-4.99000099e+06 5.72030316e+02]
[-5.00000149e+06 1.50876014e+03]
[-9.10000458e+05 6.30758537e+02]
[-7.50000105e+06 1.01425589e+01]
[-1.15000050e+06 6.46361844e+02]
[-1.80000080e+06 1.05204594e+03]
[-1.25000075e+06 1.09169962e+03]
[-1.27500062e+06 8.48533539e+02]
[-7.05000433e+05 6.36716502e+02]
[-8.72500461e+05 6.46505505e+02]
[-2.60000116e+06 1.52672146e+03]
[-1.10000037e+06 4.20693564e+02]
[-2.10000097e+06 1.28204694e+03]
[-1.19900043e+06 5.07553032e+02]
[-1.62500100e+06 1.47820308e+03]
[-2.71000073e+06 6.77793788e+02]
[-2.53200068e+06 6.26540030e+02]
[-3.21000058e+06 2.48462944e+02]
[-1.50000077e+06 1.06003254e+03]
[-1.90300029e+06 4.27993523e+01]
[-3.50005829e+04 1.09757852e+03]
[-7.20000459e+05 6.82816538e+02]
[-1.30000046e+06 5.39365034e+02]
[-3.80000279e+06 4.29876503e+03]
[-4.62100210e+06 2.76636720e+03]
[-7.70000562e+05 8.64489024e+02]
[-1.60000075e+06 1.00736747e+03]
[-4.00000168e+06 2.14540963e+03]
[-6.26000357e+05 5.12324978e+02]
[-1.85300048e+06 4.32126187e+02]
[-1.16600078e+06 1.17093655e+03]
[-3.35000373e+06 6.19278769e+03]
[-1.10000047e+06 6.07692175e+02]

23
[-4.00000113e+06 1.09340763e+03]
[-1.45700039e+06 3.57678809e+02]
[-1.77000065e+06 7.67838339e+02]
[-2.40000077e+06 8.32056547e+02]
[-7.45000014e+06 -1.69352958e+03]
[-1.30750030e+07 2.30405906e+03]
[-2.50000052e+06 3.30386011e+02]
[-4.15000071e+06 2.56414152e+02]
[-2.30000030e+06 -4.02878644e+01]
[-9.00000190e+06 1.23215198e+03]
[-3.50000192e+06 2.72376843e+03]
[-1.25000028e+07 2.09188138e+03]
[-1.30000050e+06 6.12362311e+02]
[-1.51000049e+06 5.29763976e+02]
[-1.67600019e+07 -8.26659911e+02]
[-1.20000016e+07 -2.08787587e+02]
[-2.98900122e+06 1.53084935e+03]
[-2.81000056e+06 3.29121021e+02]
[-2.52500165e+06 2.47022049e+03]
[-2.47500083e+06 9.21551970e+02]
[-1.15000049e+06 6.30358547e+02]
[-1.00900030e+06 2.96617831e+02]
[-7.45000014e+06 -1.69352958e+03]
[-2.40000062e+06 5.36044693e+02]
[-8.30000342e+05 4.29888151e+02]
[-1.41000037e+06 3.32431578e+02]
[-3.00000087e+06 8.70060097e+02]
[-3.50000112e+06 1.20173252e+03]
[-9.50000362e+06 4.37482971e+03]
[-4.15000071e+06 2.56414152e+02]
[-3.62500102e+06 9.90906119e+02]
[-2.50000064e+06 5.53384258e+02]
[-9.20000525e+05 7.55488182e+02]
[-2.50200099e+06 1.22234003e+03]
[-4.75000111e+06 8.56422746e+02]
[-3.40000034e+06 -2.55602169e+02]
[-2.90000037e+06 -6.72769005e+01]
[-5.00000262e+06 3.65976652e+03]
[-1.84500068e+06 7.96341161e+02]
[-1.23000053e+06 6.77231963e+02]
[-6.00000061e+06 -4.25890317e+02]
[-3.17500099e+06 1.04589681e+03]
[-9.65000382e+05 4.70812074e+02]
[-1.33000047e+06 5.49562287e+02]
[-1.00000013e+07 -1.24496790e+02]
[-9.50000182e+06 9.46827929e+02]
[-2.85000104e+06 1.21705736e+03]
[-2.72500105e+06 1.28288961e+03]

24
[-5.25000238e+06 3.13511444e+03]
[-2.01500055e+06 5.13808492e+02]
[-5.55000131e+06 1.02710488e+03]
[-4.65000138e+06 1.40110678e+03]
[-1.75000058e+06 6.38371312e+02]
[-1.57600071e+06 9.33028246e+02]
[-6.25000159e+06 1.36746268e+03]
[-9.49000623e+05 9.32216179e+02]
[-8.95000549e+05 8.07655211e+02]
[-1.25000055e+06 7.13700706e+02]
[-1.65000090e+06 1.27604208e+03]
[-3.19500116e+06 1.35937026e+03]
[-7.35000147e+06 8.50142342e+02]
[-1.34900050e+06 5.95558160e+02]
[-1.25000041e+06 4.41696443e+02]]

[159]: projected_data.shape

[159]: (439, 2)

[122]: explained_variance_ratio = pca.explained_variance_ratio_

# Print the explained variance ratio values

print("Explained Variance Ratio:", explained_variance_ratio)

Explained Variance Ratio: [9.99999885e-01 1.14996251e-07 5.38924715e-13

1.62041663e-13
7.34178924e-14 7.01994932e-18 4.54122976e-19]

[123]: import numpy as np

import matplotlib.pyplot as plt

# Assuming 'explained_variance_ratio' contains the explained variance ratio of␣

↪each principal component

explained_variance_ratio = explained_variance_ratio # Replace [...] with your␣

↪actual explained variance ratio values

# Calculate cumulative explained variance ratio

cumulative_variance_ratio = np.cumsum(explained_variance_ratio)

# Plot scree plot

plt.plot(range(1, len(cumulative_variance_ratio) + 1),␣
↪cumulative_variance_ratio, marker='o', linestyle='-')

plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('Scree Plot')
plt.grid(True)

25
plt.show()

2.Biplot
[180]:

---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[180], line 3
1 import pandas as pd
2 import matplotlib.pyplot as plt
----> 3 from prince import PCA
5 # Assuming X contains your standardized data
6 pca = PCA(n_components=2)

ModuleNotFoundError: No module named 'prince'

[183]: pip install prince

Requirement already satisfied: prince in

26
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages
(0.13.0)
Requirement already satisfied: altair<6.0.0,>=4.2.2 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
prince) (5.3.0)
Requirement already satisfied: pandas<3.0.0,>=1.4.1 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
prince) (2.1.3)
Requirement already satisfied: scikit-learn<2.0.0,>=1.0.2 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
prince) (1.4.2)
Requirement already satisfied: jinja2 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
altair<6.0.0,>=4.2.2->prince) (3.1.2)
Requirement already satisfied: packaging in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
altair<6.0.0,>=4.2.2->prince) (23.2)
Requirement already satisfied: jsonschema>=3.0 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
altair<6.0.0,>=4.2.2->prince) (4.20.0)
Requirement already satisfied: typing-extensions>=4.0.1 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
altair<6.0.0,>=4.2.2->prince) (4.7.1)
Requirement already satisfied: toolz in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
altair<6.0.0,>=4.2.2->prince) (0.12.1)
Requirement already satisfied: numpy in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
altair<6.0.0,>=4.2.2->prince) (1.26.2)
Requirement already satisfied: pytz>=2020.1 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
pandas<3.0.0,>=1.4.1->prince) (2023.3.post1)
Requirement already satisfied: python-dateutil>=2.8.2 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
pandas<3.0.0,>=1.4.1->prince) (2.8.2)
Requirement already satisfied: tzdata>=2022.1 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
pandas<3.0.0,>=1.4.1->prince) (2023.3)
Requirement already satisfied: scipy>=1.6.0 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
scikit-learn<2.0.0,>=1.0.2->prince) (1.12.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
scikit-learn<2.0.0,>=1.0.2->prince) (3.2.0)
Requirement already satisfied: joblib>=1.2.0 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
scikit-learn<2.0.0,>=1.0.2->prince) (1.3.2)
Requirement already satisfied: attrs>=22.2.0 in

27
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
jsonschema>=3.0->altair<6.0.0,>=4.2.2->prince) (23.1.0)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
jsonschema>=3.0->altair<6.0.0,>=4.2.2->prince) (2023.11.1)
Requirement already satisfied: referencing>=0.28.4 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
jsonschema>=3.0->altair<6.0.0,>=4.2.2->prince) (0.31.1)
Requirement already satisfied: rpds-py>=0.7.1 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
jsonschema>=3.0->altair<6.0.0,>=4.2.2->prince) (0.13.2)
Requirement already satisfied: six>=1.5 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
python-dateutil>=2.8.2->pandas<3.0.0,>=1.4.1->prince) (1.16.0)
Requirement already satisfied: MarkupSafe>=2.0 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
jinja2->altair<6.0.0,>=4.2.2->prince) (2.1.3)
Note: you may need to restart the kernel to use updated packages.

[notice] A new release of pip available: 22.2.2 -> 24.0

[notice] To update, run: python.exe -m pip install --upgrade pip

[199]: import pandas as pd

import numpy as np
from prince import PCA
# Impute missing values with mean (replace 'mean' with 'median' or␣
↪'most_frequent' if desired)

imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(df_scaled)

# Assuming X is your ndarray

X_df = pd.DataFrame(X_imputed)

# Perform PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(X_df)

[209]: import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer

# Impute missing values with mean (replace 'mean' with 'median' or␣
↪'most_frequent' if desired)

imputer = SimpleImputer(strategy='mean')

28
X_imputed = imputer.fit_transform(df_scaled)

# Assuming X contains your standardized data

pca = PCA(n_components=2)
principal_components = pca.fit_transform(X_imputed)

# Extract loadings of each feature for the first two principal components
loadings = pca.components_.T[:, :2]

# Plot the first two principal components

plt.figure(figsize=(8, 7))
plt.scatter(principal_components[:, 0], principal_components[:, 1], alpha=0.5)

# Plot the loadings as vectors

for i, feature in enumerate(df_scaled):
plt.arrow(0, 0, loadings[i, 0], loadings[i, 1], color='r', alpha=0.5)
plt.text(loadings[i, 0], loadings[i, 1], feature, color='g', fontsize=10,␣
↪ha='center', va='center')

# Set labels and title

plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('Biplot of Principal Components')
plt.grid(True)

plt.show()

29
[189]: import pandas as pd
import numpy as np
from prince import PCA

# Assuming X is your ndarray

X_df = pd.DataFrame(X)

# Perform PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(X_df)

3.Pairplot
[130]: import seaborn as sns
import pandas as pd

30
sns.pairplot(df)
plt.show()

[132]: from sklearn.impute import SimpleImputer

# Impute missing values with mean (replace 'mean' with 'median' or␣
↪'most_frequent' if desired)

imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

# Now proceed with PCA using X_imputed as the input

[135]: df

31
[135]: bathrooms bedrooms finishedsqft lastsoldprice latitude longitude \
0 2.0 2 1463.0 1950000 37.795139 -122.425309
1 3.5 3 3291.0 4200000 37.794429 -122.428513
2 1.0 1 653.0 665000 37.792472 -122.425281
3 2.5 2 2272.0 2735000 37.794706 -122.426347
4 1.0 1 837.0 1050000 37.793212 -122.423744
.. … … … … … …
434 2.0 3 2145.0 1650000 37.795777 -122.433024
435 3.5 4 3042.0 3195000 37.795330 -122.436540
436 7.5 6 4721.0 7350000 37.795246 -122.437490
437 1.0 2 1306.0 1349000 37.796588 -122.429641
438 1.0 1 1100.0 1250000 37.795255 -122.432880

totalrooms
0 7
1 7
2 3
3 6
4 3
.. …
434 8
435 10
436 13
437 5
438 5

[439 rows x 7 columns]

[138]: import numpy as np

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer

# Separate features (X) and target variable (y)

X = df.drop(columns=['lastsoldprice'])
y = df['lastsoldprice']

# Handle missing values using imputation

imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

# Perform PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(X_imputed)

32
# Create a DataFrame for the principal components
pc_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])

# Create a scatterplot matrix with both principal components

combined_df = pd.concat([pc_df, y], axis=1)
sns.pairplot(combined_df, hue='lastsoldprice')
plt.show()

[176]: import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer

# Impute missing values with mean (replace 'mean' with 'median' or␣
↪'most_frequent' if desired)

imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(df_scaled)

33
# Assuming X contains your standardized data
pca = PCA(n_components=2)
principal_components = pca.fit_transform(X_imputed)

# Create a heatmap of the principal components

plt.figure(figsize=(10, 6))
sns.heatmap(principal_components, cmap='coolwarm', annot=True, fmt=".2f",␣
↪cbar=True)

plt.xlabel('Principal Component')
plt.ylabel('Data Point')
plt.title('Heatmap of Principal Components')
plt.show()

[ ]:

Score plot
[152]: import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

34
import matplotlib.pyplot as plt

# Assuming your data is stored in a DataFrame called 'df'

# Impute missing values

imputer = SimpleImputer(strategy='mean') # You can change the strategy as␣
↪needed

imputed_data = imputer.fit_transform(df)

# Standardize the data

scaler = StandardScaler()
scaled_data = scaler.fit_transform(imputed_data)

# Perform PCA
pca = PCA(n_components=2) # Reduce to 2 components for a 2D plot
principal_components = pca.fit_transform(scaled_data)

# Plot the score plot with quadrants

pc_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])
plt.figure(figsize=(8, 6))
plt.scatter(pc_df['PC1'], pc_df['PC2'])
plt.axhline(y=0, color='k', linestyle='--') # Add horizontal line at y=0
plt.axvline(x=0, color='k', linestyle='--') # Add vertical line at x=0
plt.text(0.5, 0.5, ' I', fontsize=12, ha='center')
plt.text(-0.5, 0.5, ' II', fontsize=12, ha='center')
plt.text(-0.5, -0.5, ' III', fontsize=12, ha='center')
plt.text(0.5, -0.5, ' IV', fontsize=12, ha='center')
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.title('Score Plot ')
plt.grid(True)
plt.show()

35
0.3 ADDITIONAL EXPLORATION
[148]: from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor

# Assuming you have already reduced your data dimensionality and stored it in␣
↪X_reduced

# Initialize your chosen model

model = RandomForestRegressor() # Example: RandomForestRegressor, you can use␣
↪any other model

# Perform cross-validation
scores = cross_val_score(model, df_scaled, y, cv=5,␣
↪scoring='neg_mean_squared_error')

# Print the cross-validation scores

print("Cross-validation Mean Squared Error:", -scores.mean())

36
Cross-validation Mean Squared Error: 195630660332.1332

[170]: import numpy as np

import pandas as pd
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer

# Create an imputer object

imputer = SimpleImputer(strategy='mean') # You can replace 'mean' with␣
↪'median' or 'most_frequent'

# Fit the imputer to X and transform X

X_imputed = imputer.fit_transform(df_scaled)

# Assuming 'X' is your standardized data matrix

# Perform PCA
pca = PCA(n_components=2) # Choose the number of components you want
principal_components = pca.fit_transform(X_imputed)

# Get the loadings (coefficients) for each PC

loadings = pca.components_

# Create a DataFrame to display the loadings

loadings_df = pd.DataFrame(loadings, columns=df_scaled.columns,␣
↪index=[f'PC{i+1}' for i in range(pca.n_components_)])

# Display the loadings DataFrame

print("Loadings (Coefficients) for Each Principal Component:")
print(loadings_df)

Loadings (Coefficients) for Each Principal Component:

0 1 2 3 4 5 6
PC1 0.443793 0.414535 0.467587 0.420565 0.017968 -0.195893 0.443845
PC2 -0.092340 -0.005345 -0.067513 -0.077041 -0.750017 -0.646715 -0.013622

[164]: df

[164]: bathrooms bedrooms finishedsqft lastsoldprice latitude longitude \

37
437 1.0 2 1306.0 1349000 37.796588 -122.429641
438 1.0 1 1100.0 1250000 37.795255 -122.432880

totalrooms
0 7
1 7
2 3
3 6
4 3
.. …
434 8
435 10
436 13
437 5
438 5

[439 rows x 7 columns]

B9 Scheme Tsol All Subjects
100% (1)
B9 Scheme Tsol All Subjects
54 pages
TikTok Ad Solutions 2022
No ratings yet
TikTok Ad Solutions 2022
27 pages
Exp 1 A
No ratings yet
Exp 1 A
5 pages
Ds ML House Price Book
No ratings yet
Ds ML House Price Book
46 pages
Data Assigment 1
100% (2)
Data Assigment 1
32 pages
The Data Science Process
100% (1)
The Data Science Process
53 pages
NationalSemiconductor FACTAdvancedCMOSLogicDatabook1993OCR PDF
No ratings yet
NationalSemiconductor FACTAdvancedCMOSLogicDatabook1993OCR PDF
749 pages
Explanatory Data Analysis
100% (1)
Explanatory Data Analysis
28 pages
XenApp 6.5 Advanced Administratoin - Student Manual
No ratings yet
XenApp 6.5 Advanced Administratoin - Student Manual
310 pages
Vector-Logic Computing For Faults-As-Address Deductive Simulation
No ratings yet
Vector-Logic Computing For Faults-As-Address Deductive Simulation
15 pages
ML LAB Manual-1
No ratings yet
ML LAB Manual-1
33 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
Company PROFILE - Peoplelink
No ratings yet
Company PROFILE - Peoplelink
118 pages
Machine Learning (BCSL606) Lab Manual
No ratings yet
Machine Learning (BCSL606) Lab Manual
117 pages
Outlier Detection
No ratings yet
Outlier Detection
41 pages
Machine Learning (BCSL606) Lab Manual
No ratings yet
Machine Learning (BCSL606) Lab Manual
117 pages
Delhivery Mani
No ratings yet
Delhivery Mani
79 pages
Report
No ratings yet
Report
40 pages
ML Lab Manual
No ratings yet
ML Lab Manual
110 pages
Data Structures
No ratings yet
Data Structures
59 pages
Advanced Python Programming Data Science: The University of Sheffield
No ratings yet
Advanced Python Programming Data Science: The University of Sheffield
55 pages
Security+ Practice Exam 4 - 65 Questions - More Practice
No ratings yet
Security+ Practice Exam 4 - 65 Questions - More Practice
30 pages
Exp - 2-EDA - CaliforniaData Set - HeatMap - PairPlot-checkpoint - Jupyter Notebook
No ratings yet
Exp - 2-EDA - CaliforniaData Set - HeatMap - PairPlot-checkpoint - Jupyter Notebook
12 pages
Hduud
No ratings yet
Hduud
55 pages
Merged
No ratings yet
Merged
47 pages
Chapter 4 2023 2
No ratings yet
Chapter 4 2023 2
25 pages
Systems Analysis and Design With UML 2.0
No ratings yet
Systems Analysis and Design With UML 2.0
30 pages
Quantum Series G.6 Operating Instructions - Issue 2
No ratings yet
Quantum Series G.6 Operating Instructions - Issue 2
32 pages
Báo cáo Đa nền tảng
No ratings yet
Báo cáo Đa nền tảng
24 pages
ML Lab Manual
No ratings yet
ML Lab Manual
25 pages
Dealing With Outliers
No ratings yet
Dealing With Outliers
19 pages
E Variable Transformation - Solution - Jupyter Notebook
No ratings yet
E Variable Transformation - Solution - Jupyter Notebook
29 pages
(KB2885) Download and Install ESET Offline or Install Older Versions of ESET Windows Home Products
No ratings yet
(KB2885) Download and Install ESET Offline or Install Older Versions of ESET Windows Home Products
4 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
20 pages
Linear Regression Analysis - Polynomial Regression
No ratings yet
Linear Regression Analysis - Polynomial Regression
25 pages
Identifying and Handling Outliers in Pandas - A Step-By-Step Guide - by Arvid Eichner - Python in Plain English
No ratings yet
Identifying and Handling Outliers in Pandas - A Step-By-Step Guide - by Arvid Eichner - Python in Plain English
19 pages
Holiday Homework
No ratings yet
Holiday Homework
22 pages
Ass 1 ML
No ratings yet
Ass 1 ML
21 pages
ML Observation
No ratings yet
ML Observation
29 pages
Project Intern - Jupyter Notebook
No ratings yet
Project Intern - Jupyter Notebook
16 pages
Machine Learning Laboratory
No ratings yet
Machine Learning Laboratory
23 pages
Predicting Home Prices in Bangalore
No ratings yet
Predicting Home Prices in Bangalore
18 pages
The Limits Theorem
No ratings yet
The Limits Theorem
18 pages
Share Whitepaper 7
No ratings yet
Share Whitepaper 7
14 pages
Injecttive Blockchain
No ratings yet
Injecttive Blockchain
14 pages
Task 6
No ratings yet
Task 6
14 pages
ML Expt 1 Description
No ratings yet
ML Expt 1 Description
15 pages
USA Real Estate Price Prediction Using Decision Tree Regressor, and AdaBoost Regressor
No ratings yet
USA Real Estate Price Prediction Using Decision Tree Regressor, and AdaBoost Regressor
14 pages
How To Handle Outliers
No ratings yet
How To Handle Outliers
6 pages
Lo3 Engage in Quality Improvement 1
No ratings yet
Lo3 Engage in Quality Improvement 1
19 pages
HTML Quiz Qa
No ratings yet
HTML Quiz Qa
17 pages
House Price Prediction Models
No ratings yet
House Price Prediction Models
16 pages
Eda Code Snippets
No ratings yet
Eda Code Snippets
17 pages
Machine Learning Lab - Preprocessing
No ratings yet
Machine Learning Lab - Preprocessing
13 pages
Walton
No ratings yet
Walton
14 pages
Python Expert
No ratings yet
Python Expert
10 pages
Lab Extern L
No ratings yet
Lab Extern L
8 pages
Regression Algorithm
No ratings yet
Regression Algorithm
9 pages
Signals and Systems Question Bank
No ratings yet
Signals and Systems Question Bank
4 pages
Project Linear Regression
No ratings yet
Project Linear Regression
7 pages
DV Mid Internal 1
No ratings yet
DV Mid Internal 1
8 pages
DataPreparation - Outlier - Treatment ASSIGEMENT ANSWER
No ratings yet
DataPreparation - Outlier - Treatment ASSIGEMENT ANSWER
4 pages
Kaggle Machine Learning
No ratings yet
Kaggle Machine Learning
6 pages
A926534728 - 28953 - 8 - 2025 - Spark Mllib
No ratings yet
A926534728 - 28953 - 8 - 2025 - Spark Mllib
8 pages
California Housing Project
No ratings yet
California Housing Project
5 pages
External Parts: Types of Peripheral
No ratings yet
External Parts: Types of Peripheral
3 pages
Data Clearning
No ratings yet
Data Clearning
7 pages
ML Short Code - Under Updating
No ratings yet
ML Short Code - Under Updating
4 pages
Emllab
No ratings yet
Emllab
6 pages
Wasim Mohammed CV New
No ratings yet
Wasim Mohammed CV New
5 pages
Practice Questions2
No ratings yet
Practice Questions2
2 pages
Method of Last Resort
No ratings yet
Method of Last Resort
5 pages
Prac - 8 (1) - Jupyter Notebook
No ratings yet
Prac - 8 (1) - Jupyter Notebook
6 pages
MB 0044
No ratings yet
MB 0044
8 pages
Automatic Account Determination in MM
No ratings yet
Automatic Account Determination in MM
6 pages
Bio 11 Syllabus
No ratings yet
Bio 11 Syllabus
4 pages
My Credentials
No ratings yet
My Credentials
1 page
ML Lab - Exp1-10
No ratings yet
ML Lab - Exp1-10
4 pages
Mlprogram 1
No ratings yet
Mlprogram 1
3 pages
ML Program No.1
No ratings yet
ML Program No.1
3 pages
Extended Reality
No ratings yet
Extended Reality
2 pages
Lab Prog1
No ratings yet
Lab Prog1
2 pages
Choppa Sravani: Professional Objective
No ratings yet
Choppa Sravani: Professional Objective
2 pages
DSBDA Prac4 2
No ratings yet
DSBDA Prac4 2
1 page
Program 1
No ratings yet
Program 1
1 page
Daljit PDF
No ratings yet
Daljit PDF
2 pages
Program 01
No ratings yet
Program 01
1 page
TensorFlow深度学习项目实战: Chinese Edition
From Everand
TensorFlow深度学习项目实战: Chinese Edition
Posts & Telecom Press
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet