0% found this document useful (0 votes)

14 views47 pages

Regression and Eda

This document performs exploratory data analysis on a dataframe containing economic indicators data from 1980 to 2022. It loads the data, cleans it by removing unnecessary columns and rows with missing values, then analyzes the distribution and trends of variables like gold prices, inflation, and unemployment rate using histograms, box plots, and line plots.

Uploaded by

SUHANI LARIYA 22112338

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views47 pages

Regression and Eda

Uploaded by

SUHANI LARIYA 22112338

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

regression-and-eda

May 3, 2024

[2]: import pandas as pd

[3]: df=pd.read_excel('C:\\Users\\lariy\\Downloads\\capstone\\final_data.xlsx')
df

[3]: Unnamed: 0 Unnamed: 1 Year Gold Prices (USD) Inflation \

0 NaN NaN 1980-01-04 588.00 13.549202
1 NaN NaN 1980-01-11 623.00 13.549202
2 NaN NaN 1980-01-18 835.00 13.549202
3 NaN NaN 1980-01-25 668.00 13.549202
4 NaN NaN 1980-02-01 676.50 13.549202
… … … … … …
2239 NaN NaN 2022-12-02 1784.75 NaN
2240 NaN NaN 2022-12-09 1796.15 NaN
2241 NaN NaN 2022-12-16 1792.55 NaN
2242 NaN NaN 2022-12-23 1800.70 NaN
2243 NaN NaN 2022-12-30 1813.75 NaN

Unemployment rate Interest rates Oil Prices (USD)

0 6.3 18.9 21.59
1 6.3 18.9 21.59
2 6.3 18.9 21.59
3 6.3 18.9 21.59
4 6.9 18.9 21.59
… … … …
2239 NaN 4.1 93.97
2240 NaN 4.1 93.97
2241 NaN 4.1 93.97
2242 NaN 4.1 93.97
2243 NaN 4.1 93.97

[2244 rows x 8 columns]

[4]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2244 entries, 0 to 2243
Data columns (total 8 columns):

1
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 0 non-null float64
1 Unnamed: 1 0 non-null float64
2 Year 2244 non-null datetime64[ns]
3 Gold Prices (USD) 2244 non-null float64
4 Inflation 2192 non-null float64
5 Unemployment rate 2136 non-null float64
6 Interest rates 2244 non-null float64
7 Oil Prices (USD) 2244 non-null float64
dtypes: datetime64[ns](1), float64(7)
memory usage: 140.4 KB

[7]: df.describe()

[7]: Unnamed: 0 Unnamed: 1 Year Gold Prices (USD) \

count 0.0 0.0 2244 2244.000000
mean NaN NaN 2001-07-02 12:00:00 738.579835
min NaN NaN 1980-01-04 00:00:00 253.800000
25% NaN NaN 1990-10-03 06:00:00 357.787500
50% NaN NaN 2001-07-02 12:00:00 427.575000
75% NaN NaN 2012-03-31 18:00:00 1215.312500
max NaN NaN 2022-12-30 00:00:00 2031.150000
std NaN NaN NaN 506.810257

Inflation Unemployment rate Interest rates Oil Prices (USD)

count 2192.000000 2136.000000 2244.000000 2244.000000
mean 3.180372 6.311798 4.276190 40.127362
min -0.355546 3.900000 0.070000 10.870000
25% 2.069337 5.300000 0.540000 16.540000
50% 2.931204 5.900000 4.160000 27.560000
75% 3.156842 7.800000 6.770000 59.690000
max 13.549202 10.800000 18.900000 95.990000
std 2.402141 1.459679 3.957089 27.380432

[27]: df.columns

[27]: Index(['Year', 'Gold Prices (USD)', 'Inflation', 'Unemployment rate',

'Interest rates', 'Oil Prices (USD)'],
dtype='object')

[8]: #Removing unnecessary columns like "unnamed:0" and "unnamed:1"

import pandas as pd

df = df.drop(columns=['Unnamed: 0', 'Unnamed: 1'])

print(df.head())

2
Year Gold Prices (USD) Inflation Unemployment rate Interest rates \
0 1980-01-04 588.0 13.549202 6.3 18.9
1 1980-01-11 623.0 13.549202 6.3 18.9
2 1980-01-18 835.0 13.549202 6.3 18.9
3 1980-01-25 668.0 13.549202 6.3 18.9
4 1980-02-01 676.5 13.549202 6.9 18.9

Oil Prices (USD)

0 21.59
1 21.59
2 21.59
3 21.59
4 21.59

[133]: # Visualize data using histograms

df['Gold Prices (USD)'].hist()

# Visualize data using box plots

df.boxplot(column='Gold Prices (USD)')

# Visualize data using scatter plots

df.plot.scatter(x='Year', y='Gold Prices (USD)')

[133]: <Axes: xlabel='Year', ylabel='Gold Prices (USD)'>

3
[10]: #Removing missing values from "inflation" and "unemployment rate"
# Assuming your dataframe is named 'df'

# Remove rows with missing values in both 'Inflation' and 'Unemployment rate'␣
↪columns

df = df.dropna(subset=['Inflation', 'Unemployment rate'])

# Display the modified dataframe

print(df.head())

Year Gold Prices (USD) Inflation Unemployment rate Interest rates \

0 1980-01-04 588.0 13.549202 6.3 18.9
1 1980-01-11 623.0 13.549202 6.3 18.9
2 1980-01-18 835.0 13.549202 6.3 18.9
3 1980-01-25 668.0 13.549202 6.3 18.9
4 1980-02-01 676.5 13.549202 6.9 18.9

Oil Prices (USD)

0 21.59

4
1 21.59
2 21.59
3 21.59
4 21.59

[12]: df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2136 entries, 0 to 2135
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Year 2136 non-null datetime64[ns]
1 Gold Prices (USD) 2136 non-null float64
2 Inflation 2136 non-null float64
3 Unemployment rate 2136 non-null float64
4 Interest rates 2136 non-null float64
5 Oil Prices (USD) 2136 non-null float64
dtypes: datetime64[ns](1), float64(5)
memory usage: 116.8 KB

[ ]: #missng values are removed , left us with 2136 rows"

[14]: #Visualising distrbution of each variable to understand distribution

import matplotlib.pyplot as plt
import seaborn as sns

# Assuming your dataframe is named 'df'

# Set the style for seaborn plots

sns.set(style="whitegrid")

# Plot the distribution of each variable

plt.figure(figsize=(12, 8))

# Plot for 'Gold Prices (USD)' variable

plt.subplot(3, 2, 2)
sns.histplot(df['Gold Prices (USD)'], kde=True)
plt.title('Distribution of Gold Prices (USD)')

# Plot for 'Inflation' variable

plt.subplot(3, 2, 3)
sns.histplot(df['Inflation'], kde=True)
plt.title('Distribution of Inflation')

# Plot for 'Unemployment rate' variable

5
plt.subplot(3, 2, 4)
sns.histplot(df['Unemployment rate'], kde=True)
plt.title('Distribution of Unemployment rate')

# Plot for 'Interest rates' variable

plt.subplot(3, 2, 5)
sns.histplot(df['Interest rates'], kde=True)
plt.title('Distribution of Interest rates')

# Plot for 'Oil Prices (USD)' variable

plt.subplot(3, 2, 6)
sns.histplot(df['Oil Prices (USD)'], kde=True)
plt.title('Distribution of Oil Prices (USD)')

plt.tight_layout()
plt.show()

[15]: import seaborn as sns

import matplotlib.pyplot as plt

# Assuming your dataframe is named 'df'

# Set the style for seaborn plots

6
sns.set(style="whitegrid")

# Plot the trend of each variable over time (year)

plt.figure(figsize=(12, 8))

# Plot for 'Gold Prices (USD)'

plt.subplot(3, 2, 1)
sns.lineplot(x='Year', y='Gold Prices (USD)', data=df)
plt.title('Trend of Gold Prices (USD)')

# Plot for 'Inflation'

plt.subplot(3, 2, 2)
sns.lineplot(x='Year', y='Inflation', data=df)
plt.title('Trend of Inflation')

# Plot for 'Unemployment rate'

plt.subplot(3, 2, 3)
sns.lineplot(x='Year', y='Unemployment rate', data=df)
plt.title('Trend of Unemployment rate')

# Plot for 'Interest rates'

plt.subplot(3, 2, 4)
sns.lineplot(x='Year', y='Interest rates', data=df)
plt.title('Trend of Interest rates')

# Plot for 'Oil Prices (USD)'

plt.subplot(3, 2, 5)
sns.lineplot(x='Year', y='Oil Prices (USD)', data=df)
plt.title('Trend of Oil Prices (USD)')

plt.tight_layout()
plt.show()

7
[17]: import seaborn as sns
import matplotlib.pyplot as plt

# Pairwise relationships between variables

sns.pairplot(df, vars=['Gold Prices (USD)', 'Inflation', 'Unemployment rate',␣
↪'Interest rates', 'Oil Prices (USD)'], kind='scatter')

plt.show()

8
[18]: import seaborn as sns
import matplotlib.pyplot as plt

# Assuming your dataframe is named 'df' with the relevant columns

# Compute the correlation matrix
correlation_matrix = df.corr()

# Plot the correlation matrix as a heatmap

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f",␣
↪square=True)

plt.title('Correlation Matrix between Variables')

plt.show()

9
[ ]: '''it can be seen that relationship between inflation and gold prices is␣
↪positive theoretically , but according to the correlation matrix it is
negative.It maybe due to external factors like govt policies that the data is␣
↪behaving like this. Also this is not so strong relationship'''

[19]: #OUTLIER DETECTION

import pandas as pd

# Assuming your dataframe is named 'df' with the relevant variables

# Define a function to detect outliers using z-score

def detect_outliers_zscore(data, threshold=3):
z_scores = (data - data.mean()) / data.std()
return (z_scores > threshold) | (z_scores < -threshold)

# Define a function to detect outliers using IQR

10
def detect_outliers_iqr(data, threshold=1.5):
q1 = data.quantile(0.25)
q3 = data.quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - threshold * iqr
upper_bound = q3 + threshold * iqr
return (data < lower_bound) | (data > upper_bound)

# Iterate over each column in the dataframe to detect outliers

outliers = {}
for column in df.columns:
outliers[column] = {
'zscore_outliers': detect_outliers_zscore(df[column]),
'iqr_outliers': detect_outliers_iqr(df[column])
}

# Display outliers for each variable

for column, values in outliers.items():
print(f"Outliers in {column}:")
print("Outliers:")
print(values['zscore_outliers'])
print("IQR Outliers:")
print(values['iqr_outliers'])
print()

Outliers in Year:
Outliers:
0 False
1 False
2 False
3 False
4 False
…
2131 False
2132 False
2133 False
2134 False
2135 False
Name: Year, Length: 2136, dtype: bool
IQR Outliers:
0 False
1 False
2 False
3 False
4 False
…
2131 False

11
2132 False
2133 False
2134 False
2135 False
Name: Year, Length: 2136, dtype: bool

Outliers in Gold Prices (USD):

Outliers:
0 False
1 False
2 False
3 False
4 False
…
2131 False
2132 False
2133 False
2134 False
2135 False
Name: Gold Prices (USD), Length: 2136, dtype: bool
IQR Outliers:
0 False
1 False
2 False
3 False
4 False
…
2131 False
2132 False
2133 False
2134 False
2135 False
Name: Gold Prices (USD), Length: 2136, dtype: bool

Outliers in Inflation:
Outliers:
0 True
1 True
2 True
3 True
4 True
…
2131 False
2132 False
2133 False
2134 False
2135 False
Name: Inflation, Length: 2136, dtype: bool

12
IQR Outliers:
0 True
1 True
2 True
3 True
4 True
…
2131 False
2132 False
2133 False
2134 False
2135 False
Name: Inflation, Length: 2136, dtype: bool

Outliers in Unemployment rate:

Outliers:
0 False
1 False
2 False
3 False
4 False
…
2131 False
2132 False
2133 False
2134 False
2135 False
Name: Unemployment rate, Length: 2136, dtype: bool
IQR Outliers:
0 False
1 False
2 False
3 False
4 False
…
2131 False
2132 False
2133 False
2134 False
2135 False
Name: Unemployment rate, Length: 2136, dtype: bool

Outliers in Interest rates:

Outliers:
0 True
1 True
2 True
3 True

13
4 True
…
2131 False
2132 False
2133 False
2134 False
2135 False
Name: Interest rates, Length: 2136, dtype: bool
IQR Outliers:
0 True
1 True
2 True
3 True
4 True
…
2131 False
2132 False
2133 False
2134 False
2135 False
Name: Interest rates, Length: 2136, dtype: bool

Outliers in Oil Prices (USD):

Outliers:
0 False
1 False
2 False
3 False
4 False
…
2131 False
2132 False
2133 False
2134 False
2135 False
Name: Oil Prices (USD), Length: 2136, dtype: bool
IQR Outliers:
0 False
1 False
2 False
3 False
4 False
…
2131 False
2132 False
2133 False
2134 False
2135 False

14
Name: Oil Prices (USD), Length: 2136, dtype: bool

[21]: import seaborn as sns

import matplotlib.pyplot as plt

# Assuming your dataframe is named 'df' with the relevant variables

# Set the figure size

plt.figure(figsize=(12, 8))

# Iterate over each column in the dataframe and plot box plots with outliers
for i, column in enumerate(df.columns):
plt.subplot(len(df.columns)//2, 2, i+1)
sns.boxplot(x=df[column], showfliers=True)
plt.title(f'Box Plot of {column}')

# Adjust layout
plt.tight_layout()
plt.show()

[20]: import pandas as pd

# Assuming your dataframe is named 'df' with the relevant variables

15
# Define a function to detect outliers using z-score
def detect_outliers_zscore(data, threshold=3):
z_scores = (data - data.mean()) / data.std()
return (z_scores > threshold) | (z_scores < -threshold)

# Define a function to detect outliers using IQR

def detect_outliers_iqr(data, threshold=1.5):
q1 = data.quantile(0.25)
q3 = data.quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - threshold * iqr
upper_bound = q3 + threshold * iqr
return (data < lower_bound) | (data > upper_bound)

# Initialize a dictionary to store the count of outliers for each variable

outlier_counts = {}

# Iterate over each column in the dataframe to count outliers

for column in df.columns:
# Count outliers using z-score method
zscore_outliers_count = detect_outliers_zscore(df[column]).sum()

# Count outliers using IQR method

iqr_outliers_count = detect_outliers_iqr(df[column]).sum()

# Store the counts in the dictionary

outlier_counts[column] = {
'zscore_outliers_count': zscore_outliers_count,
'iqr_outliers_count': iqr_outliers_count
}

# Display the count of outliers for each variable

for column, values in outlier_counts.items():
print(f"Outlier counts in {column}:")
print("Z-Score Outliers Count:", values['zscore_outliers_count'])
print("IQR Outliers Count:", values['iqr_outliers_count'])
print()

Outlier counts in Year:

Z-Score Outliers Count: 0
IQR Outliers Count: 0

Outlier counts in Gold Prices (USD):

Z-Score Outliers Count: 0
IQR Outliers Count: 0

16
Outlier counts in Inflation:
Z-Score Outliers Count: 104
IQR Outliers Count: 261

Outlier counts in Unemployment rate:

Z-Score Outliers Count: 9
IQR Outliers Count: 0

Outlier counts in Interest rates:

Z-Score Outliers Count: 52
IQR Outliers Count: 52

Outlier counts in Oil Prices (USD):

Z-Score Outliers Count: 0
IQR Outliers Count: 0

[22]: # Define a function to detect outliers using z-score

def detect_outliers_zscore(data, threshold=3):
z_scores = (data - data.mean()) / data.std()
return (z_scores > threshold) | (z_scores < -threshold)

# Define a function to detect outliers using IQR

# Initialize a dictionary to store the outliers for each variable

outliers = {}

# Iterate over each column in the dataframe to detect and store outliers
for column in df.columns:
# Detect outliers using z-score method
zscore_outliers = df[column][detect_outliers_zscore(df[column])]

# Detect outliers using IQR method

iqr_outliers = df[column][detect_outliers_iqr(df[column])]

# Store the outliers in the dictionary

outliers[column] = {
'zscore_outliers': zscore_outliers,
'iqr_outliers': iqr_outliers
}

17
# Display the outlier values for each variable
for column, values in outliers.items():
print(f"Outliers in {column}:")
print("Z-Score Outliers:")
print(values['zscore_outliers'])
print("IQR Outliers:")
print(values['iqr_outliers'])
print()

Outliers in Year:
Z-Score Outliers:
Series([], Name: Year, dtype: datetime64[ns])
IQR Outliers:
Series([], Name: Year, dtype: datetime64[ns])

Outliers in Gold Prices (USD):

Z-Score Outliers:
Series([], Name: Gold Prices (USD), dtype: float64)
IQR Outliers:
Series([], Name: Gold Prices (USD), dtype: float64)

Outliers in Inflation:
Z-Score Outliers:
0 13.549202
1 13.549202
2 13.549202
3 13.549202
4 13.549202
…
99 10.334715
100 10.334715
101 10.334715
102 10.334715
103 10.334715
Name: Inflation, Length: 104, dtype: float64
IQR Outliers:
0 13.549202
1 13.549202
2 13.549202
3 13.549202
4 13.549202
…
1821 0.118627
1822 0.118627
1823 0.118627
1824 0.118627

18
1825 0.118627
Name: Inflation, Length: 261, dtype: float64

Outliers in Unemployment rate:

Z-Score Outliers:
148 10.8
149 10.8
150 10.8
151 10.8
152 10.8
153 10.8
154 10.8
155 10.8
156 10.8
Name: Unemployment rate, dtype: float64
IQR Outliers:
Series([], Name: Unemployment rate, dtype: float64)

Outliers in Interest rates:

Z-Score Outliers:
0 18.9
1 18.9
2 18.9
3 18.9
4 18.9
5 18.9
6 18.9
7 18.9
8 18.9
9 18.9
10 18.9
11 18.9
12 18.9
13 18.9
14 18.9
15 18.9
16 18.9
17 18.9
18 18.9
19 18.9
20 18.9
21 18.9
22 18.9
23 18.9
24 18.9
25 18.9
26 18.9
27 18.9

19
28 18.9
29 18.9
30 18.9
31 18.9
32 18.9
33 18.9
34 18.9
35 18.9
36 18.9
37 18.9
38 18.9
39 18.9
40 18.9
41 18.9
42 18.9
43 18.9
44 18.9
45 18.9
46 18.9
47 18.9
48 18.9
49 18.9
50 18.9
51 18.9
Name: Interest rates, dtype: float64
IQR Outliers:
0 18.9
1 18.9
2 18.9
3 18.9
4 18.9
5 18.9
6 18.9
7 18.9
8 18.9
9 18.9
10 18.9
11 18.9
12 18.9
13 18.9
14 18.9
15 18.9
16 18.9
17 18.9
18 18.9
19 18.9
20 18.9
21 18.9

20
22 18.9
23 18.9
24 18.9
25 18.9
26 18.9
27 18.9
28 18.9
29 18.9
30 18.9
31 18.9
32 18.9
33 18.9
34 18.9
35 18.9
36 18.9
37 18.9
38 18.9
39 18.9
40 18.9
41 18.9
42 18.9
43 18.9
44 18.9
45 18.9
46 18.9
47 18.9
48 18.9
49 18.9
50 18.9
51 18.9
Name: Interest rates, dtype: float64

Outliers in Oil Prices (USD):

Z-Score Outliers:
Series([], Name: Oil Prices (USD), dtype: float64)
IQR Outliers:
Series([], Name: Oil Prices (USD), dtype: float64)

[ ]: '''It can be seen that there are many outliers but share historical event ,␣
↪since the outliers are considered meaningful reflections of historical

events in U.S. history,

it may be appropriate to retain them in the dataset and proceed with the␣
↪analysis.'''

[23]: #Finding multicollinearity

import pandas as pd

21
# Assuming your dataframe is named 'df' with the independent variables as␣
↪columns

# Calculate the correlation matrix

correlation_matrix = df.corr()

# Display the correlation matrix as a table

print("Correlation Matrix:")
print(correlation_matrix)

Correlation Matrix:
Year Gold Prices (USD) Inflation Unemployment rate \
Year 1.000000 0.775752 -0.532349 -0.501394
Gold Prices (USD) 0.775752 1.000000 -0.205732 -0.014243
Inflation -0.532349 -0.205732 1.000000 0.277274
Unemployment rate -0.501394 -0.014243 0.277274 1.000000
Interest rates -0.841198 -0.560575 0.772618 0.399629
Oil Prices (USD) 0.678407 0.772340 -0.298871 0.036252

Interest rates Oil Prices (USD)

Year -0.841198 0.678407
Gold Prices (USD) -0.560575 0.772340
Inflation 0.772618 -0.298871
Unemployment rate 0.399629 0.036252
Interest rates 1.000000 -0.550246
Oil Prices (USD) -0.550246 1.000000

[24]: import pandas as pd

# Assuming your dataframe is named 'df' with the independent variables as␣
↪columns

# Calculate the correlation matrix

correlation_matrix = df.corr()

# Set the threshold values for correlation

threshold_min = 0.7
threshold_max = 0.9

# Initialize a list to store pairs of variables with high correlation

high_correlation_pairs = []

# Iterate through the correlation matrix to identify pairs with correlation␣

↪between the thresholds

for i in range(len(correlation_matrix.columns)):
for j in range(i+1, len(correlation_matrix.columns)):

22
if abs(correlation_matrix.iloc[i, j]) >= threshold_min and␣
↪abs(correlation_matrix.iloc[i, j]) <= threshold_max:
high_correlation_pairs.append((correlation_matrix.columns[i],␣
↪correlation_matrix.columns[j], correlation_matrix.iloc[i, j]))

# Display the pairs of variables with high correlation

print(f"Pairs of variables with correlation between {threshold_min} and␣
↪{threshold_max}:")

for pair in high_correlation_pairs:

print(f"{pair[0]} - {pair[1]}: {pair[2]}")

Pairs of variables with correlation between 0.7 and 0.9:

Year - Gold Prices (USD): 0.7757519467104836
Year - Interest rates: -0.8411983318709919
Gold Prices (USD) - Oil Prices (USD): 0.7723398490361068
Inflation - Interest rates: 0.7726182398554902

[ ]: #PREPROCESSING IS DONE

[31]: # Save the DataFrame with Year and Gold Prices to a CSV file (time series)
df.to_csv("data_revised.csv", index=False)

[ ]: '''Since there is multicollinearity between "Gold Prices (USD) - Oil Prices␣

↪(USD)" and "Inflation - Interest rates" , we will use
PCA '''

[43]: #PCA
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

#Since , PCA does not require time variable , we will drop "Year"
data = df.drop(columns=['Year'])

[44]: data

[44]: Gold Prices (USD) Inflation Unemployment rate Interest rates \

0 588.0 13.549202 6.3 18.90
1 623.0 13.549202 6.3 18.90
2 835.0 13.549202 6.3 18.90
3 668.0 13.549202 6.3 18.90
4 676.5 13.549202 6.9 18.90
… … … … …
2131 1940.8 4.697859 4.6 0.09
2132 1890.9 4.697859 4.6 0.09
2133 1875.7 4.697859 4.6 0.09
2134 1779.3 4.697859 4.6 0.09

23
2135 1843.0 4.697859 4.6 0.09

Oil Prices (USD)

0 21.59
1 21.59
2 21.59
3 21.59
4 21.59
… …
2131 36.86
2132 36.86
2133 36.86
2134 36.86
2135 36.86

[2136 rows x 5 columns]

[46]: # Standardize the features

scaler = StandardScaler()
scaled_data = scaler.fit_transform(features)
print(scaled_data)

[[-0.21158217 4.5466568 -0.00808432 3.63105777 -0.62875152]

[-0.13510079 4.5466568 -0.00808432 3.63105777 -0.62875152]
[ 0.32815787 4.5466568 -0.00808432 3.63105777 -0.62875152]
…
[ 2.60227714 0.71046548 -1.17299696 -1.07620925 -0.05013795]
[ 2.39162556 0.71046548 -1.17299696 -1.07620925 -0.05013795]
[ 2.53082168 0.71046548 -1.17299696 -1.07620925 -0.05013795]]

[47]: scaled_data

[47]: array([[-0.21158217, 4.5466568 , -0.00808432, 3.63105777, -0.62875152],

[-0.13510079, 4.5466568 , -0.00808432, 3.63105777, -0.62875152],
[ 0.32815787, 4.5466568 , -0.00808432, 3.63105777, -0.62875152],
…,
[ 2.60227714, 0.71046548, -1.17299696, -1.07620925, -0.05013795],
[ 2.39162556, 0.71046548, -1.17299696, -1.07620925, -0.05013795],
[ 2.53082168, 0.71046548, -1.17299696, -1.07620925, -0.05013795]])

[48]: # Perform PCA

pca = PCA()
principal_components = pca.fit_transform(scaled_features)

[49]: explained_variance_ratio = pca.explained_variance_ratio_

print("Explained Variance Ratio:")
print(explained_variance_ratio)

24
Explained Variance Ratio:
[0.53362372 0.26151325 0.13667759 0.0456746 0.02251084]

[50]: # Cumulative explained variance

cumulative_explained_variance = explained_variance_ratio.cumsum()
print("Cumulative Explained Variance:")
print(cumulative_explained_variance)

Cumulative Explained Variance:

[0.53362372 0.79513696 0.93181456 0.97748916 1. ]

[52]: import matplotlib.pyplot as plt

# Plot the scree plot

plt.figure(figsize=(10, 6))
plt.plot(range(1, len(explained_variance_ratio) + 1), explained_variance_ratio,␣
↪marker='o', linestyle='-')

plt.title('Scree Plot')
plt.xlabel('Number of Principal Components')
plt.ylabel('Explained Variance Ratio')
plt.grid(True)

# Find the elbow point

# You can visually identify the elbow point or use a threshold to find the␣
↪point programmatically

threshold = 0.05 # Example threshold for change in explained variance ratio

for i in range(1, len(explained_variance_ratio)):
if explained_variance_ratio[i] - explained_variance_ratio[i-1] < threshold:
elbow_point = i
break

# Mark the elbow point on the plot

plt.axvline(x=elbow_point, color='red', linestyle='--', label='Elbow Point')

plt.legend()
plt.show()

print("Elbow Point (k):", elbow_point)

25
Elbow Point (k): 1

[53]: '''However, if the scree plot exhibits a downward slope and there's no clear␣
↪elbow point,

it may imply that PCA is not suitable for dimensionality reduction in this␣
↪dataset,

or that the variables are not linearly related.'''

[ ]: ##MODEL BUILDING
#Since,we have multicollinearity in data , it is better to use Lasso␣
↪Regression . But for exploration purpose ,we are performing both

#Lasso and Linear Regression and to check the accuracy.

[ ]: ---------------------------------------------------------------------------------------------

[60]: #LINEAR REGRESSION

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

[62]: X = df[['Gold Prices (USD)']] # Independent variable

y = df['Gold Prices (USD)'] # Dependent variable

26
[67]: # Assuming X_train, X_test, y_train, y_test are your training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,␣
↪random_state=42)

[127]: df.dtypes

[127]: Year datetime64[ns]

Gold Prices (USD) float64
Inflation float64
Unemployment rate float64
Interest rates float64
Oil Prices (USD) float64
dtype: object

[68]: linear_reg = LinearRegression()

linear_reg.fit(X_train, y_train)
y_pred_linear = linear_reg.predict(X_test)
rmse_linear = mean_squared_error(y_test, y_pred_linear, squared=False)
print("Linear Regression RMSE:", rmse_linear)

Linear Regression RMSE: 1.2366982197576302e-13

C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\metrics\_regression.py:483: FutureWarning: 'squared' is
deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean
squared error, use the function'root_mean_squared_error'.
warnings.warn(

[135]: #Accuracy
from sklearn.metrics import r2_score
r2_linear = r2_score(y_test, y_pred_linear)
print("Linear Regression R-squared:", r2_linear)

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[135], line 3
1 #Accuracy
2 from sklearn.metrics import r2_score
----> 3 r2_linear = r2_score(y_test, y_pred_linear)
4 print("Linear Regression R-squared:", r2_linear)

File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\_param_validation
↪py:213, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)

207 try:
208 with config_context(
209 skip_parameter_validation=(
210 prefer_skip_nested_validation or global_skip_validation

27
211 )
212 ):
--> 213 return func(*args, **kwargs)
214 except InvalidParameterError as e:
215 # When the function is just a wrapper around an estimator, we allow
216 # the function to delegate validation to the estimator, but we␣
↪replace

217 # the name of the estimator by the name of the function in the error
218 # message to avoid confusion.
219 msg = re.sub(
220 r"parameter of \w+ must be",
221 f"parameter of {func.__qualname__} must be",
222 str(e),
223 )

File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_regression.
↪py:1180, in r2_score(y_true, y_pred, sample_weight, multioutput, force_finite)

1039 @validate_params(
1040 {
1041 "y_true": ["array-like"],
(…)
1059 force_finite=True,
1060 ):
1061 """:math:`R^2` (coefficient of determination) regression score␣
↪function.

1062
1063 Best possible score is 1.0 and it can be negative (because the
(…)
1178 -inf
1179 """
-> 1180 y_type, y_true, y_pred, multioutput = _check_reg_targets(
1181 y_true, y_pred, multioutput
1182 )
1183 check_consistent_length(y_true, y_pred, sample_weight)
1185 if _num_samples(y_pred) < 2:

File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_regression.
↪py:102, in _check_reg_targets(y_true, y_pred, multioutput, dtype)

68 def _check_reg_targets(y_true, y_pred, multioutput, dtype="numeric"):

69 """Check that y_true and y_pred belong to the same regression task.
70
71 Parameters
(…)
100 correct keyword.
101 """
--> 102 check_consistent_length(y_true, y_pred)

28
103 y_true = check_array(y_true, ensure_2d=False, dtype=dtype)
104 y_pred = check_array(y_pred, ensure_2d=False, dtype=dtype)

File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.
↪py:457, in check_consistent_length(*arrays)

455 uniques = np.unique(lengths)

456 if len(uniques) > 1:
--> 457 raise ValueError(
458 "Found input variables with inconsistent numbers of samples: %r"
459 % [int(l) for l in lengths]
460 )

ValueError: Found input variables with inconsistent numbers of samples: [428,␣

↪855]

[ ]: ----------------------------------------------------------------------------------------------

[72]: #LASSO REGRESSION

from sklearn.model_selection import GridSearchCV

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.metrics import mean_squared_error
lasso_reg = Lasso()
parameters = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
lasso_grid = GridSearchCV(lasso_reg, parameters,␣
↪scoring='neg_mean_squared_error', cv=5)

lasso_grid.fit(X_train, y_train)
y_pred_lasso = lasso_grid.predict(X_test)
rmse_lasso = mean_squared_error(y_test, y_pred_lasso, squared=False)
print("Lasso Regression RMSE:", rmse_lasso)

Lasso Regression RMSE: 2.2337202089187744e-06

[120]:

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[120], line 1
----> 1 r2_lasso = r2_score(y_test, y_pred_lasso)

29
2 print("Lasso Regression R-squared:", r2_lasso)

File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\_param_validation
↪py:213, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)

207 try:
208 with config_context(
209 skip_parameter_validation=(
210 prefer_skip_nested_validation or global_skip_validation
211 )
212 ):
--> 213 return func(*args, **kwargs)
214 except InvalidParameterError as e:
215 # When the function is just a wrapper around an estimator, we allow
216 # the function to delegate validation to the estimator, but we␣
↪replace

File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_regression.
↪py:1180, in r2_score(y_true, y_pred, sample_weight, multioutput, force_finite)

1039 @validate_params(
1040 {
1041 "y_true": ["array-like"],
(…)
1059 force_finite=True,
1060 ):
1061 """:math:`R^2` (coefficient of determination) regression score␣
↪function.

30
File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_regression.
↪py:102, in _check_reg_targets(y_true, y_pred, multioutput, dtype)

68 def _check_reg_targets(y_true, y_pred, multioutput, dtype="numeric"):

69 """Check that y_true and y_pred belong to the same regression task.
70
71 Parameters
(…)
100 correct keyword.
101 """
--> 102 check_consistent_length(y_true, y_pred)
103 y_true = check_array(y_true, ensure_2d=False, dtype=dtype)
104 y_pred = check_array(y_pred, ensure_2d=False, dtype=dtype)

File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.
↪py:457, in check_consistent_length(*arrays)

455 uniques = np.unique(lengths)

456 if len(uniques) > 1:
--> 457 raise ValueError(
458 "Found input variables with inconsistent numbers of samples: %r"
459 % [int(l) for l in lengths]
460 )

ValueError: Found input variables with inconsistent numbers of samples: [428,␣

↪855]

[ ]: ----------------------------------------------------------------------------------------------

[ ]: #CROSS VALIDATION

[118]: from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
import numpy as np

# Assuming X_train, X_test, y_train, y_test are your training and testing data
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)

# Train the Lasso Regression model

lasso = Lasso(alpha=0.1) # You can adjust the alpha parameter as needed
lasso.fit(X_train, y_train) # Fit the model to the training data

# Evaluate the model's performance on training and testing sets

train_preds = lasso.predict(X_train)

31
test_preds = lasso.predict(X_test)

train_mse = mean_squared_error(y_train, train_preds)

test_mse = mean_squared_error(y_test, test_preds)

print("Training MSE:", train_mse)

print("Testing MSE:", test_mse)

# Perform cross-validation
cv_scores = cross_val_score(lasso, X_train, y_train, cv=5,␣
↪scoring='neg_mean_squared_error')

cv_rmse = np.sqrt(-cv_scores)

print("Cross-validation RMSE scores:", cv_rmse)

print("Mean CV RMSE:", np.mean(cv_rmse))

# Calculate the difference between training and testing accuracies

accuracy_diff = np.abs(train_mse - test_mse)
print("Difference between Training and Testing MSE:", accuracy_diff)

Training MSE: 4.776072211470752e-08

Testing MSE: 4.781923557039156e-08
Cross-validation RMSE scores: [0.00022782 0.00022004 0.00019805 0.0002186
0.00023008]
Mean CV RMSE: 0.00021891751522405177
Difference between Training and Testing MSE: 5.851345568403972e-11

[119]: lasso_overall_r2 = lasso.score(X, y)

print("Overall R-squared:", lasso_overall_r2)

Overall R-squared: 0.9999999999997718

[74]: import matplotlib.pyplot as plt

# Plot actual vs. predicted values

plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_linear, color='blue', label='Actual vs. Predicted')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--',␣
↪lw=2, color='red', label='Perfect Prediction')

plt.xlabel('Actual Gold Prices')

plt.ylabel('Predicted Gold Prices')
plt.title('Actual vs. Predicted Gold Prices (Linear Regression)')
plt.legend()
plt.grid(True)
plt.show()

C:\Users\lariy\AppData\Local\Temp\ipykernel_17032\3438038971.py:6: UserWarning:
color is redundantly defined by the 'color' keyword argument and the fmt string

32
"k--" (-> color='k'). The keyword argument will take precedence.
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--',
lw=2, color='red', label='Perfect Prediction')

[ ]: #CALCULATING RESIDUALS

[ ]: #checking assumptions

[78]: from sklearn.linear_model import LinearRegression

import numpy as np

# Assuming X is your independent variable and y is your dependent variable

# X and y should be numpy arrays or pandas DataFrames/Series
# Fit the linear regression model
model = LinearRegression()
model.fit(X, y)

# Get the predicted values

predicted_values = model.predict(X)

# Calculate residuals
residuals = y - predicted_values

# Now 'residuals' contains the residuals from the regression model

33
[79]: from scipy.stats import shapiro

# Assuming 'residuals' is a variable containing the residuals from your␣

↪regression model

shapiro_test_statistic, shapiro_p_value = shapiro(residuals)

print("Shapiro-Wilk Test:")
print("Test Statistic:", shapiro_test_statistic)
print("p-value:", shapiro_p_value)

if shapiro_p_value > 0.05:

print("p-value > 0.05. Fail to reject the null hypothesis. Residuals are␣
↪normally distributed.")

else:
print("p-value <= 0.05. Reject the null hypothesis. Residuals are not␣
↪normally distributed.")

Shapiro-Wilk Test:
Test Statistic: 0.5694072286447804
p-value: 2.730521207845291e-58
p-value <= 0.05. Reject the null hypothesis. Residuals are not normally
distributed.

[81]: from statsmodels.stats.diagnostic import het_breuschpagan

import statsmodels.api as sm

# Assuming 'X' is your independent variable(s) and 'y' is your dependent␣

↪variable

# Fit the regression model

X = sm.add_constant(X) # Add constant if necessary
model = sm.OLS(y, X)
results = model.fit()

# Perform the Breusch-Pagan test

bp_test_statistic, bp_p_value, _, _ = het_breuschpagan(results.resid, X)
print("\nBreusch-Pagan Test:")
print("Test Statistic:", bp_test_statistic)
print("p-value:", bp_p_value)

Breusch-Pagan Test:
Test Statistic: 1100.672462731512
p-value: 2.3590049793082233e-241

[ ]: #APPLYING LOG TRANSFORMATION TO GET UNIFORM DISTRIBUTED RESIDUALS

34
[82]: import numpy as np
from scipy.stats import skew

# Calculate skewness for each column

skewness = data.apply(skew)

# Select columns with skewness greater than a threshold (e.g., 0.5)

skewed_cols = skewness[skewness > 0.5].index

# Apply log transformation to skewed columns

data[skewed_cols] = np.log1p(data[skewed_cols])

# Check the distributions after transformation

[97]: #Visualising distrbution of each variable to understand distribution

import matplotlib.pyplot as plt
import seaborn as sns

# Assuming your dataframe is named 'df'

# Set the style for seaborn plots

sns.set(style="whitegrid")

# Plot the distribution of each variable

plt.figure(figsize=(12, 8))

# Plot for 'Gold Prices (USD)' variable

plt.subplot(3, 2, 2)
sns.histplot(data['Gold Prices (USD)'], kde=True)
plt.title('Distribution of Gold Prices (USD)')

# Plot for 'Inflation' variable

plt.subplot(3, 2, 3)
sns.histplot(data['Inflation'], kde=True)
plt.title('Distribution of Inflation')

# Plot for 'Unemployment rate' variable

plt.subplot(3, 2, 4)
sns.histplot(df['Unemployment rate'], kde=True)
plt.title('Distribution of Unemployment rate')

# Plot for 'Interest rates' variable

plt.subplot(3, 2, 5)
sns.histplot(data['Interest rates'], kde=True)
plt.title('Distribution of Interest rates')

35
# Plot for 'Oil Prices (USD)' variable
plt.subplot(3, 2, 6)
sns.histplot(data['Oil Prices (USD)'], kde=True)
plt.title('Distribution of Oil Prices (USD)')

plt.tight_layout()
plt.show()

[92]: residuals_1= np.log(y) - np.log(predicted_values)

[ ]: #AGAIN running tests

[93]: from scipy.stats import shapiro

# Assuming 'log_residuals' is a variable containing the residuals after␣

↪applying the logarithmic transformation

shapiro_test_statistic, shapiro_p_value = shapiro(residuals_1)

print("Shapiro-Wilk Test:")
print("Test Statistic:", shapiro_test_statistic)
print("p-value:", shapiro_p_value)

if shapiro_p_value > 0.05:

36
print("p-value > 0.05. Fail to reject the null hypothesis. Residuals are␣
↪normally distributed.")
else:
print("p-value <= 0.05. Reject the null hypothesis. Residuals are not␣
↪normally distributed.")

Shapiro-Wilk Test:
Test Statistic: 0.45959720762702283
p-value: 2.042106007386736e-62
p-value <= 0.05. Reject the null hypothesis. Residuals are not normally
distributed.

[96]: from statsmodels.regression.linear_model import OLS

from statsmodels.stats.diagnostic import het_breuschpagan

# Fit the OLS regression model with the transformed data

model = OLS(y, X).fit()

# Compute the residuals

residuals = model.resid

# Perform the Breusch-Pagan test

bp_test_statistic, bp_p_value, _, _ = het_breuschpagan(residuals, X)

# Print the results

print("\nBreusch-Pagan Test:")
print("Test Statistic:", bp_test_statistic)
print("p-value:", bp_p_value)

Breusch-Pagan Test:
Test Statistic: 1100.672462731512
p-value: 2.3590049793082233e-241

[ ]: #RESULTS ARE THE SAME

[ ]: ------------------------------------------------------------------

[98]: #Weighted least square model

import numpy as np
import statsmodels.api as sm

# Fit a standard linear regression model

X = df[['Gold Prices (USD)']] # Independent variable
y = df['Gold Prices (USD)'] # Dependent variable

X = sm.add_constant(X) # Add constant if needed

37
model = sm.OLS(y, X).fit()

# Calculate residuals
residuals = model.resid

# Calculate weights based on the inverse of the variance of residuals

weights = 1.0 / np.var(residuals)

# Fit a weighted least squares regression model

weighted_model = sm.WLS(y, X, weights=weights).fit()

# Print summary of the weighted model

print(weighted_model.summary())

WLS Regression Results

==============================================================================
Dep. Variable: Gold Prices (USD) R-squared: 1.000
Model: WLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 3.375e+33
Date: Thu, 02 May 2024 Prob (F-statistic): 0.00
Time: 21:40:04 Log-Likelihood: 58148.
No. Observations: 2136 AIC: -1.163e+05
Df Residuals: 2134 BIC: -1.163e+05
Df Model: 1
Covariance Type: nonrobust
================================================================================
=====
coef std err t P>|t| [0.025
0.975]
--------------------------------------------------------------------------------
-----
const -1.705e-13 1.42e-14 -12.028 0.000 -1.98e-13
-1.43e-13
Gold Prices (USD) 1.0000 1.72e-17 5.81e+16 0.000 1.000
1.000
==============================================================================
Omnibus: 384.389 Durbin-Watson: 0.003
Prob(Omnibus): 0.000 Jarque-Bera (JB): 620.132
Skew: 1.302 Prob(JB): 2.19e-135
Kurtosis: 3.430 Cond. No. 1.48e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 1.48e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

38
[ ]: #CROSS VALIDATION

[99]: ##NEed to perform CROSS VALIDATION

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

# Assuming X_train and y_train are your training data and labels
# Instantiate the model
model = LinearRegression()

# Perform cross-validation
cv_scores = cross_val_score(model, X_train, y_train, cv=5) # You can adjust␣
↪the number of folds (cv) as needed

# Print the cross-validation scores

print("Cross-validation scores:", cv_scores)

# Calculate the mean and standard deviation of the cross-validation scores

mean_cv_score = cv_scores.mean()
std_cv_score = cv_scores.std()
print("Mean CV score:", mean_cv_score)
print("Standard deviation of CV scores:", std_cv_score)

Cross-validation scores: [1. 1. 1. 1. 1.]

Mean CV score: 1.0
Standard deviation of CV scores: 0.0

[102]: from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)

# Fit the model on the training data

model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate the model on the training data

y_train_pred = model.predict(X_train)
training_score = r2_score(y_train, y_train_pred)

# Evaluate the model on the testing data

y_test_pred = model.predict(X_test)
testing_score = r2_score(y_test, y_test_pred)

39
print("Training R-squared:", training_score)
print("Testing R-squared:", testing_score)

Training R-squared: 1.0

Testing R-squared: 1.0

[ ]: ----------------------------------------------------------------------------------------------

[ ]: #TRYING MODELS

[ ]: #DECISION TREE

[103]: from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

# Assuming X contains your independent variable(s) and y contains your␣

↪dependent variable

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)

# Initialize the Decision Tree Regressor

tree_regressor = DecisionTreeRegressor(random_state=42)

# Fit the model on the training data

tree_regressor.fit(X_train, y_train)

# Make predictions on the testing data

y_pred_train = tree_regressor.predict(X_train)
y_pred_test = tree_regressor.predict(X_test)

# Evaluate the model

train_r2 = r2_score(y_train, y_pred_train)
test_r2 = r2_score(y_test, y_pred_test)

print("Training R-squared:", train_r2)

print("Testing R-squared:", test_r2)

Training R-squared: 1.0

Testing R-squared: 0.9999932907263452

[105]: y_pred_dt = tree_regressor.predict(X_test)

# Calculate Mean Squared Error (MSE)

mse_dt = mean_squared_error(y_test, y_pred_dt)

40
# Calculate Root Mean Squared Error (RMSE)
rmse_dt = np.sqrt(mse_dt)

print("Decision Tree RMSE:", rmse_dt)

Decision Tree RMSE: 1.1855782531712218

[ ]: ----------------------------------------------------------------------------------------------

[ ]: #RANDOM FOREST

[107]: from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import r2_score, mean_squared_error

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)

# Initialize and fit the Random Forest Regression model

rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions on training and testing data

y_train_pred_rf = rf_model.predict(X_train)
y_test_pred_rf = rf_model.predict(X_test)

# Calculate R-squared for training and testing data

r2_train_rf = r2_score(y_train, y_train_pred_rf)
r2_test_rf = r2_score(y_test, y_test_pred_rf)

print("Random Forest Regression")

print("Training R-squared:", r2_train_rf)
print("Testing R-squared:", r2_test_rf)

Random Forest Regression

Training R-squared: 0.9999980259321656
Testing R-squared: 0.999994285557632

[ ]: ----------------------------------------------------------------------------------------------

[ ]: #GRADIENT BOOSTING

[108]: # Initialize and fit the Gradient Boosting Regression model

gb_model = GradientBoostingRegressor(random_state=42)
gb_model.fit(X_train, y_train)

41
# Make predictions on training and testing data
y_train_pred_gb = gb_model.predict(X_train)
y_test_pred_gb = gb_model.predict(X_test)

# Calculate R-squared for training and testing data

r2_train_gb = r2_score(y_train, y_train_pred_gb)
r2_test_gb = r2_score(y_test, y_test_pred_gb)

print("\nGradient Boosting Regression")

print("Training R-squared:", r2_train_gb)
print("Testing R-squared:", r2_test_gb)

Gradient Boosting Regression

Training R-squared: 0.9999754562135037
Testing R-squared: 0.9999573703979345

[112]: pip install xgboost

Requirement already satisfied: xgboost in

c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages
(2.0.3)Note: you may need to restart the kernel to use updated packages.

Requirement already satisfied: scipy in

c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
xgboost) (1.12.0)
Requirement already satisfied: numpy in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
xgboost) (1.26.2)

[notice] A new release of pip available: 22.2.2 -> 24.0

[notice] To update, run: python.exe -m pip install --upgrade pip

[115]: #Cross validation for Linear Regression

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Assuming X_train, X_test, y_train, y_test are your training and testing data

# 1. Linear Regression
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)

42
# Calculate RMSE
y_pred_train_lr = linear_reg.predict(X_train)
rmse_train_lr = np.sqrt(mean_squared_error(y_train, y_pred_train_lr))
y_pred_test_lr = linear_reg.predict(X_test)
rmse_test_lr = np.sqrt(mean_squared_error(y_test, y_pred_test_lr))

# Cross-validation
cv_scores_lr = cross_val_score(linear_reg, X_train, y_train, cv=5,␣
↪scoring='neg_mean_squared_error')

cv_rmse_lr = np.sqrt(-cv_scores_lr)

# Print results
print("Linear Regression:")
print("Training RMSE:", rmse_train_lr)
print("Testing RMSE:", rmse_test_lr)
print("Mean CV RMSE:", cv_rmse_lr.mean())
print("Std CV RMSE:", cv_rmse_lr.std())

Linear Regression:
Training RMSE: 2.2829534657510607e-13
Testing RMSE: 2.3371014992128577e-13
Mean CV RMSE: 2.4273458808647473e-13
Std CV RMSE: 9.99530028073763e-14

[124]:

C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\model_selection\_validation.py:547: FitFailedWarning:
170 fits failed out of a total of 500.
The score on these train-test partitions for these parameters will be set to
nan.
If these failures are not expected, you can try to debug them by setting
error_score='raise'.

Below are more details about the failures:

43
validate_parameter_constraints(
File "C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\utils\_param_validation.py", line 95, in
validate_parameter_constraints
raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'max_features'
parameter of RandomForestRegressor must be an int in the range [1, inf), a float
in the range (0.0, 1.0], a str among {'sqrt', 'log2'} or None. Got 'auto'
instead.

warnings.warn(some_fits_failed_message, FitFailedWarning)
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\model_selection\_search.py:1051: UserWarning: One or more of
the test scores are non-finite: [ nan -58.93907181 -37.90943391
-39.08803166 -43.19203625
-41.3236575 nan nan nan -57.15259269
-9.60570762 -40.55572985 nan -58.59893245 nan
-73.39584413 -18.16182559 nan -11.13065831 -49.6359911
nan -48.98262037 -11.31604215 -11.67992611 -48.91244679
-15.17315845 -27.45118465 nan -38.86970611 nan
-46.46008978 -76.8895055 -58.00598485 nan nan
-44.38463655 -11.17309129 nan -47.82243001 -43.27158338
-74.06200261 nan nan -17.23503776 -25.69643983
-60.15686948 nan -79.96635582 -27.44276843 -62.70058757
-58.44250228 nan -64.96921744 nan -63.50868913
-64.67395609 -70.42213274 -50.12022647 nan -27.87982487
nan -71.32384802 -25.74420683 nan nan
-73.91215454 -62.70968467 -38.69771057 -71.87860269 nan
-47.75998313 nan -57.17101105 -69.30109927 -27.04269829
nan -21.32999161 nan -11.68845447 -71.50186947
-58.46591787 -63.41533483 nan -25.82496422 -54.44954262
nan -25.74575808 -33.52783331 -32.28189951 nan
-71.64371947 -42.34704253 nan -4.80991776 nan
nan -25.35147124 -79.21123837 nan nan]
warnings.warn(
Best Hyperparameters: {'max_depth': 30, 'max_features': 'log2',
'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 123}
Training RMSE: 0.7535148520604379
Testing RMSE: 1.1135615099471048
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\metrics\_regression.py:483: FutureWarning: 'squared' is
deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean
squared error, use the function'root_mean_squared_error'.
warnings.warn(
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\metrics\_regression.py:483: FutureWarning: 'squared' is
deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean

44
squared error, use the function'root_mean_squared_error'.
warnings.warn(

[125]: #hyperparameter tunning

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from scipy.stats import randint

# Define the hyperparameters grid

param_dist = {
'n_estimators': randint(50, 200),
'max_depth': [None, 10, 20, 30],
'min_samples_split': randint(2, 20),
'min_samples_leaf': randint(1, 10),
'max_features': ['auto', 'sqrt', 'log2']
}

# Initialize the RandomForestRegressor

rf_regressor = RandomForestRegressor()

# Perform RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=rf_regressor,␣
↪param_distributions=param_dist, n_iter=100, cv=5,␣

↪scoring='neg_mean_squared_error', random_state=42)

random_search.fit(X_train, y_train)

# Get the best hyperparameters

best_params = random_search.best_params_
print("Best Hyperparameters:", best_params)

# Initialize the RandomForestRegressor with the best hyperparameters

best_rf_regressor = RandomForestRegressor(**best_params)

# Fit the model with the best hyperparameters

best_rf_regressor.fit(X_train, y_train)

# Evaluate the model

train_rmse = mean_squared_error(y_train, best_rf_regressor.predict(X_train),␣
↪squared=False)

test_rmse = mean_squared_error(y_test, best_rf_regressor.predict(X_test),␣

↪squared=False)

print("Training RMSE:", train_rmse)

print("Testing RMSE:", test_rmse)

C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\model_selection\_validation.py:547: FitFailedWarning:

45
170 fits failed out of a total of 500.
The score on these train-test partitions for these parameters will be set to
nan.
If these failures are not expected, you can try to debug them by setting
error_score='raise'.

Below are more details about the failures:

--------------------------------------------------------------------------------
170 fits failed with the following error:
Traceback (most recent call last):
File "C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\base.py", line 1467, in wrapper
estimator._validate_params()
File "C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\base.py", line 666, in _validate_params
validate_parameter_constraints(
File "C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\utils\_param_validation.py", line 95, in
validate_parameter_constraints
raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'max_features'
parameter of RandomForestRegressor must be an int in the range [1, inf), a float
in the range (0.0, 1.0], a str among {'sqrt', 'log2'} or None. Got 'auto'
instead.

46
-57.23201728 -62.8046166 nan -27.12209161 -51.13025851
nan -27.13044707 -32.43912226 -32.90987822 nan
-69.76462899 -42.51182264 nan -4.74511793 nan
nan -26.98026314 -77.91433524 nan nan]
warnings.warn(
Best Hyperparameters: {'max_depth': 30, 'max_features': 'log2',
'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 123}
Training RMSE: 0.6656633581584913
Testing RMSE: 1.1766919161273477
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\metrics\_regression.py:483: FutureWarning: 'squared' is
deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean
squared error, use the function'root_mean_squared_error'.
warnings.warn(
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\metrics\_regression.py:483: FutureWarning: 'squared' is
deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean
squared error, use the function'root_mean_squared_error'.
warnings.warn(

2,3. Introduction Pandas & Matplotlib
No ratings yet
2,3. Introduction Pandas & Matplotlib
32 pages
Gold Price Forecasting Using Time Series
100% (2)
Gold Price Forecasting Using Time Series
15 pages
Exploratory Data Analysis (EDA) in Python
No ratings yet
Exploratory Data Analysis (EDA) in Python
6 pages
Importing Libraries: Import As Import As Import As From Import As From Import From Import Import
100% (1)
Importing Libraries: Import As Import As Import As From Import As From Import From Import Import
11 pages
Time Series: International University - Vnu HCMC
No ratings yet
Time Series: International University - Vnu HCMC
35 pages
Auto ARIMA - Gold Data (Draft)
No ratings yet
Auto ARIMA - Gold Data (Draft)
36 pages
Machine Learning - Multi Linear Regression Analysis
No ratings yet
Machine Learning - Multi Linear Regression Analysis
29 pages
Mod 3
No ratings yet
Mod 3
50 pages
Package Fextremes': R Topics Documented
No ratings yet
Package Fextremes': R Topics Documented
37 pages
Eco Outlook Axis
No ratings yet
Eco Outlook Axis
26 pages
ModuleAr Merged
No ratings yet
ModuleAr Merged
42 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
42 pages
Practical Assignment #2 Tests Your Ability
No ratings yet
Practical Assignment #2 Tests Your Ability
31 pages
Chap 1: Preparing Data and A Linear Model: Explore The Data With Some EDA
No ratings yet
Chap 1: Preparing Data and A Linear Model: Explore The Data With Some EDA
27 pages
Course Content
No ratings yet
Course Content
28 pages
ML Expt 1 Description
No ratings yet
ML Expt 1 Description
15 pages
Delhivery Mani
No ratings yet
Delhivery Mani
79 pages
MEE 6070 Data Science and Analytics: Importing Data Using Plotting The Data Checking For Linearity
No ratings yet
MEE 6070 Data Science and Analytics: Importing Data Using Plotting The Data Checking For Linearity
13 pages
ADS Ut2
No ratings yet
ADS Ut2
23 pages
Data Visualization
No ratings yet
Data Visualization
13 pages
Five Year Dataset
No ratings yet
Five Year Dataset
15 pages
Data Science Assignment Submission
No ratings yet
Data Science Assignment Submission
12 pages
Predicting Gold Prices: Working With The Time Series Data
No ratings yet
Predicting Gold Prices: Working With The Time Series Data
15 pages
DataPreparation - Outlier - Treatment ASSIGNMENT 1
100% (1)
DataPreparation - Outlier - Treatment ASSIGNMENT 1
7 pages
Data Visualization EDA-print
No ratings yet
Data Visualization EDA-print
18 pages
Unit 2
No ratings yet
Unit 2
36 pages
Project Intern - Jupyter Notebook
No ratings yet
Project Intern - Jupyter Notebook
16 pages
Project Documentation
No ratings yet
Project Documentation
10 pages
Hbaf 3105
100% (2)
Hbaf 3105
94 pages
AI PDF Presentation
No ratings yet
AI PDF Presentation
9 pages
Edp 3
No ratings yet
Edp 3
16 pages
Case - Study - MachineLearning - NSE - TATA - GLOBAL - Data - Prediction - Jupyter Notebook
No ratings yet
Case - Study - MachineLearning - NSE - TATA - GLOBAL - Data - Prediction - Jupyter Notebook
10 pages
Time Series Analysis of HDFCBANK Stock by Pavan
No ratings yet
Time Series Analysis of HDFCBANK Stock by Pavan
10 pages
Daily Transactions Project On Jupyter Notebook
No ratings yet
Daily Transactions Project On Jupyter Notebook
17 pages
TSF Sri
No ratings yet
TSF Sri
14 pages
ML
No ratings yet
ML
10 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Gold Price Prediction Using RandomForestRegressor and ML
No ratings yet
Gold Price Prediction Using RandomForestRegressor and ML
7 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Forage 1
No ratings yet
Forage 1
9 pages
Week-6 DS Practical
No ratings yet
Week-6 DS Practical
12 pages
Matplotlib Pandas Guide
No ratings yet
Matplotlib Pandas Guide
7 pages
TIME - ChatGPT Manual 001
No ratings yet
TIME - ChatGPT Manual 001
7 pages
Exercises Part2
No ratings yet
Exercises Part2
7 pages
Time Arima 002
No ratings yet
Time Arima 002
11 pages
Time Series Analysis 1718649022
No ratings yet
Time Series Analysis 1718649022
5 pages
4 PythonPandas
No ratings yet
4 PythonPandas
8 pages
Code
No ratings yet
Code
3 pages
Basic Series Analysis - Gold - Monthly
No ratings yet
Basic Series Analysis - Gold - Monthly
2 pages
Business Analytis C4
No ratings yet
Business Analytis C4
10 pages
DSBDA Prac4 2
No ratings yet
DSBDA Prac4 2
1 page
Exp 12 and 15
No ratings yet
Exp 12 and 15
4 pages
Data Preprocess Steps
No ratings yet
Data Preprocess Steps
2 pages
Financial Risk Analytics: Assignment
No ratings yet
Financial Risk Analytics: Assignment
35 pages
Program 1
No ratings yet
Program 1
1 page
Statistics Project SEM1 Notes
No ratings yet
Statistics Project SEM1 Notes
5 pages
TRANSPORTATION ENGINEERING - SW 1
No ratings yet
TRANSPORTATION ENGINEERING - SW 1
5 pages
Pierian Data - Python For Finance & Algorithmic Trading Course Notes
No ratings yet
Pierian Data - Python For Finance & Algorithmic Trading Course Notes
11 pages
Package Fextremes': September 20, 2011
No ratings yet
Package Fextremes': September 20, 2011
37 pages
Time Series Analysis Cheat Sheet
No ratings yet
Time Series Analysis Cheat Sheet
2 pages
Chapter 2 RRL
No ratings yet
Chapter 2 RRL
22 pages
Foreign Market Entry Mode in The Hotel Industry - The Impact of Country - and Firm-Specific Factors
No ratings yet
Foreign Market Entry Mode in The Hotel Industry - The Impact of Country - and Firm-Specific Factors
15 pages
Validation of The Fracture Mechanics Test METHOD EGF P1-87D (ESIS P1-901ESIS P1-92)
No ratings yet
Validation of The Fracture Mechanics Test METHOD EGF P1-87D (ESIS P1-901ESIS P1-92)
54 pages
Evaluating Core Stability: Reliability and Validity of Hand-Held Dynamometry in Measuring Core Strength
No ratings yet
Evaluating Core Stability: Reliability and Validity of Hand-Held Dynamometry in Measuring Core Strength
50 pages
Handbook of Univariate and Multivariate Data Analysis With IBM SPSS, Second Edition
0% (2)
Handbook of Univariate and Multivariate Data Analysis With IBM SPSS, Second Edition
15 pages
Sales Budget PDF
No ratings yet
Sales Budget PDF
30 pages
Unit 5 Dev 2023
No ratings yet
Unit 5 Dev 2023
23 pages
Contextual Influences On Sources of Academic Self-Efficacy
100% (1)
Contextual Influences On Sources of Academic Self-Efficacy
10 pages
Instant Download Ebook PDF Elementary Statistics A Step by Step Approach 9th Edition PDF Scribd
100% (58)
Instant Download Ebook PDF Elementary Statistics A Step by Step Approach 9th Edition PDF Scribd
41 pages
Accenture
No ratings yet
Accenture
3 pages
Armor Metrics Draft Final
No ratings yet
Armor Metrics Draft Final
16 pages
QM PDF
No ratings yet
QM PDF
48 pages
1.3.2. Feature Engineering and Variable - Transformation
No ratings yet
1.3.2. Feature Engineering and Variable - Transformation
29 pages
1.3 Poorter Et Al. (2010)
No ratings yet
1.3 Poorter Et Al. (2010)
12 pages
On The Relationship Between Socio-Cultural Factors and Language Proficiency (Case Study: Shiraz University MA Students)
No ratings yet
On The Relationship Between Socio-Cultural Factors and Language Proficiency (Case Study: Shiraz University MA Students)
18 pages
Continuous Predictive Modeling A Compara PDF
No ratings yet
Continuous Predictive Modeling A Compara PDF
18 pages
The Prognostic Value of The Hamstring Outcome Scor
No ratings yet
The Prognostic Value of The Hamstring Outcome Scor
6 pages
Turnover Begets Turnover: Nicholas G. Castle, PHD
No ratings yet
Turnover Begets Turnover: Nicholas G. Castle, PHD
10 pages
Vision-Based Real Estate Price Estimation
No ratings yet
Vision-Based Real Estate Price Estimation
9 pages
2020 Deep Learning For Mental Illness Detection Using Brain SPECT Imaging - SpringerLink
No ratings yet
2020 Deep Learning For Mental Illness Detection Using Brain SPECT Imaging - SpringerLink
10 pages
Download
No ratings yet
Download
8 pages
Applied Ergonomics: R.S. Bridger, K. Brasher, A. Dew, S. Kilminster
No ratings yet
Applied Ergonomics: R.S. Bridger, K. Brasher, A. Dew, S. Kilminster
9 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
8 pages
Customer Shopping Trends Dataset: Analysis of Data - Regression Model
No ratings yet
Customer Shopping Trends Dataset: Analysis of Data - Regression Model
4 pages
Stats Tutorial - Instrumental Analysis and Calibration: Errors in The Regression Equation
No ratings yet
Stats Tutorial - Instrumental Analysis and Calibration: Errors in The Regression Equation
4 pages
PSMOD - Chapter 3 - Correlation Regression Analysis
No ratings yet
PSMOD - Chapter 3 - Correlation Regression Analysis
3 pages
How To Read A Quantitative) Journal Article
No ratings yet
How To Read A Quantitative) Journal Article
6 pages
The Ultimate Guide To Auto Cad 2022 3D Modeling For 3d Drawing And Modeling
From Everand
The Ultimate Guide To Auto Cad 2022 3D Modeling For 3d Drawing And Modeling
ALLEN BENTON
No ratings yet
Zizi and The Germs
From Everand
Zizi and The Germs
Leanne Tarvin
No ratings yet
Asia Small and Medium-Sized Enterprise Monitor 2022: Volume I: Country and Regional Reviews
From Everand
Asia Small and Medium-Sized Enterprise Monitor 2022: Volume I: Country and Regional Reviews
Asian Development Bank
No ratings yet