0% found this document useful (0 votes)
16 views2 pages

Q 1

The document outlines a Python workflow for predicting housing prices using regression and classification techniques. It includes data extraction, preprocessing, model training with Linear Regression and Random Forest, and performance evaluation. Additionally, it demonstrates how to classify houses as 'Luxury' or 'Affordable' based on their sale prices.

Uploaded by

srebalarmohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views2 pages

Q 1

The document outlines a Python workflow for predicting housing prices using regression and classification techniques. It includes data extraction, preprocessing, model training with Linear Regression and Random Forest, and performance evaluation. Additionally, it demonstrates how to classify houses as 'Luxury' or 'Affordable' based on their sale prices.

Uploaded by

srebalarmohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

In [1]: !

pip install pandas numpy scikit-learn matplotlib seaborn

Requirement already satisfied: pandas in c:\users\sreba\anaconda3\envs\iris_2\lib\site-packages (2.2.2)


Requirement already satisfied: numpy in c:\users\sreba\anaconda3\envs\iris_2\lib\site-packages (1.26.4)
Requirement already satisfied: scikit-learn in c:\users\sreba\anaconda3\envs\iris_2\lib\site-packages (1.5.1)
Requirement already satisfied: matplotlib in c:\users\sreba\anaconda3\envs\iris_2\lib\site-packages (3.9.2)
Requirement already satisfied: seaborn in c:\users\sreba\anaconda3\envs\iris_2\lib\site-packages (0.13.2)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\sreba\anaconda3\envs\iris_2\lib\site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\users\sreba\anaconda3\envs\iris_2\lib\site-packages (from pandas) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in c:\users\sreba\anaconda3\envs\iris_2\lib\site-packages (from pandas) (2023.3)
Requirement already satisfied: scipy>=1.6.0 in c:\users\sreba\anaconda3\envs\iris_2\lib\site-packages (from scikit-learn) (1.13.1)
Requirement already satisfied: joblib>=1.2.0 in c:\users\sreba\anaconda3\envs\iris_2\lib\site-packages (from scikit-learn) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in c:\users\sreba\anaconda3\envs\iris_2\lib\site-packages (from scikit-learn) (3.5.0)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\sreba\anaconda3\envs\iris_2\lib\site-packages (from matplotlib) (1.2.0)
Requirement already satisfied: cycler>=0.10 in c:\users\sreba\anaconda3\envs\iris_2\lib\site-packages (from matplotlib) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\sreba\anaconda3\envs\iris_2\lib\site-packages (from matplotlib) (4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\sreba\anaconda3\envs\iris_2\lib\site-packages (from matplotlib) (1.4.4)
Requirement already satisfied: packaging>=20.0 in c:\users\sreba\anaconda3\envs\iris_2\lib\site-packages (from matplotlib) (24.1)
Requirement already satisfied: pillow>=8 in c:\users\sreba\anaconda3\envs\iris_2\lib\site-packages (from matplotlib) (10.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\sreba\anaconda3\envs\iris_2\lib\site-packages (from matplotlib) (3.1.2)
Requirement already satisfied: six>=1.5 in c:\users\sreba\anaconda3\envs\iris_2\lib\site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)

In [22]: #QUESTION 1
#Predicting Housing Prices (Regression & Classification)
import zipfile
import os

In [21]: # List files in the current directory


files = os.listdir(current_directory)
print(files)

['.anaconda', '.conda', '.condarc', '.continuum', '.ipynb_checkpoints', '.ipython', '.jupyter', '.matplotlib', '.spyder-py3', 'anaconda3', 'AppData', 'Application Data', 'Contacts', 'Cookies', 'Documents',
'Downloads', 'extracted_data', 'Favorites', 'house-prices-advanced-regression-techniques.zip', 'Links', 'Local Settings', 'Music', 'My Documents', 'NetHood', 'NTUSER.DAT', 'ntuser.dat.LOG1', 'ntuser.dat.LOG
2', 'NTUSER.DAT{a2332f18-cdbf-11ec-8680-002248483d79}.TM.blf', 'NTUSER.DAT{a2332f18-cdbf-11ec-8680-002248483d79}.TMContainer00000000000000000001.regtrans-ms', 'NTUSER.DAT{a2332f18-cdbf-11ec-8680-002248483d7
9}.TMContainer00000000000000000002.regtrans-ms', 'ntuser.ini', 'OneDrive', 'PrintHood', 'q1.ipynb', 'Recent', 'Saved Games', 'Searches', 'SendTo', 'Start Menu', 'Templates', 'Videos']

In [5]: import zipfile

# Replace 'your_uploaded_file.zip' with the actual name of your ZIP file


zip_file_path = 'house-prices-advanced-regression-techniques.zip'
extraction_path = 'extracted_data' # Folder to extract files into

# Create the directory if it doesn't exist


os.makedirs(extraction_path, exist_ok=True)

# Extract the ZIP file


with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
zip_ref.extractall(extraction_path)

print(f"Extracted files to {extraction_path}")

Extracted files to extracted_data

In [6]: import pandas as pd


import os

# Load the dataset


csv_file_path = os.path.join('extracted_data', 'train.csv') # Adjust the filename if necessary
df = pd.read_csv(csv_file_path)

# Preview the dataset


print(df.head())

Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \


0 1 60 RL 65.0 8450 Pave NaN Reg
1 2 20 RL 80.0 9600 Pave NaN Reg
2 3 60 RL 68.0 11250 Pave NaN IR1
3 4 70 RL 60.0 9550 Pave NaN IR1
4 5 60 RL 84.0 14260 Pave NaN IR1

LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold \


0 Lvl AllPub ... 0 NaN NaN NaN 0 2
1 Lvl AllPub ... 0 NaN NaN NaN 0 5
2 Lvl AllPub ... 0 NaN NaN NaN 0 9
3 Lvl AllPub ... 0 NaN NaN NaN 0 2
4 Lvl AllPub ... 0 NaN NaN NaN 0 12

YrSold SaleType SaleCondition SalePrice


0 2008 WD Normal 208500
1 2007 WD Normal 181500
2 2008 WD Normal 223500
3 2006 WD Abnorml 140000
4 2008 WD Normal 250000

[5 rows x 81 columns]

In [7]: # Check for missing values in the dataset


missing_values = df.isnull().sum()
missing_values[missing_values > 0]

Out[7]: LotFrontage 259


Alley 1369
MasVnrType 872
MasVnrArea 8
BsmtQual 37
BsmtCond 37
BsmtExposure 38
BsmtFinType1 37
BsmtFinType2 38
Electrical 1
FireplaceQu 690
GarageType 81
GarageYrBlt 81
GarageFinish 81
GarageQual 81
GarageCond 81
PoolQC 1453
Fence 1179
MiscFeature 1406
dtype: int64

In [9]: # Fill missing values for numerical columns with the median
for column in df.select_dtypes(include=['float64', 'int64']).columns:
df[column] = df[column].fillna(df[column].median())

# Fill missing values for categorical columns with the mode


for column in df.select_dtypes(include=['object']).columns:
df[column] = df[column].fillna(df[column].mode()[0])

# Check again for any remaining missing values


remaining_missing = df.isnull().sum().sum()
print(f"Total remaining missing values: {remaining_missing}")

Total remaining missing values: 0

In [10]: # Selecting features for prediction


features = ['LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'GrLivArea', 'BedroomAbvGr']
X = df[features]

# Define target variable


y = df['SalePrice']

# Preview the features and target variable


print(X.head())
print(y.head())

LotArea OverallQual OverallCond YearBuilt GrLivArea BedroomAbvGr


0 8450 7 5 2003 1710 3
1 9600 6 8 1976 1262 3
2 11250 7 5 2001 1786 3
3 9550 7 5 1915 1717 3
4 14260 8 5 2000 2198 4
0 208500
1 181500
2 223500
3 140000
4 250000
Name: SalePrice, dtype: int64

In [11]: from sklearn.model_selection import train_test_split

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preview the sizes of the resulting datasets


print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")

Training set size: 1168


Testing set size: 292

In [12]: from sklearn.linear_model import LinearRegression


from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Create and fit the Linear Regression model


linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

# Make predictions on the test set


y_pred_linear = linear_model.predict(X_test)

# Calculate evaluation metrics


mae_linear = mean_absolute_error(y_test, y_pred_linear)
mse_linear = mean_squared_error(y_test, y_pred_linear)
r2_linear = r2_score(y_test, y_pred_linear)

# Print the results


print(f"Linear Regression - MAE: {mae_linear}, MSE: {mse_linear}, R²: {r2_linear}")

Linear Regression - MAE: 25953.42903568955, MSE: 1657205608.2597325, R²: 0.7839458761601398

In [13]: from sklearn.ensemble import RandomForestRegressor

# Create and fit the Random Forest model


rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions on the test set


y_pred_rf = rf_model.predict(X_test)

# Calculate evaluation metrics


mae_rf = mean_absolute_error(y_test, y_pred_rf)
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

# Print the results


print(f"Random Forest - MAE: {mae_rf}, MSE: {mse_rf}, R²: {r2_rf}")

Random Forest - MAE: 19991.286801206785, MSE: 908571173.644018, R²: 0.8815472576912461

In [14]: # Summary of model performance


print("Model Comparison:")
print(f"{'Model':<20} {'MAE':<15} {'MSE':<15} {'R²':<15}")
print(f"{'Linear Regression':<20} {mae_linear:<15} {mse_linear:<15} {r2_linear:<15}")
print(f"{'Random Forest':<20} {mae_rf:<15} {mse_rf:<15} {r2_rf:<15}")

Model Comparison:
Model MAE MSE R²
Linear Regression 25953.42903568955 1657205608.2597325 0.7839458761601398
Random Forest 19991.286801206785 908571173.644018 0.8815472576912461

In [15]: # Define the median sale price


median_price = y.median()

# Create a new column for classification


df['PriceCategory'] = df['SalePrice'].apply(lambda x: 'Luxury' if x > median_price else 'Affordable')

# Preview the new classification column


print(df[['SalePrice', 'PriceCategory']].head())

SalePrice PriceCategory
0 208500 Luxury
1 181500 Luxury
2 223500 Luxury
3 140000 Affordable
4 250000 Luxury

In [16]: # Define features (excluding the target variable)


X_classification = df[['LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'GrLivArea', 'BedroomAbvGr']]
y_classification = df['PriceCategory']

# Check the shape of the new features and target


print(f"Features shape: {X_classification.shape}, Target shape: {y_classification.shape}")

Features shape: (1460, 6), Target shape: (1460,)

In [17]: from sklearn.model_selection import train_test_split

# Split the data into training and testing sets


X_train_class, X_test_class, y_train_class, y_test_class = train_test_split(X_classification, y_classification, test_size=0.2, random_state=42)

# Print the sizes of the training and testing sets


print(f"Training set size: {len(X_train_class)}, Testing set size: {len(X_test_class)}")

Training set size: 1168, Testing set size: 292

In [18]: from sklearn.linear_model import LogisticRegression


from sklearn.metrics import classification_report, accuracy_score

# Create and fit the Logistic Regression model


logistic_model = LogisticRegression()
logistic_model.fit(X_train_class, y_train_class)

# Make predictions on the test set


y_pred_class = logistic_model.predict(X_test_class)

# Calculate evaluation metrics


accuracy = accuracy_score(y_test_class, y_pred_class)
report = classification_report(y_test_class, y_pred_class)

# Print the results


print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(report)

Accuracy: 0.8732876712328768
Classification Report:
precision recall f1-score support

Affordable 0.89 0.88 0.88 161


Luxury 0.86 0.86 0.86 131

accuracy 0.87 292


macro avg 0.87 0.87 0.87 292
weighted avg 0.87 0.87 0.87 292

C:\Users\sreba\anaconda3\envs\iris_2\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(

In [19]: from sklearn.ensemble import RandomForestClassifier

# Create and fit the Random Forest Classifier model


rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train_class, y_train_class)

# Make predictions on the test set


y_pred_rf = rf_model.predict(X_test_class)

# Calculate evaluation metrics


accuracy_rf = accuracy_score(y_test_class, y_pred_rf)
report_rf = classification_report(y_test_class, y_pred_rf)

# Print the results


print(f"Accuracy: {accuracy_rf}")
print("Classification Report:")
print(report_rf)

Accuracy: 0.9315068493150684
Classification Report:
precision recall f1-score support

Affordable 0.94 0.93 0.94 161


Luxury 0.92 0.93 0.92 131

accuracy 0.93 292


macro avg 0.93 0.93 0.93 292
weighted avg 0.93 0.93 0.93 292

In [20]: import pandas as pd

# Create a summary DataFrame for model comparison


model_comparison = pd.DataFrame({
'Model': ['Logistic Regression', 'Random Forest'],
'Accuracy': [0.8732876712328768, 0.9315068493150684],
})

print(model_comparison)

Model Accuracy
0 Logistic Regression 0.873288
1 Random Forest 0.931507

In [ ]: #QUESTION 2
#2. Spam Email Detection (Classification & SVM)

In [23]: import zipfile


import os

# Path to your uploaded ZIP file


zip_file_path = 'emails.csv.zip' # Update this with your file path
extraction_path = 'extracted_data2' # Directory to extract to

# Create the directory if it doesn't exist


os.makedirs(extraction_path, exist_ok=True)

# Extract the ZIP file


with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
zip_ref.extractall(extraction_path)

print(f"Extracted files to {extraction_path}")

Extracted files to extracted_data2

In [24]: import pandas as pd

# Load the dataset (replace 'your_file.csv' with the actual filename)


file_path = 'extracted_data2/emails.csv' # Update with your actual file name
df = pd.read_csv(file_path)

# Display the first few rows of the dataset


print(df.head())

text spam
0 Subject: naturally irresistible your corporate... 1
1 Subject: the stock trading gunslinger fanny i... 1
2 Subject: unbelievable new homes made easy im ... 1
3 Subject: 4 color printing special request add... 1
4 Subject: do not have money , get software cds ... 1

In [25]: # Display basic information about the dataset


print(df.info())

# Check for missing values


print(df.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5728 entries, 0 to 5727
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 text 5728 non-null object
1 spam 5728 non-null int64
dtypes: int64(1), object(1)
memory usage: 89.6+ KB
None
text 0
spam 0
dtype: int64

In [26]: import re

def clean_text(text):
# Remove special characters and numbers
text = re.sub(r'\W', ' ', text)
# Convert to lowercase
text = text.lower()
return text

# Apply the cleaning function to the text column


df['cleaned_text'] = df['text'].apply(clean_text)

# Display the cleaned text


print(df['cleaned_text'].head())

0 subject naturally irresistible your corporate...


1 subject the stock trading gunslinger fanny i...
2 subject unbelievable new homes made easy im ...
3 subject 4 color printing special request add...
4 subject do not have money get software cds ...
Name: cleaned_text, dtype: object

In [27]: from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF Vectorizer


tfidf = TfidfVectorizer()

# Fit and transform the cleaned text


X = tfidf.fit_transform(df['cleaned_text'])

# Target variable
y = df['spam']

print("Feature shape:", X.shape) # Display the shape of the feature matrix

Feature shape: (5728, 37303)

In [28]: from sklearn.model_selection import train_test_split

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the sizes of the training and testing sets


print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")

Training set size: 4582


Testing set size: 1146

In [29]: from sklearn.svm import SVC


from sklearn.metrics import classification_report

# Create an SVM model with a linear kernel


svm_model = SVC(kernel='linear')

# Train the model


svm_model.fit(X_train, y_train)

# Make predictions on the test set


y_pred = svm_model.predict(X_test)

# Evaluate the model


print(classification_report(y_test, y_pred))

precision recall f1-score support

0 0.99 1.00 1.00 856


1 0.99 0.98 0.99 290

accuracy 0.99 1146


macro avg 0.99 0.99 0.99 1146
weighted avg 0.99 0.99 0.99 1146

In [30]: # Create an SVM model with an RBF kernel


svm_model_rbf = SVC(kernel='rbf')

# Train the model


svm_model_rbf.fit(X_train, y_train)

# Make predictions on the test set


y_pred_rbf = svm_model_rbf.predict(X_test)

# Evaluate the model


print(classification_report(y_test, y_pred_rbf))

precision recall f1-score support

0 0.99 1.00 0.99 856


1 1.00 0.96 0.98 290

accuracy 0.99 1146


macro avg 0.99 0.98 0.98 1146
weighted avg 0.99 0.99 0.99 1146

In [31]: from sklearn.naive_bayes import MultinomialNB

# Create a Naive Bayes model


nb_model = MultinomialNB()

# Train the model


nb_model.fit(X_train, y_train)

# Make predictions on the test set


y_pred_nb = nb_model.predict(X_test)

# Evaluate the model


print(classification_report(y_test, y_pred_nb))

precision recall f1-score support

0 0.84 1.00 0.91 856


1 1.00 0.42 0.59 290

accuracy 0.85 1146


macro avg 0.92 0.71 0.75 1146
weighted avg 0.88 0.85 0.83 1146

In [ ]: #QUESTION 3
#. Customer Churn Prediction (Classification & Tree-Based Models)

In [32]: import pandas as pd

# Load the dataset


file_path = 'Telco-Customer-Churn.csv' # Replace with your actual file name if different
df = pd.read_csv(file_path)

# Display the first few rows of the dataset


print(df.head())

customerID gender SeniorCitizen Partner Dependents tenure PhoneService \


0 7590-VHVEG Female 0 Yes No 1 No
1 5575-GNVDE Male 0 No No 34 Yes
2 3668-QPYBK Male 0 No No 2 Yes
3 7795-CFOCW Male 0 No No 45 No
4 9237-HQITU Female 0 No No 2 Yes

MultipleLines InternetService OnlineSecurity ... DeviceProtection \


0 No phone service DSL No ... No
1 No DSL Yes ... Yes
2 No DSL Yes ... No
3 No phone service DSL Yes ... Yes
4 No Fiber optic No ... No

TechSupport StreamingTV StreamingMovies Contract PaperlessBilling \


0 No No No Month-to-month Yes
1 No No No One year No
2 No No No Month-to-month Yes
3 Yes No No One year No
4 No No No Month-to-month Yes

PaymentMethod MonthlyCharges TotalCharges Churn


0 Electronic check 29.85 29.85 No
1 Mailed check 56.95 1889.5 No
2 Mailed check 53.85 108.15 Yes
3 Bank transfer (automatic) 42.30 1840.75 No
4 Electronic check 70.70 151.65 Yes

[5 rows x 21 columns]

In [33]: # Check for missing values


print(df.isnull().sum())

# Convert TotalCharges to numeric (there may be some non-numeric values)


df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

# Drop rows with missing values (if any)


df.dropna(inplace=True)

# Convert the 'Churn' column to a binary numeric variable (1 for Yes, 0 for No)
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})

# Select relevant features and the target variable


# Here we will drop customerID as it is not a feature we want to use for prediction
X = df.drop(columns=['customerID', 'Churn', 'TotalCharges']) # Features
y = df['Churn'] # Target

# One-hot encode categorical features


X = pd.get_dummies(X, drop_first=True)

# Display the processed features and target variable shapes


print("Features shape:", X.shape)
print("Target shape:", y.shape)

customerID 0
gender 0
SeniorCitizen 0
Partner 0
Dependents 0
tenure 0
PhoneService 0
MultipleLines 0
InternetService 0
OnlineSecurity 0
OnlineBackup 0
DeviceProtection 0
TechSupport 0
StreamingTV 0
StreamingMovies 0
Contract 0
PaperlessBilling 0
PaymentMethod 0
MonthlyCharges 0
TotalCharges 0
Churn 0
dtype: int64
Features shape: (7032, 29)
Target shape: (7032,)

In [34]: from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the sizes of the training and testing sets


print("Training set size:", X_train.shape[0])
print("Testing set size:", X_test.shape[0])

Training set size: 5625


Testing set size: 1407

In [35]: from sklearn.tree import DecisionTreeClassifier


from sklearn.metrics import classification_report, accuracy_score

# Initialize the Decision Tree Classifier


dt_model = DecisionTreeClassifier(random_state=42)

# Fit the model to the training data


dt_model.fit(X_train, y_train)

# Make predictions on the testing set


y_pred_dt = dt_model.predict(X_test)

# Evaluate the model


accuracy_dt = accuracy_score(y_test, y_pred_dt)
print("Decision Tree Accuracy:", accuracy_dt)

# Print the classification report


print(classification_report(y_test, y_pred_dt))

Decision Tree Accuracy: 0.7157071783937455


precision recall f1-score support

0 0.81 0.80 0.81 1033


1 0.47 0.48 0.48 374

accuracy 0.72 1407


macro avg 0.64 0.64 0.64 1407
weighted avg 0.72 0.72 0.72 1407

In [36]: from sklearn.ensemble import RandomForestClassifier

# Initialize the Random Forest Classifier


rf_model = RandomForestClassifier(random_state=42)

# Fit the model to the training data


rf_model.fit(X_train, y_train)

# Make predictions on the testing set


y_pred_rf = rf_model.predict(X_test)

# Evaluate the model


accuracy_rf = accuracy_score(y_test, y_pred_rf)
print("Random Forest Accuracy:", accuracy_rf)

# Print the classification report


print(classification_report(y_test, y_pred_rf))

Random Forest Accuracy: 0.7718550106609808


precision recall f1-score support

0 0.82 0.88 0.85 1033


1 0.59 0.46 0.52 374

accuracy 0.77 1407


macro avg 0.71 0.67 0.68 1407
weighted avg 0.76 0.77 0.76 1407

In [37]: import pandas as pd


import matplotlib.pyplot as plt

# Get feature importances from the Random Forest model


importances = rf_model.feature_importances_

# Create a DataFrame for better visualization


feature_importance_df = pd.DataFrame({
'Feature': X_train.columns,
'Importance': importances
})

# Sort the DataFrame by importance


feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Plot the feature importances


plt.figure(figsize=(10, 6))
plt.barh(feature_importance_df['Feature'][:10], feature_importance_df['Importance'][:10], color='skyblue')
plt.xlabel('Importance')
plt.title('Top 10 Feature Importances in Random Forest Model')
plt.gca().invert_yaxis()
plt.show()

In [38]: from sklearn.linear_model import LogisticRegression


from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

# Initialize classifiers
logistic_model = LogisticRegression(max_iter=1000)
decision_tree_model = DecisionTreeClassifier()

# Fit Logistic Regression model


logistic_model.fit(X_train, y_train)
logistic_predictions = logistic_model.predict(X_test)

# Fit Decision Tree model


decision_tree_model.fit(X_train, y_train)
dt_predictions = decision_tree_model.predict(X_test)

# Compare performance
results = {
'Model': ['Logistic Regression', 'Decision Tree', 'Random Forest'],
'Accuracy': [logistic_model.score(X_test, y_test),
decision_tree_model.score(X_test, y_test),
rf_model.score(X_test, y_test)]
}

# Create DataFrame for results


results_df = pd.DataFrame(results)

# Print classification reports


print("Logistic Regression Classification Report:")
print(classification_report(y_test, logistic_predictions))

print("Decision Tree Classification Report:")


print(classification_report(y_test, dt_predictions))

# Print results DataFrame


print("\nModel Comparison Results:")
print(results_df)

Logistic Regression Classification Report:


precision recall f1-score support

0 0.84 0.89 0.86 1033


1 0.63 0.53 0.57 374

accuracy 0.79 1407


macro avg 0.73 0.71 0.72 1407
weighted avg 0.78 0.79 0.79 1407

Decision Tree Classification Report:


precision recall f1-score support

0 0.81 0.79 0.80 1033


1 0.47 0.50 0.48 374

accuracy 0.71 1407


macro avg 0.64 0.65 0.64 1407
weighted avg 0.72 0.71 0.72 1407

Model Comparison Results:


Model Accuracy
0 Logistic Regression 0.791756
1 Decision Tree 0.714996
2 Random Forest 0.771855

In [ ]: #QUESTION 4
# Image Dimensionality Reduction (Dimensionality Reduction & Visualization)

In [44]: import zipfile


import os

# Path to your zip file


zip_file_path = 'archive.zip'
extract_folder = 'extracted_data4'

# Create a directory to extract the files if it doesn't exist


if not os.path.exists(extract_folder):
os.makedirs(extract_folder)

# Unzip the file


with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
zip_ref.extractall(extract_folder)

# List the extracted files


extracted_files = os.listdir(extract_folder)
print("Extracted files:", extracted_files)

Extracted files: ['mnist_test.csv', 'mnist_train.csv']

In [45]: import pandas as pd

# Load the training dataset


train_data = pd.read_csv(os.path.join(extract_folder, 'mnist_train.csv'))

# Display the first few rows of the training dataset


print(train_data.head())

label 1x1 1x2 1x3 1x4 1x5 1x6 1x7 1x8 1x9 ... 28x19 28x20 \
0 5 0 0 0 0 0 0 0 0 0 ... 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0
2 4 0 0 0 0 0 0 0 0 0 ... 0 0
3 1 0 0 0 0 0 0 0 0 0 ... 0 0
4 9 0 0 0 0 0 0 0 0 0 ... 0 0

28x21 28x22 28x23 28x24 28x25 28x26 28x27 28x28


0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0

[5 rows x 785 columns]

In [46]: # Separate features and labels


X = train_data.drop(columns=['label'])
y = train_data['label']

# Standardize the data (mean=0, variance=1)


from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Display the shape of the features and labels


print("Features shape:", X_scaled.shape)
print("Labels shape:", y.shape)

Features shape: (60000, 784)


Labels shape: (60000,)

In [47]: from sklearn.decomposition import PCA


import matplotlib.pyplot as plt

# Apply PCA to reduce to 2 dimensions


pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Visualize the reduced data


plt.figure(figsize=(10, 7))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', alpha=0.5, s=1)
plt.colorbar(scatter, label='Digit Label')
plt.title('PCA of MNIST Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid()
plt.show()

In [48]: # Explained variance ratio


explained_variance = pca.explained_variance_ratio_

# Plot explained variance


plt.figure(figsize=(10, 5))
plt.plot(range(1, len(explained_variance) + 1), explained_variance, marker='o', linestyle='--')
plt.title('Explained Variance by PCA Components')
plt.xlabel('Number of Principal Components')
plt.ylabel('Variance Explained')
plt.grid()
plt.xticks(range(1, len(explained_variance) + 1))
plt.show()

# Cumulative explained variance


cumulative_variance = explained_variance.cumsum()
plt.figure(figsize=(10, 5))
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o', linestyle='--')
plt.title('Cumulative Explained Variance by PCA Components')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Variance Explained')
plt.grid()
plt.xticks(range(1, len(cumulative_variance) + 1))
plt.axhline(y=0.95, color='r', linestyle='--') # 95% threshold
plt.show()

In [ ]: # Summary of Dimensionality Reduction with PCA


# PCA effectively reduces the MNIST dataset's dimensionality while retaining a significant portion of variance.
# The first few principal components capture the majority of the information, allowing for efficient visualization and analysis.
# However, there is a trade-off between dimensionality reduction and potential information loss,
# highlighting the importance of balancing computational efficiency with data detail.
In [ ]:

You might also like