Coding Question
Coding Question
**Problem Statement:**
You are given a dataset containing various features of houses along with their prices. Your task is to
build a machine learning model to predict the prices of houses based on their features. You will use the
popular Boston Housing dataset for this task.
**Dataset:**
2. ZN: proportion of residential land zoned for lots over 25,000 sq. ft.
**Tasks:**
```python
import numpy as np
import pandas as pd
boston = load_boston()
boston_df['MEDV'] = boston.target
print(boston_df.head())
# Summary statistics
print(boston_df.describe())
# Check for missing values
print(boston_df.isnull().sum())
# Correlation matrix
plt.figure(figsize=(12, 10))
plt.show()
```
```python
X = boston_df.drop('MEDV', axis=1)
y = boston_df['MEDV']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
```
```python
model.fit(X_train, y_train)
# Model coefficients
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
```
```python
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print("R-squared:", r2)
plt.figure(figsize=(10, 6))
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs Predicted')
plt.show()
```
```python
new_data = np.array([[0.1, 18.0, 2.31, 0.0, 0.538, 6.575, 65.2, 4.0900, 1, 296.0, 15.3, 396.90, 4.98]])
new_data_scaled = scaler.transform(new_data)
predicted_price = model.predict(new_data_scaled)
```
- The dataset is converted into a DataFrame for easier exploration and manipulation.
- Summary statistics and correlation matrix are generated to understand the data better.
- The dataset is split into training and testing sets using `train_test_split`.
- Mean Squared Error (MSE) and R-squared (R²) are calculated to evaluate the model's performance.
5. **Making Predictions**:
- An example of making a prediction on new data is provided. The new data is scaled using the same
scaler used during training, and the model predicts the house price.
This extensive example covers the entire process of building a machine learning model to predict
housing prices, from data loading and preprocessing to model training, evaluation, and prediction.