### Coding Question: Building a Machine Learning Model to Predict Housing Prices
**Problem Statement:**
You are given a dataset containing various features of houses along with their prices. Your task is to
build a machine learning model to predict the prices of houses based on their features. You will use the
popular Boston Housing dataset for this task.
**Dataset:**
The dataset consists of the following features:
1. CRIM: per capita crime rate by town
2. ZN: proportion of residential land zoned for lots over 25,000 sq. ft.
3. INDUS: proportion of non-retail business acres per town
4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5. NOX: nitric oxides concentration (parts per 10 million)
6. RM: average number of rooms per dwelling
7. AGE: proportion of owner-occupied units built prior to 1940
8. DIS: weighted distances to five Boston employment centres
9. RAD: index of accessibility to radial highways
10. TAX: full-value property tax rate per $10,000
11. PTRATIO: pupil-teacher ratio by town
12. B: 1000(Bk - 0.63)^2 where Bk is the proportion of Black residents by town
13. LSTAT: % lower status of the population
14. MEDV: Median value of owner-occupied homes in $1000s
**Tasks:**
1. Load and explore the dataset.
2. Preprocess the data.
3. Split the data into training and testing sets.
4. Train a machine learning model (e.g., Linear Regression).
5. Evaluate the model.
6. Make predictions using the trained model.
### Step-by-Step Solution
#### 1. Load and Explore the Dataset
```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Load the dataset
boston = load_boston()
boston_df = pd.DataFrame(boston.data, columns=boston.feature_names)
boston_df['MEDV'] = boston.target
# Display the first few rows of the dataset
print(boston_df.head())
# Summary statistics
print(boston_df.describe())
# Check for missing values
print(boston_df.isnull().sum())
# Correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(boston_df.corr(), annot=True, cmap='coolwarm')
plt.show()
```
#### 2. Preprocess the Data
```python
# Features and target variable
X = boston_df.drop('MEDV', axis=1)
y = boston_df['MEDV']
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
```
#### 3. Train a Machine Learning Model (Linear Regression)
```python
# Train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Model coefficients
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
```
#### 4. Evaluate the Model
```python
# Make predictions on the testing set
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R-squared:", r2)
# Plot the results
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, color='blue')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linewidth=2)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs Predicted')
plt.show()
```
#### 5. Make Predictions Using the Trained Model
```python
# Predicting on new data (example)
new_data = np.array([[0.1, 18.0, 2.31, 0.0, 0.538, 6.575, 65.2, 4.0900, 1, 296.0, 15.3, 396.90, 4.98]])
new_data_scaled = scaler.transform(new_data)
predicted_price = model.predict(new_data_scaled)
print("Predicted price:", predicted_price)
```
### Explanation of the Code
1. **Loading and Exploring the Dataset**:
- The Boston Housing dataset is loaded using `load_boston()` from `sklearn.datasets`.
- The dataset is converted into a DataFrame for easier exploration and manipulation.
- Summary statistics and correlation matrix are generated to understand the data better.
2. **Preprocessing the Data**:
- Features (`X`) and target variable (`y`) are separated.
- The features are standardized using `StandardScaler`.
- The dataset is split into training and testing sets using `train_test_split`.
3. **Training the Model**:
- A Linear Regression model is instantiated and trained on the training data.
- Model coefficients and intercept are printed.
4. **Evaluating the Model**:
- Predictions are made on the testing set.
- Mean Squared Error (MSE) and R-squared (R²) are calculated to evaluate the model's performance.
- A scatter plot is generated to visualize the actual vs predicted values.
5. **Making Predictions**:
- An example of making a prediction on new data is provided. The new data is scaled using the same
scaler used during training, and the model predicts the house price.
This extensive example covers the entire process of building a machine learning model to predict
housing prices, from data loading and preprocessing to model training, evaluation, and prediction.