Assignment 9[1]
Assignment 9[1]
The objective of this topic is to understand how to use different machine learning models for
predictive analysis using Python's popular libraries such as Pandas, NumPy, and Matplotlib. We will
explore how to preprocess data, choose the appropriate model, train the model, and make
predictions.
Before developing any model, data must be loaded, cleaned, and explored. This step involves
removing missing values, encoding categorical variables, and exploring the dataset to find patterns.
Example:
import pandas as pd
# Loading data
df = pd.read_csv('data.csv')
print(df.isnull().sum())
# Visualizing data
df['column_name'].hist()
plt.show()
Selecting important features (variables) is essential for building a predictive model. Feature
engineering helps in creating new features that will help the model to predict better.
Example:
The data needs to be divided into training and testing sets. Typically, we use 80% of the data for
training and 20% for testing.
Example:
X = df.drop('target_column', axis=1)
y = df['target_column']
4. Model Selection
For prediction tasks, several machine learning models can be used, such as:
model = LinearRegression()
model.fit(X_train, y_train)
5. Model Evaluation
Once the model is trained, we evaluate its performance using various metrics like accuracy, mean
squared error (MSE), r-squared, etc.
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f'R-squared: {r2}')
Visualization is essential to understand the model's predictions versus the actual values. Matplotlib
and Seaborn are commonly used for visualizing the results of the predictions.
plt.scatter(y_test, y_pred)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted')
plt.show()
2. Data Cleaning:
o Remove or fill missing values.
3. Splitting Data:
o Split the dataset into training and testing sets using Scikit-learn's train_test_split().
5. Model Evaluation:
o Evaluate model performance using metrics like accuracy, mean squared error (MSE),
etc.
6. Visualization:
o Visualize the model's predictions and compare them with actual values.
Answer:
The steps in model development include:
6. Model Evaluation: Using metrics like accuracy, mean squared error (MSE), and R-squared.
Q2: What is the importance of splitting the data into training and testing sets?
Answer:
Splitting the data ensures that the model is evaluated on unseen data, which helps in assessing its
performance. The model is trained on the training set and tested on the testing set, allowing us to
determine how well it generalizes to new, unseen data.
Q3: What is the difference between Linear Regression and Logistic Regression?
Answer:
Linear Regression is used for predicting continuous numerical values (e.g., house prices,
stock prices).
Logistic Regression is used for classification tasks, where the output is categorical (e.g.,
predicting if an email is spam or not).
Q4: What evaluation metrics would you use for regression and classification models?
Answer:
For Regression: Metrics such as Mean Squared Error (MSE), R-squared, Mean Absolute
Error (MAE).
For Classification: Metrics such as Accuracy, Precision, Recall, F1-Score, Confusion Matrix.
Answer:
Overfitting can be handled by:
Using cross-validation.
Reducing the complexity of the model by pruning decision trees or using fewer features.
Answer:
Feature engineering involves creating new features from existing data that make the predictive
model more effective. It helps the model by improving its ability to detect patterns and relationships
in the data.
Answer:
Cross-validation is a technique where the dataset is split into several subsets, and the model is
trained and evaluated on different subsets to ensure that the model generalizes well to unseen data.
It helps in reducing the bias and variance of the model.
Q8: Explain the term "model evaluation" and list some evaluation metrics.
Answer:
Model evaluation refers to the process of assessing the performance of a trained model using test
data. Common evaluation metrics include:
Mean Squared Error (MSE): Measures the average squared difference between predicted
and actual values (for regression).
R-squared: The proportion of variance in the dependent variable that is predictable from the
independent variables (for regression).
Answer:
The results of a regression model can be visualized by plotting:
Q10: What is the purpose of using Matplotlib and Seaborn in model development?
Answer:
Matplotlib and Seaborn are used for visualizing the data, helping to explore relationships, trends, and
patterns. They are essential for model evaluation, visualizing predictions, and understanding data
distributions.
Task Function/Method
import pandas as pd
# Load data
df = pd.read_csv('housing_data.csv')
# Feature selection
# Split data
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate model
r2 = r2_score(y_test, y_pred)
print(f'R-squared: {r2}')
# Visualize predictions
plt.scatter(y_test, y_pred)
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.show()