Unit 6
Unit 6
The goal is for the model to learn the relationship between inputs and outputs
so that It can accurately predict outcomes for new, unseen data.
"Supervised Learning is a machine learning paradigm where an algorithm learns
a function that maps an input (X) to an output (Y) based on example input-
output pairs (X, Y), minimizing the error between predicted and actual outputs.“
• Key Characteristics:
✔ Labeled Data – Training data has both inputs and correct outputs.
✔ What it is 🤔
✔ How it works ⚙
✔ Key algorithms 🏆
For example:
If you train a model with past exam questions and answers, it will learn patterns
to predict answers for future questions.
2. How Does Supervised Learning Work?
11️⃣ Data Collection – Gather labeled data (e.g., images of cats and
dogs with labels).
Types of Regression:
Linear Regression – Models relationships using a straight line (e.g., predicting
house prices).
Polynomial Regression – Uses polynomial functions to model non-linear
relationships.
3.Ridge and Lasso Regression – Regularized versions of linear regression to
prevent overfitting.
5.Decision Tree Regression – Splits data into smaller regions for prediction.
6.Random Forest Regression – An ensemble of decision trees for better
accuracy.
Types of Classification:
Logistic Regression – Uses a sigmoid function to predict probabilities (e.g.,
spam or not spam).
K-Nearest Neighbors (KNN) – Classifies based on the majority of K nearest
neighbors.
Support Vector Machine (SVM) – Finds the best boundary (hyperplane) to
separate classes.
4.Decision Trees – Splits data into branches based on feature values.
5. Random Forest – Uses multiple decision trees to improve accuracy and
reduce overfitting.
🚧 Underfitting – When the model is too simple to learn patterns (solution: use a
better model).
𝑌=𝑚𝑋+𝑏
Where:
Y = Predicted output (dependent variable)
X = Input feature (independent variable)
m = Slope of the line (coefficient)
b = Y-intercept (bias)
11️⃣House Price Prediction:
💡 Scenario:
Suppose we want to predict the price of a house based on factors like area
(square feet) and the number of bedrooms.
✅ Multiple Linear Regression uses multiple predictors (e.g., house price → size,
bedrooms, location).
The red dashed line is the regression line, showing the trend that salary
increases with experience.
Each point in the 3D space represents a house.
The regression plane fits through the data points, showing how price varies
with both size and number of bedrooms.
2️⃣ Multiple Linear Regression (House Price vs. Size &
Bedrooms):
The red surface represents the regression plane, showing how price is
influenced by both size and number of bedrooms.
✅ 2.2 Assumptions of Linear Regression
For linear regression to work properly, the following assumptions should hold:
✅ Advantages
✔ Simple to implement and interpret.
✔ Works well with small to medium datasets.
✔ Computationally efficient.
❌ Disadvantages
🚧 Assumes a linear relationship, which may not always hold.
🚧 Sensitive to outliers.
🚧 Struggles with multicollinearity (when input variables are highly correlated).
4. Applications of Linear Regression
📊 Predicting House Prices – Based on features like size, location, and number of
rooms.
📈 Stock Market Forecasting – Estimating future stock prices using past trends.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
In this step, we train the Linear Regression model using the given dataset (car
age and speed).
Training means the model learns the relationship between the independent
variable (X: Car Age) and the dependent variable (Y: Car Speed) so that it can
make predictions on new data.
How Training Works in Python?
We use the LinearRegression() class from Scikit-Learn to create and train the
model.
Code for Training the Model:
from sklearn.linear_model import LinearRegression
In Linear Regression, various lines can be drawn on a scatter plot based on how
well they fit the data.
✔ The best fit line (also called the regression line) is the one that minimizes
the error between actual and predicted values.
✔ It is calculated using the Least Squares Method, ensuring the sum of squared
errors is the lowest.
📌 Equation of the Best Fit Line:
Y=mX+b
Where:
The best fit line shows the trend (older cars tend to move slower).
2. Overfitted Line 🚀
✔ A model that memorizes noise in the training data instead of learning the
general trend.
✔ It has high variance and performs well on training data but poorly on test
data.
📌 Example: A highly complex polynomial curve that fits every data point
perfectly but fails on new data.
✅ Solution: Use a more complex model or include more features in the dataset.
4. Residual Line (Error Line) 📉
✔ A residual line represents the difference between the actual value and the
predicted value.
✔ The goal is to minimize the residuals (errors) to get the best fit.
📌 Example: If the actual car speed is 80 km/h but the model predicts 85 km/h,
the residual is 5 km/h.
✔ A baseline model that predicts the mean of the dependent variable for every
input.
✔ Used for comparison to see if Linear Regression performs better.
📌 Example: If the average car speed is 75 km/h, this line predicts 75 km/h for
every car, ignoring age.
✅ A good regression model should perform better than the mean line.
Regularization in Machine Learning
𝑐 :c is the intercept.
Solution:
• Step 1: Calculate the means of X and Y:
Thus, the regression equation is:
• Y=2.2X−1
🚀 Exploratory Data Analysis (EDA)
• Exploratory Data Analysis (EDA) is a crucial step in data analytics that
helps uncover patterns, detect anomalies, and summarize key
characteristics of a dataset before applying machine learning
models. 📊🔍
📌 Why is EDA Important?
• EDA allows data scientists to:
✅ Understand the dataset’s structure and distributions 📊
✅ Detect missing or inconsistent values ❌
✅ Identify correlations and relationships between variables 🔗
Prepare
🛠️ data for further analysis
🔍 Looking at & Cleaning Data
• Before analyzing data, we must clean and preprocess it:
✔️Handle missing values (drop, impute, or fill) ❓
✔️Detect and remove duplicate values 📑
✔️Identify outliers and correct them 📉
✔️Standardize data formats (e.g., dates, categories)
• import pandas as pd
• # Load dataset
• df = pd.read_csv("data.csv")
Visualization Tools
• 🔹 Bar Plots – Compare categories 📊
🔹 Histograms – Show data distribution 📈
🔹 Box Plots – Detect outliers 🎁
🔹 Pair Plots – Identify relationships 🔗
• import seaborn as sns
• import matplotlib.pyplot as plt
• # Histogram of a variable
• sns.histplot(df["column_name"], bins=20)
• plt.show()
❓ Asking & Answering Data Questions
• To extract deeper insights, ask questions like:
🔍 What is the average salary of employees in different departments? 💰
🔍 Which product has the highest sales over time? 📆
🔍 Is there a correlation between age and spending habits? 🤝
• # Group data by category
• df.groupby("Department")["Salary"].mean()
• # Load dataset
• df = pd.read_csv("data.csv")
• # Quick overview
• print(df.head())
• # Correlation heatmap
• plt.figure(figsize=(8,6))
• sns.heatmap(df.corr(), annot=True, cmap="coolwarm")
• plt.show()
🐍 Seaborn and Scikit-Learn: A Quick Guide
📊 Seaborn: Advanced Data Visualization Library
Seaborn is a Python library built on top of Matplotlib that makes
statistical data visualization easy and attractive. It is widely used for
data exploration and insight generation.
🔹 Features of Seaborn
• ✅ Built-in themes for better aesthetics 🎨
✅ Works well with pandas DataFrames 📑
✅ Supports statistical visualization (e.g., regression plots, pair plots) 📊
✅ Integrates well with Matplotlib 📈
• Common Seaborn Plots
• # Bar plot
• sns.barplot(x="day", y="total_bill", data=df, palette="coolwarm")
• plt.show()
🔹 Other Visualizations:
Histogram: sns.histplot(df["column"])
Box Plot (Outliers): sns.boxplot(x="category", y="value", data=df)
• Heatmap (Correlations): sns.heatmap(df.corr(), annot=True,
cmap="coolwarm")
🤖 Scikit-Learn: Machine Learning Library
Scikit-Learn (sklearn) is a powerful Python library for machine learning,
data preprocessing, and model evaluation.
🔹 Features of Scikit-Learn
• ✅ Supports supervised & unsupervised learning models 🔍
✅ Provides data preprocessing tools (e.g., scaling, encoding) ⚙️
✅ Implements various ML models (Regression, Classification,
Clustering) 🤖
✅ Offers performance evaluation metrics 📊
• 🔹 Example: Lifrom sklearn.model_selection import train_test_split
• from sklearn.linear_model import LinearRegression
• from sklearn.metrics import mean_squared_error
• # Sample dataset
• import pandas as pd
• df = pd.DataFrame({'X': [1, 2, 3, 4, 5], 'Y': [2, 4, 5, 4, 5]})
• # Split data
• X = df[['X']]
• y = df['Y']
• X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
• # Train model
• model = LinearRegression()
• model.fit(X_train, y_train)
2. Generate a pie chart using Matplotlib for the number of meals ordered
on each day in the Tips dataset. (Hint: Use df['day'].value_counts())
3. Use Seaborn to create a heatmap showing missing values (if any) in the
Titanic dataset. (Hint: Use sns.heatmap(df.isnull(), cmap='coolwarm'))
🚀 End-to-End Example: Using Seaborn & Scikit-Learn for Predicting
House Prices
• We'll perform Exploratory Data Analysis (EDA) using Seaborn and
build a Linear Regression Model using Scikit-Learn to predict house
prices. 🏠📊🤖
• 📌 Step 1: Import Libraries
• import pandas as pd
• import numpy as np
• import seaborn as sns
• import matplotlib.pyplot as plt
• from sklearn.model_selection import train_test_split
• from sklearn.linear_model import LinearRegression
• from sklearn.metrics import mean_squared_error, r2_score
📌 Step 2: Load Dataset
• We'll use a sample dataset of house prices.
• # Sample dataset (features: size, bedrooms, age, price)
• data = {
• "Size": [1400, 1600, 1700, 1875, 1100, 1550, 2350, 2450, 1425, 1700],
• "Bedrooms": [3, 3, 3, 4, 2, 3, 4, 4, 2, 3],
• "Age": [20, 15, 18, 25, 10, 22, 30, 8, 12, 18],
• "Price": [245000, 312000, 279000, 308000, 199000, 219000, 405000, 450000,
215000, 310000]
•}
• df = pd.DataFrame(data)
• plt.figure(figsize=(8, 6))
• sns.heatmap(df.corr(), annot=True, cmap="coolwarm", fmt=".2f")
• plt.title("Feature Correlation Heatmap")
• plt.show()
• 💡 Insight: Scatter plots help us see trends, like larger houses costing
more. 🏠💲
• 3️⃣Box Plot to Detect Outliers
• sns.boxplot(data=df, palette="Set2")
• plt.title("Box Plot for Outlier Detection")
• plt.show()
• plt.figure(figsize=(8,5))
• sns.scatterplot(x=y_test, y=y_pred, color="blue", label="Predicted vs
Actual")
• plt.xlabel("Actual Prices")
• plt.ylabel("Predicted Prices")
• plt.title("Actual vs Predicted House Prices")
• plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)],
color="red", linestyle="--")
• plt.legend()
• plt.show()
• 💡 Insight: The closer points are to the red diagonal line, the better our
predictions. 🎯
🎯 Final Summary
• ✅ Seaborn helped visualize relationships between house features and
price. 📊
✅ Scikit-Learn built a Linear Regression model to predict house prices.
🤖
✅ The correlation heatmap & scatter plots helped us choose features
for the model. 🔍
✅ The model was evaluated using MSE & R2R^2R2, and its
performance was visualized. 🚀
1. Generate a pie chart using Matplotlib for the number of meals ordered daily in
the Tips dataset. (Hint: Use df['day'].value_counts())
• 2. Create a bar plot to show the average tip for each day using the Tips dataset.