0% found this document useful (0 votes)
11 views

Module 5 Machine Learning

Uploaded by

barath11koc
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Module 5 Machine Learning

Uploaded by

barath11koc
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Machine Learning

Supervised Learning and Unsupervised Learning


When Do We Use Machine Learning
• ML is used when:
• Human expertise does not exist (navigating on Mars)
• Humans can’t explain their expertise (speech recognition)
• Models must be customized (personalized medicine)
• Models are based on huge amounts of data (genomics)
Some more examples of tasks that are best solved by using a learning
algorithm
• Recognizing patterns:– Facial identities or facial expressions–
Handwritten or spoken words– Medical images .
• Generating patterns:– Generating images or motion sequences.
• Recognizing anomalies:– Unusual credit card transactions – Unusual
patterns of sensor readings in a nuclear power plant .
• Prediction:– Future stock prices or currency exchange rates.
• Supervised learning: discover patterns in the data with known target
(class) or label.
• These patterns are then utilized to predict the values of the target attribute in
future data instances.
• Unsupervised learning: The data have no target attribute.
• We want to explore the data to find some intrinsic structures in them.
Classification
Case Study Classification Example :
Business Analytics
• Objective: Predicting whether a customer will churn or not - Telecom.
• Scenario: A telecom company wants to predict whether a customer will leave the service
(churn) based on customer demographics, usage patterns, and service complaints. The
company can use classification techniques to predict binary outcomes (churn or not churn).
• Input Variables (Independent Variables):
• Age of the customer
• Monthly bill amount
• Usage patterns (data usage, talk time)
• Customer service complaints
• Account age
• Contract type (monthly, yearly, etc.)
• Output Variable (Dependent Variable):
• Churn (Yes or No)
• Classification (e.g., predicting customer churn) involves categorical
outcomes, where the goal is to classify observations into distinct
classes.
• These analytical techniques help businesses optimize operations,
forecast outcomes, and make informed decisions.
• Approach: Using logistic regression, decision trees, or random
forests, the company can classify customers as either likely to churn
(1) or not likely to churn (0) based on their characteristics.
• Example Model: A logistic regression model might output a
probability, say, 0.7 for a customer. If the threshold is set to 0.5, this
means the customer has a 70% chance of churning, so the model
would predict churn = Yes.
Regression
Case Study Regression Example:
Business Analytics
• Objective: Predicting future sales based on advertising spend.
• Scenario: A retail company wants to predict its sales for the next quarter
based on how much it spends on advertising. The company has collected
data on previous quarters, including how much it spent on advertising and
the resulting sales.
• Input Variables (Independent Variables):
• Advertising Spend (in dollars)
• Number of campaigns
• Seasonal factors (such as holidays)
• Output Variable (Dependent Variable):
• Sales (in dollars)
• Approach: Using linear regression, the company can build a model
that estimates sales based on advertising spend. The model will
create a relationship (e.g., y = mx + b) where y is the sales, m is the
slope (showing how sales increase with advertising), and x is the
advertising spend.
• Example Model: Sales = 300 + 0.5 * Advertising Spend
This suggests that for every additional dollar spent on advertising, the
company can expect to see an additional $0.50 in sales.
• Regression (e.g., predicting sales based on advertising spend) deals
with continuous outcomes, where the goal is to estimate a numeric
value.
• These analytical techniques help businesses optimize operations,
forecast outcomes, and make informed decisions
Classification V/s Regression
Decision Tree
• A Decision Tree is a supervised machine learning algorithm used for both
classification and regression tasks. It is a flowchart-like structure where:
• Each internal node represents a "decision" based on a feature (or
attribute) from the input data.
• Each branch represents the outcome of that decision (i.e., the splitting of
data based on the value of the feature).
• Each leaf node represents the final output or prediction (a class label in
classification, or a continuous value in regression).
• The goal of a decision tree is to recursively partition the data into subsets
that are as pure (homogeneous) as possible, based on the target variable.
• Key Concepts:
• Root Node: The starting point of the tree, where the first split of the data
occurs based on the most important feature.
• Branches: These are the outcomes of a decision at each node, leading to
further splits or leaf nodes.
• Leaf Nodes: The final decision or predicted output of the tree, representing
the classification (in classification tasks) or predicted value (in regression
tasks).
• Splitting Criteria: The algorithm uses criteria such as Gini impurity,
information gain (for classification), or mean squared error (for regression)
to determine the best feature to split the data at each step.
• Working of a Decision Tree:
• Splitting: The data is split at each node based on the feature that best
separates the data according to a splitting criterion.
• Recursive Partitioning: This process is repeated recursively for each
resulting subset of the data until a stopping criterion is met (such as a
maximum depth or when data cannot be split further).
• Prediction: For classification, the tree predicts the majority class in
the leaf node, and for regression, it predicts the average of the target
variable in the leaf node.
Example : Decision Tree
• Imagine you want to predict whether a customer will buy a product
based on their age and income. A decision tree might first split the
data based on age (e.g., age > 30), and then further split based on
income (e.g., income > $50,000). Each final branch would then predict
whether the customer will buy or not.
• Time spent on website traffic analysis
Random Forest
• Random Forest is an ensemble learning algorithm that combines multiple decision trees to improve the accuracy
and robustness of predictions. It is used for both classification and regression tasks. The core idea is to create a
"forest" of decision trees, where each tree is trained on a different random subset of the data and features, and
their collective predictions are averaged (for regression) or voted on (for classification) to produce a final output.
• Key Concepts of Random Forest:
• Ensemble Learning: Random Forest is based on the concept of ensemble learning, which means combining
multiple individual models to produce a more accurate and stable result than any single model.
• Bootstrap Aggregating (Bagging): Random Forest uses a technique called bagging, where:
• Multiple bootstrapped datasets are created by randomly sampling from the original dataset with replacement.
• Each decision tree is trained on a different bootstrapped dataset.
• Random Feature Selection: At each node of the decision tree, Random Forest selects a random subset of
features (rather than considering all features) to decide the best split. This introduces further diversity among
the trees, helping reduce overfitting and improving generalization.
• Voting/Averaging: Once all trees in the forest are trained, the predictions are aggregated:
• For classification, the mode (majority vote) of the predictions from all trees is used as the final class prediction.
• For regression, the average of the predictions from all trees is used as the final predicted value.
How Random Forest Works:

• Step 1: Data Sampling (Bagging):


• Randomly sample with replacement from the training data to create multiple subsets (bootstrapped
datasets).
• Step 2: Decision Tree Construction:
• For each bootstrapped dataset, build a decision tree by randomly selecting a subset of features for
each node, and split the data based on the best feature within that subset.
• This results in a variety of decision trees that are different from one another due to the random
sampling of data and features.
• Step 3: Making Predictions:
• After training, each tree in the forest makes a prediction.
• For classification: Perform a majority vote across all trees.
• For regression: Take the average of the predictions from all trees.
• Step 4: Final Output:
• The Random Forest algorithm outputs the aggregated prediction from all trees.
Random Forest : Example
• The bank has a dataset of past loan applicants and needs to build a model to predict whether a new loan applicant is likely
to default (fail to repay the loan). By using a Random Forest model, the bank aims to better understand the risk associated
with each loan application and make data-driven lending decisions.
• Features (Independent Variables):
• Age: Age of the applicant (e.g., 25, 35, 45).
• Annual Income: Applicant's annual income (e.g., $50,000, $80,000, $120,000).
• Loan Amount: The amount of the loan requested by the applicant (e.g., $10,000, $50,000).
• Loan Term: The term of the loan (e.g., 15 years, 30 years).
• Credit Score: The applicant's credit score (e.g., 600, 750, 800).
• Employment Status: Whether the applicant is employed or self-employed (e.g., employed, self-employed).
• Previous Loan Defaults: Whether the applicant has defaulted on any loans in the past (Yes/No).
• Debt-to-Income Ratio (DTI): A ratio of the applicant's debt to their income (e.g., 0.2, 0.4).
• Marital Status: Whether the applicant is single, married, or divorced (e.g., single, married).
• Home Ownership: Whether the applicant owns a home or rents (e.g., owns, rents).
• Target Variable (Dependent Variable):
• Loan Default: Whether the applicant defaults on the loan (1) or repays the loan successfully (0
Step-by-Step Process:
• 1. Data Preprocessing:
• Cleaning: Remove any missing data or handle missing values through imputation (e.g., using the median for numerical
variables).
• Encoding: Convert categorical variables like "Employment Status" or "Marital Status" into numerical values using one-hot
encoding or label encoding.
• Feature Scaling: Normalize continuous variables like Annual Income and Loan Amount to ensure they are on the same
scale for improved model performance.
• 2. Train the Random Forest Model:
• Random Forest Algorithm:
• The dataset is divided into training and testing sets. The training set is used to build multiple decision trees in the random forest.
• Each tree is trained on a random subset of the data (with replacement) and considers a random subset of features at each node,
which increases model diversity and reduces overfitting.
• 3. Make Predictions:
• Once the model is trained, it predicts whether an applicant will default on their loan based on their features (e.g., age,
income, credit score).
• For Classification: Each tree in the forest makes a prediction (default = 1, no default = 0), and the final prediction is the majority vote
from all trees.
• 4. Model Evaluation:
• Evaluate the model's performance using metrics such as:
• Accuracy: The proportion of correct predictions.
• Precision: The proportion of actual loan defaulters among all predicted defaulters.
• Recall: The proportion of actual loan defaulters that were correctly predicted.
• F1-score: A balance between precision and recall.
• Confusion Matrix: To visualize the true positives, false positives, true negatives, and false
negatives.
• 5. Feature Importance:
• The Random Forest model provides insights into which features are most important in
predicting loan defaults.
• For example, the Credit Score and Previous Loan Defaults might emerge as the most important
features for predicting default, as they directly influence a borrower’s likelihood of repayment.
• Example Results:
• After training the Random Forest model, the bank gets the following insights:
• Feature Importance:
• Credit Score: 40%
• Debt-to-Income Ratio (DTI): 25%
• Previous Loan Defaults: 15%
• Annual Income: 10%
• Loan Amount: 5%
• Employment Status: 5%
• These feature importance scores suggest that credit score and DTI ratio are the most
important factors in determining whether an applicant will default on a loan.
• Predictions:
• Applicant A: Predicted to Default (due to a low credit score and high DTI ratio).
• Applicant B: Predicted to Repay (due to a high credit score and low DTI ratio).
Business Use Case:

• Risk Assessment: By using the Random Forest model, the bank can more accurately
assess the risk of lending to an applicant. This allows the bank to better manage its
loan portfolio by identifying high-risk applicants who are likely to default.
• Personalized Loan Offers: For applicants with a lower likelihood of defaulting, the
bank can offer better terms such as lower interest rates, whereas for high-risk
applicants, the bank might offer higher interest rates or ask for additional collateral.
• Prevention of Default: The model helps the bank identify customers who are at a
higher risk of default. The bank can take proactive steps, such as offering financial
counseling or restructuring loans, to help these customers before they default.
• Regulatory Compliance: Predicting defaults accurately can help the bank meet
regulatory requirements by maintaining an appropriate risk profile for its loans and
avoiding financial crises due to excessive bad loans.
• In this example, the Random Forest algorithm is used by a bank to
predict whether a loan applicant is likely to default. By training
multiple decision trees on random subsets of data and features, the
Random Forest model creates a more robust prediction compared to
a single decision tree. The bank can then use this predictive model to
assess risk more accurately, reduce potential defaults, and make more
informed lending decisions.
Unsupervised Learning
• Unsupervised learning refers to a type of machine learning where the
algorithm is given data without labeled outcomes or target variables.
Instead, the algorithm tries to find patterns, relationships, or
groupings in the data. Common techniques in unsupervised learning
include clustering, dimensionality reduction, and anomaly detection.
Cluster
• Clustering: Task of grouping a set of data points such that data points
in the same group are more similar to each other than data points in
another group (group is known as cluster)
• it groups data instances that are similar to (near) each other in one
cluster and data instances that are very different (far away) from each
other into different clusters.
Types of Clustering
1. Exclusive Clustering: K-means
2. Overlapping Clustering: Fuzzy C-means
3. Hierarchical Clustering: Agglomerative clustering, divisive clustering
4. Probabilistic Clustering: Mixture of Gaussian models
How to choose a clustering
algorithm
• A vast collection of algorithms are available.
• Which one to choose for our problem ?
• Choosing the “best” algorithm is a challenge.
• Every algorithm has limitations and works well with certain data distributions.
• It is very hard, if not impossible, to know what distribution the application
data follow.
• The data may not fully follow any “ideal” structure or distribution required by
the algorithms.
• One also needs to decide how to standardize the data, to choose a suitable
distance function and to select other parameter values.
How to choose a clustering
algorithm
• Due to these complexities, the common practice is to
• run several algorithms using different distance functions and parameter
settings, and then carefully analyze and compare the results.
• The interpretation of the results must be based on insight into the meaning
of the original data together with knowledge of the algorithms used.
• Clustering is highly application dependent and to certain extent subjective
(personal preferences).
Case Study
• Objective: Segmenting customers into different groups based on their purchasing behavior.
• Scenario:
• A retail company wants to better understand its customer base in order to tailor marketing
strategies, product recommendations, and promotions. The company has data on customers'
past purchases, and they want to identify groups of customers with similar purchasing behavior.
This can be done using clustering techniques, specifically K-means clustering, which groups
similar data points together.
• Data Variables (Features):
• Annual spending (in dollars) on products
• Product category preferences (e.g., electronics, clothing, groceries)
• Frequency of visits (number of store visits or website visits per month)
• Demographic data (e.g., age, location, income level)
• Purchase history (total number of items bought over the last year)
• Approach:
• The company uses K-means clustering, a popular unsupervised learning
technique, to identify groups of customers that behave similarly. Here’s how it
works:
• Preprocessing: Clean and normalize the data to ensure all variables are on a
similar scale (for example, normalizing spending and visit frequency).
• Selecting the number of clusters: The company decides to segment customers
into 3 groups (clusters) based on a business strategy (for example, targeting
"high-value" customers, "frequent shoppers", and "occasional buyers").
• Running the clustering algorithm: K-means will then partition the customer
data into 3 distinct clusters based on the similarity of their purchasing behavior.
• Example of Clusters Identified:
• Cluster 1: High-Value Customers
• Customers who spend a lot annually but may not visit the store frequently.
• They may prefer high-end products or luxury items.
• These customers are likely more influenced by product quality or exclusivity.
• Cluster 2: Frequent Shoppers
• Customers who visit the store or website frequently, but their total annual spend is moderate.
• They might shop for basic or everyday items and tend to buy in smaller quantities but regularly.
• These customers could be price-sensitive and value promotions or discounts.
• Cluster 3: Occasional Buyers
• Customers who rarely visit but make large purchases when they do.
• These might be customers who prefer to buy in bulk or make seasonal purchases (e.g., holiday
shopping).
• They may be less brand-loyal but still responsive to targeted campaigns.
Results & Use in Business:

• Targeted Marketing: Each group of customers can be targeted with different marketing
strategies. For instance:
• For high-value customers, the company could offer exclusive product releases or loyalty programs.
• For frequent shoppers, they might offer personalized discounts or promotions to drive more frequent
purchases.
• For occasional buyers, the company could offer seasonal sales or reminders about products they
might need based on past purchases.
• Personalized Recommendations: The company can tailor recommendations based on the
identified clusters. For example, customers in the "high-value" group might receive
recommendations for premium products, while those in the "frequent shopper" cluster
could receive suggestions for everyday items.
• Resource Allocation: The company can allocate resources more effectively by focusing on
high-value customers for special promotions and tailoring their inventory based on the
preferences of each cluster.

You might also like