0% found this document useful (0 votes)
5 views18 pages

Cia 4

This is cia 4

Uploaded by

ronaknsheth2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views18 pages

Cia 4

This is cia 4

Uploaded by

ronaknsheth2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

NISCHAY JAIN

3 BBA FMA A
2324343

CIA – 3
(Continuous Internal Assessment)

DATA MINING

Topic: Classification Techniques

Under the supervision of:-


Shashidhar Yadav J
(School of Business and Management)
CHRIST (Deemed to be University)
Yeshwanthpur campus

Submitted on:-
20/10/2024
Classification of Data
Data classification is the process of organizing data into predefined categories, making it easier to
manage, analyze, and retrieve. It involves using algorithms or rules to assign data points to
appropriate classes.
Classification of data is a fundamental machine learning task where data points are sorted into predef
ined categories or classes based on their features. It involves training a model on labeled data to reco
gnize patterns and relationships, enabling it to predict the class of new, unseen data.There are two
main types: supervised classification (using labeled data) and unsupervised classification
(grouping data based on similarities).

 Types of Data Classification

1. Supervised Classification:

 Algorithm: Imbalances the data with the set-aside labeled data to develop a model that can form a
prediction of class for new data.
Examples:
Naive Bayes: Adopts the assumption of feature independence.
Support Vector Machines (SVM): Locates, specifically, a hyperplane within a high-dimensional
space that can best categorize the classes.
Decision Trees: Develops a hierarchy of decisions to sort data into a tree like formation.
2. Unsupervised Classification:

 Algorithm: Cluster data in a way that will partition the data based on the similarity; however, no
information about the class labels is required.
Examples:

K-means Clustering: Splits data into K clusters with the help of distance.

Hierarchical Clustering: Produces a cluster heirarchy.

Density-Based Clustering: It clusters data based on the density of the data on distinct areas.

3. Example: Customer Segmentation


Suppose, a retail company wants to get insights about its customers. They gather information about
their customers and the products such as age, gender, income level, past purchases, and selected web
links. By employing unsupervised classification which can be for example, K-means classification
they are able to categorise customers into clusters that have similar attributes. Such a segmentation
can be used to effectively target each group through marketing promotions, product development and
delivery of customer services.

 The benefits of classification:

1. Improved Data Organization: Classification is useful in that it separates data into categories and thus
has to be sorted for easier retrieval of information and organization.
2. Enhanced Decision Making: In the same way that grouping the items of a list will facilitate search,
grouping data will help in arriving at informed decisions.
3. Optimized Data Analysis: Classification is useful in analysis as it creates a guide on how to look for
patterns, relationships, and trends in the data base.
4. Tailored Marketing and Customer Engagement: It is used to sort client by their traits to help the
organizations to design marketing strategies to reach out for them.
5. Risk Mitigation: Within areas such as fraud analysis, security, the classification can help with the
ability to recognize threat patterns and endangerment’s to an organization hence taking preventive
measures.

 Users of Data Classification:

1. Businesses: Companies use data classification to segment customers, analyze market trends,
optimize operations, and improve customer satisfaction.
2. Healthcare Providers: Hospitals and clinics classify patient data to track medical records, diagnose
diseases, and personalize treatment plans.
3. Government Agencies: Governments classify data for various purposes, such as national security,
law enforcement, and public policy development.
4. Researchers: Scientists and researchers classify data to analyze experimental results, identify
patterns, and make discoveries.
5. Financial Institutions: Banks and other financial institutions classify data to detect fraud, assess
risk, and manage investments.

 Tools for Data Classification:

1. RapidMiner: A comprehensive data mining and machine learning platform offering a wide
range of classification algorithms and visualization tools.
2. Weka: A free, open-source software package providing a collection of machine learning
algorithms, including those for classification.
3. Python Libraries: Python's rich ecosystem includes libraries like scikit-learn, TensorFlow, and
Keras, which offer powerful tools for classification tasks.
4. R: A statistical computing language with numerous packages (e.g., caret, randomForest)
dedicated to classification and other data analysis tasks.
5. SAS: A commercial software suite with advanced analytics capabilities, including classification
algorithms and data mining tools.
SOURCE OF DATASET

 EXPLANATION OF COLUMNS:

1. Campaign Launched (Target):


- Shows if a marketing campaign was launched (Yes/No).
- It's the target variable used for prediction analysis.

2. Market Condition:
- Represents the economic environment (Good, Moderate, Bad).
- Helps determine how favorable the market is for launching a campaign.

3. Competition Intensity:
- Measures the level of competition (Low, Medium, High).
- Higher competition may impact the likelihood of launching a campaign.

4. Budget Allocation ($):


- Reflects the total budget for the campaign in USD.
- Higher budgets can improve campaign reach and effectiveness.

5. Expected Reach (%):


- Shows the estimated percentage of the target audience reached.
- It helps assess the potential impact and success of the campaign.

6. Campaign Timing:
- Indicates the time of day the campaign is scheduled (Morning, Afternoon, Evening).
-Timing can influence audience engagement and campaign effectiveness.

CLASSIFICATION ALGORITHMS

1. DECISION TREE

Decision Trees are an example of supervised learning algorithms of the category: classification and
regression. They are kind of like flowcharts that use if else questions to make decisions. This begins with a
root node which depicts the entire population in a sample. Where a tree divided at every internal node – a
question is asked about the data and the tree branches off in response to the answer. This process goes on
until a decision node or a terminal node as you may wish to call it, is attained.
The decision trees are easy to interpret thus preferred when one wants to explain to an audience, even an
unsophisticated one, how the particular decision was arrived at. It works with both interval/ratio and
nominal/ordinal level data and is less sensitive to outliers. But sometimes they are effective enough they
may over-fit the region they are used for especially if they are complex. Decision trees are used to solve
customer segmentation, health diagnosis, fraud detection, recommendation and accurate risk assessment.

For example, suppose a telecommunications firm wishes to forecast customers most likely to churn or to
cancel their service subscription. There is accumulation of two sorts of data, namely customer data such as
age, income, and tenure with the company; and usage data such as call time and data usage; and satisfaction
surveys. This way, with the help of a decision tree algorithm, the company will be able to build a model of
the customer churn depending on these factors. The tree might have branches like:

 If tenure < 1 year:


-If data usage > 20GB/month: High churn risk
-If data usage <= 20GB/month: Low churn risk
 If tenure >= 1 year:
-If customer satisfaction score < 7: High churn risk
-If customer satisfaction score >= 7: Low churn risk
The customers that are likely to churn away from the company provide information that can help the
company develop ways of ensuring that these customers do not leave.
Step 1: Data insertion

Step 2: Add “Set role” as the operator and set “Campaign Timing” as the label
Step 3: Add the “Decision Tree” operator and connect as below

Step 4: Results
 Interpretations:

1. Root Node (Budget Allocation > 30,500):


 This tree begins with the largest bar, the Budget Allocation ($). A budget of 30,500 is the initial
threshold:
 Finally if Budget Allocation is greater than 30,500, then it needs to be categorized in a more detailed
manner.If, however, value of the Budget Allocation is less than or equal to 30,500, the model will
place the campaign in the Morning. This is based on 3 Morning campaigns being identified where
the budget is below this threshold.

2. Left Branch (Budget Allocation > 30,500):


 For higher budgets (above 30,500), the tree makes further distinctions:
 If the Budget Allocation goes beyond 53,500, the campaign is anticipated to be started in the
Morning. The decision tree reveals 2 Morning campaigns for this situation and one can presuppose
that larger numbers of budgets are introduced for the morning campaigns because of the high effect
they expect to gain from these viewpoints.

3. Second Split: Range: 30500 and 53500


If Budget Allocation is between 30,500 and 53,500 a number of budgetary thresholds is applied and
the model distinguishes the second criteria namely Expected Reach (%).

(A) Budget Allocation > 33,500:


 If the budget is between 33,500 and 53,500, the decision tree splits again:
 If the budget is more than 50,500 then the campaign is categorized as Evening campaign. The model
identifies 2 Evening campaigns in this case which may confirm that those campaigns that have
higher budget in this range are usually conducted in the evening perhaps, because there may be more
engagement from consumers at this time.

(B) Budget ≤ 50,500:


 If the budget is less than or equal to 50,500, the model incorporates the Expected Reach (%):
 If the Expected Reach is more than 62.5%, then the campaign is forecasted to be in the Afternoon.
An overall accuracy of at least 6 Afternoon campaigns and 1 evening campaign is achieved given the
proposed decision tree.
 If the Expected Reach is below or equal to 62.5% the campaign is expected to occur in the Morning.
This threshold encompasses 2 Morning campaigns and 1 Afternoon campaign, which relates to the
general rule that lower reach is beneficial for morning campaigns most likely owing to less
competition/ or low expectations from the audience.

(C) Budget Allocation ≤ 33,500:


 For budgets between 30,500 and 33,500, the model suggested Evening campaigns. This is in light of
the identification of 3 Evening campaigns in this budget range which while proof of the trend for
evening targeting when the overall budget is restricted, is still substantial.

4. Right Branch (Budget Allocation ≤ 30,500):


 Where the Budget Allocation is less than or equal to 30,500, the model predicts Morning campaigns.
This has been categorized into 3 Morning campaigns by the decision tree, a factor pointing to the fact
that restricted budget is typical to the early morning launches.

5. General Insights:
 Timing is informed by Budget Allocation since it is the only factor that defines where the campaign
is to be placed. 150,000 + makes it possible for a brand to develop Morning or Evening campaign
whereas 50,000 makes it possible for Evening or Afternoon campaign.
 The Expected Reach then makes it as a secondary criterial when budget falls between the two
extremes. The analysis of reach over dayparts show that the campaigns with higher expected reach
are more likely to occur in the Afternoon while the low reach campaigns are more likely to be
scheduled in the Morning.
 According to the model, Morning campaigns are associated with higher budgets, perhaps to attract
the attention of audiences at the beginning of the day, while moderate budget Evening campaigns are
the most probable.
6. Summary:
This decision tree shows how Budget Allocation is actually instrumental in the definition of when is
the right time to launch campaigns. With clear thresholds for budget levels and reach expectations,
the model provides a structured approach for timing decisions: Morning for low and high budgets,
Afternoon for moderate budgets with higher expected reach, and Evening for medium budgets or
high-budget campaigns aimed at engaging evening audiences.

NAÏVE BAYES

Naive Bayes is one of the simplest models of predictive analysis successfully applied in the field of
classification. It is statistical and it’s based on what is known as Bayes theorem which provides
probability of an occurrence given evidence. An important implication made by Naive Bayes algorithm
is that features are independent and the probability of occurrence of an affair does not affect the
occurrence of other affairs. Surprisingly, despite the usefulness of this assumption and its relative
simplicity in regards to the model, it is not always precise.By using class conditional probability based
on Bayes theorem the Naive Bayes classifier finds the likelihood of a point belonging to a given class
and classifies it to the class with the maximum likelihood. It is most beneficial in text categorization
problems such as filtering spam messages and using sentiment analysis, where words are more or less
independent with each other. It is also applied in medical diagnosis, recommendation system, weather
prediction, etc.Naive Bayes has several flavours – Gaussian Naive Bayes, Multinomial Naive Bayes and
Bernoulli Naive Bayes. Every variant is designed for a specific type of data. Where the data values are
real numbers Gaussian Naive Bayes is applied, for the data that counts, in other words, integer values
Multinomial Naive Bayes is applied and for the binary data that is the presence or absence of the features
Bernoulli Naive Bayes is applied.
Regardless of that scenario, sometimes the simplest models can surprise with results, and Naive Bayes is
no exception to this rule when it comes to classification. Endogenous identification is efficient, easy to
implement, and occasionally provides accurate results, particularly when the nation’s independence
assumption is valid. However, this is not an optimal solution when the occurrences are not independent
and other algorithms should be considered.

Step 1: Insert data and Add the “Split data” operator and set the ratio to 0.6 and 0.4
Step 2: Add the “Set Role” operator and “Market condition” as label
Step 3: Add the “Discretize” operator
Step 4: Add the “Naive Bayes” operator and the “Apply Model” operator
Step 5: Results
 Interpretation:

1. Market Condition Distribution:


The Naive Bayes model has identified three main market conditions based on the distribution of
campaigns: good,bad and moderate. This means that half of the campaigns in the dataset were
orchestrated in the context of ‘Bad’ market conditions and only one-sixth (16.7%) in ‘Good’ market
conditions. This could be deemed a situation where campaigns may frequently have to operate under
negative conditions, where strategies adherent to Bad or Moderate markets are more preferable than
those in the optimum good.

2. Class Conditional Probabilities:


Naive Bayes calculates probabilities based on each feature (Budget, Expected Reach, etc.) to
determine their likelihood of contributing to each market condition:

 The likelihoods of Budget Allocation ($) and Expected Reach (%) will have small deviation from
their targeted value range which is approximately 0.98-0.99. This means budget and expected reach
are two important factors that influence market conditions.
 For example, if the expected reach is moderate, and the budget is between 30,000-50,000 it is already
more likely to fall under Moderate or even Bad Market conditions.Where a campaign has large
budget appropriation or a relatively very high expected number of contacts, the market condition is
likely to be Good.The analysis also revealed that there is a meaningful correlation between the
Campaign Launched (Target) and market conditions.The probability that a campaign was launched,
given it was in a Good market condition, was 92.5 percent, showing that campaign are more likely to
be launched when conditions are good.
 Nevertheless, the probability of a particular campaign not being launched is rather high, and is
directly associated with Bad market conditions, which indicate that unfavorable conditions make it
impossible to launch a campaign.

3. Competition Intensity:
 Low competition is most probably related with Good market conditions with estimate probability
0.893. This makes sense as campaigns are more likely to perform well in good and little or no
competition environment since there is plenty of room for visibility.On the other hand, High
competition has a very high likelihood of resulting in a bad market condition, at 96%.
 Medium competition is most likely to have a Significant Positive correlation z-score of 0.942, with
Moderate Market Conditions suggesting that where the market conditions themselves are not overly
favourable nor overly unfavourable, Moderately competitive markets are most likely.

4. Campaign Timing:
 The most frequently identified conditions are Good market conditions that in traditional campaigns
are often targeted with well-funded campaigns in hope of maximizing reach and engagement right in
the afternoon time.
 Morning campaigns happen regardless of Good and Bad market conditions for 48.6% and 25%
respectively. That may suggest that morning campaign campaigns are more and launched
irrespective of the market conditions.
 That is, Evening campaigns are less likely in Good market conditions with probability 3.57%,
however, they are more likely in Bad or Moderate markets with probability 21.95% and 32.815%
respectively. It can also be applied use in evening campaigns, especially where funds are scarce, or
the market conditions suggest a nighttime, specific audience is the best target.

5. Key Insights:
When state-specific budget allocation and expected reach are important, the model’s results establish
that market characteristics are pivotal depositor. Large advertising budget is used when markets are
good while small budgets are used when the market is bad or moderate.
It showed that Market Condition has a great impact to the probability of Launching a Campaign.
During Good market conditions, many campaigns are initiated while in Bad conditions they are least
run often due to high competitition and unfavourable environments.Timing of a Campaign has
flexibility. The larger amount of afternoon campaigns is characteristic of good conditions, while
Morning and Evening campaigns are distributed precisely in different market conditions depending
on the competition and the amount of available money.

K – NEAREST NEIGHBOURS (K-NN)

K-Nearest Neighbors (KNN) is a very simple yet one of the most effective classification models belonging
to supervised learning category. The concept of KNN is to calculate the nearest [K] and classify the data
point under consideration into that class which of the near neighbors has the maximum value. For prediction,
it searches for the “k” nearest data points of new data point in feature space and assigns the label of the
majority of those points to the new data point.By a great measure of distance between points, KNN often
employs mathematical computation known as Euclidean distance. However, it is also possible to use the
other distance measures such as Manhattan, Minkowski, or even Hamming measure depending in the given
data. When choosing the number of neighbors (in our case “k”), there is a certain number of values that
affect the model’s results. Too low a value of “k” re-suits in overfitting; that is, the model adapts too closely
to the noise in the data This is on the contrary, too high a value of “k” yields underfitting, that results in the
model being too general, and not learning a suitable representation.
The first and the most important advantage of KNN is that there is no necessity in training: the algorithm
retains the dataset and makes predictions instantly. KNN continues to be relevant for problem such as
recommendation system, image recognition, and identification of outliers because of its interpretability and
Easy to implement nature.

Step 1: Insert the data and “Generate id” operator twice and take help of Jump to Tutorial process.
Step 2: Add the “Set role” operator with “Competition Intensity” as label
Step 3: Add “Knn” and “Apply model” operators.

Step 4: Results
 Interpretations:

1. X-Axis (Expected Reach %): This parameter returns the assessed impact value of the campaign on
the target group. A higher percentage suggest that the goal of the adverting campaign has been
achieved.

2. Y-Axis (Budget Allocation $): This parameter shows how much was provided for each campaign.
It is believed that campaigns which have higher budget expectations for media buying should be
visible more or target a broader audience.

3. Bubble Color (Campaign Launched): This binary variable represents whether or not the campaign
was implemented in the first place. When the campaign was run (“Yes”), this is indicated by a blue
bubble while when it was not run (“No”), is as shown by the green bubble.

4. Bubble Size (Confidence): This variable highlights the level of confidence possessed by the model
that comes out in the result that is in actuality the accuracy with which the model has placed the
campaign either in the launched or the non-launched category.

 Observations:

1. Budget Allocation and Expected Reach: One can clearly observe a positive relationship between
Budget Allocation and Expected Reach. Looking at the means of the key variables, the results show
that the higher the campaign’s budget allocation the higher the expected percentage of reach, which
is quite reasonable since a higher budget should increase the campaign’s visibility.

2. Campaign Launched: The blue bubbles (identifying campaigns that were launched) are
concentrated towards higher ranges of the Budget Allocation and Expected Reach parameters. This
seems to suggest that those campaigns that were launched do have both greater exposure and greater
expectations of exposure within that same spending budget.

3. Model Confidence: Bigger bubbles mean that users are more confident in the method employed to
predict whether a specific campaign was initiated. The larger bubbles denote highly confident
predictions and can be observed to be ranked higher to the obtainable budget and reach, which imply
that the model is more certain when the campaigns to be run define clearer success parameters
(higher budget, greater reach).
4. Unlaunched Campaigns: The non-launched campaigns are depicted by the green bubbles and most
of them are in the lower budget and expected reach quadrant, that is, few campaigns with limited
budgets and with low expected audience reach were not implemented. The model also appears to be
confident in these predictions which is evident in the sizes of bubbles (confidence levels) in this
region.

 Dataset Insights:
The table you provided outlines several key variables: Market Condition, Competition Intensity,
Budget Allocation, Expected Reach, Campaign Timing. Campaigns that were launched (“Yes”) seem
to take place in Good or Moderate Market Condition and low or moderate Competition Intensity.
These also have higher budget allocations having average fund above $45000 and higher expected
reach % above 70%.
On the other hand campaigns that were not launched (“No”), are mainly where; – Company is in Bad
Market Condition with High Competition Intensity – Budget limit is below $40,000 – Expected
reach percentage is below 60%. The evidence also proves that carrying out a campaign is positively
associated with favorable market conditions and resources availability.

 KNN Model:

The interpretation of the results suggests several strategic takeaways for marketing:

1. Budget Considerations: Organisations with a budget of less than $40000 will not launch their
campaigns. With higher budget, there is possibility of increased chances of implementation and
success of the plan.

2. Market and Competition: Market condition factors and low competition intensity are major
triggers for campaign initiation. However, timing in these areas could prove to be more strategic
when entire environments for parties are relatively less crowded.

3. Confidence in High Reach Predictions: This model provides clear confident signals for campaign
launches when prospects suggest a high expected reach. This is because every team should aim to
work on a campaign more likely to create a larger sense of the community in turn making its launch
more probable.

 Conclusion:

As an example, the KNN analysis depicts a straightforward positive correlation between budget input
and campaign initiation with the stressed importance of the market environment and its potential
audience reach. With the help of insights offered by this analysis, companies are able to make sure that
marketing campaigns are started in the right environment.

BIBLIOGRAPHY:
1. Rapid Miner

You might also like