0% found this document useful (0 votes)
18 views75 pages

ADS CHP 6final

Uploaded by

Mariya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views75 pages

ADS CHP 6final

Uploaded by

Mariya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

1. What is mathematical model ?

Explain different classes of


model.
What is a Mathematical Model?
A mathematical model is a representation of a system or
process using mathematical concepts, structures, and equations.
The goal of a mathematical model is to describe the behavior of a
system in a quantitative manner, allowing for analysis,
predictions, and decision-making. Models can range from simple
to complex, depending on the phenomenon being studied.
Mathematical models typically include variables that represent the
system’s components and parameters that govern the
relationships between these variables. These models are used in
various fields such as physics, economics, biology, and
engineering to simulate real-world systems and predict their
behavior under different conditions.

Different Classes of Mathematical Models


Mathematical models can be categorized into several classes
based on their form, structure, or purpose. Below are some
common classes:
• Deterministic Models:
• Definition: In deterministic models, the output is
entirely determined by the initial conditions and
parameters. There is no randomness or uncertainty.
• Characteristics: Given the same initial conditions, a
deterministic model will always yield the same output.
• Examples: Newton’s laws of motion, population models
in biology, or simple economic models like supply and
demand.
• Stochastic Models:
• Definition: Stochastic models incorporate randomness
or uncertainty in the system’s behavior. The output is
probabilistic, meaning that for the same initial
conditions, the system may yield different results.
• Characteristics: These models are often used when
the system is influenced by random factors.
• Examples: Stock market models, weather forecasting,
or models involving queues (e.g., in
telecommunications or manufacturing).
• Continuous Models:
• Definition: Continuous models describe systems that
change in a smooth and continuous manner over time
or space.
• Characteristics: These models use differential
equations to represent the relationship between
variables.
• Examples: Fluid dynamics, heat conduction, or
population growth (represented by continuous functions
such as exponential growth).
• Discrete Models:
• Definition: Discrete models represent systems that
evolve in distinct steps or at discrete points in time,
often in integer increments.
• Characteristics: These models are based on
difference equations and often represent systems
where the state changes in a stepwise manner.
• Examples: Computer algorithms, inventory
management, or population dynamics in a model with
periodic time steps.
• Static Models:
• Definition: Static models represent systems that do not
change over time. They analyze the system's behavior
at a single point in time or at equilibrium.
• Characteristics: These models do not account for time
dynamics, focusing instead on a snapshot of the
system.
• Examples: Structural analysis in engineering,
equilibrium models in economics.
• Dynamic Models:
• Definition: Dynamic models describe systems that
evolve over time, with variables changing as the system
progresses.
• Characteristics: These models often include time as a
key variable and use differential or difference equations
to model the system’s evolution.
• Examples: Population dynamics, climate models, or
the modeling of viral spread.
• Empirical Models:
• Definition: Empirical models are based on observed
data rather than theoretical foundations. They are often
used when the underlying mechanisms of a system are
not well understood.
• Characteristics: These models are built using
statistical techniques like regression analysis, machine
learning, or data fitting.
• Examples: Predicting consumer behavior, sales
forecasting, or modeling energy consumption patterns.
• Theoretical Models:
• Definition: Theoretical models are based on
fundamental principles and laws of nature. They aim to
describe a system based on established theories and
assumptions.
• Characteristics: These models are typically grounded
in scientific laws and principles.
• Examples: Quantum mechanics, classical mechanics,
or economic models based on utility theory.
• Linear Models:
• Definition: Linear models assume that the relationship
between variables is linear, meaning the change in one
variable leads to a proportional change in another.
• Characteristics: These models are simpler to solve
and analyze.
• Examples: Linear regression, input-output models in
economics.
• Non-linear Models:
• Definition: Non-linear models describe relationships where
changes in variables do not lead to proportional changes in
the system’s behavior.
• Characteristics: These models are more complex and may
have multiple solutions or chaotic behavior.
• Examples: Population models with carrying capacity,
predator-prey models, or models in fluid dynamics.

Mathematical Models in Advertising


In the context of advertising, mathematical models can be
applied to optimize marketing strategies, predict consumer
behavior, and assess the effectiveness of campaigns. Different
classes of models in advertising may include:
• Deterministic Models in Advertising:
• Used to forecast sales or consumer response based on
fixed parameters like budget, media exposure, and
target audience. For example, a linear model might
predict how a specific increase in advertising spend
leads to a proportional increase in sales.
• Stochastic Models in Advertising:
• Employed to account for uncertainties such as
consumer response variability, economic fluctuations,
or competition. For example, a model may use
probability distributions to simulate consumer purchase
decisions based on different levels of ad exposure.
• Dynamic Models in Advertising:
• Used to model the long-term impact of advertising on
consumer behavior over time. For example, a dynamic
model could track how brand loyalty evolves as a result
of continuous advertising efforts.
• Empirical Models in Advertising:
• Often built using historical data from past advertising
campaigns. These models use statistical techniques to
analyze which factors (e.g., time of day, media type, ad
content) correlate with successful outcomes, like
increased sales or brand awareness.

Business Intelligence T.Y.B.Sc(I.T) SEM –


VI UNIT - II
Prepared By: Prof. Ansari Mohd. Shahid
([email protected]) Maharashtra College
• the stability of the results when minor changes in the input
parameters are introduced.

Classes of models
• There are several classes of mathematical models for
decision making, which in turn can
be solved by a number of alternative solution techniques.
• Each model class is better suited to represent certain types
of decision-making processes.
• In this section we will cover the main categories of
mathematical models for decision
making, including:

• Predictive models;

• Pattern recognition and learning models;

• Optimization models;

• Project management models;

• Risk analysis models;

• Waiting line models.

Predictive Models

• Predictive models play a primary role in business


intelligence systems, since they are
logically placed upstream with respect to other mathematical
models and, more generally,
to the whole decision-making process.
• Predictions allow
Business Intelligence T.Y.B.Sc(I.T) SEM –
VI UNIT - II
Prepared By: Prof. Ansari Mohd. Shahid
([email protected]) Maharashtra College
• the stability of the results when minor changes in the input
parameters are introduced.
Classes of models
• There are several classes of mathematical models for
decision making, which in turn can
be solved by a number of alternative solution techniques.
• Each model class is better suited to represent certain types
of decision-making processes.
• In this section we will cover the main categories of
mathematical models for decision
making, including:

• Predictive models;

• Pattern recognition and learning models;

• Optimization models;

• Project management models;

• Risk analysis models;

• Waiting line models.

Predictive Models

• Predictive models play a primary role in business


intelligence systems, since they are
logically placed upstream with respect to other mathematical
models and, more generally,
to the whole decision-making process.
• Predictions allow
Business Intelligence T.Y.B.Sc(I.T) SEM –
VI UNIT - II
Prepared By: Prof. Ansari Mohd. Shahid
([email protected]) Maharashtra College
• the stability of the results when minor changes in the input
parameters are introduced.

Classes of models
• There are several classes of mathematical models for
decision making, which in turn can
be solved by a number of alternative solution techniques.
• Each model class is better suited to represent certain types
of decision-making processes.
• In this section we will cover the main categories of
mathematical models for decision
making, including:

• Predictive models;

• Pattern recognition and learning models;

• Optimization models;

• Project management models;

• Risk analysis models;

• Waiting line models.

Predictive Models

• Predictive models play a primary role in business


intelligence systems, since they are
logically placed upstream with respect to other mathematical
models and, more generally,
to the whole decision-making process.
• Predictions allow
Business Intelligence T.Y.B.Sc(I.T) SEM –
VI UNIT - II
Prepared By: Prof. Ansari Mohd. Shahid
([email protected]) Maharashtra College
• the stability of the results when minor changes in the input
parameters are introduced.

Classes of models
• There are several classes of mathematical models for
decision making, which in turn can
be solved by a number of alternative solution techniques.
• Each model class is better suited to represent certain types
of decision-making processes.
• In this section we will cover the main categories of
mathematical models for decision
making, including:

• Predictive models;

• Pattern recognition and learning models;

• Optimization models;

• Project management models;

• Risk analysis models;

• Waiting line models.

Predictive Models

• Predictive models play a primary role in business


intelligence systems, since they are
logically placed upstream with respect to other mathematical
models and, more generally,
to the whole decision-making process.
• Predictions allow

2. Enlist and explain Different Mathematical models for


decision making
3. What is Data preparation for data mining ?
Classes of models
• There are several classes of mathematical models for
decision making, which in turn can
be solved by a number of alternative solution techniques.
• Each model class is better suited to represent certain types
of decision-making processes.
• In this section we will cover the main categories of
mathematical models for decision
making, including:

• Predictive models;

• Pattern recognition and learning models;

• Optimization models;

• Project management models;

• Risk analysis models;

• Waiting line models.

Predictive Models

• Predictive models play a primary role in business


intelligence systems, since they are
logically placed upstream with respect to other mathematical
models and, more generally,
to the whole decision-making process.
• Predictions allow input information to be fed into
different decision-making processes,
arising in strategy, research and development, administration
and control, marketing,
production and logistics.
• Basically, all departmental functions of an enterprise
make some use of predictive
information to develop decision making.
Pattern recognition and machine learning models
• The purpose of pattern recognition and learning theory is to
understand the mechanisms
that regulate the development of intelligence, understood as
the ability to extract
knowledge from past experience in order to apply it in the future.
• Mathematical models for learning can be used to
develop efficient algorithms that can
Data Preprocessing in Data Mining



Data preprocessing is an important step in the data mining
process. It refers to the cleaning, transforming, and integrating of
data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it
more suitable for the specific data mining task.
Steps of Data Preprocessing
Data preprocessing is an important step in the data mining
process that involves cleaning and transforming raw data to
make it suitable for analysis. Some common steps in data
preprocessing include:
• Data Cleaning: This involves identifying and correcting errors
or inconsistencies in the data, such as missing values,
outliers, and duplicates. Various techniques can be used for
data cleaning, such as imputation, removal, and
transformation.
• Data Integration: This involves combining data from multiple
sources to create a unified dataset. Data integration can be
challenging as it requires handling data with different formats,
structures, and semantics. Techniques such as record linkage
and data fusion can be used for data integration.
• Data Transformation: This involves converting the data into a
suitable format for analysis. Common techniques used in data
transformation include normalization, standardization, and
discretization. Normalization is used to scale the data to a
common range, while standardization is used to transform the
data to have zero mean and unit variance. Discretization is
used to convert continuous data into discrete categories.
• Data Reduction: This involves reducing the size of the
dataset while preserving the important information. Data
reduction can be achieved through techniques such as feature
selection and feature extraction. Feature selection involves
selecting a subset of relevant features from the dataset, while
feature extraction involves transforming the data into a lower-
dimensional space while preserving the important information.
• Data Discretization: This involves dividing continuous data
into discrete categories or intervals. Discretization is often
used in data mining and machine learning algorithms that
require categorical data. Discretization can be achieved
through techniques such as equal width binning, equal
frequency binning, and clustering.
• Data Normalization: This involves scaling the data to a
common range, such as between 0 and 1 or -1 and 1.
Normalization is often used to handle data with different units
and scales. Common normalization techniques include min-
max normalization, z-score normalization, and decimal
scaling.
Data preprocessing plays a crucial role in ensuring the quality of
data and the accuracy of the analysis results. The specific steps
involved in data preprocessing may vary depending on the
nature of the data and the analysis goals.
By performing these steps, the data mining process becomes
more efficient and the results become more accurate.
Preprocessing in Data Mining
Data preprocessing is a data mining technique which is used to
transform the raw data in a useful and efficient format.
Steps Involved in Data Preprocessing
1. Data Cleaning: The data can have many irrelevant and
missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.
• Missing Data: This situation arises when some data is
missing in the data. It can be handled in various ways.
Some of them are:
• Ignore the tuples: This approach is suitable only
when the dataset we have is quite large and multiple
values are missing within a tuple.
• Fill the Missing values: There are various ways to do
this task. You can choose to fill the missing values
manually, by attribute mean or the most probable
value.

• Noisy Data: Noisy data is a meaningless data that can’t be


interpreted by machines.It can be generated due to faulty data
collection, data entry errors etc. It can be handled in following
ways :
• Binning Method: This method works on sorted data in
order to smooth it. The whole data is divided into
segments of equal size and then various methods are
performed to complete the task. Each segmented is
handled separately. One can replace all data in a
segment by its mean or boundary values can be used
to complete the task.
• Regression:Here data can be made smooth by fitting
it to a regression function.The regression used may be
linear (having one independent variable) or multiple
(having multiple independent variables).
• Clustering: This approach groups the similar data in a
cluster. The outliers may be undetected or it will fall
outside the clusters.
2. Data Transformation: This step is taken in order to transform
the data in appropriate forms suitable for mining process. This
involves following ways:
• Normalization: It is done in order to scale the data values in a
specified range (-1.0 to 1.0 or 0.0 to 1.0)
• Attribute Selection: In this strategy, new attributes are
constructed from the given set of attributes to help the mining
process.
• Discretization: This is done to replace the raw values of
numeric attribute by interval levels or conceptual levels.
• Concept Hierarchy Generation: Here attributes are
converted from lower level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”.
3. Data Reduction: Data reduction is a crucial step in the data
mining process that involves reducing the size of the dataset
while preserving the important information. This is done to
improve the efficiency of data analysis and to avoid overfitting of
the model. Some common steps involved in data reduction are:
• Feature Selection: This involves selecting a subset of
relevant features from the dataset. Feature selection is often
performed to remove irrelevant or redundant features from the
dataset. It can be done using various techniques such as
correlation analysis, mutual information, and principal
component analysis (PCA).
• Feature Extraction: This involves transforming the data into a
lower-dimensional space while preserving the important
information. Feature extraction is often used when the original
features are high-dimensional and complex. It can be done
using techniques such as PCA, linear discriminant analysis
(LDA), and non-negative matrix factorization (NMF).
• Sampling: This involves selecting a subset of data points from
the dataset. Sampling is often used to reduce the size of the
dataset while preserving the important information. It can be
done using techniques such as random sampling, stratified
sampling, and systematic sampling.
• Clustering: This involves grouping similar data points
together into clusters. Clustering is often used to reduce the
size of the dataset by replacing similar data points with a
representative centroid. It can be done using techniques such
as k-means, hierarchical clustering, and density-based
clustering.
• Compression: This involves compressing the dataset while
preserving the important information. Compression is often
used to reduce the size of the dataset for storage and
transmission purposes. It can be done using techniques such
as wavelet compression, JPEG compression, and gif
compression.
How is Data Preprocessing Used?
This we have earlier noted is one of the reasons data
preprocessing is important in the earlier stages of the
development of machine learning and AI applications. While in AI
context data preprocessing is applied in order to optimize the
methods used to cleanse, transform and structure the data in a
way that will enhance the accuracy of a new model with less
computing power used.
An excellent data preprocessing step will help develop a set of
components or tools that can be utilized to quickly prototype on a
set of ideas or even run experiments on improving business
processes or customer satisfaction. For instance, preprocessing
can enhance the manner in which data is arranged for a
recommendation engine by enhancing the age ranges of
customers that are used for categorisation.
It can also make the process of developing and enhancing data
easier for more enhanced BI which is beneficial to the business.
For instance, small size, category or regions of the customers
may have different behaviors across regions. Backend
processing the data into the correct formats might enable BI
teams to integrate such findings into BI dashboard.
In a broad concept, data preprocessing is a sub-process of web
mining which is used in customer relationship management
(CRM). There’s usually the possibility of pre-processing of the
Web usage logs in order to arrive at meaningful data sets
referred to as user transactions which are actually a set of
groups of URL references. Sessions may be stored to make user
identification possible as well as the websites requested and
their sequence and time of use. Once extracted from raw data,
these give out more meaningful information that can be used, for
instance in consumer analysis, product promotion or
customization.
QUE 4) What is Data Exploration and its process?
Data exploration is the first step in the journey of extracting
insights from raw datasets. Data exploration serves as the
compass that guides data scientists through the vast sea of
information. It involves getting to know the data intimately,
understanding its structure, and uncovering valuable nuggets that
lay hidden beneath the surface.
• What is Data Exploration?
Data exploration is the initial step in data analysis where you dive
into a dataset to get a feel for what it contains. It's like detective
work for your data, where you uncover its characteristics,
patterns, and potential problems.
• Why is it Important?
Data exploration plays a crucial role in data analysis because it
helps you uncover hidden gems within your data. Through this
initial investigation, you can start to identify:
• Patterns and Trends: Are there recurring themes or
relationships between different data points?
• Anomalies: Are there any data points that fall outside the
expected range, potentially indicating errors or outliers?
How Data Exploration Works?
• Data Collection: Data exploration commences with collecting
data from diverse sources such as databases, APIs, or
through web scraping techniques. This phase emphasizes
recognizing data formats, structures, and interrelationships.
Comprehensive data profiling is conducted to grasp
fundamental statistics, distributions, and ranges of the
acquired data.
• Data Cleaning: Integral to this process is the rectification of
outliers, inconsistent data points, and addressing missing
values, all of which are vital for ensuring the reliability of
subsequent analyses. This step involves employing
methodologies like standardizing data formats, identifying
outliers, and imputing missing values. Data organization and
transformation further streamline data for analysis and
interpretation.
• Exploratory Data Analysis (EDA): This EDA phase involves
the application of various statistical tools such as box plots,
scatter plots, histograms, and distribution plots. Additionally,
correlation matrices and descriptive statistics are utilized to
uncover links, patterns, and trends within the data.
• Feature Engineering: Feature engineering focuses on
enhancing prediction models by introducing or modifying
features. Techniques like data normalization, scaling,
encoding, and creating new variables are applied. This step
ensures that features are relevant and consistent, ultimately
improving model performance.
• Model Building and Validation: During this stage, preliminary
models are developed to test hypotheses or predictions.
Regression, classification, or clustering techniques are
employed based on the problem at hand. Cross-validation
methods are used to assess model performance and
generalizability.
Importance of Data Exploration
• Trend Identification and Anomaly Detection: Data exploration
helps uncover underlying trends and patterns within datasets
that might otherwise remain unnoticed. It facilitates the
identification of anomalies or outliers that could significantly
impact decision-making processes. Detecting these trends
early can be critical for businesses to adapt, strategize, or
take preventive measures.
• Ensuring Data Quality and Integrity: It is essential for
spotting and fixing problems with data quality early on.
Through the resolution of missing values, outliers, or
discrepancies, data exploration guarantees that the
information used in later studies and models is accurate and
trustworthy. This enhances the general integrity and
reliability of the conclusions drawn.
• Revealing Latent Insights: Often, valuable insights might be
hidden within the data, not immediately apparent. Through
visualization and statistical analysis, data exploration
uncovers these latent insights, providing a deeper
understanding of relationships between variables,
correlations, or factors influencing certain outcomes.
• Foundation for Advanced Analysis and Modeling: Data
exploration sets the foundation for more sophisticated
analyses and modeling techniques. It helps in selecting
relevant features, understanding their importance, and
refining them for optimal model performance. Without a
thorough exploration, subsequent modeling efforts might
lack depth or accuracy.
• Supporting Informed Decision-Making: By revealing patterns
and insights, data exploration empowers decision-makers
with a clearer understanding of the data context. This
enables informed and evidence-based decision-making
across various domains such as marketing strategies, risk
assessment, resource allocation, and operational efficiency
improvements.
• Adaptability and Innovation: In a rapidly changing
environment, exploring data allows organizations to adapt
and innovate. Identifying emerging trends or changing
consumer behaviors through data exploration can be crucial
in staying competitive and fostering innovation within
industries.
• Risk Mitigation and Compliance: In sectors like finance or
healthcare, data exploration aids in risk mitigation by
identifying potential fraud patterns or predicting health risks
based on patient data. It also contributes to compliance
efforts by ensuring data accuracy and adhering to regulatory
requirements.
Example of Data Exploration
• Finance: Detecting fraudulent activities through anomalous
transaction patterns. In the financial domain, data
exploration plays a pivotal role in safeguarding institutions
against fraudulent practices by meticulously scrutinizing
transactional data. Here's an elaborate exploration:
• Anomaly Detection Techniques: Data exploration employs
advanced anomaly detection algorithms to sift through vast
volumes of transactional data. This involves identifying
deviations from established patterns, such as irregular
transaction amounts, unusual frequency, or unexpected
locations of transactions.
• Behavioral Analysis: By analyzing historical transactional
behaviors, data exploration discerns normal patterns from
suspicious activities. This includes recognizing deviations
from regular spending habits, unusual timeframes for
transactions, or atypical transaction sequences.
• Pattern Recognition: Through sophisticated data exploration
methods, financial institutions can uncover intricate patterns
that might indicate fraudulent behavior. This could involve
recognizing specific sequences of transactions, correlations
between seemingly unrelated accounts, or unusual clusters
of transactions occurring concurrently.
• Machine Learning Models: Leveraging machine learning
models as part of data exploration enables the creation of
predictive fraud detection systems. These models, trained on
historical data, can continuously learn and adapt to evolving
fraudulent tactics, enhancing their accuracy in identifying
suspicious transactions.
• Real-time Monitoring: Data exploration facilitates the
development of real-time monitoring systems. These
systems analyze incoming transactions as they occur, swiftly
flagging potentially fraudulent activities for immediate
investigation or intervention.
• Regulatory Compliance: Data exploration aids in ensuring
regulatory compliance by detecting and preventing
fraudulent activities that might violate financial regulations.
This helps financial institutions adhere to compliance
standards while safeguarding against financial crimes.
Advantages of data exploration
• Data exploration has several benefits, including:
• Offering a comprehensive understanding of the data set
• Highlighting important features and potential issues
• Providing insight into appropriate analysis techniques
• Guiding future research questions and directions
Disadvantages of data exploration
• Limitations of data exploration may include:
• Difficulty visualizing the high-dimensional data from images,
video, and speech
• May become complex for complicated data structures
• Risk of personal biases that influence the exploration
process
• Skewed interpretations of findings
• Misrepresenting data by choosing the wrong summary
indicator
QUE 5). What are different applications of data mining ?
Technically, data mining is the computational process of
analyzing data from different perspectives, dimensions, angles
and categorizing/summarizing it into meaningful information. Data
Mining can be applied to any type of data e.g. Data Warehouses,
Transactional Databases, Relational Databases, Multimedia
Databases, Spatial Databases, Time-series Databases, World
Wide Web.
Data mining provides competitive advantages in the knowledge
economy. It does this by providing the maximum knowledge
needed to rapidly make valuable business decisions despite the
enormous amounts of available data.
There are many measurable benefits that have been achieved in
different application areas from data mining. So, let’s discuss
different applications of Data Mining:

Scientific Analysis:
Scientific simulations are generating bulks of data every day. This
includes data collected from nuclear laboratories, data about
human psychology, etc. Data mining techniques are capable of
the analysis of these data. Now we can capture and store more
new data faster than we can analyze the old data already
accumulated. Example of scientific analysis:
• Sequence analysis in bioinformatics
• Classification of astronomical objects
• Medical decision support.
Intrusion Detection:
A network intrusion refers to any unauthorized activity on a digital
network. Network intrusions often involve stealing valuable
network resources. Data mining technique plays a vital role in
searching intrusion detection, network attacks, and anomalies.
These techniques help in selecting and refining useful and
relevant information from large data sets. Data mining technique
helps in classify relevant data for Intrusion Detection System.
Intrusion Detection system generates alarms for the network
traffic about the foreign invasions in the system. For example:
• Detect security violations
• Misuse Detection
• Anomaly Detection
Business Transactions:
Every business industry is memorized for perpetuity. Such
transactions are usually time-related and can be inter-business
deals or intra-business operations. The effective and in-time use
of the data in a reasonable time frame for competitive decision-
making is definitely the most important problem to solve for
businesses that struggle to survive in a highly competitive world.
Data mining helps to analyze these business transactions and
identify marketing approaches and decision-making. Example :
• Direct mail targeting
• Stock trading
• Customer segmentation
• Churn prediction (Churn prediction is one of the most
popular Big Data use cases in business)
Market Basket Analysis:
Market Basket Analysis is a technique that gives the careful study
of purchases done by a customer in a supermarket. This concept
identifies the pattern of frequent purchase items by customers.
This analysis can help to promote deals, offers, sale by the
companies and data mining techniques helps to achieve this
analysis task. Example:

• Data mining concepts are in use for Sales and marketing to


provide better customer service, to improve cross-selling
opportunities, to increase direct mail response rates.
• Customer Retention in the form of pattern identification and
prediction of likely defections is possible by Data mining.
• Risk Assessment and Fraud area also use the data-mining
concept for identifying inappropriate or unusual behavior etc.
Education:
For analyzing the education sector, data mining uses Educational
Data Mining (EDM) method. This method generates patterns that
can be used both by learners and educators. By using data
mining EDM we can perform some educational task:
• Predicting students admission in higher education
• Predicting students profiling
• Predicting student performance
• Teachers teaching performance
• Curriculum development
• Predicting student placement opportunities
Research:
A data mining technique can perform predictions, classification,
clustering, associations, and grouping of data with perfection in
the research area. Rules generated by data mining are unique to
find results. In most of the technical research in data mining, we
create a training model and testing model. The training/testing
model is a strategy to measure the precision of the proposed
model. It is called Train/Test because we split the data set into
two sets: a training data set and a testing data set. A training data
set used to design the training model whereas testing data set is
used in the testing model. Example:
• Classification of uncertain data.
• Information-based clustering.
• Decision support system
• Web Mining
• Domain-driven data mining
• IoT (Internet of Things)and Cybersecurity
• Smart farming IoT(Internet of Things)
Healthcare and Insurance:
A Pharmaceutical sector can examine its new deals force activity
and their outcomes to improve the focusing of high-value
physicians and figure out which promoting activities will have the
best effect in the following upcoming months, Whereas the
Insurance sector, data mining can help to predict which
customers will buy new policies, identify behavior patterns of risky
customers and identify fraudulent behavior of customers.
• Claims analysis i.e which medical procedures are claimed
together.
• Identify successful medical therapies for different illnesses.
• Characterizes patient behavior to predict office visits.
Transportation:
A diversified transportation company with a large direct sales
force can apply data mining to identify the best prospects for its
services. A large consumer merchandise organization can apply
information mining to improve its business cycle to retailers.
• Determine the distribution schedules among outlets.
• Analyze loading patterns.
Financial/Banking Sector:
A credit card company can leverage its vast warehouse of
customer transaction data to identify customers most likely to be
interested in a new credit product.
• Credit card fraud detection.
• Identify ‘Loyal’ customers.
• Extraction of information related to customers.
• Determine credit card spending by customer groups.
CRM (Customer Relationship Management):
Customer Relationship Management (CRM) is all about obtaining
and holding Customers, also enhancing customer loyalty and
implementing customer-oriented strategies. To get a decent
relationship with the customer, a business organization needs to
collect data and analyze the data. With data mining technologies,
the collected data can be used for analytics.
Fraud detection:
Billions of dollars are lost to the action of frauds. Traditional
methods of fraud detection are a little bit time consuming and
sophisticated. Data mining provides meaningful patterns and
turning data into information. An ideal fraud detection system
should protect the data of all the users. Supervised methods
consist of a collection of sample records, and these records are
classified as fraudulent or non-fraudulent. A model is constructed
using this data, and the technique is made to identify whether the
document is fraudulent or not.
Manufacturing Engineering:
Knowledge is the best asset possessed by a manufacturing
company. Data mining tools can be beneficial to find patterns in a
complex manufacturing process. Data mining can be used in
system-level designing to obtain the relationships between
product architecture, product portfolio, and data needs of the
customers. It can also be used to forecast the product
development period, cost, and expectations among the other
tasks.

QUE 6. Explain in details data mining process


Data mining means extracting valuable business information from
an extensive database. Like valuable minerals are
extracted(mined)from deep below the Earth; similarly, important
information is searched from a vast database.
• Data Mining Process
Data mining refers to a technology that involves the mining or the
extraction of knowledge from extensive amounts of data. Data
Mining is the computational procedure of locating patterns in
massive data sets involving artificial intelligence, machine
learning, statistics, and database systems. The main aim of the
data mining process is to extract information from a data set and
translate it into an understandable structure to be used in the
future. The fundamental properties of data mining are Automatic
discovery of patterns, Prediction of likely outcomes, Creation of
actionable information and Focus on large datasets and
databases.
• Steps in Data Mining Process
The data mining process is split into two parts: Data
Preprocessing and Mining. Data Preprocessing involves data
cleaning, integration, reduction, and transformation, while the
mining part does data mining, pattern evaluation, and knowledge
representation of data.
Data Understanding:
This step involves collecting and exploring the data to gain a
better understanding of its structure, quality, and content. This
includes understanding the sources of the data, identifying any
data quality issues, and exploring the data to identify patterns and
relationships. This step is important because it helps ensure that
the data is suitable for analysis.
Data Cleaning
The foremost step in data mining is the cleaning of data. It holds
importance as dirty data can confuse procedures and produce
inaccurate results if used directly in mining. This step helps
remove noisy or incomplete data from the data collection. Some
methods can clean data themselves, but they are not robust. Data
Cleaning carries out its work through the following steps:
(i) Filling The Missing Data: The missing data can be filled by
various methods such as filling the missing data manually, using
the measure of central tendency, median, ignoring the tuple, or
filling in the most probable value.
(ii) Remove The Noisy Data: Random error is called noisy data.
This noise can be removed by the method of binning.
• Binning methods are applied by sorting all the values to bins
or buckets.
• Smoothening is executed by consulting the adjacent values.
• Binning is carried out by smoothing of bin, i.e., each bin is
replaced by the mean of the bin.
• Smoothing by a median, a bin median replaces each bin
value. Smoothing by bin boundaries, i.e., the bin's minimum
and maximum values are bin boundaries, and the closest
boundary value replaces each bin value.
• Then, identifying the outliers and solving inconsistencies.
Data Integration
When multiple data sources are combined for analysis, such as
databases, data cubes, or files, this process is called data
integration. This enhances the accuracy and speed of the mining
process. There are different naming conventions of variables for
different databases, causing redundancies. These redundancies
and inconsistencies can be removed by further data cleaning
without affecting the reliability of the data. Data Integration is
performed using migration Tools such as Oracle Data Service
Integrator and Microsoft SQL.
Modeling:
This step involves building a predictive model using machine
learning algorithms. This includes selecting an appropriate
algorithm, training the model on the data, and evaluating its
performance. This step is important because it is the heart of the
data mining process and involves developing a model that can
accurately predict outcomes on new data.
Evaluation:
This step involves evaluating the performance of the model. This
includes using statistical measures to assess how well the model
is able to predict outcomes on new data. This step is important
because it helps ensure that the model is accurate and can be
used in the real world.
Data Reduction
This technique helps obtain only the relevant data for analysis
from data collection. The volume of the representation is much
smaller while maintaining integrity. Data Reduction is performed
using Naive Bayes, Decision Trees, Neural networks, etc. Some
strategies for the reduction of data are:
• Decreasing the number of attributes in the
dataset(Dimensionality Reduction)
• Replacing the original data volume with more minor forms of
data representation(Numerosity Reduction)
• The compressed representation of the original data(Data
Compression).
Data Transformation
Data Transformation is a process that involves transforming the
data into a form suitable for the mining process. Data is merged to
make the mining process more structured and the patterns easier
to understand. Data Transformation involves mapping of the data
and a code generation process.
Strategies for data transformation are:
• Removal of noise from data using methods like clustering,
regression techniques, etc. (Smoothing).
• Summary operations are applied to data(Aggregation).
• Scaling of data to come within a smaller
range(Normalisation).
• Intervals replace raw values of numeric data. (Discretization)
Deployment:
This step involves deploying the model into the production
environment. This includes integrating the model into existing
systems and processes to make predictions in real-time. This step
is important because it allows the model to be used in a practical
setting and to generate value for the organization.
Data Mining
Data Mining is the process of identifying intriguing patterns and
extracting knowledge from an extensive database. Inventive
patterns are applied to extract the data patterns. The data is
represented in patterns, and models are structured by
classification and clustering techniques
Pattern Evaluation
Pattern Evaluation is the process that involves identifying
interesting patterns representing the knowledge based on some
measures. Data summarization and visualisation methods make
the data understandable to the user.
Knowledge Representation
Data visualisation and knowledge representation tools represent
the mined data in this step. Data is visualised in the form of
reports, tables, etc.
7. What is Data validation ?
What is Data Validation?
Data validation is the process of ensuring that data is accurate,
complete, consistent, and meets predefined quality standards
before it is used, imported, or processed. It helps prevent errors,
ensure data integrity, and maintain high-quality datasets by
enforcing rules and constraints.
It is particularly critical in data migration, data warehousing, and
ETL (Extract, Transform, Load) processes to ensure that data
from different sources conforms to business rules and avoids
corruption due to inconsistencies.
Key Aspects of Data Validation
• Accuracy:
• Ensuring data matches the expected format and value
range.
• Example: A date field must follow YYYY-MM-DD.
• Completeness:
• Verifying that all required fields are filled.
• Example: Customer records must include a non-null
email address.
• Consistency:
• Maintaining uniformity across related data entries.
• Example: A product price must match its category's
pricing policy.
• Uniqueness:
• Ensuring that duplicate data does not exist in fields
marked as unique.
• Example: Each user should have a unique email ID.
• Business Rules Compliance:
• Enforcing rules specific to the domain.
• Example: An order cannot have a delivery date earlier
than its order date.

Techniques Used for Data Validation


• Constraints in Database Schema:
• Primary Keys: Ensure unique identification of records.
• Foreign Keys: Maintain referential integrity.
• Check Constraints: Validate specific conditions (e.g.,
salary > 0).
• Default Values: Assign default values to fields if no
input is provided.
• Triggers:
• Used to enforce rules by executing logic when certain
events occur (e.g., BEFORE INSERT, AFTER
UPDATE).
• Stored Procedures:
• Encapsulate validation logic to be executed during data
entry or modification.
• Application-Level Validation:
• Performed by front-end or middleware layers before
submitting data to the database.
• Validation Frameworks:
• Advanced systems may integrate frameworks like
PL/SQL packages or external libraries for complex
validations.

Importance in Advanced Systems


• Ensures Data Quality:
• Prevents invalid or corrupt data from being stored,
reducing future errors.
• Enhances Security:
• Avoids SQL injection and other vulnerabilities by
sanitizing inputs.
• Optimizes Performance:
• Avoids processing incorrect or incomplete data during
queries.
• Supports Business Processes:
• Maintains trustworthiness of critical business decisions
based on accurate data.

Examples in Advanced Systems


• Banking Databases:
• Validating transaction amounts are non-negative and
within account balance limits.
• Healthcare Systems:
• Ensuring patient data like age, weight, and blood group
follow standard medical ranges.
• E-Commerce Platforms:
• Validating user input for product reviews or order details
(e.g., valid coupon codes).

Types of Data Validation


There are many types of data validation. Most data validation
procedures will perform one or more of these checks to ensure
that the data is correct before storing it in the database. Common
types of data validation checks include:
1. Data Type Check
A data type check confirms that the data entered has the correct
data type. For example, a field might only accept numeric data. If
this is the case, then any data containing other characters such
as letters or special symbols should be rejected by the system.

2. Code Check
A code check ensures that a field is selected from a valid list of
values or follows certain formatting rules. For example, it is easier
to verify that a postal code is valid by checking it against a list of
valid codes. The same concept can be applied to other items
such as country codes and NAICS industry codes.

3. Range Check
A range check will verify whether input data falls within a
predefined range. For example, latitude and longitude are
commonly used in geographic data. A latitude value should be
between -90 and 90, while a longitude value must be between -
180 and 180. Any values out of this range are invalid.

4. Format Check
Many data types follow a certain predefined format. A common
use case is date columns that are stored in a fixed format like
“YYYY-MM-DD” or “DD-MM-YYYY.” A data validation procedure
that ensures dates are in the proper format helps maintain
consistency across data and through time.

5. Consistency Check
A consistency check is a type of logical check that confirms the
data’s been entered in a logically consistent way. An example is
checking if the delivery date is after the shipping date for a parcel.

6. Uniqueness Check
Some data like IDs or e-mail addresses are unique by nature. A
database should likely have unique entries on these fields. A
uniqueness check ensures that an item is not entered multiple
times into a database.
Benefits of Data Validation
• Cost and Time Efficiency:
• Reduces errors and saves resources by ensuring clean
data upfront.
• Improved Decision-Making:
• High-quality data leads to better outcomes.
• Compatibility:
• Ensures data is ready for use in other processes or
systems.
Limitations of Data Validation
• Complexity in Large Datasets:
• Validation can be time-consuming for massive datasets.
• Synchronization Challenges:
• Outdated or unsynchronized databases may result in
inconsistencies.

Q.8)What is Data transformation ?

Data Transformation in the context of advanced database


systems refers to the process of converting data from its original
format or structure into a different format or structure to make it
suitable for analysis, storage, or integration with other systems. It
is a key step in ETL (Extract, Transform, Load) processes, data
integration, and data warehousing.
Key Aspects of Data Transformation:
• Standardization:
• Converting data into a common format or standard.
• Example: Converting all dates into a uniform format like
YYYY-MM-DD.
• Cleaning:
• Removing inconsistencies, duplicates, or errors in data.
• Example: Handling missing values, removing outliers,
or correcting typos.
• Aggregation:
• Summarizing or combining data.
• Example: Calculating total sales for a month from daily
sales data.
• Normalization and Denormalization:
• Normalization: Reducing data redundancy by
organizing it into tables.
• Denormalization: Combining tables to optimize query
performance in some scenarios.
• Data Type Conversion:
• Changing the data type of a column or field.
• Example: Converting a string representation of a
number ("123") into an integer.
• Splitting and Merging:
• Splitting a single column into multiple columns or
merging multiple columns into one.
• Example: Splitting a full name field into "First Name"
and "Last Name."
• Encoding and Decoding:
• Transforming categorical data into numerical data (e.g.,
One-Hot Encoding for machine learning).
• Decoding involves reversing the transformation.
• Data Enrichment:
• Adding external data to enhance the dataset.
• Example: Adding geographical coordinates to address
data.
• Filtering:
• Selecting only relevant data for analysis or storage.
• Example: Filtering records based on specific conditions
like "Age > 18."

Importance of Data Transformation

Data transformation is important because it improves data


quality, compatibility, and utility. The procedure is critical for
companies and organizations that depend on data to make
informed decisions because it assures the data's accuracy,
reliability, and accessibility across many systems and
applications.
• Improved Data Quality: Data transformation eliminates
mistakes, inserts in missing information, and standardizes
formats, resulting in higher-quality, more dependable, and
accurate data.
• Enhanced Compatibility: By converting data into a suitable
format, companies may avoid possible compatibility difficulties
when integrating data from many sources or systems.
• Simplified Data Management: Data transformation is the
process of evaluating and modifying data to maximize storage
and discoverability, making it simpler to manage and maintain.
• Broader Application: Transformed data is more useable and
applicable in a larger variety of scenarios, allowing enterprises
to get the most out of their data.
• Faster Queries: By standardizing data and appropriately
storing it in a warehouse, query performance and BI tools may
be enhanced, resulting in less friction during analysis.
Advantages and Limitations of Data Transformation
Advantages of Data Transformation
• Enhanced Data Quality: Data transformation aids in the
organisation and cleaning of data, improving its quality.
• Compatibility: It guarantees data consistency between many
platforms and systems, which is necessary for integrated
business environments.
• Improved Analysis: Analytical results that are more accurate
and perceptive are frequently the outcome of transformed
data.

Limitations of Data Transformation

• Complexity: When working with big or varied datasets, the


procedure might be laborious and complicated.
• Cost: The resources and tools needed for efficient data
transformation might be expensive.
• Risk of Data Loss: Inadequate transformations may cause
important data to be lost or distorted.
Applications of Data Transformation

Applications for data transformation are found in a number of


industries:
• Business intelligence (BI) is the process of transforming data
for use in real-time reporting and decision-making using BI
technologies.
• Healthcare: Ensuring interoperability across various
healthcare systems by standardization of medical records.
• Financial Services: Compiling and de-identifying financial
information for reporting and compliance needs.
• Retail: Improving customer experience through data
transformation into an analytics-ready format and customer
behavior analysis.
• Customer Relationship Management (CRM): By converting
customer data, firms may obtain insights into consumer
behavior, tailor marketing strategies, and increase customer
satisfaction.

Examples of Data Transformation:


• Date Transformation:
• Converting 12/11/2024 to 2024-11-12 for
standardization.
• Currency Conversion:
• Transforming financial data from USD to INR using a
conversion rate.
• Data Summarization:
• Aggregating sales data by month or region.
• Merging Datasets:
• Combining data from multiple sources like sales and
customer databases to create a unified dataset.
Data transformation is a vital process in advanced database
systems, enabling systems to handle diverse datasets and extract
actionable insights effectively.

Q9). What is data reduction ?

Data Reduction in the context of Advanced Database Systems


refers to techniques aimed at reducing the volume of data while
preserving its essential properties or patterns. This is particularly
useful for managing large datasets efficiently, improving query
performance, reducing storage costs, and simplifying data
analysis.
Key Goals of Data Reduction:

• Reduce Data Volume: Minimize the amount of data stored


and processed.
• Preserve Information: Ensure that the reduced dataset
retains the meaningful and essential patterns or structures.
• Improve Efficiency: Enhance data processing and query
execution speed by working on a smaller dataset.
Techniques for Data Reduction:
• Dimensionality Reduction:
• Reducing the number of attributes (features) in the
dataset.
• Methods:
• Principal Component Analysis (PCA): Projects
data into fewer dimensions while retaining
variance.
• Singular Value Decomposition (SVD): Matrix
factorization to reduce dimensionality.
• Feature Selection: Selecting the most relevant
attributes based on statistical measures.
• Data Compression:
• Compressing data into a more compact form using
encoding techniques.
• Methods:
• Run-Length Encoding (RLE)
• Huffman Coding
• Wavelet Transformation
• Aggregation:
• Summarizing data at a higher abstraction level.
• Example: Rolling up daily sales data to monthly or
yearly sales.
• Sampling:
• Selecting a representative subset of the data.
• Types:
• Random Sampling: Selects data points randomly.
• Stratified Sampling: Ensures that subsets maintain
the proportion of classes or categories.
• Data Cube Aggregation:
• Creating a multi-dimensional summary (data cube) for
data queries and analysis.
• Often used in OLAP systems.
• Numerosity Reduction:
• Replacing the original data with a smaller
representation.
• Methods:
• Parametric Methods: Model the data (e.g.,
regression, Gaussian distribution).
• Non-Parametric Methods: Histogram-based or
clustering-based approximations.
Applications:
• Big Data Analytics
• Machine Learning (to handle large feature spaces)
• Data Warehousing and OLAP (Online Analytical Processing)
• Real-time Systems (to optimize processing time)
By implementing data reduction, advanced database systems can
efficiently handle large-scale datasets while maintaining
performance and analytical integrity.
10) Differentiate Univariate analysis & Bivariate analysis
Univariate Analysis and Bivariate Analysis are two
fundamental methods of statistical analysis that focus on the
relationships between variables, but they differ in their scope and
complexity.
Univariate Analysis
• Definition: Univariate analysis involves the analysis of a
single variable. The goal is to understand its distribution,
central tendency, and spread without considering
relationships with other variables.
• Purpose: To summarize and describe the main features of
the data for a single variable. It helps in understanding the
basic properties of the variable such as its mean, median,
mode, range, variance, and standard deviation.
• Key Techniques:
• Descriptive Statistics: Mean, median, mode, variance,
standard deviation, etc.
• Visualizations: Histograms, box plots, bar charts, and
frequency distributions.
• Assumptions: Assumes no consideration of other
variables.
• Examples:
• Examining the distribution of test scores in a class.
• Analyzing the age of a group of individuals.
Bivariate Analysis
• Definition: Bivariate analysis examines the relationship
between two variables. The goal is to understand how
changes in one variable are associated with changes in
another.
• Purpose: To investigate the association or correlation
between two variables, and determine how one variable
affects the other (if at all).
• Key Techniques:
• Correlation: Pearson's or Spearman's correlation to
assess the strength and direction of the relationship.
• Regression Analysis: Linear or logistic regression to
predict or model the relationship between two variables.
• Cross-tabulation: Contingency tables to examine the
relationship between categorical variables.
• Visualizations: Scatter plots, line graphs, or grouped
bar charts.
• Examples:
• Analyzing the relationship between height and weight.
• Examining the correlation between income and
education level.
Key Differences:
Aspect Univariate Analysis Bivariate Analysis
Number of Focuses on a single
Involves two variables.
Variables variable.
To describe and
To explore the relationship
summarize the
Purpose or association between two
distribution of one
variables.
variable.
Correlation, regression,
Analysis Descriptive statistics,
scatter plots, cross-
Techniques histograms, box plots.
tabulation.
Measures central Examines how one
Type of
tendency, spread, and variable influences or
Insights
shape. relates to the other.
Relationship between
Age distribution, test study time and exam
Examples
scores, temperature. scores, income and
education.
In summary, univariate analysis is focused on understanding
one variable at a time, while bivariate analysis seeks to explore
the relationship between two variables.

11) Explain Multivariate Analysis


Multivariate Analysis
Multivariate analysis is a statistical technique used to examine
the relationships between three or more variables simultaneously.
The primary goal is to understand the interactions and patterns
among multiple variables, which can provide more complex and
nuanced insights than univariate or bivariate analysis.

Purpose of Multivariate Analysis


• To identify relationships or associations between multiple
variables.
• To analyze the impact of several variables on a single
outcome (dependent variable).
• To reduce the dimensionality of data and detect patterns that
aren't obvious with univariate or bivariate analysis.
• To make predictions or classify data based on the
interaction of multiple variables.

Key Techniques in Multivariate Analysis
• Multiple Regression Analysis:
• Extends bivariate regression to include multiple
independent variables.
• Used to model and analyze the relationship between a
dependent variable and several independent variables.
• Example: Predicting house prices based on multiple
factors like location, size, number of rooms, and age of
the house.
• Principal Component Analysis (PCA):
• A dimensionality reduction technique used to reduce
the number of variables while retaining the most
important information.
• PCA transforms the original variables into a new set of
uncorrelated variables called principal components.
• Example: Reducing a large set of financial metrics (e.g.,
revenue, expenses, profit) into a few principal
components to summarize the performance of a
business.
• Factor Analysis:
• Similar to PCA, but focuses on identifying underlying
factors or latent variables that explain the correlations
between observed variables.
• Often used in social sciences to identify patterns in
survey responses or psychological traits.
• Example: Analyzing survey data to identify latent
constructs like "customer satisfaction" based on several
individual questions.
• Cluster Analysis (or Clustering):
• A technique used to group data points into clusters
based on their similarities across multiple variables.
• Common methods include K-means and hierarchical
clustering.
• Example: Segmenting customers into different groups
based on purchasing behavior, income, and location.
• Multivariate Analysis of Variance (MANOVA):
• Extends analysis of variance (ANOVA) to handle
multiple dependent variables.
• Helps determine if the mean differences among groups
(based on independent variables) are statistically
significant for more than one dependent variable.
• Example: Examining the effect of teaching methods
(independent variable) on students' performance across
multiple subjects (dependent variables).
• Discriminant Analysis:
• Used to classify data into predefined categories based
on multiple independent variables.
• Example: Predicting whether a customer will default on
a loan (binary outcome) based on income, credit score,
and other financial indicators.

Types of Multivariate Data


• Quantitative Multivariate Data: Involves numerical
variables, such as test scores, sales numbers, or height and
weight measurements.
• Qualitative Multivariate Data: Involves categorical
variables, such as gender, product type, or customer
feedback categories.

Advantages of Multivariate Analysis


• Comprehensive Insights: It allows for a more holistic
understanding of the data by considering interactions among
multiple variables.
• Improved Prediction Accuracy: Using multiple variables
together can lead to better and more accurate predictions
than using a single variable.
• Handling Complex Relationships: Multivariate analysis
helps in analyzing more complex scenarios where the
relationship between variables is not linear or
straightforward.
Examples of Multivariate Analysis

• Marketing: Analyzing the effect of advertising spending,


product price, and customer demographic data on sales
performance.
• Healthcare: Understanding how multiple factors (age, diet,
exercise, genetic predispositions) affect the risk of
developing a disease.
• Finance: Analyzing how multiple economic indicators
(interest rates, inflation, unemployment) influence stock
market performance.
Key Differences from Univariate and Bivariate Analysis
• Univariate analysis: Involves one variable, aiming to
describe or summarize its distribution.
• Bivariate analysis: Involves two variables, focusing on the
relationship between them.
• Multivariate analysis: Involves three or more variables,
looking at the relationships and interactions between several
variables simultaneously.
Summary Table
Univariate Bivariate
Aspect Multivariate Analysis
Analysis Analysis
Three or more
Variables One variable Two variables
variables
Summarize Explore the Explore relationships
and describe relationship and interactions
Purpose
a single between two among multiple
variable variables variables
Descriptive Correlation, Multiple regression,
Techniques statistics, regression, PCA, factor analysis,
histograms scatter plots clustering, MANOVA
Predicting sales based
Age Height vs.
on price, advertising,
Examples distribution, weight, income
and customer
test scores vs. education
demographics
Complex,
Simple, one- Moderate, two-
Complexity multidimensional
dimensional dimensional
analysis
In conclusion, multivariate analysis provides a more advanced
and sophisticated approach to data analysis, enabling
researchers and analysts to understand the interactions between
multiple variables, make more accurate predictions, and uncover
patterns that are not apparent in univariate or bivariate analysis.

12) What is Association rule mining ? Explain Single-


dimension association rules
• Association Rule Mining
Association Rule Mining is a technique in data mining that aims
to discover interesting relationships (associations) between
variables in large datasets. The goal is to identify patterns in data
that frequently co-occur, such as items that are often bought
together in market basket analysis, or the likelihood that one
event or condition will happen when another occurs.
The fundamental concept of association rule mining is to uncover
if-then relationships between variables, where the presence of
one item (or event) is associated with the presence of another.
These rules are commonly used in retail, marketing, and
recommendation systems, but they can be applied in various
fields such as healthcare, finance, and web mining.
• Key Components of Association Rules:
An association rule can be represented as: A⇒BA \Rightarrow
BA⇒B
Where:
• A is the antecedent (or left-hand side).
• B is the consequent (or right-hand side).
• The rule suggests that if A occurs, then B is likely to occur
as well.
For example, in a retail setting:
• Rule: If a customer buys bread, they are likely to buy butter
as well.
• Antecedent (A): Buys bread.
• Consequent (B): Buys butter.
Metrics Used in Association Rule Mining:
• Support:
• This measures the frequency of a combination of items
occurring in the dataset.
• Formula:
Support(A⇒B)=Number of transactions containing both A an
d BTotal number of transactions\text{Support}(A \Rightarrow
B) = \frac{\text{Number of transactions containing both A and
B}}{\text{Total number of
transactions}}Support(A⇒B)=Total number of transactionsNu
mber of transactions containing both A and B
• Example: 5% of transactions include both bread and
butter.
• Confidence:
• This measures the likelihood that item B is bought when
item A is bought.
• Formula:
Confidence(A⇒B)=Number of transactions containing both A
and BNumber of transactions containing A\text{Confidence}(
A \Rightarrow B) = \frac{\text{Number of transactions
containing both A and B}}{\text{Number of transactions
containing
A}}Confidence(A⇒B)=Number of transactions containing AN
umber of transactions containing both A and B
• Example: If a customer buys bread, 80% of the time
they also buy butter.
• Lift:
• This measures the strength of the rule over random
chance by comparing the confidence of the rule to the
expected confidence if A and B were independent.
• Formula:
Lift(A⇒B)=Confidence(A⇒B)Support(B)\text{Lift}(A
\Rightarrow B) = \frac{\text{Confidence}(A \Rightarrow
B)}{\text{Support}(B)}Lift(A⇒B)=Support(B)Confidence(A⇒B)
• Example: A lift greater than 1 means the rule is more
likely than by random chance, while a lift of 1 indicates
no association.
• Leverage:
• This metric indicates the difference between the
observed frequency of co-occurrence of items A and B
and the expected frequency if they were independent.
• Formula:
Leverage(A⇒B)=Support(A⇒B)−Support(A)×Support(B)\text
{Leverage}(A \Rightarrow B) = \text{Support}(A \Rightarrow
B) - \text{Support}(A) \times
\text{Support}(B)Leverage(A⇒B)=Support(A⇒B)−Support(A)
×Support(B)
• Conviction:
• A metric used to measure the implication of the rule,
particularly in cases where confidence is high but lift is
low.
• Formula:
Conviction(A⇒B)=1−Support(B)1−Confidence(A⇒B)\text{Co
nviction}(A \Rightarrow B) = \frac{1 - \text{Support}(B)}{1 -
\text{Confidence}(A \Rightarrow
B)}Conviction(A⇒B)=1−Confidence(A⇒B)1−Support(B)
Single-Dimension Association Rules
Single-dimension association rules refer to association rules that
involve only one dimension or category of items in the dataset.
These rules focus on finding relationships between items within a
single attribute or field.
In a typical single-dimension scenario, items are grouped based
on a specific attribute, and the goal is to identify patterns or
relationships between items within that dimension.
For example, consider a retail market basket where the only
dimension of interest is the product category. A single-
dimension association rule might be:
• Rule: If a customer buys bread, they are likely to buy butter
(both items belong to the category "grocery").
• Antecedent (A): Buys bread (category: grocery).
• Consequent (B): Buys butter (category: grocery).
Here, the rule deals with only one category of items (e.g., grocery
products), even though it relates two different products. Both
items are from the same category, and the rule is derived by
observing the frequency with which they are bought together.
Characteristics of Single-Dimension Association Rules:
• Simplicity: These rules are simpler because they focus on a
single attribute (e.g., product type, age group, etc.).
• Interpretability: Since they involve items within one
category, they are often easier to interpret and can help
businesses target specific segments or categories.
• Common Use Case: Typically used in product
recommendation systems or marketing campaigns targeting
specific product categories (like suggesting complementary
items within the same product group).
Examples of Single-Dimension Association Rules:
• Retail Example:
• Rule: If a customer buys shampoo, they are likely to
buy conditioner.
• Both shampoo and conditioner are in the personal
care category.
• Support: 10% of customers buy both shampoo
and conditioner.
• Confidence: 80% of customers who buy shampoo
also buy conditioner.
• Healthcare Example:
• Rule: If a patient has high blood pressure, they are
likely to have high cholesterol.
• Both conditions are part of the cardiovascular
health category.
• Support: 15% of patients have both high blood
pressure and high cholesterol.
• Confidence: 70% of patients with high blood
pressure also have high cholesterol.
• E-commerce Example:
• Rule: Customers who buy laptops are likely to buy
laptop bags.
• Both items belong to the electronics accessories
category.
• Support: 25% of transactions that include laptops
also include laptop bags.
Benefits of Single-Dimension Association Rules:
• Targeted Recommendations: These rules are useful for
recommending related products within a single category,
making them ideal for upselling or cross-selling within a
specific product line.
• Simplicity: Easier to understand and apply, especially when
the focus is on a single product category or attribute.
• Efficiency: Computationally less complex than multi-
dimensional rules because the analysis is limited to one
category or dimension.
Conclusion
In summary, association rule mining is a powerful technique
used to identify relationships between variables or items in large
datasets. Single-dimension association rules focus on
relationships within a single category or attribute, making them
useful for specific recommendations, like suggesting
complementary products within a particular category (e.g., "if you
buy bread, you might also buy butter"). These rules are valuable
for targeted marketing and recommendation systems, where the
data can be simplified to a single dimension for ease of analysis
and interpretation.

13. Explain Apriori Algorithm.


Ans : Apriori algorithm refers to the algorithm which is used to
calculate the association rules between objects. It means how two
or more objects are related to one another. In other words, we
can say that the apriori algorithm is an association rule leaning
that analyzes that people who bought product A also bought
product B.
The primary objective of the apriori algorithm is to create the
association rule between different objects. The association rule
describes how two or more objects are related to one another.
Apriori algorithm is also called frequent pattern mining. Generally,
you operate the Apriori algorithm on a database that consists of a
huge number of transactions. Let's understand the apriori
algorithm with the help of an example; suppose you go to Big
Bazar and buy different products. It helps the customers buy their
products with ease and increases the sales performance of the
Big Bazar.
We take an example to understand the concept better. You must
have noticed that the Pizza shop seller makes a pizza, soft drink,
and breadstick combo together. He also offers a discount to their
customers who buy these combos. Do you ever think why does
he do so? He thinks that customers who buy pizza also buy soft
drinks and breadsticks. However, by making combos, he makes it
easy for the customers. At the same time, he also increases his
sales performance.
Similarly, you go to Big Bazar, and you will find biscuits, chips,
and Chocolate bundled together. It shows that the shopkeeper
makes it comfortable for the customers to buy these products in
the same place.
The given three components comprise the apriori algorithm.
• Support
• Confidence
• Lift
Let's take an example to understand this concept.
Suppose you have 4000 customers transactions in a Big Bazar.
You have to calculate the Support, Confidence, and Lift for two
products, and you may say Biscuits and Chocolate. This is
because customers frequently buy these two items together.
Out of 4000 transactions, 400 contain Biscuits, whereas 600
contain Chocolate, and these 600 transactions include a 200 that
includes Biscuits and chocolates. Using this data, we will find out
the support, confidence, and lift.
• Support
Support refers to the default popularity of any product. You find
the support as a quotient of the division of the number of
transactions comprising that product by the total number of
transactions. Hence, we get
Support (Biscuits) = (Transactions relating biscuits) / (Total
transactions)
= 400/4000 = 10 percent.
• Confidence
Confidence refers to the possibility that the customers bought
both biscuits and chocolates together. So, you need to divide the
number of transactions that comprise both biscuits and
chocolates by the total number of transactions to get the
confidence.
Hence,
Confidence = (Transactions relating both biscuits and Chocolate)
/ (Total transactions involving Biscuits)
= 200/400
= 50 percent.
It means that 50 percent of customers who bought biscuits bought
chocolates also.
• Lift
Consider the above example; lift refers to the increase in the ratio
of the sale of chocolates when you sell biscuits. The mathematical
equations of lift are given below.
Lift = (Confidence (Biscuits - chocolates)/ (Support (Biscuits)
= 50/10 = 5
It means that the probability of people buying both biscuits and
chocolates together is five times more than that of purchasing the
biscuits alone. If the lift value is below one, it requires that the
people are unlikely to buy both the items together. Larger the
value, the better is the combination.

14. Explain general Association Rules.


Ans: Association rule learning is a type of unsupervised learning
technique that checks for the dependency of one data item on
another data item and maps accordingly so that it can be more
profitable. It tries to find some interesting relations or associations
among the variables of dataset. It is based on different rules to
discover the interesting relations between variables in the
database.
The association rule learning is one of the very important
concepts of machine learning, and it is employed in Market
Basket analysis, Web usage mining, continuous production,
etc. Here market basket analysis is a technique used by the
various big retailer to discover the associations between items.
We can understand it by taking an example of a supermarket, as
in a supermarket, all products that are purchased together are put
together.
For example, if a customer buys bread, he most likely can also
buy butter, eggs, or milk, so these products are stored within a
shelf or mostly nearby. Consider the below diagram:

Association rule learning can be divided into three types of


algorithms:
• Apriori
• Eclat
• F-P Growth Algorithm
We will understand these algorithms in later chapters.
How does Association Rule Learning work?
Association rule learning works on the concept of If and Else
Statement, such as if A then B.
Here the If element is called antecedent, and then statement is
called as Consequent. These types of relationships where we
can find out some association or relation between two items is
known as single cardinality. It is all about creating rules, and if the
number of items increases, then cardinality also increases
accordingly. So, to measure the associations between thousands
of data items, there are several metrics. These metrics are given
below:
• Support
• Confidence
• Lift
Let's understand each of them:
Support
Support is the frequency of A or how frequently an item appears
in the dataset. It is defined as the fraction of the transaction T that
contains the itemset X. If there are X datasets, then for
transactions T, it can be written as:

• Confidence
Confidence indicates how often the rule has been found to be
true. Or how often the items X and Y occur together in the dataset
when the occurrence of X is already given. It is the ratio of the
transaction that contains X and Y to the number of records that
contain X.
• Lift
It is the strength of any rule, which can be defined as below
formula:

It is the ratio of the observed support measure and expected


support if X and Y are independent of each other. It has three
possible values:
• If Lift= 1: The probability of occurrence of antecedent and
consequent is independent of each other.
• Lift>1: It determines the degree to which the two itemsets
are dependent to each other.
• Lift<1: It tells us that one item is a substitute for other items,
which means one item has a negative effect on another.
Types of Association Rule Lerning
Association rule learning can be divided into three algorithms:
• Apriori Algorithm
This algorithm uses frequent datasets to generate association
rules. It is designed to work on the databases that contain
transactions. This algorithm uses a breadth-first search and Hash
Tree to calculate the itemset efficiently.
It is mainly used for market basket analysis and helps to
understand the products that can be bought together. It can also
be used in the healthcare field to find drug reactions for patients.
• Eclat Algorithm
Eclat algorithm stands for Equivalence Class Transformation.
This algorithm uses a depth-first search technique to find frequent
itemsets in a transaction database. It performs faster execution
than Apriori Algorithm.
• F-P Growth Algorithm
The F-P growth algorithm stands for Frequent Pattern, and it is
the improved version of the Apriori Algorithm. It represents the
database in the form of a tree structure that is known as a
frequent pattern or tree. The purpose of this frequent tree is to
extract the most frequent patterns.
Applications of Association Rule Learning
It has various applications in machine learning and data mining.
Below are some popular applications of association rule learning:
• Market Basket Analysis: It is one of the popular examples
and applications of association rule mining. This technique is
commonly used by big retailers to determine the association
between items.
• Medical Diagnosis: With the help of association rules,
patients can be cured easily, as it helps in identifying the
probability of illness for a particular disease.
• Protein Sequence: The association rules help in
determining the synthesis of artificial Proteins.
• It is also used for the Catalog Design and Loss-leader
Analysis and many more other applications.

15. What is data-mining? Explain the analysis methodologies


?
Ans : Data mining can be seen as a subset of data analytics that
specifically focuses on extracting hidden patterns and knowledge
from data. Historically, a data scientist was required to build, refine,
and deploy models. However, with the rise of AutoML tools, data
analysts can now perform these tasks if the model is not too complex.
The data mining process may vary depending on your specific project
and the techniques employed, but it typically involves the 10 key
steps described below.
1. Define Problem. Clearly define the objectives and goals of your
data mining project. Determine what you want to achieve and how
mining data can help in solving the problem or answering specific
questions.

2. Collect Data. Gather relevant data from various sources, including


databases, files, APIs, or online platforms. Ensure that the collected
data is accurate, complete, and representative of the problem
domain. Modern analytics and BI tools often have data integration
capabilities. Otherwise, you’ll need someone with expertise in data
management to clean, prepare, and integrate the data.

3. Prep Data. Clean and preprocess your collected data to ensure its
quality and suitability for analysis. This step involves tasks such as
removing duplicate or irrelevant records, handling missing values,
correcting inconsistencies, and transforming the data into a suitable
format.

4. Explore Data. Explore and understand your data through


descriptive statistics, visualization techniques, and exploratory data
analysis. This step helps in identifying patterns, trends, and outliers in
the dataset and gaining insights into the underlying data
characteristics.
5. Select predictors. This step, also called feature
selection/engineering, involves identifying the relevant features
(variables) in the dataset that are most informative for the task. This
may involve eliminating irrelevant or redundant features and creating
new features that better represent the problem domain.

6. Select Model. Choose an appropriate model or algorithm based


on the nature of the problem, the available data, and the desired
outcome. Common techniques include decision trees, regression,
clustering, classification, association rule mining, and neural networks.
If you need to understand the relationship between the input features
and the output prediction (explainable AI), you may want a simpler
model like linear regression. If you need a highly accurate prediction
and explainability is less important, a more complex model such as a
deep neural network may be better.

7. Train Model. Train your selected model using the prepared


dataset. This involves feeding the model with the input data and
adjusting its parameters or weights to learn from the patterns and
relationships present in the data.

8. Evaluate Model. Assess the performance and effectiveness of


your trained model using a validation set or cross-validation. This step
helps in determining the model's accuracy, predictive power, or
clustering quality and whether it meets the desired objectives. You
may need to adjust the hyperparameters to prevent overfitting and
improve the performance of your model.

9. Deploy Model. Deploy your trained model into a real-world


environment where it can be used to make predictions, classify new
data instances, or generate insights. This may involve integrating the
model into existing systems or creating a user-friendly interface for
interacting with the model.
10. Monitor & Maintain Model. Continuously monitor your model's
performance and ensure its accuracy and relevance over time.
Update the model as new data becomes available, and refine the
data mining process based on feedback and changing requirements.

Depending on the technology and your business goals, you can


choose from a few different data analysis techniques.
Here’s a brief overview of each to help you better understand
which may be best to use depending on the current questions,
data and scenarios you are considering:
1. Text Analysis: This is also known as data mining. Text
analysis uses databases or data mining tools to find patterns
within big data sets. This means that that analysis will transform
raw data into business insights. With the ability to locate patterns
in big sets of data, you can then make better decisions.
2. Statistical Analysis: With past data displayed in the
dashboard, the statistical analysis answers the question of “What
happened?” Statistical analysis can be divided into two
categories, namely:
3. Descriptive Analysis: Descriptive analysis relies on either
complete data or a sample of summarized numerical data to
derive insights like the mean and standard deviation.
4. Inferential Analysis: Inferential analysis is gleaned from using
a sample of complete data. In this type of analysis, you can draw
different conclusions by interpreting different samples from the
same data set.
5. Diagnostic Analysis: Taking statistical analysis a step further,
you can use diagnostic analysis to answer why something
happened. Therefore, the name implies what this analysis is used
for - diagnosing the cause of an event.
6. Predictive Analysis: Drawing on previous data, the predictive
analysis assumes what will happen in the future before it takes
place. It provides a reasonable answer to “what is most likely
going to happen?” The accuracy of this forecast depends on the
quality of the past and current data, or inputs.
7. Prescriptive Analysis: Also like the name implies, this type of
analysis is about prescribing the next steps. It utilises all the
previous models of analysis to define the best action to take to
resolve a current problem. It is one of the most common forms of
analysis that business leaders use today to maintain their
competitive edge.

You might also like