0% found this document useful (0 votes)
56 views40 pages

DMA Notes

The document discusses data mining techniques including data preprocessing steps like cleaning, integration, reduction and transformation. It also covers data mining techniques like classification, clustering, association rule mining and sequential patterns. Knowledge representation techniques and data visualization methods are also explained.

Uploaded by

Kalash Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views40 pages

DMA Notes

The document discusses data mining techniques including data preprocessing steps like cleaning, integration, reduction and transformation. It also covers data mining techniques like classification, clustering, association rule mining and sequential patterns. Knowledge representation techniques and data visualization methods are also explained.

Uploaded by

Kalash Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

20 May 2023 20:40

Unit 1
Data Mining :
Data mining is the process of discovering patterns, relationships, and insights from
large sets of data. It involves using various techniques and algorithms to extract
useful information from structured or unstructured data sources. The goal of data
mining is to uncover hidden patterns or knowledge that can be used for decision-
making, predictive modeling, and other data-driven tasks.

Data mining Stages

The data mining process is divided into two parts i.e. Data Preprocessing and Data
Mining. Data Preprocessing involves data cleaning, data integration, data reduction,
and data transformation. The data mining part performs data mining, pattern
evaluation and knowledge representation of data.

1. Data cleaning :
a. remove the noisy data .
b. fill the missing data , method to remove noise are binning , sorting
2. Data Integration :
When multiple heterogeneous data sources such as databases, data cubes or files
are combined for analysis, this process is called data integration. This can help in
improving the accuracy and speed of the data mining process.
3. Data Reduction :
This technique is applied to obtain relevant data for analysis from the collection of
data. The size of the representation is much smaller in volume while maintaining
integrity. Data Reduction is performed using methods such as Naive Bayes, Decision
Trees, Neural network, etc.

• Dimensionality Reduction: Reducing the number of attributes in the dataset.


• Numerosity Reduction: Replacing the original data volume by smaller forms of data
representation.
• Data Compression: Compressed representation of the original data.

4. Data transformation :
In this process, data is transformed into a form suitable for the data mining process.
Data is consolidated so that the mining process is more efficient and the patterns
are easier to understand. Data Transformation involves Data Mapping and code
generation process.
5. Data mining :
Data Mining is a process to identify interesting patterns and knowledge from a
large amount of data. In these steps, intelligent patterns are applied to extract the
data patterns.
6. Pattern Evaluation :
This step involves identifying interesting patterns representing the knowledge
based on interestingness measures. Data summarization and visualization methods
are used to make the data understandable by the user.
7. Knowledge Representation :
Knowledge representation is a step where data visualization and knowledge
representation tools are used to represent the mined data. Data is visualized in the
form of reports, tables, etc.
DMA Page 1
form of reports, tables, etc.

Data Mining Techniques

5. Outer detection:
This type of data mining technique relates to the observation of data items in the
data set, which do not match an expected pattern or expected behavior. This
technique may be used in various domains like intrusion, detection, fraud
detection, etc. It is also known as Outlier Analysis or Outilier mining.

Association rules :
Association rule mining finds interesting associations and relationships among large
sets of data items. This rule shows how frequently a itemset occurs in a transaction.
A typical example is a Market Based Analysis.

Support, confidence, and lift are three important measures used in association rule
mining, specifically in the context of analyzing itemsets and their relationships.
Here's a simple explanation of each term:

Support: Support refers to the frequency or prevalence of a particular itemset in a


dataset. It indicates how often an itemset appears in the dataset relative to the
total number of transactions or instances. A higher support value indicates that the
itemset is more frequently occurring. For example, if "Item A" appears in 100 out of
1000 transactions, the support of "Item A" is 100/1000 = 0.1 or 10%.

Confidence: Confidence measures the reliability or strength of a rule or association


between two itemsets. It quantifies how often an itemset B appears in transactions
containing itemset A. It is expressed as a percentage or decimal between 0 and 1.
For example, if a rule states that "Item A" and "Item B" occur together with a
confidence of 0.8, it means that out of all transactions containing "Item A," 80%
also contain "Item B."

Lift: Lift is a measure of the significance or strength of association between two


itemsets, beyond what would be expected by chance. It compares the likelihood of
two itemsets occurring together to the likelihood of their occurrence if they were
independent of each other. A lift value greater than 1 indicates a positive
association, implying that the occurrence of one itemset increases the likelihood of
the other itemset. A lift value less than 1 suggests a negative association or a "rare
rule," meaning the two itemsets tend to

5. Outer detection:
This type of data mining technique relates to the observation of data items in the
data set, which do not match an expected pattern or expected behavior. This
technique may be used in various domains like intrusion, detection, fraud

DMA Page 2
technique may be used in various domains like intrusion, detection, fraud
detection, etc. It is also known as Outlier Analysis or Outilier mining. The outlier is a
data point that diverges too much from the rest of the dataset. The majority of the
real-world datasets have an outlier.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating
sequential data to discover sequential patterns. It comprises of finding interesting
subsequences in a set of sequences, where the stake of a sequence can be
measured in terms of different criteria like length, occurrence frequency, et
7. Prediction:
Prediction used a combination of other data mining techniques such as trends,
clustering, classification, etc. It analyzes past events or instances in the right
sequence to predict a future event.

Knowledge Representation

Logical Representation :

Semantics
Syntax
• It decides how we can construct • Semantics are the rules by which we
legal sentences in logic. can interpret the sentence in the logic.
• It determines which symbol we can • It assigns a meaning to each
use in knowledge representation. sentence.
• Also, how to write those symbols.

4. Production Rules
Production rules system consist of (condition, action) pairs which mean, "If condition
then action". It has mainly three parts:

1. The set of production rules


2. Working Memory
3. The recognize-act-cycle

5. Semantic

2. Frame Representation

DMA Page 3
3.
UNIT- 2
Data Reduction
1. Dimension
2. Numerical
3. Compression
4. Discretization
5. Feature Selection
6. Data
OLAPSampling : This technique involves selecting a subset of the data to work with, rather than
using the entire dataset.
OLAP (Online Analytical Processing) is a technology and approach used for
multidimensional data analysis. It enables users to quickly and interactively analyze
large volumes of data from different dimensions and perspectives.

OLAP applications are used by a variety of the functions of an organization.

• Finance and accounting:


• Sales and Marketing
• Production

Fundamentally, OLAP has a very simple concept. It pre-calculates most of the


queries that are typically very hard to execute over tabular databases, namely
aggregation, joining, and grouping. These queries are calculated during a process
that is usually called 'building' or 'processing' of the OLAP cube. This process
happens overnight, and by the time end users get to work - data will have been
updated.

It's Uses :
1. Multidimensional Analysis
2. Data Integration
3. OLAP Cube
4. Slice and Dice
5. Business Intelligence Integration

Applications
1. Marketing and customer relation management
2. Health
3. Financial analysis
4. Social media analysis
5. Transportation and logistics

DMA Page 4
DMA Page 5
DMA Page 6
DMA Page 7
DMA Page 8
Box plot : stocks
Chart
Histogram
Bar graph
Heat map : different colors
Pie chart
Tree Map

DMA Page 9
DMA Page 10
A "past selling graph" typically refers to a
graphical representation of historical sales data
over a specific period of time. It is a visual
depiction of the sales performance of a product,
service, or organization over time, showing the
trends, patterns, and fluctuations in sales.

DMA Page 11
DMA Page 12
Class comparison refers to the process of comparing and analysing
different classes or categories within a dataset to identify patterns,
differences, similarities, or relationships between them. It involves
examining the characteristics, attributes, or behaviours of different
classes and assessing how they differ or relate to each other.

• Data collection
• Dimension Relevance
• Synchronous Generalization
• Presentation of the derived Comparison

1. Mean
2. Median

DMA Page 13
1. Mean
2. Median
3. Mode
4. Variance
5. Standard Deviation
6. Co-relation
7. Covariance

Mean: The mean, also known as the average, represents the


central tendency of a set of data. It is calculated by summing all
the values and dividing by the total number of observations.
The mean provides an indication of the typical value or average
value of a particular attribute.

Median: The median is the middle value in a sorted list of data.


It divides the data into two equal halves, with 50% of the
observations below and 50% above it. The median is useful for
understanding the central position of the data and is less
affected by extreme values (outliers) compared to the mean.

Mode: The mode represents the most frequently occurring


value in a dataset. It identifies the value or values that appear
with the highest frequency. The mode is useful for identifying
the dominant or prominent values within a specific attribute.

Variance: Variance measures the spread or dispersion of data


around the mean. It quantifies the average squared deviation
of each data point from the mean. Higher variance indicates
greater variability in the data, while lower variance suggests a
more concentrated distribution.

Standard Deviation: The standard deviation is the square root


of the variance. It provides a measure of the average distance
between each data point and the mean. A higher standard
deviation indicates a wider spread of data, while a lower
standard deviation indicates a more tightly clustered
distribution.

Correlation: Correlation measures the strength and direction of


the linear relationship between two attributes. It ranges
from -1 to +1, where -1 represents a perfect negative
correlation, +1 represents a perfect positive correlation, and 0
represents no correlation. Correlation helps in understanding
the association or dependence between attributes.

Covariance: Covariance measures the joint variability between


two attributes. It indicates how changes in one attribute are
related to changes in another attribute. Positive covariance
indicates a positive relationship, negative covariance indicates a
negative relationship, and zero covariance indicates no
relationship.

Unit -3

DMA Page 14
DMA Page 15
DMA Page 16
DMA Page 17
DMA Page 18
DMA Page 19
Supervised Learning Technique
Maximum Margin Hyperplane

DMA Page 20
DMA Page 21
DMA Page 22
DMA Page 23
DMA Page 24
DMA Page 25
DMA Page 26
3 types of algo
1. Apriori
2. E clat
3. F-P growth

DMA Page 27
Eee4

DMA Page 28
DMA Page 29
DMA Page 30
DMA Page 31
DMA Page 32
DMA Page 33
DMA Page 34
UNIT -4

Descriptive analysis :
Descriptive analytics refers to the analysis of historical data to gain insights and understand patterns, trends,
and relationships within the data. In the context of descriptive analytics, there are several techniques
commonly used, including data modeling, trend analysis, and simple linear regression.

Data Modeling :

Definition: Data modeling is the process of creating a conceptual representation of data and its relationships
within a specific domain or system.

Purpose: Data modeling helps in organizing and understanding complex data structures. It provides a blueprint
for designing and building databases, ensuring data integrity and consistency.
Types of data models: There are different types of data models, including conceptual, logical, and physical
models.

Conceptual data model: It represents high-level business concepts and relationships between them. It focuses
on the essential elements of the domain and is independent of any specific technology or implementation.

Logical data model: It provides a more detailed representation of the data, including entities, attributes, and
relationships. It is designed to be technology-independent but takes into account specific business
requirements.

Physical data model: It defines the actual implementation details of the data model, including database tables,
columns, indexes, and constraints. It is specific to a particular database management system (DBMS) and
considers performance and storage considerations.

Entity-Relationship (ER) diagram: ER diagrams are commonly used to visually represent data models. They
depict entities as rectangles, attributes as ovals, and relationships as lines connecting the entities.

Relationship types: Data modeling includes defining different relationship types, such as one-to-one, one-to-
many, and many-to-many relationships. These relationships capture how entities are connected and interact
with each other.

Normalization: Data modeling involves applying normalization techniques to eliminate data redundancy and
ensure data consistency
I also helps in generation of Data Definition Language DDL

Conceptual

DMA Page 35
Logical :

Physical :
A physical data model is a database-specific model that represents relational data objects (for example, tables,
columns, primary and foreign keys) and their relationships. A physical data model can be used to generate DDL
statements which can then be deployed to a database server.

Trend Analysis :
Trend analysis is a technique used in technical analysis that attempts to predict future stock price movements
based on recently observed trend data. Trend analysis uses historical data, such as price movements and trade
volume, to forecast the long-term direction of market sentiment.
A trend is a general direction the market is taking during a specified period of time.

Types of Trends to Analyze


There are three main types of market trend for analysts to consider:
• Upward trend: An upward trend, also known as a bull market, is a sustained period of rising prices in a
particular security or market. Upward trends are generally seen as a sign of economic strength and can be
driven by factors such as strong demand, rising profits, and favorable economic conditions.
• Downward trend: A downward trend, also known as a bear market, is a sustained period of falling prices in
a particular security or market. Downward trends are generally seen as a sign of economic weakness and
can be driven by factors such as weak demand, declining profits, and unfavorable economic conditions.
• Sideways trend: A sideways trend, also known as a rangebound market, is a period of relatively stable prices
in a particular security or market. Sideways trends can be characterized by a lack of clear direction, with
prices fluctuating within a relatively narrow range.

Linear Regression Analysis :

• Definition: Simple linear regression is a statistical technique used to understand the relationship between
two variables: an independent variable (often referred to as the predictor variable or X variable) and a
dependent variable (the variable being predicted or Y variable).
• Linear relationship: Simple linear regression assumes a linear relationship between the independent and
dependent variables. It means that a change in the independent variable is expected to result in a
proportional change in the dependent variable.

• Equation: The relationship between the variables is represented by a linear equation of the form Y = a + bX,
where Y is the dependent variable, X is the independent variable, a is the intercept (the value of Y when X is
zero), and b is the slope (the change in Y for a unit change in X).

DMA Page 36
Logistic Regression :

Note : 1. The data values should be complete i.e. not missing


2. The Data values per class required is 30-50.
Definition :
• Linear Regression can only represent the continuous data but when it comes to the representation of the
categorical data in system plot we require Logistic regression
• It is used for classification
• It uses sigmoid function which converts the independent variable into a expression of probability which
ranges from [0-1] w.r.t to dependent variable .

Cut -off line


Data points lying on
This line are known as
Unclassified data points

Logit transform :
It is a mathematical function used in logistic regression to convert the probability of a binary outcome into a
linear form that can be modeled using regression techniques.

The logit transformation is defined as the logarithm of the odds ratio, which is the ratio of the probability of the
event occurring (success) to the probability of the event not occurring (failure). Mathematically, the logit
transformation can be represented as follows:
logit(p) = log(p / (1 - p))

Where:

logit(p) represents the logit transformation of the probability p.


p represents the probability of the event occurring.

Interpreting Regression Models

Interpreting regression models involves understanding the relationships between the independent variables
and the dependent variable, as well as the significance and impact of each variable on the outcome.
Some Key steps in the interpretation are :

1. Correlation Coefficients
2. Intercepts ,error
3. Sign
4. Magnitude and significance
5. Residual analysis: Examine the residuals (the differences between the predicted and actual values) to check
for any patterns or deviations from the assumptions of the regression model.
6. Outliers and influential points: Identify any influential data points or outliers that may disproportionately
affect the regression results.

DMA Page 37
Implementing Predictive Models

Predictive Model :
Predictive models are mathematical or statistical models that are designed to predict or estimate future
outcomes based on historical data and patterns. These models analyze existing data and relationships between
variables to make predictions about unknown or future events.
Implementation step :
1. Define problem
2. Data collection
3. Feature Selection : Analysis , Classification ,Regression ,clustering etc.
4. Model Selection : ANN ,SVM , Decision Tree ,Random Forest etc.
5. Model Training :
6. Evaluation
7. Deployment
8. Monitor

Linearization transforms, their uses & limitations

Linearization transforms are mathematical techniques used to transform nonlinear relationships or data into a
linear form. These transformations help in making the relationship between variables more amenable to linear
regression or other linear modeling techniques. Here are some common linearization transforms, their uses,
and limitations:

The [ X ] transform involves taking the [ X ] of one or both variables in a relationship .


1. Logarithmic Transform
2. Squared Transform
3. Exponential Transform
4. Power Transform : square root, cube root, and reciprocal transformations.

Limitations :
1. Information loss
2. Linearity assumption , sometime becomes invalid .
3. Non Valid for the Complex Equations
4. Choosing appropriate linear Transform

Uses :
1. Normalization
2. Regression
3. Outlier Detection
4. Dimension Reduction : PCA
5. Data Standardization

DMA Page 38
Heuristic Methods :
The heuristic method refers to finding the best possible solution to a problem quickly, effectively, and
efficiently. The word heuristic is derived from an ancient Greek word, 'eurisko.' It means to find, discover, or
search. The method is helpful in getting a satisfactory solution to a much larger problem within a limited time
frame.
Types :
1. Reduction
2. Local Search
3. Dividing
4. Inductive
5. Constructive

Principles
1. Understanding the problem
2. Making a plan
3. Implementing a plan
4. Evaluation

Poisson and Binomial Link Functions

In statistics, link functions are used in generalized linear models (GLMs) to relate the linear predictor of a model
to the mean of the response variable. The link function defines the relationship between the expected value of
the response variable and the linear combination of predictors.

Both the Poisson and binomial distributions are commonly used in GLMs, and they have specific link functions
associated with them.

Poisson Distribution:
The Poisson distribution is used to model count data, where the response variable represents the number of
occurrences of an event in a fixed interval. The link function commonly used for the Poisson distribution is the
logarithmic or log link function. It is defined as:
g(μ) = log(μ)

where g(μ) is the link function and μ is the mean of the Poisson distribution. The log link function ensures that
the predicted values are always positive, which is necessary for count data.

Binomial Distribution:
The binomial distribution is used to model binary or categorical data, where the response variable has two
possible outcomes (e.g., success/failure, yes/no). The link function commonly used for the binomial distribution
is the logit link function. It is defined as:
g(μ) = log(μ / (1 - μ))

DMA Page 39
g(μ) = log(μ / (1 - μ))

where g(μ) is the link function and μ is the probability of success. The logit link function maps the range of
probabilities (0 to 1) to the entire real line, allowing for modeling binary outcomes using linear predictors.

It's worth noting that while these are the commonly used link functions for the Poisson and binomial
distributions, other link functions are also possible depending on the specific modeling context and
requirements. The choice of link function can impact the interpretation and performance of the model.

Tests of hypothesis - wald test, LR test, score Test, test of overall regression.

Wald Test:
The Wald test is a statistical test used to assess the significance of individual coefficients in a generalized linear
model (GLM). It is based on the asymptotic normality of the maximum likelihood estimates. The test compares
the estimated coefficient to its estimated standard error, assuming a normal distribution.

Likelihood Ratio (LR) Test:


The Likelihood Ratio test is used to compare the fit of two nested models in a GLM. It assesses whether adding
or removing a set of predictors significantly improves or worsens the model fit. The LR test is based on
comparing the likelihoods of the two models.

Score Test:
The Score test, also known as the Lagrange Multiplier test, is another test used to assess the significance of
parameters in a GLM. It is based on the score function, which measures the first derivative of the log -likelihood
function with respect to the parameters.

Test of Overall Regression:


The Test of Overall Regression, also known as the global or omnibus test, is used to determine the overall
significance of a GLM. It assesses whether there is a statistically significant relationship between the predictors
and the response variable.

DMA Page 40

You might also like