DMA Notes
DMA Notes
Unit 1
Data Mining :
Data mining is the process of discovering patterns, relationships, and insights from
large sets of data. It involves using various techniques and algorithms to extract
useful information from structured or unstructured data sources. The goal of data
mining is to uncover hidden patterns or knowledge that can be used for decision-
making, predictive modeling, and other data-driven tasks.
The data mining process is divided into two parts i.e. Data Preprocessing and Data
Mining. Data Preprocessing involves data cleaning, data integration, data reduction,
and data transformation. The data mining part performs data mining, pattern
evaluation and knowledge representation of data.
1. Data cleaning :
a. remove the noisy data .
b. fill the missing data , method to remove noise are binning , sorting
2. Data Integration :
When multiple heterogeneous data sources such as databases, data cubes or files
are combined for analysis, this process is called data integration. This can help in
improving the accuracy and speed of the data mining process.
3. Data Reduction :
This technique is applied to obtain relevant data for analysis from the collection of
data. The size of the representation is much smaller in volume while maintaining
integrity. Data Reduction is performed using methods such as Naive Bayes, Decision
Trees, Neural network, etc.
4. Data transformation :
In this process, data is transformed into a form suitable for the data mining process.
Data is consolidated so that the mining process is more efficient and the patterns
are easier to understand. Data Transformation involves Data Mapping and code
generation process.
5. Data mining :
Data Mining is a process to identify interesting patterns and knowledge from a
large amount of data. In these steps, intelligent patterns are applied to extract the
data patterns.
6. Pattern Evaluation :
This step involves identifying interesting patterns representing the knowledge
based on interestingness measures. Data summarization and visualization methods
are used to make the data understandable by the user.
7. Knowledge Representation :
Knowledge representation is a step where data visualization and knowledge
representation tools are used to represent the mined data. Data is visualized in the
form of reports, tables, etc.
DMA Page 1
form of reports, tables, etc.
5. Outer detection:
This type of data mining technique relates to the observation of data items in the
data set, which do not match an expected pattern or expected behavior. This
technique may be used in various domains like intrusion, detection, fraud
detection, etc. It is also known as Outlier Analysis or Outilier mining.
Association rules :
Association rule mining finds interesting associations and relationships among large
sets of data items. This rule shows how frequently a itemset occurs in a transaction.
A typical example is a Market Based Analysis.
Support, confidence, and lift are three important measures used in association rule
mining, specifically in the context of analyzing itemsets and their relationships.
Here's a simple explanation of each term:
5. Outer detection:
This type of data mining technique relates to the observation of data items in the
data set, which do not match an expected pattern or expected behavior. This
technique may be used in various domains like intrusion, detection, fraud
DMA Page 2
technique may be used in various domains like intrusion, detection, fraud
detection, etc. It is also known as Outlier Analysis or Outilier mining. The outlier is a
data point that diverges too much from the rest of the dataset. The majority of the
real-world datasets have an outlier.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating
sequential data to discover sequential patterns. It comprises of finding interesting
subsequences in a set of sequences, where the stake of a sequence can be
measured in terms of different criteria like length, occurrence frequency, et
7. Prediction:
Prediction used a combination of other data mining techniques such as trends,
clustering, classification, etc. It analyzes past events or instances in the right
sequence to predict a future event.
Knowledge Representation
Logical Representation :
Semantics
Syntax
• It decides how we can construct • Semantics are the rules by which we
legal sentences in logic. can interpret the sentence in the logic.
• It determines which symbol we can • It assigns a meaning to each
use in knowledge representation. sentence.
• Also, how to write those symbols.
4. Production Rules
Production rules system consist of (condition, action) pairs which mean, "If condition
then action". It has mainly three parts:
5. Semantic
2. Frame Representation
DMA Page 3
3.
UNIT- 2
Data Reduction
1. Dimension
2. Numerical
3. Compression
4. Discretization
5. Feature Selection
6. Data
OLAPSampling : This technique involves selecting a subset of the data to work with, rather than
using the entire dataset.
OLAP (Online Analytical Processing) is a technology and approach used for
multidimensional data analysis. It enables users to quickly and interactively analyze
large volumes of data from different dimensions and perspectives.
It's Uses :
1. Multidimensional Analysis
2. Data Integration
3. OLAP Cube
4. Slice and Dice
5. Business Intelligence Integration
Applications
1. Marketing and customer relation management
2. Health
3. Financial analysis
4. Social media analysis
5. Transportation and logistics
DMA Page 4
DMA Page 5
DMA Page 6
DMA Page 7
DMA Page 8
Box plot : stocks
Chart
Histogram
Bar graph
Heat map : different colors
Pie chart
Tree Map
DMA Page 9
DMA Page 10
A "past selling graph" typically refers to a
graphical representation of historical sales data
over a specific period of time. It is a visual
depiction of the sales performance of a product,
service, or organization over time, showing the
trends, patterns, and fluctuations in sales.
DMA Page 11
DMA Page 12
Class comparison refers to the process of comparing and analysing
different classes or categories within a dataset to identify patterns,
differences, similarities, or relationships between them. It involves
examining the characteristics, attributes, or behaviours of different
classes and assessing how they differ or relate to each other.
• Data collection
• Dimension Relevance
• Synchronous Generalization
• Presentation of the derived Comparison
1. Mean
2. Median
DMA Page 13
1. Mean
2. Median
3. Mode
4. Variance
5. Standard Deviation
6. Co-relation
7. Covariance
Unit -3
DMA Page 14
DMA Page 15
DMA Page 16
DMA Page 17
DMA Page 18
DMA Page 19
Supervised Learning Technique
Maximum Margin Hyperplane
DMA Page 20
DMA Page 21
DMA Page 22
DMA Page 23
DMA Page 24
DMA Page 25
DMA Page 26
3 types of algo
1. Apriori
2. E clat
3. F-P growth
DMA Page 27
Eee4
DMA Page 28
DMA Page 29
DMA Page 30
DMA Page 31
DMA Page 32
DMA Page 33
DMA Page 34
UNIT -4
Descriptive analysis :
Descriptive analytics refers to the analysis of historical data to gain insights and understand patterns, trends,
and relationships within the data. In the context of descriptive analytics, there are several techniques
commonly used, including data modeling, trend analysis, and simple linear regression.
Data Modeling :
Definition: Data modeling is the process of creating a conceptual representation of data and its relationships
within a specific domain or system.
Purpose: Data modeling helps in organizing and understanding complex data structures. It provides a blueprint
for designing and building databases, ensuring data integrity and consistency.
Types of data models: There are different types of data models, including conceptual, logical, and physical
models.
Conceptual data model: It represents high-level business concepts and relationships between them. It focuses
on the essential elements of the domain and is independent of any specific technology or implementation.
Logical data model: It provides a more detailed representation of the data, including entities, attributes, and
relationships. It is designed to be technology-independent but takes into account specific business
requirements.
Physical data model: It defines the actual implementation details of the data model, including database tables,
columns, indexes, and constraints. It is specific to a particular database management system (DBMS) and
considers performance and storage considerations.
Entity-Relationship (ER) diagram: ER diagrams are commonly used to visually represent data models. They
depict entities as rectangles, attributes as ovals, and relationships as lines connecting the entities.
Relationship types: Data modeling includes defining different relationship types, such as one-to-one, one-to-
many, and many-to-many relationships. These relationships capture how entities are connected and interact
with each other.
Normalization: Data modeling involves applying normalization techniques to eliminate data redundancy and
ensure data consistency
I also helps in generation of Data Definition Language DDL
Conceptual
DMA Page 35
Logical :
Physical :
A physical data model is a database-specific model that represents relational data objects (for example, tables,
columns, primary and foreign keys) and their relationships. A physical data model can be used to generate DDL
statements which can then be deployed to a database server.
Trend Analysis :
Trend analysis is a technique used in technical analysis that attempts to predict future stock price movements
based on recently observed trend data. Trend analysis uses historical data, such as price movements and trade
volume, to forecast the long-term direction of market sentiment.
A trend is a general direction the market is taking during a specified period of time.
• Definition: Simple linear regression is a statistical technique used to understand the relationship between
two variables: an independent variable (often referred to as the predictor variable or X variable) and a
dependent variable (the variable being predicted or Y variable).
• Linear relationship: Simple linear regression assumes a linear relationship between the independent and
dependent variables. It means that a change in the independent variable is expected to result in a
proportional change in the dependent variable.
• Equation: The relationship between the variables is represented by a linear equation of the form Y = a + bX,
where Y is the dependent variable, X is the independent variable, a is the intercept (the value of Y when X is
zero), and b is the slope (the change in Y for a unit change in X).
DMA Page 36
Logistic Regression :
Logit transform :
It is a mathematical function used in logistic regression to convert the probability of a binary outcome into a
linear form that can be modeled using regression techniques.
The logit transformation is defined as the logarithm of the odds ratio, which is the ratio of the probability of the
event occurring (success) to the probability of the event not occurring (failure). Mathematically, the logit
transformation can be represented as follows:
logit(p) = log(p / (1 - p))
Where:
Interpreting regression models involves understanding the relationships between the independent variables
and the dependent variable, as well as the significance and impact of each variable on the outcome.
Some Key steps in the interpretation are :
1. Correlation Coefficients
2. Intercepts ,error
3. Sign
4. Magnitude and significance
5. Residual analysis: Examine the residuals (the differences between the predicted and actual values) to check
for any patterns or deviations from the assumptions of the regression model.
6. Outliers and influential points: Identify any influential data points or outliers that may disproportionately
affect the regression results.
DMA Page 37
Implementing Predictive Models
Predictive Model :
Predictive models are mathematical or statistical models that are designed to predict or estimate future
outcomes based on historical data and patterns. These models analyze existing data and relationships between
variables to make predictions about unknown or future events.
Implementation step :
1. Define problem
2. Data collection
3. Feature Selection : Analysis , Classification ,Regression ,clustering etc.
4. Model Selection : ANN ,SVM , Decision Tree ,Random Forest etc.
5. Model Training :
6. Evaluation
7. Deployment
8. Monitor
Linearization transforms are mathematical techniques used to transform nonlinear relationships or data into a
linear form. These transformations help in making the relationship between variables more amenable to linear
regression or other linear modeling techniques. Here are some common linearization transforms, their uses,
and limitations:
Limitations :
1. Information loss
2. Linearity assumption , sometime becomes invalid .
3. Non Valid for the Complex Equations
4. Choosing appropriate linear Transform
Uses :
1. Normalization
2. Regression
3. Outlier Detection
4. Dimension Reduction : PCA
5. Data Standardization
DMA Page 38
Heuristic Methods :
The heuristic method refers to finding the best possible solution to a problem quickly, effectively, and
efficiently. The word heuristic is derived from an ancient Greek word, 'eurisko.' It means to find, discover, or
search. The method is helpful in getting a satisfactory solution to a much larger problem within a limited time
frame.
Types :
1. Reduction
2. Local Search
3. Dividing
4. Inductive
5. Constructive
Principles
1. Understanding the problem
2. Making a plan
3. Implementing a plan
4. Evaluation
In statistics, link functions are used in generalized linear models (GLMs) to relate the linear predictor of a model
to the mean of the response variable. The link function defines the relationship between the expected value of
the response variable and the linear combination of predictors.
Both the Poisson and binomial distributions are commonly used in GLMs, and they have specific link functions
associated with them.
Poisson Distribution:
The Poisson distribution is used to model count data, where the response variable represents the number of
occurrences of an event in a fixed interval. The link function commonly used for the Poisson distribution is the
logarithmic or log link function. It is defined as:
g(μ) = log(μ)
where g(μ) is the link function and μ is the mean of the Poisson distribution. The log link function ensures that
the predicted values are always positive, which is necessary for count data.
Binomial Distribution:
The binomial distribution is used to model binary or categorical data, where the response variable has two
possible outcomes (e.g., success/failure, yes/no). The link function commonly used for the binomial distribution
is the logit link function. It is defined as:
g(μ) = log(μ / (1 - μ))
DMA Page 39
g(μ) = log(μ / (1 - μ))
where g(μ) is the link function and μ is the probability of success. The logit link function maps the range of
probabilities (0 to 1) to the entire real line, allowing for modeling binary outcomes using linear predictors.
It's worth noting that while these are the commonly used link functions for the Poisson and binomial
distributions, other link functions are also possible depending on the specific modeling context and
requirements. The choice of link function can impact the interpretation and performance of the model.
Tests of hypothesis - wald test, LR test, score Test, test of overall regression.
Wald Test:
The Wald test is a statistical test used to assess the significance of individual coefficients in a generalized linear
model (GLM). It is based on the asymptotic normality of the maximum likelihood estimates. The test compares
the estimated coefficient to its estimated standard error, assuming a normal distribution.
Score Test:
The Score test, also known as the Lagrange Multiplier test, is another test used to assess the significance of
parameters in a GLM. It is based on the score function, which measures the first derivative of the log -likelihood
function with respect to the parameters.
DMA Page 40