0% found this document useful (0 votes)

20 views

Data Mining

The document discusses various topics related to data preprocessing techniques like principal component analysis, feature scaling, normalization, standardization, handling missing values, log transformation, and more. It provides explanations of techniques like PCA, normalization, standardization, and their differences. Imputation techniques like KNN Imputer and MICE are explained. Softmax, ridge regression and lasso regression algorithms are also briefly covered.

Uploaded by

Arnab Nath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

Data Mining

Uploaded by

Arnab Nath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 33

Data Mining

Data Preprocessing

1. Principal Components are not as readable and interpretable as

original features.

Data standardization is must before PCA:

You must standardize your data before implementing PCA,

otherwise PCA will not be able to find the optimal Principal
Components.

For instance, if a feature set has data expressed in units of

Kilograms, Light years, or Millions, the variance scale is huge in the
training set. If PCA is applied on such a feature set, the resultant
loadings for features with high variance will also be large. Hence,
principal components will be biased towards features with high
variance, leading to false results.

Also, for standardization, all the categorical features are required to

be converted into numerical features before PCA can be applied.

2. Curse of Dimensionality describes the explosive nature of

increasing data dimensions and its resulting exponential increase in
computational efforts required for its processing and/or analysis.
The curse of dimensionality basically means that the error increases
with the increase in the number of features. It refers to the fact that
algorithms are harder to design in high dimensions and often have a
running time exponential in the dimensions. A higher number of
dimensions theoretically allow more information to be stored, but
practically it rarely helps due to the higher possibility of noise and
redundancy in the real-world data.

Gathering a huge number of data may lead to the dimensionality

problem where highly noisy dimensions with fewer pieces of
information and without significant benefit can be obtained due to the
large data. The exploding nature of spatial volume is at the forefront
is the reason for the curse of dimensionality.

3. If our model is too simple and has very few parameters then it
may have high bias and low variance. On the other hand if our
model has large number of parameters then it’s going to have high
variance and low bias. So we need to find the right/good balance
without overfitting and underfitting the data.

This tradeoff in complexity is why there is a tradeoff between bias

and variance. An algorithm can’t be more complex and less complex
at the same time.

4. Feature Scaling is a technique to standardize the independent features present in

the data in a fixed range. It is performed during the data pre-processing to handle
highly varying magnitudes or values or units. If feature scaling is not done, then
a machine learning algorithm tends to weigh greater values, higher and consider
smaller values as the lower values, regardless of the unit of the values.

S.NO. Normalization Standardization

Minimum and maximum value of Mean and standard deviation is used

1.
features are used for scaling for scaling.
S.NO. Normalization Standardization

It is used when we want to ensure

It is used when features are of
2. zero mean and unit standard
different scales.
deviation.

Scales values between [0, 1] or [-1,

3. It is not bounded to a certain range.
1].

4. It is really affected by outliers. It is much less affected by outliers.

Scikit-Learn provides a
Scikit-Learn provides a transformer
transformer
5. called StandardScaler for
called MinMaxScaler for
standardization.
Normalization.

This transformation squishes the n- It translates the data to the mean

6. dimensional data into an n- vector of original data to the origin
dimensional unit hypercube. and squishes or expands.

It is useful when we don’t know It is useful when the feature

7.
about the distribution distribution is Normal or Gaussian.

It is a often called as Scaling It is a often called as Z-Score

8.
Normalization Normalization.

6. Accuracy
The base metric used for model evaluation is often Accuracy,
describing the number of correct predictions over all predictions:
Precision
Precision is a measure of how many of the positive predictions made
are correct (true positives). The formula for it is:

Recall / Sensitivity
Recall is a measure of how many of the positive cases the classifier
correctly predicted, over all the positive cases in the data. It is
sometimes also referred to as Sensitivity. The formula for it is:

F1-Score
F1-Score is a measure combining both precision and recall. It is
generally described as the harmonic mean of the two. Harmonic
mean is just another way to calculate an “average” of values,
generally described as more suitable for ratios (such as precision
and recall) than the traditional arithmetic mean. The formula used
for F1-score in this case is:
7. log transformation: transform skewed distribution to a normal
distribution

 Not able to log 0 or negative values (add a constant to

all value to ensure values > 1)
8. Search in the internet.
9. Log Transformation :
Numerical variables may have high skewed and non-normal
distribution (Gaussian Distribution) caused by outliers, highly
exponential distributions, etc. Therefore we go for data
transformation.

In Log transformation each variable of x will be replaced by log(x)

with base 10, base 2, or natural log.

Box-Cox Transformation:
It is one of my favorite transformation techniques.

Box-cox transformation works pretty well for many data natures.

The below image is the mathematical formula for Box-cox
transformation.

All the values of lambda vary from -5 to 5 are considered and the
best value for the data is selected. The “Best” value is one that results
in the best skewness of the distribution. Log transformation will take
place when we have lambda is zero.

10. KNN Imputer was first supported by Scikit-Learn in December

2019 when it released its version 0.22. This imputer utilizes the k-
Nearest Neighbors method to replace the missing values in the
datasets with the mean value from the parameter ‘n_neighbors’
nearest neighbors found in the training set. By default, it uses a
Euclidean distance metric to impute the missing values.

11. MICE Imputation, short for ‘Multiple Imputation by Chained

Equation’ is an advanced missing data imputation technique that uses
multiple iterations of Machine Learning model training to predict the missing
values using known values from other features in the data as predictors.
Here is a quick intuition (not the exact algorithm)

1. You basically take the variable that contains missing values as a response ‘Y’
and other variables as predictors ‘X’.

2. Build a model with rows where Y is not missing.

3. Then predict the missing observations.

Do this multiple times by doing random draws of the data and taking the mean of
the predictions.
Above was short intuition about how the MICE algorithm roughly works.

12. https://www.turing.com/kb/guide-to-principal-
component-analysis

Basic Algorithms

1. Softmax Function:- It is often used as the last activation

function of a neural network to normalize the output of a
network to a probability distribution over predicted output
classes. — Wikipedia [link]

Softmax is an activation function that scales numbers/logits into

probabilities. The output of a Softmax is a vector (say v) with
probabilities of each possible outcome. The probabilities in
vector v sums to one for all possible outcomes or classes.
Also read this article:- https://deepai.org/machine-
learning-glossary-and-terms/softmax-layer
Ridge Regression:- Ridge regression is a model tuning method that is used to
analyse any data that suffers from multicollinearity. This method performs L2 regularization.

When the issue of multicollinearity occurs, least-squares are unbiased, and variances are

large, this results in predicted values being far away from the actual values.

The cost function for ridge regression:

Lambda is the penalty term. λ given here is denoted by an alpha parameter in the ridge
function. So, by changing the values of alpha, we are controlling the penalty term. The higher
the values of alpha, the bigger is the penalty and therefore the magnitude of coefficients is
reduced.

 It shrinks the parameters. Therefore, it is used to prevent multicollinearity

 It reduces the model complexity by coefficient shrinkage

Lasso Regression:- Lasso regression is a type of linear

regression that uses shrinkage. Shrinkage is where data values are shrunk
towards a central point, like the mean. The lasso procedure encourages
simple, sparse models (i.e. models with fewer parameters). This particular
type of regression is well-suited for models showing high levels
of muticollinearity or when you want to automate certain parts of model
selection, like variable selection/parameter elimination.

The acronym “LASSO” stands for Least Absolute Shrinkage

and Selection Operator.

Lasso regression performs L1 regularization, which adds a penalty equal to

the absolute value of the magnitude of coefficients. This type of
regularization can result in sparse models with few coefficients; Some
coefficients can become zero and eliminated from the model. Larger penalties
result in coefficient values closer to zero, which is the ideal for producing
simpler models. On the other hand, L2 regularization (e.g. Ridge
regression) doesn’t result in elimination of coefficients or sparse models.
This makes the Lasso far easier to interpret than the Ridge.
Here too, λ is the hypermeter, whose value is
equal to the alpha in the Lasso function.

Elastic Net Regression:- Elastic net is a popular type of

regularized linear regression that combines two popular penalties,
specifically the L1 and L2 penalty functions. Elastic Net aims at minimizing
the following loss function:

where α is the mixing parameter between ridge (α = 0) and lasso (α = 1).

Now, there are two parameters to tune: λ and α.

Ridge Regression Lasso Regression

Shrinks the coefficients toward zero and Encourages some coefficients to be exactly zero

Adds a penalty term proportional to the sum of Adds a penalty term proportional to the sum of
squared coefficients absolute values of coefficients

Does not eliminate any features Can eliminate some features

Suitable when all features are importantly Suitable when some features are irrelevant or
redundant

More computationally efficient Less computationally efficient

Requires setting a hyperparameter Requires setting a hyperparameter

Performs better when there are many small to Performs better when there are a few large
medium-sized coefficients coefficients

3. Multicollinearity occurs when independent variables in a regression model are

correlated. This correlation is a problem because independent variables should

be independent. If the degree of correlation between variables is high enough, it can

cause problems when you fit the model and interpret the results.

Multicollinearity causes the following two basic types of problems:

o The coefficient estimates can swing wildly based on which other

independent variables are in the model. The coefficients become very
sensitive to small changes in the model.

o Multicollinearity reduces the precision of the estimated coefficients,

which weakens the statistical power of your regression model. You
might not be able to trust the p-values to identify independent variables
that are statistically significant.

4.
https://stats.stackexchange.com/questions/88603/why
-is-logistic-regression-a-linear-model

5. https://towardsdatascience.com/decision-trees-
explained-entropy-information-gain-gini-index-ccp-
pruning-4d78070db36c#:~:text=The%20Gini%20index
%20has%20a,and%20maximum%20purity%20is
%200.&text=Now%20that%20we%20have
%20understood,to%20how%20they%20do
%20prediction.
Difference between Gini Index and Entropy
It is the probability of misclassifying a While entropy measures the amount of
randomly chosen element in a set. uncertainty or randomness in a set.

The range of the Gini index is [0, 1],

The range of entropy is [0, log(c)], where
where 0 indicates perfect purity and 1
c is the number of classes.
indicates maximum impurity.

Gini index is a linear measure. Entropy is a logarithmic measure.

It can be interpreted as the average

It can be interpreted as the expected error
amount of information needed to specify
rate in a classifier.
the class of an instance.

It is sensitive to the distribution of classes

It is sensitive to the number of classes.
in a set.

The computational complexity of the Gini Computational complexity of entropy is

index is O(c). O(c * log(c)).

It is less robust than entropy. It is more robust than Gini index.

It is sensitive. It is comparatively less sensitive.

Formula for the Gini index is Gini(P) = 1

Formula for entropy is Entropy(P) = -
– ∑(Px)^2 , where Pi is
∑(Px)log(Px),
the proportion of the instances of class x where pi is the proportion of the
in a set. instances of class x in a set.

It has a bias toward selecting splits that It has a bias toward selecting splits that
result in a more balanced distribution of result in a higher reduction of
classes. uncertainty.

Gini index is typically used in CART

Entropy is typically used in ID3 and
(Classification and Regression Trees)
C4.5 algorithms
algorithms
6. https://medium.com/analytics-vidhya/why-is-
scaling-required-in-knn-and-k-means-8129e4d88ed7

It does not learn anything during the training period since it does not find any
discriminative function with the help of the training data. In simple words,
actually, there is no training period for the KNN algorithm. It stores the training
dataset and learns from it only when we use the algorithm for making the real-time
predictions on the test dataset.

As a result, the KNN algorithm is much faster than other algorithms which require
training. For Example, SupportVector Machines(SVMs), Linear Regression, etc.

Moreover, since the KNN algorithm does not require any training before making
predictions as a result new data can be added seamlessly without impacting the
accuracy of the algorithm. That is why KNN does more computation on test time
rather than on train time.

7. https://medium.com/analytics-vidhya/mae-mse-
rmse-coefficient-of-determination-adjusted-r-squared-
which-metric-is-better-cd0326a5697e

8. https://www.tutorialspoint.com/what-are-the-
approaches-to-tree-pruning.
9. SVM in detail:-
https://www.geeksforgeeks.org/support-vector-
machine-algorithm/

Soft margin, Hard margin:- Hard and Soft SVM

I would like to again continue with the above example.

We can now clearly state that HP1 is a Hard SVM(left side) while HP2 is a Soft
SVM(right side).

By default, Support Vector Machine implements Hard margin SVM. It works well
only if our data is linearly separable.

Hard margin SVM does not allow any misclassification to happen.

In case our data is non-separable/ nonlinear then the Hard margin SVM will not
return any hyperplane as it will not be able to separate the data. Hence this is where
Soft Margin SVM comes to the rescue.

Soft margin SVM allows some misclassification to happen by relaxing the hard
constraints of Support Vector Machine.
Soft margin SVM is implemented with the help of the Regularization parameter
(C).

Regularization parameter (C): It tells us how much misclassification we want to

avoid.

– Hard margin SVM generally has large values of C.

– Soft margin SVM generally has small values of C.

Kernel Trick:- The SVM kernel is a function that takes low-dimensional input
space and transforms it into higher-dimensional space, ie it converts nonseparable
problems to separable problems. It is mostly useful in non-linear separation
problems. Simply put the kernel, does some extremely complex data transformations
and then finds out the process to separate the data based on the labels or outputs
defined.

Disadvantages of Support Vector Machine:

1. SVM algorithm is not suitable for large data sets.

2. SVM does not perform very well when the data set has
more noise i.e. target classes are overlapping.
3. In cases where the number of features for each data
point exceeds the number of training data samples, the
SVM will underperform.
4. As the support vector classifier works by putting data
points, above and below the classifying hyperplane
there is no probabilistic explanation for the
classification.
10. Support vectors are data points that are closer to the hyperplane
and influence the position and orientation of the hyperplane. Using
these support vectors, we maximize the margin of the classifier.
Deleting the support vectors will change the position of the
hyperplane. These are the points that help us build our SVM.

11. We use square error to get the most negligible impact of values which
contributes to the maximum error. Moreover, the squared error is differential while

the absolute error is not, which makes the squared error more compatible with the

Mathematical optimization techniques.

12. KNN and bias-variance tradeoff

Tuning K and the Bias-variance Trade-off

If we consider different values of k, we can observe the trade-off

between bias and variance. As k increases, we have a more stable
model, i.e., smaller variance, however, the bias is also increased.
As k decreases, the bias also decreases, but the model is less
stable.

13. Decision Trees are not sensitive to noisy data or outliers since extreme values
or outliers never cause much reduction in the Residual Sum of
Squares(RSS) because they are never involved in the split. Decision Trees are
generally robust to outliers. Due to their tendency to overfit, they are prone to
sampling errors. If sampled training data is somewhat different than evaluation or
scoring data, then Decision Trees tend not to produce great results.

Ensemble Learning
1. https://www.geeksforgeeks.org/bagging-vs-
boosting-in-machine-learning/

2. https://www.mygreatlearning.com/blog/ensemble-
learning/
Or
https://medium.com/@stevenyu530_73989/stacking-
and-blending-intuitive-explanation-of-advanced-
ensemble-methods-46b295da413c
Stacking vs Blending:- The difference between stacking and
blending is that Stacking uses out-of-fold predictions for the train set
of the next layer (i.e meta-model), and Blending uses a validation
set (let's say, 10-15% of the training set) to train the next layer.

3. https://www.knowledgehut.com/blog/data-
science/bagging-and-random-forest-in-machine-
learning
Random forest algorithm avoids and prevents overfitting by using multiple trees.

This gives accurate and precise results. In Decision tree there is always a scope
for overfitting, caused due to the presence of variance. The results are not

accurate. That is why Random Forest are better than Decision Tree.

4. Gradient Boosting is another very popular Boosting algorithm which works

pretty similar to what we’ve seen for AdaBoost. Gradient Boosting works by

sequentially adding the previous predictors underfitted predictions to the

ensemble, ensuring the errors made previously are corrected.

XG Boost or Extreme Gradient Boosting is an advanced implementation of the

Gradient Boosting. This algorithm has high predictive power and is ten times

faster than any other gradient boosting techniques. Moreover, it includes a

variety of data regularization which reduces overfitting and improves overall

performance.

5. As AdaBoost is a boosting algorithm, it adds the results of multiple weak

learners to get the final strong learner. Hence it is known as the additive method.

Here the term stagewise means that in AdaBoost, one weak learner is trained, now
whatever the errors that are present in the first stage while training, the first weak
learner will pass to the second weak learner while training to avoid the same error
in the future stages of training weak learners. Hence it is a Stagewise method.

Here we can see the stagewise addition of weak learners scenario in AdaBoost
hence it is known as the Stagewise Addition method.

6. https://www.kdnuggets.com/2022/07/kfold-cross-validation.html#:~:text=K
%2Dfold%20Cross%2DValidation%20is,5%2Dfold%20cross%2Dvalidation.
7. Here are some of the reasons why bagging performs well on low bias high
variance datasets:

 Reduces variance: Bagging reduces the variance of a model by averaging the

predictions of multiple models. This is because each model is trained on a
different bootstrap sample of the data, which means that each model is less
likely to be overfit to the training data.
 Improves stability: Bagging also improves the stability of a model by making it
less likely to make large changes to its predictions when new data is added to
the training set. This is because each model in the ensemble is trained on a
different bootstrap sample of the data, so each model is less likely to be
affected by any individual data point.
 Increases accuracy: Bagging can also increase the accuracy of a model by
combining the strengths of multiple models. This is because each model in
the ensemble may be better at predicting certain parts of the data than other
models. By averaging the predictions of multiple models, bagging can improve
the overall accuracy of the model.

8. See the Answer for Question 4.

9. Advantages of Ensemble Algorithm

 Ensemble is a proven method for improving the accuracy of the model and works in most of
the cases.
 Ensemble makes the model more robust and stable thus ensuring decent performance on the
test cases in most scenarios.
 You can use ensemble to capture linear and simple as well nonlinear complex relationships in
the data. This can be done by using two different models and forming an ensemble of two.

10. Out-Of-Bag Score is computed as the number of correctly predicted rows from the
out-of-bag sample. Out of bag (OOB) score is a way of validating the Random forest model.
The validation score is calculated using a separate validation dataset, which is not used to
train the model. The validation score is a more accurate estimate of the model's
generalization error than the OOB score, but it requires more data.

Clustering
1. K-Means Clustering:-
https://www.analyticsvidhya.com/blog/2019/08/compr
ehensive-guide-k-means-clustering/#What_Is_K-
Means_Clustering?
DBSCAN:- https://www.geeksforgeeks.org/dbscan-
clustering-in-ml-density-based-clustering/
Hierarchical Clustering:-
https://www.geeksforgeeks.org/hierarchical-clustering-
in-data-mining/

2. A centroid is the imaginary or real location representing the

center of the cluster.

Every data point is allocated to each of the clusters through reducing

the in-cluster sum of squares.

In other words, the K-means algorithm identifies k number of

centroids, and then allocates every data point to the nearest cluster,
while keeping the centroids as small as possible.

The ‘means’ in the K-means refers to averaging of the data; that is,
finding the centroid.
3. https://www.geeksforgeeks.org/data-mining-cluster-
analysis/

 K-means clustering is a popular deterministic clustering algorithm that groups

data points into a fixed number of clusters. The algorithm works by iteratively
assigning data points to the cluster with the closest mean.
 Hierarchical clustering is another popular deterministic clustering algorithm
that creates a hierarchy of clusters. The algorithm starts by treating each data
point as its own cluster. It then merges the closest clusters together until there
is only one cluster left.
 DBSCAN is a deterministic clustering algorithm that finds clusters of similar
data points that are densely connected. The algorithm works by iteratively
growing clusters from seed points.

These are just a few examples of the many deterministic clustering algorithms that
are available. The best algorithm for your specific problem will depend on the
characteristics of your data set and your specific requirements.
5. https://www.geeksforgeeks.org/different-types-
clustering-algorithm/

6. Space and Time Complexity of Hierarchical clustering

Technique:

Space complexity: The space required for the Hierarchical

clustering Technique is very high when the number of data points
are high as we need to store the similarity matrix in the RAM. The
space complexity is the order of the square of n.

Space complexity = O(n²) where n is the number of data points.

Time complexity: Since we’ve to perform n iterations and in each

iteration, we need to update the similarity matrix and restore the
matrix, the time complexity is also very high. The time complexity is
the order of the cube of n.

Time complexity = O(n³) where n is the number of data points.

7. Euclidean distance is preferred over Manhattan distance since Manhattan

distance calculates distance only vertically or horizontally due to which it has
dimension restrictions. On the contrary, Euclidean distance can be used in any
space to calculate the distances between the data points. Since in K means
algorithm the data points can be present in any dimension, so Euclidean distance is
a more suitable option.

8. https://www.analyticsvidhya.com/blog/2021/01/in-
depth-intuition-of-k-means-clustering-algorithm-in-
machine-learning/

9. These plots show how the ratio of the standard deviation to the mean of distance between
examples decreases as the number of dimensions increases. This convergence means k-means

becomes less effective at distinguishing between examples. This negative consequence of

high-dimensional data is called the curse of dimensionality.

A demonstration of the curse of dimensionality.

Each plot shows the pairwise distances between 200 random points.
10.
Data Mining

1. https://www.geeksforgeeks.org/kdd-process-in-
data-mining/

https://www.upgrad.com/blog/kdd-process-data-
mining/#:~:text=KDD%20is%20the%20systematic
%20process,and%20discover%20previously
%20unknown%20patterns.

2.
OLAP:-https://www.tutorialspoint.com/dwh/dwh_olap
.htm

OLTP:- https://www.tutorialspoint.com/on-line-
transaction-processing-oltp-system-in-dbms

OLAP vs OLTP:-
OLAP (Online OLTP (Online Transaction
Category Analytical Processing) Processing)

Definition It is well-known as an online It is well-known as an online

database query management
OLAP (Online OLTP (Online Transaction
Category Analytical Processing) Processing)

system. database modifying system.

Consists of historical data Consists of only operational

Data source
from various Databases. current data.

It makes use of a
It makes use of a data
Method used standard database management
warehouse.
system (DBMS).

It is subject-oriented. Used
It is application-oriented. Used for
Application for Data Mining, Analytics,
business tasks.
Decisions making, etc.

In an OLAP database, tables In an OLTP database, tables

Normalized
are not normalized. are normalized (3NF).

The data is used in planning,

The data is used to perform day-
Usage of data problem-solving, and
to-day fundamental operations.
decision-making.

It provides a multi-
It reveals a snapshot of present
Task dimensional view of
business tasks.
different business tasks.

It serves the purpose to

It serves the purpose to Insert,
extract information for
Purpose Update, and Delete information
analysis and decision-
from the database.
making.

The size of the data is relatively

A large amount of data is
Volume of data small as the historical data is
stored typically in TB, PB
archived in MB, and GB.
OLAP (Online OLTP (Online Transaction
Category Analytical Processing) Processing)

Relatively slow as the

amount of data involved is Very Fast as the queries operate on
Queries
large. Queries may take 5% of the data.
hours.

The OLAP database is not The data integrity constraint must

Update often updated. As a result, be maintained in an OLTP
data integrity is unaffected. database.

It only needs backup from

Backup and The backup and recovery process
time to time as compared to
Recovery is maintained rigorously
OLTP.

The processing of complex It is comparatively fast in

Processing
queries can take a lengthy processing because of simple and
time
time. straightforward queries.

This data is generally

This data is managed by
Types of users managed by CEO, MD, and
clerksForex and managers.
GM.

Only read and rarely write

Operations Both read and write operations.
operations.

With lengthy, scheduled

The user initiates data updates,
Updates batch operations, data is
which are brief and quick.
refreshed on a regular basis.

Nature of The process is focused on The process is focused on the

audience the customer. market.

Database Design with a focus on the Design that is focused on the

OLAP (Online OLTP (Online Transaction
Category Analytical Processing) Processing)

Design subject. application.

Improves the efficiency of

Productivity Enhances the user’s productivity.
business analysts.

3. Data Warehousing:-
https://www.tutorialspoint.com/dwh/dwh_data_ware
housing.htm

Data Mining vs Data Warehousing:-

S. Basis of
No. Comparison Data Warehousing Data Mining

A data warehouse is a
database system that is
Data mining is the process of
designed for analytical
analyzing data patterns.
analysis instead of
1. Definition transactional work.

Data is stored
Data is analyzed regularly.
2. Process periodically.

Data warehousing is the

Data mining is the use of pattern
process of extracting and
recognition logic to identify
storing data to allow
patterns.
3. Purpose easier reporting.
S. Basis of
No. Comparison Data Warehousing Data Mining

Data warehousing is Data mining is carried out by

Managing solely carried out by business users with the help of
4. Authorities engineers. engineers.

Data warehousing is the Data mining is considered as a

Data process of pooling all process of extracting data from
5. Handling relevant data together. large data sets.

Subject-oriented,
AI, statistics, databases,
integrated, time-varying
and machine learning systems
and non-volatile
are all used in data mining
constitute data
technologies.
6. Functionality warehouses.

Data warehousing is the

process of extracting and
Pattern recognition logic is used
storing data in order to
in data mining to find patterns.
make reporting more
7. Task efficient.

It extracts data and

This procedure employs pattern
stores it in an orderly
recognition tools to aid in the
format, making reporting
identification of access patterns.
8. Uses easier and faster.

Data mining aids in the creation

When a data warehouse
of suggestive patterns of key
is connected with
parameters. Customer
operational business
purchasing behavior, items, and
systems like CRM
sales are examples. As a result,
(Customer Relationship
businesses will be able to make
Management) systems, it
the required adjustments to their
adds value.
9. Examples operations and production.
4. https://www.javatpoint.com/fp-growth-algorithm-
in-data-mining

5. Market Basket Analysis:-

https://www.javatpoint.com/market-basket-analysis-
in-data-mining

Apriori Algorithm:-
https://www.javatpoint.com/apriori-algorithm

6. https://www.javatpoint.com/olap-operations

7. The data mining process typically involves the following steps:

Business Understanding: This step involves understanding the problem that needs
to be solved and defining the objectives of the data mining project. This includes
identifying the business problem, understanding the goals and objectives of the
project, and defining the KPIs that will be used to measure success. This step is
important because it helps ensure that the data mining project is aligned with
business goals and objectives.
Data Understanding: This step involves collecting and exploring the data to gain a
better understanding of its structure, quality, and content. This includes
understanding the sources of the data, identifying any data quality issues, and
exploring the data to identify patterns and relationships. This step is important
because it helps ensure that the data is suitable for analysis.
Data Preparation: This step involves preparing the data for analysis. This includes
cleaning the data to remove any errors or inconsistencies, transforming the data to
make it suitable for analysis, and integrating the data from different sources to create
a single dataset. This step is important because it ensures that the data is in a format
that can be used for modeling.
Modeling: This step involves building a predictive model using machine learning
algorithms. This includes selecting an appropriate algorithm, training the model on
the data, and evaluating its performance. This step is important because it is the heart
of the data mining process and involves developing a model that can accurately
predict outcomes on new data.
Evaluation: This step involves evaluating the performance of the model. This
includes using statistical measures to assess how well the model is able to predict
outcomes on new data. This step is important because it helps ensure that the model
is accurate and can be used in the real world.
Deployment: This step involves deploying the model into the production
environment. This includes integrating the model into existing systems and
processes to make predictions in real-time. This step is important because it allows
the model to be used in a practical setting and to generate value for the organization.

8. Data purging is the process of permanently removing obsolete data

from a specific storage location when it is no longer required.

Common criteria for data purges include the advanced age of the data or
the type of data in question. When a copy of the purged data is saved in
another storage location, the copy is referred to as an archive.

The purging process allows an administrator to permanently remove data

from its primary storage location, yet still retrieve and restore the data
from the archive copy should there ever be a need. In contrast, the delete
process also removes data permanently from a storage location, but
doesn’t keep a backup.

In enterprise IT, the compound term purging and archiving is used to

describe the removal of large amounts of data, while the term delete is
used to refer to the permanent removal of small, insignificant amounts of
data. In this context, the term deletion is often associated with data
quality and data hygiene, whereas the term purging is associated with
freeing up storage space for other uses.

Strategies for data purging are often based on specific industry and legal
requirements. When carried out automatically through business rules,
purging policies can help an organization run more efficiently and reduce
the total cost of data storage both on-premises and in the cloud.
9. https://www.javatpoint.com/data-warehouse-what-
is-data-cube

10.
https://www.tutorialspoint.com/big_data_analytics/bi
g_data_analytics_lifecycle.htm

11. Data Mining Applications

Data Mining is primarily used by organizations with intense consumer demands-
Retail, Communication, Financial, marketing company, determine price, consumer
preferences, product positioning, and impact on sales, customer satisfaction, and
corporate profits. Data mining enables a retailer to use point-of-sale records of
customer purchases to develop products and promotions that help the organization
to attract the customer.
These are the following areas where data mining is widely used:

Data Mining in Healthcare:

Data mining in healthcare has excellent potential to improve the health system. It
uses data and analytics for better insights and to identify best practices that will
enhance health care services and reduce costs. Analysts use data mining approaches
such as Machine learning, Multi-dimensional database, Data visualization, Soft
computing, and statistics. Data Mining can be used to forecast patients in each
category. The procedures ensure that the patients get intensive care at the right
place and at the right time. Data mining also enables healthcare insurers to recognize
fraud and abuse.

Data Mining in Market Basket Analysis:

Market basket analysis is a modeling method based on a hypothesis. If you buy a

specific group of products, then you are more likely to buy another group of
products. This technique may enable the retailer to understand the purchase
behavior of a buyer. This data may assist the retailer in understanding the
requirements of the buyer and altering the store's layout accordingly. Using a
different analytical comparison of results between various stores, between customers
in different demographic groups can be done.

Data mining in Education:

Education data mining is a newly emerging field, concerned with developing

techniques that explore knowledge from the data generated from educational
Environments. EDM objectives are recognized as affirming student's future learning
behavior, studying the impact of educational support, and promoting learning
science. An organization can use data mining to make precise decisions and also to
predict the results of the student. With the results, the institution can concentrate on
what to teach and how to teach.

Data Mining in Manufacturing Engineering:

Knowledge is the best asset possessed by a manufacturing company. Data mining

tools can be beneficial to find patterns in a complex manufacturing process. Data
mining can be used in system-level designing to obtain the relationships between
product architecture, product portfolio, and data needs of the customers. It can also
be used to forecast the product development period, cost, and expectations among
the other tasks.

Data Mining in CRM (Customer Relationship Management):

Customer Relationship Management (CRM) is all about obtaining and holding
Customers, also enhancing customer loyalty and implementing customer-oriented
strategies. To get a decent relationship with the customer, a business organization
needs to collect data and analyze the data. With data mining technologies, the
collected data can be used for analytics.

Data Mining in Fraud detection:

Billions of dollars are lost to the action of frauds. Traditional methods of fraud
detection are a little bit time consuming and sophisticated. Data mining provides
meaningful patterns and turning data into information. An ideal fraud detection
system should protect the data of all the users. Supervised methods consist of a
collection of sample records, and these records are classified as fraudulent or non-
fraudulent. A model is constructed using this data, and the technique is made to
identify whether the document is fraudulent or not.

Data Mining in Lie Detection:

Apprehending a criminal is not a big deal, but bringing out the truth from him is a
very challenging task. Law enforcement may use data mining techniques to
investigate offenses, monitor suspected terrorist communications, etc. This technique
includes text mining also, and it seeks meaningful patterns in data, which is usually
unstructured text. The information collected from the previous investigations is
compared, and a model for lie detection is constructed.

Data Mining Financial Banking:

The Digitalization of the banking system is supposed to generate an enormous

amount of data with every new transaction. The data mining technique can help
bankers by solving business-related problems in banking and finance by identifying
trends, casualties, and correlations in business information and market costs that are
not instantly evident to managers or executives because the data volume is too large
or are produced too rapidly on the screen by experts. The manager may find these
data for better targeting, acquiring, retaining, segmenting, and maintain a profitable
customer.

Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
1737527078055
No ratings yet
1737527078055
111 pages
5.Feauture Engineering
No ratings yet
5.Feauture Engineering
34 pages
PMA Unit-2 pdf
No ratings yet
PMA Unit-2 pdf
19 pages
Unit-2Exploratory-Analysis
No ratings yet
Unit-2Exploratory-Analysis
37 pages
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Eda
No ratings yet
Eda
48 pages
ML Unit 2
No ratings yet
ML Unit 2
90 pages
Unit 2 ML 2019
No ratings yet
Unit 2 ML 2019
91 pages
Lecture5
No ratings yet
Lecture5
26 pages
3_AML _Lecture 3_Feature Engg
No ratings yet
3_AML _Lecture 3_Feature Engg
39 pages
Lecture 7 Data Transformation and Dimensionality Reduction
No ratings yet
Lecture 7 Data Transformation and Dimensionality Reduction
22 pages
Script
No ratings yet
Script
5 pages
ML (1)
No ratings yet
ML (1)
6 pages
23.-Scaling-Techniques
No ratings yet
23.-Scaling-Techniques
30 pages
My Notes
No ratings yet
My Notes
15 pages
ML PYQs
No ratings yet
ML PYQs
32 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
Ds 5
No ratings yet
Ds 5
9 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Unit 4
No ratings yet
Unit 4
33 pages
Lec06 7 Feature Engineering 08112022 100115am
No ratings yet
Lec06 7 Feature Engineering 08112022 100115am
44 pages
Final ML
No ratings yet
Final ML
2 pages
DAI101 4 Data Preparation (1)
No ratings yet
DAI101 4 Data Preparation (1)
45 pages
Week 10
No ratings yet
Week 10
50 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Xplore Feature Engineering
No ratings yet
Xplore Feature Engineering
9 pages
Lecture6a DataPreprocessing
No ratings yet
Lecture6a DataPreprocessing
52 pages
Lecture-18 - Evaluation Metrics For Different Model
No ratings yet
Lecture-18 - Evaluation Metrics For Different Model
27 pages
02 - Diagnostics For Machine Learning Model
No ratings yet
02 - Diagnostics For Machine Learning Model
20 pages
Machine Learning Mindmap PDF
100% (1)
Machine Learning Mindmap PDF
5 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
27 pages
PPA Data Preparation
No ratings yet
PPA Data Preparation
31 pages
MODELS (AutoRecovered)
No ratings yet
MODELS (AutoRecovered)
9 pages
Foundation of Data Science previous year question paper
No ratings yet
Foundation of Data Science previous year question paper
40 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Data
No ratings yet
Data
36 pages
ML unit 3
No ratings yet
ML unit 3
17 pages
Lecture-2-20022025-092902am
No ratings yet
Lecture-2-20022025-092902am
87 pages
DWDM AR16 Unit 1.2
No ratings yet
DWDM AR16 Unit 1.2
14 pages
General ML Notes
No ratings yet
General ML Notes
30 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
2_DataPreProcessing_code
No ratings yet
2_DataPreProcessing_code
46 pages
ML - WEEK 04
No ratings yet
ML - WEEK 04
33 pages
4 - Basics in Statistics and Linear Algebra
No ratings yet
4 - Basics in Statistics and Linear Algebra
7 pages
Feature Engineering: Getting The Most Out of Data For Predictive Models
No ratings yet
Feature Engineering: Getting The Most Out of Data For Predictive Models
75 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Lecture 3
No ratings yet
Lecture 3
32 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
data processing
No ratings yet
data processing
19 pages
Unit 1
No ratings yet
Unit 1
21 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages