Data Mining
Data Mining
Data Preprocessing
3. If our model is too simple and has very few parameters then it
may have high bias and low variance. On the other hand if our
model has large number of parameters then it’s going to have high
variance and low bias. So we need to find the right/good balance
without overfitting and underfitting the data.
5.
Scikit-Learn provides a
Scikit-Learn provides a transformer
transformer
5. called StandardScaler for
called MinMaxScaler for
standardization.
Normalization.
6. Accuracy
The base metric used for model evaluation is often Accuracy,
describing the number of correct predictions over all predictions:
Precision
Precision is a measure of how many of the positive predictions made
are correct (true positives). The formula for it is:
Recall / Sensitivity
Recall is a measure of how many of the positive cases the classifier
correctly predicted, over all the positive cases in the data. It is
sometimes also referred to as Sensitivity. The formula for it is:
F1-Score
F1-Score is a measure combining both precision and recall. It is
generally described as the harmonic mean of the two. Harmonic
mean is just another way to calculate an “average” of values,
generally described as more suitable for ratios (such as precision
and recall) than the traditional arithmetic mean. The formula used
for F1-score in this case is:
7. log transformation: transform skewed distribution to a normal
distribution
Box-Cox Transformation:
It is one of my favorite transformation techniques.
All the values of lambda vary from -5 to 5 are considered and the
best value for the data is selected. The “Best” value is one that results
in the best skewness of the distribution. Log transformation will take
place when we have lambda is zero.
1. You basically take the variable that contains missing values as a response ‘Y’
and other variables as predictors ‘X’.
Do this multiple times by doing random draws of the data and taking the mean of
the predictions.
Above was short intuition about how the MICE algorithm roughly works.
12. https://www.turing.com/kb/guide-to-principal-
component-analysis
Basic Algorithms
When the issue of multicollinearity occurs, least-squares are unbiased, and variances are
large, this results in predicted values being far away from the actual values.
Lambda is the penalty term. λ given here is denoted by an alpha parameter in the ridge
function. So, by changing the values of alpha, we are controlling the penalty term. The higher
the values of alpha, the bigger is the penalty and therefore the magnitude of coefficients is
reduced.
2.
Shrinks the coefficients toward zero and Encourages some coefficients to be exactly zero
Adds a penalty term proportional to the sum of Adds a penalty term proportional to the sum of
squared coefficients absolute values of coefficients
Suitable when all features are importantly Suitable when some features are irrelevant or
redundant
cause problems when you fit the model and interpret the results.
4.
https://stats.stackexchange.com/questions/88603/why
-is-logistic-regression-a-linear-model
5. https://towardsdatascience.com/decision-trees-
explained-entropy-information-gain-gini-index-ccp-
pruning-4d78070db36c#:~:text=The%20Gini%20index
%20has%20a,and%20maximum%20purity%20is
%200.&text=Now%20that%20we%20have
%20understood,to%20how%20they%20do
%20prediction.
Difference between Gini Index and Entropy
It is the probability of misclassifying a While entropy measures the amount of
randomly chosen element in a set. uncertainty or randomness in a set.
It has a bias toward selecting splits that It has a bias toward selecting splits that
result in a more balanced distribution of result in a higher reduction of
classes. uncertainty.
It does not learn anything during the training period since it does not find any
discriminative function with the help of the training data. In simple words,
actually, there is no training period for the KNN algorithm. It stores the training
dataset and learns from it only when we use the algorithm for making the real-time
predictions on the test dataset.
As a result, the KNN algorithm is much faster than other algorithms which require
training. For Example, SupportVector Machines(SVMs), Linear Regression, etc.
Moreover, since the KNN algorithm does not require any training before making
predictions as a result new data can be added seamlessly without impacting the
accuracy of the algorithm. That is why KNN does more computation on test time
rather than on train time.
7. https://medium.com/analytics-vidhya/mae-mse-
rmse-coefficient-of-determination-adjusted-r-squared-
which-metric-is-better-cd0326a5697e
8. https://www.tutorialspoint.com/what-are-the-
approaches-to-tree-pruning.
9. SVM in detail:-
https://www.geeksforgeeks.org/support-vector-
machine-algorithm/
We can now clearly state that HP1 is a Hard SVM(left side) while HP2 is a Soft
SVM(right side).
By default, Support Vector Machine implements Hard margin SVM. It works well
only if our data is linearly separable.
In case our data is non-separable/ nonlinear then the Hard margin SVM will not
return any hyperplane as it will not be able to separate the data. Hence this is where
Soft Margin SVM comes to the rescue.
Soft margin SVM allows some misclassification to happen by relaxing the hard
constraints of Support Vector Machine.
Soft margin SVM is implemented with the help of the Regularization parameter
(C).
Kernel Trick:- The SVM kernel is a function that takes low-dimensional input
space and transforms it into higher-dimensional space, ie it converts nonseparable
problems to separable problems. It is mostly useful in non-linear separation
problems. Simply put the kernel, does some extremely complex data transformations
and then finds out the process to separate the data based on the labels or outputs
defined.
11. We use square error to get the most negligible impact of values which
contributes to the maximum error. Moreover, the squared error is differential while
the absolute error is not, which makes the squared error more compatible with the
13. Decision Trees are not sensitive to noisy data or outliers since extreme values
or outliers never cause much reduction in the Residual Sum of
Squares(RSS) because they are never involved in the split. Decision Trees are
generally robust to outliers. Due to their tendency to overfit, they are prone to
sampling errors. If sampled training data is somewhat different than evaluation or
scoring data, then Decision Trees tend not to produce great results.
Ensemble Learning
1. https://www.geeksforgeeks.org/bagging-vs-
boosting-in-machine-learning/
2. https://www.mygreatlearning.com/blog/ensemble-
learning/
Or
https://medium.com/@stevenyu530_73989/stacking-
and-blending-intuitive-explanation-of-advanced-
ensemble-methods-46b295da413c
Stacking vs Blending:- The difference between stacking and
blending is that Stacking uses out-of-fold predictions for the train set
of the next layer (i.e meta-model), and Blending uses a validation
set (let's say, 10-15% of the training set) to train the next layer.
3. https://www.knowledgehut.com/blog/data-
science/bagging-and-random-forest-in-machine-
learning
Random forest algorithm avoids and prevents overfitting by using multiple trees.
This gives accurate and precise results. In Decision tree there is always a scope
for overfitting, caused due to the presence of variance. The results are not
accurate. That is why Random Forest are better than Decision Tree.
Gradient Boosting. This algorithm has high predictive power and is ten times
performance.
Here the term stagewise means that in AdaBoost, one weak learner is trained, now
whatever the errors that are present in the first stage while training, the first weak
learner will pass to the second weak learner while training to avoid the same error
in the future stages of training weak learners. Hence it is a Stagewise method.
Here we can see the stagewise addition of weak learners scenario in AdaBoost
hence it is known as the Stagewise Addition method.
6. https://www.kdnuggets.com/2022/07/kfold-cross-validation.html#:~:text=K
%2Dfold%20Cross%2DValidation%20is,5%2Dfold%20cross%2Dvalidation.
7. Here are some of the reasons why bagging performs well on low bias high
variance datasets:
10. Out-Of-Bag Score is computed as the number of correctly predicted rows from the
out-of-bag sample. Out of bag (OOB) score is a way of validating the Random forest model.
The validation score is calculated using a separate validation dataset, which is not used to
train the model. The validation score is a more accurate estimate of the model's
generalization error than the OOB score, but it requires more data.
Clustering
1. K-Means Clustering:-
https://www.analyticsvidhya.com/blog/2019/08/compr
ehensive-guide-k-means-clustering/#What_Is_K-
Means_Clustering?
DBSCAN:- https://www.geeksforgeeks.org/dbscan-
clustering-in-ml-density-based-clustering/
Hierarchical Clustering:-
https://www.geeksforgeeks.org/hierarchical-clustering-
in-data-mining/
The ‘means’ in the K-means refers to averaging of the data; that is,
finding the centroid.
3. https://www.geeksforgeeks.org/data-mining-cluster-
analysis/
4.
These are just a few examples of the many deterministic clustering algorithms that
are available. The best algorithm for your specific problem will depend on the
characteristics of your data set and your specific requirements.
5. https://www.geeksforgeeks.org/different-types-
clustering-algorithm/
8. https://www.analyticsvidhya.com/blog/2021/01/in-
depth-intuition-of-k-means-clustering-algorithm-in-
machine-learning/
9. These plots show how the ratio of the standard deviation to the mean of distance between
examples decreases as the number of dimensions increases. This convergence means k-means
Each plot shows the pairwise distances between 200 random points.
10.
Data Mining
1. https://www.geeksforgeeks.org/kdd-process-in-
data-mining/
Or
https://www.upgrad.com/blog/kdd-process-data-
mining/#:~:text=KDD%20is%20the%20systematic
%20process,and%20discover%20previously
%20unknown%20patterns.
2.
OLAP:-https://www.tutorialspoint.com/dwh/dwh_olap
.htm
OLTP:- https://www.tutorialspoint.com/on-line-
transaction-processing-oltp-system-in-dbms
OLAP vs OLTP:-
OLAP (Online OLTP (Online Transaction
Category Analytical Processing) Processing)
It makes use of a
It makes use of a data
Method used standard database management
warehouse.
system (DBMS).
It is subject-oriented. Used
It is application-oriented. Used for
Application for Data Mining, Analytics,
business tasks.
Decisions making, etc.
It provides a multi-
It reveals a snapshot of present
Task dimensional view of
business tasks.
different business tasks.
3. Data Warehousing:-
https://www.tutorialspoint.com/dwh/dwh_data_ware
housing.htm
A data warehouse is a
database system that is
Data mining is the process of
designed for analytical
analyzing data patterns.
analysis instead of
1. Definition transactional work.
Data is stored
Data is analyzed regularly.
2. Process periodically.
Subject-oriented,
AI, statistics, databases,
integrated, time-varying
and machine learning systems
and non-volatile
are all used in data mining
constitute data
technologies.
6. Functionality warehouses.
Apriori Algorithm:-
https://www.javatpoint.com/apriori-algorithm
6. https://www.javatpoint.com/olap-operations
Common criteria for data purges include the advanced age of the data or
the type of data in question. When a copy of the purged data is saved in
another storage location, the copy is referred to as an archive.
Strategies for data purging are often based on specific industry and legal
requirements. When carried out automatically through business rules,
purging policies can help an organization run more efficiently and reduce
the total cost of data storage both on-premises and in the cloud.
9. https://www.javatpoint.com/data-warehouse-what-
is-data-cube
10.
https://www.tutorialspoint.com/big_data_analytics/bi
g_data_analytics_lifecycle.htm
Data mining in healthcare has excellent potential to improve the health system. It
uses data and analytics for better insights and to identify best practices that will
enhance health care services and reduce costs. Analysts use data mining approaches
such as Machine learning, Multi-dimensional database, Data visualization, Soft
computing, and statistics. Data Mining can be used to forecast patients in each
category. The procedures ensure that the patients get intensive care at the right
place and at the right time. Data mining also enables healthcare insurers to recognize
fraud and abuse.
Billions of dollars are lost to the action of frauds. Traditional methods of fraud
detection are a little bit time consuming and sophisticated. Data mining provides
meaningful patterns and turning data into information. An ideal fraud detection
system should protect the data of all the users. Supervised methods consist of a
collection of sample records, and these records are classified as fraudulent or non-
fraudulent. A model is constructed using this data, and the technique is made to
identify whether the document is fraudulent or not.
Apprehending a criminal is not a big deal, but bringing out the truth from him is a
very challenging task. Law enforcement may use data mining techniques to
investigate offenses, monitor suspected terrorist communications, etc. This technique
includes text mining also, and it seeks meaningful patterns in data, which is usually
unstructured text. The information collected from the previous investigations is
compared, and a model for lie detection is constructed.