0% found this document useful (0 votes)
22 views25 pages

Unit-1 Notes Onl

The document provides an overview of data mining tasks, which are categorized into descriptive and predictive tasks, and discusses the KDD process involved in data mining. It highlights various issues related to mining methodologies, performance, diverse data types, and social implications, along with techniques such as association, classification, prediction, clustering, regression, and neural networks. Additionally, it addresses the advantages and disadvantages of data mining, emphasizing the importance of data quality, privacy, and ethical considerations.

Uploaded by

Fouziya A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views25 pages

Unit-1 Notes Onl

The document provides an overview of data mining tasks, which are categorized into descriptive and predictive tasks, and discusses the KDD process involved in data mining. It highlights various issues related to mining methodologies, performance, diverse data types, and social implications, along with techniques such as association, classification, prediction, clustering, regression, and neural networks. Additionally, it addresses the advantages and disadvantages of data mining, emphasizing the importance of data quality, privacy, and ethical considerations.

Uploaded by

Fouziya A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 25

Introduction to Data Mining Tasks

The data mining tasks can be classified generally into two types based on what a specific task
tries to achieve. Those two categories are descriptive tasks and predictive tasks. The descriptive
data mining tasks characterize the general properties of data whereas predictive data mining
tasks perform inference on the available data set to predict how a new data set will behave.

KDD Process in Data Mining


Last Updated : 23 May, 2023

Pre-requisites: Data Mining


In the context of computer science, “Data Mining” can be referred to as
knowledge mining from data, knowledge extraction, data/pattern analysis,
data archaeology, and data dredging. Data Mining also known as
Knowledge Discovery in Databases, refers to the nontrivial extraction of
implicit, previously unknown and potentially useful information from data
stored in databases.
The need of data mining is to extract useful information from large datasets
and use it to make predictions or better decision-making. Nowadays, data
mining is used in almost all places where a large amount of data is stored
and processed.
For examples: Banking sector, Market Basket Analysis, Network Intrusion
Detection.

Data Mining - Issues


Mining Methodology and User Interaction Issues
It refers to the following kinds of issues −

 Mining different kinds of knowledge in databases − Different users may be


interested in different kinds of knowledge. Therefore it is necessary for data mining to
cover a broad range of knowledge discovery task.
 Interactive mining of knowledge at multiple levels of abstraction − The
data mining process needs to be interactive because it allows users to focus the search
for patterns, providing and refining data mining requests based on the returned results.
 Incorporation of background knowledge − To guide discovery process and to
express the discovered patterns, the background knowledge can be used. Background
knowledge may be used to express the discovered patterns not only in concise terms but
at multiple levels of abstraction.
 Data mining query languages and ad hoc data mining − Data Mining Query
language that allows the user to describe ad hoc mining tasks, should be integrated with
a data warehouse query language and optimized for efficient and flexible data mining.
 Presentation and visualization of data mining results − Once the patterns
are discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
 Handling noisy or incomplete data − The data cleaning methods are required
to handle the noise and incomplete objects while mining the data regularities. If the data
cleaning methods are not there then the accuracy of the discovered patterns will be
poor.
 Pattern evaluation − The patterns discovered should be interesting because
either they represent common knowledge or lack novelty.

Performance Issues
There can be performance-related issues such as follows −

 Efficiency and scalability of data mining algorithms − In order to effectively


extract the information from huge amount of data in databases, data mining algorithm
must be efficient and scalable.
 Parallel, distributed, and incremental mining algorithms − The factors such
as huge size of databases, wide distribution of data, and complexity of data mining
methods motivate the development of parallel and distributed data mining algorithms.
These algorithms divide the data into partitions which is further processed in a parallel
fashion. Then the results from the partitions is merged. The incremental algorithms,
update databases without mining the data again from scratch.

Diverse Data Types Issues


 Handling of relational and complex types of data − The database may
contain complex data objects, multimedia data objects, spatial data, temporal data etc.
It is not possible for one system to mine all these kind of data.
 Mining information from heterogeneous databases and global
information systems − The data is available at different data sources on LAN or WAN.
These data source may be structured, semi structured or unstructured. Therefore mining
the knowledge from them adds challenges to data mining.
Data Mining Metrics

Data mining is one of the forms of artificial intelligence that uses perception models,
analytical models, and multiple algorithms to simulate the techniques of the human
brain. Data mining supports machines to take human decisions and create human
choices.

The user of the data mining tools will have to direct the machine rules, preferences, and
even experiences to have decision support data mining metrics are as follows −

Usefulness − Usefulness involves several metrics that tell us whether the model
provides useful data. For instance, a data mining model that correlates save the location
with sales can be both accurate and reliable, but cannot be useful, because it cannot
generalize that result by inserting more stores at the same location.

Furthermore, it does not answer the fundamental business question of why specific
locations have more sales. It can also find that a model that appears successful is
meaningless because it depends on cross-correlations in the data.

Return on Investment (ROI) − Data mining tools will find interesting patterns buried
inside the data and develop predictive models. These models will have several measures
for denoting how well they fit the records. It is not clear how to create a decision based
on some of the measures reported as an element of data mining analyses.

Access Financial Information during Data Mining − The simplest way to frame
decisions in financial terms is to augment the raw information that is generally mined to
also contain financial data. Some organizations are investing and developing data
warehouses, and data marts.

The design of a warehouse or mart contains considerations about the types of analyses
and data needed for expected queries. It is designing warehouses in a way that allows
access to financial information along with access to more typical data on product
attributes, user profiles, etc. can be useful.

Converting Data Mining Metrics into Financial Terms − A general data mining
metric is the measure of "Lift". Lift is a measure of what is achieved by using the specific
model or pattern relative to a base rate in which the model is not used. High values
mean much is achieved. It can seem then that one can simply create a decision based
on Lift.

Accuracy − Accuracy is a measure of how well the model correlates results with the
attributes in the data that has been supported. There are several measures of accuracy,
but all measures of accuracy are dependent on the information that is used. In reality,
values can be missing or approximate, or the data can have been changed by several
processes.

It is the procedure of exploration and development, it can decide to accept a specific


amount of error in the data, especially if the data is fairly uniform in its characteristics.
For example, a model that predicts sales for a specific store based on past sales can be
powerfully correlated and very accurate, even if that store consistently used the wrong
accounting techniques. Thus, measurements of accuracy should be balanced by
assessments of reliability.

social implications of data mining

Data mining is the process of finding useful new correlations, patterns, and trends by
transferring through a high amount of data saved in repositories, using pattern
recognition technologies including statistical and mathematical techniques. It is the
analysis of factual datasets to discover unsuspected relationships and to summarize the
records in novel methods that are both logical and helpful to the data owner.

Data mining systems are designed to promote the identification and classification of
individuals into different groups or segments. From the aspect of the commercial firm,
and possibly for the industry as a whole, it can interpret the use of data mining as a
discriminatory technology in the rational search of profits.

There are various social implications of data mining which are as follows −

Privacy − It is a loaded issue. In current years privacy concerns have taken on a more
important role in American society as merchants, insurance companies, and government
agencies amass warehouses including personal records.
The concerns that people have over the group of this data will generally extend to some
analytic capabilities used to the data. Users of data mining should start thinking about
how their use of this technology will be impacted by legal problems associated with
privacy.

Profiling − Data Mining and profiling is a developing field that attempts to organize,
understand, analyze, reason, and use the explosion of data in this information age. The
process contains using algorithms and experience to extract design or anomalies that
are very complex, difficult, or time-consuming to recognize.

The founder of Microsoft's Exploration Team used complex data mining algorithms to
solve an issue that had haunted astronomers for some years. The problem of reviewing,
describing, and categorizing 2 billion sky objects recorded over 3 decades. The algorithm
extracted the relevant design to allocate the sky objects like stars or galaxies. The
algorithms were able to extract the feature that represented sky objects as stars or
galaxies. This developing field of data mining and profiling has several frontiers where it
can be used.

Unauthorized Used − Trends obtain through data mining designed to be used for
marketing goals or some other ethical goals, can be misused. Unethical businesses or
people can use the data obtained through data mining to take benefit of vulnerable
people or discriminate against a specific group of people. Furthermore, the data mining
technique is not 100 percent accurate; thus mistakes do appear which can have serious
results.

Data Mining Techniques


1. Association

Association analysis is the finding of association rules showing attribute-value conditions that occur
frequently together in a given set of data. Association analysis is widely used for a market basket or
transaction data analysis. Association rule mining is a significant and exceptionally dynamic area of data
mining research. One method of association-based classification, called associative classification, consists
of two steps. In the main step, association instructions are generated using a modified version of the
standard association rule mining algorithm known as Apriori. The second step constructs a classifier based
on the association rules discovered.

2. Classification

Classification is the processing of finding a set of models (or functions) that describe and distinguish data
classes or concepts, for the purpose of being able to use the model to predict the class of objects whose
class label is unknown. The determined model depends on the investigation of a set of training data
information (i.e. data objects whose class label is known). The derived model may be represented in
various forms, such as classification (if – then) rules, decision trees, and neural networks. Data Mining has
a different type of classifier:

 Decision Tree
 SVM(Support Vector Machine)

 Generalized Linear Models

 Bayesian classification:

 Classification by Backpropagation

 K-NN Classifier

 Rule-Based Classification

 Frequent-Pattern Based Classification

 Rough set theory

 Fuzzy Logic

3. Prediction

Data Prediction is a two-step process, similar to that of data classification. Although, for prediction, we do
not utilize the phrasing of “Class label attribute” because the attribute for which values are being predicted
is consistently valued(ordered) instead of categorical (discrete-esteemed and unordered). The attribute can
be referred to simply as the predicted attribute. Prediction can be viewed as the construction and use of a
model to assess the class of an unlabeled object, or to assess the value or value ranges of an attribute that a
given object is likely to have.

4. Clustering

Unlike classification and prediction, which analyze class-labeled data objects or attributes, clustering
analyzes data objects without consulting an identified class label. In general, the class labels do not exist in
the training data simply because they are not known to begin with. Clustering can be used to generate these
labels. The objects are clustered based on the principle of maximizing the intra-class similarity and
minimizing the interclass similarity. That is, clusters of objects are created so that objects inside a cluster
have high similarity in contrast with each other, but are different objects in other clusters. Each Cluster that
is generated can be seen as a class of objects, from which rules can be inferred. Clustering can also
facilitate classification formation, that is, the organization of observations into a hierarchy of classes that
group similar events together.

5. Regression

Regression can be defined as a statistical modeling method in which previously obtained data is used to
predicting a continuous quantity for new observations. This classifier is also known as the Continuous
Value Classifier. There are two types of regression models: Linear regression and multiple linear regression
models.

6. Artificial Neural network (ANN) Classifier Method

An artificial neural network (ANN) also referred to as simply a “Neural Network” (NN), could be a process
model supported by biological neural networks. It consists of an interconnected collection of artificial
neurons. A neural network is a set of connected input/output units where each connection has a weight
associated with it. During the knowledge phase, the network acquires by adjusting the weights to be able to
predict the correct class label of the input samples. Neural network learning is also denoted as connectionist
learning due to the connections between units. Neural networks involve long training times and are
therefore more appropriate for applications where this is feasible. They require a number of parameters that
are typically best determined empirically, such as the network topology or “structure”. Neural networks
have been criticized for their poor interpretability since it is difficult for humans to take the symbolic
meaning behind the learned weights. These features firstly made neural networks less desirable for data
mining.

The advantages of neural networks, however, contain their high tolerance to noisy data as well as their
ability to classify patterns on which they have not been trained. In addition, several algorithms have newly
been developed for the extraction of rules from trained neural networks. These issues contribute to the
usefulness of neural networks for classification in data mining.

An artificial neural network is an adjective system that changes its structure-supported information that
flows through the artificial network during a learning section. The ANN relies on the principle of learning
by example. There are two classical types of neural networks, perceptron and also multilayer perceptron.

7. Outlier Detection

A database may contain data objects that do not comply with the general behavior or model of the data.
These data objects are Outliers. The investigation of OUTLIER data is known as OUTLIER MINING. An
outlier may be detected using statistical tests which assume a distribution or probability model for the data,
or using distance measures where objects having a small fraction of “close” neighbors in space are
considered outliers. Rather than utilizing factual or distance measures, deviation-based techniques
distinguish exceptions/outlier by inspecting differences in the principle attributes of items in a group.

8. Genetic Algorithm

Genetic algorithms are adaptive heuristic search algorithms that belong to the larger part of evolutionary
algorithms. Genetic algorithms are based on the ideas of natural selection and genetics. These are
intelligent exploitation of random search provided with historical data to direct the search into the region of
better performance in solution space. They are commonly used to generate high-quality solutions for
optimization problems and search problems. Genetic algorithms simulate the process of natural selection
which means those species who can adapt to changes in their environment are able to survive and
reproduce and go to the next generation. In simple words, they simulate “survival of the fittest” among
individuals of consecutive generations for solving a problem. Each generation consist of a population of
individuals and each individual represents a point in search space and possible solution. Each individual is
represented as a string of character/integer/float/bits. This string is analogous to the Chromosome.

Advantages or Disadvantages:

Better Decision Making:

Improved Marketing:

Increased Efficiency:

Fraud Detection:

Customer Retention:

Competitive Advantage:

Improved Healthcare:

ns.

Disadvantages Of Data mining:

While data mining offers many benefits, there are also some disadvantages and challenges associated with
the process. The following are some of the main disadvantages of data mining:

Data Quality:

Data Privacy and Security:

Ethical Considerations:

Technical Complexity:

Cost:

Statistical Methods in Data Mining

Data mining refers to extracting or mining knowledge from large amounts of data. In
other words, data mining is the science, art, and technology of discovering large and
complex bodies of data in order to discover useful patterns. Theoreticians and
practitioners are continually seeking improved techniques to make the process more
efficient, cost-effective, and accurate. Any situation can be analyzed in two ways in
data mining:
 Statistical Analysis: In statistics, data is collected, analyzed, explored, and
presented to identify patterns and trends. Alternatively, it is referred to as
quantitative analysis.
 Non-statistical Analysis: This analysis provides generalized information and
includes sound, still images, and moving images.
In statistics, there are two main categories:
 Descriptive Statistics: The purpose of descriptive statistics is to organize data
and identify the main characteristics of that data. Graphs or numbers summarize
the data. Average, Mode, SD(Standard Deviation), and Correlation are some of
the commonly used descriptive statistical methods.
 Inferential Statistics: The process of drawing conclusions based on
probability theory and generalizing the data. By analyzing sample statistics, you
can infer parameters about populations and make models of relationships within
data.
There are various statistical terms that one should be aware of while dealing with
statistics. Some of these are:
 Population
 Sample
 Variable
 Quantitative Variable
 Qualitative Variable
 Discrete Variable
 Continuous Variable
Now, let’s start discussing statistical methods. This is the analysis of raw data using
mathematical formulas, models, and techniques. Through the use of statistical
methods, information is extracted from research data, and different ways are
available to judge the robustness of research outputs.
As a matter of fact, today’s statistical methods used in the data mining field typically
are derived from the vast statistical toolkit developed to answer problems arising in
other fields. These techniques are taught in science curriculums. It is necessary to
check and test several hypotheses. The hypotheses described above help us assess
the validity of our data mining endeavor when attempting to infer any inferences from
the data under study. When using more complex and sophisticated statistical
estimators and tests, these issues become more pronounced.
For extracting knowledge from databases containing different types of observations,
a variety of statistical methods are available in Data Mining and some of these are:
 Logistic regression analysis
 Correlation analysis
 Regression analysis
 Discriminate analysis
 Linear discriminant analysis (LDA)
 Classification
 Clustering
 Outlier detection
 Classification and regression trees,
 Correspondence analysis
 Nonparametric regression,
 Statistical pattern recognition,
 Categorical data analysis,
 Time-series methods for trends and periodicity
 Artificial neural networks
Now, let’s try to understand some of the important statistical methods which are used
in data mining:
 Linear Regression: The linear regression method uses the best linear
relationship between the independent and dependent variables to predict the
target variable. In order to achieve the best fit, make sure that all the distances
between the shape and the actual observations at each point are as small as
possible. A good fit can be determined by determining that no other position would
produce fewer errors given the shape chosen. Simple linear regression and
multiple linear regression are the two major types of linear regression. By fitting a
linear relationship to the independent variable, the simple linear regression
predicts the dependent variable. Using multiple independent variables, multiple
linear regression fits the best linear relationship with the dependent variable. For
more details, you can refer linear regression.
 Classification: This is a method of data mining in which a collection of data is
categorized so that a greater degree of accuracy can be predicted and analyzed.
An effective way to analyze very large datasets is to classify them. Classification
is one of several methods aimed at improving the efficiency of the analysis
process. A Logistic Regression and a Discriminant Analysis stand out as two
major classification techniques.
o Logistic Regression: It can also be applied to machine learning
applications and predictive analytics. In this approach, the dependent
variable is either binary (binary regression) or multinomial (multinomial
regression): either one of the two or a set of one, two, three, or four
options. With a logistic regression equation, one can estimate
probabilities regarding the relationship between the independent variable
and the dependent variable. For understanding logistic regression
analysis in detail, you can refer to logistic regression.
o Discriminant Analysis: A Discriminant Analysis is a statistical method of
analyzing data based on the measurements of categories or clusters and
categorizing new observations into one or more populations that were
identified a priori. The discriminant analysis models each response class
independently then uses Bayes’s theorem to flip these projections around
to estimate the likelihood of each response category given the value of X.
These models can be either linear or quadratic.
o Linear Discriminant Analysis: According to Linear
Discriminant Analysis, each observation is assigned a
discriminant score to classify it into a response variable class.
By combining the independent variables in a linear fashion,
these scores can be obtained. Based on this model,
observations are drawn from a Gaussian distribution, and the
predictor variables are correlated across all k levels of the
response variable, Y. and for further details linear discriminant
analysis
o Quadratic Discriminant Analysis: An alternative approach is
provided by Quadratic Discriminant Analysis. LDA and QDA
both assume Gaussian distributions for the observations of the
Y classes. Unlike LDA, QDA considers each class to have its
own covariance matrix. As a result, the predictor variables have
different variances across the k levels in Y.
o Correlation Analysis: In statistical terms, correlation analysis captures
the relationship between variables in a pair. The value of such variables
is usually stored in a column or rows of a database table and represents
a property of an object.
o Regression Analysis: Based on a set of numeric data, regression is a
data mining method that predicts a range of numerical values (also
known as continuous values). You could, for instance, use regression to
predict the cost of goods and services based on other variables. A
regression model is used across numerous industries for forecasting
financial data, modeling environmental conditions, and analyzing trends.

Similarity Measures in Data Science


Similarity measures are fundamental tools in data technological know-how, enabling us to quantify how
alike two information factors are. These measures are pivotal in various applications together with
clustering, category, and information retrieval. In this newsletter, we will discover a number of the most
typically used similarity measures, their formulas, descriptions, and usual packages.

1. Euclidean Distance

Description: Euclidean distance is the directly-line distance among points in a multi-dimensional


space. It is intuitive and extensively used in lots of applications, especially whilst the functions are
non-stop and the size is steady across dimensions.

Applications: It is commonly used in clustering algorithms together with okay-method, and in nearest
neighbor searches.

2. Cosine Similarity

Description: Cosine similarity measures the cosine of the perspective between vectors. It is specifically
useful in excessive-dimensional areas, which includes textual content mining, in which it measures the
orientation in place of significance, making it scale-invariant.

Applications: Widely utilized in text mining and information retrieval, which include record similarity in
serps.

3. Jaccard Similarity

Description: Jaccard similarity measures the similarity among two finite sets with the aid of dividing the
dimensions of their intersection via the dimensions of their union. It is beneficial for comparing specific
records.

Applications: Commonly used in clustering and classification tasks regarding categorical statistics,
consisting of market basket evaluation.

4. Pearson Correlation Coefficient


Description: Pearson correlation measures the linear correlation among two variables, supplying a value
between -1 and 1. It assesses how nicely a change in a single variable predicts a trade in some other.

Applications: Used in statistical evaluation and system studying to discover and quantify linear
relationships between features.

5. Hamming Distance

Description: Hamming distance measures the number of positions at which the corresponding factors of
strings are one-of-a-kind. It is especially useful for binary or specific information.

Applications: Used in mistakes detection and correction algorithms, in addition to in comparing binary
sequences or express variables.

Applications of Similarity Measures


Similarity measures are pivotal in numerous information technological know-how packages, enabling
algorithms to institution, classify, and retrieve records based totally on how alike the facts points are. This
functionality is essential in fields starting from textual content mining to image popularity. Here, we
discover some key packages of similarity measures.

1. Clustering

Clustering entails grouping a set of gadgets such that items in the identical institution (or cluster) are
greater just like every aside from to the ones in different agencies. Similarity measures play a essential
function in defining these groups.

 K-Means Clustering: Uses Euclidean distance to partition information into ok clusters. Each facts factor is
assigned to the cluster with the nearest centroid.
 Hierarchical Clustering: Uses diverse distance metrics (e.G., Euclidean, Manhattan) to construct a hierarchy
of clusters, often visualized as a dendrogram.

 Text Clustering: Uses cosine similarity to organization documents with comparable content material. This is
mainly beneficial in organizing big textual content corpora.

2. Classification
Classification assigns a label to a brand new facts factor based totally at the traits of acknowledged
classified facts points. Similarity measures help decide the label by means of comparing the new factor to
present points.

 K-Nearest Neighbors (k-NN): Classifies a statistics factor primarily based on the majority label among its ok
nearest acquaintances, frequently the usage of Euclidean distance or cosine similarity.
 Document Classification: Uses similarity measures like cosine similarity to categorize text files into
predefined instructions.

3. Information Retrieval

Information retrieval structures, together with search engines, rely on similarity measures to rank
documents primarily based on their relevance to a query.

 Search Engines: Use cosine similarity to evaluate the question vector with report vectors, ranking
documents by using their similarity to the query.
 Content-Based Filtering: In advice systems, similarity measures (e.G., cosine similarity, Jaccard similarity)
are used to recommend gadgets which might be much like those a user has previously favored.

4. Recommendation Systems

Recommendation structures suggest items to customers based on their alternatives and behavior, often the
usage of similarity measures to discover objects or customers which might be alike.

 Collaborative Filtering: Uses similarity measures like Pearson correlation or cosine similarity to locate
customers with similar preferences and propose items they've liked.
 Content-Based Filtering: Recommends items similar to those the person has shown interest in, the use of
measures like cosine similarity to examine object capabilities.

5. Anomaly Detection

Anomaly detection identifies outliers or uncommon statistics points that differ substantially from the bulk
of information.

 Mahalanobis Distance: Considers the correlations of the dataset to stumble on multivariate outliers.
 Euclidean Distance: Can be used in easier contexts to locate information factors that are a ways from the
imply or median of the dataset.

6. Natural Language Processing (NLP)

In NLP, similarity measures are used to examine text data, assisting in responsibilities consisting of report
clustering, plagiarism detection, and sentiment evaluation.

 Word Embeddings: Use cosine similarity to evaluate phrase vectors in fashions like Word2Vec or GloVe,
enabling the identity of semantically comparable words.
 Document Similarity: Measures like cosine similarity assist in clustering files or detecting plagiarism through
comparing text content.
7. Image Processing

Image processing involves analyzing and manipulating pics, where similarity measures are used to
compare picture capabilities.

 Image Retrieval: Uses measures like Euclidean distance on characteristic vectors (e.G., color histograms,
side descriptors) to discover similar photographs.
 Face Recognition: Employs measures like cosine similarity on feature vectors extracted from deep studying
fashions to become aware of or verify people.

8. Bioinformatics

In bioinformatics, similarity measures help examine organic information, along with genetic sequences or
protein systems.

 Sequence Alignment: Uses Hamming distance to compare DNA, RNA, or protein sequences, figuring out
similarities and variations which could imply evolutionary relationships.
 Protein Structure Comparison: Employs measures like RMSD (Root Mean Square Deviation) to evaluate 3-D
systems of proteins, aiding within the examine of their functions and interactions.

Decision Tree

What is a Decision Tree?


A decision tree is a flowchart-like structure used to make decisions or predictions. It consists of nodes
representing decisions or tests on attributes, branches representing the outcome of these decisions, and leaf
nodes representing final outcomes or predictions. Each internal node corresponds to a test on an attribute,
each branch corresponds to the result of the test, and each leaf node corresponds to a class label or a
continuous value.

Structure of a Decision Tree


1. Root Node: Represents the entire dataset and the initial decision to be made.

2. Internal Nodes: Represent decisions or tests on attributes. Each internal node has one or more branches.

3. Branches: Represent the outcome of a decision or test, leading to another node.

4. Leaf Nodes: Represent the final decision or prediction. No further splits occur at these nodes.

How Decision Trees Work?


The process of creating a decision tree involves:

1. Selecting the Best Attribute: Using a metric like Gini impurity, entropy, or information gain, the best
attribute to split the data is selected.

2. Splitting the Dataset: The dataset is split into subsets based on the selected attribute.

3. Repeating the Process: The process is repeated recursively for each subset, creating a new internal node or
leaf node until a stopping criterion is met (e.g., all instances in a node belong to the same class or a
predefined depth is reached).

Metrics for Splitting


 Gini Impurity: Measures the likelihood of an incorrect classification of a new instance if it was randomly
classified according to the distribution of classes in the dataset.
o Gini=1–∑i=1n(pi)2Gini=1–∑i=1n(pi)2, where pi is the probability of an instance being classified into
a particular class.

 Entropy: Measures the amount of uncertainty or impurity in the dataset.

o Entropy=−∑i=1npilog⁡2(pi)Entropy=−∑i=1npilog2(pi), where pi is the probability of an instance being


classified into a particular class.

 Information Gain: Measures the reduction in entropy or Gini impurity after a dataset is split on an
attribute.

o InformationGain=Entropyparent–∑i=1n(∣Di∣∣D∣∗Entropy(Di))InformationGain=Entropyparent–∑i=1n
(∣D∣∣Di∣∗Entropy(Di)), where Di is the subset of D after splitting by an attribute.

Advantages of Decision Trees


 Simplicity and Interpretability: Decision trees are easy to understand and interpret. The visual
representation closely mirrors human decision-making processes.

 Versatility: Can be used for both classification and regression tasks.

 No Need for Feature Scaling: Decision trees do not require normalization or scaling of the data.

 Handles Non-linear Relationships: Capable of capturing non-linear relationships between features and
target variables.

Disadvantages of Decision Trees


 Overfitting: Decision trees can easily overfit the training data, especially if they are deep with many nodes.

 Instability: Small variations in the data can result in a completely different tree being generated.

 Bias towards Features with More Levels: Features with more levels can dominate the tree structure.

Pruning
To overcome overfitting, pruning techniques are used. Pruning reduces the size of the tree by removing
nodes that provide little power in classifying instances. There are two main types of pruning:

 Pre-pruning (Early Stopping): Stops the tree from growing once it meets certain criteria (e.g., maximum
depth, minimum number of samples per leaf).

 Post-pruning: Removes branches from a fully grown tree that do not provide significant power.

Applications of Decision Trees


 Business Decision Making: Used in strategic planning and resource allocation.

 Healthcare: Assists in diagnosing diseases and suggesting treatment plans.

 Finance: Helps in credit scoring and risk assessment.

 Marketing: Used to segment customers and predict customer behavior.

Neural Network in Data Mining

What is Neural Networks?


A neural community is a computational model stimulated by the human brain's structure and functioning. It
consists of interconnected nodes, typically called neurons or artificial neurons. These neurons are organized
into layers consisting of an input layer, one or more hidden layers, and an output layer. The connections
between neurons, called weights, decide the network's ability to research from records.

1. Role of Neural Networks:

Neural networks are powerful equipment in records mining because of their capability to research
complicated patterns from massive datasets. Their adaptability to various kinds of information and trouble
domain names makes them appropriate for a wide range of applications, which include:

 Pattern Recognition: Neural networks excel in spotting styles within data, making them precious for
obligations, which include photo and speech reputation, fraud detection, and medical analysis.
 Classification: In class duties, neural networks categorize input facts into predefined instructions.
Applications include email junk mail detection, sentiment evaluation, and disorder analysis.

 Regression: Neural networks can perform regression obligations by predicting numerical values. This is
useful in scenarios consisting of predicting inventory expenses, sales forecasts, and housing charges.

 Clustering: Neural networks may be applied to clustering troubles, grouping similar data points. This is
useful in customer segmentation, anomaly detection, and statistics compression.

2. Data Preparation for Neural Networks:

 Feature Scaling: Neural networks have an advantage from feature scaling, making sure that all input
capabilities have a similar scale. Common scaling strategies encompass normalization and standardization.
 Handling Missing Data: Addressing lacking information is important for powerful neural community
training. Techniques like imputation or exclusion of incomplete statistics help maintain the facts' integrity.

 Data Splitting: Datasets are generally split into education, validation, and testing units. Training sets are
used to educate the model; validation units assist in tracking hyperparameters and checking out sets to
examine the model's performance on unseen facts.

3. Neural Network Architecture for Data Mining:


 Input Layer: The enter layer of a neural community consists of neurons similar to the functions of the
dataset. Each neuron represents a function, and the values are fed into the community throughout
schooling.
 Hidden Layers: Hidden layers are where the network learns and extracts features from the input facts. The
number of hidden layers and neurons in every layer is a crucial component of community architecture and
is often determined through experimentation.

 Output Layer: The output layer produces the very last predictions or classifications. The number of neurons
in this layer depends upon the assignment's character, binary type, multi-elegance class, or regression.

4. Training and Optimization:

 Backpropagation: One of the most important algorithms for training neural networks is backpropagation. It
is an iterative configuration of weights following the gradient of error with respect to these estimates. This
process is critical in ensuring that the difference between the predicted and actual outputs is minimal.
 Activation Functions: Activation functions introduce nonlinearity into the neural network, enabling it to
learn complex relationships. Some typical activation functions are sigmoid, hyperbolic tangent H(x), and
rectified linear units.

 Regularization: So, they apply regularization techniques such as dropout and weight decay while training to
prevent overfitting. All these techniques aid the model in generalizing well to new, unseen data.

 Hyperparameter Tuning: The selection of appropriate hyperparameters, such as learning rate, batch size,
and the number of hidden layers, drastically influence the performance level of a neural network.
Hyperparameter tuning often involves the use of Grid search or random search methods.

Challenges in Data Mining of Neural Networks:


Despite their effectiveness, neural networks pose certain challenges in the context of data mining:

 Overfitting: Neural networks are prone to memorizing the training data, which generalizes poorly when
transferred to new data. Regularization techniques and appropriate validation strategies mitigate this
problem.
 Interpretability: Neural networks are so often being called 'black box' models that it is difficult to explain
why such predictions were made. In some areas with the requirement for transparency, this inability to
make sense of it becomes a problem.

 Computational Resources: Training large neural networks is a heavy computational task that requires
strong GPUs or TPUs. This is a limiting factor, especially for small-scale projects or organizations with
limited resources.

Neural Networks in Data Mining:


1. Image and Speech Recognition:
Neural networks, especially convolutional neural networks (CNNs), have transformed image and speech
recognition. This ranges from facial recognition in security systems to voice-controlled virtual assistants.
2. Financial Fraud Detection:
In financial institutions, neural networks interpret patterns in transaction information to identify fraudulent
activities. They can detect suspicious behavior and signal possibly fraudulent transactions as they occur.

Healthcare and Medical Diagnosis:


In medicine, neural networks process medical images such as X-rays and MRIs to diagnose diseases. They
also help determine the likelihood of patient survival and potential health hazards based on patients' data.

1. Customer Relationship Management (CRM):


Neural networks are used in customer segmentation and personalized marketing CRM systems. These
systems research customers' behavior and preferences so that businesses can develop target marketing
strategies.
2. Natural Language Processing (NLP):
In recent years, natural language processing tasks like translation languages, sentiment analysis, and
chatting bots have significantly changed recurrent neural networks (RNNs) and transformer models.

Genetic Algorithm
Genetic algorithms are based on the ideas of natural selection and genetics. These are intelligent exploitation of
random searches provided with historical data to direct the search into the region of better performance in solution
space. They are commonly used to generate high-quality solutions for optimization problems and search
problems.

What is a Genetic Algorithm?


Before understanding the Genetic algorithm, let's first understand basic terminologies to better understand
this algorithm:
 Population: Population is the subset of all possible or probable solutions, which can solve the
given problem.
 Chromosomes: A chromosome is one of the solutions in the population for the given problem, and
the collection of gene generate a chromosome.
 Gene: A chromosome is divided into a different gene, or it is an element of the chromosome.
 Allele: Allele is the value provided to the gene within a particular chromosome.
 Fitness Function: The fitness function is used to determine the individual's fitness level in the
population. It means the ability of an individual to compete with other individuals. In every
iteration, individuals are evaluated based on their fitness function.
 Genetic Operators: In a genetic algorithm, the best individual mate to regenerate offspring better
than parents. Here genetic operators play a role in changing the genetic composition of the next
generation.
 Selection

Foundation of Genetic Algorithms

Genetic algorithms are based on an analogy with the genetic structure and behavior of chromosomes of the
population. Following is the foundation of GAs based on this analogy –

1. Individuals in the population compete for resources and mate

2. Those individuals who are successful (fittest) then mate to create more offspring than others

3. Genes from the “fittest” parent propagate throughout the generation, that is sometimes parents
create offspring which is better than either parent.

4. Thus each successive generation is more suited for their environment.

Search space

The population of individuals are maintained within search space. Each individual represents a solution in
search space for given problem. Each individual is coded as a finite length vector (analogous to
chromosome) of components. These variable components are analogous to Genes. Thus a chromosome
(individual) is composed of several genes (variable components).

Fitness Score

A Fitness Score is given to each individual which shows the ability of an individual to “compete”. The
individual having optimal fitness score (or near optimal) are sought.

The GAs maintains the population of n individuals (chromosome/solutions) along with their fitness
scores.The individuals having better fitness scores are given more chance to reproduce than others. The
individuals with better fitness scores are selected who mate and produce better offspring by combining
chromosomes of parents. The population size is static so the room has to be created for new arrivals. So,
some individuals die and get replaced by new arrivals eventually creating new generation when all the
mating opportunity of the old population is exhausted. It is hoped that over successive generations better
solutions will arrive while least fit die.

Each new generation has on average more “better genes” than the individual (solution) of previous
generations. Thus each new generations have better “partial solutions” than previous generations. Once
the offspring produced having no significant difference from offspring produced by previous populations,
the population is converged. The algorithm is said to be converged to a set of solutions for the problem.

Operators of Genetic Algorithms

Once the initial generation is created, the algorithm evolves the generation using following operators –
1) Selection Operator: The idea is to give preference to the individuals with good fitness scores and allow
them to pass their genes to successive generations.
2) Crossover Operator: This represents mating between individuals. Two individuals are selected using
selection operator and crossover sites are chosen randomly. Then the genes at these crossover sites are
exchanged thus creating a completely new individual (offspring). For example –

3) Mutation Operator: The key idea is to insert random genes in offspring to maintain the diversity in the
population to avoid premature convergence. For example –

The whole algorithm can be summarized as –

1) Randomly initialize populations p


2) Determine fitness of population
3) Until convergence repeat:
a) Select parents from population
b) Crossover and generate new population
c) Perform mutation on new population
d) Calculate fitness for new population

Example problem and solution using Genetic Algorithms


Given a target string, the goal is to produce target string starting from a random string of the same length.
In the following implementation, following analogies are made –

 Characters A-Z, a-z, 0-9, and other special symbols are considered as genes

 A string generated by these characters is considered as chromosome/solution/Individual

Fitness score is the number of characters which differ from characters in target string at a particular index.
So individual having lower fitness value is given more preference.

You might also like