0% found this document useful (0 votes)
11 views20 pages

Data Mining 4545

Data mining, also known as Knowledge Discovery in Database (KDD), is a technique used to extract valuable information from large datasets, applicable across various domains such as banking and insurance. While it offers advantages like improved decision-making and cost efficiency, it also presents challenges such as privacy concerns and the complexity of tools. The KDD process involves several steps including data cleaning, integration, selection, transformation, mining, evaluation, and knowledge representation.

Uploaded by

Fouziya A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views20 pages

Data Mining 4545

Data mining, also known as Knowledge Discovery in Database (KDD), is a technique used to extract valuable information from large datasets, applicable across various domains such as banking and insurance. While it offers advantages like improved decision-making and cost efficiency, it also presents challenges such as privacy concerns and the complexity of tools. The KDD process involves several steps including data cleaning, integration, selection, transformation, mining, evaluation, and knowledge representation.

Uploaded by

Fouziya A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Data Mining

Definition
Data mining is one of the most useful techniques that help entrepreneurs, researchers, and individuals
to extract valuable information from huge sets of data.
Data mining is also called Knowledge Discovery in
Database (KDD).
Data mining is the process of extracting the useful
information, which is stored in the large database.
It is a powerful tool, which is useful for organizations to
retrieve the useful information from available data
warehouses.
Data mining can be applied to relational databases, object-
oriented databases, data warehouses, structured-
unstructured databases, etc.
Data mining is used in numerous areas like banking,
insurance companies, pharmaceutical companies etc.

Advantages of Data Mining


The Data Mining technique enables organizations to
obtain knowledge-based data.
Data mining enables organizations to make lucrative modifications in operation and
production. Compared with other statistical data applications, data mining is a cost-
efficient.
Data Mining helps the decision-making process of an organization.
It Facilitates the automated discovery of hidden patterns as well as the prediction of trends and
behaviors. It can be induced in the new system as well as the existing platforms.
It is a quick process that makes it easy for new users to analyze enormous amounts of data in a short
time.
Disadvantages of Data Mining
There is a probability that the organizations may sell useful data of customers to other organizations
for money. As per the report, American Express has sold credit card purchases of their customers to
other organizations.
Many data mining analytics software is difficult to operate and needs advance training to work on.
Different data mining instruments operate in distinct ways due to the different algorithms used in their
design. Therefore, the selection of the right data mining tools is a very challenging task.
The data mining techniques are not precise, so that it may lead to severe consequences in certain
conditions.
Data Mining Applications
Different Data Mining Tasks
There are a number of data mining tasks such as classification, prediction, time-series analysis,
association, clustering, summarization etc. All these tasks are either predictive data mining tasks or
descriptive data mining tasks. A data mining system can execute one or more of the above specified tasks as
part of data mining.

Predictive data mining tasks come up


with a model from the available data set
that is helpful in predicting unknown or
future values of another data set of
interest. A medical practitioner trying to
diagnose a disease based on the medical
test results of a patient can be considered
as a predictive data mining task.
Descriptive data mining tasks usually
finds data describing patterns and comes
up with new, significant information from
the available data set. A retailer trying to
identify products that are purchased
together can be considered as a
descriptive data mining task.
a) Classification
Classification derives a model to determine the class of an object based on its attributes. A collection of
records will be available, each record with a set of attributes. One of the attributes will be class attribute and
the goal of classification task is assigning a class attribute to new set of records as accurately as possible.
Classification can be used in direct marketing, that is to reduce marketing costs by targeting a set of
customers who are likely to buy a new product. Using the available data, it is possible to know which
customers purchased similar products and who did not purchase in the past. Hence, {purchase, don’t
purchase} decision forms the class attribute in this case. Once the class attribute is assigned, demographic
and lifestyle information of customers who purchased similar products can be collected and promotion mails
can be sent to them directly.
b) Prediction
Prediction task predicts the possible values of missing or future data. Prediction involves developing a
model based on the available data and this model is used in predicting future values of a new data set of
interest. For example, a model can predict the income of an employee based on education, experience and
other demographic factors like place of stay, gender etc. Also prediction analysis is used in different areas
including medical diagnosis, fraud detection etc.
c) Time - Series Analysis
Time series is a sequence of events where the next event is determined by one or more of the preceding
events. Time series reflects the process being measured and there are certain components that affect the
behavior of a process. Time series analysis includes methods to analyze time-series data in order to extract
useful patterns, trends, rules and statistics. Stock market prediction is an important application of time-
series analysis.
d) Association
Association discovers the association or connection among a set of items. Association identifies the
relationships between objects. Association analysis is used for commodity management, advertising, catalog
design, direct marketing etc. A retailer can identify the products that normally customers purchase together
or even find the customers who respond to the promotion of same kind of products. If a retailer finds that
beer and nappy are bought together mostly, he can put nappies on sale to promote the sale of beer.
e) Clustering
Clustering is used to identify data objects that are similar to one another. The similarity can be decided
based on a number of factors like purchase behavior, responsiveness to certain actions, geographical
locations and so on. For example, an insurance company can cluster its customers based on age, residence,
income etc. This group information will be helpful to understand the customers better and hence provide
better customized services.
f) Summarization
Summarization is the generalization of data. A set of relevant data is summarized which result in a smaller
set that gives aggregated information of the data. For example, the shopping done by a customer can be
summarized into total products, total spending, offers used, etc. Such high level summarized information can
be useful for sales or customer relationship team for detailed customer and purchase behavior analysis. Data
can be summarized in different abstraction levels and from different angles.

KDD and Data mining


The main goal of KDD is to extract knowledge from large databases with the help of data mining methods.
KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful, previously
unknown, and potentially valuable information from large datasets. The KDD process is an iterative process
and it requires multiple iterations of the above steps to extract accurate knowledge from the data.The
following steps are included in KDD process:

1. Data cleaning:
In this step, noise and irrelevant data are
removed from the database.
 Cleaning in case of Missing values.
 Cleaning noisy data, where noise is
a random or variance error.
 Cleaning with Data discrepancy
detection and Data transformation
tools.
2. Data integration:
In this step, the heterogeneous data
sources are merged into a single data
source. Data integration using Data
Migration tools, Data
Synchronization tools and
ETL(Extract-Load-Transformation)
process.

3. Data selection:
Data selection is defined as the process where data relevant to the analysis is decided and retrieved
from the data collection. For this we can use Neural network, Decision Trees, Naive bayes, Clustering,
and Regression methods.
4. Data transformation:
Data Transformation is defined as the process of transforming data into appropriate form required by mining
procedure. Data Transformation is a two step process:
1. Data Mapping: Assigning elements from source base to destination to capture transformations.
2. Code generation: Creation of the actual transformation program.
5. Data mining:
In this step, the various techniques are applied to extract the data patterns.
6. Pattern evaluation:
Pattern Evaluation is defined as identifying strictly increasing patterns representing knowledge based on
given measures. It find interestingness score of each pattern, and uses summarization and Visualization to
make data understandable by user.
7. Knowledge representation:This is the final step of KDD, which represents the knowledge. This involves
presenting the results in a way that is meaningful and can be used to make decisions.

Advantages of KDD
1. Improves decision-making: KDD provides valuable insights and knowledge that can help organizations
make better decisions.
2. Increased efficiency: KDD automates repetitive and time-consuming tasks and makes the data ready for
analysis, which saves time and money.
3. Better customer service: KDD helps organizations gain a better understanding of their customers’
needs and preferences, which can help them provide better customer service.
4. Fraud detection: KDD can be used to detect fraudulent activities by identifying patterns and anomalies
in the data that may indicate fraud.
5. Predictive modeling: KDD can be used to build predictive models that can forecast future trends and
patterns.
Disadvantages of KDD
1. Privacy concerns: KDD can raise privacy concerns as it involves collecting and analyzing large
amounts of data, which can include sensitive information about individuals.
2. Complexity: KDD can be a complex process that requires specialized skills and knowledge to
implement and interpret the results.
3. Unintended consequences: KDD can lead to unintended consequences, such as bias or discrimination,
if the data or models are not properly understood or used.
4. Data Quality: KDD process heavily depends on the quality of data, if data is not accurate or consistent,
the results can be misleading
5. High cost: KDD can be an expensive process, requiring significant investments in hardware, software,
and personnel.
6. Overfitting: KDD process can lead to overfitting, which is a common problem in machine learning
where a model learns the detail and noise in the training data to the extent that it negatively impacts the
performance of the model on new unseen data.
Difference between KDD and Data Mining
Parameter KDD Data Mining

KDD refers to a process of identifying Data Mining refers to a process of


valid, novel, potentially useful, and extracting useful and valuable
Definition
ultimately understandable patterns and information or patterns from large data
relationships in data. sets.

Objective To find useful knowledge from data. To extract useful information from data.

Data cleaning, data integration, data Association rules, classification,


Techniques selection, data transformation, data mining, clustering, regression, decision trees,
Used pattern evaluation, and knowledge neural networks, and dimensionality
representation and visualization. reduction.

Structured information, such as rules and Patterns, associations, or insights that


Output models, that can be used to make decisions can be used to improve decision-making
or predictions. or understanding.

Focus is on the discovery of useful


Data mining focus is on the discovery of
Focus knowledge, rather than simply finding
patterns or relationships in data.
patterns in data.

Domain expertise is important in KDD, as Domain expertise is less critical in data


Role of
it helps in defining the goals of the process, mining, as the algorithms are designed
domain
choosing appropriate data, and interpreting to identify patterns without relying on
expertise
the results. prior knowledge.
Data Mining – Major Issues
Data mining is not an easy task, as the algorithms used can get very complex and data is not always
available at one place. It needs to be integrated from various heterogeneous data sources. These factors
also create some issues.The following diagram describes the major issues.

Mining Methodology and User


Interaction Issues It refers to the
following kinds of issues −
o Mining different kinds of knowledge in databases − Different users may be interested in
different kinds of knowledge. Therefore it is necessary for data mining to cover a broad
range of knowledge discovery task.
o Interactive mining of knowledge at multiple levels of abstraction − The data mining
process needs to be interactive because it allows users to focus the search for patterns,
providing and refining data mining requests based on the returned results.
o Incorporation of background knowledge − To guide discovery process and to express the
discovered patterns, the background knowledge can be used. Background knowledge may
be used to express the discovered patterns not only in concise terms but at multiple levels
of abstraction.
o Data mining query languages and ad hoc data mining − Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse
query language and optimized for efficient and flexible data mining.
o Presentation and visualization of data mining results − Once the patterns are discovered it
needs to be expressed in high level languages, and visual representations. These
representations should be easily understandable.
Handling noisy or incomplete data − The data cleaning methods are required to handle the noise and
incomplete objects while mining the data regularities. If the data cleaning methods are not there then
the accuracy of the discovered patterns will be poor.
o Pattern evaluation − The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.
Performance Issues
There can be performance-related issues such as follows −
o Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be
efficient and scalable.
Parallel, distributed, and incremental mining algorithms − The factors such as huge size of databases,
wide distribution of data, and complexity of data mining methods motivate the development of parallel
and distributed data mining algorithms. These algorithms divide the data into partitions which is
further processed in a parallel fashion. Then the results from the partitions is merged. The incremental
algorithms, update databases without mining the data again from scratch.
o Diverse Data Types Issues
Handling of relational and complex types of data − The database may contain complex data objects,
multimedia data objects, spatial data, temporal data etc. It is not possible for one system to mine all these
kind of data.
Mining information from heterogeneous databases and global information systems − The data is
available at different data sources on LAN or WAN. These data source may be structured, semi
structured or unstructured. Therefore mining the knowledge from them adds challenges to data mining.

DM METRICS
Data mining metrics generally fall into the categories of accuracy, reliability, and usefulness.
Accuracy is a measure of how well the model correlates an outcome with the attributes in the data that has
been provided. There are various measures of accuracy, but all measures of accuracy are dependent on the
data that is used. In reality, values might be missing or approximate, or the data might have been changed by
multiple processes. Particularly in the phase of exploration and development, we might decide to accept a
certain amount of error in the data, especially if the data is fairly uniform in its characteristics. For example,
a model that predicts sales for a particular store based on past sales can be strongly correlated and very
accurate, even if that store consistently used the wrong accounting method. Therefore, measurements of
accuracy must be balanced by assessments of reliability.

Reliability assesses the way that a data mining model performs on different data sets. A data mining
model is reliable if it generates the same type of predictions or finds the same general kinds of patterns
regardless of the test data that is supplied. For example, the model that we generate for the store that used
the wrong accounting method would not generalize well to other stores, and therefore would not be reliable

Usefulness includes various metrics that tell us whether the model provides useful information. For
example, a data mining model that correlates store location with sales might be both accurate and reliable,
but might not be useful, because you cannot generalize that result by adding more stores at the same
location. Moreover, it does not answer the fundamental business question of why certain locations have
more sales. We might also find that a model that appears successful in fact is meaningless, because it is
based on cross‐ correlations in the data.

Measuring the effectiveness or usefulness of data mining approach is not always straightforward. In
fact, different metrics could be used for different techniques and also based on the interest level. From an
overall business or usefulness perspective, a measure such as Return on Investment (ROI) could be used.
ROI examines the difference between what the data mining technique costs and what the savings or benefits
from its use are. Of course, this would be difficult to measure because the return is hard to quantify. It could
be measured as increased sales, reduced advertising expenditure, or both. In a specific advertising campaign
implemented via targeted catalog mailings, the percentage of catalog recipients and the amount of purchase
per recipient would provide one means to measure the effectiveness of the mailings.

Social implications of data mining


Data mining systems are designed to promote the identification and classification of individuals into
different groups or segments. From the aspect of the commercial firm, and possibly for the industry as a
whole, it can interpret the use of data mining as a discriminatory technology in the rational search of profits.
There are various social implications of data mining which are as follows −

Privacy − It is a loaded issue. In current years privacy concerns have taken on a more important role in
American society as merchants, insurance companies, and government agencies amass warehouses
including personal records.
The concerns that people have over the group of this data will generally extend to some analytic capabilities
used to the data. Users of data mining should start thinking about how their use of this technology will be
impacted by legal problems associated with privacy.
Profiling − Data Mining and profiling is a developing field that attempts to organize, understand, analyze,
reason, and use the explosion of data in this information age. The process contains using algorithms and
experience to extract design or anomalies that are very complex, difficult, or time-consuming to recognize.
The founder of Microsoft's Exploration Team used complex data mining algorithms to solve an issue that
had haunted astronomers for some years. The problem of reviewing, describing, and categorizing 2 billion
sky objects recorded over 3 decades. The algorithm extracted the relevant design to allocate the sky objects
like stars or galaxies. The algorithms were able to extract the feature that represented sky objects as stars or
galaxies. This developing field of data mining and profiling has several frontiers where it can be used.

Unauthorized Used − Trends obtain through data mining designed to be used for marketing goals or some
other ethical goals, can be misused. Unethical businesses or people can use the data obtained through data
mining to take benefit of vulnerable people or discriminate against a specific group of people. Furthermore,
the data mining technique is not 100 percent accurate; thus mistakes do appear which can have serious
results.
Data mining has innovatively influenced our daily lifestyle like how we work, shop, what we buy,
search for any information, importantly saves our precious time and offers personalized product
recommendations based on our previous history like amazon, Flipkart, etc.

Data mining emerging in all fields like Healthcare, Finance, Marketing, and social media. But
there is a higher contribution towards healthcare and well-being by using data mining software to
analyze data when developing drugs and to find associations between patients, drugs, and outcomes.
And improving patient satisfaction, providing more patient-centered care, and decreasing costs, and
increase operating efficiency and Insurance organizations can detect medical insurance fraud and abuse
through data mining and reduce their losses. An old payment system has now taken different forms of
transactions depending on usage, acceptability, methods, technology, and availability. It changes the
physical financial transactions to virtual payment transactions. So, data mining focuses on successful
transactions and keeps track of fake transactions.
It is also used in Web-wide tracking technology that tracks user’s interests while visiting any
site. So, information about every site is been recorded, which can be used further to provide marketers
with information reflecting your interests.It is also used for customer relationship management which
helps in providing more customized, personal service to individual customers. By studying browsing
and purchasing history on Web stores, companies can tailor advertisements and promotions to customer
profiles, only those who are interested and less likely to be annoyed with unwanted mailings. This
helps in reducing costs, the waste of time, and improving work productivity.

DM FROM DATABASE PROSPECTIVE


According to William H. Inmon, a leading architect in the construction of data warehouse
systems, “A data warehouse is a subject-oriented, integrated, time-variant,and nonvolatile collection of
data in support of management’s decision making pro-cess”
Let’s take a closer look at each of these key features.

 Subject-oriented: A data warehouse is organized around major subjects such as cus-tomer,


supplier, product, and sales. Rather than concentrating on the day-to-day operations and
transaction processing of an organization, a data warehouse focuses on the modeling and
analysis of data for decision makers. Hence, data warehouses typically provide a simple and
concise view of particular subject issues by excluding data that are not useful in the decision
support process.
 Integrated: A data warehouse is usually constructed by integrating multiple hetero-geneous
sources, such as relational databases, flat files, and online transaction records. Data cleaning and
data integration techniques are applied to ensure con-sistency in naming conventions, encoding
structures, attribute measures, and so on.
 Time-variant: Data are stored to provide information from an historic perspective(e.g., the past
5–10 years). Every key structure in the data warehouse contains, either implicitly or explicitly, a
time element.
 Nonvolatile: A data warehouse is always a physically separate store of data trans-formed from
the application data found in the operational environment. Due to this separation, a data
warehouse does not require transaction processing, recovery, and concurrency control
mechanisms. It usually requires only two operations in data accessing: initial loading of data and
access of data.

Based on this information, we view data warehousing as the process of construct-ing and using data
warehouses. The construction of a data warehouse requires data cleaning, data integration, and data
consolidation. The utilization of a data warehouse often necessitates a collection of decision support
technologies. This allows “knowledge workers” (e.g., managers, analysts, and executives) to use the
warehouse to quickly and conveniently obtain an overview of the data, and to make sound decisions
based on information in the warehouse. Some authors use the term data warehousing to refer only to the
process of data warehouse construction, while the term warehouse DBMS is used to refer to the
management and utilization of data warehouses. We will not make this distinction here.

“How are organizations using the information from data warehouses?” Many orga-nizations use this
information to support business decision-making activities, includ-ing (1) increasing customer focus,
which includes the analysis of customer buying patterns (such as buying preference, buying time,
budget cycles, and appetites for spending); (2) repositioning products and managing product portfolios
by compar-ing the performance of sales by quarter, by year, and by geographic regions in order to fine-
tune production strategies; (3) analyzing operations and looking for sources of profit; and (4) managing
customer relationships, making environmental corrections, and managing the cost of corporate assets.

Data warehousing is also very useful from the point of view of heterogeneous database integration.
Organizations typically collect diverse kinds of data and maintain large databases from multiple,
heterogeneous, autonomous, and distributed information sources. It is highly desirable, yet challenging,
to integrate such data and provide easy and efficient access to it. Much effort has been spent in the
database industry and research community toward achieving this goal.

The traditional database approach to heterogeneous database integration is to build wrappers and
integrators (or mediators) on top of multiple, heterogeneous databases. When a query is posed to a
client site, a metadata dictionary is used to translate the query into queries appropriate for the individual
heterogeneous sites involved. These queries are then mapped and sent to local query processors. The
results returned from the different sites are integrated into a global answer set. This query-driven
approach requires complex information filtering and integration processes, and competes with local
sites for processing resources. It is inefficient and potentially expensive for frequent queries, especially
queries requiring aggregations.

Data warehousing provides an interesting alternative to this traditional approach. Rather than using
a query-driven approach, data warehousing employs an update- driven approach in which information
from multiple, heterogeneous sources is inte-grated in advance and stored in a warehouse for direct
querying and analysis. Unlikeonline transaction processing databases, data warehouses do not contain
the most cur-rent information. However, a data warehouse brings high performance to the integrated
heterogeneous database system because data are copied, preprocessed, integrated, anno-tated,
summarized, and restructured into one semantic data store. Furthermore, query processing in data
warehouses does not interfere with the processing at local sources.

Moreover, data warehouses can store and integrate historic information and support complex
multidimensional queries. As a result, data warehousing has become popular in industry
Data Mining Techniques

1. Association
Association analysis is the finding of association rules
showing attribute-value conditions that occur frequently
together in a given set of data. Association analysis is
widely used for a market basket or transaction data
analysis. Association rule mining is a significant and
exceptionally dynamic area of data mining research. One
method of association-based classification, called
associative classification, consists of two steps. In the
main step, association instructions are generated using a
modified version of the standard association rule mining
algorithm known as Apriori. The second step constructs a
classifier based on the association rules discovered.
2. Classification
Classification is the processing of finding a set of models
(or functions) that describe and distinguish data classes or
concepts, for the purpose of being able to use the model to
predict the class of objects whose class label is unknown.
The determined model depends on the investigation of a
set of training data information (i.e. data objects whose
class label is known). The derived model may be
represented in various forms, such as classification (if –
then) rules, decision trees, and neural networks. Data
Mining has a different type of classifier:
 Decision Tree
 SVM(Support Vector Machine)
 Generalized Linear Models
 Bayesian classification:
 Classification by Backpropagation
 K-NN Classifier
 Rule-Based Classification
 Frequent-Pattern Based Classification
 Rough set theory
 Fuzzy Logic
Decision Trees: A decision tree is a flow-chart-like tree structure, where each node represents a test on
an attribute value, each branch denotes an outcome of a test, and tree leaves represent classes or class
distributions.
Support Vector Machine (SVM) Classifier Method: Support Vector Machines is a supervised
learning strategy used for classification and additionally used for regression. When the output of the
support vector machine is a continuous value, the learning methodology is claimed to perform
regression; and once the learning methodology will predict a category label of the input object, it’s
known as classification
Generalized Linear Models: Generalized Linear Models(GLM) is a statistical technique, for linear
modeling.GLM provides extensive coefficient statistics and model statistics, as well as row diagnostics.
It also supports confidence bounds.
Bayesian Classification: Bayesian classifier is a statistical classifier. They can predict class
membership probabilities, for instance, the probability that a given sample belongs to a particular class.
Bayesian classification is created on the Bayes theorem
Classification By Backpropagation: A Backpropagation learns by iteratively processing a set of
training samples, comparing the network’s estimate for each sample with the actual known class label.
For each training sample, weights are modified to minimize the mean squared error between the
network’s prediction and the actual class.
K-Nearest Neighbor (K-NN) Classifier Method: The k-nearest neighbor (K-NN) classifier is taken
into account as an example-based classifier, which means that the training documents are used for
comparison instead of an exact class illustration, like the class profiles utilized by other classifiers
Rule-Based Classification: Rule-Based classification represent the knowledge in the form of If-Then
rules. An assessment of a rule evaluated according to the accuracy and coverage of the classifier. If
more than one rule is triggered then we need to conflict resolution in rule-based classification
Frequent-Pattern Based Classification: Frequent pattern discovery (or FP discovery, FP mining, or
Frequent itemset mining) is part of data mining. It describes the task of finding the most frequent and
relevant patterns in large datasets. The idea was first presented for mining transaction databases.
Rough Set Theory: Rough set theory is based on the establishment of equivalence classes within the
given training data. All the data samples forming a similarity class are indiscernible, that is, the samples
are equal with respect to the attributes describing the data.
Fuzzy-Logic: Rule-based systems for classification have the disadvantage that they involve sharp cut-
offs for continuous attributes. Fuzzy Logic is valuable for data mining frameworks performing grouping
/classification.
3. Prediction
Data Prediction is a two-step process, similar to that of data classification. Although, for prediction, we
do not utilize the phrasing of “Class label attribute” because the attribute for which values are being
predicted is consistently valued(ordered) instead of categorical (discrete-esteemed and unordered).
4. Clustering
Unlike classification and prediction, which analyze class-labeled data objects or attributes, clustering
analyzes data objects without consulting an identified class label. In general, the class labels do not exist
in the training data simply because they are not known to begin with.
5. Regression
Regression can be defined as a statistical modeling method in which previously obtained data is used to
predicting a continuous quantity for new observations. This classifier is also known as the Continuous
Value Classifier. There are two types of regression models: Linear regression and multiple linear
regression models.
6. Artificial Neural network (ANN) Classifier Method
An artificial neural network (ANN) also referred to as simply a “Neural Network” (NN), could be a
process model supported by biological neural networks. It consists of an interconnected collection of
artificial neurons. A neural network is a set of connected input/output units where each connection has a
weight associated with it.
The advantages of neural networks, however, contain their high tolerance to noisy data as well as their
ability to classify patterns on which they have not been trained. In addition, several algorithms have
newly been developed for the extraction of rules from trained neural networks. These issues contribute
to the usefulness of neural networks for classification in data mining.
7. Outlier Detection
A database may contain data objects that do not comply with the general behavior or model of the data.
These data objects are Outliers. The investigation of OUTLIER data is known as OUTLIER MINING.
An outlier may be detected using statistical tests which assume a distribution or probability model for
the data, or using distance measures where objects having a small fraction of “close” neighbors in space
are considered outliers. Rather than utilizing factual or distance measures, deviation-based techniques
distinguish exceptions/outlier by inspecting differences in the principle attributes of items in a group.
8. Genetic Algorithm
Genetic algorithms are adaptive heuristic search algorithms that belong to the larger part of evolutionary
algorithms. Genetic algorithms are based on the ideas of natural selection and genetics. These are
intelligent exploitation of random search provided with historical data to direct the search into the
region of better performance in solution space. They are commonly used to generate high-quality
solutions for optimization problems and search problems

Statistical Methods in Data Mining


Data mining refers to extracting or mining knowledge from large amounts of data. In other
words, data mining is the science, art, and technology of discovering large and complex bodies of data in
order to discover useful patterns. Theoreticians and practitioners are continually seeking improved
techniques to make the process more efficient, cost-effective, and accurate. Any situation can be analyzed in
two ways in data mining:
 Statistical Analysis: In statistics, data is collected, analyzed, explored, and presented to identify
patterns and trends. Alternatively, it is referred to as quantitative analysis.
 Non-statistical Analysis: This analysis provides generalized information and includes sound, still
images, and moving images.
In statistics, there are two main categories:
 Descriptive Statistics: The purpose of descriptive statistics is to organize data and identify the main
characteristics of that data. Graphs or numbers summarize the data. Average, Mode, SD(Standard
Deviation), and Correlation are some of the commonly used descriptive statistical methods.
 Inferential Statistics: The process of drawing conclusions based on probability theory and
generalizing the data. By analyzing sample statistics, you can infer parameters about populations and
make models of relationships within data.
There are various statistical terms that one should be aware of while dealing with statistics. Some of these
are:
 Population
 Sample
 Variable
 Quantitative Variable
 Qualitative Variable
 Discrete Variable
 Continuous Variable
Now, let’s start discussing statistical methods. This is the analysis of raw data using mathematical formulas,
models, and techniques. Through the use of statistical methods, information is extracted from research data,
and different ways are available to judge the robustness of research outputs.
As a matter of fact, today’s statistical methods used in the data mining field typically are derived from the
vast statistical toolkit developed to answer problems arising in other fields. These techniques are taught in
science curriculums. It is necessary to check and test several hypotheses. The hypotheses described above
help us assess the validity of our data mining endeavor when attempting to infer any inferences from the
data under study. When using more complex and sophisticated statistical estimators and tests, these issues
become more pronounced.
For extracting knowledge from databases containing different types of observations, a variety of statistical
methods are available in Data Mining and some of these are:
 Logistic regression analysis
 Correlation analysis
 Regression analysis
 Discriminate analysis
 Linear discriminant analysis (LDA)
 Classification
 Clustering
 Outlier detection
 Classification and regression trees,
 Correspondence analysis
 Nonparametric regression,
 Statistical pattern recognition,
 Categorical data analysis,
 Time-series methods for trends and periodicity
 Artificial neural networks
Now, let’s try to understand some of the important statistical methods which are used in data mining:
 Linear Regression: The linear regression method uses the best linear relationship between the
independent and dependent variables to predict the target variable. In order to achieve the best fit,
make sure that all the distances between the shape and the actual observations at each point are as
small as possible. A good fit can be determined by determining that no other position would produce
fewer errors given the shape chosen. Simple linear regression and multiple linear regression are the
two major types of linear regression. By fitting a linear relationship to the independent variable, the
simple linear regression predicts the dependent variable. Using multiple independent variables,
multiple linear regression fits the best linear relationship with the dependent variable. For more
details, you can refer linear regression.
 Classification: This is a method of data mining in which a collection of data is categorized so that a
greater degree of accuracy can be predicted and analyzed. An effective way to analyze very large
datasets is to classify them. Classification is one of several methods aimed at improving the
efficiency of the analysis process. A Logistic Regression and a Discriminant Analysis stand out as
two major classification techniques.
o Logistic Regression: It can also be applied to machine learning applications and predictive
analytics. In this approach, the dependent variable is either binary (binary regression) or
multinomial (multinomial regression): either one of the two or a set of one, two, three, or four
options. With a logistic regression equation, one can estimate probabilities regarding the
relationship between the independent variable and the dependent variable. For understanding
logistic regression analysis in detail, you can refer to logistic regression.
o Discriminant Analysis: A Discriminant Analysis is a statistical method of analyzing data
based on the measurements of categories or clusters and categorizing new observations into
one or more populations that were identified a priori. The discriminant analysis models each
response class independently then uses Bayes’s theorem to flip these projections around to
estimate the likelihood of each response category given the value of X. These models can be
either linear or quadratic.
o Linear Discriminant Analysis: According to Linear Discriminant Analysis, each
observation is assigned a discriminant score to classify it into a response variable
class. By combining the independent variables in a linear fashion, these scores can be
obtained. Based on this model, observations are drawn from a Gaussian distribution,
and the predictor variables are correlated across all k levels of the response variable,
Y. and for further details linear discriminant analysis
o Quadratic Discriminant Analysis: An alternative approach is provided by
Quadratic Discriminant Analysis. LDA and QDA both assume Gaussian distributions
for the observations of the Y classes. Unlike LDA, QDA considers each class to have
its own covariance matrix. As a result, the predictor variables have different variances
across the k levels in Y.
o Correlation Analysis: In statistical terms, correlation analysis captures the relationship
between variables in a pair. The value of such variables is usually stored in a column or rows
of a database table and represents a property of an object.
o Regression Analysis: Based on a set of numeric data, regression is a data mining method that
predicts a range of numerical values (also known as continuous values). You could, for
instance, use regression to predict the cost of goods and services based on other variables. A
regression model is used across numerous industries for forecasting financial data, modeling
environmental conditions, and analyzing trends.
The first step in creating good statistics is having good data that was derived with an aim in mind. There are
two main types of data: an input (independent or predictor) variable, which we control or are able to
measure, and an output (dependent or response) variable which is observed. A few will be quantitative
measurements, but others may be qualitative or categorical variables (called factors).

Similarity Measures
 Similarity measures are mathematical functions used to determine the degree of similarity between
two data points or objects. These measures produce a score that indicates how similar or alike the
two data points are.
 It takes two data points as input and produces a similarity score as output, typically ranging from 0
(completely dissimilar) to 1 (identical or perfectly similar).
 A similarity measure can be based on various mathematical techniques such as Cosine similarity,
Jaccard similarity, and Pearson correlation coefficient.
 Similarity measures are generally used to identify duplicate records, equivalent instances, or
identifying clusters
 Similarity measures also have some well-known properties -
o sim(A,B)=1sim(A,B)=1 (or maximum similarity) only if A=BA=B
o Typical range - (0≤sim≤1)(0≤sim≤1)
o Symmetry - sim(A,B)=sim(B,A)sim(A,B)=sim(B,A) for all AA and BB
Now let’s explore a few of the most commonly used similarity measures in data mining.
Cosine Similarity
Cosine similarity is a widely used similarity measure in data mining and information retrieval. It measures
the cosine of the angle between two non-zero vectors in a multi-dimensional space. In the context of data
mining, these vectors represent the feature vectors of two data points. The cosine similarity score ranges
from 0 to 1, with 0 indicating no similarity and 1 indicating perfect similarity.
The cosine similarity between two vectors is calculated as the dot product of the vectors divided by the
product of their magnitudes. This calculation can be represented mathematically as follows -

where A and B are the feature vectors of two data points, "." denotes the dot product, and "||" denotes the
magnitude of the vector.

Jaccard Similarity
The Jaccard similarity is another widely used similarity measure in data mining, particularly in text analysis
and clustering. It measures the similarity between two sets of data by calculating the ratio of the intersection
of the sets to their union. The Jaccard similarity score ranges from 0 to 1, with 0 indicating no similarity and
1 indicating perfect similarity.
The Jaccard similarity between two sets A and B is calculated as follows -

where ∣A∩B∣∣A∩B∣ is the size of the intersection of sets AA and BB, and ∣A∪B∣∣A∪B∣ is the size of the
union of sets AA and BB.
Pearson Correlation Coefficient
The Pearson correlation coefficient is a widely used similarity measure in data mining and statistical
analysis. It measures the linear correlation between two continuous variables, X and Y. The Pearson
correlation coefficient ranges from -1 to +1, with -1 indicating a perfect negative correlation, 0 indicating no
correlation, and +1 indicating a perfect positive correlation. The Pearson correlation coefficient is commonly
used in data mining applications such as feature selection and regression analysis. It can help identify
variables that are highly correlated with each other, which can be useful for reducing the dimensionality of a
dataset. In regression analysis, it can also be used to predict the value of one variable based on the value of
another variable.
The Pearson correlation coefficient between two variables, X and Y, is calculated as follows –
where cov⁡(X,Y)cov(X,Y) is the covariance between variables XX and YY, and σXσX and σYσY are the
standard deviations of variables XX and YY, respectively.
Sørensen-Dice Coefficient
The Sørensen-Dice coefficient, also known as the Dice similarity index or Dice coefficient, is a similarity
measure used to compare the similarity between two sets of data, typically used in the context of text or
image analysis. The coefficient ranges from 0 to 1, with 0 indicating no similarity and 1 indicating perfect
similarity. The Sørensen-Dice coefficient is commonly used in text analysis to compare the similarity
between two documents based on the set of words or terms they contain. It is also used in image analysis to
compare the similarity between two images based on the set of pixels they contain.
The Sørensen-Dice coefficient between two sets, A and B, is calculated as follows -

where ∣A∩B∣∣A∩B∣ is the size of the intersection of sets AA and BB,


and ∣A∣∣A∣ and ∣B∣∣B∣ are the sizes of sets AA and BB, respectively.
Choosing The Appropriate Similarity Measure
Choosing an appropriate similarity measure depends on the nature of the data and the specific task at hand.
Here are some factors to consider when choosing a similarity measure -
 Different similarity measures are suitable for different data types, such as continuous or categorical
data, text or image data, etc. For example, the Pearson correlation coefficient, which is only suitable
for continuous variables.
 Some similarity measures are sensitive to the scale of measurement of the data.
 The choice of similarity measure also depends on the specific task at hand. For example, cosine
similarity is often used in information retrieval and text mining, while Jaccard similarity is
commonly used in clustering and recommendation systems.
 Some similarity measures are more robust to noise and outliers in the data than others. For example,
the Sørensen-Dice coefficient is less sensitive to noise.
Dissimilarity Measures
 Dissimilarity measures are used to quantify the degree of difference or distance between two objects
or data points.
 Dissimilarity measures can be considered the inverse of similarity measures, where the similarity
measure returns a high value for similar objects and a low value for dissimilar objects, and the
dissimilarity measure returns a low value for similar objects and a high value for dissimilar objects.
1. Euclidean Distance

Description: Euclidean distance is the directly-line distance among points in a multi-dimensional space. It is
intuitive and extensively used in lots of applications, especially whilst the functions are non-stop and the
size is steady across dimensions.
Applications: It is commonly used in clustering algorithms together with okay-method, and in nearest
neighbor searches.
Applications of Similarity Measures
1. Clustering
Clustering entails grouping a set of gadgets such that items in the identical institution (or cluster) are greater
just like every aside from to the ones in different agencies. Similarity measures play a essential function in
defining these groups.
2. Classification
Classification assigns a label to a brand new facts factor based totally at the traits of acknowledged classified
facts points. Similarity measures help decide the label by means of comparing the new factor to present
points.
3. Information Retrieval
Information retrieval structures, together with search engines, rely on similarity measures to rank documents
primarily based on their relevance to a query.
4. Recommendation Systems
Recommendation structures suggest items to customers based on their alternatives and behavior, often the
usage of similarity measures to discover objects or customers which might be alike.
5. Anomaly Detection
Anomaly detection identifies outliers or uncommon statistics points that differ substantially from the bulk of
information.
6. Natural Language Processing (NLP)
In NLP, similarity measures are used to examine text data, assisting in responsibilities consisting of report
clustering, plagiarism detection, and sentiment evaluation.
7. Image Processing
Image processing involves analyzing and manipulating pics, where similarity measures are used to compare
picture capabilities.
8. Bioinformatics
In bioinformatics, similarity measures help examine organic information, along with genetic sequences or
protein systems.
Decision Tree

Decision trees are a popular and powerful


tool used in various fields such as machine
learning, data mining, and statistics. They
provide a clear and intuitive way to make
decisions based on data by modeling the
relationships between different variables.
This article is all about what decision trees
are, how they work, their advantages and
disadvantages, and their applications.
What is a Decision Tree?
A decision tree is a flowchart-like
structure used to make decisions or
predictions. It consists of nodes
representing decisions or tests on
attributes, branches representing the outcome of these decisions, and leaf nodes representing final outcomes
or predictions. Each internal node corresponds to a test on an attribute, each branch corresponds to the result
of the test, and each leaf node corresponds to a class label or a continuous value.
Structure of a Decision Tree
1. Root Node: Represents the entire dataset and the initial decision to be made.
2. Internal Nodes: Represent decisions or tests on attributes. Each internal node has one or more
branches.
3. Branches: Represent the outcome of a decision or test, leading to another node.
4. Leaf Nodes: Represent the final decision or prediction. No further splits occur at these nodes.
How Decision Trees Work?
The process of creating a decision tree involves:
1. Selecting the Best Attribute: Using a metric like Gini impurity, entropy, or information gain, the
best attribute to split the data is selected.
2. Splitting the Dataset: The dataset is split into subsets based on the selected attribute.
3. Repeating the Process: The process is repeated recursively for each subset, creating a new internal
node or leaf node until a stopping criterion is met (e.g., all instances in a node belong to the same
class or a predefined depth is reached).
Metrics for Splitting
 Gini Impurity: Measures the likelihood of an incorrect classification of a new instance if it was
randomly classified according to the distribution of classes in the dataset.
o Gini=1–∑i=1n(pi)2Gini=1–∑i=1n(pi)2, where pi is the probability of an instance being
classified into a particular class.
 Entropy: Measures the amount of uncertainty or impurity in the dataset.
o Entropy=−∑i=1npilog⁡2(pi)Entropy=−∑i=1npilog2(pi), where pi is the probability of an
instance being classified into a particular class.
 Information Gain: Measures the reduction in entropy or Gini impurity after a dataset is split on an
attribute.
o InformationGain=Entropyparent–
∑i=1n(∣Di∣∣D∣∗Entropy(Di))InformationGain=Entropyparent–∑i=1n(∣D∣∣Di∣∗Entropy(Di)),
where Di is the subset of D after splitting by an attribute.
Advantages of Decision Trees
 Simplicity and Interpretability: Decision trees are easy to understand and interpret. The visual
representation closely mirrors human decision-making processes.
 Versatility: Can be used for both classification and regression tasks.
 No Need for Feature Scaling: Decision trees do not require normalization or scaling of the data.
 Handles Non-linear Relationships: Capable of capturing non-linear relationships between features
and target variables.
Disadvantages of Decision Trees
 Overfitting: Decision trees can easily overfit the training data, especially if they are deep with many
nodes.
 Instability: Small variations in the data can result in a completely different tree being generated.
 Bias towards Features with More Levels: Features with more levels can dominate the tree
structure.
Pruning
To overcome overfitting, pruning techniques are used. Pruning reduces the size of the tree by removing
nodes that provide little power in classifying instances. There are two main types of pruning:
 Pre-pruning (Early Stopping): Stops the tree from growing once it meets certain criteria (e.g.,
maximum depth, minimum number of samples per leaf).
 Post-pruning: Removes branches from a fully grown tree that do not provide significant power.
Applications of Decision Trees
 Business Decision Making: Used in strategic planning and resource allocation.
 Healthcare: Assists in diagnosing diseases and suggesting treatment plans.
 Finance: Helps in credit scoring and risk assessment.
 Marketing: Used to segment customers and predict customer behavior.

Neural Network in Data Mining


What is Neural Networks?
A neural community is a computational model stimulated by the human brain's structure and functioning. It
consists of interconnected nodes, typically called neurons or artificial neurons. These neurons are organized
into layers consisting of an input layer, one or more hidden layers, and an output layer. The connections
between neurons, called weights, decide the network's ability to research from records.
Types of Neural Networks:
There are numerous types of neural networks, every designed for specific tasks. Some common kinds
encompass:
1. Feedforward Neural Networks (FNN):
Feedforward neural networks are the most effective form of neural networks, in which statistics flow
in a single direction-from the input layer via the hidden layers to the output layer. They are normally
used for classification and regression duties.
2. Recurrent Neural Networks (RNN):
Recurrent neural networks have connections that shape cycles, letting them seize temporal
dependencies in sequential statistics. RNNs are appropriate for responsibilities regarding time series
evaluation, natural language processing, and speech reputation.
3. Convolutional Neural Networks (CNN):
Convolutional neural networks are designed to method grid-like records together with photographs.
They rent convolutional layers to robotically learn hierarchical representations of styles, making
them notably powerful in picture reputation and computer imaginative and prescient obligations.
4. Radial Basis Function Networks (RBFN):
Radial foundation function networks use radial foundation features as activation functions inside the
hidden layers. They are often hired for pattern reputation and feature approximation.
Neural Network Training:
Training a neural network entails adjusting the weights of connections to limit the distinction between the
predicted output and the real goal values. This manner generally employs optimization algorithms like
gradient descent. During education, the community learns the underlying styles within the information,
enabling it to make accurate predictions on new, unseen examples.
Neural Networks in Data Mining:
1. Role of Neural Networks:
Neural networks are powerful equipment in records mining because of their capability to research
complicated patterns from massive datasets. Their adaptability to various kinds of information and trouble
domain names makes them appropriate for a wide range of applications, which include:
o Pattern Recognition: Neural networks excel in spotting styles within data, making them precious
for obligations, which include photo and speech reputation, fraud detection, and medical analysis.
o Classification: In class duties, neural networks categorize input facts into predefined instructions.
Applications include email junk mail detection, sentiment evaluation, and disorder analysis.
o Regression: Neural networks can perform regression obligations by predicting numerical values.
This is useful in scenarios consisting of predicting inventory expenses, sales forecasts, and housing
charges.
o Clustering: Neural networks may be applied to clustering troubles, grouping similar data points.
This is useful in customer segmentation, anomaly detection, and statistics compression.
2. Data Preparation for Neural Networks:
o Feature Scaling: Neural networks have an advantage from feature scaling, making sure that all input
capabilities have a similar scale. Common scaling strategies encompass normalization and
standardization.
o Handling Missing Data: Addressing lacking information is important for powerful neural
community training. Techniques like imputation or exclusion of incomplete statistics help maintain
the facts' integrity.
o Data Splitting: Datasets are generally split into education, validation, and testing units. Training sets
are used to educate the model; validation units assist in tracking hyperparameters and checking out
sets to examine the model's performance on unseen facts.
4. Neural Network Architecture for Data Mining:

o Input Layer: The enter layer of a neural community consists of neurons similar to the functions of
the dataset. Each neuron represents a function, and the values are fed into the community throughout
schooling.
o Hidden Layers: Hidden layers are where the network learns and extracts features from the input
facts. The number of hidden layers and neurons in every layer is a crucial component of community
architecture and is often determined through experimentation.
o Output Layer: The output layer produces the very last predictions or classifications. The number of
neurons in this layer depends upon the assignment's character, binary type, multi-elegance class, or
regression.
4. Training and Optimization:
o Backpropagation: One of the most important algorithms for training neural networks is
backpropagation. It is an iterative configuration of weights following the gradient of error with
respect to these estimates. This process is critical in ensuring that the difference between the
predicted and actual outputs is minimal.
o Activation Functions: Activation functions introduce nonlinearity into the neural network, enabling
it to learn complex relationships. Some typical activation functions are sigmoid, hyperbolic tangent
H(x), and rectified linear units.
o Regularization: So, they apply regularization techniques such as dropout and weight decay while
training to prevent overfitting. All these techniques aid the model in generalizing well to new, unseen
data.
o Hyperparameter Tuning: The selection of appropriate hyperparameters, such as learning rate,
batch size, and the number of hidden layers, drastically influence the performance level of a neural
network. Hyperparameter tuning often involves the use of Grid search or random search methods.
Challenges in Data Mining of Neural Networks:
Despite their effectiveness, neural networks pose certain challenges in the context of data mining:
o Overfitting: Neural networks are prone to memorizing the training data, which generalizes poorly
when transferred to new data. Regularization techniques and appropriate validation strategies
mitigate this problem.
o Interpretability: Neural networks are so often being called 'black box' models that it is difficult to
explain why such predictions were made. In some areas with the requirement for transparency, this
inability to make sense of it becomes a problem.
o Computational Resources: Training large neural networks is a heavy computational task that
requires strong GPUs or TPUs. This is a limiting factor, especially for small-scale projects or
organizations with limited resources.
Neural Networks in Data Mining:
1. Image and Speech Recognition:
Neural networks, especially convolutional neural networks (CNNs), have transformed image and
speech recognition. This ranges from facial recognition in security systems to voice-controlled
virtual assistants.
2. Financial Fraud Detection:
In financial institutions, neural networks interpret patterns in transaction information to identify
fraudulent activities. They can detect suspicious behavior and signal possibly fraudulent transactions
as they occur.
3. Healthcare and Medical Diagnosis:
In medicine, neural networks process medical images such as X-rays and MRIs to diagnose diseases. They
also help determine the likelihood of patient survival and potential health hazards based on patients' data.
4. Customer Relationship Management (CRM):
Neural networks are used in customer segmentation and personalized marketing CRM systems.
These systems research customers' behavior and preferences so that businesses can develop target
marketing strategies.
5. Natural Language Processing (NLP):
In recent years, natural language processing tasks like translation languages, sentiment analysis, and
chatting bots have significantly changed recurrent neural networks (RNNs) and transformer models.
Future Trends and Developments:
1. Explainable AI (XAI):
In response to the interpretability challenge, Explainable AI (XAI) seeks to increase transparency
and comprehensibility in neural networks. Researchers are currently working to create techniques
that give knowledge about how complex models make decisions.
2. Transfer Learning:
Transfer learning refers to pre-training neural networks for one task and subsequent fine-tuning on
another closely associated matter. This approach has been proven effective in enhancing neural
networks' efficiency and performance, especially where labeled data is limited.
3. Edge Computing:
Neural networks can be integrated with edge computing devices, thus enabling real-time data
processing at the source. This reduces the extensive transmission of data to centralized servers,
which is beneficial in applications such as IoT and autonomous systems.

Genetic Algorithm
Before understanding the Genetic algorithm, let's first understand basic terminologies to better understand
this algorithm:
 Population: Population is the subset of all possible or probable solutions, which can solve the given
problem.
 Chromosomes: A chromosome is one of the solutions in the population for the given problem, and
the collection of gene generate a chromosome.
 Gene: A chromosome is divided into a different gene, or it is an element of the chromosome.
 Allele: Allele is the value provided to the gene within a particular chromosome.
 Fitness Function: The fitness function is used to determine the individual's fitness level in the
population. It means the ability of an individual to compete with other individuals. In every iteration,
individuals are evaluated based on their fitness function.
 Genetic Operators: In a genetic algorithm, the best individual mate to regenerate offspring better
than parents. Here genetic operators play a role in changing the genetic composition of the next
generation.
 Selection
Foundation of Genetic Algorithms
Genetic algorithms are based on an analogy with the genetic structure and behavior of chromosomes of the
population. Following is the foundation of GAs based on this analogy –
1. Individuals in the population compete for resources and mate
2. Those individuals who are successful (fittest) then mate to create more offspring than others
3. Genes from the “fittest” parent propagate throughout the generation, that is sometimes parents create
offspring which is better than either parent.
4. Thus each successive generation is more suited for their environment.
Search space
The population of individuals are maintained within search space. Each individual represents a solution in
search space for given problem. Each individual is coded as a finite length vector (analogous to
chromosome) of components. These variable components are analogous to Genes. Thus a chromosome
(individual) is composed of several genes (variable components).

Fitness Score
A Fitness Score is given to each individual which shows the ability of an individual to “compete”. The
individual having optimal fitness score (or near optimal) are sought.
The GAs maintains the population of n individuals (chromosome/solutions) along with their fitness
scores.The individuals having better fitness scores are given more chance to reproduce than others. The
individuals with better fitness scores are selected who mate and produce better offspring by combining
chromosomes of parents. The population size is static so the room has to be created for new arrivals. So,
some individuals die and get replaced by new arrivals eventually creating new generation when all the
mating opportunity of the old population is exhausted. It is hoped that over successive generations better
solutions will arrive while least fit die.
Each new generation has on average more “better genes” than the individual (solution) of previous
generations. Thus each new generations have better “partial solutions” than previous generations. Once the
offspring produced having no significant difference from offspring produced by previous populations, the
population is converged. The algorithm is said to be converged to a set of solutions for the problem.
Operators of Genetic Algorithms
Once the initial generation is created, the algorithm evolves the generation using following operators –
1) Selection Operator: The idea is to give preference to the individuals with good fitness scores and allow
them to pass their genes to successive generations.
2) Crossover Operator: This represents mating between individuals. Two individuals are selected using
selection operator and crossover sites are chosen randomly. Then the genes at these crossover sites are
exchanged thus creating a completely new individual (offspring). For example –

3) Mutation Operator: The key idea is to insert random genes in offspring to maintain the diversity in the
population to avoid premature convergence. For example –

The whole algorithm can be summarized as –


1) Randomly initialize populations p
2) Determine fitness of population
3) Until convergence repeat:
a) Select parents from population
b) Crossover and generate new population
c) Perform mutation on new population
d) Calculate fitness for new population
Example problem and solution using Genetic Algorithms
Given a target string, the goal is to produce target string starting from a random string of the same length. In
the following implementation, following analogies are made –
 Characters A-Z, a-z, 0-9, and other special symbols are considered as genes
 A string generated by these characters is considered as chromosome/solution/Individual
Fitness score is the number of characters which differ from characters in target string at a particular index.
So individual having lower fitness value is given more preference.

You might also like