Data Mining 4545
Data Mining 4545
Definition
Data mining is one of the most useful techniques that help entrepreneurs, researchers, and individuals
to extract valuable information from huge sets of data.
Data mining is also called Knowledge Discovery in
Database (KDD).
Data mining is the process of extracting the useful
information, which is stored in the large database.
It is a powerful tool, which is useful for organizations to
retrieve the useful information from available data
warehouses.
Data mining can be applied to relational databases, object-
oriented databases, data warehouses, structured-
unstructured databases, etc.
Data mining is used in numerous areas like banking,
insurance companies, pharmaceutical companies etc.
1. Data cleaning:
In this step, noise and irrelevant data are
removed from the database.
Cleaning in case of Missing values.
Cleaning noisy data, where noise is
a random or variance error.
Cleaning with Data discrepancy
detection and Data transformation
tools.
2. Data integration:
In this step, the heterogeneous data
sources are merged into a single data
source. Data integration using Data
Migration tools, Data
Synchronization tools and
ETL(Extract-Load-Transformation)
process.
3. Data selection:
Data selection is defined as the process where data relevant to the analysis is decided and retrieved
from the data collection. For this we can use Neural network, Decision Trees, Naive bayes, Clustering,
and Regression methods.
4. Data transformation:
Data Transformation is defined as the process of transforming data into appropriate form required by mining
procedure. Data Transformation is a two step process:
1. Data Mapping: Assigning elements from source base to destination to capture transformations.
2. Code generation: Creation of the actual transformation program.
5. Data mining:
In this step, the various techniques are applied to extract the data patterns.
6. Pattern evaluation:
Pattern Evaluation is defined as identifying strictly increasing patterns representing knowledge based on
given measures. It find interestingness score of each pattern, and uses summarization and Visualization to
make data understandable by user.
7. Knowledge representation:This is the final step of KDD, which represents the knowledge. This involves
presenting the results in a way that is meaningful and can be used to make decisions.
Advantages of KDD
1. Improves decision-making: KDD provides valuable insights and knowledge that can help organizations
make better decisions.
2. Increased efficiency: KDD automates repetitive and time-consuming tasks and makes the data ready for
analysis, which saves time and money.
3. Better customer service: KDD helps organizations gain a better understanding of their customers’
needs and preferences, which can help them provide better customer service.
4. Fraud detection: KDD can be used to detect fraudulent activities by identifying patterns and anomalies
in the data that may indicate fraud.
5. Predictive modeling: KDD can be used to build predictive models that can forecast future trends and
patterns.
Disadvantages of KDD
1. Privacy concerns: KDD can raise privacy concerns as it involves collecting and analyzing large
amounts of data, which can include sensitive information about individuals.
2. Complexity: KDD can be a complex process that requires specialized skills and knowledge to
implement and interpret the results.
3. Unintended consequences: KDD can lead to unintended consequences, such as bias or discrimination,
if the data or models are not properly understood or used.
4. Data Quality: KDD process heavily depends on the quality of data, if data is not accurate or consistent,
the results can be misleading
5. High cost: KDD can be an expensive process, requiring significant investments in hardware, software,
and personnel.
6. Overfitting: KDD process can lead to overfitting, which is a common problem in machine learning
where a model learns the detail and noise in the training data to the extent that it negatively impacts the
performance of the model on new unseen data.
Difference between KDD and Data Mining
Parameter KDD Data Mining
Objective To find useful knowledge from data. To extract useful information from data.
DM METRICS
Data mining metrics generally fall into the categories of accuracy, reliability, and usefulness.
Accuracy is a measure of how well the model correlates an outcome with the attributes in the data that has
been provided. There are various measures of accuracy, but all measures of accuracy are dependent on the
data that is used. In reality, values might be missing or approximate, or the data might have been changed by
multiple processes. Particularly in the phase of exploration and development, we might decide to accept a
certain amount of error in the data, especially if the data is fairly uniform in its characteristics. For example,
a model that predicts sales for a particular store based on past sales can be strongly correlated and very
accurate, even if that store consistently used the wrong accounting method. Therefore, measurements of
accuracy must be balanced by assessments of reliability.
Reliability assesses the way that a data mining model performs on different data sets. A data mining
model is reliable if it generates the same type of predictions or finds the same general kinds of patterns
regardless of the test data that is supplied. For example, the model that we generate for the store that used
the wrong accounting method would not generalize well to other stores, and therefore would not be reliable
Usefulness includes various metrics that tell us whether the model provides useful information. For
example, a data mining model that correlates store location with sales might be both accurate and reliable,
but might not be useful, because you cannot generalize that result by adding more stores at the same
location. Moreover, it does not answer the fundamental business question of why certain locations have
more sales. We might also find that a model that appears successful in fact is meaningless, because it is
based on cross‐ correlations in the data.
Measuring the effectiveness or usefulness of data mining approach is not always straightforward. In
fact, different metrics could be used for different techniques and also based on the interest level. From an
overall business or usefulness perspective, a measure such as Return on Investment (ROI) could be used.
ROI examines the difference between what the data mining technique costs and what the savings or benefits
from its use are. Of course, this would be difficult to measure because the return is hard to quantify. It could
be measured as increased sales, reduced advertising expenditure, or both. In a specific advertising campaign
implemented via targeted catalog mailings, the percentage of catalog recipients and the amount of purchase
per recipient would provide one means to measure the effectiveness of the mailings.
Privacy − It is a loaded issue. In current years privacy concerns have taken on a more important role in
American society as merchants, insurance companies, and government agencies amass warehouses
including personal records.
The concerns that people have over the group of this data will generally extend to some analytic capabilities
used to the data. Users of data mining should start thinking about how their use of this technology will be
impacted by legal problems associated with privacy.
Profiling − Data Mining and profiling is a developing field that attempts to organize, understand, analyze,
reason, and use the explosion of data in this information age. The process contains using algorithms and
experience to extract design or anomalies that are very complex, difficult, or time-consuming to recognize.
The founder of Microsoft's Exploration Team used complex data mining algorithms to solve an issue that
had haunted astronomers for some years. The problem of reviewing, describing, and categorizing 2 billion
sky objects recorded over 3 decades. The algorithm extracted the relevant design to allocate the sky objects
like stars or galaxies. The algorithms were able to extract the feature that represented sky objects as stars or
galaxies. This developing field of data mining and profiling has several frontiers where it can be used.
Unauthorized Used − Trends obtain through data mining designed to be used for marketing goals or some
other ethical goals, can be misused. Unethical businesses or people can use the data obtained through data
mining to take benefit of vulnerable people or discriminate against a specific group of people. Furthermore,
the data mining technique is not 100 percent accurate; thus mistakes do appear which can have serious
results.
Data mining has innovatively influenced our daily lifestyle like how we work, shop, what we buy,
search for any information, importantly saves our precious time and offers personalized product
recommendations based on our previous history like amazon, Flipkart, etc.
Data mining emerging in all fields like Healthcare, Finance, Marketing, and social media. But
there is a higher contribution towards healthcare and well-being by using data mining software to
analyze data when developing drugs and to find associations between patients, drugs, and outcomes.
And improving patient satisfaction, providing more patient-centered care, and decreasing costs, and
increase operating efficiency and Insurance organizations can detect medical insurance fraud and abuse
through data mining and reduce their losses. An old payment system has now taken different forms of
transactions depending on usage, acceptability, methods, technology, and availability. It changes the
physical financial transactions to virtual payment transactions. So, data mining focuses on successful
transactions and keeps track of fake transactions.
It is also used in Web-wide tracking technology that tracks user’s interests while visiting any
site. So, information about every site is been recorded, which can be used further to provide marketers
with information reflecting your interests.It is also used for customer relationship management which
helps in providing more customized, personal service to individual customers. By studying browsing
and purchasing history on Web stores, companies can tailor advertisements and promotions to customer
profiles, only those who are interested and less likely to be annoyed with unwanted mailings. This
helps in reducing costs, the waste of time, and improving work productivity.
Based on this information, we view data warehousing as the process of construct-ing and using data
warehouses. The construction of a data warehouse requires data cleaning, data integration, and data
consolidation. The utilization of a data warehouse often necessitates a collection of decision support
technologies. This allows “knowledge workers” (e.g., managers, analysts, and executives) to use the
warehouse to quickly and conveniently obtain an overview of the data, and to make sound decisions
based on information in the warehouse. Some authors use the term data warehousing to refer only to the
process of data warehouse construction, while the term warehouse DBMS is used to refer to the
management and utilization of data warehouses. We will not make this distinction here.
“How are organizations using the information from data warehouses?” Many orga-nizations use this
information to support business decision-making activities, includ-ing (1) increasing customer focus,
which includes the analysis of customer buying patterns (such as buying preference, buying time,
budget cycles, and appetites for spending); (2) repositioning products and managing product portfolios
by compar-ing the performance of sales by quarter, by year, and by geographic regions in order to fine-
tune production strategies; (3) analyzing operations and looking for sources of profit; and (4) managing
customer relationships, making environmental corrections, and managing the cost of corporate assets.
Data warehousing is also very useful from the point of view of heterogeneous database integration.
Organizations typically collect diverse kinds of data and maintain large databases from multiple,
heterogeneous, autonomous, and distributed information sources. It is highly desirable, yet challenging,
to integrate such data and provide easy and efficient access to it. Much effort has been spent in the
database industry and research community toward achieving this goal.
The traditional database approach to heterogeneous database integration is to build wrappers and
integrators (or mediators) on top of multiple, heterogeneous databases. When a query is posed to a
client site, a metadata dictionary is used to translate the query into queries appropriate for the individual
heterogeneous sites involved. These queries are then mapped and sent to local query processors. The
results returned from the different sites are integrated into a global answer set. This query-driven
approach requires complex information filtering and integration processes, and competes with local
sites for processing resources. It is inefficient and potentially expensive for frequent queries, especially
queries requiring aggregations.
Data warehousing provides an interesting alternative to this traditional approach. Rather than using
a query-driven approach, data warehousing employs an update- driven approach in which information
from multiple, heterogeneous sources is inte-grated in advance and stored in a warehouse for direct
querying and analysis. Unlikeonline transaction processing databases, data warehouses do not contain
the most cur-rent information. However, a data warehouse brings high performance to the integrated
heterogeneous database system because data are copied, preprocessed, integrated, anno-tated,
summarized, and restructured into one semantic data store. Furthermore, query processing in data
warehouses does not interfere with the processing at local sources.
Moreover, data warehouses can store and integrate historic information and support complex
multidimensional queries. As a result, data warehousing has become popular in industry
Data Mining Techniques
1. Association
Association analysis is the finding of association rules
showing attribute-value conditions that occur frequently
together in a given set of data. Association analysis is
widely used for a market basket or transaction data
analysis. Association rule mining is a significant and
exceptionally dynamic area of data mining research. One
method of association-based classification, called
associative classification, consists of two steps. In the
main step, association instructions are generated using a
modified version of the standard association rule mining
algorithm known as Apriori. The second step constructs a
classifier based on the association rules discovered.
2. Classification
Classification is the processing of finding a set of models
(or functions) that describe and distinguish data classes or
concepts, for the purpose of being able to use the model to
predict the class of objects whose class label is unknown.
The determined model depends on the investigation of a
set of training data information (i.e. data objects whose
class label is known). The derived model may be
represented in various forms, such as classification (if –
then) rules, decision trees, and neural networks. Data
Mining has a different type of classifier:
Decision Tree
SVM(Support Vector Machine)
Generalized Linear Models
Bayesian classification:
Classification by Backpropagation
K-NN Classifier
Rule-Based Classification
Frequent-Pattern Based Classification
Rough set theory
Fuzzy Logic
Decision Trees: A decision tree is a flow-chart-like tree structure, where each node represents a test on
an attribute value, each branch denotes an outcome of a test, and tree leaves represent classes or class
distributions.
Support Vector Machine (SVM) Classifier Method: Support Vector Machines is a supervised
learning strategy used for classification and additionally used for regression. When the output of the
support vector machine is a continuous value, the learning methodology is claimed to perform
regression; and once the learning methodology will predict a category label of the input object, it’s
known as classification
Generalized Linear Models: Generalized Linear Models(GLM) is a statistical technique, for linear
modeling.GLM provides extensive coefficient statistics and model statistics, as well as row diagnostics.
It also supports confidence bounds.
Bayesian Classification: Bayesian classifier is a statistical classifier. They can predict class
membership probabilities, for instance, the probability that a given sample belongs to a particular class.
Bayesian classification is created on the Bayes theorem
Classification By Backpropagation: A Backpropagation learns by iteratively processing a set of
training samples, comparing the network’s estimate for each sample with the actual known class label.
For each training sample, weights are modified to minimize the mean squared error between the
network’s prediction and the actual class.
K-Nearest Neighbor (K-NN) Classifier Method: The k-nearest neighbor (K-NN) classifier is taken
into account as an example-based classifier, which means that the training documents are used for
comparison instead of an exact class illustration, like the class profiles utilized by other classifiers
Rule-Based Classification: Rule-Based classification represent the knowledge in the form of If-Then
rules. An assessment of a rule evaluated according to the accuracy and coverage of the classifier. If
more than one rule is triggered then we need to conflict resolution in rule-based classification
Frequent-Pattern Based Classification: Frequent pattern discovery (or FP discovery, FP mining, or
Frequent itemset mining) is part of data mining. It describes the task of finding the most frequent and
relevant patterns in large datasets. The idea was first presented for mining transaction databases.
Rough Set Theory: Rough set theory is based on the establishment of equivalence classes within the
given training data. All the data samples forming a similarity class are indiscernible, that is, the samples
are equal with respect to the attributes describing the data.
Fuzzy-Logic: Rule-based systems for classification have the disadvantage that they involve sharp cut-
offs for continuous attributes. Fuzzy Logic is valuable for data mining frameworks performing grouping
/classification.
3. Prediction
Data Prediction is a two-step process, similar to that of data classification. Although, for prediction, we
do not utilize the phrasing of “Class label attribute” because the attribute for which values are being
predicted is consistently valued(ordered) instead of categorical (discrete-esteemed and unordered).
4. Clustering
Unlike classification and prediction, which analyze class-labeled data objects or attributes, clustering
analyzes data objects without consulting an identified class label. In general, the class labels do not exist
in the training data simply because they are not known to begin with.
5. Regression
Regression can be defined as a statistical modeling method in which previously obtained data is used to
predicting a continuous quantity for new observations. This classifier is also known as the Continuous
Value Classifier. There are two types of regression models: Linear regression and multiple linear
regression models.
6. Artificial Neural network (ANN) Classifier Method
An artificial neural network (ANN) also referred to as simply a “Neural Network” (NN), could be a
process model supported by biological neural networks. It consists of an interconnected collection of
artificial neurons. A neural network is a set of connected input/output units where each connection has a
weight associated with it.
The advantages of neural networks, however, contain their high tolerance to noisy data as well as their
ability to classify patterns on which they have not been trained. In addition, several algorithms have
newly been developed for the extraction of rules from trained neural networks. These issues contribute
to the usefulness of neural networks for classification in data mining.
7. Outlier Detection
A database may contain data objects that do not comply with the general behavior or model of the data.
These data objects are Outliers. The investigation of OUTLIER data is known as OUTLIER MINING.
An outlier may be detected using statistical tests which assume a distribution or probability model for
the data, or using distance measures where objects having a small fraction of “close” neighbors in space
are considered outliers. Rather than utilizing factual or distance measures, deviation-based techniques
distinguish exceptions/outlier by inspecting differences in the principle attributes of items in a group.
8. Genetic Algorithm
Genetic algorithms are adaptive heuristic search algorithms that belong to the larger part of evolutionary
algorithms. Genetic algorithms are based on the ideas of natural selection and genetics. These are
intelligent exploitation of random search provided with historical data to direct the search into the
region of better performance in solution space. They are commonly used to generate high-quality
solutions for optimization problems and search problems
Similarity Measures
Similarity measures are mathematical functions used to determine the degree of similarity between
two data points or objects. These measures produce a score that indicates how similar or alike the
two data points are.
It takes two data points as input and produces a similarity score as output, typically ranging from 0
(completely dissimilar) to 1 (identical or perfectly similar).
A similarity measure can be based on various mathematical techniques such as Cosine similarity,
Jaccard similarity, and Pearson correlation coefficient.
Similarity measures are generally used to identify duplicate records, equivalent instances, or
identifying clusters
Similarity measures also have some well-known properties -
o sim(A,B)=1sim(A,B)=1 (or maximum similarity) only if A=BA=B
o Typical range - (0≤sim≤1)(0≤sim≤1)
o Symmetry - sim(A,B)=sim(B,A)sim(A,B)=sim(B,A) for all AA and BB
Now let’s explore a few of the most commonly used similarity measures in data mining.
Cosine Similarity
Cosine similarity is a widely used similarity measure in data mining and information retrieval. It measures
the cosine of the angle between two non-zero vectors in a multi-dimensional space. In the context of data
mining, these vectors represent the feature vectors of two data points. The cosine similarity score ranges
from 0 to 1, with 0 indicating no similarity and 1 indicating perfect similarity.
The cosine similarity between two vectors is calculated as the dot product of the vectors divided by the
product of their magnitudes. This calculation can be represented mathematically as follows -
where A and B are the feature vectors of two data points, "." denotes the dot product, and "||" denotes the
magnitude of the vector.
Jaccard Similarity
The Jaccard similarity is another widely used similarity measure in data mining, particularly in text analysis
and clustering. It measures the similarity between two sets of data by calculating the ratio of the intersection
of the sets to their union. The Jaccard similarity score ranges from 0 to 1, with 0 indicating no similarity and
1 indicating perfect similarity.
The Jaccard similarity between two sets A and B is calculated as follows -
where ∣A∩B∣∣A∩B∣ is the size of the intersection of sets AA and BB, and ∣A∪B∣∣A∪B∣ is the size of the
union of sets AA and BB.
Pearson Correlation Coefficient
The Pearson correlation coefficient is a widely used similarity measure in data mining and statistical
analysis. It measures the linear correlation between two continuous variables, X and Y. The Pearson
correlation coefficient ranges from -1 to +1, with -1 indicating a perfect negative correlation, 0 indicating no
correlation, and +1 indicating a perfect positive correlation. The Pearson correlation coefficient is commonly
used in data mining applications such as feature selection and regression analysis. It can help identify
variables that are highly correlated with each other, which can be useful for reducing the dimensionality of a
dataset. In regression analysis, it can also be used to predict the value of one variable based on the value of
another variable.
The Pearson correlation coefficient between two variables, X and Y, is calculated as follows –
where cov(X,Y)cov(X,Y) is the covariance between variables XX and YY, and σXσX and σYσY are the
standard deviations of variables XX and YY, respectively.
Sørensen-Dice Coefficient
The Sørensen-Dice coefficient, also known as the Dice similarity index or Dice coefficient, is a similarity
measure used to compare the similarity between two sets of data, typically used in the context of text or
image analysis. The coefficient ranges from 0 to 1, with 0 indicating no similarity and 1 indicating perfect
similarity. The Sørensen-Dice coefficient is commonly used in text analysis to compare the similarity
between two documents based on the set of words or terms they contain. It is also used in image analysis to
compare the similarity between two images based on the set of pixels they contain.
The Sørensen-Dice coefficient between two sets, A and B, is calculated as follows -
Description: Euclidean distance is the directly-line distance among points in a multi-dimensional space. It is
intuitive and extensively used in lots of applications, especially whilst the functions are non-stop and the
size is steady across dimensions.
Applications: It is commonly used in clustering algorithms together with okay-method, and in nearest
neighbor searches.
Applications of Similarity Measures
1. Clustering
Clustering entails grouping a set of gadgets such that items in the identical institution (or cluster) are greater
just like every aside from to the ones in different agencies. Similarity measures play a essential function in
defining these groups.
2. Classification
Classification assigns a label to a brand new facts factor based totally at the traits of acknowledged classified
facts points. Similarity measures help decide the label by means of comparing the new factor to present
points.
3. Information Retrieval
Information retrieval structures, together with search engines, rely on similarity measures to rank documents
primarily based on their relevance to a query.
4. Recommendation Systems
Recommendation structures suggest items to customers based on their alternatives and behavior, often the
usage of similarity measures to discover objects or customers which might be alike.
5. Anomaly Detection
Anomaly detection identifies outliers or uncommon statistics points that differ substantially from the bulk of
information.
6. Natural Language Processing (NLP)
In NLP, similarity measures are used to examine text data, assisting in responsibilities consisting of report
clustering, plagiarism detection, and sentiment evaluation.
7. Image Processing
Image processing involves analyzing and manipulating pics, where similarity measures are used to compare
picture capabilities.
8. Bioinformatics
In bioinformatics, similarity measures help examine organic information, along with genetic sequences or
protein systems.
Decision Tree
o Input Layer: The enter layer of a neural community consists of neurons similar to the functions of
the dataset. Each neuron represents a function, and the values are fed into the community throughout
schooling.
o Hidden Layers: Hidden layers are where the network learns and extracts features from the input
facts. The number of hidden layers and neurons in every layer is a crucial component of community
architecture and is often determined through experimentation.
o Output Layer: The output layer produces the very last predictions or classifications. The number of
neurons in this layer depends upon the assignment's character, binary type, multi-elegance class, or
regression.
4. Training and Optimization:
o Backpropagation: One of the most important algorithms for training neural networks is
backpropagation. It is an iterative configuration of weights following the gradient of error with
respect to these estimates. This process is critical in ensuring that the difference between the
predicted and actual outputs is minimal.
o Activation Functions: Activation functions introduce nonlinearity into the neural network, enabling
it to learn complex relationships. Some typical activation functions are sigmoid, hyperbolic tangent
H(x), and rectified linear units.
o Regularization: So, they apply regularization techniques such as dropout and weight decay while
training to prevent overfitting. All these techniques aid the model in generalizing well to new, unseen
data.
o Hyperparameter Tuning: The selection of appropriate hyperparameters, such as learning rate,
batch size, and the number of hidden layers, drastically influence the performance level of a neural
network. Hyperparameter tuning often involves the use of Grid search or random search methods.
Challenges in Data Mining of Neural Networks:
Despite their effectiveness, neural networks pose certain challenges in the context of data mining:
o Overfitting: Neural networks are prone to memorizing the training data, which generalizes poorly
when transferred to new data. Regularization techniques and appropriate validation strategies
mitigate this problem.
o Interpretability: Neural networks are so often being called 'black box' models that it is difficult to
explain why such predictions were made. In some areas with the requirement for transparency, this
inability to make sense of it becomes a problem.
o Computational Resources: Training large neural networks is a heavy computational task that
requires strong GPUs or TPUs. This is a limiting factor, especially for small-scale projects or
organizations with limited resources.
Neural Networks in Data Mining:
1. Image and Speech Recognition:
Neural networks, especially convolutional neural networks (CNNs), have transformed image and
speech recognition. This ranges from facial recognition in security systems to voice-controlled
virtual assistants.
2. Financial Fraud Detection:
In financial institutions, neural networks interpret patterns in transaction information to identify
fraudulent activities. They can detect suspicious behavior and signal possibly fraudulent transactions
as they occur.
3. Healthcare and Medical Diagnosis:
In medicine, neural networks process medical images such as X-rays and MRIs to diagnose diseases. They
also help determine the likelihood of patient survival and potential health hazards based on patients' data.
4. Customer Relationship Management (CRM):
Neural networks are used in customer segmentation and personalized marketing CRM systems.
These systems research customers' behavior and preferences so that businesses can develop target
marketing strategies.
5. Natural Language Processing (NLP):
In recent years, natural language processing tasks like translation languages, sentiment analysis, and
chatting bots have significantly changed recurrent neural networks (RNNs) and transformer models.
Future Trends and Developments:
1. Explainable AI (XAI):
In response to the interpretability challenge, Explainable AI (XAI) seeks to increase transparency
and comprehensibility in neural networks. Researchers are currently working to create techniques
that give knowledge about how complex models make decisions.
2. Transfer Learning:
Transfer learning refers to pre-training neural networks for one task and subsequent fine-tuning on
another closely associated matter. This approach has been proven effective in enhancing neural
networks' efficiency and performance, especially where labeled data is limited.
3. Edge Computing:
Neural networks can be integrated with edge computing devices, thus enabling real-time data
processing at the source. This reduces the extensive transmission of data to centralized servers,
which is beneficial in applications such as IoT and autonomous systems.
Genetic Algorithm
Before understanding the Genetic algorithm, let's first understand basic terminologies to better understand
this algorithm:
Population: Population is the subset of all possible or probable solutions, which can solve the given
problem.
Chromosomes: A chromosome is one of the solutions in the population for the given problem, and
the collection of gene generate a chromosome.
Gene: A chromosome is divided into a different gene, or it is an element of the chromosome.
Allele: Allele is the value provided to the gene within a particular chromosome.
Fitness Function: The fitness function is used to determine the individual's fitness level in the
population. It means the ability of an individual to compete with other individuals. In every iteration,
individuals are evaluated based on their fitness function.
Genetic Operators: In a genetic algorithm, the best individual mate to regenerate offspring better
than parents. Here genetic operators play a role in changing the genetic composition of the next
generation.
Selection
Foundation of Genetic Algorithms
Genetic algorithms are based on an analogy with the genetic structure and behavior of chromosomes of the
population. Following is the foundation of GAs based on this analogy –
1. Individuals in the population compete for resources and mate
2. Those individuals who are successful (fittest) then mate to create more offspring than others
3. Genes from the “fittest” parent propagate throughout the generation, that is sometimes parents create
offspring which is better than either parent.
4. Thus each successive generation is more suited for their environment.
Search space
The population of individuals are maintained within search space. Each individual represents a solution in
search space for given problem. Each individual is coded as a finite length vector (analogous to
chromosome) of components. These variable components are analogous to Genes. Thus a chromosome
(individual) is composed of several genes (variable components).
Fitness Score
A Fitness Score is given to each individual which shows the ability of an individual to “compete”. The
individual having optimal fitness score (or near optimal) are sought.
The GAs maintains the population of n individuals (chromosome/solutions) along with their fitness
scores.The individuals having better fitness scores are given more chance to reproduce than others. The
individuals with better fitness scores are selected who mate and produce better offspring by combining
chromosomes of parents. The population size is static so the room has to be created for new arrivals. So,
some individuals die and get replaced by new arrivals eventually creating new generation when all the
mating opportunity of the old population is exhausted. It is hoped that over successive generations better
solutions will arrive while least fit die.
Each new generation has on average more “better genes” than the individual (solution) of previous
generations. Thus each new generations have better “partial solutions” than previous generations. Once the
offspring produced having no significant difference from offspring produced by previous populations, the
population is converged. The algorithm is said to be converged to a set of solutions for the problem.
Operators of Genetic Algorithms
Once the initial generation is created, the algorithm evolves the generation using following operators –
1) Selection Operator: The idea is to give preference to the individuals with good fitness scores and allow
them to pass their genes to successive generations.
2) Crossover Operator: This represents mating between individuals. Two individuals are selected using
selection operator and crossover sites are chosen randomly. Then the genes at these crossover sites are
exchanged thus creating a completely new individual (offspring). For example –
3) Mutation Operator: The key idea is to insert random genes in offspring to maintain the diversity in the
population to avoid premature convergence. For example –