Fund Data Science
Fund Data Science
By:
Published by
Data mining refers to filtering, sorting, and classifying data from larger datasets to reveal subtle
patterns and relationships, which helps enterprises identify and solve complex business
problems through data analysis. Data mining software tools and techniques allow
organizations to foresee future market trends and make business-critical decisions at crucial
times.
Data mining is an essential component of data science that employs advanced data analytics to
derive insightful information from large volumes of data. If we dig deeper, data mining is a
crucial ingredient of the knowledge discovery in databases (KDD) process, where data
gathering, processing, and analysis takes place at a fundamental level. Businesses rely heavily
on data mining to undertake analytics initiatives in the organizational setup. The analyzed data
sourced from data mining is used for varied analytics and business intelligence (BI) applications,
which consider real-time data analysis along with some historical pieces of information.
With top-notch data mining practices, enterprises can make several business strategies and
manage their operations better. This can entail refining customer-centric functions, including
advertising, marketing, sales, customer support, finance, HR, etc. Data mining also plays a vital
role in handling business-critical use cases such as cybersecurity planning, fraud detection, risk
management, and several others. Data mining finds applications across industry verticals such
as healthcare, scientific research, sports, governmental projects, etc.
Working:
Data mining is predominantly handled by a group of data scientists, skilled BI professionals,
analytics groups, business analysts, tech-savvy executives, and personnel having a solid
background and inclination toward data analytics.
Fundamentally, machine learning (ML), artificial intelligence (AI), statistical analysis, and data
management are crucial elements of data mining that are necessary to scrutinize, sort, and
prepare data for analysis. Top ML algorithms and AI tools have enabled the easy mining of
massive datasets, including customer data, transactional records, and even log files picked up
from sensors, actuators, IoT devices, mobile apps, and servers.
Data gathering: Data mining begins with the data gathering step, where relevant
information is identified, collected, and organized for analysis. Data sources can
include data warehouses, data lakes, or any other source that contains raw data in a
structured or unstructured format.
Data preparation: In the second step, fine-tuning the gathered data is the prime focus.
This involves several processes, such as data pre-processing, data profiling, and data
cleansing, to fix any data errors. These stages are essential to maintain data quality before
following up with the mining and analysis processes.
Mining the data: In the third step, the data professional selects an appropriate data
mining technique once the desired quality of data is prepared. Here, a proper set of data
processing algorithms are identified where sample data is trained initially before running it
over the entire dataset.
Data analysis and interpretation: In the last step, the results derived in the third
step are used to develop analytical models for making future business decisions. Moreover,
the data science team communicates the results to the concerned stakeholders via data
visualizations and other more straightforward techniques. The information is conveyed in
a manner that makes the content digestible for any non-expert working outside the field
of data science.
KDD (Knowledge Discovery in Databases) is a field of computer science, which includes the
tools and theories to help humans in extracting useful and previously unknown information
(i.e., knowledge) from large collections of digitized data. KDD consists of several steps, and
Data Mining is one of them. Data Mining is the application of a specific algorithm to extract
patterns from data. Nonetheless, KDD and Data Mining are used interchangeably.
KDD
KDD is a computer science field specializing in extracting previously unknown and interesting
information from raw data. KDD is the whole process of trying to make sense of data by
developing appropriate methods or techniques. This process deals with low-level mapping data
into other forms that are more compact, abstract, and useful. This is achieved by creating short
reports, modelling the process of generating data, and developing predictive models that can
predict future cases.
Due to the exponential growth of data, especially in areas such as business, KDD has become a
very important process to convert this large wealth of data into business intelligence, as
manual extraction of patterns has become seemingly impossible in the past few decades.
For example, it is currently used for various applications such as social network analysis, fraud
detection, science, investment, manufacturing, telecommunications, data cleaning, sports,
information retrieval, and marketing. KDD is usually used to answer questions like what are the
main products that might help to obtain high-profit next year in V-Mart.
1. Goal identification: Develop and understand the application domain and the relevant
prior knowledge and identify the KDD process's goal from the customer perspective.
2. Creating a target data set: Selecting the data set or focusing on a set of variables or
data samples on which the discovery was made.
4. Data reduction and projection: Finding useful features to represent the data depending
on the purpose of the task. The effective number of variables under consideration may
be reduced through dimensionality reduction methods or conversion, or invariant
representations for the data can be found.
5. Matching process objectives: KDD with step 1 a method of mining particular. For
example, summarization, classification, regression, clustering, and others.
6. Modelling and exploratory analysis and hypothesis selection: Choosing the algorithms
or data mining and selecting the method or methods to search for data patterns. This
process includes deciding which model and parameters may be appropriate (e.g.,
definite data models are different models on the real vector) and the matching of data
mining methods, particularly with the general approach of the KDD process (for
example, the end-user might be more interested in understanding the model in its
predictive capabilities).
7. Data Mining: The search for patterns of interest in a particular representational form or
a set of these representations, including classification rules or trees, regression, and
clustering. The user can significantly aid the data mining method to carry out the
preceding steps properly.
Data Mining
Data mining, also known as Knowledge Discovery in Databases, refers to the nontrivial extraction
of implicit, previously unknown, and potentially useful information from data stored in
databases.
Data Mining is only a step within the overall KDD process. There are two major Data Mining goals
defined by the application's goal: verification of discovery. Verification verifies the user's
hypothesis about data, while discovery automatically finds interesting patterns.
There are four major data mining tasks: clustering, classification, regression, and association
(summarization). Clustering is identifying similar groups from unstructured data. Classification
is learning rules that can be applied to new data. Regression is finding functions with minimal
error to model data. And the association looks for relationships between variables. Then, the
specific data mining algorithm needs to be selected. Different algorithms like linear regression,
logistic regression, decision trees, and Naive Bayes can be selected depending on the goal.
Then patterns of interest in one or more symbolic forms are searched. Finally, models are
evaluated either using predictive accuracy or understandability.
The volume of information is increasing every day that we can handle from business transactions,
scientific data, sensor data, Pictures, videos, etc. So, we need a system that will be capable of
extracting the essence of information available and that can automatically generate reports,
views, or summaries of data for better decision-making.
Although the two terms KDD and Data Mining are heavily used interchangeably, they refer to two
related yet slightly different concepts.
KDD is the overall process of extracting knowledge from data, while Data Mining is a step inside
the KDD process, which deals with identifying patterns in data. And Data Mining is only the
application of a specific algorithm based on the overall goal of the KDD process.
KDD is an iterative process where evaluation measures can be enhanced, mining can be refined,
and new data can be integrated and transformed to get different and more appropriate results.
2. Data Mining:
1. Data Mining is the process of discovering patterns, correlations, anomalies, and insights
from large datasets.
2. It involves using various techniques from statistics, machine learning, and artificial
intelligence to analyze data and extract valuable information.
3. Data Mining helps in making predictions, identifying trends, segmenting data, and
making data-driven decisions.
4. It is often used for tasks such as customer segmentation, market basket analysis, fraud
detection, and predictive maintenance.
5. Data Mining is closely related to disciplines like machine learning, data analytics, and
business intelligence.
While DBMS focuses on the storage and management of data, Data Mining focuses on
analyzing and extracting useful information from that data. DBMS provides the infrastructure
and tools necessary for storing and accessing data efficiently, while Data Mining techniques
operate on top of this data to derive insights and knowledge. In many cases, Data Mining
utilizes data stored within a DBMS, making them complementary technologies in the broader
realm of data management and analysis.
Depending on various methods and technologies from the intersection of machine learning,
database management, and statistics, professionals in data mining have devoted their careers
to better understanding how to process and make conclusions from the huge amount of data,
but what are the methods they use to make it happen?
In recent data mining projects, various major data mining techniques have been developed and
used, including association, classification, clustering, prediction, sequential patterns, and
regression.
1. Classification of Data mining frameworks as per the type of data sources mined: This
classification is as per the type of data handled. For example, multimedia, spatial data,
text data, time-series data, World Wide Web, and so on.
2. Classification of data mining frameworks as per the database involved: This
classification based on the data model involved. For example. Object-oriented database,
transactional database, relational database, and so on..
3. Classification of data mining frameworks as per the kind of knowledge discovered:
This classification depends on the types of knowledge discovered or data mining
functionalities. For example, discrimination, classification, clustering, characterization,
etc. some frameworks tend to be extensive frameworks offering a few data mining
functionalities together..
4. Classification of data mining frameworks according to data mining techniques used:
This classification is as per the data analysis approach utilized, such as neural networks,
machine learning, genetic algorithms, visualization, statistics, data warehouse-oriented
or database-oriented, etc.
The classification can also take into account, the level of user interaction involved in the data
mining procedure, such as query-driven systems, autonomous systems, or interactive
exploratory systems.
2. Clustering:
Clustering is a division of information into groups of connected objects. Describing the data
by a few clusters mainly loses certain confine details, but accomplishes improvement. It
models data by its clusters. Data modelling puts clustering from a historical point of view
rooted in statistics, mathematics, and numerical analysis. From a machine learning point of
view, clusters relate to hidden patterns, the search for clusters is unsupervised learning, and
the subsequent framework represents a data concept. From a practical point of view,
clustering plays an extraordinary job in data mining applications. For example, scientific data
exploration, text mining, information retrieval, spatial database applications, CRM, Web
analysis, computational biology, medical diagnostics, and much more.
3. Regression:
Regression analysis is the data mining process is used to identify and analyze the relationship
between variables because of the presence of the other factor. It is used to define the
probability of the specific variable. Regression, primarily a form of planning and modelling.
For example, we might use it to project certain costs, depending on other factors such as
availability, consumer demand, and competition. Primarily it gives the exact relationship
between two or more variables in the given data set.
4. Association Rules:
This data mining technique helps to discover a link between two or more items. It finds a
hidden pattern in the data set.
Association rules are if-then statements that support to show the probability of interactions
between data items within large data sets in different types of databases. Association rule
mining has several applications and is commonly used to help sales correlations in data or
medical data sets.
The way the algorithm works is that you have various data, For example, a list of grocery
items that you have been buying for the last six months. It calculates a percentage of items
being purchased together.
Lift: This measurement technique measures the accuracy of the confidence over how
often item B is purchased.
(Confidence) / (item B)/ (Entire dataset)
Support: This measurement technique measures how often multiple items are
purchased and compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
Confidence: This measurement technique measures how often item B is purchased
when item A is purchased as well.
(Item A + Item B)/ (Item A)
5. Outer detection:
This type of data mining technique relates to the observation of data items in the data set,
which do not match an expected pattern or expected behavior. This technique may be used
in various domains like intrusion, detection, fraud detection, etc. It is also known as Outlier
Analysis or Outilier mining. The outlier is a data point that diverges too much from the rest of
the dataset. The majority of the real-world datasets have an outlier. Outlier detection plays a
significant role in the data mining field. Outlier detection is valuable in numerous fields like
network interruption identification, credit or debit card fraud detection, detecting outlying in
wireless sensor network data, etc.
In other words, this technique of data mining helps to discover or recognize similar patterns
in transaction data over some time.
7. Prediction:
Prediction used a combination of other data mining techniques such as trends, clustering,
classification, etc. It analyzes past events or instances in the right sequence to predict a
future event.
Data mining, despite its immense potential for extracting valuable insights from data, faces
various problems, issues, and challenges:
1. Data Quality: Poor quality data can significantly impact the results of data mining efforts.
Issues such as missing values, outliers, inaccuracies, and inconsistencies can distort
patterns and lead to erroneous conclusions. Cleaning and preprocessing the data to
ensure its quality is a critical challenge.
2. Scalability: With the exponential growth of data volumes, scalability becomes a significant
concern. Traditional data mining algorithms may struggle to handle large datasets
efficiently. Developing scalable algorithms and leveraging distributed computing
frameworks are essential to process big data effectively.
3. Complexity of Data: Modern datasets are often complex, with high dimensionality,
heterogeneity, and varying structures. Dealing with such diverse data types, including
structured, semi-structured, and unstructured data, poses challenges in data
representation, integration, and analysis.
4. Privacy and Security: Data mining raises privacy and security concerns, particularly when
dealing with sensitive or personal information. Protecting the privacy of individuals and
ensuring data security are paramount, requiring robust encryption, access controls, and
anonymization techniques.
5. Interpretability and Explainability: Complex data mining models, such as deep neural
networks, often lack interpretability, making it challenging to understand how they arrive
at their predictions or decisions. Ensuring model transparency and explainability is crucial,
especially in applications with legal, ethical, or regulatory implications.
6. Bias and Fairness: Data mining models can inherit biases present in the training data,
leading to unfair or discriminatory outcomes. Addressing bias and ensuring fairness in data
mining processes require careful consideration of the data collection process, feature
selection, and model evaluation.
7. Overfitting and Generalization: Overfitting occurs when a model learns to memorize the
training data instead of generalizing from it, leading to poor performance on unseen data.
10. Ethical and Regulatory Compliance: Data mining activities must comply with ethical
guidelines and legal regulations governing data usage, privacy, and consent. Ensuring
ethical conduct, obtaining informed consent, and respecting data ownership rights are
essential considerations in data mining projects.
Addressing these problems, issues, and challenges requires interdisciplinary expertise, including
data science, computer science, statistics, domain knowledge, and ethical considerations.
Collaboration between data scientists, domain experts, policymakers, and ethicists is essential
to develop responsible and effective data mining solutions.
Data mining (DM) finds application across various industries and domains, helping organizations
derive valuable insights from their data. Here are some common applications of data mining:
2. Customer Relationship Management (CRM): Data mining plays a crucial role in CRM by
segmenting customers based on their preferences, purchase history, and behavior. It
enables personalized marketing campaigns, customer retention strategies, and targeted
product recommendations, ultimately leading to improved customer satisfaction and
loyalty.
3. Fraud Detection and Prevention: Data mining techniques are employed in fraud detection
across various industries, including finance, insurance, and e-commerce. By analyzing
transactional data and detecting anomalies or suspicious patterns, organizations can
identify fraudulent activities and take proactive measures to prevent financial losses.
4. Healthcare Analytics: In healthcare, data mining helps analyze electronic health records
(EHRs), medical imaging data, and clinical data to improve patient care, optimize
treatment plans, and predict disease outcomes. It facilitates medical diagnosis, drug
discovery, patient monitoring, and healthcare resource allocation.
5. Market Basket Analysis: Market basket analysis is a data mining technique used in retail
and e-commerce to understand the associations between products purchased by
6. Supply Chain Management: Data mining aids supply chain management by analyzing
supply chain data, demand forecasts, inventory levels, and logistics information. It helps
optimize inventory management, streamline logistics operations, identify supply chain
risks, and improve overall supply chain efficiency.
8. Social Media Analysis: Data mining techniques are applied in social media analytics to
extract insights from social media platforms such as Twitter, Facebook, and Instagram.
Organizations use sentiment analysis, topic modeling, and social network analysis to
understand customer sentiments, track brand perception, and identify influencers.
9. Text Mining and Natural Language Processing (NLP): Text mining and NLP techniques
analyze unstructured textual data from sources such as emails, customer reviews, and
documents. It facilitates tasks such as document categorization, sentiment analysis,
information extraction, and entity recognition, enabling organizations to derive actionable
insights from text data.
These are just a few examples of the diverse applications of data mining across industries. As data
continues to grow in volume and complexity, the importance of data mining in extracting
meaningful insights and driving informed decision-making will continue to increase.
Data Warehouse
Data Warehouse is a relational database management
system (RDBMS) construct to meet the requirement of
transaction processing systems. It can be loosely
described as any centralized data repository which can be
queried for business benefits. It is a database that stores
information oriented to satisfy decision-making requests.
It is a group of decision support technologies, targets to
enabling the knowledge worker (executive, manager, and
analyst) to make superior and higher decisions. So, Data
Warehousing support architectures and tool for business
executives to systematically organize, understand and
use their information to make strategic decisions.
Kamadhenu BCA College 11
Data Warehouse environment contains an extraction, transportation, and loading (ETL)
solution, an online analytical processing (OLAP) engine, customer analysis tools, and other
applications that handle the process of gathering information and delivering it to business
users.
A Data Warehouse (DW) is a relational database that is designed for query and analysis
rather than transaction processing. It includes historical data derived from transaction data
from single and multiple sources.
A Data Warehouse is a group of data specific to the entire organization, not only to a particular
group of users.It is not used for daily operations and transaction processing but used for
making decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
It is a database designed for investigative tasks, using data from various applications.
It supports a relatively small number of clients with relatively long interactions.
It includes current and historical data to provide a historical perspective of information.
Its usage is read-intensive.
It contains a few large tables.
Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers. Therefore,
data warehouses typically provide a concise and straightforward view around a particular
subject, such as customer, product, or sales, instead of the global organization's ongoing
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and
online transaction records. It requires performing data cleaning and integration during data
warehousing to ensure consistency in naming conventions, attributes types, etc., among
different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files from 3
months, 6 months, 12 months, or even previous data from a data warehouse. These variations
with a transactions system, where often only the most current file is kept.
The data warehouse is a physically separate data storage, which is transformed from the
source operational RDBMS. The operational updates of data do not occur in the data
warehouse, i.e., update, insert, and delete operations are not performed. It usually requires
only two procedures in data accessing: Initial loading of data and access to data. Therefore, the
DW does not require transaction processing, recovery, and concurrency capabilities, which
allows for substantial speedup of data retrieval. Non-Volatile defines that once entered into
the warehouse, and data should not change.
The idea of data warehousing came to the late 1980's when IBM researchers Barry Devlin and
Paul Murphy established the "Business Data Warehouse."
In essence, the data warehousing idea was planned to support an architectural model for the
flow of information from the operational system to decisional support environments. The
concept attempt to address the various problems associated with the flow, mainly the high
costs associated with it.
In the absence of data warehousing architecture, a vast amount of space was required to
support multiple decision support environments. In large corporations, it was ordinary for
various decision support environments to operate independently.
2. Store historical data: Data Warehouse is required to store the time variable data from the
past. This input is made to be used for various purposes.
3. Make strategic decisions: Some strategies may be depending upon the data in the data
warehouse. So, data warehouse contributes to making strategic decisions.
4. For data consistency and quality: Bringing the data from different sources at a
commonplace, the user can effectively undertake to bring the uniformity and consistency
in data.
5. High response time: Data warehouse has to be ready for somewhat unexpected loads and
types of queries, which demands a significant degree of flexibility and quick response time.
The following stages should be followed by every project for building a Multi Dimensional
Data Model :
Stage 1 :
Assembling data from the client : In first stage, a Multi Dimensional Data Model collects
correct data from the client. Mostly, software professionals provide simplicity to the client
about the range of data which can be gained with the selected technology and collect the
complete data in detail.
Stage 2 :
Grouping different segments of the system : In the second stage, the Multi Dimensional
Data Model recognizes and classifies all the data to the respective section they belong to and
also builds it problem-free to apply step by step.
Stage 3 :
Noticing the different proportions : In the third stage, it is the basis on which the design of
the system is based. In this stage, the main factors are recognized according to the user’s
point of view. These factors are also known as “Dimensions”.
Stage 4 :
Preparing the actual-time factors and their respective qualities : In the fourth stage, the
factors which are recognized in the previous step are used further for identifying the related
qualities. These qualities are also known as “attributes” in the database.
Stage 6 :
Building the Schema to place the data, with respect to the information collected from the
steps above : In the sixth stage, on the basis of the data which was collected previously, a
Schema is built.
For Example :
1. Let us take the example of a firm. The revenue cost of a firm can be recognized on the
basis of different factors such as geographical location of firm’s workplace, products of
the firm, advertisements done, time utilized to flourish a product, etc.
Example 1
2. Let us take the example of the data of a factory which sells products per quarter in
Bangalore. The data is represented in the table given below :
2D factory data
In the above given presentation, the factory’s sales for Bangalore are, for the time dimension,
which is organized into quarters and the dimension of items, which is sorted according to the
kind of item which is sold. The facts here are represented in rupees (in thousands).
Now, if we desire to view the data of the sales in a three-dimensional table, then it is
represented in the diagram given below. Here the data of the sales is represented as a
3D data representation as 2D
This data can be represented in the form of three dimensions conceptually, which is shown in
the image below :
3D data representation
1. Measures: Measures are numerical data that can be analyzed and compared, such as
sales or revenue. They are typically stored in fact tables in a multidimensional data model.
2. Dimensions: Dimensions are attributes that describe the measures, such as time,
location, or product. They are typically stored in dimension tables in a multidimensional
data model
3. Cubes: Cubes are structures that represent the multidimensional relationships between
measures and dimensions in a data model. They provide a fast and efficient way to
retrieve and analyze data.
Data Cleaning
Data cleaning is a crucial aspect of data warehousing, ensuring that the data stored in the data
warehouse is accurate, consistent, and reliable. Here's how data cleaning is typically performed
in the context of data warehousing:
1. Identifying Data Quality Issues: The first step in data cleaning is to identify data quality
issues within the data warehouse. Common issues include missing values, incorrect data
types, duplicates, inconsistencies, and outliers.
2. Data Profiling: Data profiling involves analyzing the structure, content, and quality of data
within the data warehouse. It helps identify patterns, anomalies, and potential data quality
problems that need to be addressed during the cleaning process.
4. Handling Missing Values: Missing values are common in datasets and can adversely affect
data analysis and decision-making. Data cleaning techniques such as imputation (replacing
missing values with estimated values based on statistical methods) or deletion (removing
records with missing values) are used to handle missing data.
5. Removing Duplicates: Duplicate records can skew analysis results and waste storage space
in the data warehouse. Data cleaning involves identifying and removing duplicate records
based on key attributes or combinations of attributes.
6. Correcting Errors: Data cleaning may involve identifying and correcting errors in the data,
such as typos, inconsistencies, and inaccuracies. Techniques such as data validation rules,
referential integrity checks, and data cleansing tools are used to identify and rectify errors.
7. Data Transformation: Data transformation involves converting data from one format or
structure to another to ensure compatibility and consistency within the data warehouse.
This may include data normalization, aggregation, or restructuring to meet specific
requirements for analysis and reporting.
9. Automating Data Cleaning Processes: Data cleaning processes can be automated using
tools and software that streamline data profiling, validation, cleansing, and transformation
tasks. Automation reduces manual effort, accelerates data cleaning cycles, and improves
overall data quality.
10. Monitoring and Maintenance: Data cleaning is an ongoing process that requires regular
monitoring and maintenance to ensure data quality over time. Continuous monitoring of
data quality metrics, proactive identification of issues, and timely intervention are essential
for maintaining high-quality data in the data warehouse.
Data Integration
Data integration is the process of merging data from several disparate sources. While performing
data integration, you must work on data redundancy, inconsistency, duplicity, etc. In data
mining, data integration is a record pre-processing method that includes merging data from a
couple of the heterogeneous data sources into coherent data to retain and provide a unified
perspective of the data. These assets could also include several record cubes, databases, or flat
documents. The statistical integration strategy is formally stated as a triple (G, S, M) approach.
G represents the global schema, S represents the heterogeneous source of schema, and M
represents the mapping between source and global schema queries.
It has been an integral part of data operations because data can be obtained from several sources.
It is a strategy that integrates data from several sources to make it available to users in a
single uniform view that shows their status. There are communication sources between
Kamadhenu BCA College 20
systems that can include multiple databases, data cubes, or flat files. Data fusion merges data
from various diverse sources to produce meaningful results. The consolidated findings must
exclude inconsistencies, contradictions, redundancies, and inequities.
Data integration is important because it gives a uniform view of scattered data while also
maintaining data accuracy. It assists the data-mining program in meaningful mining
information, which in turn assists the executive and managers make strategic decisions for the
enterprise's benefit.
The data integration methods are formally characterized as a triple (G, S, M), where;
Tight Coupling : It is the process of using ETL (Extraction, Transformation, and Loading) to
combine data from various sources into a single physical location.
Loose Coupling: Facts with loose coupling are most effectively kept in the actual source
databases. This approach provides an interface that gets a query from the user, changes it into
a format that the supply database may understand, and then sends the query to the source
databases without delay to obtain the result.
Inconsistencies further increase the level of redundancy within the characteristic. The use of
correlation analysis can be used to determine redundancy. The traits are examined to
determine their interdependence on each difference, consequently discovering the link
between them.
3. Tuple Duplication
Information integration has also handled duplicate tuples in addition to redundancy. Duplicate
tuples may also appear in the generated information if the denormalized table was utilized as a
deliverable for data integration.
There are various data integration techniques in data mining. Some of them are
as follows:
1. Manual Integration
This method avoids using automation during data integration. The data analyst collects,
cleans, and integrates the data to produce meaningful information. This strategy is suitable
for a mini organization with a limited data set. Although, it will be time-consuming for the
huge, sophisticated, and recurring integration. Because the entire process must be done
manually, it is a time-consuming operation.
2. Middleware Integration
The middleware software is used to take data from many sources, normalize it, and store it
in the resulting data set. When an enterprise needs to integrate data from legacy systems
to modern systems, this technique is used. Middleware software acts as a translator
between legacy and advanced systems. You may take an adapter that allows two systems
with different interfaces to be connected. It is only applicable to certain systems.
3. Application-based integration
It is using software applications to extract, transform, and load data from disparate sources.
This strategy saves time and effort, but it is a little more complicated because building such
an application necessitates technical understanding. This strategy saves time and effort,
but it is a little more complicated because building such an application necessitates
technical understanding.
5. Data Warehousing
This technique is related to the uniform access integration technique in a roundabout way.
The unified view, on the other hand, is stored in a different location. It enables the data
analyst to deal with more sophisticated inquiries. Although it is a promising solution and
increased storage costs, the unified data's view or copy requires separate storage and
maintenance costs.
Integration tools
There are various integration tools in data mining. Some of them are as follows:
Data transformation changes the format, structure, or values of the data and converts them
into clean, usable data. Data may be transformed at two stages of the data pipeline for data
analytics projects. Organizations that use on-premises data warehouses generally use an ETL
(extract, transform, and load) process, in which data transformation is the middle step. Today,
most organizations use cloud-based data warehouses to scale compute and storage resources
with latency measured in seconds or minutes. The scalability of the cloud platform lets
organizations skip preload transformations and load raw data into the data warehouse, then
transform it at query time.
Data integration, migration, data warehousing, data wrangling may all involve data
transformation. Data transformation increases the efficiency of business and analytic processes,
and it enables businesses to make better data-driven decisions. During the data transformation
1. Data Smoothing
Data smoothing is a process that is used to remove noise from the dataset using some
algorithms. It allows for highlighting important features present in the dataset. It helps in
predicting the patterns. When collecting data, it can be manipulated to eliminate or reduce
any variance or any other noise form.
The concept behind data smoothing is that it will be able to identify simple changes to help
predict different trends and patterns. This serves as a help to analysts or traders who need
to look at a lot of data which can often be difficult to digest for finding patterns that they
wouldn't see otherwise.
We have seen how the noise is removed from the data using the techniques such as
binning, regression, clustering.
o Binning: This method splits the sorted data into the number of bins and smoothens the
data values in each bin considering the neighborhood values around it.
o Regression: This method identifies the relation among two dependent attributes so
that if we have one attribute, it can be used to predict the other attribute.
o Clustering: This method groups similar data values and form a cluster. The values that
lie outside a cluster are known as outliers.
For example, suppose we have a data set referring to measurements of different plots, i.e., we
may have the height and width of each plot. So here, we can construct a new attribute
'area' from attributes 'height' and 'weight'. This also helps understand the relations among
the attributes in a data set.
3. Data Aggregation
Data collection or aggregation is the method of storing and presenting data in a summary
format. The data may be obtained from multiple data sources to integrate these data
sources into a data analysis description. This is a crucial step since the accuracy of data
analysis insights is highly dependent on the quantity and quality of the data used.
Gathering accurate data of high quality and a large enough quantity is necessary to produce
relevant results. The collection of data is useful for everything from decisions concerning
financing or business strategy of the product, pricing, operations, and marketing strategies.
For example, we have a data set of sales reports of an enterprise that has quarterly sales of
each year. We can aggregate the data to get the enterprise's annual sales report.
4. Data Normalization
Normalizing the data refers to scaling the data values to a much smaller range such as [-1, 1] or
[0.0, 1.0]. There are different methods to normalize the data, as discussed below.
Consider that we have a numeric attribute A and we have n number of observed values for
attribute A that are V1, V2, V3, ….Vn.
o Min-max normalization: This method implements a linear transformation on the
original data. Let us consider that we have minA and maxA as the minimum and
maximum value observed for attribute A and Viis the value for attribute A that has to be
normalized. The min-max normalization would map V i to the V'i in a new smaller range
[new_minA, new_maxA].
The formula for min-max normalization is given below:
o Z-score normalization: This method normalizes the value for attribute A using the mean
and standard deviation. The following formula is used for Z-score normalization:
Here Ᾱ and σA are the mean and standard deviation for attribute A, respectively.
For example, we have a mean and standard deviation for attribute A as $54,000 and
$16,000. And we have to normalize the value $73,600 using z-score normalization.
o Decimal Scaling: This method normalizes the value of attribute A by moving the decimal
point in the value. This movement of a decimal point depends on the maximum
absolute value of A. The formula for the decimal scaling is given below:
For example, the observed values for attribute A range from -986 to 917, and the
maximum absolute value for attribute A is 986. Here, to normalize each value of
attribute A using decimal scaling, we have to divide each value of attribute A by 1000,
i.e., j=3.
So, the value -986 would be normalized to -0.986, and 917 would be normalized to
0.917.
The normalization parameters such as mean, standard deviation, the maximum
absolute value must be preserved to normalize the future data uniformly.
5. Data Discretization
This is a process of converting continuous data into a set of data intervals. Continuous
attribute values are substituted by small interval labels. This makes the data easier to
study and analyze. If a data mining task handles a continuous attribute, then its discrete
values can be replaced by constant quality attributes. This improves the efficiency of the
task.
This method is also called a data reduction mechanism as it transforms a large dataset into
a set of categorical data. Discretization also uses decision tree-based algorithms to
produce short, compact, and accurate results when using discrete values.
Data discretization can be classified into two types: supervised discretization, where the
class information is used, and unsupervised discretization, which is based on which
direction the process proceeds, i.e., 'top-down splitting strategy' or 'bottom-up merging
strategy'.
Kamadhenu BCA College 26
For example, the values for the age attribute can be replaced by the interval labels such as
(0-10, 11-20…) or (kid, youth, adult, senior).
6. Data Generalization
It converts low-level data attributes to high-level data attributes using concept hierarchy.
This conversion from a lower level to a higher conceptual level is useful to get a clearer
picture of the data. Data generalization can be divided into two approaches:
o Data cube process (OLAP) approach.
o Attribute-oriented induction (AOI) approach.
For example, age data can be in the form of (20, 30) in a dataset. It is transformed into a
higher conceptual level into a categorical value (young, old).
1. Data Discovery: During the first stage, analysts work to understand and identify data in
its source format. To do this, they will use data profiling tools. This step helps analysts
decide what they need to do to get data into its desired format.
2. Data Mapping: During this phase, analysts perform data mapping to determine how
individual fields are modified, mapped, filtered, joined, and aggregated. Data mapping is
essential to many data processes, and one misstep can lead to incorrect analysis and ripple
through your entire organization.
3. Data Extraction: During this phase, analysts extract the data from its original source.
These may include structured sources such as databases or streaming sources such as
customer log files from web applications.
4. Code Generation and Execution: Once the data has been extracted, analysts need to
create a code to complete the transformation. Often, analysts generate codes with the help
of data transformation platforms or tools.
Transforming data can help businesses in a variety of ways. Here are some of the essential
advantages of data transformation, such as:
1. Better Organization: Transformed data is easier for both humans and computers to use.
2. Improved Data Quality: There are many risks and costs associated with bad data. Data
transformation can help your organization eliminate quality issues such as missing values
and other inconsistencies.
3. Perform Faster Queries: You can quickly and easily retrieve transformed data thanks to it
being stored and standardized in a source location.
4. Better Data Management: Businesses are constantly generating data from more and more
sources. If there are inconsistencies in the metadata, it can be challenging to organize and
understand it. Data transformation refines your metadata, so it's easier to organize and
understand.
5. More Use Out of Data: While businesses may be collecting data constantly, a lot of that
data sits around unanalyzed. Transformation makes it easier to get the most out of your
data by standardizing it and making it more usable.
While data transformation comes with a lot of benefits, still there are some challenges to
transforming data effectively, such as:
1. Data transformation can be expensive. The cost is dependent on the specific infrastructure,
software, and tools used to process data. Expenses may include licensing, computing
resources, and hiring necessary personnel.
2. Data transformation processes can be resource-intensive. Performing transformations in an
on-premises data warehouse after loading or transforming data before feeding it into
applications can create a computational burden that slows down other operations. If you use
1. Scripting: Data transformation through scripting involves Python or SQL to write the code
to extract and transform data. Python and SQL are scripting languages that allow you to
automate certain tasks in a program. They also allow you to extract information from data
sets. Scripting languages require less code than traditional programming languages.
Therefore, it is less intensive.
2. On-Premises ETL Tools: ETL tools take the required work to script the data
transformation by automating the process. On-premises ETL tools are hosted on company
servers. While these tools can help save you time, using them often requires extensive
expertise and significant infrastructure costs.
3. Cloud-Based ETL Tools: As the name suggests, cloud-based ETL tools are hosted in the
cloud. These tools are often the easiest for non-technical users to utilize. They allow you to
collect data from any cloud source and load it into your data warehouse. With cloud-based
ETL tools, you can decide how often you want to pull data from your source, and you can
monitor your usage.
Data reduction techniques ensure the integrity of data while reducing the data. Data reduction is a
process that reduces the volume of original data and represents it in a much smaller volume.
Data reduction techniques are used to obtain a reduced representation of the dataset that is
much smaller in volume by maintaining the integrity of the original data.
Kamadhenu BCA College 29
By reducing the data, the efficiency of the data mining process is improved, which produces
the same analytical results.
Data reduction does not affect the result obtained from data mining. That means the result
obtained from data mining before and after data reduction is the same or almost the
same.Data reduction aims to define it more compactly. When the data size is smaller, it is
simpler to apply sophisticated and computationally high-priced algorithms. The reduction of
the data may be in terms of the number of rows (records) or terms of the number of columns
(dimensions).
1. Dimensionality Reduction
Whenever we encounter weakly important data, we use the attribute required for our analysis.
Dimensionality reduction eliminates the attributes from the data set under consideration,
thereby reducing the volume of original data. It reduces data size as it eliminates outdated or
redundant features. Here are three methods of dimensionality reduction.
ii. Principal Component Analysis: Suppose we have a data set to be analyzed that has
tuples with n attributes. The principal component analysis identifies k independent
tuples with n attributes that can represent the data set. In this way, the original data
can be cast on a much smaller space, and dimensionality reduction can be achieved.
Principal component analysis can be applied to sparse and skewed data.
iii. Attribute Subset Selection: The large data set has many attributes, some of which are
irrelevant to data mining or some are redundant. The core attribute subset selection
reduces the data volume and dimensionality. The attribute subset selection reduces the
volume of data by eliminating redundant and irrelevant attributes. The attribute subset
selection ensures that we get a good subset of original attributes even after eliminating
the unwanted attributes. The resulting probability of data distribution is as close as
possible to the original data distribution using all the attributes.
3. Cluster sample: The tuples in data set D are clustered into M mutually
disjoint subsets. The data reduction can be applied by implementing
SRSWOR on these clusters. A simple random sample of size s could be
generated from these clusters where s<M.
4. Stratified sample: The large data set D is partitioned into mutually
disjoint sets called 'strata'. A simple random sample is taken from each
stratum to get stratified data. This method is effective for skewed data.
4. Data Compression
This technique reduces the size of the files using different encoding mechanisms, such as
Huffman Encoding and run-length Encoding. We can divide it into two types based on their
compression techniques.
i. Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple and
minimal data size reduction. Lossless data compression uses algorithms to restore the
precise original data from the compressed data.
ii. Lossy Compression: In lossy-data compression, the decompressed data may differ from
the original data but are useful enough to retrieve information from them. For example,
the JPEG image format is a lossy compression, but we can find the meaning equivalent
to the original image. Methods such as the Discrete Wavelet transform technique PCA
(principal component analysis) are examples of this compression.
5. Discretization Operation
The data discretization technique is used to divide the attributes of the continuous nature into
data with intervals. We replace many constant values of the attributes with labels of small
intervals. This means that mining results are shown in a concise and easily understandable way.
The main benefit of data reduction is simple: the more data you can fit into a terabyte of disk
space, the less capacity you will need to purchase. Here are some benefits of data reduction,
such as:
Data reduction greatly increases the efficiency of a storage system and directly impacts your
total spending on capacity.
Another example is analytics, where we gather the static data of website visitors. For example,
all visitors who visit the site with the IP address of India are shown under country level.
2. Binning
Binning refers to a data smoothing technique that helps to group a huge number of
continuous values into smaller values. For data discretization and the development of idea
hierarchy, this technique can also be used.
3. Cluster Analysis
Cluster analysis is a form of data discretization. A clustering algorithm is executed by
dividing the values of x numbers into clusters to isolate a computational feature of x.
Let's understand this concept hierarchy for the dimension location with the help of an example.
A particular city can map with the belonging country. For example, New Delhi can be mapped
to India, and India can be mapped to Asia.
Top-down mapping
Top-down mapping generally starts with the top with some general information and ends with
the bottom to the specialized information.
Bottom-up mapping
Bottom-up mapping generally starts with the bottom with some specialized information and
ends with the top to the generalized information.
Data discretization is a method of converting attributes values of continuous data into a finite
set of intervals with minimum data loss. In contrast, data binarization is used to transform the
continuous and discrete attributes into binary attributes.
Finding recurrent patterns or item sets in huge datasets is the goal of frequent pattern mining,
a crucial data mining approach. It looks for groups of objects that regularly appear together in
order to expose underlying relationships and interdependence. Market basket analysis, web
usage mining, and bioinformatics are a few areas where this method is important.
The Apriori algorithm, a popular method for finding recurrent patterns, takes a methodical
approach. In order to find no more frequent itemsets, it generates candidate itemsets, prunes
the infrequent ones, and then progressively grows the size of the itemsets. The patterns that
fulfill the required support criteria are successfully identified through this iterative approach.
Apriori Algorithm
Apriori algorithm refers to the algorithm which is used to calculate the association rules
between objects. It means how two or more objects are related to one another. In other words,
we can say that the apriori algorithm is an association rule leaning that analyzes that people
who bought product A also bought product B.
The primary objective of the apriori algorithm is to create the association rule between
different objects. The association rule describes how two or more objects are related to one
another. Apriori algorithm is also called frequent pattern mining. Generally, you operate the
Apriori algorithm on a database that consists of a huge number of transactions. Let's
Kamadhenu BCA College 37
understand the apriori algorithm with the help of an example; suppose you go to Big Bazar and
buy different products. It helps the customers buy their products with ease and increases the
sales performance of the Big Bazar. In this tutorial, we will discuss the apriori algorithm with
examples.
Introduction
We take an example to understand the concept better. You must have noticed that the Pizza
shop seller makes a pizza, soft drink, and breadstick combo together. He also offers a discount
to their customers who buy these combos. Do you ever think why does he do so? He thinks
that customers who buy pizza also buy soft drinks and breadsticks. However, by making
combos, he makes it easy for the customers. At the same time, he also increases his sales
performance.
Similarly, you go to Big Bazar, and you will find biscuits, chips, and Chocolate bundled together.
It shows that the shopkeeper makes it comfortable for the customers to buy these products in
the same place.
The above two examples are the best examples of Association Rules in Data Mining. It helps us
to learn the concept of apriori algorithms.
Apriori algorithm refers to an algorithm that is used in mining frequent products sets and
relevant association rules. Generally, the apriori algorithm operates on a database containing a
huge number of transactions. For example, the items customers but at a Big Bazar.
Apriori algorithm helps the customers to buy their products with ease and increases the sales
performance of the particular store.
We have already discussed above; you need a huge database containing a large no of
transactions. Suppose you have 4000 customers transactions in a Big Bazar. You have to
calculate the Support, Confidence, and Lift for two products, and you may say Biscuits and
Chocolate. This is because customers frequently buy these two items together. Out of 4000
transactions, 400 contain Biscuits, whereas 600 contain Chocolate, and these 600 transactions
include a 200 that includes Biscuits and chocolates. Using this data, we will find out the support,
confidence, and lift.
1. Support
Support refers to the default popularity of any product. You find the support as a quotient
of the division of the number of transactions comprising that product by the total number
of transactions. Hence, we get
2. Confidence
Confidence refers to the possibility that the customers bought both biscuits and
chocolates together. So, you need to divide the number of transactions that comprise both
biscuits and chocolates by the total number of transactions to get the confidence.
Hence,
Confidence = (Transactions relating both biscuits and Chocolate) / (Total transactions
involving Biscuits)
= 200/400
= 50 percent.
It means that 50 percent of customers who bought biscuits bought chocolates also.
3. Lift
Consider the above example; lift refers to the increase in the ratio of the sale of chocolates
when you sell biscuits. The mathematical equations of lift are given below.
It means that the probability of people buying both biscuits and chocolates together is five
times more than that of purchasing the biscuits alone. If the lift value is below one, it
requires that the people are unlikely to buy both the items together. Larger the value, the
better is the combination.
We will understand this algorithm with the help of an example Consider a Big Bazar scenario
where the product set is P = {Rice, Pulse, Oil, Milk, Apple}. The database comprises six
transactions where 1 represents the presence of the product and 0 represents the absence of
the product.
t2 0 1 1 1 0
t3 0 0 0 1 1
t4 1 1 0 1 0
t5 1 1 1 0 1
t6 1 1 1 1 1
Step 1
Make a frequency table of all the products that appear in all the transactions. Now, short the
frequency table to add only those products with a threshold support level of over 50 percent.
We find the given frequency table.
The above table indicated the products frequently bought by the customers.
Step 2
Create pairs of products such as RP, RO, RM, PO, PM, OM. You will get the given frequency
table.
Step 3
Implementing the same threshold support of 50 percent and consider the products that are
more than 50 percent. In our case, it is more than 3
Step 4
Now, look for a set of three products that the customers buy together. We get the given
combination.
1. RP and RO give RPO
Step 5
Calculate the frequency of the two itemsets, and you will get the given frequency table.
If you implement the threshold assumption, you can figure out that the customers' set of three
products is RPO.
We have considered an easy example to discuss the apriori algorithm in data mining. In reality, you
find thousands of such combinations.
There are various methods used for the efficiency of the Apriori algorithm
In hash-based itemset counting, you need to exclude the k-itemset whose equivalent hashing
bucket count is least than the threshold is an infrequent itemset.
Transaction Reduction
In transaction reduction, a transaction not involving any frequent X itemset becomes not valuable
in subsequent scans.
The primary requirements to find the association rules in data mining are given below.
Step 1
we have already discussed how to create the frequency table and calculate itemsets having a
greater support value than that of the threshold support.
In the above example, you can see that the RPO combination was the frequent itemset. Now,
we find out all the rules using RPO.
You can see that there are six different combinations. Therefore, if you have n elements, there
will be 2n - 2 candidate association rules.
The FP-Growth Algorithm proposed by Han in. This is an efficient and scalable method for
mining the complete set of frequent patterns by pattern fragment growth, using an extended
prefix-tree structure for storing compressed and crucial information about frequent patterns
named frequent-pattern tree (FP-tree). In his study, Han proved that his method outperforms
other popular methods for mining frequent patterns, e.g. the Apriori Algorithm and the
TreeProjection. In some later works, it was proved that FP-Growth performs better than other
methods, including Eclat and Relim. The popularity and efficiency of the FP-Growth Algorithm
contribute to many studies that propose variations to improve its performance.
The FP-Growth Algorithm is an alternative way to find frequent item sets without using
candidate generations, thus improving performance. For so much, it uses a divide-and-conquer
strategy. The core of this method is the usage of a special data structure named frequent-
pattern tree (FP-tree), which retains the item set association information.
Using this strategy, the FP-Growth reduces the search costs by recursively looking for short
patterns and then concatenating them into the long frequent patterns.
In large databases, holding the FP tree in the main memory is impossible. A strategy to cope with
this problem is to partition the database into a set of smaller databases (called projected
databases) and then construct an FP-tree from each of these smaller databases.
FP-Tree
The frequent-pattern tree (FP-tree) is a compact data structure that stores quantitative
information about frequent patterns in a database. Each transaction is read and then mapped
onto a path in the FP-tree. This is done until all transactions have been read. Different
transactions with common subsets allow the tree to remain compact because their paths
overlap.
A frequent Pattern Tree is made with the initial item sets of the database. The purpose of the FP
tree is to mine the most frequent pattern. Each node of the FP tree represents an item of the
item set.
The root node represents null, while the lower nodes represent the item sets. The associations of
the nodes with the lower nodes, that is, the item sets with the other item sets, are maintained
while forming the tree.
1. One root is labelled as "null" with a set of item-prefix subtrees as children and a
frequent-item-header table.
2. Each node in the item-prefix subtree consists of three fields:
o Item-name: registers which item is represented by the node;
o Count: the number of transactions represented by the portion of the path
reaching the node;
o Node-link: links to the next node in the FP-tree carrying the same item name or
null if there is none.
3. Each entry in the frequent-item-header table consists of two fields:
o Item-name: as the same to the node;
o Head of node-link: a pointer to the first node in the FP-tree carrying the item
name.
Additionally, the frequent-item-header table can have the count support for an item. The below
diagram is an example of a best-case scenario that occurs when all transactions have the same
itemset; the size of the FP-tree will be only a single branch of nodes.
Algorithm by Han
The original algorithm to construct the FP-Tree defined by Han is given below:
1. The first step is to scan the database to find the occurrences of the itemsets in the
database. This step is the same as the first step of Apriori. The count of 1-itemsets in the
database is called support count or frequency of 1-itemset.
2. The second step is to construct the FP tree. For this, create the root of the tree. The root is
represented by null.
3. The next step is to scan the database again and examine the transactions. Examine the
first transaction and find out the itemset in it. The itemset with the max count is taken at
Using this algorithm, the FP-tree is constructed in two database scans. The first scan collects and
sorts the set of frequent items, and the second constructs the FP-Tree.
Example
Table 1:
T2 I2,I3,I4
T3 I4,I5
T4 I1,I2,I4
T5 I1,I2,I3,I5
T6 I1,I2,I3,I4
Item Count
I2 5
I3 4
I4 4
I5 2
Item Count
I2 5
I1 4
I3 4
I4 4
Build FP Tree
1. The lowest node item, I5, is not considered as it does not have a min support count.
Hence it is deleted.
2. The next lower node is I4. I4 occurs in 2 branches , {I2,I1,I3:,I41},{I2,I3,I4:1}. Therefore
considering I4 as suffix the prefix paths will be {I2, I1, I3:1}, {I2, I3: 1} this forms the
conditional pattern base.
3. The conditional pattern base is considered a transaction database, and an FP tree is
constructed. This will contain {I2:2, I3:2}, I1 is not considered as it does not meet the min
support count.
4. This path will generate all combinations of frequent patterns : {I2,I4:2},{I3,I4:2},{I2,I3,I4:2}
5. For I3, the prefix path would be: {I2,I1:3},{I2:1}, this will generate a 2 node FP-tree : {I2:4,
I1:3} and frequent patterns are generated: {I2,I3:4}, {I1:I3:3}, {I2,I1,I3:3}.
6. For I1, the prefix path would be: {I2:4} this will generate a single node FP-tree: {I2:4} and
frequent patterns are generated: {I2, I1:4}.
The diagram given below depicts the conditional FP tree associated with the conditional node
I3.
Algorithm 2: FP-Growth
Procedure FP-growth(Tree, a)
{
If the tree contains a single prefix path, then.
{
// Mining single prefix-path FP-tree
let P be the single prefix-path part of the tree;
let Q be the multipath part with the top branching node replaced by a null root;
construct ß's conditional pattern-based, and then ß's conditional FP-tree Tree ß;
if Tree ß ≠ Ø then
call FP-growth(Tree ß, ß);
let freq pattern set(Q) be the set of patterns so generated;
return(freq pattern set(P) ∪ freq pattern set(Q) ∪ (freq pattern set(P) × freq pattern set(Q)))
}
When the FP-tree contains a single prefix path, the complete set of frequent patterns can be
generated in three parts:
Kamadhenu BCA College 48
1. The single prefix-path P,
2. The multipath Q,
3. And their combinations (lines 01 to 03 and 14).
The resulting patterns for a single prefix path are the enumerations of its subpaths with
minimum support. After that, the multipath Q is defined, and the resulting patterns are
processed. Finally, the combined results are returned as the frequent patterns found.
Apriori and FP-Growth algorithms are the most basic FIM algorithms. There are some basic
differences between these algorithms, such as:
Apriori FP Growth
Apriori generates frequent patterns by making FP Growth generates an FP-Tree for making
the itemsets using pairings such as single item frequent patterns.
set, double itemset, and triple itemset.
Apriori uses candidate generation where FP-growth generates a conditional FP-Tree
frequent subsets are extended one item at a for every item in the data.
time.
Since apriori scans the database in each step, FP-tree requires only one database scan in
it becomes time-consuming for data where its beginning steps, so it consumes less
the number of items is larger. time.
A converted version of the database is saved A set of conditional FP-tree for every item
in the memory is saved in the memory
The association rule learning is one of the very important concepts of machine learning, and it
is employed in Market Basket analysis, Web usage mining, continuous production, etc. Here
market basket analysis is a technique used by the various big retailer to discover the
associations between items. We can understand it by taking an example of a supermarket, as
in a supermarket, all products that are purchased together are put together.
For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk, so
these products are stored within a shelf or mostly nearby. Consider the below diagram:
1. Apriori
2. Eclat
3. F-P Growth Algorithm
We will understand these algorithms in later chapters.
Working :
Association rule learning works on the concept of If and Else Statement, such as if A then B.
Here the If element is called antecedent, and then statement is called as Consequent. These
types of relationships where we can find out some association or relation between two items is
known as single cardinality. It is all about creating rules, and if the number of items increases,
1. Support
2. Confidence
3. Lift
Support
Support is the frequency of A or how frequently an item appears in the dataset. It is defined as
the fraction of the transaction T that contains the itemset X. If there are X datasets, then for
transactions T, it can be written as:
Confidence
Confidence indicates how often the rule has been found to be true. Or how often the items X
and Y occur together in the dataset when the occurrence of X is already given. It is the ratio of
the transaction that contains X and Y to the number of records that contain X.
Lift
It is the strength of any rule, which can be defined as below formula:
It is the ratio of the observed support measure and expected support if X and Y are
independent of each other. It has three possible values:
Apriori Algorithm
This algorithm uses frequent datasets to generate association rules. It is designed to work on
the databases that contain transactions. This algorithm uses a breadth-first search and Hash
Tree to calculate the itemset efficiently.
Kamadhenu BCA College 51
It is mainly used for market basket analysis and helps to understand the products that can be
bought together. It can also be used in the healthcare field to find drug reactions for patients.
Eclat Algorithm
Eclat algorithm stands for Equivalence Class Transformation. This algorithm uses a depth-
first search technique to find frequent itemsets in a transaction database. It performs faster
execution than Apriori Algorithm.
It is a data analysis task, i.e. the process of finding a model that describes and distinguishes data
classes and concepts. Classification is the problem of identifying to which of a set of categories
(subpopulations), a new observation belongs to, on the basis of a training set of data
containing observations and whose categories membership is known.
Example: Before starting any project, we need to check its feasibility. In this case, a classifier is
required to predict class labels such as ‘Safe’ and ‘Risky’ for adopting the Project and to further
approve it. It is a two-step process such as:
1. Learning Step (Training Phase): Construction of Classification Model
Different Algorithms are used to build a classifier by making the model learn using the
training set available. The model has to be trained for the prediction of accurate results.
2. Classification Step: Model used to predict class labels and testing the constructed
model on test data and hence estimate the accuracy of the classification rules.
Test data are used to estimate the accuracy of the classification rule
Training and Testing:
Suppose there is a person who is sitting under a fan and the fan starts falling on him, he should get
aside in order not to get hurt. So, this is his training part to move away. While Testing if the
person sees any heavy object coming towards him or falling on him and moves aside then the
system is tested positively and if the person does not move aside then the system is negatively
tested.The same is the case with the data, it should be trained in order to get the accurate and
best results.
There are certain data types associated with data mining that actually tells us the format of the file
(whether it is in text format or in numerical format).
1. Decision Trees
2. Bayesian Classifiers
3. Neural Networks
Market Basket Analysis:It is a modeling technique that has been associated with
frequent transactions of buying some combination of items.
Example: Amazon and many other Retailers use this technique. While viewing some
products, certain suggestions for the commodities are shown that some people have
bought in the past.
Weather Forecasting:Changing Patterns in weather conditions needs to be observed
based on parameters such as temperature, humidity, wind direction. This keen
observation also requires the use of previous records in order to predict it accurately.
Advantages:
Classification in data mining involves assigning predefined labels or categories to instances based
on their features or attributes. While classification algorithms have proven to be powerful
tools, they are not without challenges. Here are some common issues associated with
classification in data mining:
1. Imbalanced Datasets:
Kamadhenu BCA College 55
When the distribution of classes in the dataset is uneven, classification
algorithms may be biased toward the majority class. This can lead to poor
performance in predicting minority classes.
2. Overfitting:
Overfitting occurs when a model learns the training data too well, capturing
noise or random fluctuations rather than the underlying pattern. This can result
in poor generalization to new, unseen data.
3. Underfitting:
In contrast to overfitting, underfitting occurs when a model is too simple to
capture the underlying structure of the data. This leads to poor performance on
both the training and test datasets.
4. Noise in Data:
Noisy data, which includes errors or irrelevant information, can negatively
impact the performance of classification algorithms. Noisy features may mislead
the model and result in incorrect predictions.
5. Feature Selection and Dimensionality:
The curse of dimensionality can affect the performance of classification
algorithms when dealing with a large number of features. Feature selection or
dimensionality reduction techniques are often needed to mitigate this issue.
6. Categorical and Missing Data:
Many classification algorithms are designed for numerical data. Handling
categorical variables or dealing with missing data requires preprocessing and
may impact the model's performance.
7. Computational Complexity:
Some classification algorithms can be computationally intensive, especially
when dealing with large datasets. Training complex models may require
significant computational resources.
8. Interpretable Models:
Complex models, such as deep neural networks, might lack interpretability,
making it challenging to understand the decision-making process. Interpretable
models are often preferred in scenarios where transparency is crucial.
9. Concept Drift:
The underlying relationships in the data may change over time, leading to
concept drift. Classification models trained on historical data may become less
accurate when applied to current data.
10. Scalability:
Some algorithms may not scale well with the size of the dataset. Training large
models or handling big data can be challenging for certain classification
techniques.
11. Evaluation Metrics:
The choice of evaluation metrics is crucial and may depend on the nature of the
problem. Using inappropriate metrics can lead to a misinterpretation of the
model's performance.
Decision Tree is a supervised learning method used in data mining for classification and regression
methods. It is a tree that helps us in decision-making purposes. The decision tree creates
classification or regression models as a tree structure. It separates a data set into smaller
subsets, and at the same time, the decision tree is steadily developed. The final tree is a tree
with the decision nodes and leaf nodes. A decision node has at least two branches. The leaf
nodes show a classification or decision. We can't accomplish more split on leaf nodes-The
uppermost decision node in a tree that relates to the best predictor called the root node.
Decision trees can deal with both categorical and numerical data.
Key factors:
Entropy:
Entropy refers to a common way to measure impurity. In the decision tree, it measures the
randomness or impurity in data sets.
Information Gain:
Information Gain refers to the decline in entropy after the dataset is split. It is also called
Entropy Reduction. Building a decision tree is all about discovering attributes that return the
highest data gain.
Expanding factor costs $3 million, the probability of a good economy is 0.6 (60%), which leads
to $8 million profit, and the probability of a bad economy is 0.4 (40%), which leads to $6
million profit.
Not expanding factor with 0$ cost, the probability of a good economy is 0.6(60%), which leads
to $4 million profit, and the probability of a bad economy is 0.4, which leads to $2 million profit.
The management teams need to take a data-driven decision to expand or not based on the
given data.
Bayesian classification in data mining is a statistical approach to data classification that uses
Bayes' theorem to make predictions about a class of a data point based on observed data. It is
a popular data mining and machine learning technique for modelling the probability of certain
outcomes and making predictions based on that probability.
The basic idea behind Bayesian classification in data mining is to assign a class label to a new
data instance based on the probability that it belongs to a particular class, given the observed
data. Bayes' theorem provides a way to compute this probability by multiplying the prior
probability of the class (based on previous knowledge or assumptions) by the likelihood of the
observed data given that class (conditional probability).
Several types of Bayesian classifiers exist, such as naive Bayes, Bayesian network classifiers,
Bayesian logistic regression, etc. Bayesian classification is preferred in many applications
because it allows for the incorporation of new data (just by updating the prior probabilities)
and can update the probabilities of class labels accordingly.
This is important when new data is constantly being collected, or the underlying distribution
may change over time. In contrast, other classification techniques, such as decision trees or
support vector machines, do not easily accommodate new data and may require re-training of
the entire model to incorporate new information. This can be computationally expensive and
time-consuming.
Bayesian classification is a powerful tool for data mining and machine learning and is widely
used in many applications, such as spam filtering, text classification, and medical diagnosis. Its
ability to incorporate prior knowledge and uncertainty makes it well-suited for real-world
problems where data is incomplete or noisy and accurate predictions are critical.
Bayes’ Theorem in Data Mining
Bayes' theorem is used in Bayesian classification in data mining, which is a technique for
predicting the class label of a new instance based on the probabilities of different class labels
and the observed features of the instance. In data mining, Bayes' theorem is used to compute
the probability of a hypothesis (such as a class label or a pattern in the data) given some
observed event (such as a set of features or attributes). It is named after Reverend Thomas
Bayes, an 18th-century British mathematician who first formulated it.
Bayes' theorem states that the probability of a hypothesis H given some observed event E is
proportional to the likelihood of the evidence given the hypothesis, multiplied by the prior
probability of the hypothesis, as shown below -
Rule-based classification in data mining is a technique in which class decisions are taken based
on various “if...then… else” rules. Thus, we define it as a classification type governed by a set of
IF-THEN rules. We write an IF-THEN rule as:
IF-THEN Rule
Example:
In rule-based classification in data mining, there are two factors based on which we can access
the rules. These are:
Coverage of Rule: The fraction of the records which satisfy the antecedent
conditions of a particular rule is called the coverage of that rule.We can calculate
this by dividing the number of records satisfying the rule(n1) by the total number of
records(n). Coverage(R) = n1/n
Accuracy of a rule: The fraction of the records that satisfy the antecedent
conditions and meet the consequent values of a rule is called the accuracy of that
rule. We can calculate this by dividing the number of records satisfying the
consequent values(n2) by the number of records satisfying the rule(n1). Accuracy(R)
= n2/n1
Generally, we convert them into percentages by multiplying them by 100. We do so to make it
easy for the layman to understand these terms and their values.
Properties of Rule-Based Classifiers
There are two significant properties of rule-based classification in data mining. They are:
Kamadhenu BCA College 60
Rules may not be mutually exclusive
Rules may not be exhaustive
Rules may not be mutually exclusive in nature
Many different rules are generated for the dataset, so it is possible and likely that many of
them satisfy the same data record. This condition makes the rules not mutually exclusive.
Since the rules are not mutually exclusive, we cannot decide on classes that cover different
parts of data on different rules. But this was our main objective. So, to solve this problem, we
have two ways:
The first way is using an ordered set of rules. By ordering the rules, we set priority
orders. Thus, this ordered rule set is called a decision list. So the class with the highest
priority rule is taken as the final class.
The second solution can be assigning votes for each class depending on their weights.
So, in this, the set of rules remains unordered.
lazy learner
A "lazy learner" in the context of machine learning refers to an algorithm that postpones the
processing or learning phase until it receives a query for predictions. Instead of building a
model during the training phase, a lazy learner stores instances of the training data and uses
them to make predictions when new, unseen instances are presented. Lazy learners are also
known as instance-based learners or memory-based learners.
One of the most common examples of a lazy learner is the k-Nearest Neighbors (k-NN)
algorithm. Here's an overview of lazy learner classification, using k-NN as an example:
k- Nearest Neighbors (k-NN):
1. Training Phase:
In the training phase, a k-NN classifier doesn't build a model in the traditional
sense.
Instead, it memorizes the training instances and their corresponding class labels.
2. Prediction Phase:
When a new instance needs to be classified, the algorithm identifies the k-
nearest neighbors to the new instance from the training data based on a
distance metric (e.g., Euclidean distance).
The majority class among the k-nearest neighbors is assigned to the new
instance as its predicted class.
3. Characteristics of Lazy Learners:
No Model Building: Lazy learners don't create an explicit model during the
training phase. They store the training instances and use them for predictions.
Adaptability: Lazy learners are adaptive to changes in the dataset, as they can
easily incorporate new instances without retraining the entire model.
Computationally Intensive: The prediction phase can be computationally
expensive, especially in high-dimensional spaces or with large datasets, as it
requires calculating distances for each query.
4. Parameters in k-NN:
Kamadhenu BCA College 61
k (Number of Neighbors): The choice of the value for k impacts the algorithm's
sensitivity to noise and its ability to capture the underlying structure of the data.
5. Advantages:
Simple Implementation: Lazy learners are easy to implement and understand.
Adaptive to Local Patterns: They can adapt well to local patterns in the data.
6. Disadvantages:
High Computation Cost: The prediction phase can be computationally expensive,
especially with large datasets.
Sensitivity to Noise: Lazy learners can be sensitive to noise or irrelevant
features in the dataset.
7. Use Cases:
Non-Stationary Data: Lazy learners are suitable for scenarios where the data
distribution is non-stationary, and the model needs to adapt to changes over
time.
Local Patterns: When the decision boundaries are complex and vary in different
parts of the feature space.
Lazy learners, like k-NN, are well-suited for certain types of problems, particularly in situations
where the decision boundaries are complex and the relationships within the data are nonlinear
and varied. However, their computational cost can be a drawback in high-dimensional or large
datasets. The choice of whether to use a lazy learner depends on the specific characteristics of
the problem at hand.
Nearest Neighbor
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.
How to select the value of K in the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the K-NN algorithm:
o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
o Large values for K are good, but it may find some difficulties.
Kamadhenu BCA College 64
Advantages of KNN Algorithm:
o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data
points for all the training samples.
Prediction
In data mining, prediction involves using models built from historical or existing data to make
informed predictions or estimations about future or unseen data. The goal is to identify
patterns, relationships, or trends within the data that can be generalized to make predictions
on new instances. Predictive modeling is a key aspect of data mining and machine learning.
Here's an overview of the prediction process in data mining:
1. Data Collection and Preprocessing:
The first step is to collect relevant data for the problem at hand. This data is
then preprocessed to clean it, handle missing values, and transform features
into a suitable format for modeling.
2. Feature Selection and Engineering:
Features (attributes or variables) that are most relevant to the prediction task
are selected or engineered to enhance the performance of the predictive model.
3. Training a Predictive Model:
A predictive model is trained using a portion of the available data (training
dataset). The model learns patterns and relationships between input features
and the target variable (the variable to be predicted).
4. Model Evaluation:
The trained model is evaluated using a separate dataset (validation or test
dataset) that the model has not seen during training. Evaluation metrics, such as
accuracy, precision, recall, F1 score, or others, are used to assess the model's
performance.
5. Model Tuning:
If the model's performance is not satisfactory, adjustments are made. This may
involve tuning hyper parameters, changing the model architecture, or modifying
the feature set.
6. Prediction on New Data:
Once the model is trained and validated, it can be used to make predictions on
new, unseen data. These predictions are based on the patterns learned during
the training phase.
7. Deployment:
If the predictive model meets the desired performance criteria, it can be
deployed for use in a real-world environment. This could involve integrating the
model into a software application or a decision-making system.
8. Monitoring and Updating:
Kamadhenu BCA College 65
Predictive models need to be monitored for performance degradation over time,
especially if the data distribution changes. Models may need to be periodically
updated or retrained to maintain their accuracy.
9. Types of Predictive Models:
Various types of predictive models can be used, including linear regression,
decision trees, support vector machines, neural networks, and ensemble
methods like random forests or gradient boosting.
10. Applications of Prediction in Data Mining:
Predictive modeling is applied in numerous fields, including finance (credit
scoring), healthcare (disease prediction), marketing (customer churn prediction),
and many others.
The prediction process involves a combination of statistical and machine learning techniques,
and the success of the prediction task depends on the quality of the data, the choice of
features, and the appropriateness of the modeling technique for the specific problem.
Accuracy, precision, and recall are common evaluation metrics used in data mining and
machine learning to assess the performance of predictive models. These metrics provide
insights into different aspects of a model's performance, especially in binary or multiclass
classification tasks.
1. Accuracy:
Definition: Accuracy is the ratio of correctly predicted instances to the total
number of instances in the dataset.
Components:
TP (True Positives): Instances correctly predicted as positive.
TN (True Negatives): Instances correctly predicted as negative.
FP (False Positives): Instances incorrectly predicted as positive.
FN (False Negatives): Instances incorrectly predicted as negative.
Interpretation: Precision focuses on the accuracy of positive predictions. It is
useful when the cost of false positives is high. A high precision indicates a low
false-positive rate.
3. Recall (Sensitivity or True Positive Rate):
Definition: Recall is the ratio of correctly predicted positive instances to the
total actual positive instances.
Interpretation: F1 score
combines precision and recall into a single metric. It is particularly useful when
there is an uneven class distribution.
5. Receiver Operating Characteristic (ROC) Curve and Area Under the Curve (AUC):
Definition: ROC curve plots the true positive rate (recall) against the false
positive rate at various classification thresholds. AUC quantifies the area under
the ROC curve.
Interpretation: ROC and AUC provide insights into a model's ability to
discriminate between classes across different thresholds. A higher AUC
generally indicates better model performance.
When interpreting these metrics, it's crucial to consider the specific context of the problem and
the associated costs of false positives and false negatives. The choice of evaluation metric
depends on the goals and requirements of the task at hand.
Cluster Analysis in Data Mining means that to find out the group of objects which are similar to
each other in the group but are different from the object in other groups. In the process
of clustering in data analytics, the sets of data are divided into groups or classes based on data
similarity. Then each of these classes is labelled according to their data types. Going
through clustering in data mining example can help you understand the analysis more
extensively.
Cluster data mining analysis is an unsupervised learning technique that groups data points
based on their similarities. The goal is to create clusters where points within a cluster are more
similar than points in other clusters. Some key aspects of cluster analysis:
It can be used to discover patterns in data without prior knowledge of class labels. The
algorithm explores the data and groups similar points together.
Many clustering methods in data mining analysis algorithms include k-means,
hierarchical clustering, density-based clustering, etc. Each has its approach for defining
clusters.
The number of classification and clustering in data mining must be determined
beforehand for some algorithms like k-means. Other algorithms, like hierarchical
clustering, can infer the number of clusters from the data.
Choosing the appropriate clustering algorithm and tuning its parameters, like several
clusters, is key for obtaining meaningful clusters from the data.
Cluster evaluation in data mining analysis is commonly used for exploratory data
analysis, market segmentation, social network analysis, and image segmentation.
Cluster evaluation in data mining results requires domain knowledge and analyzing
cluster characteristics like tightness, separation, etc. External evaluation measures can
also be used if class labels are available.
Kamadhenu BCA College 68
Overall view of contents in clustering:
Two main approaches of clustering:
1. Partitioning clustering
2. Hierarchical clustering
1. Partition algorithms are:
1. PAM
2. CLARA
3. CLARANS
2. Hierarchical algorithms are
1. DBSCAN
2. CURE
3. BIRCH
Categorical data mining using clustering involves following algorithms
1. STIRR
2. ROCK
3. CACTUS
Partition methods
Partitioning Method: This clustering method classifies the information into multiple groups based
on the characteristics and similarity of the data. Its the data analysts to specify the number of
clusters that has to be generated for the clustering methods. In the partitioning method when
database(D) that contains multiple(N) objects then the partitioning method constructs user-
specified(K) partitions of the data in which each partition represents a cluster and a particular
region. There are many algorithms that come under partitioning method some of the popular
ones are K-Mean, PAM(K-Medoids), CLARA algorithm (Clustering Large Applications) etc. In this
article, we will be seeing the working of K Mean algorithm in detail.
K-Mean (A centroid based Technique): The K means algorithm takes the input parameter K from
the user and partitions the dataset containing N objects into K clusters so that resulting
similarity among the data objects inside the group (intracluster) is high but the similarity of
data objects with the data objects from outside the cluster is low (intercluster). The similarity
of the cluster is determined with respect to the mean value of the cluster. It is a type of square
error algorithm. At the start randomly k objects from the dataset are chosen in which each of
the objects represents a cluster mean(centre). For the rest of the data objects, they are
assigned to the nearest cluster based on their distance from the cluster mean
K-mean clustering:
The goal of clustering is to divide the population or set of data points into a number of groups
so that the data points within each group are more comparable to one another and different
from the data points within the other groups. It is essentially a grouping of things based on
how similar and different they are to one another.
How k-means clustering works?
We are given a data set of items, with certain features, and values for these features (like a
vector). The task is to categorize those items into groups. To achieve this, we will use the K-
means algorithm, an unsupervised learning algorithm. ‘K’ in the name of the algorithm
represents the number of groups/clusters we want to classify our items into.
(It will help if you think of items as points in an n-dimensional space). The algorithm will
categorize the items into k groups or clusters of similarity. To calculate that similarity, we will
use the Euclidean distance as a measurement.
Medoid: A Medoid is a point in the cluster from which the sum of distances to other data
points is minimal.
(or) A Medoid is a point in the cluster from which dissimilarities with all the other points in the
clusters are minimal.
Instead of centroids as reference points in K-Means algorithms, the K-Medoids algorithm takes
a Medoid as a reference point.
There are three types of algorithms for K-Medoids Clustering:
Here are the key features and steps of the PAM algorithm:
Objective:
Like K-Means, PAM aims to partition a dataset into k clusters, where each cluster is
represented by one data point called a "medoid."
A medoid is the most centrally located point in a cluster, minimizing the sum of
dissimilarities (or distances) between it and the other points in the cluster.
Steps:
1. Initialization:
Choose k initial medoids randomly from the dataset.
2. Assignment:
Assign each data point to the nearest medoid based on a dissimilarity metric
(commonly using Euclidean distance or other distance measures).
3. Update Medoids:
For each cluster, consider swapping a non-medoid point with another non-medoid
point to minimize the total dissimilarity within the cluster.
Repeat this process until no further improvement can be made.
4. Repeat:
Iterate steps 2 and 3 until convergence, where convergence occurs when no more
medoid swaps lead to a reduction in total dissimilarity.
Dissimilarity Measure:
PAM can use different dissimilarity measures depending on the nature of the data, but
commonly it uses distance measures such as Euclidean distance.
Advantages of PAM:
More robust to noise and outliers compared to K-Means, as it uses medoids that are
less sensitive to extreme values.
Provides a more interpretable cluster center (medoid) than K-Means, especially when t
he mean might not be a representative point.
Limitations:
Implementation:
PAM is available in the R programming language through the cluster package, but it
might not be as commonly implemented in other languages as K-Means.
2. CLARA
CLARA (Clustering Large Applications) is an extension of the Partitioning Around Medoids
(PAM) clustering algorithm designed to handle large datasets that might not fit into
Kamadhenu BCA College 71
memory. It was proposed by Kaufman and Rousseeuw in 1990 as a scalable version of
PAM for large datasets. CLARA uses a sampling approach to select representative subsets
of the data, applies PAM to these subsets, and then combines the results to provide an
overall clustering solution.
Here are the key features and steps of the CLARA algorithm:
Objective:
Like PAM, CLARA aims to partition a dataset into k clusters using medoids.
CLARA is specifically designed to handle large datasets that do not fit into memory by
using random samples.
Steps:
1. Sampling:
Randomly select multiple samples (subsets) of the data, each of size sampsize.
The samples are chosen without replacement, and each data point can appear
in multiple samples.
2. Apply PAM to Each Sample:
Apply the PAM algorithm to each sampled subset. This involves selecting initial
medoids, assigning points to clusters, and updating the medoids within each
subset.
3. Evaluate Quality:
Evaluate the quality of each clustering solution by calculating the total
dissimilarity across all points in each cluster.
4. Choose Final Clustering:
Select the clustering solution with the lowest overall dissimilarity as the final
result.
Advantages of CLARA:
Handles large datasets that cannot fit into memory by using random samples.
Provides a more robust clustering solution by considering multiple samples.
Limitations:
Computationally more intensive than standard PAM due to the need to apply PAM to
multiple samples.
The choice of the sample size (sampsize) can impact the quality of the clustering
solution.
Implementation:
1. Input Parameters:
Number of clusters (k)
Maximum number of neighbors examined (max-neighbors)
Number of local minima (num-local)
2. Initialization:
Randomly select k data points as initial medoids.
3. Main Loop:
Generate a random number of neighbors for each medoid.
Swap medoids with neighboring non-medoid points and compute the cost of the
new configuration.
If the new configuration reduces the cost, update the medoid; otherwise, revert
the swap.
Repeat the above steps until local minima are reached.
4. Repeat Main Loop:
Repeat the main loop for a specified number of local minima.
5. Output:
Return the clustering solution with the lowest cost.
Limitations:
Bottom-Up Approach: It starts with each data point as a single cluster and successively
merges clusters until only one cluster, containing all data points, remains.
Steps:
1. Treat each data point as a single cluster.
2. Find the closest (most similar) pair of clusters and merge them into a new
cluster.
3. Repeat step 2 until only one cluster remains.
Linkage Methods: The choice of linkage method determines how the distance between
clusters is measured during the merging process. Common linkage methods include:
Single Linkage: Distance between the closest pair of points in different clusters.
Complete Linkage: Distance between the farthest pair of points in different
clusters.
Average Linkage: Average distance between all pairs of points in different
clusters.
Dendrogram: The result is a dendrogram, a tree-like diagram that illustrates the
hierarchy of clusters.
Top-Down Approach: It starts with all data points in a single cluster and recursively
divides clusters until each data point is in its own cluster.
Steps:
1. Treat all data points as a single cluster.
2. Find a cluster to split into two based on a criterion.
3. Repeat step 2 recursively until each data point is in its own cluster.
Divisive Approach: Divisive hierarchical clustering involves recursively dividing clusters
based on certain criteria.
Advantages of Hierarchical Clustering:
Density-Based Clustering refers to one of the most popular unsupervised learning methodologies
used in model building and machine learning algorithms. The data points in the region
separated by two clusters of low point density are considered as noise. The surroundings with
a radius ε of a given object are known as the ε neighborhood of the object. If the ε
neighborhood of the object comprises at least a minimum number, MinPts of objects, then it is
called a core object.
Density-Based Clustering - Background
MinPts: MinPts refers to the minimum number of points in an Eps neighborhood of that point.
A point i is considered as the directly density reachable from a point k with respect to Eps,
MinPts if
point condition:
Density reachable:
A point denoted by i is a density reachable from a point j with respect to Eps, MinPts if there is a
sequence chain of a point i1,…., in, i1 = j, pn = i such that i i + 1 is directly density reachable from
ii.
A point i refers to density connected to a point j with respect to Eps, MinPts if there is a point o
such that both i and j are considered as density reachable from o with respect to Eps and
MinPts.
An object i is density reachable form the object j with respect to ε and MinPts in a given set of
objects, D' only if there is a sequence of object chains point i1,…., in, i1 = j, pn = i such that ii + 1
is directly density reachable from ii with respect to ε and MinPts.
An object i is density connected object j with respect to ε and MinPts in a given set of objects,
D' only if there is an object o belongs to D such that both point i and j are density reachable
from o with respect to ε and MinPts.
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It depends on a
density-based notion of cluster. It also identifies clusters of arbitrary size in the spatial
database with outliers.
The choice of grid size can impact the clustering results; an inappropriate grid size may
lead to oversmoothing or undersmoothing.
Struggles with clusters of irregular shapes or clusters with varying densities.
Sensitive to the spatial distribution of data points.
DBSCAN as a Grid-Based Clustering Algorithm:
DBSCAN, although not exclusively grid-based, can be adapted for grid-based clustering
by using a grid structure to efficiently compute spatial relationships.
The algorithm defines clusters as dense regions separated by areas of lower point
density.
Implementation:
Evaluating the performance of a clustering algorithm is crucial to understanding how well it has
grouped similar data points and how meaningful the resulting clusters are. Several metrics and
methods are commonly used to evaluate the quality of clustering. Here are some key aspects
and methods for evaluating clustering results:
1. Internal Evaluation Metrics:
Silhouette Score: Measures how similar an object is to its own cluster compared
to other clusters. A higher silhouette score indicates better-defined clusters.
Davies-Bouldin Index: Computes the average similarity ratio of each cluster with
the cluster that is most similar to it. Lower values indicate better clustering.
2. External Evaluation Metrics:
Adjusted Rand Index (ARI): Measures the similarity between true labels and
predicted labels, adjusted for chance. ARI ranges from -1 to 1, where higher
values indicate better agreement.
Normalized Mutual Information (NMI): Measures the mutual information
between true labels and predicted labels, normalized by the entropy of the
labels. Higher NMI values indicate better clustering.
3. Visual Inspection:
Dendrogram and Scatter Plots: For hierarchical clustering, dendrograms can
provide insights into the hierarchical structure. Scatter plots of clustered data
points can visually reveal the separation of clusters.
TSNE (t-distributed Stochastic Neighbor Embedding) or PCA (Principal
Component Analysis): Dimensionality reduction techniques can help visualize
the data in lower dimensions and inspect the separation between clusters.
4. Cluster Purity:
Cluster Purity: Measures the extent to which each cluster contains
predominantly a single class. Higher purity values indicate better clustering
performance.
5. Fowlkes-Mallows Index:
Fowlkes-Mallows Index: A metric that assesses the similarity between true and
predicted clusters. It considers precision and recall.
6. Computational Efficiency:
Runtime and Memory Usage: Depending on the application, the computational
efficiency of the clustering algorithm may be an important consideration.
7. Domain-Specific Metrics:
Metrics tailored to the specific goals of the clustering task. For example, if the
goal is to identify disease subtypes, domain-specific metrics related to disease
characteristics may be relevant.
Answer:Data Mining is the process of analyzing large datasets to identify patterns, trends, and
relationships that are not immediately apparent. It uses algorithms, statistical models, and
machine learning techniques to explore hidden structures in data. The goal is to extract useful
information from raw data, which can be used to make better decisions, predict future trends,
and understand underlying structures within the data. For example, in retail, Data Mining can
help identify buying patterns of customers and suggest new products based on those patterns.
2. What are the main differences between KDD (Knowledge Discovery in Databases) and
Data Mining?
3. How does Data Mining differ from DBMS (Database Management Systems)?
Answer:
Some of the key techniques used in Data Mining include:
Answer:
Data Mining comes with several challenges:
1. Data Quality: Inconsistent, incomplete, or noisy data can significantly hinder the
data mining process. Cleaning the data is often a time-consuming and essential step.
2. Scalability: As the volume of data increases, it becomes harder to apply traditional Data
Mining techniques efficiently. Handling large datasets with millions of records can
require sophisticated techniques and high computational resources.
3. Interpretability of Models: Data Mining algorithms can generate complex models that
are hard for humans to interpret, which can be a barrier to their practical use.
4. Privacy Concerns: Mining sensitive data, such as personal or financial information, can
raise privacy issues, especially in domains like healthcare or finance.
5. Data Overfitting: Overfitting occurs when a model is too closely fit to the training data
and fails to generalize well to unseen data. This is especially problematic in predictive
modeling.
6. Class Imbalance: In many datasets, some classes are underrepresented, making
it difficult to build accUrate models, especially for rare events or minority classes.
Answer:A Data Warehouse is a specialized database system that stores historical data from
multiple sources in a centralized repository. Unlike operational databases that store real-time
transaction data, a Data Warehouse is designed for querying and analysis. Data Warehouses
consolidate data from various sources (such as transactional systems, flat files, or external
databases) and present it in a structured, easy-to-analyze format. They support business
intelligence activities like reporting, dashboards, and data analysis by providing a
comprehensive view of an organization's data.
Answer:The Multidimensional Data Model is a way of organizing data in a Data Warehouse using
multiple dimensions to represent various aspects of the data. Data is organized in a structure
known as a cube, where each axis represents a different dimension (such as time, geography, and
product category). The intersection of dimensions holds the actual data values, such as sales
numbers or profit margins. This model allows users to analyze data from different perspectives,
such as viewing sales by month (time), by region (geography), or by product
Answer:Data Cleaning refers to the process of detecting and correcting (or removing) errors,
inconsistencies, and inaccuracies in the data. In the context of Data Warehousing, cleaning is
crucial because Data Warehouses often consolidate data from multiple sources, which may have
different formats, structures, and levels of quality. Cleaning ensures that the data is accurate,
complete, and consistent, which in turn enhances the quality of any analyses performed on it.
Without proper data cleaning, reports and business intelligence insights generated from the data
can be misleading or incorrect, leading to poor decision-making.
Answer:Data Integration is the process of combining data from various sources into a unified
format suitable for analysis. In Data Warehousing, this involves extracting data from disparate
systems, transforming it into a common format, and loading it into the Data Warehouse (ETL
process: Extract, Transform, Load). Data Transformation involves modifying the data to meet
the specific needs of the Data Warehouse, which may include normalizing data, converting
currencies, or dealing with missing values. Transformation ensures that the data is structured in
a way that makes it easy to analyze and query.
Answer:Data Reduction is a technique used to reduce the size of the data stored in a Data
Warehouse while preserving the essential patterns and relationships. Large volumes of data can
slow down query performance and complicate analysis. By applying methods like aggregation,
sampling, or dimensionality reduction, Data Reduction techniques help to streamline data
processing without sacrificing important information. For example, aggregating monthly sales
data into quarterly data reduces the amount of data while still retaining essential trends.
Answer:Discretization refers to the process of converting continuous data into discrete intervals
or categories. In Data Warehousing, this can be useful when working with large sets of
continuous data such as age, income, or temperature. By converting these continuous values
into discrete categories (e.g., age groups like 20-30, 30-40, etc.), it becomes easier to analyze
the data and perform aggregations. Discretization helps simplify the data and can improve the
performance of Data Mining algorithms, as many algorithms perform better with categorical
data.
Answer:Frequent Item Set Mining involves discovering sets of items that frequently appear
together in transactional datasets. This is most commonly used in market basket analysis to
understand which products are often purchased together. For example, in a grocery store,
frequent item sets might reveal that customers often buy bread and butter together. The key
objective is to identify these frequently co-occurring items to help businesses with decisions like
cross-selling, promotions, and inventory management. Techniques like the A-Priori Algorithm
and FP-Growth are widely used to mine frequent item sets.
Kamadhenu BCA College 82
14. What is the A-Priori Algorithm, and how does it work?
Answer:The A-Priori Algorithm is an algorithm for mining frequent item sets. It works by
iteratively generating candidate item sets of increasing size, starting from individual items, and
checking their frequency in the dataset. The key idea behind A-Priori is that any subset of a
frequent item set must also be frequent. This property allows the algorithm to prune the search
space by eliminating infrequent item sets early in the process. A-Priori uses a breadth-first
search approach, first finding frequent individual items, then pairs of items, triples, and so on,
until no more frequent item sets can be found.
15. What is the FP-Growth Algorithm, and how does it differ from A-Priori?
16. What are Association Rules, and how are they mined?
Answer:Association Rules are used to identify relationships between different variables in large
datasets. These rules are typically written in the form of "If X, then Y," where X and Y are
items or attributes. For example, an association rule could be "If a customer buys milk, they are
likely to buy bread." To mine Association Rules, Data Mining algorithms like A-Priori and FP-
Growth are used to find frequent item sets, and then rules are generated from those item sets
based on metrics like support, confidence, and lift. Support measures the frequency of the rule,
confidence measures the reliability of the rule, and lift measures the strength of the rule
compared to random chance.
1. Overfitting: Overfitting occurs when a model is too complex and starts to capture noise or
irrelevant patterns in the training data. This can result in poor generalization to new, unseen
data.
20. What is the Naive Bayes classifier, and how does it work?
Classification (Continued)
22. What are Lazy Learners (or Learning from your Neighbors) in Classification?
Answer:Lazy Learners, also known as Instance-Based Learning, are algorithms that do not
explicitly build a model during the training phase. Instead, they store all the training instances
and perform classification at prediction time by comparing a new instance to stored examples.
The most common Lazy Learner algorithm is the k-Nearest Neighbors (k-NN) algorithm, where
the class of a new instance is determined by the majority class of its 'k' nearest neighbors in the
training set. Lazy Learners are simple, non-parametric, and can work well for many
classification problems, especially when the decision boundaries are irregular. However, they
Kamadhenu BCA College 84
can be computationally expensive during prediction, particularly when the dataset is large.
Answer:The k-Nearest Neighbors (k-NN) algorithm is a simple and widely used classification
technique where the class of a data point is determined by the majority vote of its 'k' closest
training data points. The distance between data points is usually measured using metrics like
Euclidean distance. In practice, to classify a new instance, the algorithm finds the 'k' training
examples that are closest to the instance and predicts the class that occurs most frequently
among them. The main advantage of k-NN is its simplicity and effectiveness for non-linear
data. However, it can become inefficient for large datasets, as it requires computing distances to
all points in the training set.
Answer:Precision and Recall are two important evaluation metrics used for
classification models, particularly when dealing with imbalanced classes.
1. Precision measures the accuracy of positive predictions. It is defined as the ratio of true
positive instances to the total predicted positives. A high precision indicates that the
model does a good job of only predicting positives when it is confident that they are
correct.
1. Recall (also known as Sensitivity or True Positive Rate) measures how well the model
identifies positive instances. It is defined as the ratio of true positive instances to the
total actual positives. A high recall indicates that the model correctly identifies most of
the actual positive instances.
Both precision and recall provide valuable insights into a model's performance,
especially in cases where the data is imbalanced (e.g., rare events like fraud detection).
Answer:The F1-Score is the harmonic mean of precision and recall, providing a single
measure of a model's performance, particularly in cases where there is a class imbalance. It
balances the trade-off between precision and recall.
The F1-Score ranges from 0 (worst performance) to 1 (best performance). It is a useful metric
when both false positives and false negatives are important, such as in medical diagnostics or
fraud detection.
1. Partitioning Methods: These methods divide the dataset into a predefined number of
clusters. K-Means is the most popular partitioning algorithm. It works by selecting 'k'
initial centroids and iteratively updating them to minimize the distance between data
points and their corresponding centroids.
2. Hierarchical Methods: These methods build a tree-like structure (dendrogram) of
clusters, which can be either agglomerative (bottom-up) or divisive (top-down).
Agglomerative clustering starts by treating each data point as its own cluster and then
merges the closest clusters. Divisive clustering starts with all data points in one cluster
and iteratively splits it into smaller clusters.
3. Density-Based Methods: These methods, such as DBSCAN (Density-Based Spatial
Clustering of Applications with Noise), group points that are closely packed together
and separate regions with sparse data. DBSCAN can handle clusters of arbitrary
shapes and can also identify noise points (outliers).
4. Grid-Based Methods: These methods divide the data space into a finite number of cells
and perform clustering based on the grid structure. They are particularly useful for
large datasets with high dimensionality.
28. What is the K-Means clustering algorithm, and how does it work?
Answer:The K-Means algorithm is a partitioning clustering algorithm that divides a dataset into
'k' clusters. It works by first selecting 'k' initial centroids (either randomly or based on some
heuristic) and then assigning each data point to the nearest centroid. Once all points are assigned,
the centroids are recalculated as the mean of all points in their respective clusters. This process
is repeated iteratively until the centroids no longer change, meaning the algorithm has converged.
K-Means is efficient and easy to implement but has limitations, such as sensitivity to the initial
choice of centroids and difficulty handling non-spherical clusters.
29. What are the advantages and disadvantages of the K-Means algorithm?
Answer:
1. Advantages:
2. Disadvantages:
1. Sensitive to the initial choice of centroids and can get stuck in local minima.
2. Requires the user to specify the number of clusters ('k') in advance.
3. Struggles with clusters of varying sizes, densities, and shapes.
4. Sensitive to noise and outliers, which can distort the clustering results.
Answer:
There are several evaluation metrics used to assess the quality of clustering results:
1. Silhouette Score: Measures how similar each point is to its own cluster compared to
other clusters. A high silhouette score indicates that the points are well-clustered.
2. Dunn Index: Measures the compactness and separation of clusters. A higher Dunn
Index indicates better clustering.
3. Davies-Bouldin Index: A lower Davies-Bouldin score indicates better clustering, as it
measures the average similarity between each cluster and its most similar cluster.
4. Rand Index: Measures the agreement between two clusterings, by comparing pairs of
points and counting the number of pairs that are either in the same cluster or different
clusters in both clusterings.
5. Adjusted Rand Index (ARI): Adjusts the Rand Index to account for chance, making
it more reliable for comparing clustering results.
Problem 1:
Transaction ID Items
Purchased T1 A, B, D
T2 B, C, D
T3 A, B, C, D
T4 A, B, C
T5 B, D
T6 A, C, D
We are given the minimum support threshold as 50% (0.5), and the minimum confidence
threshold as 70% (0.7).
Solution:
Step 1: Generate Candidate 1-itemsets The frequency of individual items in the transactions
are:
1. A: 3 times
2. B: 4 times
1. (A, B): 0.5, (A, C): 0.5, (A, D): 0.5, (B, C): 0.67, (B, D): 0.67, (C, D): 0.67
Step 4: Generate Association Rules For the frequent itemsets, we generate rules and
calculate their confidence:
Thus, the frequent itemsets and their corresponding association rules have been mined.
Transaction ID Items
Purchased T1 A, B, D
T2 B, C, D
T3 A, B, C, D
T4 A, B, C
T5 B, D
T6 A, C, D
We are given a minimum support threshold of 50% (0.5), and the goal is to mine frequent
itemsets using the FP-Growth algorithm.
Solution:
Step 1: Construct the Frequency Table The frequency of individual items is calculated:
1. A: 3 times
2. B: 4 times
3. C: 4 times
4. D: 4 times
Step 2: Sort Items in Transaction Data After sorting the items in each transaction based on
their frequency (from most frequent to least frequent), the transaction dataset looks like this:
Transaction ID Sorted
Items T1 B, A, D
T2 B, D, C
T3 B, A, D, C
T4 B, A, C
T5 B, D
T6 A, C, D
Step 3: Build the FP-tree Construct the FP-tree by iterating through the sorted transactions and
inserting them into the tree:
mathematica
CopyEdit
Root
|
B (5)
/ | \
A D C (4)
/\ \ |D C C D (3)
|
A (2)
Step 4: Mine Frequent Itemsets To mine the frequent itemsets, perform a conditional pattern
base mining:
1. For item D, the conditional pattern base contains: {(B, A), (B, C), (A, C)}.
2. From this, the frequent itemsets for D are {B, D}, {A, D}, and {B, A, D}.
1. {A, B}
2. {B, D}
3. {A, D}
4. {A, B, D}
5. {C, D}
Step 5: Generate Association Rules From the frequent itemsets, generate association rules and
calculate their confidence:
Thus, the frequent itemsets and association rules have been mined using the FP-Growth
algorithm.