Data Mining Notes
Data Mining Notes
Name : &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&
SYLLABUS
Module I: Introduction:-Data, Information, Knowledge, KDD, types of data for mining,
Application domains, data mining functionalities/tasks. Data processing—Understanding data,
pre-processing data-Form of data processing, Data cleaning (definition and Phases only), Need
for data integration, Steps in data transformation, Need of data reduction
Module II: Data Warehouses-Databases, Data warehouses, Data Mart, Databases Vs Data
warehouses, Data ware houses Vs Data mart, OLTP OLAP, OLAP operations/functions, OLAP
Multi-Dimensional Models- Data cubes, Star, Snow Flakes, Fact constellation. Association rules-
Market Basket Analysis, Criteria for classifying frequent pattern mining, Mining Single
Dimensional Boolean Association rule-A priori algorithm
Module I
Introduction
Data, Information, Knowledge, KDD, types of data for mining, Application domains, data mining
functionalities/tasks. Data processing—Understanding data, pre-processing data-Form of data
processing, Data cleaning (definition and Phases only), Need for data integration, Steps in data
transformation, Need of data reduction
DATA
• Data is a collection of facts in a raw or unorganized form such as numbers or characters.
• However, without context, data can mean little. For example, 12012012 is just a sequence of
numbers without apparent importance. But if we view it in the context of 8this is a date9, we can
easily recognize 12th of January, 2012. By adding context and value to the numbers, they now
have more meaning.
INFORMATION
• information is <prepared data that has been processed, aggregated and organized into a more
human-friendly format that provides more context. Information is often delivered in the form
of data visualizations, reports,
• Information addresses the requirements of a user, giving it significance and usefulness as it is
the product of data that has been interpreted to deliver a logical meaning.
KNOWLEDGE
• Knowledge means the familiarity and awareness of a person, place, events, ideas, issues, ways
of doing things or anything else, which is gathered through learning, perceiving or discovering.
It is the state of knowing something with cognizance through the understanding of concepts,
study and experience.
Volume of information is increasing everyday that we can handle from business transactions, scientific
data, sensor data, Pictures, videos, etc. So, we need a system that will be capable of extracting essence
of information available and that can automatically generate report,views or summary of data for better
decision-making.
1.Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from collection.
2.Data Integration: Data integration is defined as heterogeneous data from multiple sources combined
in a common source(DataWarehouse).
3.Data Selection: Data selection is defined as the process where data relevant to the analysis is decided
and retrieved from the data collection.
4.Data Transformation: Data Transformation is defined as the process of transforming data into
appropriate form required by mining procedure.Data Transformation is a two step process:
• Data Mapping: Assigning elements from source base to destination to capture transformations.
• Code generation: Creation of the actual transformation program.
5.Data Mining: Data mining is defined as clever techniques that are applied to extract patterns
potentially useful.
• Generate reports.
• Generate tables.
• Generate discriminate rules, classification rules, characterization rules, etc.
1. Classification:
This technique is used to obtain important and relevant information about data and metadata. This data
mining technique helps to classify data in different classes.
i. Classification of Data mining frameworks as per the type of data sources mined:
This classification is as per the type of data handled. For example, multimedia, spatial data, text
data, time-series data, World Wide Web, and so on..
ii. Classification of data mining frameworks as per the database involved:
This classification based on the data model involved. For example. Object-oriented database,
transactional database, relational database, and so on..
iii. Classification of data mining frameworks as per the kind of knowledge discovered:
This classification depends on the types of knowledge discovered or data mining functionalities.
iv. Classification of data mining frameworks according to data mining techniques used:
This classification is as per the data analysis approach utilized, such as neural networks, machine
learning, genetic algorithms, visualization, statistics, data warehouse-oriented or database-
oriented, etc.
2.Clustering
• Clustering is a division of information into groups of connected objects. Describing the data by a
few clusters mainly loses certain confine details, but accomplishes improvement. It models data
by its clusters. Data modeling puts clustering from a historical point of view rooted in statistics,
mathematics, and numerical analysis.
• Clustering analysis is a data mining technique to identify similar data. This technique helps to
recognize the differences and similarities between the data. Clustering is very similar to the
classification, but it involves grouping chunks of data together based on their similarities.
3. Regression:
Regression analysis is the data mining process is used to identify and analyze the relationship between
variables because of the presence of the other factor. It is used to define the probability of the specific
variable. Regression, primarily a form of planning and modeling. For example, we might use it to project
certain costs, depending on other factors such as availability, consumer demand, and competition.
Primarily it gives the exact relationship between two or more variables in the given data set.
4. Association Rules:
This data mining technique helps to discover a link between two or more items. It finds a hidden pattern
in the data set.
Association rules are if-then statements that support to show the probability of interactions between
data items within large data sets in different types of databases. Association rule mining has several
applications and is commonly used to help sales correlations in data or medical data sets.
The way the algorithm works is that you have various data, For example, a list of grocery items that you
have been buying for the last six months. It calculates a percentage of items being purchased together.
5. Outer detection:
This type of data mining technique relates to the observation of data items in the data set, which do not
match an expected pattern or expected behavior. This technique may be used in various domains like
intrusion, detection, fraud detection, etc. It is also known as Outlier Analysis or Outilier mining. The
outlier is a data point that diverges too much from the rest of the dataset. The majority of the real-world
datasets have an outlier.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating sequential data to discover
sequential patterns. It comprises of finding interesting subsequences in a set of sequences, where the
stake of a sequence can be measured in terms of different criteria like length, occurrence frequency, etc.
In other words, this technique of data mining helps to discover or recognize similar patterns in
transaction data over some time.
7. Prediction:
Prediction used a combination of other data mining techniques such as trends, clustering, classification,
etc. It analyzes past events or instances in the right sequence to predict a future event.
2.Retail Industry
Data Mining has its great application in Retail Industry because it collects large amount of data from on
sales, customer purchasing history, goods transportation, consumption and services
Data mining in retail industry helps in identifying customer buying patterns and trends that lead to
improved quality of customer service and good customer retention and satisfaction. Here is the list of
examples of data mining in the retail industry −
• Design and Construction of data warehouses based on the benefits of data mining.
• Multidimensional analysis of sales, customers, products, time and region.
• Analysis of effectiveness of sales campaigns.
• Customer Retention.
• Product recommendation and cross-referencing of items.
3.Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries providing various
services such as fax, pager, cellular phone, internet messenger, images, e-mail, web data transmission,
etc. Due to the development of new computer and communication technologies, the
telecommunication industry is rapidly expanding. This is the reason why data mining is become very
important to help and understand the business.
6.Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the availability of
network resources. In this world of connectivity, security has become the major issue. With increased
usage of internet and availability of the tools and tricks for intruding and attacking network prompted
intrusion detection to become a critical component of network administration. Here is the list of areas
in which data mining technology may be applied for intrusion detection −
• Development of data mining algorithm for intrusion detection.
• Association and correlation analysis, aggregation to help select and build discriminating
attributes.
• Analysis of Stream data.
DATA MINING FUNCTIONALITIES/TASKS
Data mining deals with the kind of patterns that can be mined. On the basis of the kind of data
to be mined, there are two categories of functions involved in Data Mining −
• Descriptive
• Classification and Prediction
1.Descriptive Function
The descriptive function deals with the general properties of data in the database. Here is the
list of descriptive functions −
• Class/Concept Description
• Mining of Frequent Patterns
• Mining of Associations
• Mining of Correlations
• Mining of Clusters
Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For example, in
a company, the classes of items for sales include computer and printers, and concepts of
customers include big spenders and budget spenders. Such descriptions of a class or a concept
are called class/concept descriptions. These descriptions can be derived by the following two
ways −
• Data Characterization − This refers to summarizing data of class under study. This class
under study is called as Target Class.
• Data Discrimination − It refers to the mapping or classification of a class with some
predefined group or class.
Frequent patterns are those patterns that occur frequently in transactional data. Here is the
list of kind of frequent patterns −
• Frequent Item Set − It refers to a set of items that frequently appear together, for
example, milk and bread.
• Frequent Subsequence − A sequence of patterns that occur frequently such as
purchasing a camera is followed by memory card.
• Frequent Sub Structure − Substructure refers to different structural forms, such as
graphs, trees, or lattices, which may be combined with item-sets or subsequences.
Mining of Association
Associations are used in retail sales to identify patterns that are frequently purchased
together. This process refers to the process of uncovering the relationship among data and
determining association rules.
For example, a retailer generates an association rule that shows that 70% of time milk is sold
with bread and only 30% of times biscuits are sold with bread.
Mining of Correlations
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming group of
objects that are very similar to each other but are highly different from the objects in other
clusters.
Classification is the process of finding a model that describes the data classes or concepts. The
purpose is to be able to use this model to predict the class of objects whose class label is
unknown. This derived model is based on the analysis of sets of training data. The derived
model can be presented in the following forms −
Data processing starts with data in its raw form and converts it into a more readable format (graphs,
documents, etc.), giving it the form and context necessary to be interpreted by computers and utilized
by employees throughout an organization.
Collecting data is the first step in data processing. Data is pulled from available sources, including data
lakes and data warehouses.
2. Data preparation
Once the data is collected, it then enters the data preparation stage. Data preparation, often referred to
as <pre-processing= is the stage at which raw data is cleaned up and organized for the following stage of
data processing.
3. Data input
The clean data is then entered into its destination (perhaps a CRM like Salesforce or a data warehouse
like Redshift), and translated into a language that it can understand.
4. Processing
During this stage, the data inputted to the computer in the previous stage is actually processed for
interpretation. Processing is done using machine learning algorithms
5. Data output/interpretation
The output/interpretation stage is the stage at which data is finally usable to non-data scientists. It is
translated, readable, and often in the form of graphs, videos, images, plain text, etc.).
6. Data storage
The final stage of data processing is storage. After all of the data is processed, it is then stored for future
use. While some information may be put to use immediately, much of it will serve a purpose later on.
Preprocessing of Data
Data preprocessing is a data mining technique which is used to transform the raw data in a
useful and efficient format.
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task.
Each segmented is handled separately.
2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used
may be linear (having one independent variable) or multiple (having multiple independent
variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it
will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process. This
involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining
process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.
4. Concept Hierarchy Generation:
Here attributes are converted from level to higher level in hierarchy. For Example-The attribute
<city= can be converted to <country=.
3.Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with huge
volume of data, analysis became harder in such cases. In order to get rid of this, we uses data
reduction technique. It aims to increase the storage efficiency and reduce data storage and analysis
costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
2. Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For performing attribute
selection, one can use level of significance and p- value of the attribute.the attribute having p-
value greater than significance level can be discarded.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or lossless
Data Integration
Data Integration is a data preprocessing technique that involves combining data from
multiple heterogeneous data sources into a coherent data store and provide a unified view of
the data.
These sources may include multiple data cubes, databases or flat files.
The data integration approach are formally defined as triple <G, S, M> where,
G stand for the global schema,
S stand for heterogenous source of schema,
M stand for mapping between the queries of source and global schema.
Module II
Data Warehouses -Databases, Data warehouses, Data Mart, Databases Vs Data warehouses,
Data ware houses Vs Data mart, OLTP OLAP, OLAP operations/functions, OLAP Multi-
Dimensional Models- Data cubes, Star, Snow Flakes, Fact constellation. Association rules-
Market Basket Analysis, Criteria for classifying frequent pattern mining, Mining Single
Dimensional Boolean Association rule-A priori algorithm
Data Warehouse
• A data warehouse is a database, which is kept separate from the organization's
operational database.
• There is no frequent updating done in a data warehouse.
• It possesses consolidated historical data, which helps the organization to analyze its
business.
• A data warehouse helps executives to organize, understand, and use their data to take
strategic decisions.
• Data warehouse systems help in the integration of diversity of application systems.
• A data warehouse system helps in consolidated historical data analysis.
Data Mart
• A data mart is a simple form of a data warehouse that is focused on a single subject (or
functional area), such as Sales or Finance or Marketing.
• Data marts are often built and controlled by a single department within an organization.
Given their single-subject focus, data marts usually draw data from only a few sources.
• The sources could be internal operational systems, a central data warehouse, or
external data.
• Database is a collection of related data that represents some elements of the real world
whereas Data warehouse is an information system that stores historical and
commutative data from single or multiple sources.
• Database is designed to record data whereas the Data warehouse is designed to analyze
data.
• Database is application-oriented-collection of data whereas Data Warehouse is the
subject-oriented collection of data.
• Database uses Online Transactional Processing (OLTP) whereas Data warehouse uses
Online Analytical Processing (OLAP).
• Database tables and joins are complicated because they are normalized whereas Data
Warehouse tables and joins are easy because they are denormalized.
• ER modeling techniques are used for designing Database whereas data modeling
techniques are used for designing Data Warehouse.
Datamart vs Datawarehouse
• Data Warehouse is a large repository of data collected from different sources whereas
Data Mart is only subtype of a data warehouse.
• Data Warehouse is focused on all departments in an organization whereas Data Mart
focuses on a specific group.
• Data Warehouse designing process is complicated whereas the Data Mart process is
easy to design.
• Data Warehouse takes a long time for data handling whereas Data Mart takes a short
time for data handling.
• Data Warehouse size range is 100 GB to 1 TB+ whereas Data Mart size is less than 100
GB.
• Data Warehouse implementation process takes 1 month to 1 year whereas Data Mart
takes a few months to complete the implementation process.
Characteristics of OLTP
OLTP system is an online database changing system. Therefore, it supports database query such
as insert, update, and delete information from the database.
Consider a point of sale system of a supermarket, following are the sample queries that this
system can process:
OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data.
Here is the list of OLAP operations −
• Roll-up
• Drill-down
• Slice and dice
• Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways −
Slice
The slice operation selects one particular dimension from a given cube and provides a new
sub-cube. Consider the following diagram that shows how slice works.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider
the following diagram that shows the dice operation.
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to
provide an alternative presentation of data.
1. Data Cube
2. Star
3. Snow Flakes
4. Fact constellation
1.Data Cube
2.Star
• Star Schema in data warehouse, in which the center of the star can have one fact table
and a number of associated dimension tables.
• It is known as star schema as its structure resembles a star.
• The Star Schema data model is the simplest type of Data Warehouse schema.
• It is also known as Star Join Schema and is optimized for querying large data sets.
• In the following Star Schema example, the fact table is at the center which contains keys
to every dimension table like Dealer_ID, Model ID, Date_ID, Product_ID, Branch_ID &
other attributes like Units sold and revenue.
3.Snowflake
• A Fact constellation means two or more fact tables sharing one or more dimensions. It is
also called Galaxy schema.
• Fact Constellation Schema describes a logical structure of data warehouse or data mart.
Fact Constellation Schema can design with a collection of de-normalized FACT, Shared,
and Conformed Dimension tables.
Association Rule
Association rule mining finds interesting associations and relationships among large sets of
data items. This rule shows how frequently a itemset occurs in a transaction. A typical
example is Market Based Analysis.
• Market Basket Analysis is one of the fundamental techniques used by large retailers to
uncover the association between items.
• In other words, it allows retailers to identify the relationship between items which are
more frequently bought together.
Assume we have a data set of 20 customers who visited the grocery store out of which 11 made
the purchase:
Customer 1: Bread, egg, papaya and oat packet
Customer 2: Papaya, bread, oat packet and milk
Customer 3: Egg, bread, and butter
Customer 4: Oat packet, egg, and milk
Customer 5: Milk, bread, and butter
Customer 6: Papaya and milk
Customer 7: Butter, papaya, and bread
Customer 8: Egg and bread
Customer 9: Papaya and oat packet
Customer 10: Milk, papaya, and bread
Customer 11: Egg and milk
Here we observe that 3 customers have bought bread and butter together. The outcome of this
technique can be understood merely as <if this, then that= (if a customer buys bread, there are
chances customer will buy butter).
• Frequent patterns are collections of items which appear in a data set at an important
frequency (usually greater than a predefined threshold)and can thus reveal association
rules and relations between variables.
• Frequent pattern mining is a research area in data science applied to many domains such
as recommender systems (what are the set of items usually ordered together),
bioinformatics (what are the genes co-expressed in a given condition), decision making,
clustering, website navigation.
• Input data is usually stored in a database or as a collection of transactions.
• A transaction is a collection of items which have been observed together (e.g. the list of
products ordered by a customer during a shopping session or the list of expressed genes
in a condition).
1.horizontal layout: 2 columns structure in which one column contains transaction ids and the
second one the list of associated items, for instance:
transaction1: [item1, item2, item7]
transaction2: [item1, item2, item5]
transactionk: [item1, itemk]
2.vertical layout: 2 columns structure in which one column contains individual item ids and the
second one the associated transaction ids, for instance
item1: [transaction1, transaction5]
item2: [transaction2, transaction4]
itemk: [transactionk]
• Apriori algorithm uses data organized by horizontal layout. It is founded on the fact that
if a subset S appears k times in a database, any other subset S1 which contains S will
appear k times or less.
• The algorithm computes the counts for all itemsets of k elements (starting with k = 1).
During the next iterations the previous sets are being joined and thus we create all
• An example of running this algorithm step by step on a dummy data set can be
found here.
• Apriori algorithm produces a large number of candidates items (with possible duplicates)
and performs many database scans (equal to the maximum length of frequent itemset).
It is thus very expensive to run on large databases.
transactions = [
['beer', 'nuts'],
['beer', 'cheese'],
]
results = list(apriori(transactions))
Module III
Classification
Classification Vs Prediction.
Decision Tree
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal
node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf
node holds a class label. The topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a
customer at a company is likely to buy a computer or not. Each internal node represents a test
on an attribute. Each leaf node represents a class.
Bayes Classification
Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical
classifiers. Bayesian classifiers can predict class membership probabilities such as the
probability that a given tuple belongs to a particular class.
Baye's Theorem
Bayes' Theorem is named after Thomas Bayes. There are two types of probabilities −
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem.
It is not a single algorithm but a family of algorithms where all of them share a common
principle, i.e. every pair of features being classified is independent of each other.
To start with, let us consider a dataset.
Consider a fictional dataset that describes the weather conditions for playing a game of golf.
Given the weather conditions, each tuple classifies the conditions as fit(<Yes=) or unfit(<No=)
for plaing golf.
Here is a tabular representation of our dataset.
The dataset is divided into two parts, namely, feature matrix and the response vector.
• Feature matrix contains all the vectors(rows) of dataset in which each vector consists of
the value of dependent features. In above dataset, features are 8Outlook9, 8Temperature9,
8Humidity9 and 8Windy9.
• Response vector contains the value of class variable(prediction or output) for each row of
feature matrix. In above dataset, the class variable name is 8Play golf9.
K-nearest neighbors (KNN) algorithm uses 8feature similarity9 to predict the values of new
datapoints which further means that the new data point will be assigned a value based on how
closely it matches the points in the training set. We can understand its working with the help
of following steps −
Step 1 − For implementing any algorithm, we need dataset. So during the first step of KNN, we
must load the training as well as test data.
Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can be any
integer.
Step 3 − For each point in the test data do the following −
• 3.1 − Calculate the distance between test data and each row of training data with the
help of any of the method namely: Euclidean, Manhattan or Hamming distance. The
most commonly used method to calculate distance is Euclidean.
• 3.2 − Now, based on the distance value, sort them in ascending order.
• 3.3 − Next, it will choose the top K rows from the sorted array.
• 3.4 − Now, it will assign a class to the test point based on most frequent class of these
rows.
Step 4 − End
Example
The following is an example to understand the concept of K and working of KNN algorithm −
Suppose we have a dataset which can be plotted as follows −
Now, we need to classify new data point with black dot (at point 60,60) into blue or red class.
We are assuming K = 3 i.e. it would find three nearest data points. It is shown in the next
diagram −
We can see in the above diagram the three nearest neighbors of the data point with black dot.
Among those three, two of them lies in Red class hence the black dot will also be assigned in
red class.
Points to remember −
• The IF part of the rule is called rule antecedent or precondition.
• The THEN part of the rule is called rule consequent.
• The antecedent part the condition consist of one or more attribute tests and these tests
are logically ANDed.
• The consequent part consists of class prediction.
Note − We can also write rule R1 as follows −
R1: (age = youth) ^ (student = yes))(buys computer = yes)
If the condition holds true for a given tuple, then the antecedent is satisfied.
Module IV
Cluster Analysis
Cluster is a group of objects that belongs to the same class. In other words, similar objects are
grouped in one cluster and dissimilar objects are grouped in another cluster.
What is Clustering?
Clustering is the process of making a group of abstract objects into classes of similar objects.
• A cluster of data objects can be treated as one group.
• While doing cluster analysis, we first partition the set of data into groups based on data
similarity and then assign the labels to the groups.
• The main advantage of clustering over classification is that, it is adaptable to changes
and helps single out useful features that distinguish different group
Requirements of Clustering
The following points throw light on why clustering is required in data mining −
• Scalability − We need highly scalable clustering algorithms to deal with large databases.
• Discovery of clusters with attribute shape − The clustering algorithm should be capable
of detecting clusters of arbitrary shape. They should not be bounded to only distance
measures that tend to find spherical cluster of small sizes.
• High dimensionality − The clustering algorithm should not only be able to handle low-
dimensional data but also the high dimensional space.
• Ability to deal with noisy data − Databases contain noisy, missing or erroneous data.
Some algorithms are sensitive to such data and may lead to poor quality clusters.
1. Scalability:
Scalability in clustering implies that as we boost the amount of data objects, the time to
perform clustering should approximately scale to the complexity order of the algorithm. For
example, if we perform K- means clustering, we know it is O(n), where n is the number of
objects in the data. If we raise the number of data objects 10 folds, then the time taken to
cluster them should also approximately increase 10 times.
2. Interpretability:
The clustering algorithm should be able to find arbitrary shape clusters. They should not be
limited to only distance measurements that tend to discover a spherical cluster of small sizes.
Algorithms should be capable of being applied to any data such as data based on intervals
(numeric), binary data, and categorical data.
Databases contain data that is noisy, missing, or incorrect. Few algorithms are sensitive to such
data and may result in poor quality clusters.
6. High dimensionality:
The clustering tools should not only able to handle high dimensional data space but also the
low-dimensional space.
1. Interval-Scaled variables
2. Binary variables
3. Nominal, Ordinal, and Ratio variables
4. Variables of mixed types
5. Interval-Scaled Variables
1.Interval-scaled variables
Typical examples include weight and height, latitude and longitude coordinates (e.g.,
when clustering houses), and weather temperature.
• The measurement unit used can affect the clustering analysis. For example, changing
measurement units from meters to inches for height, or from kilograms to pounds for
weight, may lead to a very different clustering structure.
2.Binary Variables
Ordinal Variables
Clustering methods
1.Partitioning Method
Suppose we are given a database of 8n9 objects and the partitioning method constructs 8k9
partition of data. Each partition will represent a cluster and k ≤ n. It means that it will classify
the data into k groups, which satisfy the following requirements −
• Each group contains at least one object.
• Each object must belong to exactly one group.
• For a given number of partitions (say k), the partitioning method will create an initial
partitioning.
• Then it uses the iterative relocation technique to improve the partitioning by moving
objects from one group to other.
2.K-Means Clustering Algorithm
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science.
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset
into different clusters. Here K defines the number of pre-defined clusters that need to be
created in the process, as if K=2, there will be two clusters, and for K=3, there will be three
clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
Types of outliers
Outliers can also come in different flavours, depending on the environment: point
outliers, contextual outliers, or collective outliers.
1. Point outliers are single data points that lay far from the rest of the distribution.
2. Contextual outliers can be noise in data, such as punctuation symbols when realizing text
analysis or background noise signal when doing speech recognition.
3. Collective outliers can be subsets of novelties in data such as a signal that may indicate
the discovery of new phenomena
This makes z-score a parametric method. Very frequently data points are not to described by a
gaussian distribution, this problem can be solved by applying transformations to data ie: scaling
it.
Some Python libraries like Scipy and Sci-kit Learn have easy to use functions and classes for a
easy implementation along with Pandas and Numpy.
After making the appropriate transformations to the selected feature space of the dataset, the z-
score of any data point can be calculated with the following expression:
When computing the z-score for each sample on the data set a threshold must be specified.
Some good 8thumb-rule9 thresholds can be: 2.5, 3, 3.5 or more standard deviations.
By 8tagging9 or removing the data points that lay beyond a given threshold we are classifying
Z-score is a simple, yet powerful method to get rid of outliers in data if you are dealing with