0% found this document useful (0 votes)
117 views

Data Mining Notes

Uploaded by

Midhun Manoj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views

Data Mining Notes

Uploaded by

Midhun Manoj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

lOMoARcPSD|23068105

Data Mining notes

Computer Applications (University of Kerala)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Midhun Manoj ([email protected])
lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

MUSLIM ASSOCIATION COLLEGE OF ARTS AND SCIENCE


Panavoor,Thiruvananthapuram,Kerala
(Affiliated to the University of Kerala)

Department of Computer Science


CS1641 Data Mining and Warehousing

Name : &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&

Candidate Code: &&&&&&&&&&&&&&&&&&&&&&&&&&&&&..

Muslim Association college of Arts and Science Page 1

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

CS1641 Data Mining and Warehousing

SYLLABUS
Module I: Introduction:-Data, Information, Knowledge, KDD, types of data for mining,
Application domains, data mining functionalities/tasks. Data processing—Understanding data,
pre-processing data-Form of data processing, Data cleaning (definition and Phases only), Need
for data integration, Steps in data transformation, Need of data reduction

Module II: Data Warehouses-Databases, Data warehouses, Data Mart, Databases Vs Data
warehouses, Data ware houses Vs Data mart, OLTP OLAP, OLAP operations/functions, OLAP
Multi-Dimensional Models- Data cubes, Star, Snow Flakes, Fact constellation. Association rules-
Market Basket Analysis, Criteria for classifying frequent pattern mining, Mining Single
Dimensional Boolean Association rule-A priori algorithm

Module III: Classification- Classification Vs Prediction, Issues, Decision trees, Bayes


classification- Bayes Theorem, Naïve Bayesian classifier, K Nearest Neighbour method, Rule-
Based classification-Using IF&THEN rules for classification

Module IV: Cluster analysis: definition and Requirements, Characteristics of clustering


techniques, Types of data in cluster analysis, categories of clustering-Partitioning methods, K-
Mean and K - method only, outlier detection in clustering.

Muslim Association college of Arts and Science Page 2

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

Module I

Introduction
Data, Information, Knowledge, KDD, types of data for mining, Application domains, data mining
functionalities/tasks. Data processing—Understanding data, pre-processing data-Form of data
processing, Data cleaning (definition and Phases only), Need for data integration, Steps in data
transformation, Need of data reduction

Muslim Association college of Arts and Science Page 3

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

DATA
• Data is a collection of facts in a raw or unorganized form such as numbers or characters.
• However, without context, data can mean little. For example, 12012012 is just a sequence of
numbers without apparent importance. But if we view it in the context of 8this is a date9, we can
easily recognize 12th of January, 2012. By adding context and value to the numbers, they now
have more meaning.

INFORMATION
• information is <prepared data that has been processed, aggregated and organized into a more
human-friendly format that provides more context. Information is often delivered in the form
of data visualizations, reports,
• Information addresses the requirements of a user, giving it significance and usefulness as it is
the product of data that has been interpreted to deliver a logical meaning.

KNOWLEDGE
• Knowledge means the familiarity and awareness of a person, place, events, ideas, issues, ways
of doing things or anything else, which is gathered through learning, perceiving or discovering.
It is the state of knowing something with cognizance through the understanding of concepts,
study and experience.

What is Data Mining?


Data Mining is defined as extracting information from huge sets of data. In other words, we can say that
data mining is the procedure of mining knowledge from data.

Why we need Data Mining?

Volume of information is increasing everyday that we can handle from business transactions, scientific
data, sensor data, Pictures, videos, etc. So, we need a system that will be capable of extracting essence
of information available and that can automatically generate report,views or summary of data for better
decision-making.

Why Data Mining is used in Business?

Data mining is used in business to make better managerial decisions by:

1. Automatic summarization of data


2. Extracting essence of information stored.
3. Discovering patterns in raw data.

Muslim Association college of Arts and Science Page 4

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

Knowledge Discovery in Databases (KDD)


Data Mining also known as Knowledge Discovery in Databases, refers to the nontrivial extraction of
implicit, previously unknown and potentially useful information from data stored in databases.

Steps Involved in KDD Process:

1.Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from collection.

• Cleaning in case of Missing values.


• Cleaning noisy data, where noise is a random or variance error.
• Cleaning with Data discrepancy detection and Data transformation tools.

2.Data Integration: Data integration is defined as heterogeneous data from multiple sources combined
in a common source(DataWarehouse).

• Data integration using Data Migration tools.


• Data integration using Data Synchronization tools.
• Data integration using ETL(Extract-Load-Transformation) process.

3.Data Selection: Data selection is defined as the process where data relevant to the analysis is decided
and retrieved from the data collection.

• Data selection using Neural network.


• Data selection using Decision Trees.
• Data selection using Naive bayes.
• Data selection using Clustering, Regression, etc.

4.Data Transformation: Data Transformation is defined as the process of transforming data into
appropriate form required by mining procedure.Data Transformation is a two step process:

• Data Mapping: Assigning elements from source base to destination to capture transformations.
• Code generation: Creation of the actual transformation program.

5.Data Mining: Data mining is defined as clever techniques that are applied to extract patterns
potentially useful.

• Transforms task relevant data into patterns.


• Decides purpose of model using classification or characterization.

6.Pattern Evaluation: Pattern Evaluation is defined as as identifying strictly increasing patterns


representing knowledge based on given measures.

• Find interestingness score of each pattern.


• Uses summarization and Visualization to make data understandable by user.

Muslim Association college of Arts and Science Page 5

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

7.Knowledge representation: Knowledge representation is defined as technique which utilizes


visualization tools to represent data mining results.

• Generate reports.
• Generate tables.
• Generate discriminate rules, classification rules, characterization rules, etc.

TYPES OF DATA MINING


Data mining includes the utilization of refined data analysis tools to find previously unknown, valid
patterns and relationships in huge data sets.

1. Classification:

This technique is used to obtain important and relevant information about data and metadata. This data
mining technique helps to classify data in different classes.

Data mining techniques can be classified by different criteria, as follows:

i. Classification of Data mining frameworks as per the type of data sources mined:
This classification is as per the type of data handled. For example, multimedia, spatial data, text
data, time-series data, World Wide Web, and so on..
ii. Classification of data mining frameworks as per the database involved:
This classification based on the data model involved. For example. Object-oriented database,
transactional database, relational database, and so on..

Muslim Association college of Arts and Science Page 6

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

iii. Classification of data mining frameworks as per the kind of knowledge discovered:
This classification depends on the types of knowledge discovered or data mining functionalities.
iv. Classification of data mining frameworks according to data mining techniques used:
This classification is as per the data analysis approach utilized, such as neural networks, machine
learning, genetic algorithms, visualization, statistics, data warehouse-oriented or database-
oriented, etc.

2.Clustering

• Clustering is a division of information into groups of connected objects. Describing the data by a
few clusters mainly loses certain confine details, but accomplishes improvement. It models data
by its clusters. Data modeling puts clustering from a historical point of view rooted in statistics,
mathematics, and numerical analysis.
• Clustering analysis is a data mining technique to identify similar data. This technique helps to
recognize the differences and similarities between the data. Clustering is very similar to the
classification, but it involves grouping chunks of data together based on their similarities.

3. Regression:

Regression analysis is the data mining process is used to identify and analyze the relationship between
variables because of the presence of the other factor. It is used to define the probability of the specific
variable. Regression, primarily a form of planning and modeling. For example, we might use it to project
certain costs, depending on other factors such as availability, consumer demand, and competition.
Primarily it gives the exact relationship between two or more variables in the given data set.

4. Association Rules:

This data mining technique helps to discover a link between two or more items. It finds a hidden pattern
in the data set.

Association rules are if-then statements that support to show the probability of interactions between
data items within large data sets in different types of databases. Association rule mining has several
applications and is commonly used to help sales correlations in data or medical data sets.

The way the algorithm works is that you have various data, For example, a list of grocery items that you
have been buying for the last six months. It calculates a percentage of items being purchased together.

5. Outer detection:

This type of data mining technique relates to the observation of data items in the data set, which do not
match an expected pattern or expected behavior. This technique may be used in various domains like

Muslim Association college of Arts and Science Page 7

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

intrusion, detection, fraud detection, etc. It is also known as Outlier Analysis or Outilier mining. The
outlier is a data point that diverges too much from the rest of the dataset. The majority of the real-world
datasets have an outlier.

6. Sequential Patterns:

The sequential pattern is a data mining technique specialized for evaluating sequential data to discover
sequential patterns. It comprises of finding interesting subsequences in a set of sequences, where the
stake of a sequence can be measured in terms of different criteria like length, occurrence frequency, etc.

In other words, this technique of data mining helps to discover or recognize similar patterns in
transaction data over some time.

7. Prediction:

Prediction used a combination of other data mining techniques such as trends, clustering, classification,
etc. It analyzes past events or instances in the right sequence to predict a future event.

APPLICATION DOMAINS OF DATA MINING


Data mining is widely used in diverse areas. There are a number of commercial data mining system
available today and yet there are many challenges in this field. Here we will discuss the applications
and the trend of data mining.
Here is the list of areas where data mining is widely used −

1. Financial Data Analysis


2. Retail Industry
3. Telecommunication Industry
4. Biological Data Analysis
5. Other Scientific Applications
6. Intrusion Detection

1.Financial Data Analysis


The financial data in banking and financial industry is generally reliable and of high quality which
facilitates systematic data analysis and data mining. Some of the typical cases are as follows −
• Design and construction of data warehouses for multidimensional data analysis and data
mining.
• Loan payment prediction and customer credit policy analysis.
• Classification and clustering of customers for targeted marketing.
• Detection of money laundering and other financial crimes.

Muslim Association college of Arts and Science Page 8

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

2.Retail Industry
Data Mining has its great application in Retail Industry because it collects large amount of data from on
sales, customer purchasing history, goods transportation, consumption and services
Data mining in retail industry helps in identifying customer buying patterns and trends that lead to
improved quality of customer service and good customer retention and satisfaction. Here is the list of
examples of data mining in the retail industry −
• Design and Construction of data warehouses based on the benefits of data mining.
• Multidimensional analysis of sales, customers, products, time and region.
• Analysis of effectiveness of sales campaigns.
• Customer Retention.
• Product recommendation and cross-referencing of items.

3.Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries providing various
services such as fax, pager, cellular phone, internet messenger, images, e-mail, web data transmission,
etc. Due to the development of new computer and communication technologies, the
telecommunication industry is rapidly expanding. This is the reason why data mining is become very
important to help and understand the business.

4.Biological Data Analysis


In recent times, we have seen a tremendous growth in the field of biology such as genomics,
proteomics, functional Genomics and biomedical research. Biological data mining is a very important
part of Bioinformatics. Following are the aspects in which data mining contributes for biological data
analysis −
• Semantic integration of heterogeneous, distributed genomic and proteomic databases.
• Alignment, indexing, similarity search and comparative analysis multiple nucleotide sequences.
• Discovery of structural patterns and analysis of genetic networks and protein pathways.

5.Other Scientific Applications


The applications discussed above tend to handle relatively small and homogeneous data sets for which
the statistical techniques are appropriate. Huge amount of data have been collected from scientific
domains such as geosciences, astronomy, etc. Following are the applications of data mining in the field
of Scientific Applications −

• Data Warehouses and data preprocessing.


• Graph-based mining.
• Visualization and domain specific knowledge.

Muslim Association college of Arts and Science Page 9

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

6.Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the availability of
network resources. In this world of connectivity, security has become the major issue. With increased
usage of internet and availability of the tools and tricks for intruding and attacking network prompted
intrusion detection to become a critical component of network administration. Here is the list of areas
in which data mining technology may be applied for intrusion detection −
• Development of data mining algorithm for intrusion detection.
• Association and correlation analysis, aggregation to help select and build discriminating
attributes.
• Analysis of Stream data.
DATA MINING FUNCTIONALITIES/TASKS
Data mining deals with the kind of patterns that can be mined. On the basis of the kind of data
to be mined, there are two categories of functions involved in Data Mining −

• Descriptive
• Classification and Prediction

1.Descriptive Function

The descriptive function deals with the general properties of data in the database. Here is the
list of descriptive functions −

• Class/Concept Description
• Mining of Frequent Patterns
• Mining of Associations
• Mining of Correlations
• Mining of Clusters
Class/Concept Description

Class/Concept refers to the data to be associated with the classes or concepts. For example, in
a company, the classes of items for sales include computer and printers, and concepts of
customers include big spenders and budget spenders. Such descriptions of a class or a concept
are called class/concept descriptions. These descriptions can be derived by the following two
ways −
• Data Characterization − This refers to summarizing data of class under study. This class
under study is called as Target Class.
• Data Discrimination − It refers to the mapping or classification of a class with some
predefined group or class.

Muslim Association college of Arts and Science Page 10

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

Mining of Frequent Patterns

Frequent patterns are those patterns that occur frequently in transactional data. Here is the
list of kind of frequent patterns −
• Frequent Item Set − It refers to a set of items that frequently appear together, for
example, milk and bread.
• Frequent Subsequence − A sequence of patterns that occur frequently such as
purchasing a camera is followed by memory card.
• Frequent Sub Structure − Substructure refers to different structural forms, such as
graphs, trees, or lattices, which may be combined with item-sets or subsequences.
Mining of Association

Associations are used in retail sales to identify patterns that are frequently purchased
together. This process refers to the process of uncovering the relationship among data and
determining association rules.
For example, a retailer generates an association rule that shows that 70% of time milk is sold
with bread and only 30% of times biscuits are sold with bread.
Mining of Correlations

It is a kind of additional analysis performed to uncover interesting statistical correlations


between associated-attribute-value pairs or between two item sets to analyze that if they
have positive, negative or no effect on each other.
Mining of Clusters

Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming group of
objects that are very similar to each other but are highly different from the objects in other
clusters.

2.Classification and Prediction

Classification is the process of finding a model that describes the data classes or concepts. The
purpose is to be able to use this model to predict the class of objects whose class label is
unknown. This derived model is based on the analysis of sets of training data. The derived
model can be presented in the following forms −

• Classification (IF-THEN) Rules


• Decision Trees
• Mathematical Formulae
• Neural Networks

Muslim Association college of Arts and Science Page 11

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

The list of functions involved in these processes are as follows −


• Classification − It predicts the class of objects whose class label is unknown. Its
objective is to find a derived model that describes and distinguishes data classes or
concepts. The Derived Model is based on the analysis set of training data i.e. the data
object whose class label is well known.
• Prediction − It is used to predict missing or unavailable numerical data values rather
than class labels. Regression Analysis is generally used for prediction. Prediction can
also be used for identification of distribution trends based on available data.
• Outlier Analysis − Outliers may be defined as the data objects that do not comply with
the general behavior or model of the data available.
• Evolution Analysis − Evolution analysis refers to the description and model regularities
or trends for objects whose behavior changes over time.
Data Processing
Data processing occurs when data is collected and translated into usable information. Usually
performed by a data scientist or team of data scientists, it is important for data processing to be done
correctly as not to negatively affect the end product, or data output.

Data processing starts with data in its raw form and converts it into a more readable format (graphs,
documents, etc.), giving it the form and context necessary to be interpreted by computers and utilized
by employees throughout an organization.

Six stages of data processing


1. Data collection

Collecting data is the first step in data processing. Data is pulled from available sources, including data
lakes and data warehouses.

2. Data preparation

Once the data is collected, it then enters the data preparation stage. Data preparation, often referred to
as <pre-processing= is the stage at which raw data is cleaned up and organized for the following stage of
data processing.

3. Data input

The clean data is then entered into its destination (perhaps a CRM like Salesforce or a data warehouse
like Redshift), and translated into a language that it can understand.

Muslim Association college of Arts and Science Page 12

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

4. Processing

During this stage, the data inputted to the computer in the previous stage is actually processed for
interpretation. Processing is done using machine learning algorithms

5. Data output/interpretation

The output/interpretation stage is the stage at which data is finally usable to non-data scientists. It is
translated, readable, and often in the form of graphs, videos, images, plain text, etc.).

6. Data storage

The final stage of data processing is storage. After all of the data is processed, it is then stored for future
use. While some information may be put to use immediately, much of it will serve a purpose later on.

Preprocessing of Data
Data preprocessing is a data mining technique which is used to transform the raw data in a
useful and efficient format.

Steps Involved in Data Preprocessing:

1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.

(a). Missing Data:


This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are
missing within a tuple.
2. Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values manually, by
attribute mean or the most probable value.

(b). Noisy Data:


Noisy data is a meaningless data that can9t be interpreted by machines.It can be generated due to
faulty data collection, data entry errors etc. It can be handled in following ways :

1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task.
Each segmented is handled separately.

Muslim Association college of Arts and Science Page 13

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used
may be linear (having one independent variable) or multiple (having multiple independent
variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it
will fall outside the clusters.

2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process. This
involves following ways:

1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining
process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.
4. Concept Hierarchy Generation:
Here attributes are converted from level to higher level in hierarchy. For Example-The attribute
<city= can be converted to <country=.

3.Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with huge
volume of data, analysis became harder in such cases. In order to get rid of this, we uses data
reduction technique. It aims to increase the storage efficiency and reduce data storage and analysis
costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
2. Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For performing attribute
selection, one can use level of significance and p- value of the attribute.the attribute having p-
value greater than significance level can be discarded.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or lossless

Muslim Association college of Arts and Science Page 14

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

Data Integration

Data Integration is a data preprocessing technique that involves combining data from
multiple heterogeneous data sources into a coherent data store and provide a unified view of
the data.
These sources may include multiple data cubes, databases or flat files.
The data integration approach are formally defined as triple <G, S, M> where,
G stand for the global schema,
S stand for heterogenous source of schema,
M stand for mapping between the queries of source and global schema.

Muslim Association college of Arts and Science Page 15

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

Module II

Data Warehouses -Databases, Data warehouses, Data Mart, Databases Vs Data warehouses,
Data ware houses Vs Data mart, OLTP OLAP, OLAP operations/functions, OLAP Multi-
Dimensional Models- Data cubes, Star, Snow Flakes, Fact constellation. Association rules-
Market Basket Analysis, Criteria for classifying frequent pattern mining, Mining Single
Dimensional Boolean Association rule-A priori algorithm

Muslim Association college of Arts and Science Page 16

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

Data Warehouse
• A data warehouse is a database, which is kept separate from the organization's
operational database.
• There is no frequent updating done in a data warehouse.
• It possesses consolidated historical data, which helps the organization to analyze its
business.
• A data warehouse helps executives to organize, understand, and use their data to take
strategic decisions.
• Data warehouse systems help in the integration of diversity of application systems.
• A data warehouse system helps in consolidated historical data analysis.

Data Mart

• A data mart is a simple form of a data warehouse that is focused on a single subject (or
functional area), such as Sales or Finance or Marketing.
• Data marts are often built and controlled by a single department within an organization.
Given their single-subject focus, data marts usually draw data from only a few sources.
• The sources could be internal operational systems, a central data warehouse, or
external data.

Database vs Data warehouse

• Database is a collection of related data that represents some elements of the real world
whereas Data warehouse is an information system that stores historical and
commutative data from single or multiple sources.
• Database is designed to record data whereas the Data warehouse is designed to analyze
data.
• Database is application-oriented-collection of data whereas Data Warehouse is the
subject-oriented collection of data.
• Database uses Online Transactional Processing (OLTP) whereas Data warehouse uses
Online Analytical Processing (OLAP).
• Database tables and joins are complicated because they are normalized whereas Data
Warehouse tables and joins are easy because they are denormalized.
• ER modeling techniques are used for designing Database whereas data modeling
techniques are used for designing Data Warehouse.

Muslim Association college of Arts and Science Page 17

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

Datamart vs Datawarehouse

• Data Warehouse is a large repository of data collected from different sources whereas
Data Mart is only subtype of a data warehouse.
• Data Warehouse is focused on all departments in an organization whereas Data Mart
focuses on a specific group.
• Data Warehouse designing process is complicated whereas the Data Mart process is
easy to design.
• Data Warehouse takes a long time for data handling whereas Data Mart takes a short
time for data handling.
• Data Warehouse size range is 100 GB to 1 TB+ whereas Data Mart size is less than 100
GB.
• Data Warehouse implementation process takes 1 month to 1 year whereas Data Mart
takes a few months to complete the implementation process.

Online Transaction Processing (OLTP)

• The full form of OLTP is Online Transaction Processing.


• OLTP is an operational system that supports transaction-oriented applications in a 3-tier
architecture.
• It administers the day to day transaction of an organization.
• OLTP is basically focused on query processing, maintaining data integrity in multi-access
environments as well as effectiveness that is measured by the total number of
transactions per second.

Characteristics of OLTP

Following are important characteristics of OLTP:

• OLTP uses transactions that include small amounts of data.


• Indexed data in the database can be accessed easily.
• OLTP has a large number of users.
• It has fast response times
• Databases are directly accessible to end-users
• OLTP uses a fully normalized schema for database consistency.
• The response time of OLTP system is short.
• It strictly performs only the predefined operations on a small number of records.
• OLTP stores the records of the last few days or a week.
• It supports complex data models and tables.

Muslim Association college of Arts and Science Page 18

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

Type of queries that an OLTP system can Process:

OLTP system is an online database changing system. Therefore, it supports database query such
as insert, update, and delete information from the database.

Consider a point of sale system of a supermarket, following are the sample queries that this
system can process:

• Retrieving the description of a particular product.


• Filtering all products related to the supplier.
• Searching the record of the customer.
• Listing products having a price less than the expected amount.

Online Analytical Processing Server (OLAP)

• Online Analytical Processing Server (OLAP) is based on the multidimensional data


model.
• It allows managers, and analysts to get an insight of the information through fast,
consistent, and interactive access to information

OLAP Operations

Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data.
Here is the list of OLAP operations −

• Roll-up
• Drill-down
• Slice and dice
• Pivot (rotate)

Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −

• By climbing up a concept hierarchy for a dimension


• By dimension reduction

Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways −

• By stepping down a concept hierarchy for a dimension

Muslim Association college of Arts and Science Page 19

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

• By introducing a new dimension.

Slice
The slice operation selects one particular dimension from a given cube and provides a new
sub-cube. Consider the following diagram that shows how slice works.

Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider
the following diagram that shows the dice operation.

Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to
provide an alternative presentation of data.

Multidimensional Data Model

• The multidimensional data model is an integral part of On-Line Analytical Processing, or


OLAP.
• Because OLAP is on-line, it must provide answers quickly; analysts pose iterative queries
during interactive sessions, not in batch jobs that run overnight. And because OLAP is
also analytic, the queries are complex.
• The multidimensional data model is designed to solve complex queries in real time.
• The multidimensional data model is composed of logical cubes, measures, dimensions,
hierarchies, levels, and attributes.
• The simplicity of the model is inherent because it defines objects that represent real-
world business entities.
• Analysts know which business measures they are interested in examining, which
dimensions and attributes make the data meaningful, and how the dimensions of their
business are organized into levels and hierarchies.

The following are the multi dimensional Data Models

1. Data Cube
2. Star
3. Snow Flakes
4. Fact constellation

Muslim Association college of Arts and Science Page 20

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

1.Data Cube

• When data is grouped or combined in multidimensional matrices called Data Cubes.


• A data cube is created from a subset of attributes in the database.
• Specific attributes are chosen to be measure attributes, i.e., the attributes whose values
are of interest.
• Another attributes are selected as dimensions or functional attributes. The measure
attributes are aggregated according to the dimensions.
• Data cube method is an interesting technique with many applications.
• Data cubes could be sparse in many cases because not every cell in each dimension may
have corresponding data in the database.

2.Star

• Star Schema in data warehouse, in which the center of the star can have one fact table
and a number of associated dimension tables.
• It is known as star schema as its structure resembles a star.
• The Star Schema data model is the simplest type of Data Warehouse schema.
• It is also known as Star Join Schema and is optimized for querying large data sets.
• In the following Star Schema example, the fact table is at the center which contains keys
to every dimension table like Dealer_ID, Model ID, Date_ID, Product_ID, Branch_ID &
other attributes like Units sold and revenue.

Example of Star Schema Diagram

Muslim Association college of Arts and Science Page 21

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

3.Snowflake

• Snowflake Schema in data warehouse is a logical arrangement of tables in a


multidimensional database such that the ER diagram resembles a snowflake shape.
• A Snowflake Schema is an extension of a Star Schema, and it adds additional
dimensions. The dimension tables are normalized which splits data into additional
tables.
• In the following Snowflake Schema example, Country is further normalized into an
individual table.

Example of Snowflake Schema

4.Fact Constellation Schema

• A Fact constellation means two or more fact tables sharing one or more dimensions. It is
also called Galaxy schema.
• Fact Constellation Schema describes a logical structure of data warehouse or data mart.
Fact Constellation Schema can design with a collection of de-normalized FACT, Shared,
and Conformed Dimension tables.

Muslim Association college of Arts and Science Page 22

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

Fact Constellation Schema is a sophisticated database design that is difficult to summarize


information. Fact Constellation Schema can implement between aggregate Fact tables or
decompose a complex Fact table into independent simplex Fact tables.

Association Rule

Association rule mining finds interesting associations and relationships among large sets of
data items. This rule shows how frequently a itemset occurs in a transaction. A typical
example is Market Based Analysis.

What is Market basket analysis?

• Market Basket Analysis is one of the fundamental techniques used by large retailers to
uncover the association between items.
• In other words, it allows retailers to identify the relationship between items which are
more frequently bought together.

Let’s understand the concept with an example:

Assume we have a data set of 20 customers who visited the grocery store out of which 11 made
the purchase:
Customer 1: Bread, egg, papaya and oat packet
Customer 2: Papaya, bread, oat packet and milk
Customer 3: Egg, bread, and butter
Customer 4: Oat packet, egg, and milk
Customer 5: Milk, bread, and butter
Customer 6: Papaya and milk
Customer 7: Butter, papaya, and bread
Customer 8: Egg and bread
Customer 9: Papaya and oat packet
Customer 10: Milk, papaya, and bread
Customer 11: Egg and milk

Here we observe that 3 customers have bought bread and butter together. The outcome of this
technique can be understood merely as <if this, then that= (if a customer buys bread, there are
chances customer will buy butter).

Muslim Association college of Arts and Science Page 23

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

What are frequent patterns?

• Frequent patterns are collections of items which appear in a data set at an important
frequency (usually greater than a predefined threshold)and can thus reveal association
rules and relations between variables.
• Frequent pattern mining is a research area in data science applied to many domains such
as recommender systems (what are the set of items usually ordered together),
bioinformatics (what are the genes co-expressed in a given condition), decision making,
clustering, website navigation.
• Input data is usually stored in a database or as a collection of transactions.
• A transaction is a collection of items which have been observed together (e.g. the list of
products ordered by a customer during a shopping session or the list of expressed genes
in a condition).

Criteria for classifying frequent pattern mining

1.horizontal layout: 2 columns structure in which one column contains transaction ids and the
second one the list of associated items, for instance:
transaction1: [item1, item2, item7]
transaction2: [item1, item2, item5]
transactionk: [item1, itemk]

2.vertical layout: 2 columns structure in which one column contains individual item ids and the
second one the associated transaction ids, for instance
item1: [transaction1, transaction5]
item2: [transaction2, transaction4]
itemk: [transactionk]

Mining Single Dimensional Boolean Association rule-A priori algorithm

• Apriori algorithm uses data organized by horizontal layout. It is founded on the fact that
if a subset S appears k times in a database, any other subset S1 which contains S will
appear k times or less.

• This implies that when deciding on a minimum support threshold(minimum frequency an


item set needs to have in order to not be discarded)we can avoid calculating S1 or any
other superset of S if support(s) < minimum support. It can be said that all such
candidates are being discarded a priori.

• The algorithm computes the counts for all itemsets of k elements (starting with k = 1).
During the next iterations the previous sets are being joined and thus we create all

Muslim Association college of Arts and Science Page 24

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

possible k + 1 itemsets. Only the combinations appearing at a frequency inferior to the


minimum support rate are being discarded. The iterations end when no further
extensions (joins) are being found.

• An example of running this algorithm step by step on a dummy data set can be
found here.

• Apriori algorithm produces a large number of candidates items (with possible duplicates)
and performs many database scans (equal to the maximum length of frequent itemset).
It is thus very expensive to run on large databases.

Python Implementaion of apripori algorithm


from apyori import apriori

transactions = [
['beer', 'nuts'],
['beer', 'cheese'],
]
results = list(apriori(transactions))

Muslim Association college of Arts and Science Page 25

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

Module III

Classification- Classification Vs Prediction, Issues, Decision trees, Bayes classification- Bayes


Theorem, Naïve Bayesian classifier, K Nearest Neighbour method, Rule-Based classification-
Using IF&THEN rules for classification

Muslim Association college of Arts and Science Page 26

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

Classification

• Classification is a data mining function that assigns items in a collection to target


categories or classes.
• The goal of classification is to accurately predict the target class for each case in the
data.
• For example, a classification model could be used to identify loan applicants as low,
medium, or high credit risks.
• A classification task begins with a data set in which the class assignments are known.
• For example, a classification model that predicts credit risk could be developed based on
observed data for many loan applicants over a period of time.

Classification Vs Prediction.

• Classification is the method of recognizing to which group; a new process belongs to a


background of a training data set containing a new process of observing whose group
membership is familiar.
• Predication is the method of recognizing the missing or not available numerical data for a new
process of observing.
• A classifier is built to detect explicit labels.
• A predictor will be build that predicts a current valued job or command value.
• In classification, authenticity depends on detecting the class label correctly.
• In predication, the authenticity depends on how well a given predictor can guess the value of a
predicated attribute for new data.
• In classification, the sample can be called the classifier.
• In predication, the sample can be called the predictor.

Muslim Association college of Arts and Science Page 27

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

Issues in Data Mining


Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data
sources. These factors also create some issues. Here, we will discuss the major issues
regarding −

• Mining Methodology and User Interaction


• Performance Issues
• Diverse Data Types Issues
The following diagram describes the major issues.

Muslim Association college of Arts and Science Page 28

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

Decision Tree
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal
node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf
node holds a class label. The topmost node in the tree is the root node.

The following decision tree is for the concept buy_computer that indicates whether a
customer at a company is likely to buy a computer or not. Each internal node represents a test
on an attribute. Each leaf node represents a class.

The benefits of having a decision tree are as follows −

• It does not require any domain knowledge.


• It is easy to comprehend.
• The learning and classification steps of a decision tree are simple and fast.

Muslim Association college of Arts and Science Page 29

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

Bayes Classification
Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical
classifiers. Bayesian classifiers can predict class membership probabilities such as the
probability that a given tuple belongs to a particular class.

Baye's Theorem

Bayes' Theorem is named after Thomas Bayes. There are two types of probabilities −

• Posterior Probability [P(H/X)]


• Prior Probability [P(H)]
where X is data tuple and H is some hypothesis.
According to Bayes' Theorem,
P(H/X)= P(X/H)P(H) / P(X)

Naïve Bayes' Classifier:

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem.
It is not a single algorithm but a family of algorithms where all of them share a common
principle, i.e. every pair of features being classified is independent of each other.
To start with, let us consider a dataset.
Consider a fictional dataset that describes the weather conditions for playing a game of golf.
Given the weather conditions, each tuple classifies the conditions as fit(<Yes=) or unfit(<No=)
for plaing golf.
Here is a tabular representation of our dataset.

Muslim Association college of Arts and Science Page 30

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

The dataset is divided into two parts, namely, feature matrix and the response vector.
• Feature matrix contains all the vectors(rows) of dataset in which each vector consists of
the value of dependent features. In above dataset, features are 8Outlook9, 8Temperature9,
8Humidity9 and 8Windy9.
• Response vector contains the value of class variable(prediction or output) for each row of
feature matrix. In above dataset, the class variable name is 8Play golf9.

KNN Algorithm - Finding Nearest Neighbors


K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be used
for both classification as well as regression predictive problems. However, it is mainly used for
classification predictive problems in industry. The following two properties would define KNN
well −
• Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a
specialized training phase and uses all the data for training while classification.
• Non-parametric learning algorithm − KNN is also a non-parametric learning algorithm
because it doesn9t assume anything about the underlying data.

Working of KNN Algorithm

K-nearest neighbors (KNN) algorithm uses 8feature similarity9 to predict the values of new
datapoints which further means that the new data point will be assigned a value based on how
closely it matches the points in the training set. We can understand its working with the help
of following steps −
Step 1 − For implementing any algorithm, we need dataset. So during the first step of KNN, we
must load the training as well as test data.
Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can be any
integer.
Step 3 − For each point in the test data do the following −
• 3.1 − Calculate the distance between test data and each row of training data with the
help of any of the method namely: Euclidean, Manhattan or Hamming distance. The
most commonly used method to calculate distance is Euclidean.
• 3.2 − Now, based on the distance value, sort them in ascending order.
• 3.3 − Next, it will choose the top K rows from the sorted array.
• 3.4 − Now, it will assign a class to the test point based on most frequent class of these
rows.
Step 4 − End

Muslim Association college of Arts and Science Page 31

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

Example
The following is an example to understand the concept of K and working of KNN algorithm −
Suppose we have a dataset which can be plotted as follows −

Now, we need to classify new data point with black dot (at point 60,60) into blue or red class.
We are assuming K = 3 i.e. it would find three nearest data points. It is shown in the next
diagram −

We can see in the above diagram the three nearest neighbors of the data point with black dot.
Among those three, two of them lies in Red class hence the black dot will also be assigned in
red class.

Muslim Association college of Arts and Science Page 32

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

Rule Based Classification IF-THEN Rules


Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a
rule in the following from −
IF condition THEN conclusion

Let us consider a rule R1,

R1: IF age = youth AND student = yes


THEN buy_computer = yes

Points to remember −
• The IF part of the rule is called rule antecedent or precondition.
• The THEN part of the rule is called rule consequent.
• The antecedent part the condition consist of one or more attribute tests and these tests
are logically ANDed.
• The consequent part consists of class prediction.
Note − We can also write rule R1 as follows −
R1: (age = youth) ^ (student = yes))(buys computer = yes)
If the condition holds true for a given tuple, then the antecedent is satisfied.

Muslim Association college of Arts and Science Page 33

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

Module IV

Cluster analysis: definition and Requirements, Characteristics of clustering techniques, Types of


data in cluster analysis, categories of clustering-Partitioning methods, K-Mean and K - method
only, outlier detection in clustering.

Muslim Association college of Arts and Science Page 34

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

Cluster Analysis
Cluster is a group of objects that belongs to the same class. In other words, similar objects are
grouped in one cluster and dissimilar objects are grouped in another cluster.

What is Clustering?

Clustering is the process of making a group of abstract objects into classes of similar objects.
• A cluster of data objects can be treated as one group.
• While doing cluster analysis, we first partition the set of data into groups based on data
similarity and then assign the labels to the groups.
• The main advantage of clustering over classification is that, it is adaptable to changes
and helps single out useful features that distinguish different group

Requirements of Clustering

The following points throw light on why clustering is required in data mining −
• Scalability − We need highly scalable clustering algorithms to deal with large databases.

• Ability to deal with different kinds of attributes − Algorithms should be capable to be


applied on any kind of data such as interval-based (numerical) data, categorical, and
binary data.

• Discovery of clusters with attribute shape − The clustering algorithm should be capable
of detecting clusters of arbitrary shape. They should not be bounded to only distance
measures that tend to find spherical cluster of small sizes.

• High dimensionality − The clustering algorithm should not only be able to handle low-
dimensional data but also the high dimensional space.

• Ability to deal with noisy data − Databases contain noisy, missing or erroneous data.
Some algorithms are sensitive to such data and may lead to poor quality clusters.

• Interpretability − The clustering results should be interpretable, comprehensible, and


usable.

Muslim Association college of Arts and Science Page 35

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

Characteristics of Cluster Analysis

1. Scalability:

Scalability in clustering implies that as we boost the amount of data objects, the time to
perform clustering should approximately scale to the complexity order of the algorithm. For
example, if we perform K- means clustering, we know it is O(n), where n is the number of
objects in the data. If we raise the number of data objects 10 folds, then the time taken to
cluster them should also approximately increase 10 times.

2. Interpretability:

The outcomes of clustering should be interpretable, comprehensible, and usable.

3. Discovery of clusters with attribute shape:

The clustering algorithm should be able to find arbitrary shape clusters. They should not be
limited to only distance measurements that tend to discover a spherical cluster of small sizes.

4. Ability to deal with different types of attributes:

Algorithms should be capable of being applied to any data such as data based on intervals
(numeric), binary data, and categorical data.

5. Ability to deal with noisy data:

Databases contain data that is noisy, missing, or incorrect. Few algorithms are sensitive to such
data and may result in poor quality clusters.

6. High dimensionality:

The clustering tools should not only able to handle high dimensional data space but also the
low-dimensional space.

Muslim Association college of Arts and Science Page 36

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

Types of Data/ Variables Used In Cluster Analysis

1. Interval-Scaled variables
2. Binary variables
3. Nominal, Ordinal, and Ratio variables
4. Variables of mixed types
5. Interval-Scaled Variables

1.Interval-scaled variables

• Interval-scaled variables are continuous measurements of a roughly linear scale.

Typical examples include weight and height, latitude and longitude coordinates (e.g.,
when clustering houses), and weather temperature.
• The measurement unit used can affect the clustering analysis. For example, changing
measurement units from meters to inches for height, or from kilograms to pounds for
weight, may lead to a very different clustering structure.

2.Binary Variables

• A binary variable is a variable that can take only 2 values.


• For example, generally, gender variables can take 2 variables male and female.

3.Nominal or Categorical Variables


A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow,
blue, green.

Ordinal Variables

• An ordinal variable can be discrete or continuous.


• In this order is important, e.g., rank.
.
Ratio-Scaled Intervals

Ratio-scaled variable: It is a positive measurement on a nonlinear scale, approximately at an


exponential scale, such as Ae^Bt or A^e-Bt.

Muslim Association college of Arts and Science Page 37

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

4.Variables Of Mixed Type

• A database may contain all the six types of variables


symmetric binary, asymmetric binary, nominal, ordinal, interval, and ratio.
And those combinedly called as mixed-type variables.

Clustering methods
1.Partitioning Method
Suppose we are given a database of 8n9 objects and the partitioning method constructs 8k9
partition of data. Each partition will represent a cluster and k ≤ n. It means that it will classify
the data into k groups, which satisfy the following requirements −
• Each group contains at least one object.
• Each object must belong to exactly one group.
• For a given number of partitions (say k), the partitioning method will create an initial
partitioning.
• Then it uses the iterative relocation technique to improve the partitioning by moving
objects from one group to other.
2.K-Means Clustering Algorithm

K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science.

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset
into different clusters. Here K defines the number of pre-defined clusters that need to be
created in the process, as if K=2, there will be two clusters, and for K=3, there will be three
clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties.

It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.

Muslim Association college of Arts and Science Page 38

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Outlier Detection In Clustering.


Outliers are extreme values that deviate from other observations on data , they may indicate a
variability in a measurement, experimental errors or a novelty. In other words, an outlier is an
observation that diverges from an overall pattern on a sample.

Types of outliers

Outliers can be of two kinds: univariate and multivariate.

1. Univariate outliers can be found when looking at a distribution of values in a single


feature space.
2. Multivariate outliers can be found in a n-dimensional space (of n-features). Looking at
distributions in n-dimensional spaces can be very difficult for the human brain, that is
why we need to train a model to do it for us.

Outliers can also come in different flavours, depending on the environment: point
outliers, contextual outliers, or collective outliers.

1. Point outliers are single data points that lay far from the rest of the distribution.
2. Contextual outliers can be noise in data, such as punctuation symbols when realizing text
analysis or background noise signal when doing speech recognition.
3. Collective outliers can be subsets of novelties in data such as a signal that may indicate
the discovery of new phenomena

Muslim Association college of Arts and Science Page 39

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

Outlier Detection Method : -Z-Score


The z-score or standard score of an observation is a metric that indicates how many standard
deviations a data point is from the sample9s mean, assuming a gaussian distribution.

This makes z-score a parametric method. Very frequently data points are not to described by a
gaussian distribution, this problem can be solved by applying transformations to data ie: scaling
it.

Some Python libraries like Scipy and Sci-kit Learn have easy to use functions and classes for a
easy implementation along with Pandas and Numpy.

After making the appropriate transformations to the selected feature space of the dataset, the z-
score of any data point can be calculated with the following expression:

When computing the z-score for each sample on the data set a threshold must be specified.

Some good 8thumb-rule9 thresholds can be: 2.5, 3, 3.5 or more standard deviations.

Muslim Association college of Arts and Science Page 40

Downloaded by Midhun Manoj ([email protected])


lOMoARcPSD|23068105

S6 B.Sc Computer Science CS1641 Data Mining and Warehousing

By 8tagging9 or removing the data points that lay beyond a given threshold we are classifying

data into outliers and not outliers

Z-score is a simple, yet powerful method to get rid of outliers in data if you are dealing with

parametric distributions in a low dimensional feature space. For nonparametric

problems Dbscan and Isolation Forests can be good solutions.

Muslim Association college of Arts and Science Page 41

Downloaded by Midhun Manoj ([email protected])

You might also like