DM Mod 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Why do we need Data Mining?

Have a look at the below-mentioned points which explain why data mining is
required.

1. Data mining is the procedure of capturing large sets of data in order to


identify the insights and visions of that data. Nowadays, the demand of
data industry is rapidly growing which has also increased the demands
for data analysts and data scientists;
2. With this technique, we analyze the data and then convert that data
into meaningful information. This helps the business to take accurate
and better decisions in an organization;
3. Data mining helps to develop smart market decision, run accurate
campaigns, make predictions, and more;
4. With the help of Data mining, we can analyze customer behaviors and
their insights. This leads to great success and data-driven business.

Data mining and its process

Data mining is an interactive process. Take a look at the following steps.

1 – Requirement gathering

Data mining projects start with requirement gathering and understanding. Data
mining analysts or users define the requirement scope with the vendor business
perspective. Once the scope is defined, we move to the next phase.

2 – Data exploration

In this step, Data mining experts gather, evaluate, and explore the requirement or
project. Experts understand the problems, challenges, and convert them to
metadata. In this step, data mining statistics are used to identify and convert data
patterns.

3- Data preparations

Data mining experts convert the data into meaningful information for the modeling
step. They use the ETL process – extract, transform, and load. They are also
responsible for creating new data attributes. Here various tools are used to present
data in a structural format without changing the meaning of data sets.

4- Modeling

Data experts put their best tools in place for this step as this plays a vital role in the
complete processing of data. All modeling methods are applied to filter the data in an
appropriate manner. Modeling and evaluation are correlated steps and are followed
at the same time to check the parameters. Once the final modeling is done the final
outcome is quality proven.

5- Evaluation

This is the filtering process after successful modeling. If the outcome is not
satisfactory, then it is transferred to the model again. Upon final outcome, the
requirement is checked again with the vendor so no point is missed. Data mining
experts judge the complete result at the end.

6- Deployment

This is the final stage of the complete process. Experts present the data to vendors
in the form of spreadsheets or graphs.

What is Data Mining?


The process of extracting information to identify patterns, trends, and useful data that
would allow the business to take the data-driven decision from huge sets of data is
called Data Mining.

In other words, we can say that Data Mining is the process of investigating hidden
patterns of information to various perspectives for categorization into useful data,
which is collected and assembled in particular areas such as data warehouses, efficient
analysis, data mining algorithm, helping decision making and other data requirement
to eventually cost-cutting and generating revenue.
Data mining is the act of automatically searching for large stores of information to find
trends and patterns that go beyond simple analysis procedures. Data mining utilizes
complex mathematical algorithms for data segments and evaluates the probability of
future events. Data Mining is also called Knowledge Discovery of Data (KDD).

Data Mining is a process used by organizations to extract specific data from huge
databases to solve business problems. It primarily turns raw data into useful
information.

Data Mining is similar to Data Science carried out by a person, in a specific situation,
on a particular data set, with an objective. This process includes various types of
services such as text mining, web mining, audio and video mining, pictorial data
mining, and social media mining. It is done through software that is simple or highly
specific. By outsourcing data mining, all the work can be done faster with low operation
costs. Specialized firms can also use new technologies to collect data that is impossible
to locate manually. There are tonnes of information available on various platforms, but
very little knowledge is accessible. The biggest challenge is to analyze the data to
extract important information that can be used to solve a problem or for company
development. There are many powerful instruments and techniques available to mine
data and find better insight from it.

Types of Data Mining


Data mining can be performed on the following types of data:

Relational Database:

A relational database is a collection of multiple data sets formally organized by tables,


records, and columns from which data can be accessed in various ways without having
to recognize the database tables. Tables convey and share information, which
facilitates data searchability, reporting, and organization.

Data warehouses:

A Data Warehouse is the technology that collects the data from various sources within
the organization to provide meaningful business insights. The huge amount of data
comes from multiple places such as Marketing and Finance. The extracted data is
utilized for analytical purposes and helps in decision- making for a business
organization. The data warehouse is designed for the analysis of data rather than
transaction processing.

Data Repositories:

The Data Repository generally refers to a destination for data storage. However, many
IT professionals utilize the term more clearly to refer to a specific kind of setup within
an IT structure. For example, a group of databases, where an organization has kept
various kinds of information.

Object-Relational Database:

A combination of an object-oriented database model and relational database model


is called an object-relational model. It supports Classes, Objects, Inheritance, etc.

ne of the primary objectives of the Object-relational data model is to close the gap
between the Relational database and the object-oriented model practices frequently
utilized in many programming languages, for example, C++, Java, C#, and so on.

Transactional Database:

A transactional database refers to a database management system (DBMS) that has


the potential to undo a database transaction if it is not performed appropriately. Even
though this was a unique capability a very long while back, today, most of the relational
database systems support transactional database activities.

what type of data can be mined:

1. Flat Files
2. Relational Databases
3. DataWarehouse
4. Transactional Databases
5. Multimedia Databases
6. Spatial Databases
7. Time Series Databases
8. World Wide Web(WWW)
1. Flat Files
• Flat files is defined as data files in text form or binary form with
a structure that can be easily extracted by data mining
algorithms.
• Data stored in flat files have no relationship or path among
themselves, like if a relational database is stored on flat file,
then there will be no relations between the tables.
• Flat files are represented by data dictionary. Eg: CSV file.
• Application: Used in DataWarehousing to store data, Used in
carrying data to and from server, etc.
2. Relational Databases
• A Relational database is defined as the collection of data
organized in tables with rows and columns.
• Physical schema in Relational databases is a schema which
defines the structure of tables.
• Logical schema in Relational databases is a schema which
defines the relationship among tables.
• Standard API of relational database is SQL.
• Application: Data Mining, ROLAP model, etc.
3. DataWarehouse
• A datawarehouse is defined as the collection of data integrated
from multiple sources that will queries and decision making.
• There are three types of
datawarehouse: Enterprise datawarehouse, Data
Mart and Virtual Warehouse.
• Two approaches can be used to update data in
DataWarehouse: Query-driven Approach and Update-
driven Approach.
• Application: Business decision making, Data mining, etc.
4. Transactional Databases
• Transactional databases is a collection of data organized by
time stamps, date, etc to represent transaction in databases.
• This type of database has the capability to roll back or undo its
operation when a transaction is not completed or committed.
• Highly flexible system where users can modify information
without changing any sensitive information.
• Follows ACID property of DBMS.
• Application: Banking, Distributed systems, Object databases,
etc.
5. Multimedia Databases
• Multimedia databases consists audio, video, images and text
media.
• They can be stored on Object-Oriented Databases.
• They are used to store complex information in a pre-specified
formats.
• Application: Digital libraries, video-on demand, news-on
demand, musical database, etc.
6. Spatial Database
• Store geographical information.
• Stores data in the form of coordinates, topology, lines,
polygons, etc.
• Application: Maps, Global positioning, etc.
7. Time-series Databases
• Time series databases contains stock exchange data and user
logged activities.
• Handles array of numbers indexed by time, date, etc.
• It requires real-time analysis.
• Application: eXtremeDB, Graphite, InfluxDB, etc.
8. WWW
• WWW refers to World wide web is a collection of documents
and resources like audio, video, text, etc which are identified by
Uniform Resource Locators (URLs) through web browsers,
linked by HTML pages, and accessible via the Internet network.
• It is the most heterogeneous repository as it collects data from
multiple resources.
• It is dynamic in nature as Volume of data is continuously
increasing and changing.
• Application: Online shopping, Job search, Research, studying,
etc.

What kind of patterns can be mined?


Data mining deals with the kind of patterns that can be mined. On the basis of the kind
of data to be mined, there are two categories of functions involved in Data Mining −

• Descriptive
• Classification and Prediction

Descriptive Function
The descriptive function deals with the general properties of data in the database.
Here is the list of descriptive functions −

• Class/Concept Description
• Mining of Frequent Patterns
• Mining of Associations
• Mining of Correlations
• Mining of Clusters

Class/Concept Description

Class/Concept refers to the data to be associated with the classes or concepts. For
example, in a company, the classes of items for sales include computer and printers,
and concepts of customers include big spenders and budget spenders. Such
descriptions of a class or a concept are called class/concept descriptions. These
descriptions can be derived by the following two ways −
• Data Characterization − This refers to summarizing data of class under study.
This class under study is called as Target Class.
• Data Discrimination − It refers to the mapping or classification of a class with
some predefined group or class.

Mining of Frequent Patterns

Frequent patterns are those patterns that occur frequently in transactional data. Here
is the list of kind of frequent patterns −
• Frequent Item Set − It refers to a set of items that frequently appear together,
for example, milk and bread.
• Frequent Subsequence − A sequence of patterns that occur frequently such
as purchasing a camera is followed by memory card.
• Frequent Sub Structure − Substructure refers to different structural forms,
such as graphs, trees, or lattices, which may be combined with item-sets or
subsequences.

Mining of Association

Associations are used in retail sales to identify patterns that are frequently purchased
together. This process refers to the process of uncovering the relationship among
data and determining association rules.
For example, a retailer generates an association rule that shows that 70% of time
milk is sold with bread and only 30% of times biscuits are sold with bread.

Mining of Correlations

It is a kind of additional analysis performed to uncover interesting statistical


correlations between associated-attribute-value pairs or between two item sets to
analyze that if they have positive, negative or no effect on each other.

Mining of Clusters

Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming
group of objects that are very similar to each other but are highly different from the
objects in other clusters.

Classification and Prediction


Classification is the process of finding a model that describes the data classes or
concepts. The purpose is to be able to use this model to predict the class of objects
whose class label is unknown. This derived model is based on the analysis of sets of
training data. The derived model can be presented in the following forms −
• Classification (IF-THEN) Rules
• Decision Trees
• Mathematical Formulae
• Neural Networks
The list of functions involved in these processes are as follows −
• Classification − It predicts the class of objects whose class label is unknown.
Its objective is to find a derived model that describes and distinguishes data
classes or concepts. The Derived Model is based on the analysis set of training
data i.e. the data object whose class label is well known.
• Prediction − It is used to predict missing or unavailable numerical data values
rather than class labels. Regression Analysis is generally used for prediction.
Prediction can also be used for identification of distribution trends based on
available data.
• Outlier Analysis − Outliers may be defined as the data objects that do not
comply with the general behavior or model of the data available.
• Evolution Analysis − Evolution analysis refers to the description and model
regularities or trends for objects whose behavior changes over time.

Technologies used in data mining


Several techniques used in the development of data mining methods. Some
of them are mentioned below:

1. Statistics:

• It uses the mathematical analysis to express representations, model


and summarize empirical data or real world observations.
• Statistical analysis involves the collection of methods, applicable
to large amount of data to conclude and report the trend.
2. Machine learning

• Arthur Samuel defined machine learning as a field of study that gives


computers the ability to learn without being programmed.
• When the new data is entered in the computer, algorithms help the
data to grow or change due to machine learning.
• In machine learning, an algorithm is constructed to predict the data
from the available database (Predictive analysis).
• It is related to computational statistics.
The four types of machine learning are:

1. Supervised learning

• It is based on the classification.


• It is also called as inductive learning. In this method, the desired
outputs are included in the training dataset.
2. Unsupervised learning
Unsupervised learning is based on clustering. Clusters are formed on the
basis of similarity measures and desired outputs are not included in the
training dataset.

3. Semi-supervised learning
Semi-supervised learning includes some desired outputs to the training
dataset to generate the appropriate functions. This method generally avoids
the large number of labeled examples (i.e. desired outputs) .

4. Active learning

• Active learning is a powerful approach in analyzing the data efficiently.


• The algorithm is designed in such a way that, the desired output
should be decided by the algorithm itself (the user plays important
role in this type).
3. Information retrieval
Information deals with uncertain representations of the semantics of objects
(text, images).
For example: Finding relevant information from a large document.

4. Database systems and data warehouse

• Databases are used for the purpose of recording the data as well as
data warehousing.
• Online Transactional Processing (OLTP) uses databases for day to day
transaction
purpose.
• To remove the redundant data and save the storage space, data is
normalized and stored in the form of tables.
• Entity-Relational modeling techniques are used for relational database
management system design.
• Data warehouses are used to store historical data which helps to take
strategical decision for business.
• It is used for online analytical processing (OALP), which helps to
analyze the data.
5. Decision support system

• Decision support system is a category of information system. It is very


useful in decision making for organizations.
• It is an interactive software based system which helps decision makers
to extract useful information from the data, documents to make the
decision.

Which kind of Applications are targeted?


Data mining is widely used in diverse areas. There are a number of commercial data
mining system available today and yet there are many challenges in this field. In this
tutorial, we will discuss the applications and the trend of data mining.

Data Mining Applications


Here is the list of areas where data mining is widely used −

• Financial Data Analysis


• Retail Industry
• Telecommunication Industry
• Biological Data Analysis
• Other Scientific Applications
• Intrusion Detection

Financial Data Analysis

The financial data in banking and financial industry is generally reliable and of high
quality which facilitates systematic data analysis and data mining. Some of the typical
cases are as follows −
• Design and construction of data warehouses for multidimensional data analysis
and data mining.
• Loan payment prediction and customer credit policy analysis.
• Classification and clustering of customers for targeted marketing.
• Detection of money laundering and other financial crimes.
Retail Industry

Data Mining has its great application in Retail Industry because it collects large
amount of data from on sales, customer purchasing history, goods transportation,
consumption and services. It is natural that the quantity of data collected will continue
to expand rapidly because of the increasing ease, availability and popularity of the
web.
Data mining in retail industry helps in identifying customer buying patterns and trends
that lead to improved quality of customer service and good customer retention and
satisfaction. Here is the list of examples of data mining in the retail industry −
• Design and Construction of data warehouses based on the benefits of data
mining.
• Multidimensional analysis of sales, customers, products, time and region.
• Analysis of effectiveness of sales campaigns.
• Customer Retention.
• Product recommendation and cross-referencing of items.

Telecommunication Industry

Today the telecommunication industry is one of the most emerging industries


providing various services such as fax, pager, cellular phone, internet messenger,
images, e-mail, web data transmission, etc. Due to the development of new computer
and communication technologies, the telecommunication industry is rapidly
expanding. This is the reason why data mining is become very important to help and
understand the business.
Data mining in telecommunication industry helps in identifying the telecommunication
patterns, catch fraudulent activities, make better use of resource, and improve quality
of service. Here is the list of examples for which data mining improves
telecommunication services −
• Multidimensional Analysis of Telecommunication data.
• Fraudulent pattern analysis.
• Identification of unusual patterns.
• Multidimensional association and sequential patterns analysis.
• Mobile Telecommunication services.
• Use of visualization tools in telecommunication data analysis.

Biological Data Analysis

In recent times, we have seen a tremendous growth in the field of biology such as
genomics, proteomics, functional Genomics and biomedical research. Biological data
mining is a very important part of Bioinformatics. Following are the aspects in which
data mining contributes for biological data analysis −
• Semantic integration of heterogeneous, distributed genomic and proteomic
databases.
• Alignment, indexing, similarity search and comparative analysis multiple
nucleotide sequences.
• Discovery of structural patterns and analysis of genetic networks and protein
pathways.
• Association and path analysis.
• Visualization tools in genetic data analysis.

Other Scientific Applications

The applications discussed above tend to handle relatively small and homogeneous
data sets for which the statistical techniques are appropriate. Huge amount of data
have been collected from scientific domains such as geosciences, astronomy, etc. A
large amount of data sets is being generated because of the fast numerical
simulations in various fields such as climate and ecosystem modeling, chemical
engineering, fluid dynamics, etc. Following are the applications of data mining in the
field of Scientific Applications −

• Data Warehouses and data preprocessing.


• Graph-based mining.
• Visualization and domain specific knowledge.

Intrusion Detection

Intrusion refers to any kind of action that threatens integrity, confidentiality, or the
availability of network resources. In this world of connectivity, security has become
the major issue. With increased usage of internet and availability of the tools and
tricks for intruding and attacking network prompted intrusion detection to become a
critical component of network administration. Here is the list of areas in which data
mining technology may be applied for intrusion detection −
• Development of data mining algorithm for intrusion detection.
• Association and correlation analysis, aggregation to help select and build
discriminating attributes.
• Analysis of Stream data.
• Distributed data mining.
• Visualization and query tools.

Major issues in Data Mining


Data mining is not an easy task, as the algorithms used can get very complex and
data is not always available at one place. It needs to be integrated from various
heterogeneous data sources. These factors also create some issues. Here in this
tutorial, we will discuss the major issues regarding −

• Mining Methodology and User Interaction


• Performance Issues
• Diverse Data Types Issues
The following diagram describes the major issues.

Mining Methodology and User Interaction Issues


It refers to the following kinds of issues −
• Mining different kinds of knowledge in databases − Different users may be
interested in different kinds of knowledge. Therefore it is necessary for data
mining to cover a broad range of knowledge discovery task.
• Interactive mining of knowledge at multiple levels of abstraction − The
data mining process needs to be interactive because it allows users to focus
the search for patterns, providing and refining data mining requests based on
the returned results.
• Incorporation of background knowledge − To guide discovery process and
to express the discovered patterns, the background knowledge can be used.
Background knowledge may be used to express the discovered patterns not
only in concise terms but at multiple levels of abstraction.
• Data mining query languages and ad hoc data mining − Data Mining Query
language that allows the user to describe ad hoc mining tasks, should be
integrated with a data warehouse query language and optimized for efficient
and flexible data mining.
• Presentation and visualization of data mining results − Once the patterns
are discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
• Handling noisy or incomplete data − The data cleaning methods are
required to handle the noise and incomplete objects while mining the data
regularities. If the data cleaning methods are not there then the accuracy of
the discovered patterns will be poor.
• Pattern evaluation − The patterns discovered should be interesting because
either they represent common knowledge or lack novelty.

Performance Issues
There can be performance-related issues such as follows −
• Efficiency and scalability of data mining algorithms − In order to effectively
extract the information from huge amount of data in databases, data mining
algorithm must be efficient and scalable.
• Parallel, distributed, and incremental mining algorithms − The factors
such as huge size of databases, wide distribution of data, and complexity of
data mining methods motivate the development of parallel and distributed data
mining algorithms. These algorithms divide the data into partitions which is
further processed in a parallel fashion. Then the results from the partitions is
merged. The incremental algorithms, update databases without mining the
data again from scratch.

Diverse Data Types Issues


• Handling of relational and complex types of data − The database may
contain complex data objects, multimedia data objects, spatial data, temporal
data etc. It is not possible for one system to mine all these kind of data.
• Mining information from heterogeneous databases and global
information systems − The data is available at different data sources on LAN
or WAN. These data source may be structured, semi structured or
unstructured. Therefore mining the knowledge from them adds challenges to
data mining.

Data Types
Data mining can be conducted on the following data forms.

i) Relational databases
A relational database is a set of records which are linked between using some set of pre-
defined constraints. These records are arranged with columns and rows in the form of
tables. Tables are used to store data about the items that are to be described in the
database.

A relational database is characterized as the set of data arranged in rows and columns in
the database tables. In relational databases, the database structure can be defined using
physical and logical schema. The physical schema is a schema which describes the
database structure and the relationship between tables while logical schema is a schema
which describes how tables are linked with one another. The relational database's
standard API is SQL. Its applications are data processing, model ROLAP, etc.

ii) Data warehouses


The method of building a data pool using some set of rules is a data warehouse.
Through combining data from several heterogeneous sources which enable a user for
analytical reporting, standardized and/or ad hoc requests, and decision making. Data
warehousing requires data cleaning, integration of data and storage of information. To
help historical research, a data warehouse typically preserves several months or years of
data. The data in a data warehouse is usually loaded from multiple data sources by an
extraction, transformation, and loading process. Modern data warehouses shift towards
an architecture of extract, load, transformation in which all or much of the transformation
of data is carried out on the database that hosts the data warehouse. It is important to
remember that a very significant part of a data warehouse's design initiative is to
describe ETL (Extraction, Transformation, and Loading.) method. ETL activities are the
backbone of the data warehouse.

iii) Transactional databases


To explain what a transaction database is, let's first see what a transaction entails. A
transaction is, in technical words, a series of sequences of acts that are both independent
and dependent at the same time. A transaction is said to be concluded only if all the
activities that are part of the transaction are completed successfully. The transaction will
be considered an error even if it fails, and all the actions need to be rolled back or
undone.

There is a given starting point for any database transaction, followed by steps to change
the data inside the database. In the end, before the transaction can be tried again, the
database either commits the changes to make them permanent or rolls back the
changes to the starting point.

Example - The case of a bank transaction. A bank transaction is said to be accurate only
when the amount credited from one account is successfully debited to another account.
If the amount is withdrawn but not received by a candidate then it is appropriate to roll
back the whole transaction to the original point.
iv) Database management system
DBMS is an application for database development and management. It offers a
structured way for users to create, retrieve, update, and manage the data. A person who
uses DBMS to communicate with the database need not concern about how and where
the data is processed. DBMS will take care of it.

DBMS is a collection of data in a structured manner. DBMS is a system for database


management that records information that has some significance. As an example, if we
have to create a student database, so we have to add certain attributes such as student
ID, student name, student address, student mobile number, student email, etc., and all
attributes have the same record type as a student have. The DBMS provides the final
user with a reliable firm.

v) Advanced database system


A new range of databases such as NoSQL/new SQL was targeted by specialized database
management systems. New developments in data storage have risen by application
demands, such as support for predictive analytics, research, and data processing, are also
supported by advanced database management systems. The center of an effective
database and information systems has always been advanced data management. It treats
a wealth of different data models and surveys the foundations of structuring, sorting,
storing, and querying data according to these models.

Data Quality: Why do we preprocess the data?


Many characteristics act as a deciding factor for data quality, such as
incompleteness and incoherent information, which are common properties of
the big database in the real world. Factors used for data quality assessment
are:

• Accuracy:
There are many possible reasons for flawed or inaccurate data
here. i.e. Having incorrect values of properties that could be human
or computer errors.

• Completeness:
For some reasons, incomplete data can occur, attributes of interest
such as customer information for sales & transaction data may not
always be available.

• Consistency:
Incorrect data can also result from inconsistencies in naming
convention or data codes, or from input field incoherent format.
Duplicate tuples need cleaning of details, too.
• Timeliness:
It also affects the quality of the data. At the end of the month,
several sales representatives fail to file their sales record on time.
These are also several corrections & adjustments which flow into
after the end of the month. Data stored in the database are
incomplete for a time after each month.

• Believability:
It is reflective of how much users trust the data.

• Interpretability:
It is a reflection of how easy the users can understand the data.

You might also like