DM Mod 1
DM Mod 1
DM Mod 1
Have a look at the below-mentioned points which explain why data mining is
required.
1 – Requirement gathering
Data mining projects start with requirement gathering and understanding. Data
mining analysts or users define the requirement scope with the vendor business
perspective. Once the scope is defined, we move to the next phase.
2 – Data exploration
In this step, Data mining experts gather, evaluate, and explore the requirement or
project. Experts understand the problems, challenges, and convert them to
metadata. In this step, data mining statistics are used to identify and convert data
patterns.
3- Data preparations
Data mining experts convert the data into meaningful information for the modeling
step. They use the ETL process – extract, transform, and load. They are also
responsible for creating new data attributes. Here various tools are used to present
data in a structural format without changing the meaning of data sets.
4- Modeling
Data experts put their best tools in place for this step as this plays a vital role in the
complete processing of data. All modeling methods are applied to filter the data in an
appropriate manner. Modeling and evaluation are correlated steps and are followed
at the same time to check the parameters. Once the final modeling is done the final
outcome is quality proven.
5- Evaluation
This is the filtering process after successful modeling. If the outcome is not
satisfactory, then it is transferred to the model again. Upon final outcome, the
requirement is checked again with the vendor so no point is missed. Data mining
experts judge the complete result at the end.
6- Deployment
This is the final stage of the complete process. Experts present the data to vendors
in the form of spreadsheets or graphs.
In other words, we can say that Data Mining is the process of investigating hidden
patterns of information to various perspectives for categorization into useful data,
which is collected and assembled in particular areas such as data warehouses, efficient
analysis, data mining algorithm, helping decision making and other data requirement
to eventually cost-cutting and generating revenue.
Data mining is the act of automatically searching for large stores of information to find
trends and patterns that go beyond simple analysis procedures. Data mining utilizes
complex mathematical algorithms for data segments and evaluates the probability of
future events. Data Mining is also called Knowledge Discovery of Data (KDD).
Data Mining is a process used by organizations to extract specific data from huge
databases to solve business problems. It primarily turns raw data into useful
information.
Data Mining is similar to Data Science carried out by a person, in a specific situation,
on a particular data set, with an objective. This process includes various types of
services such as text mining, web mining, audio and video mining, pictorial data
mining, and social media mining. It is done through software that is simple or highly
specific. By outsourcing data mining, all the work can be done faster with low operation
costs. Specialized firms can also use new technologies to collect data that is impossible
to locate manually. There are tonnes of information available on various platforms, but
very little knowledge is accessible. The biggest challenge is to analyze the data to
extract important information that can be used to solve a problem or for company
development. There are many powerful instruments and techniques available to mine
data and find better insight from it.
Relational Database:
Data warehouses:
A Data Warehouse is the technology that collects the data from various sources within
the organization to provide meaningful business insights. The huge amount of data
comes from multiple places such as Marketing and Finance. The extracted data is
utilized for analytical purposes and helps in decision- making for a business
organization. The data warehouse is designed for the analysis of data rather than
transaction processing.
Data Repositories:
The Data Repository generally refers to a destination for data storage. However, many
IT professionals utilize the term more clearly to refer to a specific kind of setup within
an IT structure. For example, a group of databases, where an organization has kept
various kinds of information.
Object-Relational Database:
ne of the primary objectives of the Object-relational data model is to close the gap
between the Relational database and the object-oriented model practices frequently
utilized in many programming languages, for example, C++, Java, C#, and so on.
Transactional Database:
1. Flat Files
2. Relational Databases
3. DataWarehouse
4. Transactional Databases
5. Multimedia Databases
6. Spatial Databases
7. Time Series Databases
8. World Wide Web(WWW)
1. Flat Files
• Flat files is defined as data files in text form or binary form with
a structure that can be easily extracted by data mining
algorithms.
• Data stored in flat files have no relationship or path among
themselves, like if a relational database is stored on flat file,
then there will be no relations between the tables.
• Flat files are represented by data dictionary. Eg: CSV file.
• Application: Used in DataWarehousing to store data, Used in
carrying data to and from server, etc.
2. Relational Databases
• A Relational database is defined as the collection of data
organized in tables with rows and columns.
• Physical schema in Relational databases is a schema which
defines the structure of tables.
• Logical schema in Relational databases is a schema which
defines the relationship among tables.
• Standard API of relational database is SQL.
• Application: Data Mining, ROLAP model, etc.
3. DataWarehouse
• A datawarehouse is defined as the collection of data integrated
from multiple sources that will queries and decision making.
• There are three types of
datawarehouse: Enterprise datawarehouse, Data
Mart and Virtual Warehouse.
• Two approaches can be used to update data in
DataWarehouse: Query-driven Approach and Update-
driven Approach.
• Application: Business decision making, Data mining, etc.
4. Transactional Databases
• Transactional databases is a collection of data organized by
time stamps, date, etc to represent transaction in databases.
• This type of database has the capability to roll back or undo its
operation when a transaction is not completed or committed.
• Highly flexible system where users can modify information
without changing any sensitive information.
• Follows ACID property of DBMS.
• Application: Banking, Distributed systems, Object databases,
etc.
5. Multimedia Databases
• Multimedia databases consists audio, video, images and text
media.
• They can be stored on Object-Oriented Databases.
• They are used to store complex information in a pre-specified
formats.
• Application: Digital libraries, video-on demand, news-on
demand, musical database, etc.
6. Spatial Database
• Store geographical information.
• Stores data in the form of coordinates, topology, lines,
polygons, etc.
• Application: Maps, Global positioning, etc.
7. Time-series Databases
• Time series databases contains stock exchange data and user
logged activities.
• Handles array of numbers indexed by time, date, etc.
• It requires real-time analysis.
• Application: eXtremeDB, Graphite, InfluxDB, etc.
8. WWW
• WWW refers to World wide web is a collection of documents
and resources like audio, video, text, etc which are identified by
Uniform Resource Locators (URLs) through web browsers,
linked by HTML pages, and accessible via the Internet network.
• It is the most heterogeneous repository as it collects data from
multiple resources.
• It is dynamic in nature as Volume of data is continuously
increasing and changing.
• Application: Online shopping, Job search, Research, studying,
etc.
• Descriptive
• Classification and Prediction
Descriptive Function
The descriptive function deals with the general properties of data in the database.
Here is the list of descriptive functions −
• Class/Concept Description
• Mining of Frequent Patterns
• Mining of Associations
• Mining of Correlations
• Mining of Clusters
Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For
example, in a company, the classes of items for sales include computer and printers,
and concepts of customers include big spenders and budget spenders. Such
descriptions of a class or a concept are called class/concept descriptions. These
descriptions can be derived by the following two ways −
• Data Characterization − This refers to summarizing data of class under study.
This class under study is called as Target Class.
• Data Discrimination − It refers to the mapping or classification of a class with
some predefined group or class.
Frequent patterns are those patterns that occur frequently in transactional data. Here
is the list of kind of frequent patterns −
• Frequent Item Set − It refers to a set of items that frequently appear together,
for example, milk and bread.
• Frequent Subsequence − A sequence of patterns that occur frequently such
as purchasing a camera is followed by memory card.
• Frequent Sub Structure − Substructure refers to different structural forms,
such as graphs, trees, or lattices, which may be combined with item-sets or
subsequences.
Mining of Association
Associations are used in retail sales to identify patterns that are frequently purchased
together. This process refers to the process of uncovering the relationship among
data and determining association rules.
For example, a retailer generates an association rule that shows that 70% of time
milk is sold with bread and only 30% of times biscuits are sold with bread.
Mining of Correlations
Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming
group of objects that are very similar to each other but are highly different from the
objects in other clusters.
1. Statistics:
1. Supervised learning
3. Semi-supervised learning
Semi-supervised learning includes some desired outputs to the training
dataset to generate the appropriate functions. This method generally avoids
the large number of labeled examples (i.e. desired outputs) .
4. Active learning
• Databases are used for the purpose of recording the data as well as
data warehousing.
• Online Transactional Processing (OLTP) uses databases for day to day
transaction
purpose.
• To remove the redundant data and save the storage space, data is
normalized and stored in the form of tables.
• Entity-Relational modeling techniques are used for relational database
management system design.
• Data warehouses are used to store historical data which helps to take
strategical decision for business.
• It is used for online analytical processing (OALP), which helps to
analyze the data.
5. Decision support system
The financial data in banking and financial industry is generally reliable and of high
quality which facilitates systematic data analysis and data mining. Some of the typical
cases are as follows −
• Design and construction of data warehouses for multidimensional data analysis
and data mining.
• Loan payment prediction and customer credit policy analysis.
• Classification and clustering of customers for targeted marketing.
• Detection of money laundering and other financial crimes.
Retail Industry
Data Mining has its great application in Retail Industry because it collects large
amount of data from on sales, customer purchasing history, goods transportation,
consumption and services. It is natural that the quantity of data collected will continue
to expand rapidly because of the increasing ease, availability and popularity of the
web.
Data mining in retail industry helps in identifying customer buying patterns and trends
that lead to improved quality of customer service and good customer retention and
satisfaction. Here is the list of examples of data mining in the retail industry −
• Design and Construction of data warehouses based on the benefits of data
mining.
• Multidimensional analysis of sales, customers, products, time and region.
• Analysis of effectiveness of sales campaigns.
• Customer Retention.
• Product recommendation and cross-referencing of items.
Telecommunication Industry
In recent times, we have seen a tremendous growth in the field of biology such as
genomics, proteomics, functional Genomics and biomedical research. Biological data
mining is a very important part of Bioinformatics. Following are the aspects in which
data mining contributes for biological data analysis −
• Semantic integration of heterogeneous, distributed genomic and proteomic
databases.
• Alignment, indexing, similarity search and comparative analysis multiple
nucleotide sequences.
• Discovery of structural patterns and analysis of genetic networks and protein
pathways.
• Association and path analysis.
• Visualization tools in genetic data analysis.
The applications discussed above tend to handle relatively small and homogeneous
data sets for which the statistical techniques are appropriate. Huge amount of data
have been collected from scientific domains such as geosciences, astronomy, etc. A
large amount of data sets is being generated because of the fast numerical
simulations in various fields such as climate and ecosystem modeling, chemical
engineering, fluid dynamics, etc. Following are the applications of data mining in the
field of Scientific Applications −
Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the
availability of network resources. In this world of connectivity, security has become
the major issue. With increased usage of internet and availability of the tools and
tricks for intruding and attacking network prompted intrusion detection to become a
critical component of network administration. Here is the list of areas in which data
mining technology may be applied for intrusion detection −
• Development of data mining algorithm for intrusion detection.
• Association and correlation analysis, aggregation to help select and build
discriminating attributes.
• Analysis of Stream data.
• Distributed data mining.
• Visualization and query tools.
Performance Issues
There can be performance-related issues such as follows −
• Efficiency and scalability of data mining algorithms − In order to effectively
extract the information from huge amount of data in databases, data mining
algorithm must be efficient and scalable.
• Parallel, distributed, and incremental mining algorithms − The factors
such as huge size of databases, wide distribution of data, and complexity of
data mining methods motivate the development of parallel and distributed data
mining algorithms. These algorithms divide the data into partitions which is
further processed in a parallel fashion. Then the results from the partitions is
merged. The incremental algorithms, update databases without mining the
data again from scratch.
Data Types
Data mining can be conducted on the following data forms.
i) Relational databases
A relational database is a set of records which are linked between using some set of pre-
defined constraints. These records are arranged with columns and rows in the form of
tables. Tables are used to store data about the items that are to be described in the
database.
A relational database is characterized as the set of data arranged in rows and columns in
the database tables. In relational databases, the database structure can be defined using
physical and logical schema. The physical schema is a schema which describes the
database structure and the relationship between tables while logical schema is a schema
which describes how tables are linked with one another. The relational database's
standard API is SQL. Its applications are data processing, model ROLAP, etc.
There is a given starting point for any database transaction, followed by steps to change
the data inside the database. In the end, before the transaction can be tried again, the
database either commits the changes to make them permanent or rolls back the
changes to the starting point.
Example - The case of a bank transaction. A bank transaction is said to be accurate only
when the amount credited from one account is successfully debited to another account.
If the amount is withdrawn but not received by a candidate then it is appropriate to roll
back the whole transaction to the original point.
iv) Database management system
DBMS is an application for database development and management. It offers a
structured way for users to create, retrieve, update, and manage the data. A person who
uses DBMS to communicate with the database need not concern about how and where
the data is processed. DBMS will take care of it.
• Accuracy:
There are many possible reasons for flawed or inaccurate data
here. i.e. Having incorrect values of properties that could be human
or computer errors.
• Completeness:
For some reasons, incomplete data can occur, attributes of interest
such as customer information for sales & transaction data may not
always be available.
• Consistency:
Incorrect data can also result from inconsistencies in naming
convention or data codes, or from input field incoherent format.
Duplicate tuples need cleaning of details, too.
• Timeliness:
It also affects the quality of the data. At the end of the month,
several sales representatives fail to file their sales record on time.
These are also several corrections & adjustments which flow into
after the end of the month. Data stored in the database are
incomplete for a time after each month.
• Believability:
It is reflective of how much users trust the data.
• Interpretability:
It is a reflection of how easy the users can understand the data.