0% found this document useful (0 votes)

121 views27 pages

Unit 1

Uploaded by

bhavanabhoomika1234

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

121 views27 pages

Unit 1

Uploaded by

bhavanabhoomika1234

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 27

Unit 1

Data Mining:

The process of extracting information to identify patterns, trends, and useful data that
would allow the business to take the data-driven decision from huge sets of data is called Data
Mining.

Data Mining is the process of investigating hidden patterns of information to various

perspectives for categorization into useful data, which is collected and assembled in particular
areas such as data warehouses, efficient analysis, data mining algorithm, helping decision
making and other data requirement to eventually cost-cutting and generating revenue.

Data mining is the act of automatically searching for large stores of information to find
trends and patterns that go beyond simple analysis procedures. Data mining utilizes complex
mathematical algorithms for data segments and evaluates the probability of future events. Data
Mining is also called Knowledge Discovery of Data (KDD).

Data Mining is a process used by organizations to extract specific data from huge
databases to solve business problems. It primarily turns raw data into useful information.

Data Mining is similar to Data Science carried out by a person, in a specific situation, on
a particular data set, with an objective. This process includes various types of services such as
text mining, web mining, audio and video mining, pictorial data mining, and social media
mining. It is done through software that is simple or highly specific. By outsourcing data mining,
all the work can be done faster with low operation costs. Specialized firms can also use new
technologies to collect data that is impossible to locate manually. There are tonnes of
information available on various platforms, but very little knowledge is accessible. The biggest
challenge is to analyze the data to extract important information that can be used to solve a
problem or for company development. There are many powerful instruments and techniques
available to mine data and find better insight from it.

1
History of Data Mining

In the 1990s, the term "Data Mining" was introduced, but data mining is the evolution of a sector
with an extensive history.

Early techniques of identifying patterns in data include Bayes theorem (1700s), and the evolution
of regression (1800s). The generation and growing power of computer science have boosted data
collection, storage, and manipulation as data sets have broad in size and complexity level.
Explicit hands-on data investigation has progressively been improved with indirect, automatic
data processing, and other computer science discoveries such as neural networks, clustering,
genetic algorithms (1950s), decision trees(1960s), and supporting vector machines (1990s).

Data mining origins are traced back to three family lines: Classical statistics, Artificial
intelligence, and Machine learning.

Classical statistics:

Statistics are the basis of most technology on which data mining is built, such as regression
analysis, standard deviation, standard distribution, standard variance, discriminatory analysis,
cluster analysis, and confidence intervals. All of these are used to analyze data and data
connection.

Artificial Intelligence:

AI or Artificial intelligence is based on heuristics as opposed to statistics. It tries to apply

human- thought like processing to statistical problems. A specific AI concept was adopted by
some high-end commercial products, such as query optimization modules for Relational
Database Management System (RDBMS).

Machine Learning:

Machine learning is a combination of statistics and AI. It might be considered as an evolution of

AI because it mixes AI heuristics with complex statistical analysis. Machine learning tries to
enable computer programs to know about the data they are studying so that programs make a
distinct decision based on the characteristics of the data examined. It uses statistics for basic
concepts and adding more AI heuristics and algorithms to accomplish its target.

Advantages of Data Mining

o The Data Mining technique enables organizations to obtain knowledge-based data.
o Data mining enables organizations to make lucrative modifications in operation and
production.
o Compared with other statistical data applications, data mining is a cost-efficient.
o Data Mining helps the decision-making process of an organization.
o It facilitates the automated discovery of hidden patterns as well as the prediction of trends
and behaviors.

2
o It can be induced in the new system as well as the existing platforms.
o It is a quick process that makes it easy for new users to analyze enormous amounts of
data in a short time.

Disadvantages of Data Mining

o There is a probability that the organizations may sell useful data of customers to other
organizations for money. As per the report, American Express has sold credit card
purchases of their customers to other organizations.
o Many data mining analytics software is difficult to operate and needs advance training to
work on.
o Different data mining instruments operate in distinct ways due to the different algorithms
used in their design. Therefore, the selection of the right data mining tools is a very
challenging task.
o The data mining techniques are not precise, so that it may lead to severe consequences in
certain conditions.

Data Mining Applications:

Data Mining is primarily used by organizations with intense consumer demands- Retail,
Communication, Financial, marketing company, determine price, consumer preferences, product
positioning, and impact on sales, customer satisfaction, and corporate profits. Data mining
enables a retailer to use point-of-sale records of customer purchases to develop products and
promotions that help the organization to attract the customer.

These are the following areas where data mining is widely used:

Data Mining in Healthcare:

3
Data mining in healthcare has excellent potential to improve the health system. It uses data and
analytics for better insights and to identify best practices that will enhance health care services
and reduce costs. Analysts use data mining approaches such as Machine learning, Multi-
dimensional database, Data visualization, Soft computing, and statistics. Data Mining can be
used to forecast patients in each category. The procedures ensure that the patients get intensive
care at the right place and at the right time. Data mining also enables healthcare insurers to
recognize fraud and abuse.

Data Mining in Market Basket Analysis:

Market basket analysis is a modeling method based on a hypothesis. If you buy a specific group
of products, then you are more likely to buy another group of products. This technique may
enable the retailer to understand the purchase behavior of a buyer. This data may assist the
retailer in understanding the requirements of the buyer and altering the store's layout accordingly.
Using a different analytical comparison of results between various stores, between customers in
different demographic groups can be done.

Data mining in Education:

Education data mining is a newly emerging field, concerned with developing techniques that
explore knowledge from the data generated from educational Environments. EDM objectives are
recognized as affirming student's future learning behavior, studying the impact of educational
support, and promoting learning science. An organization can use data mining to make precise
decisions and also to predict the results of the student. With the results, the institution can
concentrate on what to teach and how to teach.

Data Mining in Manufacturing Engineering:

Knowledge is the best asset possessed by a manufacturing company. Data mining tools can be
beneficial to find patterns in a complex manufacturing process. Data mining can be used in
system-level designing to obtain the relationships between product architecture, product
portfolio, and data needs of the customers. It can also be used to forecast the product
development period, cost, and expectations among the other tasks.

Data Mining in CRM (Customer Relationship Management):

Customer Relationship Management (CRM) is all about obtaining and holding Customers, also
enhancing customer loyalty and implementing customer-oriented strategies. To get a decent
relationship with the customer, a business organization needs to collect data and analyze the data.
With data mining technologies, the collected data can be used for analytics.

Data Mining in Fraud detection:

4
Billions of dollars are lost to the action of frauds. Traditional methods of fraud detection are a
little bit time consuming and sophisticated. Data mining provides meaningful patterns and
turning data into information. An ideal fraud detection system should protect the data of all the
users. Supervised methods consist of a collection of sample records, and these records are
classified as fraudulent or non-fraudulent. A model is constructed using this data, and the
technique is made to identify whether the document is fraudulent or not.

Data Mining in Lie Detection:

Apprehending a criminal is not a big deal, but bringing out the truth from him is a very
challenging task. Law enforcement may use data mining techniques to investigate offenses,
monitor suspected terrorist communications, etc. This technique includes text mining also, and it
seeks meaningful patterns in data, which is usually unstructured text. The information collected
from the previous investigations is compared, and a model for lie detection is constructed.

Data Mining Financial Banking:

The Digitalization of the banking system is supposed to generate an enormous amount of data
with every new transaction. The data mining technique can help bankers by solving business-
related problems in banking and finance by identifying trends, casualties, and correlations in
business information and market costs that are not instantly evident to managers or executives
because the data volume is too large or are produced too rapidly on the screen by experts. The
manager may find these data for better targeting, acquiring, retaining, segmenting, and maintain
a profitable customer.

Types of Data Mining

Data mining can be performed on the following types of data:

1) Relational Database:

A relational database is a collection of multiple data sets formally organized by tables, records,
and columns from which data can be accessed in various ways without having to recognize the
database tables. Tables convey and share information, which facilitates data searchability,
reporting, and organization.

2) Data warehouses:

A Data Warehouse is the technology that collects the data from various sources within the
organization to provide meaningful business insights. The huge amount of data comes from
multiple places such as Marketing and Finance. The extracted data is utilized for analytical
purposes and helps in decision- making for a business organization. The data warehouse is
designed for the analysis of data rather than transaction processing.

3) Data Repositories:

5
The Data Repository generally refers to a destination for data storage. However, many IT
professionals utilize the term more clearly to refer to a specific kind of setup within an IT
structure. For example a group of databases, where an organization has kept various kinds of
information.

4) Object-Relational Database:

A combination of an object-oriented database model and relational database model is called an

object-relational model. It supports Classes, Objects, Inheritance, etc.

One of the primary objectives of the Object-relational data model is to close the gap between the
Relational database and the object-oriented model practices frequently utilized in many
programming languages, for example, C++, Java, C#, and so on.

5) Transactional Database:

A transactional database refers to a database management system (DBMS) that has the potential
to undo a database transaction if it is not performed appropriately. Even though this was a unique
capability a very long while back, today, most of the relational database systems support
transactional database activities.

Challenges of Implementation in Data mining

Although data mining is very powerful, it faces many challenges during its execution. Various
challenges could be related to performance, data, methods, and techniques, etc. The process of
data mining becomes effective when the challenges or problems are correctly recognized and
adequately resolved.

Incomplete and noisy data:

6
The process of extracting useful data from large volumes of data is data mining. The data in the
real-world is heterogeneous, incomplete, and noisy. Data in huge quantities will usually be
inaccurate or unreliable. These problems may occur due to data measuring instrument or because
of human errors. Suppose a retail chain collects phone numbers of customers who spend more
than 500 Rs and the accounting employees put the information into their system. The person may
make a digit mistake when entering the phone number, which results in incorrect data. Even
some customers may not be willing to disclose their phone numbers, which results in incomplete
data. The data could get changed due to human or system error. All these consequences (noisy
and incomplete data) make data mining challenging.

Data Distribution:

Real-worlds data is usually stored on various platforms in a distributed computing environment.

It might be in a database, individual systems, or even on the internet. Practically, It is a quite
tough task to make all the data to a centralized data repository mainly due to organizational and
technical concerns. For example, various regional offices may have their servers to store their
data. It is not feasible to store, all the data from all the offices on a central server. Therefore, data
mining requires the development of tools and algorithms that allow the mining of distributed
data.

Complex Data:

Real-world data is heterogeneous, and it could be multimedia data, including audio and video,
images, complex data, spatial data, time series, and so on. Managing these various types of data
and extracting useful information is a tough task. Most of the time, new technologies, new tools,
and methodologies would have to be refined to obtain specific information.

Performance:

The data mining system's performance relies primarily on the efficiency of algorithms and
techniques used. If the designed algorithm and techniques are not up to the mark, then the
efficiency of the data mining process will be affected adversely.

Data Privacy and Security:

Data mining usually leads to serious issues in terms of data security, governance, and privacy.
For example, if a retailer analyzes the details of the purchased items, then it reveals data about
buying habits and preferences of the customers without their permission.

Data Visualization:

In data mining, data visualization is a very important process because it is the primary method
that shows the output to the user in a presentable way. The extracted data should convey the

7
exact meaning of what it intends to express. But many times, representing the information to the
end-user in a precise and easy way is difficult. The input data and the output information being
complicated, very efficient, and successful data visualization processes need to be implemented
to make it successful.

Knowledge Discovery in Database (KDD):

KDD is referred to as Knowledge Discovery in Database and is defined as a method of

finding, transforming, and refining meaningful data and patterns from a raw database in order
to be utilised in different domains or applications.

The KDD Process

1. Building up an understanding of the application domain

It develops the scene for understanding what should be done with the various decisions like
transformation, algorithms, representation, etc. The individuals who are in charge of a KDD
venture need to understand and characterize the objectives of the end-user and the environment
in which the knowledge discovery process will occur (involves relevant prior knowledge).

2. Choosing and creating a data set on which discovery will be performed

Once defined the objectives, the data that will be utilized for the knowledge discovery process
should be determined. This incorporates discovering what data is accessible, obtaining important
data, and afterward integrating all the data for knowledge discovery onto one set involves the
qualities that will be considered for the process. This process is important because of Data
Mining learns and discovers from the accessible data. This is the evidence base for building the
8
models. If some significant attributes are missing, at that point, then the entire study may be
unsuccessful from this respect, the more attributes are considered. On the other hand, to
organize, collect, and operate advanced data repositories is expensive, and there is an
arrangement with the opportunity for best understanding the phenomena. This arrangement refers
to an aspect where the interactive and iterative aspect of the KDD is taking place. This begins
with the best available data sets and later expands and observes the impact in terms of knowledge
discovery and modeling.

3. Preprocessing and cleaning

In this step, data reliability is improved. It incorporates data clearing, for example, Handling the
missing quantities and removal of noise or outliers. It might include complex statistical
techniques or use a Data Mining algorithm in this context. For example, when one suspects that a
specific attribute of lacking reliability or has many missing data, at this point, this attribute could
turn into the objective of the Data Mining supervised algorithm. A prediction model for these
attributes will be created, and after that, missing data can be predicted. The expansion to which
one pays attention to this level relies upon numerous factors. Regardless, studying the aspects is
significant and regularly revealing by itself, to enterprise data frameworks.

4. Data Transformation

In this stage, the creation of appropriate data for Data Mining is prepared and developed.
Techniques here incorporate dimension reduction (for example, feature selection and extraction
and record sampling), also attribute transformation (for example, discretization of numerical
attributes and functional transformation). This step can be essential for the success of the entire
KDD project, and it is typically very project-specific. For example, in medical assessments, the
quotient of attributes may often be the most significant factor and not each one by itself. In
business, we may need to think about impacts beyond our control as well as efforts and transient
issues. For example, studying the impact of advertising accumulation. However, if we do not
utilize the right transformation at the starting, then we may acquire an amazing effect that
insights to us about the transformation required in the next iteration. Thus, the KDD process
follows upon itself and prompts an understanding of the transformation required.

5. Prediction and description

We are now prepared to decide on which kind of Data Mining to use, for example, classification,
regression, clustering, etc. This mainly relies on the KDD objectives, and also on the previous
steps. There are two significant objectives in Data Mining, the first one is a prediction, and the
second one is the description. Prediction is usually referred to as supervised Data Mining, while
descriptive Data Mining incorporates the unsupervised and visualization aspects of Data Mining.
Most Data Mining techniques depend on inductive learning, where a model is built explicitly or
implicitly by generalizing from an adequate number of preparing models. The fundamental

9
assumption of the inductive approach is that the prepared model applies to future cases. The
technique also takes into account the level of meta-learning for the specific set of accessible data.

6. Selecting the Data Mining algorithm

Having the technique, we now decide on the strategies. This stage incorporates choosing a
particular technique to be used for searching patterns that include multiple inducers. For
example, considering precision versus understandability, the previous is better with neural
networks, while the latter is better with decision trees. For each system of meta-learning, there
are several possibilities of how it can be succeeded. Meta-learning focuses on clarifying what
causes a Data Mining algorithm to be fruitful or not in a specific issue. Thus, this methodology
attempts to understand the situation under which a Data Mining algorithm is most suitable. Each
algorithm has parameters and strategies of leaning, such as ten folds cross-validation or another
division for training and testing.

7. Utilizing the Data Mining algorithm

At last, the implementation of the Data Mining algorithm is reached. In this stage, we may need
to utilize the algorithm several times until a satisfying outcome is obtained. For example, by
turning the algorithms control parameters, such as the minimum number of instances in a single
leaf of a decision tree.

8. Evaluation

In this step, we assess and interpret the mined patterns, rules, and reliability to the objective
characterized in the first step. Here we consider the preprocessing steps as for their impact on the
Data Mining algorithm results. For example, including a feature in step 4, and repeat from there.
This step focuses on the comprehensibility and utility of the induced model. In this step, the
identified knowledge is also recorded for further use. The last step is the use, and overall
feedback and discovery results acquire by Data Mining.

9. Using the discovered knowledge

Now, we are prepared to include the knowledge into another system for further activity. The
knowledge becomes effective in the sense that we may make changes to the system and measure
the impacts. The accomplishment of this step decides the effectiveness of the whole KDD
process. There are numerous challenges in this step, such as losing the "laboratory conditions"
under which we have worked. For example, the knowledge was discovered from a certain static
depiction, it is usually a set of data, but now the data becomes dynamic. Data structures may
change certain quantities that become unavailable, and the data domain might be modified, such
as an attribute that may have a value that was not expected previously.

10
Advantages of KDD
1. Improves decision-making: KDD provides valuable insights and knowledge that can help
organizations make better decisions.
2. Increased efficiency: KDD automates repetitive and time-consuming tasks and makes the
data ready for analysis, which saves time and money.
3. Better customer service: KDD helps organizations gain a better understanding of their
customers’ needs and preferences, which can help them provide better customer service.
4. Fraud detection: KDD can be used to detect fraudulent activities by identifying patterns
and anomalies in the data that may indicate fraud.
5. Predictive modeling: KDD can be used to build predictive models that can forecast future
trends and patterns.

Disadvantages of KDD
1. Privacy concerns: KDD can raise privacy concerns as it involves collecting and analyzing
large amounts of data, which can include sensitive information about individuals.
2. Complexity: KDD can be a complex process that requires specialized skills and knowledge
to implement and interpret the results.
3. Unintended consequences: KDD can lead to unintended consequences, such as bias or
discrimination, if the data or models are not properly understood or used.
4. Data Quality: KDD process heavily depends on the quality of data, if data is not accurate
or consistent, the results can be misleading
5. High cost: KDD can be an expensive process, requiring significant investments in
hardware, software, and personnel.
6. Overfitting: KDD process can lead to overfitting, which is a common problem in machine
learning where a model learns the detail and noise in the training data to the extent that it
negatively impacts the performance of the model on new unseen data.

Difference between KDD and Data Mining

Parameter KDD Data Mining

KDD refers to a process of identifying Data Mining refers to a process of

valid, novel, potentially useful, and extracting useful and valuable
Definition
ultimately understandable patterns and information or patterns from large
relationships in data. data sets.

To extract useful information from

Objective To find useful knowledge from data.
data.

Techniques Data cleaning, data integration, data Association rules, classification,

Used selection, data transformation, data clustering, regression, decision
mining, pattern evaluation, and trees, neural networks, and
knowledge representation and dimensionality reduction.

11
Parameter KDD Data Mining

visualization.

Structured information, such as rules Patterns, associations, or insights

Output and models, that can be used to make that can be used to improve
decisions or predictions. decision-making or understanding.

Focus is on the discovery of useful Data mining focus is on the

Focus knowledge, rather than simply finding discovery of patterns or
patterns in data. relationships in data.

Domain expertise is less critical in

Domain expertise is important in KDD,
Role of data mining, as the algorithms are
as it helps in defining the goals of the
domain designed to identify patterns
process, choosing appropriate data, and
expertise without relying on prior
interpreting the results.
knowledge.

Data Mining Techniques

Data mining includes the utilization of refined data analysis tools to find previously unknown,
valid patterns and relationships in huge data sets. These tools can incorporate statistical models,
machine learning techniques, and mathematical algorithms, such as neural networks or decision
trees. Thus, data mining incorporates analysis and prediction.

In recent data mining projects, various major data mining techniques have been developed and
used, including association, classification, clustering, prediction, sequential patterns, and
regression.

12
1. Classification:

This technique is used to obtain important and relevant information about data and metadata.
This data mining technique helps to classify data in different classes.

Data mining techniques can be classified by different criteria, as follows:

i. Classification of Data mining frameworks as per the type of data sources mined:
This classification is as per the type of data handled. For example, multimedia, spatial
data, text data, time-series data, World Wide Web, and so on.
ii. Classification of data mining frameworks as per the database involved:
This classification based on the data model involved. For example. Object-oriented
database, transactional database, relational database, and so on.
iii. Classification of data mining frameworks as per the kind of knowledge discovered:
This classification depends on the types of knowledge discovered or data mining
functionalities. For example, discrimination, classification, clustering, characterization,
etc. some frameworks tend to be extensive frameworks offering a few data mining
functionalities together.
iv. Classification of data mining frameworks according to data mining techniques used:
This classification is as per the data analysis approach utilized, such as neural networks,
machine learning, genetic algorithms, visualization, statistics, data warehouse-oriented or
database-oriented,etc.
The classification can also take into account, the level of user interaction involved in the
data mining procedure, such as query-driven systems, autonomous systems, or interactive
exploratory systems.

13
2. Clustering:

Clustering is a division of information into groups of connected objects. Describing the data by a
few clusters mainly loses certain confine details, but accomplishes improvement. It models data
by its clusters. Data modeling puts clustering from a historical point of view rooted in statistics,
mathematics, and numerical analysis. From a machine learning point of view, clusters relate to
hidden patterns, the search for clusters is unsupervised learning, and the subsequent framework
represents a data concept. From a practical point of view, clustering plays an extraordinary job in
data mining applications. For example, scientific data exploration, text mining, information
retrieval, spatial database applications, CRM, Web analysis, computational biology, medical
diagnostics, and much more.

In other words, we can say that Clustering analysis is a data mining technique to identify similar
data. This technique helps to recognize the differences and similarities between the data.
Clustering is very similar to the classification, but it involves grouping chunks of data together
based on their similarities.

3. Regression:

Regression analysis is the data mining process is used to identify and analyze the relationship
between variables because of the presence of the other factor. It is used to define the probability
of the specific variable. Regression, primarily a form of planning and modeling. For example, we
might use it to project certain costs, depending on other factors such as availability, consumer
demand, and competition. Primarily it gives the exact relationship between two or more variables
in the given data set.

4. Association Rules:

This data mining technique helps to discover a link between two or more items. It finds a hidden
pattern in the data set.

Association rules are if-then statements that support to show the probability of interactions
between data items within large data sets in different types of databases. Association rule mining
has several applications and is commonly used to help sales correlations in data or medical data
sets.

The way the algorithm works is that you have various data, For example, a list of grocery items
that you have been buying for the last six months. It calculates a percentage of items being
purchased together.

These are three major measurements technique:

14
o Lift:
This measurement technique measures the accuracy of the confidence over how often
item B is purchased.
(Confidence) / (item B)/ (Entire dataset)
o Support:
This measurement technique measures how often multiple items are purchased and
compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
o Confidence:
This measurement technique measures how often item B is purchased when item A is
purchased as well.
(Item A + Item B)/ (Item A)

5. Outer detection:

This type of data mining technique relates to the observation of data items in the data set, which
do not match an expected pattern or expected behavior. This technique may be used in various
domains like intrusion, detection, fraud detection, etc. It is also known as Outlier Analysis or
Outilier mining. The outlier is a data point that diverges too much from the rest of the dataset.
The majority of the real-world datasets have an outlier. Outlier detection plays a significant role
in the data mining field. Outlier detection is valuable in numerous fields like network
interruption identification, credit or debit card fraud detection, detecting outlying in wireless
sensor network data, etc.

6. Sequential Patterns:

The sequential pattern is a data mining technique specialized for evaluating sequential data to
discover sequential patterns. It comprises of finding interesting subsequences in a set of
sequences, where the stake of a sequence can be measured in terms of different criteria like
length, occurrence frequency, etc.

In other words, this technique of data mining helps to discover or recognize similar patterns in
transaction data over some time.

7. Prediction:

Prediction used a combination of other data mining techniques such as trends, clustering,
classification, etc. It analyzes past events or instances in the right sequence to predict a future
event.

15
Data Warehouse

A Data Warehouse (DW) is a relational database that is designed for query and analysis rather
than transaction processing. It includes historical data derived from transaction data from single
and multiple sources.

A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing
support for decision-makers for data modeling and analysis.

A Data Warehouse is a group of data specific to the entire organization, not only to a particular
group of users.

It is not used for daily operations and transaction processing but used for making decisions.

A Data Warehouse can be viewed as a data system with the following attributes:

o It is a database designed for investigative tasks, using data from various applications.
o It supports a relatively small number of clients with relatively long interactions.
o It includes current and historical data to provide a historical perspective of information.
o Its usage is read-intensive.
o It contains a few large tables.

Characteristics of Data Warehouse

1) Subject-Oriented

A data warehouse target on the modeling and analysis of data for decision-makers. Therefore,
data warehouses typically provide a concise and straightforward view around a particular
subject, such as customer, product, or sales, instead of the global organization's ongoing
operations. This is done by excluding data that are not useful concerning the subject and
including all data needed by the users to understand the subject.

16
2) Integrated

A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and
online transaction records. It requires performing data cleaning and integration during data
warehousing to ensure consistency in naming conventions attributes types, etc., among different
data sources.

17
3)Time-Variant

Historical information is kept in a data warehouse. For example, one can retrieve files from 3
months, 6 months, 12 months, or even previous data from a data warehouse. These variations
with a transactions system, where often only the most current file is kept.

4) Non-Volatile

The data warehouse is a physically separate data storage, which is transformed from the source
operational RDBMS. The operational updates of data do not occur in the data warehouse, i.e.,
update, insert, and delete operations are not performed. It usually requires only two procedures in
data accessing: Initial loading of data and access to data. Therefore, the DW does not require
transaction processing, recovery, and concurrency capabilities, which allows for substantial
speedup of data retrieval. Non-Volatile defines that once entered into the warehouse and data
should not change.

History of Data Warehouse

The idea of data warehousing came to the late 1980's when IBM researchers Barry Devlin and
Paul Murphy established the "Business Data Warehouse."

In essence, the data warehousing idea was planned to support an architectural model for the flow
of information from the operational system to decisional support environments. The concept
attempt to address the various problems associated with the flow, mainly the high costs
associated with it.

In the absence of data warehousing architecture, a vast amount of space was required to support
multiple decision support environments. In large corporations, it was ordinary for various
decision support environments to operate independently.

18
Goals of Data Warehousing
o To help reporting as well as analysis
o Maintain the organization's historical information
o Be the foundation for decision making.

Need for Data Warehouse

1) Business User: Business users require a data warehouse to view summarized data
from the past. Since these people are non-technical, the data may be presented to them in
an elementary form.

2) Store historical data: Data Warehouse is required to store the time variable data from
the past. This input is made to be used for various purposes.

3) Make strategic decisions: Some strategies may be depending upon the data in the
data warehouse. So, data warehouse contributes to making strategic decisions.

4) For data consistency and quality: Bringing the data from different sources at a
commonplace, the user can effectively undertake to bring the uniformity and consistency
in data.

5) High response time: Data warehouse has to be ready for somewhat unexpected loads
and types of queries, which demands a significant degree of flexibility and quick
response time.

Benefits of Data Warehouse

1. Understand business trends and make better forecasting decisions.
2. Data Warehouses are designed to perform well enormous amounts of data.
3. The structure of data warehouses is more accessible for end-users to navigate,
understand, and query.
4. Queries that would be complex in many normalized databases could be easier to build
and maintain in data warehouses.
5. Data warehousing is an efficient method to manage demand for lots of information from
lots of users.
6. Data warehousing provide the capabilities to analyze a large amount of historical data.

Functions of Data Warehouse Tools and Utilities

The following are the functions of data warehouse tools and utilities

 Data Extraction − Involves gathering data from multiple heterogeneous sources.

 Data Cleaning − Involves finding and correcting the errors in data.

19
 Data Transformation − Involves converting the data from legacy format to warehouse
format.

 Data Loading − Involves sorting, summarizing, consolidating, checking integrity, and

building indices and partitions.

 Refreshing − Involves updating from data sources to warehouse.

Multi Dimensional Data Model

The multi-Dimensional Data Model is a method which is used for ordering data in the database
along with good arrangement and assembling of the contents in the database.
The Multi Dimensional Data Model allows customers to interrogate analytical questions
associated with market or business trends, unlike relational databases which allow customers to
access data in the form of queries. They allow users to rapidly receive answers to the requests
which they made by creating and examining the data comparatively fast.
OLAP (online analytical processing) and data warehousing uses multi dimensional databases.
It is used to show multiple dimensions of the data to users.
It represents data in the form of data cubes. Data cubes allow to model and view the data from
many dimensions and perspectives. It is defined by dimensions and facts and is represented by
a fact table. Facts are numerical measures and fact tables contain measures of the related
dimensional tables or names of the facts.

Multidimensional Data Representation

Working on a Multidimensional Data Model

On the basis of the pre-decided steps, the Multidimensional Data Model works.

20
The following stages should be followed by every project for building a Multi Dimensional
Data Model:

Stage 1 : Assembling data from the client: In first stage, a Multi Dimensional Data Model
collects correct data from the client. Mostly, software professionals provide simplicity to the
client about the range of data which can be gained with the selected technology and collect the
complete data in detail.
Stage 2 : Grouping different segments of the system: In the second stage, the Multi
Dimensional Data Model recognizes and classifies all the data to the respective section they
belong to and also builds it problem-free to apply step by step.
Stage 3: Noticing the different proportions: In the third stage, it is the basis on which the
design of the system is based. In this stage, the main factors are recognized according to the
user’s point of view. These factors are also known as “Dimensions”.
Stage 4: Preparing the actual-time factors and their respective qualities: In the fourth
stage, the factors which are recognized in the previous step are used further for identifying the
related qualities. These qualities are also known as “attributes” in the database.
Stage 5: Finding the actuality of factors which are listed previously and their qualities: In
the fifth stage, A Multi Dimensional Data Model separates and differentiates the actuality from
the factors which are collected by it. These actually play a significant role in the arrangement
of a Multi Dimensional Data Model.
Stage 6 : Building the Schema to place the data, with respect to the information collected
from the steps above : In the sixth stage, on the basis of the data which was collected
previously, a Schema is built.

For Example:
1. Let us take the example of a firm. The revenue cost of a firm can be recognized on the basis
of different factors such as geographical location of firm’s workplace, products of the firm,
advertisements done, time utilized to flourish a product, etc.

21
Example 1

2. Let us take the example of the data of a factory which sells products per quarter in
Bangalore. The data is represented in the table given below:

2D factory data

In the above given presentation, the factory’s sales for Bangalore are, for the time dimension,
which is organized into quarters and the dimension of items, which is sorted according to the
kind of item which is sold. The facts here are represented in rupees (in thousands).
Now, if we desire to view the data of the sales in a three-dimensional table, then it is
represented in the diagram given below. Here the data of the sales is represented as a
two dimensional table. Let us consider the data according to item, time and location (like
Kolkata, Delhi, Mumbai). Here is the table :

3D data representation as 2D

Data Cleaning:

Data cleaning is the process of correcting or deleting inaccurate, damaged, improperly formatted,
duplicated, or insufficient data from a dataset. Even if results and algorithms appear to be

22
correct, they are unreliable if the data is inaccurate. There are numerous ways for data to be
duplicated or incorrectly labeled when merging multiple data sources.

Characteristics of Data Cleaning

To ensure the correctness, integrity, and security of corporate data, data cleaning is a
requirement. These may be of varying quality depending on the properties or attributes of the
data. The key components of data cleansing in data mining are as follows:

o Accuracy: The business's database must contain only extremely accurate data.
Comparing them to other sources is one technique to confirm their veracity. The stored
data will also have issues if the source cannot be located or contains errors.
o Coherence: To ensure that the information on a person or body is the same throughout
all types of storage, the data must be consistent with one another.
o Validity: There must be rules or limitations in place for the stored data. The information
must also be confirmed to support its veracity.
o Uniformity: A database's data must all share the same units or values. Since it doesn't
complicate the process, it is a crucial component while doing the Data Cleansing process.
o Data Verification: Every step of the process, including its appropriateness and
effectiveness, must be checked. The study, design, and validation stages all play a role in
the verification process. The disadvantages are frequently obvious after applying the data
to a specific number of changes.
o Clean Data Backflow: After addressing quality issues, the previously clean data must be
replaced with data that is not present in the source so that legacy applications can profit
from it and avoid the need for a subsequent data-cleaning program.

Tools for Data Cleaning in Data Mining

Data Cleansing Tools can be very helpful if you are not confident of cleaning the data yourself or
have no time to clean up all the data sets. There are many data cleaning tools in the market. Here
are some top-ranked data cleaning tools, such as:

1. OpenRefine
2. Trifacta Wrangler
3. Drake
4. Data Ladder
5. Data Cleaner
6. Cloudingo
7. Reifier
8. IBM Infosphere Quality Stage

23
9. TIBCO Clarity
10. Winpure

Advantages and benefits of data cleaning

Having clean data will ultimately increase overall productivity and allow for the highest quality
information in your decision-making. Benefits include:

 Removal of errors when multiple sources of data are at play.

 Fewer errors make for happier clients and less-frustrated employees.

 Ability to map the different functions and what your data is intended to do.

 Monitoring errors and better reporting to see where errors are coming from, making it
easier to fix incorrect or corrupt data for future applications.

 Using tools for data cleaning will make for more efficient business practices and quicker
decision-making.

Data Integration

Data integration is the process of merging data from several disparate sources. While performing
data integration, you must work on data redundancy, inconsistency, duplicity, etc. In data
mining, data integration is a record preprocessing method that includes merging data from a
couple of the heterogeneous data sources into coherent data to retain and provide a unified
perspective of the data.

Why is the Data Integration Important?

Companies that want to stay competitive and relevant welcome big data and all of its benefits
and drawbacks. One of the most common applications for data integration services and
technologies is market and consumer data collection. Data integration supports queries in these
vast datasets, benefiting from corporate intelligence and consumer data analytics to stimulate
real-time information delivery. Enterprise data integration feeds integrated data into data centers
to enable enterprise reporting, predictive analytics, and business intelligence.

Data integration is particularly important in the healthcare industry. Integrated data from various
patient records and clinics assist clinicians in identifying medical disorders and diseases by
integrating data from many systems into a single perspective of beneficial information from
which useful insights can be derived. Effective data collection and integration also improve
medical insurance claims processing accuracy and ensure that patient names and contact
information are recorded consistently and accurately. Interoperability refers to the sharing of
information across different systems.

Data integration is used in a wide range of applications, such as business intelligence, data
warehousing, master data management, and analytics. Data integration can be critical to the

24
success of these applications, as it enables organizations to access and analyze data that is
spread across different systems, departments, and lines of business, in order to make better
decisions, improve operational efficiency, and gain a competitive advantage.

Issues in Data Integration:

There are several issues that can arise when integrating data from multiple sources, including:
1. Data Quality: Inconsistencies and errors in the data can make it difficult to combine and
analyze.
2. Data Semantics: Different sources may use different terms or definitions for the same data,
making it difficult to combine and understand the data.
3. Data Heterogeneity: Different sources may use different data formats, structures, or
schemas, making it difficult to combine and analyze the data.
4. Data Privacy and Security: Protecting sensitive information and maintaining security can
be difficult when integrating data from multiple sources.
5. Scalability: Integrating large amounts of data from multiple sources can be
computationally expensive and time-consuming.
6. Data Governance: Managing and maintaining the integration of data from multiple sources
can be difficult, especially when it comes to ensuring data accuracy, consistency, and
timeliness.
7. Performance: Integrating data from multiple sources can also affect the performance of the
system.
8. Integration with existing systems: Integrating new data sources with existing systems can
be a complex task, requiring significant effort and resources.
9. Complexity: The complexity of integrating data from multiple sources can be high,
requiring specialized skills and knowledge.

Approaches for data integration

25
There are mainly 2 major approaches for data integration – one is the “tight coupling
approach” and another is the “loose coupling approach”.

Tight Coupling:

This approach involves creating a centralized repository or data warehouse to store the
integrated data. The data is extracted from various sources, transformed and loaded into a data
warehouse. Data is integrated in a tightly coupled manner, meaning that the data is integrated
at a high level, such as at the level of the entire dataset or schema. This approach is also known
as data warehousing, and it enables data consistency and integrity, but it can be inflexible and
difficult to change or update.
 Here, a data warehouse is treated as an information retrieval component.
 In this coupling, data is combined from different sources into a single physical location
through the process of ETL – Extraction, Transformation, and Loading.

Loose Coupling:
This approach involves integrating data at the lowest level, such as at the level of individual
data elements or records. Data is integrated in a loosely coupled manner, meaning that the data
is integrated at a low level, and it allows data to be integrated without having to create a central
repository or data warehouse. This approach is also known as data federation, and it enables
data flexibility and easy updates, but it can be difficult to maintain consistency and integrity
across multiple data sources.
 Here, an interface is provided that takes the query from the user, transforms it in a way the
source database can understand, and then sends the query directly to the source databases to
obtain the result.
 And the data only remains in the actual source databases.

Moving data from one system to another requires a data pipeline that understands the structure
and meaning of the data as well as defines the path it will take through the technical systems.
The specific techniques used for data integration depend on:

 The volume, velocity, and variety of the data to be integrated.

 The characteristics of the sources and destinations of data.
 The time and resources available.
 The minimum performance standards.

Data Transformation in Data Mining

Data transformation is a technique used to convert the raw data into a suitable format that
efficiently eases data mining and retrieves strategic information. Data transformation includes

26
data cleaning techniques and a data reduction technique to convert the data into the appropriate
form.

Data transformation is an essential data preprocessing technique that must be performed on the
data before data mining to provide patterns that are easier to understand.

Data transformation changes the format, structure, or values of the data and converts them
into clean, usable data. Data may be transformed at two stages of the data pipeline for data
analytics projects. Organizations that use on-premises data warehouses generally use an ETL
(extract, transform, and load) process, in which data transformation is the middle step.

Cp4252 ML All Units Notes
No ratings yet
Cp4252 ML All Units Notes
172 pages
Unit 1 Datamining For Business Intelligence
No ratings yet
Unit 1 Datamining For Business Intelligence
101 pages
Data Mining Notes
No ratings yet
Data Mining Notes
21 pages
M Tech Artificial Intelligence - MODIFIED
No ratings yet
M Tech Artificial Intelligence - MODIFIED
40 pages
Data Mining
No ratings yet
Data Mining
395 pages
Eedom and Data Driven PDF
No ratings yet
Eedom and Data Driven PDF
156 pages
Visual and Spatial Analysis 2004
No ratings yet
Visual and Spatial Analysis 2004
582 pages
DMAddins SampleData en
No ratings yet
DMAddins SampleData en
2,224 pages
Data Mining Techniques Unit-1
No ratings yet
Data Mining Techniques Unit-1
122 pages
MISK
No ratings yet
MISK
134 pages
Future Technologies, Today's Choices
100% (31)
Future Technologies, Today's Choices
72 pages
Data Mining For The Masses: Dr. Matthew North
No ratings yet
Data Mining For The Masses: Dr. Matthew North
7 pages
Notes DATA MINING MBA III
No ratings yet
Notes DATA MINING MBA III
8 pages
R18CSE4102-UNIT 2 Data Mining Notes
100% (1)
R18CSE4102-UNIT 2 Data Mining Notes
31 pages
Machine Learning: Trustworthy
No ratings yet
Machine Learning: Trustworthy
267 pages
Data-Mining FINAL
No ratings yet
Data-Mining FINAL
45 pages
Data Mining Unit 1 (MSC Ds 3 Sem)
No ratings yet
Data Mining Unit 1 (MSC Ds 3 Sem)
119 pages
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
21 pages
DM Material
No ratings yet
DM Material
98 pages
Data Mining M1
No ratings yet
Data Mining M1
64 pages
C1. Introduction
No ratings yet
C1. Introduction
81 pages
Data Mining
No ratings yet
Data Mining
89 pages
Data Mining Final
No ratings yet
Data Mining Final
38 pages
SWEN3165 Lecture 9 - Data Mining
No ratings yet
SWEN3165 Lecture 9 - Data Mining
32 pages
Intelligent System For Fighting Crime and Insecurity
No ratings yet
Intelligent System For Fighting Crime and Insecurity
40 pages
Big Data & Cloud Computing CME Unit 1
No ratings yet
Big Data & Cloud Computing CME Unit 1
23 pages
K Means
No ratings yet
K Means
63 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
46 pages
Data Mining-CH5
No ratings yet
Data Mining-CH5
49 pages
Real-World Data Is Dirty: Data Cleansing and The Merge/Purge Problem
No ratings yet
Real-World Data Is Dirty: Data Cleansing and The Merge/Purge Problem
39 pages
Data Mining Notes
No ratings yet
Data Mining Notes
46 pages
Data Mining
No ratings yet
Data Mining
22 pages
Designing A New Model For Trojan Horse D PDF
No ratings yet
Designing A New Model For Trojan Horse D PDF
10 pages
Data Mining and Its Applications
No ratings yet
Data Mining and Its Applications
60 pages
Data Mining L1,2
No ratings yet
Data Mining L1,2
26 pages
Unit 1 Data Warehouse and Data Mining
No ratings yet
Unit 1 Data Warehouse and Data Mining
13 pages
Unit 2 (DWDM)
No ratings yet
Unit 2 (DWDM)
40 pages
DM Unit-1
No ratings yet
DM Unit-1
27 pages
BIDW Lecture 2
No ratings yet
BIDW Lecture 2
33 pages
DMT Unit 5
No ratings yet
DMT Unit 5
25 pages
Capture D'écran, Le 2025-04-21 À 21.26.38
No ratings yet
Capture D'écran, Le 2025-04-21 À 21.26.38
14 pages
Vision Mission PEOs POs PSOs and COs
No ratings yet
Vision Mission PEOs POs PSOs and COs
19 pages
Data Mining1
No ratings yet
Data Mining1
37 pages
Data Mining Tutorial - Javatpoint
No ratings yet
Data Mining Tutorial - Javatpoint
12 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
46 pages
L - 1 Data Mining
No ratings yet
L - 1 Data Mining
17 pages
Data Warehousing&Dat Mining
No ratings yet
Data Warehousing&Dat Mining
12 pages
Data Mining and Data Warehousing Unit 3 Part 1
No ratings yet
Data Mining and Data Warehousing Unit 3 Part 1
13 pages
B SC (IT) VI-DSE3-M5
No ratings yet
B SC (IT) VI-DSE3-M5
13 pages
Data Mining Tutorial
No ratings yet
Data Mining Tutorial
30 pages
The American Statistician
No ratings yet
The American Statistician
8 pages
Literature Survey Diabetes Prediction
No ratings yet
Literature Survey Diabetes Prediction
2 pages
Dmi Unit 1 - 186 - N3
No ratings yet
Dmi Unit 1 - 186 - N3
12 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
24 pages
DM-Unit 1
No ratings yet
DM-Unit 1
13 pages
KM Notes Unit-3
No ratings yet
KM Notes Unit-3
20 pages
Clustering
No ratings yet
Clustering
8 pages
Data Mining Cognate
No ratings yet
Data Mining Cognate
23 pages
Data Warehousing and Data Mining Syllabus
No ratings yet
Data Warehousing and Data Mining Syllabus
1 page
Data Mining
No ratings yet
Data Mining
8 pages
Unit II Data Mining
No ratings yet
Unit II Data Mining
8 pages
What Is Data Mining
No ratings yet
What Is Data Mining
8 pages
Introduction To Data Mining - 125604
No ratings yet
Introduction To Data Mining - 125604
7 pages
DWDM 2
No ratings yet
DWDM 2
15 pages
Motivation of Data Mining
No ratings yet
Motivation of Data Mining
4 pages
University of WOLVERHAMPTON
No ratings yet
University of WOLVERHAMPTON
13 pages
Data Mining-Introduction
No ratings yet
Data Mining-Introduction
8 pages
Data Mining
No ratings yet
Data Mining
19 pages
Lps Week 16 Iatb
No ratings yet
Lps Week 16 Iatb
5 pages
Module 4 - Data Mining
No ratings yet
Module 4 - Data Mining
13 pages
TJ 11 2017 3 128 132
No ratings yet
TJ 11 2017 3 128 132
5 pages
Data Mining & Business Intelligence (2170715) : Unit-5 Concept Description and Association Rule Mining
No ratings yet
Data Mining & Business Intelligence (2170715) : Unit-5 Concept Description and Association Rule Mining
39 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
Extended Major Program in Artificial Intelligence
No ratings yet
Extended Major Program in Artificial Intelligence
3 pages
Data Mining - First Page PDF
No ratings yet
Data Mining - First Page PDF
20 pages
Data Mining - Prashant
No ratings yet
Data Mining - Prashant
10 pages
Final Document
No ratings yet
Final Document
25 pages
Data Mining
No ratings yet
Data Mining
4 pages
Cheat Sheet For Exam
No ratings yet
Cheat Sheet For Exam
2 pages
Seminar Data Mining
No ratings yet
Seminar Data Mining
10 pages
TPW Data Mining
No ratings yet
TPW Data Mining
4 pages
Buying or Berowsing
No ratings yet
Buying or Berowsing
9 pages
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
No ratings yet
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
27 pages
Absract:: Data, Information, and Knowledge
No ratings yet
Absract:: Data, Information, and Knowledge
7 pages
Data Mining Question Set
No ratings yet
Data Mining Question Set
5 pages
Data Mining-Applications, Issues
No ratings yet
Data Mining-Applications, Issues
9 pages
Unit 4 New Database Applications and Environments: by Bhupendra Singh Saud
No ratings yet
Unit 4 New Database Applications and Environments: by Bhupendra Singh Saud
14 pages
Clustering - Hierarchical
No ratings yet
Clustering - Hierarchical
4 pages
Data-Driven Business Strategies: Understanding and Harnessing the Power of Big Data
From Everand
Data-Driven Business Strategies: Understanding and Harnessing the Power of Big Data
Steven Vollmer
No ratings yet

Unit 1

Uploaded by

Unit 1

Uploaded by

Unit 1

Data Mining is the process of investigating hidden patterns of information to various

AI or Artificial intelligence is based on heuristics as opposed to statistics. It tries to apply

Machine learning is a combination of statistics and AI. It might be considered as an evolution of

Advantages of Data Mining

Disadvantages of Data Mining

Data Mining Applications:

Data Mining in Healthcare:

Data Mining in Market Basket Analysis:

Data mining in Education:

Data Mining in Manufacturing Engineering:

Data Mining in CRM (Customer Relationship Management):

Data Mining in Fraud detection:

Data Mining in Lie Detection:

Data Mining Financial Banking:

Types of Data Mining

Data mining can be performed on the following types of data:

A combination of an object-oriented database model and relational database model is called an

Challenges of Implementation in Data mining

Incomplete and noisy data:

Real-worlds data is usually stored on various platforms in a distributed computing environment.

Data Privacy and Security:

Knowledge Discovery in Database (KDD):

KDD is referred to as Knowledge Discovery in Database and is defined as a method of

The KDD Process

1. Building up an understanding of the application domain

2. Choosing and creating a data set on which discovery will be performed

3. Preprocessing and cleaning

5. Prediction and description

6. Selecting the Data Mining algorithm

7. Utilizing the Data Mining algorithm

9. Using the discovered knowledge

Difference between KDD and Data Mining

KDD refers to a process of identifying Data Mining refers to a process of

To extract useful information from

Techniques Data cleaning, data integration, data Association rules, classification,

Structured information, such as rules Patterns, associations, or insights

Focus is on the discovery of useful Data mining focus is on the

Domain expertise is less critical in

Data Mining Techniques

Data mining techniques can be classified by different criteria, as follows:

These are three major measurements technique:

Characteristics of Data Warehouse

History of Data Warehouse

Need for Data Warehouse

Benefits of Data Warehouse

Functions of Data Warehouse Tools and Utilities

 Data Extraction − Involves gathering data from multiple heterogeneous sources.

 Data Cleaning − Involves finding and correcting the errors in data.

 Data Loading − Involves sorting, summarizing, consolidating, checking integrity, and

 Refreshing − Involves updating from data sources to warehouse.

Multi Dimensional Data Model

Multidimensional Data Representation

Working on a Multidimensional Data Model

Characteristics of Data Cleaning

Tools for Data Cleaning in Data Mining

Advantages and benefits of data cleaning

 Removal of errors when multiple sources of data are at play.

 Fewer errors make for happier clients and less-frustrated employees.

Why is the Data Integration Important?

Issues in Data Integration:

Approaches for data integration

 The volume, velocity, and variety of the data to be integrated.

Data Transformation in Data Mining

You might also like