0% found this document useful (0 votes)
1K views14 pages

1.data Mining Functionalities

The document discusses data mining functionalities and tasks. It describes two categories of data mining - descriptive and predictive. Descriptive mining highlights common data features while predictive mining estimates characteristics based on previous tests. Key data mining functionalities are also outlined, including class description, pattern mining, classification/prediction, clustering, and outlier analysis. The document also discusses data mining query primitives, major issues in data mining like handling noise, and the importance of data pre-processing techniques like cleaning, transformation, and discretization.

Uploaded by

Sai Deekshith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views14 pages

1.data Mining Functionalities

The document discusses data mining functionalities and tasks. It describes two categories of data mining - descriptive and predictive. Descriptive mining highlights common data features while predictive mining estimates characteristics based on previous tests. Key data mining functionalities are also outlined, including class description, pattern mining, classification/prediction, clustering, and outlier analysis. The document also discusses data mining query primitives, major issues in data mining like handling noise, and the importance of data pre-processing techniques like cleaning, transformation, and discretization.

Uploaded by

Sai Deekshith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

1.

Data Mining Functionalities:


A. Data Mining functions are used to define the trends or correlations contained in
data mining activities.
In comparison, data mining activities can be divided into 2 categories:
1. Descriptive Data Mining:
It includes certain knowledge to understand what is happening within the
data without a previous idea. The common data features are highlighted in
the data set.
For examples: count, average etc.
2. Predictive Data Mining:
It helps developers to provide unlabelled definitions of attributes. Based on
previous tests, the software estimates the characteristics that are absent.
For example: Judging from the findings of a patient’s medical examinations
that is he suffering from any particular disease.
Data Mining Functionality:
1. Class/Concept Descriptions:
Classes or definitions can be correlated with results. In simplified, descriptive
and yet accurate ways, it can be helpful to define individual groups and
concepts.
These class or concept definitions are referred to as class/concept descriptions.
• Data Characterization:
This refers to the summary of general characteristics or features of the class
that is under the study. For example. To study the characteristics of a
software product whose sales increased by 15% two years ago, anyone can
collect these type of data related to such products by running SQL queries.
• Data Discrimination:
It compares common features of class which is under study. The output of
this process can be represented in many forms. Eg., bar charts, curves and pie
charts.
2. Mining Frequent Patterns, Associations, and Correlations:
Frequent patterns, are patterns that occur frequently in data. There are many
kinds of frequent patterns, including itemsets, subsequences, and
substructures.
3. Classification and Prediction:
Classification is the process of finding a model that describes and distinguishes
data classes for the purpose of being able to use the model to predict the class of
objects whose class label is unknown.
4. Cluster Analysis:
In classification and prediction analyse classlabeled data objects, whereas
clustering analyses data objects without consulting a known class label.
5. Outlier Analysis:
A database may contain data objects that do not comply with the general
behaviour or model of the data. These data objects are outliers.
Most data mining methods discard outliers as noise or exceptions. The analysis of
outlier data is referred to as outlier mining.

2. Data Mining Task Primitives


Each user will have a data mining task in mind, that is, some form of data analysis that he
or she would like to have performed. A data mining task can be specified in the form of
a data mining query, which is input to the data mining system. A data mining query is
defined in terms of data mining task primitives. These primitives allow the user to
interactively communicate with the data mining system during discovery in order to
direct the mining process, or examine the findings from different angles or depths.

The data mining primitives specify the following:


The set of task-relevant data to be mined:
This specifies the portions of the database or the set of data in which the user is
interested. This includes the database attributes or data warehouse dimensions of
interest (referred to as the relevant attributes or dimensions).
The kind of knowledge to be mined:
This specifies the data mining functions to be performed, such as characterization,
discrimination, association or correlation analysis, classification, prediction, clustering,
outlier analysis, or evolution analysis.
The background knowledge to be used in the discovery process:
This knowledge about the domain to be mined is useful for guiding the knowledge
discovery process and for evaluating the patterns found.
The interestingness measures and thresholds for pattern evaluation:
They may be used to guide the mining process or, after discovery, to evaluate the
discovered patterns. Different kinds of knowledge may have different interestingness
measures. For example, interestingness measures for association rules include support
and confidence. Rules whose support and confidence values are below user-specified
thresholds are considered uninteresting.
The expected representation for visualizing the discovered patterns:
This refers to the form in which discovered patterns are to be displayed, which may
include rules, tables, charts, graphs, decision trees, and cubes.

3.Major Issues in Data Mining:


Mining different kinds of knowledge in databases. - The need of different users is not
the same. And Different user may be in interested in different kind of knowledge.
Therefore, it is necessary for data mining to cover broad range of knowledge discovery
task.

• Interactive mining of knowledge at multiple levels of abstraction:


The data mining process needs to be interactive because it allows users to
focus the search for patterns, providing and refining data mining requests
based on returned results.
• Incorporation of background knowledge:
To guide discovery process and to express the discovered patterns, the
background knowledge can be used. Background knowledge may be used to
express the discovered patterns not only in concise terms but at multiple
level of abstraction.
• Data mining query languages and ad hoc data mining:
Data Mining Query language that allows the user to describe ad hoc mining
tasks, should be integrated with a data warehouse query language and
optimized for efficient and flexible data mining.
• Presentation and visualization of data mining results:
Once the patterns are discovered it needs to be expressed in high level
languages, visual representations. These representations should be easily
understandable by the users.
• Handling noisy or incomplete data:
The data cleaning methods are required that can handle the noise,
incomplete objects while mining the data regularities. If data cleaning
methods are not there then the accuracy of the discovered patterns will be
poor.
• Pattern evaluation:
It refers to interestingness of the problem. The patterns discovered should
be interesting because either they represent common knowledge or lack
novelty.
• Efficiency and scalability of data mining algorithms:
In order to effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.
• Parallel, distributed, and incremental mining algorithms:
The factors such as huge size of databases, wide distribution of data, and
complexity of data mining methods motivate the development of parallel
and distributed data mining algorithms. These algorithms divide the data
into partitions which is further processed parallel. Then the results from the
partitions is merged. The incremental algorithms, updates databases
without having mine the data again from scratch.
4.Data Pre-processing:

Data pre-processing is a data mining technique which is used to transform the


raw data in a useful and efficient format.

Steps Involved in Data Pre-processing:


1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.
• (a). Missing Data:
This situation arises when some data is missing in the data. It can be handled
in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite
large and multiple values are missing within a tuple.
2. Fill the Missing values:
There are various ways to do this task. You can choose to fill the
missing values manually, by attribute mean or the most probable
value.
• (b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can
be generated due to faulty data collection, data entry errors etc. It can be
handled in following ways:
1. Binning Method:
This method works on sorted data in order to smooth it. The whole
data is divided into segments of equal size and then various
methods are performed to complete the task. Each segmented is
handled separately. One can replace all data in a segment by its
mean or boundary values can be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.
The regression used may be linear (having one independent
variable) or multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or
0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels
or conceptual levels.
4. Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in hierarchy.
For Example-The attribute “city” can be converted to “country”.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order to
get rid of this, we uses data reduction technique. It aims to increase the storage
efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:

1. Data Cube Aggregation:


Aggregation operation is applied to data for the construction of the data
cube.
2. Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For
performing attribute selection, one can use level of significance and p- value
of the attribute. The attribute having p-value greater than significance level
can be discarded.
3. Numerosity Reduction:
This enables to store the model of data instead of whole data, for example:
Regression Models.
4. Dimensionality Reduction:
This reduces the size of data by encoding mechanisms. It can be lossy or
lossless. If after reconstruction from compressed data, original data can be
retrieved, such reduction are called lossless reduction else it is called lossy
reduction. The two effective methods of dimensionality reduction are:
Wavelet transforms and PCA (Principal Component Analysis).

5. Data Cleaning:
Data cleaning is one of the important parts of Data Mining. It plays a significant part in
building a model. However, the success or failure of a project relies on proper data
cleaning.
If we have a well-cleaned dataset, there are chances that we can get achieve good
results with simple algorithms also, which can prove very beneficial at times
especially in terms of computation when the dataset size is large.
Obviously, different types of data will require different types of cleaning. However,
this systematic approach can always serve as a good starting point.

Steps involved in Data Cleaning:


1. Removal of unwanted observations
This includes deleting duplicate/ redundant or irrelevant values from your
dataset. Duplicate observations most frequently arise during data collection
and Irrelevant observations are those that don’t actually fit the specific
problem that you’re trying to solve.
• Redundant observations alter the efficiency by a great extent as the
data repeats and may add towards the correct side or towards the
incorrect side, thereby producing unfaithful results.
• Irrelevant observations are any type of data that is of no use to us
and can be removed directly.
2. Fixing Structural errors
The errors that arise during measurement, transfer of data or other similar
situations are called structural errors. Structural errors include typos in the
name of features, same attribute with different name, mislabelled classes, i.e.,
separate classes that should really be the same or inconsistent capitalization.
• For example, the model will treat America and America as different
classes or values, though they represent the same value or red,
yellow and red-yellow as different classes or attributes, though one
class can be included in the other two classes. So, these are some
structural errors that make our model inefficient and gives poor
quality results.
2. Managing Unwanted outliers
Outliers can cause problems with certain types of models. For example, linear
regression models are less robust to outliers than decision tree models.
Generally, we should not remove outliers until we have a legitimate reason to
remove them. Sometimes, removing them improves performance, sometimes
not. So, one must have a good reason to remove the outlier, such as
suspicious measurements that are unlikely to be the part of real data.
3. Handling missing data
Missing data is a deceptively tricky issue in machine learning. We cannot just
ignore or remove the missing observation. They must be handled carefully as
they can be an indication of something important. The two most common
ways to deal with missing data are:
1. Dropping observations with missing values.
Dropping missing values is sub-optimal because when you drop
observations, you drop information.

• The fact that the value was missing may be informative in


itself.
• Plus, in the real world, you often need to make predictions
on new data even if some of the features are missing!
2. Imputing the missing values from past observations.
Imputing missing values is sub-optimal because the value was
originally missing but you filled it in, which always leads to a loss in
information, no matter how sophisticated your imputation method
is.

• Again, “missingness” is almost always informative in itself,


and you should tell your algorithm if a value was missing.
• Even if you build a model to impute your values, you’re not
adding any real information. You’re just reinforcing the
patterns already provided by other features.
Both of these approaches are sub-optimal because dropping an observation
means dropping information, thereby reducing data and imputing values also
is sub-optimal as we fil the values that were not present in the actual
dataset, which leads to a loss of information.

6.Classigication of Data Mining Systems:


Data Mining is considered as an interdisciplinary field. It includes a set of various
disciplines such as statistics, database systems, machine learning, visualization and
information sciences. Classification of the data mining system helps users to
understand the system and match their requirements with such systems.
Data mining systems can be categorized according to various criteria, as follows:

1. Classification according to the application adapted:


This involves domain-specific application.For example, the data mining
systems can be tailored accordingly for telecommunications, finance, stock
markets, e-mails and so on.
2. Classification according to the type of techniques utilized:
This technique involves the degree of user interaction or the technique of
data analysis involved. For example, machine learning, visualization, pattern
recognition, neural networks, database-oriented or data-warehouse oriented
techniques.
3. Classification according to the types of knowledge mined:
This is based on functionalities such as characterization, association,
discrimination and correlation, prediction etc.
4. Classification according to types of databases mined:
A database system can be classified as a ‘type of data’ or ‘use of data’ model
or ‘application of data’.

7.Challenges of Data Mining:


Nowadays Data Mining and knowledge discovery are evolving a crucial technology for
business and researchers in many domains. Data Mining is developing into established
and trusted discipline, many still pending challenges have to be solved.
Some of these challenges are given below.

1. Security and Social Challenges:


Decision-Making strategies are done through data collection-sharing, so it
requires considerable security. Private information about individuals and
sensitive information are collected for customers profiles, user behaviour
pattern understanding. Illegal access to information and the confidential
nature of information becoming an important issue.
2. User Interface:
The knowledge discovered is discovered using data mining tools is useful
only if it is interesting and above all understandable by the user. From good
visualization interpretation of data, mining results can be eased and helps
better understand their requirements. To obtain good visualization many
research is carried out for big data sets that display and manipulate mined
knowledge.
(i) Mining based on Level of Abstraction: Data Mining process needs to be
collaborative because it allows users to concentrate on pattern finding,
presenting and optimizing requests for data mining based on returned
results.
(ii) Integration of Background Knowledge: Previous information may be
used to express discovered patterns to direct the exploration processes and
to express discovered patterns.
3. Mining Methodology Challenges:
These challenges are related to data mining approaches and their limitations.
Mining approaches that cause the problem are:
4. Different approaches may implement differently based upon data
consideration. Some algorithms require noise-free data. Most data sets contain
exceptions, invalid or incomplete information lead to complication in the
analysis process and some cases compromise the precision of the results.

5. Complex Data:
Real-world data is heterogeneous and it could be multimedia data containing
images, audio and video, complex data, temporal data, spatial data, time series,
natural language text etc. It is difficult to handle these various kinds of data and
extract the required information. New tools and methodologies are developing
to extract relevant information.
(i) Complex data types: The database can include complex data elements,
objects with graphical data, spatial data, and temporal data. Mining all these
kinds of data is not practical to be done one device.
(ii) Mining from Varied Sources: The data is gathered from different sources
on Network. The data source may be of different kinds depending on how they
are stored such as structured, semi-structured or unstructured.
6. Performance:
The performance of the data mining system depends on the efficiency of
algorithms and techniques are using. The algorithms and techniques designed
are not up to the mark lead to affect the performance of the data mining
process.
(i) Efficiency and Scalability of the Algorithms: The data mining algorithm
must be efficient and scalable to extract information from huge amounts of
data in the database.
(ii) Improvement of Mining Algorithms: Factors such as the enormous size of
the database, the entire data flow and the difficulty of data mining approaches
inspire the creation of parallel & distributed data mining algorithms.
8.Architecture of Data mining:
Data mining refers to the detection and extraction of new patterns from the already
collected data. Data mining is the amalgamation of the field of statistics and
computer science aiming to discover patterns in incredibly large datasets and then
transforming them into a comprehensible structure for later use.
Architecture of Data Mining:
Basic Working:
1. It all starts when the user puts up certain data mining requests, these
requests are then sent to data mining engines for pattern evaluation.
2. These applications try to find the solution of the query using the already
present database.
3. The metadata then extracted is sent for proper analysis to the data mining
engine which sometimes interacts with pattern evaluation modules to
determine the result.
4. This result is then sent to the front end in an easily understandable manner
using a suitable interface.
A detailed description of parts of data mining architecture is shown:

1. Data Sources:
Database, WWW and data warhouse are parts of data sources. The data in
these sources may be in the form of plain text, spreadsheets or in other
forms of media like photos or videos. WWW is one of the biggest sources of
data.
2. Database Server:
The database server contains the actual data ready to be processed. It
performs the task of handling data retrieval as per the request of the user.
3. Data Mining Engine:
It is one of the core components of the data mining architecture that
performs all kinds of data mining techniques like association, classification,
characterization, clustering, prediction, etc.
4. Pattern Evaluation Modules:
They are responsible for finding interesting patterns in the data and
sometimes they also interact with the database servers for producing the
result of the user requests.
5. Graphic User Interface:
Since the user cannot fully understand the complexity of the data mining
process so graphical user interface helps the user to communicate
effectively with the data mining system.
6. Knowledge Base:
Knowledge Base is an important part of the data mining engine that is quite
beneficial in guiding the search for the result patterns. Data mining engine
may also sometimes get inputs from the knowledge base. This knowledge
base may contain data from user experiences. The objective of the
knowledge base is to make the result more accurate and reliable.

You might also like