Bi Lesson 6
Bi Lesson 6
INTELLIGENCE
SIT 303
LESSON 6
Objectives
a) Categories of metadata
b) Roles of metadata
c) Challenges of metadata
d) Define Data mining
e) KDD process
f) Data Preprocessing in Data mining
g) Issues in Data mining
h) Types of data in data mining
i) Advantages and Disadvantages of Datamining
j) Applications in Data mining
Meta data
Metadata is as data about data. That is, The data that is used to represent other data is known
as metadata.
For example, the index of a book serves as a metadata for the contents in the book.
In other words, we can say that metadata is the summarized data that leads us to detailed data.
iii. Metadata acts as a directory. This directory helps the decision support system to
locate the contents of a data warehouse.
Categories of Metadata
Metadata can be broadly categorized into three categories −
• Business Metadata − It has the data ownership information, business definition, and
changing policies.
• Technical Metadata − It includes database system names, table and column names
and sizes, data types and allowed values. Technical metadata also includes structural
information such as primary and foreign key attributes and indices.
• Operational Metadata − It includes currency of data, data lineage and data quality.
Currency of data means whether the data is active, archived, or purged. Lineage of
data means the history of data migrated and transformation applied on it.
Role of Metadata
1. Metadata acts as a directory. This directory helps the decision support
system to locate the contents of the data warehouse.
2. Metadata helps in decision support system for mapping of data when
data is transformed from operational environment to data warehouse
environment.
3. Metadata also helps in summarization between lightly detailed data and
highly summarized data.
4. Metadata is used for query tools.
5. Metadata is used in extraction and cleansing tools.
6. Metadata is used in reporting tools.
7. Metadata is used in transformation tools.
8. Metadata plays an important role in loading functions.
Challenges for Metadata Management
•
i. Smoothing by bin mean
For example
21,25,34,4,15,8,21,24,28
sort first 4,8,15,21,21,24,25,28,34
Select bin size = eg 3 (equal frequency of bin)
4 8 15 (4+8+15)/3 = 9 9 9 9
21 21 24 21+21+24=22 22 22 22
25 28 34 25+28+34=29 29 29 29
• Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it
will fall outside the clusters.
• It is dividing the populations or data points into a number of groups such that data points in
the same groups are more similar to other data points in the same group than those in other
groups
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable
for mining process. This involves following ways:
i. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to
1.0 or 0.0 to 1.0). The numerical attributes are scaled up or down to fit
within a specified range
ii. Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
E.g. New properties of data are created from existing attributes to help in
the data mining process. For example, date of birth, data attribute can be
transformed to another property like is_senior_citizen for each tuple, which
will directly influence predicting diseases or chances of survival, etc.
iii. Discretization:
It is a process of transforming continuous data into set of small intervals. This is done
because continuous features tend to have a smaller chance of correlation with the
target variable. .
For example, (1-10, 11-20) (age:- young, middle age, senior).
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms. It can be lossy or
lossless. If after reconstruction from compressed data, original data can be
retrieved, such reduction are called lossless reduction else it is called lossy
reduction. The two effective methods of dimensionality reduction are:Wavelet
transforms and PCA (Principal Component Analysis).
Issues In Data Mining
Issues that need to be addressed by any serious data mining package are:
i. Uncertainty Handling
ii. Dealing with Missing Values
iii. Dealing with Noisy data
iv. The efficiency of algorithms
v. Limiting Knowledge Discovered to only Useful
vi. Incorporating Domain Knowledge
vii. Size and Complexity of Data
viii. Data Selection
ix. Understandability of Discovered Knowledge:
x. Consistency between Data and Discovered Knowledge
Types of Data used in Data Mining
1. Relational Database:
A relational database is a collection of multiple data sets formally organized
by tables, records, and columns from which data can be accessed in various
ways without having to recognize the database tables. Tables convey and share
information, which facilitates data search ability, reporting, and organization.
2. Data warehouses:
A Data Warehouse is the technology that collects the data from various
sources within the organization to provide meaningful business insights.
The huge amount of data comes from multiple places such as Marketing and
Finance. The extracted data is utilized for analytical purposes and helps in
decision- making for a business organization. The data warehouse is designed
for the analysis of data rather than transaction processing.
3. Data Repositories:
The Data Repository generally refers to a destination for data storage.
However, many IT professionals utilize the term more clearly to refer to a specific
kind of setup within an IT structure. For example, a group of databases, where an
organization has kept various kinds of information
4. Object-Relational Database:
A combination of an object-oriented database model and relational database
model is called an object-relational model. It supports Classes, Objects,
Inheritance, etc.
One of the primary objectives of the Object-relational data model is to close the
gap between the Relational database and the object-oriented model practices
frequently utilized in many programming languages, for example, C++, Java, C#,
and so on.
5. Transactional Database:
A transactional database refers to a database management system (DBMS) that
has the potential to undo a database transaction if it is not performed
appropriately. Even though this was a unique capability a very long while
back, today, most of the relational database systems support transactional
database activities.
Advantages of Data Mining
1. The Data Mining technique enables organizations to obtain knowledge
based data.
2. Data mining enables organizations to make meaningful modifications in
operation and production.
3. Compared with other statistical data applications, data mining is a cost-
efficient.
4. Data Mining helps the decision-making process of an organization.
5. It Facilitates the automated discovery of hidden patterns as well as the
prediction of trends and behaviors.
6. It can be induced in the new system as well as the existing platforms.
7. It is a quick process that makes it easy for new users to analyze enormous
amounts of data in a short time.
Disadvantages of data mining
1. There is a probability that the organizations may sell useful data of
customers to other organizations for money. As per the report, American
Express has sold credit card purchases of their customers to other
organizations.
2. Many data mining analytics software is difficult to operate and needs
advance training to work on.
3. Different data mining instruments operate in distinct ways due to the
different algorithms used in their design. Therefore, the selection of the
right data mining tools is a very challenging task.
4. The data mining techniques are not precise, so that it may lead to severe
consequences in certain conditions.
Data Mining Applications
Data Mining is used by organizations with intense consumer demands like
i. Retail,
ii. Communication,
iii. Financial,
iv. marketing company,
v. determine price,
vi. consumer preferences,
vii.product positioning, and impact on sales,
viii.customer satisfaction, and corporate profits.
Data mining enables a retailer to use point-of-sale records of customer
purchases to develop products and promotions that help the organization to
attract the customer.
Customer relationship management
Data Mining Applications
1. Data Mining in Healthcare:
Data mining in healthcare has excellent potential to improve the health system. It uses data and
analytics for better insights and to identify best practices that will enhance health care
services and reduce costs. Analysts use data mining approaches such as Machine learning,
Multi-dimensional database, Data visualization, Soft computing, and statistics. Data Mining can
be used to forecast patients in each category. The procedures ensure that the patients get
intensive care at the right place and at the right time. Data mining also enables healthcare
insurers to recognize fraud and abuse