0% found this document useful (0 votes)
45 views36 pages

Bi Lesson 6

The document discusses introduction to business intelligence and data mining. It covers topics like metadata, its categories and roles, data preprocessing techniques like cleaning, transformation, and reduction in data mining. Challenges of metadata management and different types of data in data mining are also discussed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views36 pages

Bi Lesson 6

The document discusses introduction to business intelligence and data mining. It covers topics like metadata, its categories and roles, data preprocessing techniques like cleaning, transformation, and reduction in data mining. Challenges of metadata management and different types of data in data mining are also discussed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

INTRODUCTION TO BUSINESS

INTELLIGENCE
SIT 303

LESSON 6
Objectives
a) Categories of metadata
b) Roles of metadata
c) Challenges of metadata
d) Define Data mining
e) KDD process
f) Data Preprocessing in Data mining
g) Issues in Data mining
h) Types of data in data mining
i) Advantages and Disadvantages of Datamining
j) Applications in Data mining
Meta data
Metadata is as data about data. That is, The data that is used to represent other data is known
as metadata.

For example, the index of a book serves as a metadata for the contents in the book.

In other words, we can say that metadata is the summarized data that leads us to detailed data.

In terms of data warehouse, we can define metadata as follows.

i. Metadata is the road-map to a data warehouse.

ii. Metadata in a data warehouse defines the warehouse objects.

iii. Metadata acts as a directory. This directory helps the decision support system to
locate the contents of a data warehouse.
Categories of Metadata
Metadata can be broadly categorized into three categories −

• Business Metadata − It has the data ownership information, business definition, and
changing policies.

• Technical Metadata − It includes database system names, table and column names
and sizes, data types and allowed values. Technical metadata also includes structural
information such as primary and foreign key attributes and indices.

• Operational Metadata − It includes currency of data, data lineage and data quality.
Currency of data means whether the data is active, archived, or purged. Lineage of
data means the history of data migrated and transformation applied on it.
Role of Metadata
1. Metadata acts as a directory. This directory helps the decision support
system to locate the contents of the data warehouse.
2. Metadata helps in decision support system for mapping of data when
data is transformed from operational environment to data warehouse
environment.
3. Metadata also helps in summarization between lightly detailed data and
highly summarized data.
4. Metadata is used for query tools.
5. Metadata is used in extraction and cleansing tools.
6. Metadata is used in reporting tools.
7. Metadata is used in transformation tools.
8. Metadata plays an important role in loading functions.
Challenges for Metadata Management

1. Metadata in a big organization is scattered across the organization. This


metadata is spread in spreadsheets, databases, and applications.

2. Metadata could be present in text files or multimedia files. To use this


data for information management solutions, it has to be correctly defined.

3. There are no industry-wide accepted standards. Data management


solution vendors have narrow focus.

4. There are no easy and accepted methods of passing metadata.


Data mining
Introduction to data mining
• Data mining is one of the most useful techniques that help entrepreneurs,
researchers, and individuals to extract valuable information from huge sets of
data.
• The data sources can include Databases, Data-warehouses, Web browsers, and other
information repositories.
• Data mining is also called Knowledge Discovery in Database (KDD).
• The knowledge discovery process includes
i. Data cleaning,
ii. Data integration,
iii. Data selection,
iv. Data transformation,
v. Data mining,
vi. Pattern evaluation, and
vii.Knowledge presentation.
KDD process Diagram
Iterative steps KDD
1. Data cleaning: It involves removing inconsistent data and noise.
2. Data integration: It involves combining data from various data sources and
forming a single data source by integrating it all together.
3. Data selection: The data relevant to performing a particular task is selected
as a part of it.
4. Data transformation: Data is transformed into appropriate form and
transformation operations like summary or aggregation.
5. Data mining: It is an essential process where intelligent methods are applied
to extract data patterns.
6. Pattern evaluation: The identified patterns are evaluated, and interesting
patterns are represented as knowledge.
7. Knowledge representation: visualizations are created using various graphical
representation methods.
Data Preprocessing in Data Mining
Data preprocessing is a data mining technique which is used to transform the
raw data in a useful and efficient format.
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.

• (a). Missing Data:


This situation arises when some data is missing in the data. It can be handled in
various ways.
Some of them are:
• Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.

• Fill the Missing values:


There are various ways to do this task. You can choose to fill the missing
values manually, by attribute mean or the most probable value.
• (b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines. It can be generated
due to faulty data collection, data entry errors etc. It can be handled in following ways :
• Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task. Each
segmented is handled separately.
• There are 3 approaches to perform smoothing
i) Smoothing by bin mean
ii)Smoothing by bin median
iii)Smoothing by bin boundary


i. Smoothing by bin mean
For example
21,25,34,4,15,8,21,24,28
sort first 4,8,15,21,21,24,25,28,34
Select bin size = eg 3 (equal frequency of bin)
4 8 15 (4+8+15)/3 = 9 9 9 9
21 21 24 21+21+24=22 22 22 22
25 28 34 25+28+34=29 29 29 29

ii. Smoothing by bin median


4 8 15 8 8 8
21 21 24 21 21 21
25 28 34 28 28 28

• Regression:
Here data can be made smooth by fitting it to a regression function. The regression used
may be linear (having one independent variable) or multiple (having multiple independent
variables).

• Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it
will fall outside the clusters.
• It is dividing the populations or data points into a number of groups such that data points in
the same groups are more similar to other data points in the same group than those in other
groups
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable
for mining process. This involves following ways:
i. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to
1.0 or 0.0 to 1.0). The numerical attributes are scaled up or down to fit
within a specified range
ii. Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
E.g. New properties of data are created from existing attributes to help in
the data mining process. For example, date of birth, data attribute can be
transformed to another property like is_senior_citizen for each tuple, which
will directly influence predicting diseases or chances of survival, etc.
iii. Discretization:
It is a process of transforming continuous data into set of small intervals. This is done
because continuous features tend to have a smaller chance of correlation with the
target variable. .
For example, (1-10, 11-20) (age:- young, middle age, senior).

iv. Concept Hierarchy Generation:


Here attributes are converted from lower level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with
huge volume of data, analysis became harder in such cases. In order to get rid of this, we uses data
reduction technique. It aims to increase the storage efficiency and reduce data storage and
analysis costs.

The various steps to data reduction are:


1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
2. Attribute Subset Selection:
It is very important to be specific in the selection of attributes. Otherwise, it might lead to
high dimensional data, which are difficult to train due to underfitting/overfitting problems.
Only attributes that add more value towards model training should be considered, and the rest all
can be discarded.
Underfitting means that your model makes accurate, but initially incorrect predictions or
Overfitting means that your model makes not accurate predictions.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example:
Regression Models.

4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms. It can be lossy or
lossless. If after reconstruction from compressed data, original data can be
retrieved, such reduction are called lossless reduction else it is called lossy
reduction. The two effective methods of dimensionality reduction are:Wavelet
transforms and PCA (Principal Component Analysis).
Issues In Data Mining
Issues that need to be addressed by any serious data mining package are:
i. Uncertainty Handling
ii. Dealing with Missing Values
iii. Dealing with Noisy data
iv. The efficiency of algorithms
v. Limiting Knowledge Discovered to only Useful
vi. Incorporating Domain Knowledge
vii. Size and Complexity of Data
viii. Data Selection
ix. Understandability of Discovered Knowledge:
x. Consistency between Data and Discovered Knowledge
Types of Data used in Data Mining

1. Relational Database:
A relational database is a collection of multiple data sets formally organized
by tables, records, and columns from which data can be accessed in various
ways without having to recognize the database tables. Tables convey and share
information, which facilitates data search ability, reporting, and organization.
2. Data warehouses:
A Data Warehouse is the technology that collects the data from various
sources within the organization to provide meaningful business insights.
The huge amount of data comes from multiple places such as Marketing and
Finance. The extracted data is utilized for analytical purposes and helps in
decision- making for a business organization. The data warehouse is designed
for the analysis of data rather than transaction processing.
3. Data Repositories:
The Data Repository generally refers to a destination for data storage.
However, many IT professionals utilize the term more clearly to refer to a specific
kind of setup within an IT structure. For example, a group of databases, where an
organization has kept various kinds of information
4. Object-Relational Database:
A combination of an object-oriented database model and relational database
model is called an object-relational model. It supports Classes, Objects,
Inheritance, etc.
One of the primary objectives of the Object-relational data model is to close the
gap between the Relational database and the object-oriented model practices
frequently utilized in many programming languages, for example, C++, Java, C#,
and so on.
5. Transactional Database:
A transactional database refers to a database management system (DBMS) that
has the potential to undo a database transaction if it is not performed
appropriately. Even though this was a unique capability a very long while
back, today, most of the relational database systems support transactional
database activities.
Advantages of Data Mining
1. The Data Mining technique enables organizations to obtain knowledge
based data.
2. Data mining enables organizations to make meaningful modifications in
operation and production.
3. Compared with other statistical data applications, data mining is a cost-
efficient.
4. Data Mining helps the decision-making process of an organization.
5. It Facilitates the automated discovery of hidden patterns as well as the
prediction of trends and behaviors.
6. It can be induced in the new system as well as the existing platforms.
7. It is a quick process that makes it easy for new users to analyze enormous
amounts of data in a short time.
Disadvantages of data mining
1. There is a probability that the organizations may sell useful data of
customers to other organizations for money. As per the report, American
Express has sold credit card purchases of their customers to other
organizations.
2. Many data mining analytics software is difficult to operate and needs
advance training to work on.
3. Different data mining instruments operate in distinct ways due to the
different algorithms used in their design. Therefore, the selection of the
right data mining tools is a very challenging task.
4. The data mining techniques are not precise, so that it may lead to severe
consequences in certain conditions.
Data Mining Applications
Data Mining is used by organizations with intense consumer demands like
i. Retail,
ii. Communication,
iii. Financial,
iv. marketing company,
v. determine price,
vi. consumer preferences,
vii.product positioning, and impact on sales,
viii.customer satisfaction, and corporate profits.
Data mining enables a retailer to use point-of-sale records of customer
purchases to develop products and promotions that help the organization to
attract the customer.
Customer relationship management
Data Mining Applications
1. Data Mining in Healthcare:
Data mining in healthcare has excellent potential to improve the health system. It uses data and
analytics for better insights and to identify best practices that will enhance health care
services and reduce costs. Analysts use data mining approaches such as Machine learning,
Multi-dimensional database, Data visualization, Soft computing, and statistics. Data Mining can
be used to forecast patients in each category. The procedures ensure that the patients get
intensive care at the right place and at the right time. Data mining also enables healthcare
insurers to recognize fraud and abuse

2. Data Mining in Market Basket Analysis:


Market basket analysis is a modeling method based on a hypothesis. If you buy a specific
group of products, then you are more likely to buy another group of products. This technique
may enable the retailer to understand the purchase behavior of a buyer. This data may
assist the retailer in understanding the requirements of the buyer and altering the store's
layout accordingly.
Data Mining Applications
3. Data mining in Education:
Education data mining is a newly emerging field, concerned with developing
techniques that explore knowledge from the data generated from educational
Environments. EDM objectives are recognized as affirming student's future learning
behavior, studying the impact of educational support, and promoting learning science.
An organization can use data mining to make precise decisions and also to predict the
results of the student. With the results, the institution can concentrate on what to
teach and how to teach.
4. Data Mining in Manufacturing Engineering:
Knowledge is the best asset possessed by a manufacturing company. Data mining tools
can be beneficial to find patterns in a complex manufacturing process.
Data mining can be used in system-level designing to obtain the relationships
between product architecture, product portfolio, and data needs of the customers.
It can also be used to forecast the product development period, cost, and
expectations among the other tasks.
Data Mining Applications
5. Data Mining in CRM (Customer Relationship Management):
Customer Relationship Management (CRM) is all about obtaining and holding
Customers, enhancing customer loyalty and implementing customer-oriented
strategies.
To get a decent relationship with the customer, a business organization needs to collect
data and analyze the data. With data mining technologies, the collected data can be used
for analytics.
6. Data Mining in Fraud detection:
Billions of dollars are lost to the action of frauds. Traditional methods of fraud
detection are a little bit time consuming and sophisticated.
Data mining provides meaningful patterns and turning data into information. An
ideal fraud detection system should protect the data of all the users. Supervised methods
consist of a collection of sample records, and these records are classified as fraudulent or
non-fraudulent. A model is constructed using this data, and the technique is made to
identify whether the document is fraudulent or not.
Data Mining Applications
7. Data Mining in Lie Detection:
Apprehending a criminal is not a big deal, but bringing out the truth from him is a very
challenging task. Law enforcement may use data mining techniques to investigate
offenses, monitor suspected terrorist communications, etc. This technique includes text
mining also, and it seeks meaningful patterns in data, which is usually unstructured
text. The information collected from the previous investigations is compared, and a
model for lie detection is constructed.
8. Data Mining Financial Banking:
The Digitalization of the banking system is supposed to generate an enormous amount
of data with every new transaction. The data mining technique can help bankers by
solving business-related problems in banking and finance by identifying trends,
casualties, and correlations in business information and market costs that are not
instantly evident to managers or executives because the data volume is too large or are
produced too rapidly on the screen by experts. The manager may find these data for
better targeting, acquiring, retaining, segmenting, and maintain a profitable customer.
THE END

You might also like