0% found this document useful (0 votes)

45 views36 pages

Bi Lesson 6

The document discusses introduction to business intelligence and data mining. It covers topics like metadata, its categories and roles, data preprocessing techniques like cleaning, transformation, and reduction in data mining. Challenges of metadata management and different types of data in data mining are also discussed.

Uploaded by

calebgaichuhie254

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views36 pages

Bi Lesson 6

Uploaded by

calebgaichuhie254

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

INTRODUCTION TO BUSINESS

INTELLIGENCE
SIT 303

LESSON 6
Objectives
a) Categories of metadata
b) Roles of metadata
c) Challenges of metadata
d) Define Data mining
e) KDD process
f) Data Preprocessing in Data mining
g) Issues in Data mining
h) Types of data in data mining
i) Advantages and Disadvantages of Datamining
j) Applications in Data mining
Meta data
Metadata is as data about data. That is, The data that is used to represent other data is known
as metadata.

For example, the index of a book serves as a metadata for the contents in the book.

In other words, we can say that metadata is the summarized data that leads us to detailed data.

In terms of data warehouse, we can define metadata as follows.

i. Metadata is the road-map to a data warehouse.

ii. Metadata in a data warehouse defines the warehouse objects.

iii. Metadata acts as a directory. This directory helps the decision support system to
locate the contents of a data warehouse.
Categories of Metadata
Metadata can be broadly categorized into three categories −

• Business Metadata − It has the data ownership information, business definition, and
changing policies.

• Technical Metadata − It includes database system names, table and column names
and sizes, data types and allowed values. Technical metadata also includes structural
information such as primary and foreign key attributes and indices.

• Operational Metadata − It includes currency of data, data lineage and data quality.
Currency of data means whether the data is active, archived, or purged. Lineage of
data means the history of data migrated and transformation applied on it.
Role of Metadata
1. Metadata acts as a directory. This directory helps the decision support
system to locate the contents of the data warehouse.
2. Metadata helps in decision support system for mapping of data when
data is transformed from operational environment to data warehouse
environment.
3. Metadata also helps in summarization between lightly detailed data and
highly summarized data.
4. Metadata is used for query tools.
5. Metadata is used in extraction and cleansing tools.
6. Metadata is used in reporting tools.
7. Metadata is used in transformation tools.
8. Metadata plays an important role in loading functions.
Challenges for Metadata Management

1. Metadata in a big organization is scattered across the organization. This

metadata is spread in spreadsheets, databases, and applications.

2. Metadata could be present in text files or multimedia files. To use this

data for information management solutions, it has to be correctly defined.

3. There are no industry-wide accepted standards. Data management

solution vendors have narrow focus.

4. There are no easy and accepted methods of passing metadata.

Data mining
Introduction to data mining
• Data mining is one of the most useful techniques that help entrepreneurs,
researchers, and individuals to extract valuable information from huge sets of
data.
• The data sources can include Databases, Data-warehouses, Web browsers, and other
information repositories.
• Data mining is also called Knowledge Discovery in Database (KDD).
• The knowledge discovery process includes
i. Data cleaning,
ii. Data integration,
iii. Data selection,
iv. Data transformation,
v. Data mining,
vi. Pattern evaluation, and
vii.Knowledge presentation.
KDD process Diagram
Iterative steps KDD
1. Data cleaning: It involves removing inconsistent data and noise.
2. Data integration: It involves combining data from various data sources and
forming a single data source by integrating it all together.
3. Data selection: The data relevant to performing a particular task is selected
as a part of it.
4. Data transformation: Data is transformed into appropriate form and
transformation operations like summary or aggregation.
5. Data mining: It is an essential process where intelligent methods are applied
to extract data patterns.
6. Pattern evaluation: The identified patterns are evaluated, and interesting
patterns are represented as knowledge.
7. Knowledge representation: visualizations are created using various graphical
representation methods.
Data Preprocessing in Data Mining
Data preprocessing is a data mining technique which is used to transform the
raw data in a useful and efficient format.
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.

• (a). Missing Data:

This situation arises when some data is missing in the data. It can be handled in
various ways.
Some of them are:
• Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.

• Fill the Missing values:

There are various ways to do this task. You can choose to fill the missing
values manually, by attribute mean or the most probable value.
• (b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines. It can be generated
due to faulty data collection, data entry errors etc. It can be handled in following ways :
• Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task. Each
segmented is handled separately.
• There are 3 approaches to perform smoothing
i) Smoothing by bin mean
ii)Smoothing by bin median
iii)Smoothing by bin boundary

•
i. Smoothing by bin mean
For example
21,25,34,4,15,8,21,24,28
sort first 4,8,15,21,21,24,25,28,34
Select bin size = eg 3 (equal frequency of bin)
4 8 15 (4+8+15)/3 = 9 9 9 9
21 21 24 21+21+24=22 22 22 22
25 28 34 25+28+34=29 29 29 29

ii. Smoothing by bin median

4 8 15 8 8 8
21 21 24 21 21 21
25 28 34 28 28 28
•
• Regression:
Here data can be made smooth by fitting it to a regression function. The regression used
may be linear (having one independent variable) or multiple (having multiple independent
variables).

• Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it
will fall outside the clusters.
• It is dividing the populations or data points into a number of groups such that data points in
the same groups are more similar to other data points in the same group than those in other
groups
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable
for mining process. This involves following ways:
i. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to
1.0 or 0.0 to 1.0). The numerical attributes are scaled up or down to fit
within a specified range
ii. Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
E.g. New properties of data are created from existing attributes to help in
the data mining process. For example, date of birth, data attribute can be
transformed to another property like is_senior_citizen for each tuple, which
will directly influence predicting diseases or chances of survival, etc.
iii. Discretization:
It is a process of transforming continuous data into set of small intervals. This is done
because continuous features tend to have a smaller chance of correlation with the
target variable. .
For example, (1-10, 11-20) (age:- young, middle age, senior).

iv. Concept Hierarchy Generation:

Here attributes are converted from lower level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with
huge volume of data, analysis became harder in such cases. In order to get rid of this, we uses data
reduction technique. It aims to increase the storage efficiency and reduce data storage and
analysis costs.

The various steps to data reduction are:

1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
2. Attribute Subset Selection:
It is very important to be specific in the selection of attributes. Otherwise, it might lead to
high dimensional data, which are difficult to train due to underfitting/overfitting problems.
Only attributes that add more value towards model training should be considered, and the rest all
can be discarded.
Underfitting means that your model makes accurate, but initially incorrect predictions or
Overfitting means that your model makes not accurate predictions.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example:
Regression Models.

4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms. It can be lossy or
lossless. If after reconstruction from compressed data, original data can be
retrieved, such reduction are called lossless reduction else it is called lossy
reduction. The two effective methods of dimensionality reduction are:Wavelet
transforms and PCA (Principal Component Analysis).
Issues In Data Mining
Issues that need to be addressed by any serious data mining package are:
i. Uncertainty Handling
ii. Dealing with Missing Values
iii. Dealing with Noisy data
iv. The efficiency of algorithms
v. Limiting Knowledge Discovered to only Useful
vi. Incorporating Domain Knowledge
vii. Size and Complexity of Data
viii. Data Selection
ix. Understandability of Discovered Knowledge:
x. Consistency between Data and Discovered Knowledge
Types of Data used in Data Mining

1. Relational Database:
A relational database is a collection of multiple data sets formally organized
by tables, records, and columns from which data can be accessed in various
ways without having to recognize the database tables. Tables convey and share
information, which facilitates data search ability, reporting, and organization.
2. Data warehouses:
A Data Warehouse is the technology that collects the data from various
sources within the organization to provide meaningful business insights.
The huge amount of data comes from multiple places such as Marketing and
Finance. The extracted data is utilized for analytical purposes and helps in
decision- making for a business organization. The data warehouse is designed
for the analysis of data rather than transaction processing.
3. Data Repositories:
The Data Repository generally refers to a destination for data storage.
However, many IT professionals utilize the term more clearly to refer to a specific
kind of setup within an IT structure. For example, a group of databases, where an
organization has kept various kinds of information
4. Object-Relational Database:
A combination of an object-oriented database model and relational database
model is called an object-relational model. It supports Classes, Objects,
Inheritance, etc.
One of the primary objectives of the Object-relational data model is to close the
gap between the Relational database and the object-oriented model practices
frequently utilized in many programming languages, for example, C++, Java, C#,
and so on.
5. Transactional Database:
A transactional database refers to a database management system (DBMS) that
has the potential to undo a database transaction if it is not performed
appropriately. Even though this was a unique capability a very long while
back, today, most of the relational database systems support transactional
database activities.
Advantages of Data Mining
1. The Data Mining technique enables organizations to obtain knowledge
based data.
2. Data mining enables organizations to make meaningful modifications in
operation and production.
3. Compared with other statistical data applications, data mining is a cost-
efficient.
4. Data Mining helps the decision-making process of an organization.
5. It Facilitates the automated discovery of hidden patterns as well as the
prediction of trends and behaviors.
6. It can be induced in the new system as well as the existing platforms.
7. It is a quick process that makes it easy for new users to analyze enormous
amounts of data in a short time.
Disadvantages of data mining
1. There is a probability that the organizations may sell useful data of
customers to other organizations for money. As per the report, American
Express has sold credit card purchases of their customers to other
organizations.
2. Many data mining analytics software is difficult to operate and needs
advance training to work on.
3. Different data mining instruments operate in distinct ways due to the
different algorithms used in their design. Therefore, the selection of the
right data mining tools is a very challenging task.
4. The data mining techniques are not precise, so that it may lead to severe
consequences in certain conditions.
Data Mining Applications
Data Mining is used by organizations with intense consumer demands like
i. Retail,
ii. Communication,
iii. Financial,
iv. marketing company,
v. determine price,
vi. consumer preferences,
vii.product positioning, and impact on sales,
viii.customer satisfaction, and corporate profits.
Data mining enables a retailer to use point-of-sale records of customer
purchases to develop products and promotions that help the organization to
attract the customer.
Customer relationship management
Data Mining Applications
1. Data Mining in Healthcare:
Data mining in healthcare has excellent potential to improve the health system. It uses data and
analytics for better insights and to identify best practices that will enhance health care
services and reduce costs. Analysts use data mining approaches such as Machine learning,
Multi-dimensional database, Data visualization, Soft computing, and statistics. Data Mining can
be used to forecast patients in each category. The procedures ensure that the patients get
intensive care at the right place and at the right time. Data mining also enables healthcare
insurers to recognize fraud and abuse

2. Data Mining in Market Basket Analysis:

Market basket analysis is a modeling method based on a hypothesis. If you buy a specific
group of products, then you are more likely to buy another group of products. This technique
may enable the retailer to understand the purchase behavior of a buyer. This data may
assist the retailer in understanding the requirements of the buyer and altering the store's
layout accordingly.
Data Mining Applications
3. Data mining in Education:
Education data mining is a newly emerging field, concerned with developing
techniques that explore knowledge from the data generated from educational
Environments. EDM objectives are recognized as affirming student's future learning
behavior, studying the impact of educational support, and promoting learning science.
An organization can use data mining to make precise decisions and also to predict the
results of the student. With the results, the institution can concentrate on what to
teach and how to teach.
4. Data Mining in Manufacturing Engineering:
Knowledge is the best asset possessed by a manufacturing company. Data mining tools
can be beneficial to find patterns in a complex manufacturing process.
Data mining can be used in system-level designing to obtain the relationships
between product architecture, product portfolio, and data needs of the customers.
It can also be used to forecast the product development period, cost, and
expectations among the other tasks.
Data Mining Applications
5. Data Mining in CRM (Customer Relationship Management):
Customer Relationship Management (CRM) is all about obtaining and holding
Customers, enhancing customer loyalty and implementing customer-oriented
strategies.
To get a decent relationship with the customer, a business organization needs to collect
data and analyze the data. With data mining technologies, the collected data can be used
for analytics.
6. Data Mining in Fraud detection:
Billions of dollars are lost to the action of frauds. Traditional methods of fraud
detection are a little bit time consuming and sophisticated.
Data mining provides meaningful patterns and turning data into information. An
ideal fraud detection system should protect the data of all the users. Supervised methods
consist of a collection of sample records, and these records are classified as fraudulent or
non-fraudulent. A model is constructed using this data, and the technique is made to
identify whether the document is fraudulent or not.
Data Mining Applications
7. Data Mining in Lie Detection:
Apprehending a criminal is not a big deal, but bringing out the truth from him is a very
challenging task. Law enforcement may use data mining techniques to investigate
offenses, monitor suspected terrorist communications, etc. This technique includes text
mining also, and it seeks meaningful patterns in data, which is usually unstructured
text. The information collected from the previous investigations is compared, and a
model for lie detection is constructed.
8. Data Mining Financial Banking:
The Digitalization of the banking system is supposed to generate an enormous amount
of data with every new transaction. The data mining technique can help bankers by
solving business-related problems in banking and finance by identifying trends,
casualties, and correlations in business information and market costs that are not
instantly evident to managers or executives because the data volume is too large or are
produced too rapidly on the screen by experts. The manager may find these data for
better targeting, acquiring, retaining, segmenting, and maintain a profitable customer.
THE END

Unit III DWDM
No ratings yet
Unit III DWDM
113 pages
TDMS10 - EN - Col94 Test Data Configuration and Execution With SAP TDMS 4.0
No ratings yet
TDMS10 - EN - Col94 Test Data Configuration and Execution With SAP TDMS 4.0
503 pages
DWDM Notes - Unit 1
No ratings yet
DWDM Notes - Unit 1
26 pages
DWM NOTES
No ratings yet
DWM NOTES
118 pages
Unit 2
No ratings yet
Unit 2
144 pages
DMML Notes
No ratings yet
DMML Notes
89 pages
BDUD Unit1
No ratings yet
BDUD Unit1
100 pages
Study Material I
No ratings yet
Study Material I
140 pages
Datamining&warehousing
No ratings yet
Datamining&warehousing
65 pages
Chapter 1 - Data Mining and Data Warehouse
No ratings yet
Chapter 1 - Data Mining and Data Warehouse
44 pages
Adm Unit - 1
No ratings yet
Adm Unit - 1
62 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
62 pages
Unit 1 Data Mining Task
No ratings yet
Unit 1 Data Mining Task
7 pages
Unit - 2
No ratings yet
Unit - 2
17 pages
LECTURE 3-BDM 411 Data Analytics and BIG Data
No ratings yet
LECTURE 3-BDM 411 Data Analytics and BIG Data
49 pages
Using DataPump To Export From Physical Standbby
No ratings yet
Using DataPump To Export From Physical Standbby
3 pages
Data Mining Notes UNIT I
No ratings yet
Data Mining Notes UNIT I
21 pages
Database Development Life Cycle
No ratings yet
Database Development Life Cycle
13 pages
DATA MINING Notes
No ratings yet
DATA MINING Notes
37 pages
SDP Architecture Guide Web
No ratings yet
SDP Architecture Guide Web
39 pages
Data Science Module 1 Notes
No ratings yet
Data Science Module 1 Notes
16 pages
Data Mining-CH5
No ratings yet
Data Mining-CH5
49 pages
Computer System Servicing NC II COC-3-Setup-Computer-Server
94% (17)
Computer System Servicing NC II COC-3-Setup-Computer-Server
17 pages
Unit 2 Data Warehouse and Data Mining
No ratings yet
Unit 2 Data Warehouse and Data Mining
19 pages
Topic 3 - Data Mining
No ratings yet
Topic 3 - Data Mining
37 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
13 pages
Unit-2 Introduction To Data Mining
100% (1)
Unit-2 Introduction To Data Mining
11 pages
FDS Unit01
No ratings yet
FDS Unit01
10 pages
Data Mining
No ratings yet
Data Mining
15 pages
Dta Mining
No ratings yet
Dta Mining
15 pages
UNIT - I (Part 1)
No ratings yet
UNIT - I (Part 1)
41 pages
Down 2
No ratings yet
Down 2
61 pages
DM Events
No ratings yet
DM Events
138 pages
Data Minng
No ratings yet
Data Minng
20 pages
Data Mining and Warehousing-1
No ratings yet
Data Mining and Warehousing-1
43 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
22 pages
Unit 3
No ratings yet
Unit 3
18 pages
Datascienc Intro
No ratings yet
Datascienc Intro
18 pages
What Is Data Mining: Effective Data Collection Warehousing
No ratings yet
What Is Data Mining: Effective Data Collection Warehousing
21 pages
DMDW Imp Ques
No ratings yet
DMDW Imp Ques
17 pages
Data Mining
No ratings yet
Data Mining
40 pages
DWM 4
No ratings yet
DWM 4
23 pages
Free Preparation Oracle 1Z0-591 Exam Questions and Answers - IT Exam Leak
100% (1)
Free Preparation Oracle 1Z0-591 Exam Questions and Answers - IT Exam Leak
62 pages
The System Boot Process - 01-Solaris 8 - Student Guide - SA238 (SCSA pt1) - 2
No ratings yet
The System Boot Process - 01-Solaris 8 - Student Guide - SA238 (SCSA pt1) - 2
36 pages
Module-1 DM
No ratings yet
Module-1 DM
15 pages
Knowledge Discovery in Databases
No ratings yet
Knowledge Discovery in Databases
17 pages
Unit 3 Data Warehousing and Data Mining
No ratings yet
Unit 3 Data Warehousing and Data Mining
7 pages
DM Module1
No ratings yet
DM Module1
15 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
BUSINESS INTELLIGENCE NOTES Unit 4
No ratings yet
BUSINESS INTELLIGENCE NOTES Unit 4
10 pages
What Is Normalization
No ratings yet
What Is Normalization
24 pages
Data Mining
No ratings yet
Data Mining
7 pages
Data Mining - Prashant
No ratings yet
Data Mining - Prashant
10 pages
PG ONLINE HOMEWORK Arrays
0% (1)
PG ONLINE HOMEWORK Arrays
4 pages
Chapter 1
No ratings yet
Chapter 1
6 pages
AWS Certified DevOps Engineer Professional... Tests 2021
100% (3)
AWS Certified DevOps Engineer Professional... Tests 2021
210 pages
Types of Industry
0% (1)
Types of Industry
9 pages
Unit 1
No ratings yet
Unit 1
11 pages
Extend QP To Custom Applications
No ratings yet
Extend QP To Custom Applications
21 pages
IV-cse DM Viva Questions
No ratings yet
IV-cse DM Viva Questions
10 pages
Data Mining
No ratings yet
Data Mining
27 pages
DATA MINING Notes (Upate)
No ratings yet
DATA MINING Notes (Upate)
25 pages
SAP Central Finance: Key Design Decisions For Reporting Deployments
No ratings yet
SAP Central Finance: Key Design Decisions For Reporting Deployments
5 pages
12th Computer CHP 2
No ratings yet
12th Computer CHP 2
4 pages
cc15 2nd
No ratings yet
cc15 2nd
2 pages
Apps Table-Query Main
No ratings yet
Apps Table-Query Main
11 pages
Tryton Client Web
No ratings yet
Tryton Client Web
6 pages
Sample Exam Questions
No ratings yet
Sample Exam Questions
10 pages
AIX From Strength To Strength
100% (1)
AIX From Strength To Strength
22 pages
Data Mining New Notes Unit 3 PDF
No ratings yet
Data Mining New Notes Unit 3 PDF
12 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
3 pages
Database Tunning With High Perforemance
No ratings yet
Database Tunning With High Perforemance
19 pages
ACS Event Report 1
No ratings yet
ACS Event Report 1
3 pages
p144 Data Mining
100% (3)
p144 Data Mining
11 pages
SAP MM Consultant Resume
No ratings yet
SAP MM Consultant Resume
5 pages
Business Uses of Data Mining and Data Warehousing MIS 304 Section 04 CRN-41595
No ratings yet
Business Uses of Data Mining and Data Warehousing MIS 304 Section 04 CRN-41595
23 pages
A Brief Overview On Data Mining Survey PDF
No ratings yet
A Brief Overview On Data Mining Survey PDF
8 pages
Question Bank Unit I and II
No ratings yet
Question Bank Unit I and II
3 pages
Photo Resume
No ratings yet
Photo Resume
2 pages
Data Mining
No ratings yet
Data Mining
7 pages
Solution Brief Managed Detection and Response MDR PDF
No ratings yet
Solution Brief Managed Detection and Response MDR PDF
2 pages
Rupesh C Haul Again Resumed Broad
No ratings yet
Rupesh C Haul Again Resumed Broad
1 page
100 RPA Opportunities Across The Enterprise
No ratings yet
100 RPA Opportunities Across The Enterprise
1 page
Origami Fact Sheet Eng
No ratings yet
Origami Fact Sheet Eng
2 pages

Bi Lesson 6

Uploaded by

Bi Lesson 6

Uploaded by

INTRODUCTION TO BUSINESS

In terms of data warehouse, we can define metadata as follows.

i. Metadata is the road-map to a data warehouse.

ii. Metadata in a data warehouse defines the warehouse objects.

1. Metadata in a big organization is scattered across the organization. This

2. Metadata could be present in text files or multimedia files. To use this

3. There are no industry-wide accepted standards. Data management

4. There are no easy and accepted methods of passing metadata.

• (a). Missing Data:

• Fill the Missing values:

ii. Smoothing by bin median

iv. Concept Hierarchy Generation:

The various steps to data reduction are:

2. Data Mining in Market Basket Analysis:

You might also like