0% found this document useful (0 votes)

20 views

Topic 3 - Data Mining

Uploaded by

Arif Syazmi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

Topic 3 - Data Mining

Uploaded by

Arif Syazmi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 37

DATA MINING

ITC2263 INTRODUCTION TO DATA ANALYTICS

TOPIC 3
What is data mining

► Data mining is also called knowledge discovery and data mining

(KDD)
► Data mining is extraction of useful patterns from data sources, e.g.,
databases, texts, web, image
► Patterns must be: valid, novel, potentially useful, understandable
(Dr Dhaval Patel)
Knowlegde Discovery & Data Mining KDD

► The main objective of the KDD process is to extract information

from data in the context of large databases. It does this by using Data
Mining algorithms to identify what is deemed knowledge.
► The Knowledge Discovery in Databases is considered as a
programmed, exploratory analysis and modeling of vast data
repositories.
► KDD is the organized procedure of recognizing valid, useful, and
understandable patterns from huge and complex data sets.
KDD vs Data Mining

► KDD (Knowledge Discovery in Databases) is a field of computer science, which includes

the tools and theories to help humans in extracting useful and previously unknown
information (i.e., knowledge) from large collections of digitized data.
► KDD consists of several steps, and Data Mining is one of them.
► Data Mining is the application of a specific algorithm to extract patterns from data.
Nonetheless, KDD and Data Mining are used interchangeably.
► KDD is a computer science field specializing in extracting previously unknown and
interesting information from raw data. KDD is the whole process of trying to make sense
of data by developing appropriate methods or techniques.
► For example, it is currently used for various applications such as social network analysis,
fraud detection, science, investment, manufacturing, telecommunications, data cleaning,
sports, information retrieval, and marketing.
► KDD is usually used to answer questions like what are the main products that might help
to obtain high-profit next year in V-Mart.
► Data mining, also known as Knowledge Discovery in Databases, refers to the nontrivial
extraction of implicit, previously unknown, and potentially useful information from data
stored in databases.
► Data Mining is only a step within the overall KDD process. There are two major Data
Mining goals defined by the application's goal: verification of discovery. Verification
verifies the user's hypothesis about data, while discovery automatically finds interesting
patterns.
► There are four major data mining tasks: clustering, classification, regression, and
association (summarization). Clustering is identifying similar groups from unstructured
data. Classification is learning rules that can be applied to new data. Regression is
finding functions with minimal error to model data. And the association looks for
relationships between variables.
► Then, the specific data mining algorithm needs to be selected. Different algorithms like
linear regression, logistic regression, decision trees, and Naive Bayes can be selected
depending on the goal. Then patterns of interest in one or more symbolic forms are
Why do we need Data Mining?

► The volume of information is increasing every day that we can

handle from business transactions, scientific data, sensor data,
pictures, videos, etc.
► So, we need a system that will be capable of extracting the essence
of information available and that can automatically generate
reports, views, or summaries of data for better decision-making.
Why is Data Mining used in business?

►Data mining is used in business to make better managerial

decisions by:
►Automatic summarization of data.
►Discovering patterns in raw data.
►Extracting the essence of information stored.
Cont’d
► Data Mining is the root of the KDD procedure, including the
inferring of algorithms that investigate the data, develop the
model, and find previously unknown patterns. The model is used
for extracting the knowledge from the data, analyze the data, and
predict the data.
► The availability and abundance of data today make knowledge
discovery and Data Mining a matter of impressive significance
and need.
► In the recent development of the field, it isn't surprising that a
wide variety of techniques is presently accessible to specialists
and experts.
KNOWLEDGE DISCOVERY IN DATA:
A PROCESS

Figure 1: KDD (Sources: Dr Dhaval Patel)

Figure 2: KDD Process (Detail)
CONT’D
► The knowledge discovery process(Figure 2) is iterative
and interactive, comprises of nine steps.
► The process is iterative at each stage, implying that
moving back to the previous actions might be required.
► The process has many imaginative aspects in the sense
that one cant presents one formula or make a complete
scientific categorization for the correct decisions for each
step and application type.
► Thus, it is needed to understand the process and the
different requirements and possibilities in each stage.
Data Selection in Data Mining

► Data selection is defined as the process of determining the appropriate

data type and source and suitable instruments to collect data.
► Data selection precedes the actual practice of data collection. This definition
distinguishes data selection from selective data reporting
(selectively excluding data that is not supportive of a research hypothesis)
► and interactive/active data selection (using collected data for monitoring
activities/events, or conducting secondary data analyses). The process of
selecting suitable data for a research project can impact data integrity.
Why data selection

► The primary objective of data selection is the determination of appropriate data type,
source, and instrument(s) that allow investigators to adequately answer research
questions.
► This determination is often discipline-specific and is primarily driven by the nature of the
investigation, existing literature, and accessibility to necessary data sources.
► Integrity issues can arise when the decisions to select ‘appropriate’ data to collect are
based primarily on cost and convenience considerations rather than the ability of data to
adequately answer research questions.
► Certainly, cost and convenience are valid factors in the decision-making process.
However, researchers should assess to what degree these factors might compromises the
integrity of the research endeavor.
Types and Sources of Data

► Data types and sources can be represented in a variety of ways. The two primary data types are:
► Quantitative represents as numerical figures - interval and ratio level measurements.
► Qualitative are text, images, audio/video, etc.
► Questions that need to know when selecting data type and sources are given below:
► What is the research question?
► What is the scope of the investigation? (This defines the parameters of any study. Selected data should not extend
beyond the scope of the study).
► What has the literature (previous research) determined to be the most appropriate data to collect?
► What type of data should be considered: quantitative, qualitative, or a composite of both?
Pre-processing and Cleaning

► Data preprocessing is a data mining technique which is used to

transform the raw data in a useful and efficient format.
► Data cleaning is the process of fixing or removing incorrect,
corrupted, incorrectly formatted, duplicate, or incomplete data
within a dataset. When combining multiple data sources, there are
many opportunities for data to be duplicated or mislabeled.
Step Involve in Data Pre-processing

Figure 3: Data
Preprocessing
Data Transformation and Reduction

► Raw data is difficult to trace or understand. That's why it needs to be preprocessed before
retrieving any information from it.
► Data transformation is a technique used to convert the raw data into a suitable format that
efficiently eases data mining and retrieves strategic information.
► Data transformation includes data cleaning techniques and a data reduction technique to
convert the data into the appropriate form.
Con’t

► Data transformation changes the format, structure, or values of the data and converts them
into clean, usable data.
► Data may be transformed at two stages of the data pipeline for data analytics projects.
Organizations that use on-premises data warehouses generally use an ETL (extract, transform,
and load) process, in which data transformation is the middle step.
► Today, most organizations use cloud-based data warehouses to scale compute and storage
resources with latency measured in seconds or minutes.
► The scalability of the cloud platform lets organizations skip preload transformations and load
raw data into the data warehouse, then transform it at query time.
Cont’d

► Data integration, migration, data warehousing, data wrangling may all involve data transformation.
► Data transformation increases the efficiency of business and analytic processes, and it enables businesses to
make better data-driven decisions.
► During the data transformation process, an analyst will determine the structure of the data
► This could mean that data transformation may be:
► Constructive: The data transformation process adds, copies, or replicates data.
► Destructive: The system deletes fields or records.
► Aesthetic: The transformation standardizes the data to meet requirements or parameters.
► Structural: The database is reorganized by renaming, moving, or combining columns.
Data reduction

► Data reduction techniques ensure the integrity of data while reducing the data.
► Data reduction is a process that reduces the volume of original data and represents it in
a much smaller volume.
► Data reduction techniques are used to obtain a reduced representation of the dataset that
is much smaller in volume by maintaining the integrity of the original data. By
reducing the data, the efficiency of the data mining process is improved, which
produces the same analytical results.
Cont’d

► Data reduction does not affect the result obtained from data mining. That means
the result obtained from data mining before and after data reduction is the same
or almost the same.
► Data reduction aims to define it more compactly. When the data size is smaller, it
is simpler to apply sophisticated and computationally high-priced algorithms.
The reduction of the data may be in terms of the number of rows (records) or
terms of the number of columns (dimensions).
Data Mining Process

► Many different sectors are taking advantage of data mining to boost their business
efficiency, including manufacturing, chemical, marketing, aerospace, etc.
► Therefore, the need for a conventional data mining process improved effectively.
► Data mining techniques must be reliable, repeatable by company individuals with little or
no knowledge of the data mining context.
► As a result, a cross-industry standard process for data mining (CRISP-DM) was first
introduced in 1990, after going through many workshops, and contribution for more than
300 organizations.
Cont’d

► Data mining is described as a process of finding hidden precious

data by evaluating the huge quantity of information stored in data
warehouses, using multiple data mining techniques such as
Artificial Intelligence (AI), Machine learning and statistics.
► Figure 4 describe the process of data mining.
Cont’d

Figure 4: Data
Mining Process
The Cross-Industry Standard Process for Data
Mining (CRISP-DM)

► Cross-industry Standard Process of

Data Mining (CRISP-DM) comprises
of six phases designed as a cyclical
method as the given figure 5:

Figure 5:Standard
Process
Cont’d

► 1. Business understanding:
► it focuses on understanding the project goals and requirements form a business point of
view, then converting this information into a data mining problem afterward a preliminary
plan designed to accomplish the target.
► 2. Data Understanding:
► Data understanding starts with an original data collection and proceeds with operations to
get familiar with the data, to data quality issues, to find better insight in data, or to detect
interesting subsets for concealed information hypothesis.
Cont’d

► 3. Data Preparation:
► It usually takes more than 90 percent of the time.
► It covers all operations to build the final data set from the original raw information.
► Data preparation is probable to be done several times and not in any prescribed order.
► 4. Modeling:
► In modeling, various modeling methods are selected and applied, and their parameters are
measured to optimum values. Some methods gave particular requirements on the form of
data. Therefore, stepping back to the data preparation phase is necessary.
Cont’d

► 5. Evaluation:
► At the last of this phase, a decision on the use of the data mining results should be
reached.
► It evaluates the model efficiently, and review the steps executed to build the model and to
ensure that the business objectives are properly achieved.
► The main objective of the evaluation is to determine some significant business issue that
has not been regarded adequately.
► At the last of this phase, a decision on the use of the data mining outcomes should be
reached.
Cont’d

► 6. Deployment:
► Determine:
► Deployment refers to how the outcomes need to be utilized.
► Deploy data mining results by:
► It includes scoring a database, utilizing results as company guidelines, interactive internet
scoring.
► The information acquired will need to be organized and presented in a way that can be used
by the client. However, the deployment phase can be as easy as producing. However,
depending on the demands, the deployment phase may be as simple as generating a report
or as complicated as applying a repeatable data mining method across the organizations.
Data Visualization / Evaluation

► Data visualization is a graphical representation of quantitative information and

data by using visual elements like graphs, charts, and maps.
► Data visualization convert large and small data sets into visuals, which is easy to
understand and process for humans.
► Data visualization tools provide accessible ways to understand outliers, patterns,
and trends in the data.
► In the world of Big Data, the data visualization tools and technologies are
required to analyze vast amounts of information.
What makes Data Visualization Effective?

► Effective data visualization are created by communication, data science, and

design collide. Data visualizations did right key insights into complicated data
sets into meaningful and natural.
► To craft an effective data visualization, you need to start with clean data that is
well-sourced and complete. After the data is ready to visualize, you need to pick
the right chart.
► After you have decided the chart type, you need to design and customize your
visualization to your liking. Simplicity is essential - you don't want to add any
elements that distract from the data.
Importance of Data Visualization

► Data visualization is important because of the processing of

information in human brains. Using graphs and charts to visualize a
large amount of the complex data sets is more comfortable in
comparison to studying the spreadsheet and reports.
► Data visualization is an easy and quick way to convey concepts
universally.
► You can experiment with a different outline by making a slight
adjustment.
Data visualization have some more
specialties such as:

► Data visualization can identify areas that need improvement or

modifications.
► Data visualization can clarify which factor influence customer
behavior.
► Data visualization helps you to understand which products to place
where.
► Data visualization can predict sales volumes
Why Use Data Visualization?

► To make easier in understand and remember.

► To discover unknown facts, outliers, and trends.
► To visualize relationships and patterns quickly.
► To ask a better question and make better decisions.
► To competitive analyze.
► To improve insights.
The end

A Brief Overview On Data Mining Survey PDF
No ratings yet
A Brief Overview On Data Mining Survey PDF
8 pages
ACI Concrete International 2021 Vol43 No2
100% (1)
ACI Concrete International 2021 Vol43 No2
60 pages
Hi SAP Dme Format
No ratings yet
Hi SAP Dme Format
8 pages
SERVICE & PARTS ABhgjjGHT sn1500-2889
75% (4)
SERVICE & PARTS ABhgjjGHT sn1500-2889
146 pages
Module-1 DM
No ratings yet
Module-1 DM
15 pages
Knowledge Discovery in Databases
No ratings yet
Knowledge Discovery in Databases
17 pages
Chapter 1___Data Mining and Data Warehouse
No ratings yet
Chapter 1___Data Mining and Data Warehouse
44 pages
Data Mining
No ratings yet
Data Mining
7 pages
Bi Lesson 6
No ratings yet
Bi Lesson 6
36 pages
Data Mining and Warehousing-1
No ratings yet
Data Mining and Warehousing-1
43 pages
DWM 4
No ratings yet
DWM 4
23 pages
BDUD unit1
No ratings yet
BDUD unit1
100 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
38 pages
What Is Data Mining: Effective Data Collection Warehousing
No ratings yet
What Is Data Mining: Effective Data Collection Warehousing
21 pages
data mining introduction
No ratings yet
data mining introduction
52 pages
1 Chapter One
No ratings yet
1 Chapter One
54 pages
DM 1
No ratings yet
DM 1
78 pages
Lecture 1-Data Mining (Introduction)
No ratings yet
Lecture 1-Data Mining (Introduction)
30 pages
KDD Process
No ratings yet
KDD Process
56 pages
DWDM 1
No ratings yet
DWDM 1
17 pages
Course Manual on Data Mining_CSC 425_015446
No ratings yet
Course Manual on Data Mining_CSC 425_015446
44 pages
2 DM Module 1 Introduction DVS
No ratings yet
2 DM Module 1 Introduction DVS
81 pages
datamining&warehousing
No ratings yet
datamining&warehousing
65 pages
Chapter-1 (Introduction)
No ratings yet
Chapter-1 (Introduction)
17 pages
Data Minng
No ratings yet
Data Minng
20 pages
DM Module1
No ratings yet
DM Module1
15 pages
CIS 467 - Topic 1 - Introduction - 2020
No ratings yet
CIS 467 - Topic 1 - Introduction - 2020
79 pages
Dataminig
No ratings yet
Dataminig
21 pages
Data Mining Overview
No ratings yet
Data Mining Overview
14 pages
Data Mining Nostos
100% (1)
Data Mining Nostos
39 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
62 pages
Unit 1
No ratings yet
Unit 1
43 pages
Data Mining Lecture One - Docx1
No ratings yet
Data Mining Lecture One - Docx1
12 pages
July 16, 2009 1 Data Mining
No ratings yet
July 16, 2009 1 Data Mining
26 pages
PPP
No ratings yet
PPP
38 pages
DWDM Notes - Unit 1
No ratings yet
DWDM Notes - Unit 1
26 pages
Chapter 1 - What is Data Mining
No ratings yet
Chapter 1 - What is Data Mining
8 pages
Data Mining Notes UNIT I
No ratings yet
Data Mining Notes UNIT I
21 pages
p144 Data Mining
100% (3)
p144 Data Mining
11 pages
Why Data Mining?: March 3, 2015
No ratings yet
Why Data Mining?: March 3, 2015
41 pages
Motivation For Data Mining The Information Crisis
No ratings yet
Motivation For Data Mining The Information Crisis
13 pages
Data Structures: Notes For Lecture 12 Introduction To Data Mining by Samaher Hussein Ali
No ratings yet
Data Structures: Notes For Lecture 12 Introduction To Data Mining by Samaher Hussein Ali
4 pages
Data Mining and KDD
No ratings yet
Data Mining and KDD
15 pages
Unit-2 Finalized
No ratings yet
Unit-2 Finalized
12 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
48 pages
Lect 1 2 Data Mining 3
No ratings yet
Lect 1 2 Data Mining 3
19 pages
Introduction-to-Data-Mining
No ratings yet
Introduction-to-Data-Mining
32 pages
01Intro (2)
No ratings yet
01Intro (2)
45 pages
Data Mining
No ratings yet
Data Mining
7 pages
Data Mining - Prashant
No ratings yet
Data Mining - Prashant
10 pages
Screenshot 2023-10-19 at 11.36.57
No ratings yet
Screenshot 2023-10-19 at 11.36.57
27 pages
Data Mining Note
No ratings yet
Data Mining Note
79 pages
wao
No ratings yet
wao
9 pages
Data Mining
No ratings yet
Data Mining
17 pages
cc15 2nd
No ratings yet
cc15 2nd
2 pages
Dmbi Unit-3
No ratings yet
Dmbi Unit-3
21 pages
Digital Data Mining Nostos - FP
No ratings yet
Digital Data Mining Nostos - FP
37 pages
Data Mine
No ratings yet
Data Mine
14 pages
DM NOTES
No ratings yet
DM NOTES
193 pages
B SC (IT) VI-DSE3-M5
No ratings yet
B SC (IT) VI-DSE3-M5
13 pages
Data Mining - Digital Notes (Unit I To V)
No ratings yet
Data Mining - Digital Notes (Unit I To V)
85 pages
Principles of Data Mining
From Everand
Principles of Data Mining
Subodh Keshari
No ratings yet
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Carbonyl Compound
No ratings yet
Carbonyl Compound
8 pages
Schedule Cracker Manual 1.2
No ratings yet
Schedule Cracker Manual 1.2
45 pages
10marks: 1.what Are The Various Data Types in C?Explain
100% (1)
10marks: 1.what Are The Various Data Types in C?Explain
4 pages
Every ASP
No ratings yet
Every ASP
11 pages
Media Literacy in Theory
No ratings yet
Media Literacy in Theory
14 pages
Benzos Factsheet2011
No ratings yet
Benzos Factsheet2011
4 pages
Enjoy English Ssenariy
No ratings yet
Enjoy English Ssenariy
13 pages
MDKA 2019-Q4 Financial-Report Audited IND-ENG
No ratings yet
MDKA 2019-Q4 Financial-Report Audited IND-ENG
113 pages
Analysing Tables Exam Questions
No ratings yet
Analysing Tables Exam Questions
6 pages
Why LC/MS/MS?: Technology Transfer Workshop
No ratings yet
Why LC/MS/MS?: Technology Transfer Workshop
47 pages
Continuous Probability Distribution.
100% (2)
Continuous Probability Distribution.
10 pages
Remaja Dan Seksualiti 3
No ratings yet
Remaja Dan Seksualiti 3
88 pages
PowerUp PDF
No ratings yet
PowerUp PDF
1 page
Itco Oil Spill
No ratings yet
Itco Oil Spill
25 pages
BUSINESS PLAN-WPS Office
No ratings yet
BUSINESS PLAN-WPS Office
7 pages
Four Noble Truths 2
No ratings yet
Four Noble Truths 2
14 pages
Bank Statement Isaac
No ratings yet
Bank Statement Isaac
2 pages
Isa Forum2012 Programme Book
No ratings yet
Isa Forum2012 Programme Book
447 pages
Trust and Communication in A Digitized World: Bernd Blöbaum
No ratings yet
Trust and Communication in A Digitized World: Bernd Blöbaum
253 pages
Middle Stage
No ratings yet
Middle Stage
6 pages
_잉글리쉬 마이갓_ 2025년 3월 모의고사 고1 - 1편
No ratings yet
_잉글리쉬 마이갓_ 2025년 3월 모의고사 고1 - 1편
8 pages
Yews-E-Ap1105-Iom01 (0315) en
No ratings yet
Yews-E-Ap1105-Iom01 (0315) en
83 pages
Asme-B30.22 - 02
No ratings yet
Asme-B30.22 - 02
41 pages
Frictape 7003-104 Rede de Pouso Heliporto Helideck Landing Net 15x15m Ficha Tecnica Manual Catalogo Datasheet
No ratings yet
Frictape 7003-104 Rede de Pouso Heliporto Helideck Landing Net 15x15m Ficha Tecnica Manual Catalogo Datasheet
3 pages
Textbook of Allergy for the Clinician 1st Edition Pudupakkam K. Vedanthan - Read the ebook online or download it for a complete experience
100% (1)
Textbook of Allergy for the Clinician 1st Edition Pudupakkam K. Vedanthan - Read the ebook online or download it for a complete experience
60 pages
Rhodoline 642 PDF
No ratings yet
Rhodoline 642 PDF
2 pages
Human-Computer Interaction in Cloud Systems
No ratings yet
Human-Computer Interaction in Cloud Systems
6 pages

Topic 3 - Data Mining

Uploaded by

Topic 3 - Data Mining

Uploaded by

DATA MINING

ITC2263 INTRODUCTION TO DATA ANALYTICS

► Data mining is also called knowledge discovery and data mining

► The main objective of the KDD process is to extract information

► KDD (Knowledge Discovery in Databases) is a field of computer science, which includes

► The volume of information is increasing every day that we can

►Data mining is used in business to make better managerial

Figure 1: KDD (Sources: Dr Dhaval Patel)

► Data selection is defined as the process of determining the appropriate

► Data preprocessing is a data mining technique which is used to

► Data mining is described as a process of finding hidden precious

► Cross-industry Standard Process of

► Data visualization is a graphical representation of quantitative information and

► Effective data visualization are created by communication, data science, and

► Data visualization is important because of the processing of

► Data visualization can identify areas that need improvement or

► To make easier in understand and remember.

You might also like