0% found this document useful (0 votes)
20 views

Topic 3 - Data Mining

Uploaded by

Arif Syazmi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Topic 3 - Data Mining

Uploaded by

Arif Syazmi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

DATA MINING

ITC2263 INTRODUCTION TO DATA ANALYTICS


TOPIC 3
What is data mining

► Data mining is also called knowledge discovery and data mining


(KDD)
► Data mining is extraction of useful patterns from data sources, e.g.,
databases, texts, web, image
► Patterns must be: valid, novel, potentially useful, understandable
(Dr Dhaval Patel)
Knowlegde Discovery & Data Mining KDD

► The main objective of the KDD process is to extract information


from data in the context of large databases. It does this by using Data
Mining algorithms to identify what is deemed knowledge.
► The Knowledge Discovery in Databases is considered as a
programmed, exploratory analysis and modeling of vast data
repositories.
► KDD is the organized procedure of recognizing valid, useful, and
understandable patterns from huge and complex data sets.
KDD vs Data Mining

► KDD (Knowledge Discovery in Databases) is a field of computer science, which includes


the tools and theories to help humans in extracting useful and previously unknown
information (i.e., knowledge) from large collections of digitized data.
► KDD consists of several steps, and Data Mining is one of them.
► Data Mining is the application of a specific algorithm to extract patterns from data.
Nonetheless, KDD and Data Mining are used interchangeably.
► KDD is a computer science field specializing in extracting previously unknown and
interesting information from raw data. KDD is the whole process of trying to make sense
of data by developing appropriate methods or techniques.
► For example, it is currently used for various applications such as social network analysis,
fraud detection, science, investment, manufacturing, telecommunications, data cleaning,
sports, information retrieval, and marketing.
► KDD is usually used to answer questions like what are the main products that might help
to obtain high-profit next year in V-Mart.
► Data mining, also known as Knowledge Discovery in Databases, refers to the nontrivial
extraction of implicit, previously unknown, and potentially useful information from data
stored in databases.
► Data Mining is only a step within the overall KDD process. There are two major Data
Mining goals defined by the application's goal: verification of discovery. Verification
verifies the user's hypothesis about data, while discovery automatically finds interesting
patterns.
► There are four major data mining tasks: clustering, classification, regression, and
association (summarization). Clustering is identifying similar groups from unstructured
data. Classification is learning rules that can be applied to new data. Regression is
finding functions with minimal error to model data. And the association looks for
relationships between variables.
► Then, the specific data mining algorithm needs to be selected. Different algorithms like
linear regression, logistic regression, decision trees, and Naive Bayes can be selected
depending on the goal. Then patterns of interest in one or more symbolic forms are
Why do we need Data Mining?

► The volume of information is increasing every day that we can


handle from business transactions, scientific data, sensor data,
pictures, videos, etc.
► So, we need a system that will be capable of extracting the essence
of information available and that can automatically generate
reports, views, or summaries of data for better decision-making.
Why is Data Mining used in business?

►Data mining is used in business to make better managerial


decisions by:
►Automatic summarization of data.
►Discovering patterns in raw data.
►Extracting the essence of information stored.
Cont’d
► Data Mining is the root of the KDD procedure, including the
inferring of algorithms that investigate the data, develop the
model, and find previously unknown patterns. The model is used
for extracting the knowledge from the data, analyze the data, and
predict the data.
► The availability and abundance of data today make knowledge
discovery and Data Mining a matter of impressive significance
and need.
► In the recent development of the field, it isn't surprising that a
wide variety of techniques is presently accessible to specialists
and experts.
KNOWLEDGE DISCOVERY IN DATA:
A PROCESS

Figure 1: KDD (Sources: Dr Dhaval Patel)


Figure 2: KDD Process (Detail)
CONT’D
► The knowledge discovery process(Figure 2) is iterative
and interactive, comprises of nine steps.
► The process is iterative at each stage, implying that
moving back to the previous actions might be required.
► The process has many imaginative aspects in the sense
that one cant presents one formula or make a complete
scientific categorization for the correct decisions for each
step and application type.
► Thus, it is needed to understand the process and the
different requirements and possibilities in each stage.
Data Selection in Data Mining

► Data selection is defined as the process of determining the appropriate


data type and source and suitable instruments to collect data.
► Data selection precedes the actual practice of data collection. This definition
distinguishes data selection from selective data reporting
(selectively excluding data that is not supportive of a research hypothesis)
► and interactive/active data selection (using collected data for monitoring
activities/events, or conducting secondary data analyses). The process of
selecting suitable data for a research project can impact data integrity.
Why data selection

► The primary objective of data selection is the determination of appropriate data type,
source, and instrument(s) that allow investigators to adequately answer research
questions.
► This determination is often discipline-specific and is primarily driven by the nature of the
investigation, existing literature, and accessibility to necessary data sources.
► Integrity issues can arise when the decisions to select ‘appropriate’ data to collect are
based primarily on cost and convenience considerations rather than the ability of data to
adequately answer research questions.
► Certainly, cost and convenience are valid factors in the decision-making process.
However, researchers should assess to what degree these factors might compromises the
integrity of the research endeavor.
Types and Sources of Data

► Data types and sources can be represented in a variety of ways. The two primary data types are:
► Quantitative represents as numerical figures - interval and ratio level measurements.
► Qualitative are text, images, audio/video, etc.
► Questions that need to know when selecting data type and sources are given below:
► What is the research question?
► What is the scope of the investigation? (This defines the parameters of any study. Selected data should not extend
beyond the scope of the study).
► What has the literature (previous research) determined to be the most appropriate data to collect?
► What type of data should be considered: quantitative, qualitative, or a composite of both?
Pre-processing and Cleaning

► Data preprocessing is a data mining technique which is used to


transform the raw data in a useful and efficient format.
► Data cleaning is the process of fixing or removing incorrect,
corrupted, incorrectly formatted, duplicate, or incomplete data
within a dataset. When combining multiple data sources, there are
many opportunities for data to be duplicated or mislabeled.
Step Involve in Data Pre-processing

Figure 3: Data
Preprocessing
Data Transformation and Reduction

► Raw data is difficult to trace or understand. That's why it needs to be preprocessed before
retrieving any information from it.
► Data transformation is a technique used to convert the raw data into a suitable format that
efficiently eases data mining and retrieves strategic information.
► Data transformation includes data cleaning techniques and a data reduction technique to
convert the data into the appropriate form.
Con’t

► Data transformation changes the format, structure, or values of the data and converts them
into clean, usable data.
► Data may be transformed at two stages of the data pipeline for data analytics projects.
Organizations that use on-premises data warehouses generally use an ETL (extract, transform,
and load) process, in which data transformation is the middle step.
► Today, most organizations use cloud-based data warehouses to scale compute and storage
resources with latency measured in seconds or minutes.
► The scalability of the cloud platform lets organizations skip preload transformations and load
raw data into the data warehouse, then transform it at query time.
Cont’d

► Data integration, migration, data warehousing, data wrangling may all involve data transformation.
► Data transformation increases the efficiency of business and analytic processes, and it enables businesses to
make better data-driven decisions.
► During the data transformation process, an analyst will determine the structure of the data
► This could mean that data transformation may be:
► Constructive: The data transformation process adds, copies, or replicates data.
► Destructive: The system deletes fields or records.
► Aesthetic: The transformation standardizes the data to meet requirements or parameters.
► Structural: The database is reorganized by renaming, moving, or combining columns.
Data reduction

► Data reduction techniques ensure the integrity of data while reducing the data.
► Data reduction is a process that reduces the volume of original data and represents it in
a much smaller volume.
► Data reduction techniques are used to obtain a reduced representation of the dataset that
is much smaller in volume by maintaining the integrity of the original data. By
reducing the data, the efficiency of the data mining process is improved, which
produces the same analytical results.
Cont’d

► Data reduction does not affect the result obtained from data mining. That means
the result obtained from data mining before and after data reduction is the same
or almost the same.
► Data reduction aims to define it more compactly. When the data size is smaller, it
is simpler to apply sophisticated and computationally high-priced algorithms.
The reduction of the data may be in terms of the number of rows (records) or
terms of the number of columns (dimensions).
Data Mining Process

► Many different sectors are taking advantage of data mining to boost their business
efficiency, including manufacturing, chemical, marketing, aerospace, etc.
► Therefore, the need for a conventional data mining process improved effectively.
► Data mining techniques must be reliable, repeatable by company individuals with little or
no knowledge of the data mining context.
► As a result, a cross-industry standard process for data mining (CRISP-DM) was first
introduced in 1990, after going through many workshops, and contribution for more than
300 organizations.
Cont’d

► Data mining is described as a process of finding hidden precious


data by evaluating the huge quantity of information stored in data
warehouses, using multiple data mining techniques such as
Artificial Intelligence (AI), Machine learning and statistics.
► Figure 4 describe the process of data mining.
Cont’d

Figure 4: Data
Mining Process
The Cross-Industry Standard Process for Data
Mining (CRISP-DM)

► Cross-industry Standard Process of


Data Mining (CRISP-DM) comprises
of six phases designed as a cyclical
method as the given figure 5:

Figure 5:Standard
Process
Cont’d

► 1. Business understanding:
► it focuses on understanding the project goals and requirements form a business point of
view, then converting this information into a data mining problem afterward a preliminary
plan designed to accomplish the target.
► 2. Data Understanding:
► Data understanding starts with an original data collection and proceeds with operations to
get familiar with the data, to data quality issues, to find better insight in data, or to detect
interesting subsets for concealed information hypothesis.
Cont’d

► 3. Data Preparation:
► It usually takes more than 90 percent of the time.
► It covers all operations to build the final data set from the original raw information.
► Data preparation is probable to be done several times and not in any prescribed order.
► 4. Modeling:
► In modeling, various modeling methods are selected and applied, and their parameters are
measured to optimum values. Some methods gave particular requirements on the form of
data. Therefore, stepping back to the data preparation phase is necessary.
Cont’d

► 5. Evaluation:
► At the last of this phase, a decision on the use of the data mining results should be
reached.
► It evaluates the model efficiently, and review the steps executed to build the model and to
ensure that the business objectives are properly achieved.
► The main objective of the evaluation is to determine some significant business issue that
has not been regarded adequately.
► At the last of this phase, a decision on the use of the data mining outcomes should be
reached.
Cont’d

► 6. Deployment:
► Determine:
► Deployment refers to how the outcomes need to be utilized.
► Deploy data mining results by:
► It includes scoring a database, utilizing results as company guidelines, interactive internet
scoring.
► The information acquired will need to be organized and presented in a way that can be used
by the client. However, the deployment phase can be as easy as producing. However,
depending on the demands, the deployment phase may be as simple as generating a report
or as complicated as applying a repeatable data mining method across the organizations.
Data Visualization / Evaluation

► Data visualization is a graphical representation of quantitative information and


data by using visual elements like graphs, charts, and maps.
► Data visualization convert large and small data sets into visuals, which is easy to
understand and process for humans.
► Data visualization tools provide accessible ways to understand outliers, patterns,
and trends in the data.
► In the world of Big Data, the data visualization tools and technologies are
required to analyze vast amounts of information.
What makes Data Visualization Effective?

► Effective data visualization are created by communication, data science, and


design collide. Data visualizations did right key insights into complicated data
sets into meaningful and natural.
► To craft an effective data visualization, you need to start with clean data that is
well-sourced and complete. After the data is ready to visualize, you need to pick
the right chart.
► After you have decided the chart type, you need to design and customize your
visualization to your liking. Simplicity is essential - you don't want to add any
elements that distract from the data.
Importance of Data Visualization

► Data visualization is important because of the processing of


information in human brains. Using graphs and charts to visualize a
large amount of the complex data sets is more comfortable in
comparison to studying the spreadsheet and reports.
► Data visualization is an easy and quick way to convey concepts
universally.
► You can experiment with a different outline by making a slight
adjustment.
Data visualization have some more
specialties such as:

► Data visualization can identify areas that need improvement or


modifications.
► Data visualization can clarify which factor influence customer
behavior.
► Data visualization helps you to understand which products to place
where.
► Data visualization can predict sales volumes
Why Use Data Visualization?

► To make easier in understand and remember.


► To discover unknown facts, outliers, and trends.
► To visualize relationships and patterns quickly.
► To ask a better question and make better decisions.
► To competitive analyze.
► To improve insights.
The end

You might also like