Topic 3 - Data Mining
Topic 3 - Data Mining
► The primary objective of data selection is the determination of appropriate data type,
source, and instrument(s) that allow investigators to adequately answer research
questions.
► This determination is often discipline-specific and is primarily driven by the nature of the
investigation, existing literature, and accessibility to necessary data sources.
► Integrity issues can arise when the decisions to select ‘appropriate’ data to collect are
based primarily on cost and convenience considerations rather than the ability of data to
adequately answer research questions.
► Certainly, cost and convenience are valid factors in the decision-making process.
However, researchers should assess to what degree these factors might compromises the
integrity of the research endeavor.
Types and Sources of Data
► Data types and sources can be represented in a variety of ways. The two primary data types are:
► Quantitative represents as numerical figures - interval and ratio level measurements.
► Qualitative are text, images, audio/video, etc.
► Questions that need to know when selecting data type and sources are given below:
► What is the research question?
► What is the scope of the investigation? (This defines the parameters of any study. Selected data should not extend
beyond the scope of the study).
► What has the literature (previous research) determined to be the most appropriate data to collect?
► What type of data should be considered: quantitative, qualitative, or a composite of both?
Pre-processing and Cleaning
Figure 3: Data
Preprocessing
Data Transformation and Reduction
► Raw data is difficult to trace or understand. That's why it needs to be preprocessed before
retrieving any information from it.
► Data transformation is a technique used to convert the raw data into a suitable format that
efficiently eases data mining and retrieves strategic information.
► Data transformation includes data cleaning techniques and a data reduction technique to
convert the data into the appropriate form.
Con’t
► Data transformation changes the format, structure, or values of the data and converts them
into clean, usable data.
► Data may be transformed at two stages of the data pipeline for data analytics projects.
Organizations that use on-premises data warehouses generally use an ETL (extract, transform,
and load) process, in which data transformation is the middle step.
► Today, most organizations use cloud-based data warehouses to scale compute and storage
resources with latency measured in seconds or minutes.
► The scalability of the cloud platform lets organizations skip preload transformations and load
raw data into the data warehouse, then transform it at query time.
Cont’d
► Data integration, migration, data warehousing, data wrangling may all involve data transformation.
► Data transformation increases the efficiency of business and analytic processes, and it enables businesses to
make better data-driven decisions.
► During the data transformation process, an analyst will determine the structure of the data
► This could mean that data transformation may be:
► Constructive: The data transformation process adds, copies, or replicates data.
► Destructive: The system deletes fields or records.
► Aesthetic: The transformation standardizes the data to meet requirements or parameters.
► Structural: The database is reorganized by renaming, moving, or combining columns.
Data reduction
► Data reduction techniques ensure the integrity of data while reducing the data.
► Data reduction is a process that reduces the volume of original data and represents it in
a much smaller volume.
► Data reduction techniques are used to obtain a reduced representation of the dataset that
is much smaller in volume by maintaining the integrity of the original data. By
reducing the data, the efficiency of the data mining process is improved, which
produces the same analytical results.
Cont’d
► Data reduction does not affect the result obtained from data mining. That means
the result obtained from data mining before and after data reduction is the same
or almost the same.
► Data reduction aims to define it more compactly. When the data size is smaller, it
is simpler to apply sophisticated and computationally high-priced algorithms.
The reduction of the data may be in terms of the number of rows (records) or
terms of the number of columns (dimensions).
Data Mining Process
► Many different sectors are taking advantage of data mining to boost their business
efficiency, including manufacturing, chemical, marketing, aerospace, etc.
► Therefore, the need for a conventional data mining process improved effectively.
► Data mining techniques must be reliable, repeatable by company individuals with little or
no knowledge of the data mining context.
► As a result, a cross-industry standard process for data mining (CRISP-DM) was first
introduced in 1990, after going through many workshops, and contribution for more than
300 organizations.
Cont’d
Figure 4: Data
Mining Process
The Cross-Industry Standard Process for Data
Mining (CRISP-DM)
Figure 5:Standard
Process
Cont’d
► 1. Business understanding:
► it focuses on understanding the project goals and requirements form a business point of
view, then converting this information into a data mining problem afterward a preliminary
plan designed to accomplish the target.
► 2. Data Understanding:
► Data understanding starts with an original data collection and proceeds with operations to
get familiar with the data, to data quality issues, to find better insight in data, or to detect
interesting subsets for concealed information hypothesis.
Cont’d
► 3. Data Preparation:
► It usually takes more than 90 percent of the time.
► It covers all operations to build the final data set from the original raw information.
► Data preparation is probable to be done several times and not in any prescribed order.
► 4. Modeling:
► In modeling, various modeling methods are selected and applied, and their parameters are
measured to optimum values. Some methods gave particular requirements on the form of
data. Therefore, stepping back to the data preparation phase is necessary.
Cont’d
► 5. Evaluation:
► At the last of this phase, a decision on the use of the data mining results should be
reached.
► It evaluates the model efficiently, and review the steps executed to build the model and to
ensure that the business objectives are properly achieved.
► The main objective of the evaluation is to determine some significant business issue that
has not been regarded adequately.
► At the last of this phase, a decision on the use of the data mining outcomes should be
reached.
Cont’d
► 6. Deployment:
► Determine:
► Deployment refers to how the outcomes need to be utilized.
► Deploy data mining results by:
► It includes scoring a database, utilizing results as company guidelines, interactive internet
scoring.
► The information acquired will need to be organized and presented in a way that can be used
by the client. However, the deployment phase can be as easy as producing. However,
depending on the demands, the deployment phase may be as simple as generating a report
or as complicated as applying a repeatable data mining method across the organizations.
Data Visualization / Evaluation