0% found this document useful (0 votes)

149 views

DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing

Data mining involves extracting knowledge and patterns from large amounts of data through techniques like classification, clustering, and prediction to help users better understand their data. The data mining process includes steps like data cleaning, transformation, mining, evaluation and presentation. Data mining systems have components like data sources, database servers, data mining engines, and graphical user interfaces to help users interact with complex mining algorithms and knowledge bases.

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

149 views

DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 55

What is Data mining?

❏ Extracting or mining knowledge from large amounts of data

Data Mining Task Primitives

● Each user will have a data mining task in mind, that is, some form of data
analysis that he or she would like to have performed.
● A data mining task can be specified in the form of a data mining query,
which is input to the data mining system. A data mining query is defined in
terms of data mining task primitives.
● These primitives allow the user to interactively communicate with the data
mining system during discovery in order to direct the mining process, or
examine the findings from different angles or depths.
● The set of task-relevant data to be mined

● The kind of knowledge to be mined

● The background knowledge to be used in the

discovery process

● The interestingness measures and thresholds for

pattern evaluation

● The expected representation for visualizing

the discovered patterns

KDD (knowledge discovery from data) process
Steps in KDD
1.Data cleaning (to remove noise and inconsistent data)

2. Data integration (where multiple data sources may be combined)

3. Data selection (where data relevant to the analysis task are retrieved from the database)

4. Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations)

5. Data mining (an essential process where intelligent methods are applied to extract data
patterns)

6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on interestingness measures)

7. Knowledge presentation (where visualization and knowledge representation

techniques are used to present mined knowledge to users)
Architecture of Data Mining
Data Sources:
Database, WWW and Datawarehouse are parts of data sources. The data in these sources may
be in the form of plain text, spreadsheets or in other forms of media like photos or videos.
WWW is one of the biggest sources of data.

Database Server:
The database server contains the actual data ready to be processed. It performs the task of
handling data retrieval as per the request of the user.

Data Mining Engine:

It is one of the core components of the data mining architecture that performs all kinds of data
mining techniques like association, classification, characterization, clustering, prediction, etc.
Pattern Evaluation Modules:
They are responsible for finding interesting patterns in the data and sometimes they also interact with
the database servers for producing the result of the user requests.

Graphic User Interface:

Since the user cannot fully understand the complexity of the data mining process so graphical user
interface helps the user to communicate effectively with the data mining system.

Knowledge Base:
Knowledge Base is an important part of the data mining engine that is quite beneficial in guiding the
search for the result patterns. Data mining engine may also sometimes get inputs from the knowledge
base. This knowledge base may contain data from user experiences. The objective of the knowledge
Data Mining Techniques
1. Descriptive mining tasks describe the characteristics of the data in a target data set. On the
other hand, predictive mining tasks carry out the induction over the current and past data so
that predictions can be made.
2. In terms of accuracy, the descriptive technique is more precise and accurate as compared to
predictive mining.
3. The predictive analysis involves control over the situation along with responding to it while
descriptive analysis just responds to the situation.
4. The operation performed in the descriptive approach are standard reporting, query/drill down
and ad-hoc reporting which are capable of generating the response of –
● what happened?
● where exactly is the problem?
● what is the frequency of the problem?
5. As against, predictive mining performs tasks like predictive modelling, forecasting, simulation
and alerts. These involve the result of questions like –
● what will happen next?
● what is the outcome if these trends continue?
● what actions are required to be taken?
Issues in Data Mining
1. Mining Methodology
● Mining different kinds of knowledge in databases:

Different users can be interested in different kind of knowledge, data mining

should cover wide spectrum of data analysis and knowledge discovery tasks like
characterization, association, classification, clustering and trend & deviation
analysis.

2.User Interaction
● Interactive mining of knowledge in multiple levels of abstractions
● Incorporation of background knowledge:
● Ad hoc data mining and data mining query languages:
● Presentation and visualization of data mining results:
● Handling noisy & incomplete data
● Pattern evaluation
continued…...

3. Performance Issues:
● Efficiency and scalability of data mining algorithms
● Parallel, distributed, and incremental mining algorithms

4. Diversity of Database Types

● Handling of relational & complex types of data
● Mining dynamic, networked, and global data repositories
Applications of Data Mining
Data Exploration
What are the types of attributes or fields that make up your data?
What kind of values does each attribute have?
Which attributes are discrete, and which are continuous-valued?
What do the data look like?
How are the values distributed?
Are there ways we can visualize the data to get a better sense of it all?
Can we spot any outliers?
Can we measure the similarity of some data objects with respect to others?
A data object represents an entity—
in a sales database, the objects may be customers, store items, and sales;
in a medical database, the objects may be patients;
in a university database, the objects may be students, professors, and courses.
Data objects are typically described by attributes.
Data objects can also be referred to as samples, examples, instances, data points, or
objects.
If the data objects are stored in a database, they are data tuples. That is, the rows of
a database correspond to the data objects, and the columns correspond to the
attributes
What Is an Attribute?
● An attribute is a data field, representing a characteristic or feature of a data object
● attribute, dimension, feature, and variable are often used interchangeably
● The type of an attribute is determined by the set of possible values—nominal,
binary, ordinal, or numeric—the attribute can have.
Nominal Attributes
Nominal means “relating to names.” The values of a nominal attribute are symbols or
names of things. Each value represents some kind of category, code, or state, and so
nominal attributes are also referred to as categorical. The values do not have any
meaningful order.
■ Nominal: categories, states, or “names of things”
■ Hair_color = {auburn, black, blond, brown, grey, red, white}
■ marital status, occupation, ID numbers, zip codes
Binary Attributes
A binary attribute is a nominal attribute with only two categories or states: 0 or 1, where 0 typically means
that the attribute is absent, and 1 means that it is present. Binary attributes are referred to as Boolean if the
two states correspond to true and false.

the patient undergoes a medical test that has two possible outcomes. The attribute medical test is binary,
where a value of 1 means the result of the test for the patient is positive, while 0 means the result is
negative.
■ Binary
■ Nominal attribute with only 2 states (0 and 1)
■ Symmetric binary: both outcomes equally important
■ e.g., gender
■ Asymmetric binary: outcomes not equally important.
■ e.g., medical test (positive vs. negative)
■ Convention: assign 1 to most important outcome (e.g., HIV positive)
Ordinal Attributes

An ordinal attribute is an attribute with possible values that have a meaningful order or
ranking among them, but the magnitude between successive values is not known.

E.g. examples of ordinal attributes include grade (e.g., A+, A, A−, B+, and so on) and
professional rank. Professional ranks can be enumerated in a sequential order: for
example, assistant, associate, and full for professors, and private, private first class,
specialist, corporal, and sergeant for army ranks
■ Ordinal
■ Values have a meaningful order (ranking) but magnitude between
successive values is not known.
■ Size = {small, medium, large}, grades, army rankings
Numeric Attributes
A numeric attribute is quantitative; that is, it is a measurable quantity, represented in integer or real
values.

Numeric attributes can be interval-scaled or ratio-scaled.

Interval-Scaled Attributes :Interval-scaled attributes are measured on a scale of equal-size units.

The values of interval-scaled attributes have order and can be positive, 0, or negative. Thus, in
addition to providing a ranking of values, such attributes allow us to compare and quantify the
difference between values.

Ratio-Scaled Attributes : A ratio-scaled attribute is a numeric attribute with an inherent zero-point.

That is, if a measurement is ratio-scaled, we can speak of a value as being a multiple (or ratio) of
another value. In addition, the values are ordered, and we can also compute the difference between
values, as well as the mean, median, and mode
Discrete versus Continuous Attributes
■ Discrete Attribute
■ Has only a finite or countably infinite set of values

■ E.g., zip codes, profession, or the set of words in a collection of documents

■ Sometimes, represented as integer variables

■ Note: Binary attributes are a special case of discrete attributes

■ Continuous Attribute
■ Has real numbers as attribute values

■ E.g., temperature, height, or weight

■ Practically, real values can only be measured and represented using a finite number

of digits
■ Continuous attributes are typically represented as floating-point variables
Statistical Descriptions of Data
Descriptive Statistics :
Descriptive statistics uses data that provides a description of the population either through numerical
calculation or graph or table. It provides a graphical summary of data. It is simply used for summarizing
objects, etc. There are two categories in this as following below.

1. Measure of central tendency –

Measure of central tendency is also known as summary statistics that is used to represents the center point
or a particular value of a data set or sample set.
In statistics, there are three common measures of central tendency as shown below:
○ (i) Mean :
It is measure of average of all value in a sample set.
For example,
Median :
It is measure of central value of a sample set. In these, data set is ordered from
lowest to highest value and then finds exact middle.
For example,
Mode :

It is value most frequently arrived in sample set. The value repeated most of time in
central set is actually mode.

For example,
2. Measure of Variability –
Measure of Variability is also known as measure of dispersion and used to describe variability
in a sample or population. In statistics, there are three common measures of variability as
shown below:

● (i) Range :
It is given measure of how to spread apart values in sample set or data set.
Range = Maximum value - Minimum value
● (ii) Variance :
It simply describes how much a random variable differs from expected value and it is
also computed as square of deviation.
S2= ∑ni=1 [(xi - ͞x)2 ÷ n]
In these formula, n represent total data points, ͞x represent mean of data points and xi
represent individual data points.
● (iii) Dispersion :
It is measure of dispersion of set of data from its mean.
σ= √ (1÷n) ∑ni=1 (xi - μ)2
Data Visualization
■ Why data visualization?
■ Gain insight into an information space by mapping data onto graphical primitives
■ Provide qualitative overview of large data sets
■ Search for patterns, trends, structure, irregularities, relationships among data
■ Help find interesting regions and suitable parameters for further quantitative
analysis
■ Provide a visual proof of computer representations derived
Data Visualization
Why visualization?

Without the concept of visualization, mining and analysis doesn’t play any role of

importance as data mining is the idea of finding inferences by analyzing the data through

patterns and those patterns can only be represented by different visualization techniques.

Techniques:

● Box plots

● Histograms

● Charts

● Tree maps
Box Plots
In descriptive statistics a box plot or boxplot is a method for graphically depicting groups of numerical data through their quartiles.Box
plots may also have lines extending from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the
terms box-and-whisker plot and box-and-whisker diagram.Outliers may be plotted as individual points.

A boxplot is a standardized way of displaying the dataset based on a five-number summary: the minimum, the maximum, the sample
median, and the first and third quartiles.
Minimum (Q0 or 0th percentile): the lowest data point excluding any outliers.
Maximum (Q4 or 100th percentile): the largest data point excluding any outliers.
Median (Q2 or 50th percentile): the middle value of the dataset.
First quartile (Q1 or 25th percentile): also known as the lower quartile qn(0.25), is the median of the lower half of the dataset.
Third quartile (Q3 or 75th percentile): also known as the upper quartile qn(0.75), is the median of the upper half of the dataset.[4]
An important element used to construct the box plot by determining the minimum and maximum data values feasible, but is not part of
the aforementioned five-number summary, is the interquartile range or IQR denoted below:
Interquartile range (IQR) : is the distance between the upper and lower quartiles.
Histogram
A histogram is a graphical display of data using bars of different heights.Describe the distribution of
quantitative data.A histogram divides the variable values into equal-sized intervals In a histogram, each
bar groups numbers into ranges. Taller bars show that more data falls in that range. A histogram
displays the shape and spread of continuous sample data. It is similar to a vertical bar graph. However, a
histogram, unlike a vertical bar graph, shows no gaps between the bars.
Charts

The pictorial representation of a grouped data, in the form of vertical or horizontal

rectangular bars, where the lengths of the bars are equivalent to the measure of data, are

known as bar graphs or bar charts. A bar chart can be plotted vertically or horizontally.

Usually it is drawn vertically where x-axis represents the categories and y-axis represents

the values for these categories.

Bar Chart
Line Charts

It is a type of chart which displays information as a series of data points called markers connected by
straight line segments. Line graphs show how a continuous variable changes over time. The variable that
measures time is plotted on the x-axis. The continuous variable is plotted on the y-axis.
Pie Chart
It is circular statistical graph which decide into slices to illustrate numerical proportion. Here
the arc length of each slide is proportional to the quantity it represents.
Scatter plot

A scatterplot is a graphical way to display the relationship between two quantitative sample
variables. It consists of an X axis, a Y axis and a series of dots where each dot represents one
observation from a data set. The position of the dot refers to its X and Y values.
Major Tasks in Data Preprocessing
■ Data cleaning
■ Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
■ Data integration
■ Integration of multiple databases, data cubes, or files
■ Data reduction
■ Dimensionality reduction
■ Numerosity reduction
■ Data compression
■ Data transformation and data discretization
■ Normalization
■ Concept hierarchy generation
Data Cleaning
■ Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
■ incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
■ e.g., Occupation=“ ” (missing data)
■ noisy: containing noise, errors, or outliers
■ e.g., Salary=“−10” (an error)
■ inconsistent: containing discrepancies in codes or names, e.g.,
■ Age=“42”, Birthday=“03/07/2010”
■ Was rating “1, 2, 3”, now rating “A, B, C”
■ discrepancy between duplicate records
■ Intentional (e.g., disguised missing data)
■ Jan. 1 as everyone’s birthday?
Incomplete (Missing) Data
■ Data is not always available
■ E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
■ Missing data may be due to
■ equipment malfunction
■ inconsistent with other recorded data and thus deleted
■ data not entered due to misunderstanding
■ certain data may not be considered important at the time of entry
■ not register history or changes of the data
■ Missing data may need to be inferred
How to Handle Missing Data?
■ Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute
varies considerably
■ Fill in the missing value manually: tedious + infeasible?
■ Fill in it automatically with
■ a global constant : e.g., “unknown”, a new class?!
■ the attribute mean
■ the attribute mean for all samples belonging to the same class:
smarter
■ the most probable value: inference-based such as Bayesian formula
or decision tree
Noisy Data
■ Noise: random error or variance in a measured variable
■ Incorrect attribute values may be due to
■ faulty data collection instruments
■ data entry problems
■ data transmission problems
■ technology limitation
■ inconsistency in naming convention
How to Handle Noisy Data?
■ Binning
■ first sort data and partition into (equal-frequency) bins
■ then one can smooth by bin means, smooth by bin median, smooth by
bin boundaries, etc.
■ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

■ Regression
■ smooth by fitting the data into regression functions

Linear regression refers to finding the best line to fit between two variables so that one
can be used to predict the other. Using regression to find a mathematical equation to fit
into the data helps to smooth out the noise.
■ Clustering
■ detect and remove outliers
■ Combined computer and human inspection
■ detect suspicious values and check by human (e.g., deal with possible
outliers)
Data Integration
■ Data integration:
■ Combines data from multiple sources into a coherent store
■ Schema integration
■ Integrate metadata from different sources
■ Entity identification problem:
■ Identify real world entities from multiple data sources, e.g., Bill Clinton = William
Clinton
■ Detecting and resolving data value conflicts
■ For the same real world entity, attribute values from different sources are different
■ Possible reasons: different representations, different scales, e.g., metric vs. British
units
Data Reduction Strategies
■ Data reduction: Obtain a reduced representation of the data set that is much smaller
in volume but yet produces the same (or almost the same) analytical results
■ Why data reduction? — A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete data set.
■ Data reduction strategies
■ Dimensionality reduction, e.g., remove unimportant attributes
■ Wavelet transforms(store only a small fraction of the strongest of the wavelet
coefficients)
■ Principal Components Analysis (PCA)(The original data are projected onto a
much smaller space, resulting in dimensionality reduction.)
■ Feature subset selection, feature creation
■ Numerosity reduction (some simply call it: Data Reduction)
■ Regression and Log-Linear Models
■ Histograms, clustering, sampling
■ Data cube aggregation
■ Data compression
The data compression technique reduces the size of the files using different encoding
mechanisms (Huffman Encoding & run-length Encoding). We can divide it into two types
based on their compression techniques.
Numerosity reduction
Min-Max Normalization

In this technique of data normalization, linear transformation is performed on the original data. Minimum and
maximum value from data is fetched and each value is replaced according to the formula.

Where A is the attribute data,

Min(A), Max(A) are the minimum and maximum absolute value of A respectively.
v’ is the new value of each entry in data.
v is the old value of each entry in data.
new_max(A), new_min(A) is the max and min value of the range(i.e boundary value of range required)
respectively.

Z-score normalization –
In this technique, values are normalized based on mean and standard deviation of the data A. The formula used is:
Decimal Scaling Method For Normalization –

It normalizes by moving the decimal point of values of the data. To normalize the data by this technique, we divide
each value of the data by the maximum absolute value of data. The data value, vi, of data is normalized to vi‘ by
using the formula below –
Discretization
■
Three types of attributes
■ Nominal—values from an unordered set, e.g., color, profession
■ Ordinal—values from an ordered set, e.g., military or academic rank
■ Numeric—real numbers, e.g., integer or real numbers
■ Discretization: Divide the range of a continuous attribute into intervals
■ Interval labels can then be used to replace actual data values
■ Reduce data size by discretization
■ Supervised vs. unsupervised
■ Split (top-down) vs. merge (bottom-up)
■ Discretization can be performed recursively on an attribute
■ Prepare for further analysis, e.g., classification
Concept Hierarchy Generation
■ Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in a data
warehouse
■ Concept hierarchies facilitate drilling and rolling in data warehouses to
view data in multiple granularity
■ Concept hierarchy formation: Recursively reduce the data by collecting
and replacing low level concepts (such as numeric values for age) by
higher level concepts (such as youth, adult, or senior)
■ Concept hierarchies can be explicitly specified by domain experts and/or
data warehouse designers
■ Concept hierarchy can be automatically formed for both numeric and
nominal data. For numeric data, use discretization methods shown.
A concept hierarchy for a given numeric attribute attribute defines a discretization of the attribute. Concept
hierarchies can be used to reduce the data y collecting and replacing low-level concepts (such as numeric value
for the attribute age) by higher level concepts (such as young, middle-aged, or senior).
Discretization and Concept Hierarchy Generation for
Numerical Data:
Binning: Attribute values can be discretized by distributing the values into bin and replacing
each bin by the mean bin value or bin median value. These technique can be applied recursively
to the resulting partitions in order to generate concept hierarchies.

Histogram Analysis: Histograms can also be used for discretization. Partitioning rules can be
applied to define range of values. The histogram analyses algorithm can be applied recursively
to each partition in order to automatically generate a multilevel concept hierarchy, with the
procedure terminating once a prespecified number of concept levels have been reached. A
minimum interval size can be used per level to control the recursive procedure. this specifies the
minimum width of the partition, or the minimum member of partitions at each level.
Cluster Analysis: A clustering algorithm can be applied to partition data into clusters or groups. Each
cluster forms a node of a concept hierarchy, where all noses are at the same conceptual level. Each
cluster may be further decomposed into sub-clusters, forming a lower kevel in the hierarchy. Clusters
may also be grouped together to form a higher-level concept hierarchy.

Segmentation by natural partitioning: Breaking up annual salaries in the range of into ranges like
($50,000-$100,000) are often more desirable than ranges like ($51, 263, 89-$60,765.3) arrived at by
cluster analysis. The 3-4-5 rule can be used to segment numeric data into relatively uniform “natural”
intervals. In general the rule partitions a give range of data into 3,4,or 5 equinity intervals, recursively
level by level based on value range at the most significant digit. The rule can be recursively applied to
each interval creating a concept hierarchy for the given numeric attribute.
Discretization and Concept Hierarchy Generation for
Categorical Data:

Specification of a partial ordering of attributes explicitly at the schema level by

experts:
Concept hierarchies for categorical attributes or dimensions typically involve a group of
attributes. A user or an expert can easily define concept hierarchy by specifying a partial
or total ordering of the attributes at a schema level. A hierarchy can be defined at the
schema level such as street < city < province <state < country.

Entrepreneurship: Theory, Process, Practice, 12e: Chapter 12: Developing An Effective Business Plan
No ratings yet
Entrepreneurship: Theory, Process, Practice, 12e: Chapter 12: Developing An Effective Business Plan
30 pages
Notes For MBA Maths
0% (1)
Notes For MBA Maths
125 pages
Analyzing The External Environment of The Firm: Chapter Two
No ratings yet
Analyzing The External Environment of The Firm: Chapter Two
37 pages
Data Mining: Exploring Data: Lecture Notes For Chapter 3
No ratings yet
Data Mining: Exploring Data: Lecture Notes For Chapter 3
21 pages
Chapter-3 DATA MINING PDF
No ratings yet
Chapter-3 DATA MINING PDF
13 pages
Data Warehouse and OLAP
No ratings yet
Data Warehouse and OLAP
55 pages
Chapter 7 Random Number Generation
No ratings yet
Chapter 7 Random Number Generation
12 pages
Impact Curve AND Experience Curve
No ratings yet
Impact Curve AND Experience Curve
16 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
172 pages
Kuratko 8e CH 09
50% (2)
Kuratko 8e CH 09
24 pages
Appreciation Letter 23.8.2022
No ratings yet
Appreciation Letter 23.8.2022
8 pages
Individual Replacement and Group Replacement Individual Replacement Policy
No ratings yet
Individual Replacement and Group Replacement Individual Replacement Policy
5 pages
Coronel PPT Ch03
100% (1)
Coronel PPT Ch03
38 pages
Pillsbury
No ratings yet
Pillsbury
13 pages
Welcome: To All MBA Students
No ratings yet
Welcome: To All MBA Students
60 pages
Knowledge Based Expert Systems
No ratings yet
Knowledge Based Expert Systems
4 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
83 pages
Elasticity of Demand and Its Types
No ratings yet
Elasticity of Demand and Its Types
8 pages
Chapter 4 - ERP A Manufacturing Perspective
100% (1)
Chapter 4 - ERP A Manufacturing Perspective
18 pages
Profit Maximization Vs Wealth Maximization
No ratings yet
Profit Maximization Vs Wealth Maximization
7 pages
Managerial Economics (Unit-3)
No ratings yet
Managerial Economics (Unit-3)
92 pages
Unit IV - Database Normalization
No ratings yet
Unit IV - Database Normalization
31 pages
Logistics Concept Evolution Objectives A
No ratings yet
Logistics Concept Evolution Objectives A
11 pages
A Multi-Dimensional Data Model
No ratings yet
A Multi-Dimensional Data Model
37 pages
Bca Vi Sem Bi - Unit III
No ratings yet
Bca Vi Sem Bi - Unit III
110 pages
Mba 1 Sem Quantitative Methods 2017
No ratings yet
Mba 1 Sem Quantitative Methods 2017
1 page
Operations Management: - Design of Goods and Services
No ratings yet
Operations Management: - Design of Goods and Services
26 pages
BA 4201 QTDM Two Mark
No ratings yet
BA 4201 QTDM Two Mark
15 pages
Basic Computer Skills AECC 1
No ratings yet
Basic Computer Skills AECC 1
56 pages
Chapter 3 - Best Buy and EBay in China
No ratings yet
Chapter 3 - Best Buy and EBay in China
2 pages
Assignment No 1 (Sequence Problem)
No ratings yet
Assignment No 1 (Sequence Problem)
6 pages
Individual Assignment MR, Jackson
No ratings yet
Individual Assignment MR, Jackson
8 pages
Queuing Theory: Transient State & Steady State
No ratings yet
Queuing Theory: Transient State & Steady State
6 pages
MA_UNIT V
No ratings yet
MA_UNIT V
22 pages
Unit.1 Operations Research Notes (By Dr. Ksrinivas)
No ratings yet
Unit.1 Operations Research Notes (By Dr. Ksrinivas)
21 pages
Decision Theory On Operations Research
No ratings yet
Decision Theory On Operations Research
4 pages
Data Reduction
No ratings yet
Data Reduction
22 pages
ABC Assignment
No ratings yet
ABC Assignment
3 pages
BA4101 - Statistics - For - Management - Revised
No ratings yet
BA4101 - Statistics - For - Management - Revised
21 pages
02-03 ASAP Business Analytics-2 Descriptive Statistics
No ratings yet
02-03 ASAP Business Analytics-2 Descriptive Statistics
109 pages
QUESTIONS - Quantitative Technique Answer
No ratings yet
QUESTIONS - Quantitative Technique Answer
13 pages
Emerging Horizons
0% (1)
Emerging Horizons
28 pages
Chapter 8
No ratings yet
Chapter 8
30 pages
Evans Analytics2e PPT 08
No ratings yet
Evans Analytics2e PPT 08
65 pages
Nilesh Manoriya: New Enterprise and Innovation Management
No ratings yet
Nilesh Manoriya: New Enterprise and Innovation Management
41 pages
Time Value of Money
No ratings yet
Time Value of Money
21 pages
Unit 10 Labour Market Mind Maps - 231103 - 102231
No ratings yet
Unit 10 Labour Market Mind Maps - 231103 - 102231
19 pages
Business Statistics Assignment
No ratings yet
Business Statistics Assignment
17 pages
Managing Bull Whip - SCM Case Study
100% (1)
Managing Bull Whip - SCM Case Study
11 pages
Operations Research
67% (3)
Operations Research
1 page
Bancassurance
No ratings yet
Bancassurance
15 pages
International Financial Reporting Standards - Fina
No ratings yet
International Financial Reporting Standards - Fina
13 pages
Monte Carlo My Presentation PDF
No ratings yet
Monte Carlo My Presentation PDF
11 pages
Bazc415 Sep29 An PDF
No ratings yet
Bazc415 Sep29 An PDF
1 page
Financial Accounting MBA Depreciation)
100% (2)
Financial Accounting MBA Depreciation)
41 pages
P and Q System@Sudip Bakshi
No ratings yet
P and Q System@Sudip Bakshi
15 pages
AVONA Problems
No ratings yet
AVONA Problems
1 page
Data Warehousing and Data Mining: DR Seema Agarwal
No ratings yet
Data Warehousing and Data Mining: DR Seema Agarwal
72 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
DWDM REFERENCE NOTES
No ratings yet
DWDM REFERENCE NOTES
126 pages
Prediction Theory
No ratings yet
Prediction Theory
90 pages
MP C3004/C3504/C3004Ex/C3504Ex Series/Dsc1230 (D238/D239/D0Ac/D0Ad) Parts Catalog
No ratings yet
MP C3004/C3504/C3004Ex/C3504Ex Series/Dsc1230 (D238/D239/D0Ac/D0Ad) Parts Catalog
231 pages
Layered Windows
No ratings yet
Layered Windows
7 pages
Omicron CMC356 Relay Test With Advanced Protection Software Datasheet
No ratings yet
Omicron CMC356 Relay Test With Advanced Protection Software Datasheet
8 pages
Vision - OpenAI API
No ratings yet
Vision - OpenAI API
8 pages
(External) Video Policy Explanatory Slides-1
No ratings yet
(External) Video Policy Explanatory Slides-1
19 pages
AI: Neural Network For Beginners (Part 1 of 3) : Sacha Barber
No ratings yet
AI: Neural Network For Beginners (Part 1 of 3) : Sacha Barber
9 pages
Cast Iron Solidification
100% (1)
Cast Iron Solidification
12 pages
Iphone 13 Pro Max
No ratings yet
Iphone 13 Pro Max
11 pages
Data Base Assignment 2024
No ratings yet
Data Base Assignment 2024
12 pages
Nia Qms Manual
No ratings yet
Nia Qms Manual
71 pages
Chapter 4 Thesis Sample Conclusion
100% (2)
Chapter 4 Thesis Sample Conclusion
7 pages
Dynamic HTTP or Odata Adapter –ntication for Flow Processing in
No ratings yet
Dynamic HTTP or Odata Adapter –ntication for Flow Processing in
24 pages
Colorful Playful Career Planner Presentation
No ratings yet
Colorful Playful Career Planner Presentation
26 pages
FOTON Section Two ENG PDF
100% (3)
FOTON Section Two ENG PDF
884 pages
Advanced SQL and PL/SQL: Guide To Oracle 10g
No ratings yet
Advanced SQL and PL/SQL: Guide To Oracle 10g
22 pages
Latching Relay For Momentary Contact Switches
No ratings yet
Latching Relay For Momentary Contact Switches
2 pages
Accenture Preaperation Coding Question Set
No ratings yet
Accenture Preaperation Coding Question Set
13 pages
Supliment La Diploma
No ratings yet
Supliment La Diploma
9 pages
Boolean Algebra and Logic Gate
No ratings yet
Boolean Algebra and Logic Gate
52 pages
A4 Internet Porno Key Facts One Pager
No ratings yet
A4 Internet Porno Key Facts One Pager
2 pages
Tran Hist PdfAndXls
No ratings yet
Tran Hist PdfAndXls
16 pages
Chapter 2. Introducing The UML: The Unified Modeling Language User Guide Second Edition
No ratings yet
Chapter 2. Introducing The UML: The Unified Modeling Language User Guide Second Edition
35 pages
Introduction To Minitab: Lab No: 01
No ratings yet
Introduction To Minitab: Lab No: 01
3 pages
Arts 6 Quarter 4 Module 1
100% (2)
Arts 6 Quarter 4 Module 1
9 pages
Computer Science - Strings (CBSE Class 11)
No ratings yet
Computer Science - Strings (CBSE Class 11)
6 pages
Activity Book Answer Key
No ratings yet
Activity Book Answer Key
7 pages
Various Interface Styles
No ratings yet
Various Interface Styles
45 pages
Crack Paketi
No ratings yet
Crack Paketi
22 pages
Lab 3. Digital Filters - Real Time DSP Lab Manual
No ratings yet
Lab 3. Digital Filters - Real Time DSP Lab Manual
6 pages

DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing

Uploaded by

DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing

Uploaded by

What is Data mining?

❏ Extracting or mining knowledge from large amounts of data

● The kind of knowledge to be mined

● The background knowledge to be used in the

● The interestingness measures and thresholds for

● The expected representation for visualizing

the discovered patterns

2. Data integration (where multiple data sources may be combined)

7. Knowledge presentation (where visualization and knowledge representation

Data Mining Engine:

Graphic User Interface:

Different users can be interested in different kind of knowledge, data mining

4. Diversity of Database Types

Numeric attributes can be interval-scaled or ratio-scaled.

Interval-Scaled Attributes :Interval-scaled attributes are measured on a scale of equal-size units.

Ratio-Scaled Attributes : A ratio-scaled attribute is a numeric attribute with an inherent zero-point.

■ E.g., zip codes, profession, or the set of words in a collection of documents

■ Sometimes, represented as integer variables

■ Note: Binary attributes are a special case of discrete attributes

■ E.g., temperature, height, or weight

1. Measure of central tendency –

The pictorial representation of a grouped data, in the form of vertical or horizontal

the values for these categories.

Where A is the attribute data,

Specification of a partial ordering of attributes explicitly at the schema level by

You might also like