DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing
DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing
● Each user will have a data mining task in mind, that is, some form of data
analysis that he or she would like to have performed.
● A data mining task can be specified in the form of a data mining query,
which is input to the data mining system. A data mining query is defined in
terms of data mining task primitives.
● These primitives allow the user to interactively communicate with the data
mining system during discovery in order to direct the mining process, or
examine the findings from different angles or depths.
● The set of task-relevant data to be mined
discovery process
pattern evaluation
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied to extract data
patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on interestingness measures)
Database Server:
The database server contains the actual data ready to be processed. It performs the task of
handling data retrieval as per the request of the user.
Knowledge Base:
Knowledge Base is an important part of the data mining engine that is quite beneficial in guiding the
search for the result patterns. Data mining engine may also sometimes get inputs from the knowledge
base. This knowledge base may contain data from user experiences. The objective of the knowledge
Data Mining Techniques
1. Descriptive mining tasks describe the characteristics of the data in a target data set. On the
other hand, predictive mining tasks carry out the induction over the current and past data so
that predictions can be made.
2. In terms of accuracy, the descriptive technique is more precise and accurate as compared to
predictive mining.
3. The predictive analysis involves control over the situation along with responding to it while
descriptive analysis just responds to the situation.
4. The operation performed in the descriptive approach are standard reporting, query/drill down
and ad-hoc reporting which are capable of generating the response of –
● what happened?
● where exactly is the problem?
● what is the frequency of the problem?
5. As against, predictive mining performs tasks like predictive modelling, forecasting, simulation
and alerts. These involve the result of questions like –
● what will happen next?
● what is the outcome if these trends continue?
● what actions are required to be taken?
Issues in Data Mining
1. Mining Methodology
● Mining different kinds of knowledge in databases:
2.User Interaction
● Interactive mining of knowledge in multiple levels of abstractions
● Incorporation of background knowledge:
● Ad hoc data mining and data mining query languages:
● Presentation and visualization of data mining results:
● Handling noisy & incomplete data
● Pattern evaluation
continued…...
3. Performance Issues:
● Efficiency and scalability of data mining algorithms
● Parallel, distributed, and incremental mining algorithms
the patient undergoes a medical test that has two possible outcomes. The attribute medical test is binary,
where a value of 1 means the result of the test for the patient is positive, while 0 means the result is
negative.
■ Binary
■ Nominal attribute with only 2 states (0 and 1)
■ Symmetric binary: both outcomes equally important
■ e.g., gender
■ Asymmetric binary: outcomes not equally important.
■ e.g., medical test (positive vs. negative)
■ Convention: assign 1 to most important outcome (e.g., HIV positive)
Ordinal Attributes
An ordinal attribute is an attribute with possible values that have a meaningful order or
ranking among them, but the magnitude between successive values is not known.
E.g. examples of ordinal attributes include grade (e.g., A+, A, A−, B+, and so on) and
professional rank. Professional ranks can be enumerated in a sequential order: for
example, assistant, associate, and full for professors, and private, private first class,
specialist, corporal, and sergeant for army ranks
■ Ordinal
■ Values have a meaningful order (ranking) but magnitude between
successive values is not known.
■ Size = {small, medium, large}, grades, army rankings
Numeric Attributes
A numeric attribute is quantitative; that is, it is a measurable quantity, represented in integer or real
values.
■ Continuous Attribute
■ Has real numbers as attribute values
■ Practically, real values can only be measured and represented using a finite number
of digits
■ Continuous attributes are typically represented as floating-point variables
Statistical Descriptions of Data
Descriptive Statistics :
Descriptive statistics uses data that provides a description of the population either through numerical
calculation or graph or table. It provides a graphical summary of data. It is simply used for summarizing
objects, etc. There are two categories in this as following below.
It is value most frequently arrived in sample set. The value repeated most of time in
central set is actually mode.
For example,
2. Measure of Variability –
Measure of Variability is also known as measure of dispersion and used to describe variability
in a sample or population. In statistics, there are three common measures of variability as
shown below:
● (i) Range :
It is given measure of how to spread apart values in sample set or data set.
Range = Maximum value - Minimum value
● (ii) Variance :
It simply describes how much a random variable differs from expected value and it is
also computed as square of deviation.
S2= ∑ni=1 [(xi - ͞x)2 ÷ n]
In these formula, n represent total data points, ͞x represent mean of data points and xi
represent individual data points.
● (iii) Dispersion :
It is measure of dispersion of set of data from its mean.
σ= √ (1÷n) ∑ni=1 (xi - μ)2
Data Visualization
■ Why data visualization?
■ Gain insight into an information space by mapping data onto graphical primitives
■ Provide qualitative overview of large data sets
■ Search for patterns, trends, structure, irregularities, relationships among data
■ Help find interesting regions and suitable parameters for further quantitative
analysis
■ Provide a visual proof of computer representations derived
Data Visualization
Why visualization?
Without the concept of visualization, mining and analysis doesn’t play any role of
importance as data mining is the idea of finding inferences by analyzing the data through
patterns and those patterns can only be represented by different visualization techniques.
Techniques:
● Box plots
● Histograms
● Charts
● Tree maps
Box Plots
In descriptive statistics a box plot or boxplot is a method for graphically depicting groups of numerical data through their quartiles.Box
plots may also have lines extending from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the
terms box-and-whisker plot and box-and-whisker diagram.Outliers may be plotted as individual points.
A boxplot is a standardized way of displaying the dataset based on a five-number summary: the minimum, the maximum, the sample
median, and the first and third quartiles.
Minimum (Q0 or 0th percentile): the lowest data point excluding any outliers.
Maximum (Q4 or 100th percentile): the largest data point excluding any outliers.
Median (Q2 or 50th percentile): the middle value of the dataset.
First quartile (Q1 or 25th percentile): also known as the lower quartile qn(0.25), is the median of the lower half of the dataset.
Third quartile (Q3 or 75th percentile): also known as the upper quartile qn(0.75), is the median of the upper half of the dataset.[4]
An important element used to construct the box plot by determining the minimum and maximum data values feasible, but is not part of
the aforementioned five-number summary, is the interquartile range or IQR denoted below:
Interquartile range (IQR) : is the distance between the upper and lower quartiles.
Histogram
A histogram is a graphical display of data using bars of different heights.Describe the distribution of
quantitative data.A histogram divides the variable values into equal-sized intervals In a histogram, each
bar groups numbers into ranges. Taller bars show that more data falls in that range. A histogram
displays the shape and spread of continuous sample data. It is similar to a vertical bar graph. However, a
histogram, unlike a vertical bar graph, shows no gaps between the bars.
Charts
rectangular bars, where the lengths of the bars are equivalent to the measure of data, are
known as bar graphs or bar charts. A bar chart can be plotted vertically or horizontally.
Usually it is drawn vertically where x-axis represents the categories and y-axis represents
Bar Chart
Line Charts
It is a type of chart which displays information as a series of data points called markers connected by
straight line segments. Line graphs show how a continuous variable changes over time. The variable that
measures time is plotted on the x-axis. The continuous variable is plotted on the y-axis.
Pie Chart
It is circular statistical graph which decide into slices to illustrate numerical proportion. Here
the arc length of each slide is proportional to the quantity it represents.
Scatter plot
A scatterplot is a graphical way to display the relationship between two quantitative sample
variables. It consists of an X axis, a Y axis and a series of dots where each dot represents one
observation from a data set. The position of the dot refers to its X and Y values.
Major Tasks in Data Preprocessing
■ Data cleaning
■ Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
■ Data integration
■ Integration of multiple databases, data cubes, or files
■ Data reduction
■ Dimensionality reduction
■ Numerosity reduction
■ Data compression
■ Data transformation and data discretization
■ Normalization
■ Concept hierarchy generation
Data Cleaning
■ Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
■ incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
■ e.g., Occupation=“ ” (missing data)
■ noisy: containing noise, errors, or outliers
■ e.g., Salary=“−10” (an error)
■ inconsistent: containing discrepancies in codes or names, e.g.,
■ Age=“42”, Birthday=“03/07/2010”
■ Was rating “1, 2, 3”, now rating “A, B, C”
■ discrepancy between duplicate records
■ Intentional (e.g., disguised missing data)
■ Jan. 1 as everyone’s birthday?
Incomplete (Missing) Data
■ Data is not always available
■ E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
■ Missing data may be due to
■ equipment malfunction
■ inconsistent with other recorded data and thus deleted
■ data not entered due to misunderstanding
■ certain data may not be considered important at the time of entry
■ not register history or changes of the data
■ Missing data may need to be inferred
How to Handle Missing Data?
■ Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute
varies considerably
■ Fill in the missing value manually: tedious + infeasible?
■ Fill in it automatically with
■ a global constant : e.g., “unknown”, a new class?!
■ the attribute mean
■ the attribute mean for all samples belonging to the same class:
smarter
■ the most probable value: inference-based such as Bayesian formula
or decision tree
Noisy Data
■ Noise: random error or variance in a measured variable
■ Incorrect attribute values may be due to
■ faulty data collection instruments
■ data entry problems
■ data transmission problems
■ technology limitation
■ inconsistency in naming convention
How to Handle Noisy Data?
■ Binning
■ first sort data and partition into (equal-frequency) bins
■ then one can smooth by bin means, smooth by bin median, smooth by
bin boundaries, etc.
■ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
■ Regression
■ smooth by fitting the data into regression functions
Linear regression refers to finding the best line to fit between two variables so that one
can be used to predict the other. Using regression to find a mathematical equation to fit
into the data helps to smooth out the noise.
■ Clustering
■ detect and remove outliers
■ Combined computer and human inspection
■ detect suspicious values and check by human (e.g., deal with possible
outliers)
Data Integration
■ Data integration:
■ Combines data from multiple sources into a coherent store
■ Schema integration
■ Integrate metadata from different sources
■ Entity identification problem:
■ Identify real world entities from multiple data sources, e.g., Bill Clinton = William
Clinton
■ Detecting and resolving data value conflicts
■ For the same real world entity, attribute values from different sources are different
■ Possible reasons: different representations, different scales, e.g., metric vs. British
units
Data Reduction Strategies
■ Data reduction: Obtain a reduced representation of the data set that is much smaller
in volume but yet produces the same (or almost the same) analytical results
■ Why data reduction? — A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete data set.
■ Data reduction strategies
■ Dimensionality reduction, e.g., remove unimportant attributes
■ Wavelet transforms(store only a small fraction of the strongest of the wavelet
coefficients)
■ Principal Components Analysis (PCA)(The original data are projected onto a
much smaller space, resulting in dimensionality reduction.)
■ Feature subset selection, feature creation
■ Numerosity reduction (some simply call it: Data Reduction)
■ Regression and Log-Linear Models
■ Histograms, clustering, sampling
■ Data cube aggregation
■ Data compression
The data compression technique reduces the size of the files using different encoding
mechanisms (Huffman Encoding & run-length Encoding). We can divide it into two types
based on their compression techniques.
Numerosity reduction
Min-Max Normalization
In this technique of data normalization, linear transformation is performed on the original data. Minimum and
maximum value from data is fetched and each value is replaced according to the formula.
Z-score normalization –
In this technique, values are normalized based on mean and standard deviation of the data A. The formula used is:
Decimal Scaling Method For Normalization –
It normalizes by moving the decimal point of values of the data. To normalize the data by this technique, we divide
each value of the data by the maximum absolute value of data. The data value, vi, of data is normalized to vi‘ by
using the formula below –
Discretization
■
Three types of attributes
■ Nominal—values from an unordered set, e.g., color, profession
■ Ordinal—values from an ordered set, e.g., military or academic rank
■ Numeric—real numbers, e.g., integer or real numbers
■ Discretization: Divide the range of a continuous attribute into intervals
■ Interval labels can then be used to replace actual data values
■ Reduce data size by discretization
■ Supervised vs. unsupervised
■ Split (top-down) vs. merge (bottom-up)
■ Discretization can be performed recursively on an attribute
■ Prepare for further analysis, e.g., classification
Concept Hierarchy Generation
■ Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in a data
warehouse
■ Concept hierarchies facilitate drilling and rolling in data warehouses to
view data in multiple granularity
■ Concept hierarchy formation: Recursively reduce the data by collecting
and replacing low level concepts (such as numeric values for age) by
higher level concepts (such as youth, adult, or senior)
■ Concept hierarchies can be explicitly specified by domain experts and/or
data warehouse designers
■ Concept hierarchy can be automatically formed for both numeric and
nominal data. For numeric data, use discretization methods shown.
A concept hierarchy for a given numeric attribute attribute defines a discretization of the attribute. Concept
hierarchies can be used to reduce the data y collecting and replacing low-level concepts (such as numeric value
for the attribute age) by higher level concepts (such as young, middle-aged, or senior).
Discretization and Concept Hierarchy Generation for
Numerical Data:
Binning: Attribute values can be discretized by distributing the values into bin and replacing
each bin by the mean bin value or bin median value. These technique can be applied recursively
to the resulting partitions in order to generate concept hierarchies.
Histogram Analysis: Histograms can also be used for discretization. Partitioning rules can be
applied to define range of values. The histogram analyses algorithm can be applied recursively
to each partition in order to automatically generate a multilevel concept hierarchy, with the
procedure terminating once a prespecified number of concept levels have been reached. A
minimum interval size can be used per level to control the recursive procedure. this specifies the
minimum width of the partition, or the minimum member of partitions at each level.
Cluster Analysis: A clustering algorithm can be applied to partition data into clusters or groups. Each
cluster forms a node of a concept hierarchy, where all noses are at the same conceptual level. Each
cluster may be further decomposed into sub-clusters, forming a lower kevel in the hierarchy. Clusters
may also be grouped together to form a higher-level concept hierarchy.
Segmentation by natural partitioning: Breaking up annual salaries in the range of into ranges like
($50,000-$100,000) are often more desirable than ranges like ($51, 263, 89-$60,765.3) arrived at by
cluster analysis. The 3-4-5 rule can be used to segment numeric data into relatively uniform “natural”
intervals. In general the rule partitions a give range of data into 3,4,or 5 equinity intervals, recursively
level by level based on value range at the most significant digit. The rule can be recursively applied to
each interval creating a concept hierarchy for the given numeric attribute.
Discretization and Concept Hierarchy Generation for
Categorical Data: