DataMining Unit I Notes
DataMining Unit I Notes
DATA MINING
Data
Types of Attributes
This is the First step of Data-preprocessing. We differentiate between different types of
attributes and then preprocess the data. So here is the description of attribute types.
1. Qualitative (Nominal (N), Ordinal (O), Binary(B)).
2. Quantitative (Numeric, Discrete, Continuous)
Qualitative Attributes:
1. Nominal Attributes – related to names: The values of a Nominal attribute are names of
things, some kind of symbols. Values of Nominal attributes represents some category or state
and that’s why nominal attribute also referred as categorical attributes and there is no order
(rank, position) among values of the nominal attribute.
Example :
Binary Attributes: Binary data has only 2 values/states. For Example yes or no, affected or
unaffected, true or false
.
Symmetric: Both values are equally important (Gender).
Quantitative Attributes:
Example:
3. Continuous: Continuous data have an infinite no of states. Continuous data is of float type.
There can be many values between 2 and 3.
Example :
Types of Data
Flat files: Flat files are actually the most common data source for data mining algorithms,
especially at the research level. Flat files are simple data files in text or binary format with a
structure known by the data mining algorithm to be applied. The data in these files can be
transactions, time-series data, scientific measurements, etc.
The most commonly used query language for relational database is SQL, which allows
retrieval and manipulation of the data stored in the tables, as well as the calculation of
aggregate functions such as average, sum, min, max and count. For instance, an SQL query
to select the videos grouped by category would be:
Data mining algorithms using relational databases can be more versatile than data mining
algorithms specifically written for flat files, since they can take advantage of the structure
inherent to relational databases. While data mining can benefit from SQL for data selection,
transformation and consolidation, it goes beyond what SQL could provide, such as
predicting, comparing, detecting deviations, etc.
Transactional databases
In general, a transactional database consists of a flat file where each record represents a
transaction. A transaction typically includes a unique transaction identity number (trans ID),
and a list of the items making up the transaction (such as items purchased in a store) as
shown below:
SALES
A spatial database contains spatial-related data, which may be represented in the form of
raster or vector data. Raster data consists of n-dimensional bit maps or pixel maps, and
vector data are represented by lines, points, polygons or other kinds of processed primitives,
Some examples of spatial databases include geographical (map) databases, VLSI chip
designs, and medical and satellite images databases.
Data warehouses
A data warehouse is a repository of information collected from multiple sources, stored under a
unified schema, and which usually resides at a single site. Data warehouses are constructed via a
process of data cleansing, data transformation, data integration, data loading, and periodic data
refreshing. The figure shows the basic architecture of a data warehouse.
In order to facilitate decision making, the data in a data warehouse are organized around major
subjects, such as customer, item, supplier, and activity. The data are stored to provide
information from a historical perspective and are typically summarized.
A data warehouse is usually modeled by a multidimensional database structure, where each
dimension corresponds to an attribute or a set of attributes in the schema, and each cell stores
the value of some aggregate measure, such as count or sales amount. The actual physical
structure of a data warehouse may be a relational data store or a multidimensional data cube. It
provides a multidimensional view of data and allows the precomputation and fast accessing of
summarized data.
The data cube structure that stores the primitive or lowest level of information is called a base
cuboid. Its corresponding higher level multidimensional (cube) structures are called (non-base)
cuboids. A base cuboid together with all of its corresponding higher level cuboids form a data
cube. By providing multidimensional data views and the precomputation of summarized data,
data warehouse systems are well suited for On- Line Analytical Processing, or OLAP. OLAP
operations make use of background knowledge regarding the domain of the data being studied
in order to allow the presentation of data at different levels of abstraction. Such operations
accommodate different user viewpoints. Examples of OLAP operations include drill-down and
roll-up, which allow the user to view the data at differing degrees of summarization, as
illustrated in above figure.
A multimedia database stores images, audio, and video data, and is used in applications such as
picture content-based retrieval, voice-mail systems, video-on- demand systems, the World Wide
Web, and speech-based user interfaces.
Time-Series Databases: Time-series databases contain time related data such stock market data or
logged activities. These databases usually have a continuous flow of new data coming in, which
sometimes causes the need for a challenging real time analysis. Data mining in such databases
commonly includes the study of trends and correlations between evolutions of different variables, as
well as the prediction of trends and movements of the variables in time.
The World-Wide Web provides rich, world-wide, on-line information services, where data objects
are linked together to facilitate interactive access. Some examples of distributed information
services associated with the World-Wide Web include America Online, Yahoo!, AltaVista, and
Prodigy.
Data mining
Data mining is one of the most useful techniques that help entrepreneurs, researchers, and
individuals to extract valuable information from huge sets of data. Data mining is also
called Knowledge Discovery in Database (KDD). The knowledge discovery process includes Data
cleaning, Data integration, Data selection, Data transformation, Data mining, Pattern evaluation, and
Knowledge presentation.
Many people treat data mining as a synonym for another popularly used term,knowledge discovery
from data, or KDD, while others view data mining as merely an essential step in the process of
knowledge discovery. The knowledge discovery process is shown in Figure as an iterative sequence
of the following steps:
The architecture of a typical data mining system may have the following major components
1. Database, data warehouse, or other information repository. This is one or a set of databases,
data warehouses, spread sheets, or other kinds of information repositories. Data cleaning and data
integration techniques may be performed on the data.
3. Knowledge base. This is the domain knowledge that is used to guide the search, or
evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies,
used to organize attributes or attribute values into different levels of abstraction. Knowledge such as
user beliefs, which can be used to assess a pattern's interestingness based on its unexpectedness,
may also be included.
4. Data mining engine. This is essential to the data mining system and ideally consists of a
set of functional modules for tasks such as characterization, association analysis, classification,
evolution and deviation analysis.
6. Graphical user interface. This module communicates between users and the data mining
system, allowing the user to interact with the system by specifying a data mining query or task,
providing information to help focus the search, and performing exploratory data mining based on
the intermediate data mining results.
Data mining functions are used to define the trends or correlations contained in data mining
activities. In comparison, data mining activities can be divided into two categories:
o Descriptive Data Mining: It includes certain knowledge to understand what is happening
within the data without a previous idea. The common data features are highlighted in the data
set. For example, count, average etc.
o Predictive Data Mining: It helps developers to provide unlabeled definitions of attributes.
With previously available or historical data, data mining can be used to make predictions about
critical business metrics based on data's linearity. For example, predicting the volume of
business next quarter based on performance in the previous quarters over several years or
judging from the findings of a patient's medical examinations that is he suffering from any
particular disease.
Data mining functionalities are used to represent the type of patterns that have to be discovered in
data mining tasks. Data mining is extensively used in many areas or sectors. It is used to predict and
characterize data. But the ultimate objective in Data Mining Functionalities is to observe the
various trends in data mining. There are several data mining functionalities that the organized and
scientific methods offer, such as:
1. Class/Concept Descriptions
A class or concept implies there is a data set or set of features that define the class or a concept.
A class can be a category of items on a shop floor, and a concept could be the abstract idea on
which data may be categorized like products to be put on clearance sale and non-sale products.
There are two concepts here, one that helps with grouping and the other that helps in
differentiating.
o Data Characterization: This refers to the summary of general characteristics or features of the
class, resulting in specific rules that define a target class. A data analysis technique called
Attribute-oriented Induction is employed on the data set for achieving characterization.
o Data Discrimination: Discrimination is used to separate distinct data sets based on the disparity
in attribute values. It compares features of a class with features of one or more contrasting
classes.g., bar charts, curves and pie charts.
One of the functions of data mining is finding data patterns. Frequent patterns are things that are
discovered to be most common in data. Various types of frequency can be found in the dataset.
o Frequent item set:This term refers to a group of items that are commonly found together, such as
milk and sugar.
o Frequent substructure: It refers to the various types of data structures that can be combined with
an item set or subsequences, such as trees and graphs.
o Frequent Subsequence: A regular pattern series, such as buying a phone followed by a cover.
3. Association Analysis
It analyses the set of items that generally occur together in a transactional dataset. It is also known
as Market Basket Analysis for its wide use in retail sales. Two parameters are used for determining
the association rules:
Support : It provides which identifies the common item set in the database.
Confidence : is the conditional probability that an item occurs when another item occurs in a
transaction.
4. Classification
Classification is a data mining technique that categorizes items in a collection based on some
predefined properties. It uses methods like if-then, decision trees or neural networks to predict a
class or essentially classify a collection of items. A training set containing items whose
properties are known is used to train the system to predict the category of items from an
unknown collection of items.
5. Prediction
It predicts some unavailable data values or spending trends. An object can be anticipated based on
the attribute values of the object and attribute values of the classes. It can be a prediction of missing
numerical values or increase or decrease trends in time-related information. There are primarily two
types of predictions in data mining: numeric and class predictions.
o Numeric predictions are made by creating a linear regression model that is based on historical data.
Prediction of numeric values helps businesses ramp up for a future event that might impact the
business positively or negatively.
o Class predictions are used to fill in missing class information for products using a training data set
where the class for products is known.
6. Cluster Analysis
In image processing, pattern recognition and bioinformatics, clustering is a popular data mining
functionality. It is similar to classification, but the classes are not predefined. Data attributes
represent the classes. Similar data are grouped together, with the difference being that a class label
is not known. Clustering algorithms group data based on similar features and dissimilarities.
7. Outlier Analysis
Outlier analysis is important to understand the quality of data. If there are too many outliers, you
cannot trust the data or draw patterns. An outlier analysis determines if there is something out of
turn in the data and whether it indicates a situation that a business needs to consider and take
measures to mitigate. An outlier analysis of the data that cannot be grouped into any classes by the
algorithms is pulled up.
Evolution Analysis pertains to the study of data sets that change over time. Evolution analysis
models are designed to capture evolutionary trends in data helping to characterize, classify, cluster
or discriminate time-related data.
9. Correlation Analysis
Correlation is a mathematical technique for determining whether and how strongly two attributes is
related to one another. It refers to the various types of data structures, such as trees and graphs, that
can be combined with an item set or subsequence. It determines how well two numerically
measured continuous variables are linked. Researchers can use this type of analysis to see if there
are any possible correlations between variables in their study.
Interestingness Patterns
A pattern is interesting if it is (1) easily understood by humans, (2) valid on new or test data with
some degree of certainty, (3) potentially useful, and (4) novel. A pattern is also interesting if it
validates a hypothesis that the user sought to confirm. An interesting pattern represents knowledge.
objective measures of pattern interestingness are based on the structure of discovered patterns
and the statistics underlying them. An objective measure for association rules of the form X Y
is rule support, representing the percentage of transactions from a transaction database that the
given rule satisfies. This is taken to be the probability P.(X U Y) where X UY indicates that a
transaction contains both X and Y, that is, the union of itemsets X and Y.
Another objective measure for association rules is confidence, which assesses the degree of
certainty of the detected association. This is taken to be the conditional probability P.(Y | X), that
is, the probability
that a transaction containing X also contains Y.
In general, each interestingness measure is associated with a threshold, which may be controlled
by the user. For example, rules that do not satisfy a confidence threshold of, say, 50% can be
considered uninteresting. Rules below the threshold likely reflect noise,exceptions, or minority
cases and are probably of less value.
Other objective interestingness measures include accuracy and coverage for classification(IF-
THEN) rules. In general terms, accuracy tells us the percentage of data that are correctly
classified by a rule. Coverage is similar to support, in that it tells us the percentage of data to
which a rule applies.
.
Subjective interestingness measures are based on user beliefs in the data. These measures find
patterns interesting if the patterns are unexpected (contradicting a user’s belief) or offer strategic
information on which the user can act. In the latter case, such patterns are referred to as
actionable. For example, patterns like “a large earthquake often follows a cluster of small
quakes” may be highly actionable if users can act on the information to save lives. Patterns that
are expected can be interesting if they confirm a hypothesis that the user wishes to validate or
they resemble a user’s hunch.
It is often unrealistic and inefficient for data mining systems to generate all possible patterns.
Instead, user provided constraints and interestingness measures should be used to focus the
search. For some mining tasks, such as association, this is often sufficient to ensure the
completeness of the algorithm. Association rule mining is an example where the use of
constraints and interestingness measures can ensure the completeness of mining.
It is highly desirable for data mining systems to generate only interesting patterns. This would be
efficient for users and data mining systems because neither would have to search through the
patterns generated to identify the truly interesting ones. Progress has been made in this direction;
however, such optimization remains a challenging issue in data mining. Measures of pattern
interestingness are essential for the efficient discovery of patterns by target users. Such measures
can be used after the data mining step to rank the discovered patterns according to their
interestingness, filtering out the uninteresting ones.More important, such measures can be used
to guide and constrain the discovery process, improving the search efficiency by pruning away
subsets of the pattern space that do not satisfy prespecified interestingness constraints.
Database Technology
Statistics
Machine Learning
Information Science
Visualization
Other Disciplines
Apart from these, a data mining system can also be classified based on the kind of
(a) databases mined, (b) knowledge mined, (c) techniques utilized, and (d) applications adapted.
Classification Based on the Databases Mined
We can classify a data mining system according to the kind of databases mined. Database system
can be classified according to different criteria such as data models, types of data, etc. And the
data mining system can be classified accordingly.For example, if we classify a database
according to the data model, then we may have a relational, transactional, object-relational, or
data warehouse mining system.
Characterization
Discrimination
Association and Correlation Analysis
Classification
Prediction
Outlier Analysis
Evolution Analysis
Finance
Telecommunications
DNA
Stock Markets
E-mail
A data mining task can be specified in the form of a data mining query, which is input to the data
mining system. A data mining query is defined in terms of data mining task primitives. These
primitives allow the user to interactively communicate with the data mining system during
discovery to direct the mining process or examine the findings from different angles or depths.
The data mining primitives specify the following,
This specifies the portions of the database or the set of data in which the user is interested. This
includes the database attributes or data warehouse dimensions of interest (the relevant attributes
or dimensions).In a relational database, the set of task-relevant data can be collected via a
relational query involving operations like selection, projection, join, and aggregation.The data
collection process results in a new data relational called the initial data relation. The initial data
relation can be ordered or grouped according to the conditions specified in the query. This data
retrieval can be thought of as a subtask of the data mining task. This initial relation may or may
not correspond to physical relation in the database. Since virtual relations are called Views in the
field of databases, the set of task-relevant data for data mining is called a minable view.
This knowledge about the domain to be mined is useful for guiding the knowledge discovery
process and evaluating the patterns found. Concept hierarchies are a popular form of background
knowledge, which allows data to be mined at multiple levels of abstraction. Concept hierarchy
defines a sequence of mappings from low-level concepts to higher-level, more general concepts.
o Rolling Up - Generalization of data: Allow to view data at more meaningful and explicit
abstractions and makes it easier to understand. It compresses the data, and it would require fewer
input/output operations.
o Drilling Down - Specialization of data: Concept values replaced by lower-level concepts.
Based on different user viewpoints, there may be more than one concept hierarchy for a given
attribute or dimension.
An example of a concept hierarchy for the attribute (or dimension) age is shown below. User
beliefs regarding relationships in the data are another form of background knowledge.
Different kinds of knowledge may have different interesting measures. They may be used to
guide the mining process or, after discovery, to evaluate the discovered patterns. For example,
interesting measures for association rules include support and confidence. Rules whose support
and confidence values are below user-specified thresholds are considered uninteresting.
o Simplicity: A factor contributing to the interestingness of a pattern is the pattern's overall
simplicity for human comprehension. For example, the more complex the structure of a rule is,
the more difficult it is to interpret, and hence, the less interesting it is likely to be. Objective
measures of pattern simplicity can be viewed as functions of the pattern structure, defined in
terms of the pattern size in bits or the number of attributes or operators appearing in the pattern.
o Certainty (Confidence): Each discovered pattern should have a measure of certainty
associated with it that assesses the validity or "trustworthiness" of the pattern. A certainty
measure for association rules of the form "A =>B" where A and B are sets of items is
confidence. Confidence is a certainty measure. Given a set of task-relevant data tuples, the
confidence of "A=>B" is defined as
Confidence (A=>B) = # tuples containing both A and B /# tuples containing A
o Utility (Support): The potential usefulness of a pattern is a factor defining its interestingness.
It can be estimated by a utility function, such as support. The support of an association pattern
refers to the percentage of task-relevant data tuples (or transactions) for which the pattern is true.
Utility (support): usefulness of a pattern
Support (A=>B) = # tuples containing both A and B / total #of tuples
o Novelty: Novel patterns are those that contribute new information or increased performance to
the given pattern set. For example -> A data exception. Another strategy for detecting novelty is
to remove redundant patterns.
This refers to the form in which discovered patterns are to be displayed, which may include
rules, tables, cross tabs, charts, graphs, decision trees, cubes, or other visual
representations.Users must be able to specify the forms of presentation to be used for displaying
the discovered patterns. Some representation forms may be better suited than others for
particular kinds of knowledge.For example, generalized relations and their corresponding cross
tabs or pie/bar charts are good for presenting characteristic descriptions, whereas decision trees
are common for classification.
Integration of Data Mining System with a Data warehouse
If a data mining system is not integrated with a database or a data warehouse system, then there
will be no system to communicate with. This scheme is known as the non-coupling scheme. In
this scheme, the main focus is on data mining design and on developing efficient and effective
algorithms for mining the available data sets.
No Coupling − In this scheme, the data mining system does not utilize any of the database or data
warehouse functions. It fetches the data from a particular source and processes that data using
some data mining algorithms. The data mining result is stored in another file.
Loose Coupling − In this scheme, the data mining system may use some of the functions of
database and data warehouse system. It fetches the data from the data respiratory managed by
these systems and performs data mining on that data. It then stores the mining result either in a
file or in a designated place in a database or in a data warehouse.
Semi−tight Coupling − In this scheme, the data mining system is linked with a database or a data
warehouse system and in addition to that, efficient implementations of a few data mining
primitives can be provided in the database.
Tight coupling − In this coupling scheme, the data mining system is smoothly integrated into the
database or data warehouse system. The data mining subsystem is treated as one functional
component of an information system.
Major Issues in DataMining
Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data sources.
These factors also create some issues.
Mining Methodology
In addition, mining methodologies should consider issues such as data uncertainty, noise, and
incompleteness. Some mining methods explore how user specified measures can be used to
assess the interestingness of discovered patterns as well as guide the discovery process.
Mining various and new kinds of knowledge: Data mining covers a wide spectrum of data
analysis and knowledge discovery tasks, from data characterization and discrimination to
association and correlation analysis, classification, regression, clustering, outlier analysis,
sequence analysis, and trend and evolution analysis. These tasks may use the same database in
different ways and require the development of numerous data mining techniques. Due to the
diversity of applications, new mining tasks continue to emerge, making data mining a dynamic
and fast-growing field. For example, for effective knowledge discovery in information networks,
integrated clustering and ranking may lead to the discovery of high-quality clusters and object
ranks in large networks.
Mining knowledge in multidimensional space: When searching for knowledge in large data
sets, we can explore the data in multidimensional space. That is, we can search for interesting
patterns among combinations of dimensions (attributes) at varying levels of abstraction. Such
mining is known as (exploratory) multidimensional data mining. In many cases, data can be
aggregated or viewed as a multidimensional data cube. Mining knowledge in cube space can
substantially enhance the power and
flexibility of data mining.
Data mining—an interdisciplinary effort: The power of data mining can be substantially
enhanced by integrating new methods from multiple disciplines. For example, to mine data with
natural language text, it makes sense to fuse data mining methods with methods of information
retrieval and natural language processing. As another example, consider the mining of software
bugs in large programs. This form of mining, known as bug mining, benefits from the
incorporation of software engineering knowledge into the data mining process.
Boosting the power of discovery in a networked environment: Most data objects reside in a
linked or interconnected environment, whether it be the Web, database relations, files, or
documents. Semantic links across multiple data objects can be used to advantage in data mining.
Knowledge derived in one set of objects can be used to boost the discovery of knowledge in a
“related” or semantically linked set of objects.
Handling uncertainty, noise, or incompleteness of data: Data often contain noise, errors,
exceptions, or uncertainty, or are incomplete. Errors and noise may confuse the data mining
process, leading to the derivation of erroneous patterns. Data cleaning, data pre-processing,
outlier detection and removal, and uncertainty reasoning are examples of techniques that need to
be integrated with the data mining process.
Pattern evaluation and pattern- or constraint-guided mining: Not all the patterns generated by
data mining processes are interesting. What makes a pattern interesting may vary from user to
user. Therefore, techniques are needed to assess the interestingness of discovered patterns based
on subjective measures. These estimate the value of patterns with respect to a given user class,
based on user beliefs or expectations. Moreover, by using interestingness measures or user-
specified constraints to guide the discovery process, we may generate more interesting patterns
and reduce the search space.
User Interaction
The user plays an important role in the data mining process. Interesting areas of research include
how to interact with a data mining system, how to incorporate a user’s background knowledge in
mining, and how to visualize and comprehend data mining results.
Interactive mining: The data mining process should be highly interactive. Thus, it is important
to build flexible user interfaces and an exploratory mining environment, facilitating the user’s
interaction with the system. A user may like to first sample a set of data, explore general
characteristics of the data, and estimate potential mining results. Interactive mining should allow
users to dynamically change the focus of a search, to refine mining requests based on returned
results, and to drill, dice, and pivot through the data and knowledge space interactively,
dynamically exploring “cube space” while mining.
Presentation and visualization of data mining results: How can a data mining system present
data mining results, vividly and flexibly, so that the discovered knowledge can be easily
understood and directly usable by humans? This is especially crucial if the data mining process is
interactive. It requires the system to adopt expressive knowledge representations, user-friendly
interfaces, and visualization techniques.
Efficiency and scalability of data mining algorithms: Data mining algorithms must be efficient
and scalable in order to effectively extract information from huge amounts of data in many data
repositories or in dynamic data streams. In other words, the running time of a data mining
algorithm must be predictable, short, and acceptable by applications. Efficiency, scalability,
performance, optimization, and the ability to execute in real time are key criteria that drive the
development of many new data mining algorithms.
Parallel, distributed, and incremental mining algorithms: The humongous size of many data
sets, the wide distribution of data, and the computational complexity of some data mining
methods are factors that motivate the development of parallel and distributed data-intensive
mining algorithms. Such algorithms first partition the data into “pieces.” Each piece is
processed, in parallel, by searching for patterns. The parallel processes may interact with one
another. The patterns from each partition are eventually merged.
Cloud computing and cluster computing, which use computers in a distributed and
collaborative way to tackle very large-scale computational tasks, are also active research themes
in parallel data mining. In addition, the high cost of some data mining processes and the
incremental nature of input promote incremental data mining, which incorporates new data
updates without having to mine the entire data “from scratch.” Such methods perform knowledge
modification incrementally to amend and strengthen what was previously discovered.
Handling complex types of data: Diverse applications generate a wide spectrum of new data
types, from structured data such as relational and data warehouse data to semi-structured and
unstructured data; from stable data repositories to dynamic data streams; from simple data
objects to temporal data, biological sequences, sensor data, spatial data, hypertext data,
multimedia data, software program code,Web data, and social network data. It is unrealistic to
expect one data mining system to mine all kinds of data, given the diversity of data types and the
different goals of data mining. Domain- or application-dedicated data mining systems are being
constructed for in depth mining of specific kinds of data. The construction of effective and
efficient data mining tools for diverse applications remains a challenging and active area of
research.
Mining dynamic, networked, and global data repositories: Multiple sources of data are
connected by the Internet and various kinds of networks, forming gigantic, distributed, and
heterogeneous global information systems and networks. The discovery of knowledge from
different sources of structured, semi-structured, or unstructured yet interconnected data with
diverse data semantics poses great challenges to data mining. Mining such gigantic,
interconnected information networks may help disclose many more patterns and knowledge in
heterogeneous data sets than can be discovered from a small set of isolated data repositories.
Web mining, multisource data mining, and information network mining have become
challenging and fast-evolving data mining fields.
Data Preprocessing
Data pre-processing is the process of transforming raw data into an understandable format. It is
also an important step in data mining as we cannot work with raw data. The quality of the data
should be checked before applying machine learning or data mining algorithms.Preprocessing of
data is mainly to check the data quality. The quality can be checked by the following:
Consistency: To check whether the same data is kept in all the places that do or do not match.
There are 4 major tasks in data preprocessing – Data cleaning, Data integration, Data reduction,
Data cleaning is the process of removing incorrect data, incomplete data, and inaccurate data
from the datasets, and it also replaces the missing values. Here are some techniques for data
cleaning:
Standard values like “Not Available” or “NA” can be used to replace the missing values.
Missing values can also be filled manually, but it is not recommended when that dataset is big.
The attribute’s mean value can be used to replace the missing value when the data is normally
distributed wherein in the case of non-normal distribution median value of the attribute can be
used.
While using regression or decision tree algorithms, the missing value can be replaced by the
Noisy generally means random error or containing unnecessary data points. Handling noisy data
is one of the most important steps as it leads to the optimization of the model we are using Here
Binning: This method is to smooth or handle noisy data. First, the data is sorted then, and then
the sorted values are separated and stored in the form of bins. There are three methods for
smoothing data in the bin. Smoothing by bin mean method: In this method, the values in the
bin are replaced by the mean value of the bin; Smoothing by bin median: In this method, the
values in the bin are replaced by the median value; Smoothing by bin boundary: In this
method, the using minimum and maximum values of the bin values are taken, and the closest
Regression: This is used to smooth the data and will help to handle data when unnecessary data
is present. For the analysis, purpose regression helps to decide the variable which is suitable for
our analysis.
Clustering: This is used for finding the outliers and also in grouping the data. Clustering is
Data Integration
The process of combining multiple sources into a single dataset. The Data integration process is
one of the main components of data management. There are some problems to be considered
Schema integration: Integrates metadata(a set of data that describes other data) from different
sources.
Entity identification problem: Identifying entities from multiple databases. For example, the
system or the user should know the student id of one database and studentname of another
Detecting and resolving data value concepts: The data taken from different databases while
merging may differ. The attribute values from one database may differ from another database.
For example, the date format may differ, like “MM/DD/YYYY” or “DD/MM/YYYY”.
Data Reduction
This process helps in the reduction of the volume of the data, which makes the analysis easier yet
produces the same or almost the same result. This reduction also helps to reduce storage space.
Some of the data reduction techniques are dimensionality reduction, numerosity reduction, and
data compression.
Dimensionality reduction: This process is necessary for real-world applications as the data size
is big. In this process, the reduction of random variables or attributes is done so that the
dimensionality of the data set can be reduced. Combining and merging the attributes of the data
without losing its original characteristics. This also helps in the reduction of storage space, and
computation time is reduced. When the data is highly dimensional, a problem called the “Curse
of Dimensionality” occurs.
Numerosity Reduction: In this method, the representation of the data is made smaller by
reducing the volume. There will not be any loss of data in this reduction.
Data compression: The compressed form of data is called data compression. This compression
can be lossless or lossy. When there is no loss of information during compression, it is called
lossless compression. Whereas lossy compression reduces information, but it removes only the
unnecessary information.
Data Transformation
The change made in the format or the structure of the data is called data transformation. This
step can be simple or complex based on the requirements. There are some methods for data
transformation.
Smoothing: With the help of algorithms, we can remove noise from the dataset, which helps in
knowing the important features of the dataset. By smoothing, we can find even a simple change
Aggregation: In this method, the data is stored and presented in the form of a summary. The data
set, which is from multiple sources, is integrated into with data analysis description. This is an
important step since the accuracy of the data depends on the quantity and quality of the data.
When the quality and the quantity of the data are good, the results are more relevant.
Discretization: The continuous data here is split into intervals. Discretization reduces the data
size. For example, rather than specifying the class time, we can set an interval like (3 pm-5 pm,
or 6 pm-8 pm).
Normalization: It is the method of scaling the data so that it can be represented in a smaller