0% found this document useful (0 votes)
4 views20 pages

Unit I

Cpp containt

Uploaded by

Mahesh Mutnale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views20 pages

Unit I

Cpp containt

Uploaded by

Mahesh Mutnale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

DWDM UNIT 1

DATA WAREHOUSE
Introduction:
Data warehouses generalize and consolidate data in multidimensional space. The construction of data
warehouses involves data cleaning, data integration and data transformation and can be viewed as an
important preprocessing step for data mining. Moreover, data warehouses provide on-line analytical
processing (OLAP) tools for the interactive analysis of multidimensional data of varied granularities, which
facilitates effective data generalization and data mining. Many other data mining functions, such as
association, classification, prediction, and clustering, can be integrated with OLAP operations to enhance
interactive mining of knowledge at multiple levels of abstraction.
Hence, the data warehouse has become an increasingly important plat form for data analysis and on-line
analytical processing and will provide an effective plat form for data mining. Therefore, data warehousing
and OLAP form an essential step in the knowledge discovery process. Data warehousing provides
architectures and tools for business executives to systematically organize, understand, and use their data to
make strategic decisions. Data warehouse systems are valuable tools in today’s competitive, fast-evolving
world. In the last several years, many firms have spent millions of dollars in building enterprise-wide data
warehouses. Many people feel that with competition mounting in every industry, data warehousing is the
latest must-have marketing weapon—a way to retain customers by learning more about their needs. “Then,
what exactly is a data warehouse?” Data warehouses have been defined in many ways, making it difficult
to formulate a rigorous definition. Loosely speaking, a data warehouse refers to a database that is
maintained separately from an organization’s operational databases. Data warehouse systems allow for the
integration of a variety of application systems. They support information processing by providing a solid
platform of consolidated historical data for analysis. According to William H. Inman, a leading architect in
the construction of data warehouse systems, “A data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatile collection of data in support of management’s decision making process”. This
short, but comprehensive definition presents the major features of a data warehouse. The four keywords,
subject-oriented, integrated, time-variant, and nonvolatile, distinguish data warehouses from other data
repository systems, such as relational database systems, transaction processing systems, and file systems.
Let’s take a closer look at each of these key features.

Subject-oriented:
A data warehouse is organized around major subjects, such as customer, supplier, product, and sales. Rather
than concentrating on the day-to-day operations and transaction processing of an organization, a data
warehouse focuses on the modeling and analysis of data for decision makers. Hence, data warehouses
typically provide a simple and concise view around particular subject issues by excluding data that are not
useful in the decision support process.

Integrated:
A data warehouse is usually constructed by integrating multiple heterogeneous sources, such as relational
databases, flat files, and on-line transaction records. Data cleaning and data integration techniques are
applied to ensure consistency in naming conventions, encoding structures, attribute measures, and so on.

Time-variant:
Data are stored to provide information from a historical perspective (e.g., the past 5–10 years). Every key
structure in the data warehouse contains, either implicitly or explicitly, an element of time.
DWDM MODULE 1

Nonvolatile:
A data warehouse is always a physically separate store of data transformed from the application data found
in the operational environment. Due to this separation, a data warehouse does not require transaction
processing, recovery, and concurrency control mechanisms. It usually requires only two operations in data
accessing: initial loading of data and access of data. In sum, a data warehouse is a semantically consistent
data store that serves as a physical implementation of a decision support data model and stores the
information on which an enterprise needs to make strategic decisions. A data warehouse is also often viewed
as architecture, constructed by integrating data from multiple heterogeneous sources to support structured
and/or ad hoc queries, analytical reporting, and decision making.

DATA MINING
Introduction:

Data mining refers to extracting or “mining” knowledge from large amounts of data. The term is actually a
misnomer. Remember that the mining of gold from rocks or sand is referred to as gold mining rather than
rock or sand mining. Thus, data mining should have been more appropriately named “knowledge mining
from data,”. This is unfortunately somewhat long. “Knowledge mining,” a shorter term may not reflect the
emphasis on mining from large amounts of data. Nevertheless, mining is a vivid term characterizing the
process that finds a small set of precious nuggets from a great deal of raw material. Thus, such a misnomer
that carries both “data” and “mining” became a popular choice. Many other terms carry a similar or slightly
different meaning to data mining, such as knowledge mining from data, knowledge extraction, data/pattern
analysis, data archaeology, and data dredging.
Many people treat data mining as a synonym for another popularly used term, Knowledge Discovery from
Data, or KDD. Alternatively, others view data mining as simply essential step in the process of knowledge
discovery. Knowledge discovery as consists of an iterative sequence of the following steps:

1. Data cleaning (to remove noise and inconsistent data)


2. Data integration (where multiple data sources may be combined)1
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed or consolidated into forms appropriate for mining by
performing summary or aggregation operations, for instance)
5. Data mining (an essential process where intelligent methods are applied in order to extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some
interestingness measures;)
7. Knowledge presentation (where visualization and knowledge representation techniques are used to
present the mined knowledge to the user)

Steps 1 to 4 are different forms of data preprocessing, where the data are prepared for mining. The data
mining step may interact with the user or a knowledge base. The interesting patterns are presented to the
user and may be stored as new knowledge in the knowledge base. Note that according to this view, data
mining is only one step in the entire process, albeit an essential one because it uncovers hidden patterns for
evaluation.

We agree that data mining is a step in the knowledge discovery process. However, in industry, in media,
and in the database research milieu, the term data mining is becoming more popular than the longer term
of knowledge discovery from data. Therefore, in this book, we choose to use the term data mining. We
adopt a broad view of data mining functionality: data mining is the process of discovering interesting

FROM THE DESK OF MR. CHAITANYA REDDY,SMDC BALLARI 1


DWDM MODULE 1

knowledge from large amounts of data stored in databases, data warehouses, or other information
repositories

ARCHITECTURE OF DATA MINING

FROM THE DESK OF MR. CHAITANYA REDDY,SMDC BALLARI 2


DWDM MODULE 1

➢ DATA BASE WAREHOUSE, WORLD WIDE WEB, OTHER REPOSITORIES


Data base, data warehouse, world wide web(www) Text files and other documents are the actual Source of
data. Large volume of historical data Is needed to a mining to be successful. Organizations Generally store
data in databases or data warehouses. Data ware houses may contain one or more data bases, Text files,
spreadsheets or other kinds of information Repositories. World wide web or internet is another Big source
of data.

➢ DIFFERENT PROCESSES, (CLEANING, INTEGRATION, SELECTION)


The data needs to be cleaned, integrated and Selected before passing it to the data base or data
Warehouse server. As the data is from different sources and in different formats .it cannot be used directly
for The mining process because the data might not be Complete and reliable. So the data must be cleaned,
Integrated and selected. A number of techniques May be performed as a part of cleaning, integration and
selection.

➢ DATA BASE OR DATA WAREHOUSE SERVER:


The data base or data warehouse server contains the Actual data that is ready to be processed. Hence the
Server is responsible for retrieving the relevant data Based on the data mining request of the user.

➢ DATA MINING ENGINE:


The data mining engine is the core component of any Data mining system. It consists of a number of
modules for performing data mining tasks include association, Classification, prediction, time- series
analysis.

➢ PATTERN EVALUATION:
The pattern evaluation is mainly responsible for the measure of interestingness of the pattern by
using a threshold value.

➢ USER INTERFACE OR GRAPHICAL USER INTERFACE


The graphical user interface communicates Between the user and the data mining system. This Module
helps the user use the system easily and Efficiently without knowing the real complexity Behind the process.
When user specifies a query, this module interacts with the data mining system and displays the result in
an easily, understandable Manner.

➢ KNOWLEDGE BASE.
The knowledge base is helpful in the whole data Mining process. It might be useful for guiding
the search or evaluating the interestingness of the result Patterns. The knowledge base might even contain
user Beliefs and data from user experiences that can be Useful in the process of data mining. The data
mining Engine might get inputs from knowledge base to make the result more accurate and reliable. The
pattern Evaluation module interacts with knowledge base on Regular basis to get inputs and also to update
it.

FROM THE DESK OF MR. CHAITANYA REDDY,SMDC BALLARI 3


DWDM MODULE 1

➢ DEFINITION OF OLAP:

OLAP is an Online Analytical Processing system. OLAP database stores historical data that has
been inputted by OLTP. It allows a user to view different summaries of multi-dimensional data. Using
OLAP, you can extract information from a large database and analyse it for decision making.

OLAP also allow a user to execute complex queries to extract multidimensional data. In OLTP even if the
transaction fails in middle it will not harm data integrity as the user use OLAP system to retrieve data from
a large database to analyse. Simply the user can fire the query again and extract the data for analysis.

The transaction in OLAP are long and hence take comparatively more time for processing and requires
large space. The transactions in OLAP are less frequent as compared to OLTP. Even the tables in OLAP
database may not be normalized. The example for OLAP is to view a financial report, or budgeting,
marketing management, sales report, etc.

Example:
Any type of Data warehouse system is an OLAP system. Uses of OLAP are as follows:
• Spottily analyzed songs by users to come up with the personalized homepage of their songs and
playlist.
• Netflix movie recommendation system.

➢ BENEFITS OF USING OLAP:

• OLAP creates a single platform for all type of business analytical needs which includes planning,
budgeting, forecasting, and analysis.
• The main benefit of OLAP is the consistency of information and calculations.
• Easily apply security restrictions on users and objects to comply with regulations and protect
sensitive data.

• DRAWBACKS OF OLAP:

• implementation and maintenance are dependent on IT professional because the traditional OLAP
tools require a complicated modeling procedure.
• OLAP tools need cooperation between people of various departments to be effective which might
always be not possible.

DEFINITION OF OLTP:

OLTP is an Online Transaction Processing system. The main focus of OLTP system is to record
the current Update, Insertion and Deletion while transaction. The OLTP queries
are simpler and short and hence require less time in processing, and also requires less space.

OLTP database gets updated frequently. It may happen that a transaction in OLTP fails in middle, which
may affect data integrity. So, it has to take special care of data integrity. OLTP database has normalized
tables (3NF).

FROM THE DESK OF MR. CHAITANYA REDDY,SMDC BALLARI 4


DWDM MODULE 1

The best example for OLTP system is an ATM, in which using short transactions we modify the status of
our account. OLTP system becomes the source of data for OLAP.

Example: -
Uses of OLTP are as follows:
• ATM center is an OLTP application.
• OLTP handles the ACID properties during data transaction via the application.

• It’s also used for Online banking, Online airline ticket booking, sending a text message, add a book
to the shopping cart.

BENEFITS OF OLTP METHOD:

• It administers daily transactions of an organization.


• OLTP widens the customer base of an organization by simplifying individual processes.

DRAWBACKS OF OLTP:

• If OLTP system faces hardware failures, then online transactions get severely affected.
• OLTP systems allow multiple users to access and change the same data at the same time which
many times created unprecedented situation.

FROM THE DESK OF MR. CHAITANYA REDDY,SMDC BALLARI 5


DWDM MODULE 1

Difference Between OLTP and OLAP


Sl. Basic for OLTP OLAP
No Comparison
1 Explanations Online Transaction Processing Online Analytical Processing
2 Queries Simpler queries. Complex queries.
3 Normalization Tables in OLTP database are Tables in OLAP database are not
normalized (3NF). normalized.
4 Transaction OLTP has short transactions. OLAP has long transactions.
5 Functionality OLTP is an online database OLAP is an online database
modifying system. query management system.
6 Method OLTP uses traditional DBMS. OLAP uses the data warehouse.
7 Response time Response time in seconds to
It's response time is in millisecond.
minutes.
8 Operation Allow read/write operations. Only read and rarely write.
9 Audience It is a customer orientated
It is a market orientated process.
process.
10 User type It is used by Data critical users like Used by Data knowledge users
clerk, DBA & Data Base like workers, managers, and
professionals. CEO.
11 Purpose Designed for analysis of business
Designed for real time business
measures by category and
operations.
attributes.
12 Number of users This kind of Database users allows This kind of Database allows
thousands of users. only hundreds of users.
13 Productivity It helps to Increase user's self- Help to Increase productivity of
service and productivity the business analysts.
14 Process It provides fast result for daily used It ensures that response to the
data. query is quicker consistently.
15 Data Source Here OLTP system are itself the Data to OLAP comes from
source of the data different OLTP databases
16 Storage Database size is from 100 MB to 1 Database size is from 100 GB to
GB. 1 TB
17 Back-up Complete backup of the data OLAP only need a backup from
combined with incremental time to time. Backup is not
backups. important compared to OLTP
18 Usefulness It helps to control and run It helps with planning, problem-
fundamental business tasks. solving, and decision support.
19 Data Quality The data in the OLTP database is The data in OLAP process might
always detailed and organized. not be organized.
20 Query Insert, Update, and Delete
Mostly select operations
information from the database.

FROM THE DESK OF MR. CHAITANYA REDDY,SMDC BALLARI 6


DWDM MODULE 1

OLAP (ONLINE ANALYTICAL PROCESSING):

▪ OLAP is a category of software that allows users to analyse information from multiple database systems
at the same time.
▪ It is a technology that enables analysts to extract and view business data from different points of view.
▪ OLAP databases are divided into one or more cubes. The cubes are designed in such a way that creating
and viewing reports become easy.
Example:

❖ BASIC ANALYTICAL OPERATIONS OF OLAP:


The types of analytical operations in OLAP are:

1. ROLL-UP.
2. DOWN-DRILL.
3.SLICE.
4. DICE.
5. PIVOT(ROTATE).

✓ ROLL-UP: Roll-up is also known as “consolidation” or “aggregation”. The roll-up operation can
be performed in two ways:
• Reducing dimensions.
• Climbing up concept hierarchy is
• a system of grouping based on their order or level.

Example:

FROM THE DESK OF MR. CHAITANYA REDDY,SMDC BALLARI 7


DWDM MODULE 1

✓ DOWN-DRILL: In down-drill data is fragmented into smaller parts. It is the opposite of the roll-up
process.
• It can be done via:
i. Moving down the concept hierarchy.
ii. Increasing a dimension.
Example:

✓ SLICE: In slice, one dimensions is selected and a new sub-cube is created.


Example:

✓ DICE: This operation is similar to a slice. The difference in dice is we select two or more
dimensions that results in the creation of a sub-cube.
Example:

FROM THE DESK OF MR. CHAITANYA REDDY,SMDC BALLARI 8


DWDM MODULE 1

✓ PIVOT(ROTATE): In pivot, we rotate the data axes to provide a substitute presentation of data.
Example:

DATA CUBE:

A data cube refers is a three-dimensional (3D) (or higher) range of values that are generally used to explain
the time sequence of an image's data. It is a data abstraction to evaluate aggregated data from a variety of
viewpoints. It is also useful for imaging spectroscopy as a spectrally-resolved image is depicted as a 3-D
volume.

A data cube can also be described as the multidimensional extensions of two-dimensional tables. It can be
viewed as a collection of identical 2-D tables stacked upon one another. Data cubes are used to represent
data that is too complex to be described by a table of columns and rows. As such, data cubes can go far
beyond 3-D to include many more dimensions.

FROM THE DESK OF MR. CHAITANYA REDDY,SMDC BALLARI 9


DWDM MODULE 1

A data cube is generally used to easily interpret data. It is especially useful when representing data together
with dimensions as certain measures of business requirements. A cube's every dimension represents certain
characteristic of the database, for example, daily, monthly or yearly sales. The data included inside a data
cube makes it possible analyze almost all the figures for virtually any or all customers, sales agents,
products, and much more. Thus, a data cube can help to establish trends and analyze performance.

Data cubes are mainly categorized into two categories:


1. Multidimensional Data Cube.
2. Relational OLAP.
DATA MINING:

The extracting and mining the large amount of data is called as data mining.

Data mining is the process of discovering patterns in large data sets involving methods at the
intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary
subfield of computer science and statistics with an overall goal to extract information (with intelligent
methods) from a data set and transform the information into a comprehensible structure for further use.
Data mining is the analysis step of the "knowledge discovery in databases" process or KDD.

Some of the terms related to data mining are: -


1. Knowledge mining.
2. Knowledge extracting.
3. Pattern analysis.
4. Data Archaeology.
5. Data digging.
KNOWLEDGE DISCOVERY:

Data mining is not same as KDD (Knowledge Discovery of Data).


Data mining is a process of discovery interacting patterns and knowledge from large amount of data.

Steps in KDD:

FROM THE DESK OF MR. CHAITANYA REDDY,SMDC BALLARI 10


DWDM MODULE 1

1. Data Cleaning − In this step, the noise and inconsistent data is removed.
2. Data Integration − In this step, multiple data sources are combined.
3. Data Selection − In this step, data relevant to the analysis task are retrieved from the database.
4. Data Transformation − In this step, data is transformed or consolidated into forms appropriate for
mining by performing summary or aggregation operations.
5. Data Mining − In this step, intelligent methods are applied in order to extract data patterns.
6. Pattern Evaluation − In this step, data patterns are evaluated.
7. Knowledge Presentation − In this step, knowledge is represented.
MAJOR ISSUES IN DATA MINING:
Data mining is not an easy task, as the algorithms used can get very complex and data is not always
available at one place. It needs to be integrated from various heterogeneous data sources. These factors
also create some issues. Here in this tutorial, we will discuss the major issues regarding −

• Mining Methodology and User Interaction.


• Performance Issues.
• Diverse Data Types Issues.

FROM THE DESK OF MR. CHAITANYA REDDY,SMDC BALLARI 11


DWDM MODULE 1

MINING METHODOLOGY AND USER INTERACTION:


• Mining different kinds of knowledge in databases − Different users may be interested in different
kinds of knowledge. Therefore, it is necessary for data mining to cover a broad range of knowledge
discovery task.
• Interactive mining of knowledge at multiple levels of abstraction − The data mining process
needs to be interactive because it allows users to focus the search for patterns, providing and
refining data mining requests based on the returned results.
• Incorporation of background knowledge − To guide discovery process and to express the
discovered patterns, the background knowledge can be used. Background knowledge may be used
to express the discovered patterns not only in concise terms but at multiple levels of abstraction.
• Data mining query languages and ad hoc data mining − Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse query
language and optimized for efficient and flexible data mining.
• Presentation and visualization of data mining results − Once the patterns are discovered it needs
to be expressed in high level languages, and visual representations. These representations should
be easily understandable.
• Handling noisy or incomplete data − The data cleaning methods are required to handle the noise
and incomplete objects while mining the data regularities. If the data cleaning methods are not
there then the accuracy of the discovered patterns will be poor.
• Pattern evaluation − The patterns discovered should be interesting because either they represent
common knowledge or lack novelty.

PERFORMANCE ISSUES:
• Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient and
scalable.
• Parallel, distributed, and incremental mining algorithms − The factors such as huge size of
databases, wide distribution of data, and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithms divide the data
into partitions which is further processed in a parallel fashion. Then the results from the partitions
is merged. The incremental algorithms, update databases without mining the data again from
scratch.

DIVERSE DATA TYPES ISSUES:


• Handling of relational and complex types of data − The database may contain complex data
objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one system
to mine all these kinds of data.
• Mining information from heterogeneous databases and global information systems − The data
is available at different data sources on LAN or WAN. These data source may be structured, semi
structured or unstructured. Therefore, mining the knowledge from them adds challenges to data
mining.

FROM THE DESK OF MR. CHAITANYA REDDY,SMDC BALLARI 12


DWDM MODULE 1

DATA CLEANING:

Introduction:
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data cleansing) routines
attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies
in the data. In this section, you will study basic methods for data cleaning.

Missing Values:
Imagine that you need to analyze All Electronics sales and customer data. You note that many tuples have
no recorded value for several attributes, such as customer income. How can you go about filling in the
missing values for this attribute? Let’s look at the following methods:

Ignore the Tuple:


This is usually done when the class label is missing (assuming the mining task involves classification).
This method is not very effective, unless the tuple contains several attributes with missing values. It is
especially poor when the percentage of missing values per attribute varies considerably.

Fill in the missing value manually:


In general, this approach is time-consuming and may not be feasible given a large data set with many
missing values.

Use a global constant to fill in the missing value:


Replace all missing attribute values by the same constant, such as a label like “Unknown” or -∞. If
missing values are replaced by, say, “Unknown,” then the mining program may mistakenly think that they
form an interesting concept, since they all have a value in common—that of “Unknown.” Hence, although
this method is simple, it is not foolproof.

Use the attribute mean to fill in the missing value:


For example, suppose that the average income of All Electronics customers is $56,000. Use this value to
replace the missing value for income.

Use the attribute mean for all samples belonging to the same class as the given tuple:
For example, if classifying customers according to credit risk, replace the missing value with the average
income value for customers in the same credit risk category as that of the given tuple.

Use the most probable value to fill in the missing value:


This may be determined with regression, inference-based tools using a Bayesian formalism, or decision
tree induction. For example, using the other customer attributes in your data set, you may construct a
decision tree to predict the missing values for income. Decision trees, regression, and Bayesian inference

FROM THE DESK OF MR. CHAITANYA REDDY,SMDC BALLARI 13


DWDM MODULE 1

NOISY DATA:

Introduction: Noise is a random error or variance in a measured variable. Given a numerical attribute
such as, say, price, how can we “smooth” out the data to remove the noise.
Let’s look at the following data smoothing techniques:
Sorted data for price (in dollars):4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into (equal-frequency) bins:

Bin 1: 4, 8, 15

Bin 2: 21, 21, 24

Bin 3: 25, 28, 34

Smoothing by bin means:

Bin 1: 9, 9, 9

Bin 2: 22, 22, 22

Bin 3: 29, 29, 29

Smoothing by bin boundaries:

Bin 1: 4, 4, 15

Bin 2: 21, 21, 24

Bin 3: 25, 25, 34

Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,” that is, the
values around it. The sorted values are distributed into a number of “buckets,” or bins. Because binning
methods consult the neighborhood of values, they perform local smoothing.

Regression: Data can be smoothed by fitting the data to a function, such as with regression. Linear
regression involves finding the “best” line to fit two attributes (or variables), so that one attribute can be
used to predict the other. Multiple linear regressions are an extension of linear regression, where more
than two attributes are involved and the data are fit to a multidimensional surface.

Clustering: Outliers may be detected by clustering, where similar values are organized into groups, or
“clusters.” Intuitively, values that fall outside of the set of clusters may be considered outliers
Many methods for data smoothing are also methods for data reduction involving discretization. For
example, the binning techniques described above reduce the number of distinct values per attribute. This
acts as a form of data reduction for logic-based data mining methods, such as decision tree induction,
which repeatedly make value comparisons on sorted data. Concept hierarchies are a form of data
discretization that can also be used for data smoothing. A concept hierarchy for price, for example, may

FROM THE DESK OF MR. CHAITANYA REDDY,SMDC BALLARI 14


DWDM MODULE 1

map real price values into inexpensive, moderately priced, and expensive, thereby reducing the number of
data values to be handled by the mining process. Some methods of classification, such as neural networks,
have built-in data smoothing mechanisms.

DATA INTEGRATION:

It is likely that your data analysis task will involve data integration, which combines data from multiple
sources into a coherent data store, as in data warehousing. These sources may include multiple databases,
data cubes, or flat files.
Issues in Data Integration:

There are no of issues to consider during data integration: Schema Integration, Redundancy, Detection
and resolution of data value conflicts. These are explained in brief as following below.

1. Schema Integration:
• Integrate metadata from different sources.
• The real world entities from multiple source be matched referred to as the entity identification
problem.
For example, how can the data analyst and computer be sure that customer id in one data base and
customer number in another reference to the same attribute.
2. Redundancy:
• An attribute may be redundant if it can be derived or obtaining from another attribute or set of
attribute.
• Inconsistencies in attribute can also cause redundancies in the resulting data set.
• Some redundancies can be detected by correlation analysis.

3. Detection and resolution of data value conflicts:


• This is the third important issues in data integration.
• Attribute values from another different sources may differ for the same real world entity.
• An attribute in one system may be recorded at a lower level abstraction then the “same” attribute in
another.
DATA REDUCTION
The method of data reduction may achieve a condensed description of the original data which is
much smaller in quantity but keeps the quality of the original data
• Obtaining a reduced representation of the complete data set is called as data reduction
• More data for mining takes long time to run our complete data set
• By data reduction the result maybe same or almost same as original
Some of the strategies of data reduction are:

1. Data cube aggregrations


2. Removing unwanted attributes
3. Data compression
4. Fit the data into mathematical problems
5. Discretization & concept hierarchy generation

FROM THE DESK OF MR. CHAITANYA REDDY,SMDC BALLARI 15


DWDM MODULE 1

1. Data Cube Aggregation:


This technique is used to aggregate data in a simpler form. For example, imagine that information
you gathered for your analysis for the years 2012 to 2014, that data includes the revenue of your
company every three months. They involve you in the annual sales, rather than the quarterly
average, So we can summarize the data in such a way that the resulting data summarizes the total
sales per year instead of per quarter. It summarizes the data.

2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute required
for our analysis. It reduces data size as it eliminates outdated or redundant features.

1. Step-wise Forward Selection


The selection begins with an empty set of attributes later on we decide best of the original attributes on the
set based on their relevance to other attributes. We know it as a p-value in statistics.
Suppose there are the following attributes in the data set in which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }

Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}


2. Step-wise Backward Selection
This selection starts with a set of complete attributes in the original data and at each point, it eliminates the
worst remaining attribute in the set.
Suppose there are the following attributes in the data set in which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }

Step-1: {X1, X2, X3, X4, X5}


Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}


3. Combination of forwarding and Backward Selection
It allows us to remove the worst and select best attributes, saving time and making the process faster.

FROM THE DESK OF MR. CHAITANYA REDDY,SMDC BALLARI 16


DWDM MODULE 1

3. Data Compression:
The data compression technique reduces the size of the files using different encoding mechanisms
(Huffman Encoding & run-length Encoding). We can divide it into two types based on their compression
techniques.
• Lossless Compression
Encoding techniques (Run Length Encoding) allows a simple and minimal data size reduction. Lossless
data compression uses algorithms to restore the precise original data from the compressed data.
• Lossy Compression
Methods such as Discrete Wavelet transform technique, PCA (principal component analysis) are examples
of this compression. For e.g., JPEG image format is a lossy compression, but we can find the meaning
equivalent to the original the image. In lossy-data compression, the decompressed data may differ to the
original data but are useful enough to retrieve information from them.

4. Numerosity Reduction:
In this reduction technique the actual data is replaced with mathematical models or smaller
representation of the data instead of actual data, it is important to only store the model parameter. Or non-
parametric method such as clustering, histogram, sampling.

5. Discretization & Concept Hierarchy Operation:


Techniques of data discretization are used to divide the attributes of the continuous nature into data with
intervals. We replace many constant values of the attributes by labels of small intervals. This means that mining
results are shown in a concise, and easily understandable way.
• Top-down discretization
If you first consider one or a couple of points (so-called breakpoints or split points) to divide the whole set
of attributes and repeat of this method up to the end, then the process is known as top-down discretization
also known as splitting.
• Bottom-up discretization
If you first consider all the constant values as split-points, some are discarded through a combination of the
neighbourhood values in the interval, that process is called bottom-up discretization.

Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such as 43 for age) to high-level
concepts (categorical variables such as middle age or Senior).
For numeric data following techniques can be followed:
• Binning
Binning is the process of changing numerical variables into categorical counterparts. The number of
categorical counterparts depends on the number of bins specified by the user.
• Histogram analysis
Like the process of binning, the histogram is used to partition the value for the attribute X, into disjoint
ranges called brackets. There are several partitioning rules:
1. Equal Frequency partitioning: Partitioning the values based on their number of occurrences in the
data set.
2. Equal Width Partioning: Partioning the values in a fixed gap based on the number of bins i.e. a set
of values ranging from 0-20.
3. Clustering: Grouping the similar data together.

FROM THE DESK OF MR. CHAITANYA REDDY,SMDC BALLARI 17


DWDM MODULE 1

Data Transformation

The data are transformed in ways that are ideal for mining the data. The data transformation involves steps that
are:

1. Smoothing:
It is a process that is used to remove noise from the dataset using some algorithms It allows for highlighting
important features present in the dataset. It helps in predicting the patterns. When collecting data, it can be
manipulated to eliminate or reduce any variance or any other noise form.
The concept behind data smoothing is that it will be able to identify simple changes to help predict different
trends and patterns. This serves as a help to analysts or traders who need to look at a lot of data which can often
be difficult to digest for finding patterns that they wouldn’t see otherwise.

2. Aggregation:
Data collection or aggregation is the method of storing and presenting data in a summary format. The data may
be obtained from multiple data sources to integrate these data sources into a data analysis description. This is a
crucial step since the accuracy of data analysis insights is highly dependent on the quantity and quality of the
data used. Gathering accurate data of high quality and a large enough quantity is necessary to produce relevant
results.
The collection of data is useful for everything from decisions concerning financing or business strategy of the
product, pricing, operations, and marketing strategies.
For example, Sales, data may be aggregated to compute monthly& annual total amounts.

3. Discretization:
It is a process of transforming continuous data into set of small intervals. Most Data Mining activities in the real
world require continuous attributes. Yet many of the existing data mining frameworks are unable to handle these
attributes.
Also, even if a data mining task can manage a continuous attribute, it can significantly improve its efficiency by
replacing a constant quality attribute with its discrete values.
For example, (1-10, 11-20) (age:- young, middle age, senior).

4. Attribute Construction:
Where new attributes are created & applied to assist the mining process from the given set of attributes. This
simplifies the original data & makes the mining more efficient.

5. Generalization:
It converts low-level data attributes to high-level data attributes using concept hierarchy. For Example Age
initially in Numerical form (22, 25) is converted into categorical value (young, old).
For example, Categorical attributes, such as house addresses, may be generalized to higher-level definitions,
such as town or country.

6. Normalization:
Data normalization involves converting all data variable into a given range.
Techniques that are used for normalization are:
• Min-Max Normalization:
• This transforms the original data linearly.
• Suppose that: min_A is the minima and max_A is the maxima of an attribute, P

FROM THE DESK OF MR. CHAITANYA REDDY,SMDC BALLARI 18


DWDM MODULE 1

We Have the Formula:

• Where v is the value you want to plot in the new range.


• v’ is the new value you get after normalizing the old value.

Solved example:
Suppose the minimum and maximum value for an attribute profit(P) are Rs. 10, 000 and Rs. 100, 000. We
want to plot the profit in the range [0, 1]. Using min-max normalization the value of Rs. 20, 000 for
attribute profit can be plotted to:

And hence, we get the value of v’ as 0.11


• Z-Score Normalization:
• In z-score normalization (or zero-mean normalization) the values of an attribute (A), are normalized
based on the mean of A and its standard deviation
• A value, v, of attribute A is normalized to v’ by computing

For example:
Let mean of an attribute P = 60, 000, Standard Deviation = 10, 000, for the attribute P. Using z-score
normalization, a value of 85000 for P can be transformed to:

And hence we get the value of v’ to be 2.5


• Decimal Scaling:
• It normalizes the values of an attribute by changing the position of their decimal points
• The number of points by which the decimal point is moved can be determined by the absolute
maximum value of attribute A.
• A value, v, of attribute A is normalized to v’ by computing

• where j is the smallest integer such that Max(|v’|) < 1.

For example:
• Suppose: Values of an attribute P varies from -99 to 99.
• The maximum absolute value of P is 99.
• For normalizing the values we divide the numbers by 100 (i.e., j = 2) or (number of integers in the
largest number) so that values come out to be as 0.98, 0.97 and so on.

FROM THE DESK OF MR. CHAITANYA REDDY,SMDC BALLARI 19

You might also like