0% found this document useful (0 votes)
104 views30 pages

Data Mining and Warehousing - L1 & L2

This document provides an overview of data mining and data warehousing. It discusses key concepts in data mining including data preparation, artificial intelligence, association rule learning, clustering, classification, and machine learning. It also defines data warehousing as collecting and managing varied data sources to provide business insights. Common data warehouse architectures include shared memory, shared disk, and shared nothing. The document also discusses characteristics, processes, and mapping data warehouses to multiprocessor architectures.

Uploaded by

Deepika Garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views30 pages

Data Mining and Warehousing - L1 & L2

This document provides an overview of data mining and data warehousing. It discusses key concepts in data mining including data preparation, artificial intelligence, association rule learning, clustering, classification, and machine learning. It also defines data warehousing as collecting and managing varied data sources to provide business insights. Common data warehouse architectures include shared memory, shared disk, and shared nothing. The document also discusses characteristics, processes, and mapping data warehouses to multiprocessor architectures.

Uploaded by

Deepika Garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Data Mining and Warehousing

Lecture-1,2

Dr. Shweta Sharma


School of Computing Information Technology
Manipal University Jaipur
India
Data Mining
• Data mining is the process of analyzing massive volumes of data to
discover business intelligence that helps companies solve problems,
mitigate risks, and seize new opportunities. This branch of data
science derives its name from the similarities between searching for
valuable information in a large database and mining a mountain for
ore. Both processes require sifting through tremendous amounts of
material to find hidden value.
Conti…
Data Mining Concepts
Achieving the best results from data mining requires an array of tools and techniques. Some of the most commonly-used functions
include:

• Data cleansing and preparation — A step in which data is transformed into a form suitable for further analysis and processing, such as
identifying and removing errors and missing data.
• Artificial intelligence (AI) — These systems perform analytical activities associated with human intelligence such as planning, learning,
reasoning, and problem-solving.
• Association rule learning — These tools, also known as market basket analysis, search for relationships among variables in a dataset,
such as determining which products are typically purchased together.
• Clustering — A process of partitioning a dataset into a set of meaningful sub-classes, called clusters, to help users understand the
natural grouping or structure in the data.
• Classification — This technique assigns items in a dataset to target categories or classes with the goal of accurately predicting the target
class for each case in the data.
• Data analytics — The process of evaluating digital information into useful business intelligence.
• Data warehousing — A large collection of business data used to help an organization make decisions. It is the foundational component
of most large-scale data mining efforts.
• Machine learning — A computer programming technique that uses statistical probabilities to give computers the ability to “learn”
without being explicitly programmed.
• Regression — A technique used to predict a range of numeric values, such as sales, temperatures, or stock prices, based on a particular
data set.
Conti…
Advantages of Data Mining
For example, data mining can tell you which prospects are likely to become profitable customers
based on past customer profiles, and which are most likely to respond to a specific offer. With this
knowledge, you can increase your return on investment (ROI) by making your offer to only those
prospects likely to respond and become valuable customers.
• Increasing revenue.
• Understanding customer segments and preferences.
• Acquiring new customers.
• Improving cross-selling and up-selling.
• Retaining customers and increasing loyalty.
• Increasing ROI from marketing campaigns.
• Detecting fraud.
• Identifying credit risks.
• Monitoring operational performance.
Data Warehousing?
• A Data Warehousing (DW) is process for collecting and managing
data from varied sources to provide meaningful business insights. A
Data warehouse is typically used to connect and analyze business
data from heterogeneous sources. The data warehouse is the core of
the BI system which is built for data analysis and reporting.
• It is a blend of technologies and components which aids the strategic
use of data. It is electronic storage of a large amount of information
by a business which is designed for query and analysis instead of
transaction processing. It is a process of transforming data into
information and making it available to users in a timely manner to
make a difference.
Conti…
Conti…
Data warehouse architecture
Conti…
Conti…
Conti…
Data warehouse system is also known by the following name:
• Decision Support System (DSS)
• Executive Information System
• Management Information System
• Business Intelligence Solution
• Analytic Application
• Data Warehouse
How Datawarehouse works?
A Data Warehouse works as a central repository where information arrives from one or more data
sources. Data flows into a data warehouse from the transactional system and other relational
databases.
Data may be:
1.Structured
2.Semi-structured
3.Unstructured data
The data is processed, transformed, and ingested so that users can access the processed data in the
Data Warehouse through Business Intelligence tools, SQL clients, and spreadsheets. A data
warehouse merges information coming from different sources into one comprehensive database.
By merging all of this information in one place, an organization can analyze its customers more
holistically. This helps to ensure that it has considered all the information available. Data
warehousing makes data mining possible. Data mining is looking for patterns in the data that may
lead to higher sales and profits.
Characteristics of data warehousing
• Subject oriented: data are organized by detailed subject containing only information
relevant for decision support. It provides a more comprehensive view of the organization
• Integrated: data warehouses must place data from different sources into a consistent
format
• Time variant (time series): it contains historical (daily, weekly and monthly) inc addition
to current data (real-time)
• Nonvolatile: data can not be changed or updated after it had entered into data
warehouse. Obsolete (Old) data are discarded and changes are recorded as new data
• Web based: designed for web based applications
• Relational/multidimensional: its structure is either relational or multidimensional
• Uses Client/server: so as to be easy to access.
• Real-time: this a character for new data warehouse
• Include metadata: it is a data about data (about how data are organized and to use
them)
Conti…
• Data mart
A departmental data warehouse that stores only relevant data
(usually smaller that warehouse)
• Dependent data mart
A subset that is created directly from a data warehouse
• Independent data mart
A small data warehouse designed for a strategic business unit (SBU) or
a department and its source is not the EDW (Enterprise Data
Warehouse)
Conti…
• Operational data stores (ODS)
A type of database often used as an interim (temporal) area for a data
warehouse, especially for customer information files
• Oper marts
An operational data mart. An oper mart is a small-scale data mart
typically used by a single department or functional area in an
organization when they need to analyze operational data
• Enterprise data warehouse (EDW)
A technology that provides a vehicle for pushing data from source
systems into a data warehouse that is used across the enterprise for
decision support
• Metadata
Data about data. In a data warehouse, metadata describe the
contents of a data warehouse and the manner of its use
Data Warehousing Process Overview
• Organizations continuously collect data, information, and knowledge
at an increasingly accelerated rate and store them in computerized
systems
• The number of users needing to access the information continues to
increase as a result of improved reliability and availability of network
access, especially the Internet
Data Objects and Attribute Types
• Data sets are made up of data objects. A data object represents
an entity—in a sales database, the objects may be customers,
store items, and sales; in a medical database, the objects may be
patients; in a university database, the objects may be students,
professors, and courses. Data objects are typically described by
attributes. Data objects can also be referred to as samples,
examples, instances, data points, or objects. If the data objects are
stored in a database, they are data tuples. That is, the rows of a
database correspond to the data objects, and the columns
correspond to the attributes. In this section, we define attributes
and look at the various attribute types.
• Data: It is how the data objects, and their attributes are stored. 
 
• An attribute is an object’s property or characteristics. For example. A
person’s hair color, air humidity etc.
• An attribute set defines an object. The object is also referred to as a
record of the instances or entity.
Mapping the data warehouse to a
multiprocessor architecture
To manage large number of client requests efficiently, database vendor’s designed parallel hardware
architectures by implementing multiserver and multithreaded systems. This is called interquery
parallism in which different server threads handle multiple requests at the same time.
This can be implemented on SMP systems, where it increases throughput and allowed the support
of more concurrent users.

Data warehouse can be mapped into different type of architectures as follows:

• Shared memory architecture

• Shared disk architecture

• Shared nothing architecture


Multiprocessor Architecture
This architecture is simple to implement, and the key idea is
that a single RDBMS server can potentially utilize all
processors, access all memory and access the entire
database.
There are three DBMS software architecture
styles for parallel processing:
1. Shared memory or shared everything Architecture
2. Shared disk architecture
3. Shred nothing architecture
1. Shared Memory Architecture
Tightly coupled shared memory systems, illustrated in following figure
have the following characteristics:
 Multiple PUs share memory.
 Each PU has full access to all shared memory through a common bus.
Communication between nodes occurs via shared memory.
Performance is limited by the bandwidth of the memory bus.
Conti…
Parallel processing advantages of shared memory
systems are these:
• Memory access is cheaper than inter-node
communication. This means that internal
synchronization is faster than using the Lock
Manager.
• Shared memory systems are easier to administer
than a cluster.
A disadvantage of shared memory systems for
parallel processing is as follows:
• Scalability is limited by bus bandwidth and latency,
and by available memory.
Shared Disk Architecture
Shared disk systems are typically loosely coupled.
Such systems, illustrated in following figure, have
the following characteristics:
• Each node consists of one or more PUs and
associated memory.
• Memory is not shared between nodes.
• Communication occurs over a common high-
speed bus.
• Each node has access to the same disks and other
resources.
• A node can be an SMP if the hardware supports it.
• Bandwidth of the high-speed bus limits the
number of nodes (scalability) of the system.
Shared Nothing Architecture
Advantages.
• Shared nothing systems provide for incremental growth.
• System growth is practically unlimited.
• MPPs are good for read-only databases and decision support
applications.
• Failure is local: if one node fails, the others stay up.
Disadvantages
• More coordination is required.
• More overhead is required for a process working on a disk belonging
to another node.
• If there is a heavy workload of updates or inserts, as in an online
transaction processing system, it may be worthwhile to consider
data-dependent routing to alleviate contention.

You might also like