Data Mining and Warehousing - L1 & L2
Data Mining and Warehousing - L1 & L2
Lecture-1,2
• Data cleansing and preparation — A step in which data is transformed into a form suitable for further analysis and processing, such as
identifying and removing errors and missing data.
• Artificial intelligence (AI) — These systems perform analytical activities associated with human intelligence such as planning, learning,
reasoning, and problem-solving.
• Association rule learning — These tools, also known as market basket analysis, search for relationships among variables in a dataset,
such as determining which products are typically purchased together.
• Clustering — A process of partitioning a dataset into a set of meaningful sub-classes, called clusters, to help users understand the
natural grouping or structure in the data.
• Classification — This technique assigns items in a dataset to target categories or classes with the goal of accurately predicting the target
class for each case in the data.
• Data analytics — The process of evaluating digital information into useful business intelligence.
• Data warehousing — A large collection of business data used to help an organization make decisions. It is the foundational component
of most large-scale data mining efforts.
• Machine learning — A computer programming technique that uses statistical probabilities to give computers the ability to “learn”
without being explicitly programmed.
• Regression — A technique used to predict a range of numeric values, such as sales, temperatures, or stock prices, based on a particular
data set.
Conti…
Advantages of Data Mining
For example, data mining can tell you which prospects are likely to become profitable customers
based on past customer profiles, and which are most likely to respond to a specific offer. With this
knowledge, you can increase your return on investment (ROI) by making your offer to only those
prospects likely to respond and become valuable customers.
• Increasing revenue.
• Understanding customer segments and preferences.
• Acquiring new customers.
• Improving cross-selling and up-selling.
• Retaining customers and increasing loyalty.
• Increasing ROI from marketing campaigns.
• Detecting fraud.
• Identifying credit risks.
• Monitoring operational performance.
Data Warehousing?
• A Data Warehousing (DW) is process for collecting and managing
data from varied sources to provide meaningful business insights. A
Data warehouse is typically used to connect and analyze business
data from heterogeneous sources. The data warehouse is the core of
the BI system which is built for data analysis and reporting.
• It is a blend of technologies and components which aids the strategic
use of data. It is electronic storage of a large amount of information
by a business which is designed for query and analysis instead of
transaction processing. It is a process of transforming data into
information and making it available to users in a timely manner to
make a difference.
Conti…
Conti…
Data warehouse architecture
Conti…
Conti…
Conti…
Data warehouse system is also known by the following name:
• Decision Support System (DSS)
• Executive Information System
• Management Information System
• Business Intelligence Solution
• Analytic Application
• Data Warehouse
How Datawarehouse works?
A Data Warehouse works as a central repository where information arrives from one or more data
sources. Data flows into a data warehouse from the transactional system and other relational
databases.
Data may be:
1.Structured
2.Semi-structured
3.Unstructured data
The data is processed, transformed, and ingested so that users can access the processed data in the
Data Warehouse through Business Intelligence tools, SQL clients, and spreadsheets. A data
warehouse merges information coming from different sources into one comprehensive database.
By merging all of this information in one place, an organization can analyze its customers more
holistically. This helps to ensure that it has considered all the information available. Data
warehousing makes data mining possible. Data mining is looking for patterns in the data that may
lead to higher sales and profits.
Characteristics of data warehousing
• Subject oriented: data are organized by detailed subject containing only information
relevant for decision support. It provides a more comprehensive view of the organization
• Integrated: data warehouses must place data from different sources into a consistent
format
• Time variant (time series): it contains historical (daily, weekly and monthly) inc addition
to current data (real-time)
• Nonvolatile: data can not be changed or updated after it had entered into data
warehouse. Obsolete (Old) data are discarded and changes are recorded as new data
• Web based: designed for web based applications
• Relational/multidimensional: its structure is either relational or multidimensional
• Uses Client/server: so as to be easy to access.
• Real-time: this a character for new data warehouse
• Include metadata: it is a data about data (about how data are organized and to use
them)
Conti…
• Data mart
A departmental data warehouse that stores only relevant data
(usually smaller that warehouse)
• Dependent data mart
A subset that is created directly from a data warehouse
• Independent data mart
A small data warehouse designed for a strategic business unit (SBU) or
a department and its source is not the EDW (Enterprise Data
Warehouse)
Conti…
• Operational data stores (ODS)
A type of database often used as an interim (temporal) area for a data
warehouse, especially for customer information files
• Oper marts
An operational data mart. An oper mart is a small-scale data mart
typically used by a single department or functional area in an
organization when they need to analyze operational data
• Enterprise data warehouse (EDW)
A technology that provides a vehicle for pushing data from source
systems into a data warehouse that is used across the enterprise for
decision support
• Metadata
Data about data. In a data warehouse, metadata describe the
contents of a data warehouse and the manner of its use
Data Warehousing Process Overview
• Organizations continuously collect data, information, and knowledge
at an increasingly accelerated rate and store them in computerized
systems
• The number of users needing to access the information continues to
increase as a result of improved reliability and availability of network
access, especially the Internet
Data Objects and Attribute Types
• Data sets are made up of data objects. A data object represents
an entity—in a sales database, the objects may be customers,
store items, and sales; in a medical database, the objects may be
patients; in a university database, the objects may be students,
professors, and courses. Data objects are typically described by
attributes. Data objects can also be referred to as samples,
examples, instances, data points, or objects. If the data objects are
stored in a database, they are data tuples. That is, the rows of a
database correspond to the data objects, and the columns
correspond to the attributes. In this section, we define attributes
and look at the various attribute types.
• Data: It is how the data objects, and their attributes are stored.
• An attribute is an object’s property or characteristics. For example. A
person’s hair color, air humidity etc.
• An attribute set defines an object. The object is also referred to as a
record of the instances or entity.
Mapping the data warehouse to a
multiprocessor architecture
To manage large number of client requests efficiently, database vendor’s designed parallel hardware
architectures by implementing multiserver and multithreaded systems. This is called interquery
parallism in which different server threads handle multiple requests at the same time.
This can be implemented on SMP systems, where it increases throughput and allowed the support
of more concurrent users.