0% found this document useful (0 votes)

54 views29 pages

Introduction To Data Mining

Data mining is the process of discovering patterns in large data sets. There are two main approaches: descriptive data mining, which characterizes data properties, and predictive data mining, which is used to predict outcomes. The data mining process involves problem definition, data gathering/preparation, model building/evaluation, and knowledge deployment. A data warehouse is a central repository of integrated data used for analysis and reporting. It contains historical data to help organizations make strategic decisions.

Uploaded by

Bulmi Hilme

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views29 pages

Introduction To Data Mining

Uploaded by

Bulmi Hilme

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 29

Data mining

Presented by: Tek Narayan Adhikari

What is Data Mining?
• “The process of discovering meaningful patterns and
trends often previously unknown by using some
mathematical algorithm on huge amount of stored data”

• “Extraction of interesting, non-trivial, implicit,

previously unknown and potentially useful information
or patterns from data in large database.”

• Data mining is basically concerned with the analysis of

data and the use of software techniques for finding
patterns and regularities in sets of data.
Two Approaches are:

Descriptive Data Mining

Predictive Data Mining

Descriptive Data Mining:
• It characterizes the general properties of data in the database.
• It finds patterns in data the user determinants which ones are
important.
• Mostly used during data exploration.
• Typical questions answered by descriptive data mining are:
– What is in the data?
– What doesn’t look like?
– Are there any unusual patterns?
– What does the data suggest for customer segmentation?
– User may have no idea on which kind of patterns are interesting?
• Functionalities of descriptive data mining are: Clustering,
Summarization, Visualization, and Association.
Predictive Data Mining:

Model
X Y

• X: Vectors of independent variables.

• Y: Dependent variables
• Y = f(X)
• Users don’t care about the model, they simply interested in
accuracy of predictions.
• Using unknown examples the model is trained and the
unknown function is learned from data.
• The more data with known outcomes is available the better is
the predictive power of model.
Predictive Data Mining:
• Used to predict outcomes whose inputs are known but the
output values are not realized yet.
• Never 100% accurate.
• The performance of a model on past data is not predicting the
known outcomes.
• Suitable for unknown data set.
• Typical questions answered by predictive models are:
– Who is likely to respond to next product?
– Which customers are likely to leave in the next six months?
Data Mining Process:

Problem Data Model Knowledge

Definition Gathering & Building & Deployment
Preparation Evaluation

Problem Definition:
• Focuses on Understanding the project objectives and requirements in terms of
business perspective.
• Eg: How can I sell more of my product to customer? Which customers are most
likely to purchase the product?
Data Gathering and Preparation:
• Data Collection & Exploration.
• Identify data quality, patterns in data.
• Data preparation phase covers all the tasks involved to build the model.
• Data preparation tasks are likely to be performed multiple and not in any prescribed
order.
Data Mining Process cont..
Model Building and Evaluation:

• Various modeling techniques are applied and calibrated the

parameters to optimal values.
• Evaluate how well the model satisfies the originally stated
business goal.

Knowledge Deployment:
• Use data mining within a target environment.
• Insight and actionable information can be derived from data.
Why Data Mining?

• Data mining is a combination of multidisciplinary field. It can be

applied in many fields and can be done using many algorithm and
techniques.
Database
Technology Statistics

Machine Visualization
Learning Data Mining

Pattern
Recognition Other
Algorithm Disciplines
Data Mining Vs. Query Tools
• SQL can find normal queries from the database such as what is
an average turnover? Whereas data mining tools find
interesting patterns and facts such as what are the important
trends in sells?
• Data mining is much more faster than SQL in trend and pattern
analysis since it uses algorithm like machine learning, genetic
algorithm.
• If we know exactly what we are looking for, we use SQL nut if
we know only vaguely what we are looking for we use data
mining.
• Hybrid information can’t be easily be traced using SQL.
Data Warehouse
• In most of the organization, there occur large databases in
operation for normal daily transactions called operational
database.
• A data warehouse is a large database built from the operational
database.
• In computing, a data warehouse (DW or DWH), also known as
an enterprise data warehouse (EDW), is a system used for
reporting and data analysis, and is considered a core
component of business intelligence. DWs are central
repositories of integrated data from one or more disparate
sources.
Data Warehouse
• A data warehouse is a database, which is kept separate from the
organization's operational database.
• There is no frequent updating done in a data warehouse.
• It possesses consolidated historical data, which helps the
organization to analyze its business.
• A data warehouse helps executives to organize, understand, and
use their data to take strategic decisions.
• Data warehouse systems help in the integration of diversity of
application systems.
• A data warehouse system helps in consolidated historical data
analysis.
Data Warehouse

A data warehouse should be:

• Time – dependent
– There must be a connection between the information in the
warehouse and the time when it was entered.
– One of the most important aspect of the warehouse as it
relates to data mining, because information can then be
sourced according to period.
• Non-Volatile
– Data in a warehouse is never updated, but used only for
queries.
– End-users who want to update data must use operational
database.
– A data warehouse will always be filled with historical data.
A data warehouse should be:
• Subject Oriented
– Not all the information in the operational database is useful
for a data warehouse.
– A data warehouse should be designed especially for
decision support and expert system with specific related
data.
• Integrated
– In an operational data, many types of information being
used with different names for same entity.
– In a data warehouse, all entities should be integrated and
consistent i.e. only one name must exist to describe each
individual entity.
Data Warehouse

Data Information Decision

Query Manager
Operational Load manager
Data Detailed Summary
Information Information

External
Meta Data
Data

Warehouse Manager

Fig: “Architecture of a Data Warehouse”

Three Tire Architecture of Data Warehouse
Data Warehouse
• Load Manager: The system components that perform all the operations
necessary to support the extract and load process. It fast loads the extracted
data into a temporary data store and performs simple transformations into a
structure similar to the one in the data warehouse.
• Warehouse Manager: Performs all the necessary operations to support the
warehouse management process. It analyzes the data to perform consistency
and referential checks. It also transforms and merges the source data in the
temporary data store into the published data warehouse with creating
indexes and business views. Update all existing aggregations and back up
data in the data warehouse.
• Query Manager: Performs all the operations necessary to support the query
management process by directing queries to the appropriate tables. In some
cases it also stores query profiles to allow the warehouse manager to
determine which indexes and aggregations are appropriate.
Data Warehouse
• Detailed Information: Stores all the detailed information to
determine the business requirements to analyze the level at
which to retain detailed information in the data warehouse.
• Summary Information: Stores all the predefined aggregations
generated by the warehouse manager. It is a transient area
which will change on an ongoing basis in order to respond to
changing query profiles. It is essentially a replication to
detailed information.
• Meta Data: Meta data is data about data which describes how
information is structured within a data warehouse. It maps data
stores to common view of information with the data
warehouse.
Data Warehouse Models
• From the perspective of data warehouse architecture, we have
the following data warehouse models:
• Virtual Warehouse
• Data mart
• Enterprise Warehouse
Virtual Warehouse
• The view over an operational data warehouse is known as a
virtual warehouse.
• A virtual data warehouse provides a compact view of the data
inventory.
• It contains Meta data.
• It uses middleware to build connections to different data
sources.
• They can be fast as they allow users to filter the most
important pieces of data from different legacy applications.
• Easy to build a virtual warehouse.
• Building a virtual warehouse requires excess capacity on
operational database servers.
Data Mart:

• Data Mart is a subset of the information content of a data warehouse

that is stored in its own database.
• Data mart may or may not be sourced from an enterprise data
warehouse i.e. it could have been directly populated from source
data.
• Data mart can improve query performance simply by reducing the
volume of data that needs to be scanned to satisfy the query.
• Data marts are created along functional level to reduce the
likelihood of queries requiring data outside the mart.
• Data marts may help in multiple queries or tools to access data by
creating their own internal database structures.
• Eg: Departmental Store, Banking System.
Enterprise Warehouse
• An enterprise warehouse collects all the information and the
subjects spanning an entire organization

• It provides us enterprise-wide data integration.

• The data is integrated from operational systems and external

information providers.

• This information can vary from a few gigabytes to hundreds of

gigabytes, terabytes or beyond.
Data Warehousing - Schemas

• Schema is a logical description of the entire database.

• It includes the name and description of records of all record

types including all associated data-items and aggregates.

• Much like a database, a data warehouse also requires to

maintain a schema.

• A database uses relational model, while a data warehouse uses

Star, Snowflake, and Fact Constellation schema.
Star Schema
• Each dimension in a star schema is represented with only one-
dimension table.
• This dimension table contains the set of attributes.
• The following diagram shows the sales data of a company
with respect to the four dimensions, namely time, item,
branch, and location.
Snowflake Schema
• Some dimension tables in the Snowflake schema are normalized.
• The normalization splits up the data into additional tables.
• Unlike Star schema, the dimensions table in a snowflake schema are
normalized. For example, the item dimension table in star schema is
normalized and split into two dimension tables, namely item and
supplier table.
Fact Constellation Schema
• A fact constellation has multiple fact tables. It is also known
as galaxy schema.
• The following diagram shows two fact tables, namely sales
and shipping.
Example of Star Schema

time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures

27
Example of Snowflake Schema
time
item
time_key
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country

28
Example of Fact Constellation
time Shipping Fact Table
item
time_key time_key
Sales Fact Table item_key
day
day_of_the_week item_name
time_key brand item_key
month
quarter type shipper_key
year
item_key supplier_type
branch_key from_location

branch location_key location to_location

branch_key location_key dollars_cost
branch_name units_sold
street
branch_type
dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
29
shipper_type

(Open Me) Pot Player
No ratings yet
(Open Me) Pot Player
1 page
Transactions in DBMS
No ratings yet
Transactions in DBMS
42 pages
Big Data Pipelines
No ratings yet
Big Data Pipelines
22 pages
Database Sync Best Practices For Teradata Change Data Capture
No ratings yet
Database Sync Best Practices For Teradata Change Data Capture
10 pages
DoDAF and TOGAF Mapping
No ratings yet
DoDAF and TOGAF Mapping
6 pages
Informatica Cloud (IICS) Architecture
No ratings yet
Informatica Cloud (IICS) Architecture
21 pages
Comandos Hive SQL
100% (1)
Comandos Hive SQL
5 pages
Slide 13 - Kafka
No ratings yet
Slide 13 - Kafka
109 pages
Apache Sqoop: Hanoi - Autumn 2019
No ratings yet
Apache Sqoop: Hanoi - Autumn 2019
18 pages
Explain Terraform vs. Other Software
No ratings yet
Explain Terraform vs. Other Software
5 pages
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
No ratings yet
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
48 pages
12 - DataEngineer - Interview - Questions and Answers - EPAM Anywhere
No ratings yet
12 - DataEngineer - Interview - Questions and Answers - EPAM Anywhere
2 pages
Data Warehouse Ques
No ratings yet
Data Warehouse Ques
10 pages
Access Control Snowflake
No ratings yet
Access Control Snowflake
6 pages
Advanced Data Model
No ratings yet
Advanced Data Model
18 pages
DBMS Question Bank Student
No ratings yet
DBMS Question Bank Student
10 pages
Serverless Architecture For Product Defect Detection Using Computer Vision Ra
No ratings yet
Serverless Architecture For Product Defect Detection Using Computer Vision Ra
1 page
Unit Testing Node - Js PDF
100% (1)
Unit Testing Node - Js PDF
33 pages
Granny Log
No ratings yet
Granny Log
130 pages
Data Modeling Tips, Tricks, and Customizations
No ratings yet
Data Modeling Tips, Tricks, and Customizations
50 pages
A Brief History in Time For Data Vault
100% (1)
A Brief History in Time For Data Vault
6 pages
Business Process Lifecycle Management
100% (1)
Business Process Lifecycle Management
12 pages
Cloud Computing Gov Conf 1209
No ratings yet
Cloud Computing Gov Conf 1209
21 pages
DB m8 9 10 11 PDF
No ratings yet
DB m8 9 10 11 PDF
170 pages
Statistical Year Book 2017 PDF
No ratings yet
Statistical Year Book 2017 PDF
274 pages
Python Jinja Tutorial
No ratings yet
Python Jinja Tutorial
10 pages
Data Analysis With Pandas
No ratings yet
Data Analysis With Pandas
7 pages
Introduction To Data Warehouse: Unit I: Data Warehousing
No ratings yet
Introduction To Data Warehouse: Unit I: Data Warehousing
110 pages
Data Mining Chapter 6 Anomaly & Fraud Detection
No ratings yet
Data Mining Chapter 6 Anomaly & Fraud Detection
41 pages
Data Warehouse - What Is It
No ratings yet
Data Warehouse - What Is It
5 pages
Guidelines 01/2021 On Examples Regarding Data Breach Notification
No ratings yet
Guidelines 01/2021 On Examples Regarding Data Breach Notification
32 pages
Impala and BigQuery
No ratings yet
Impala and BigQuery
47 pages
OLAP and Data Warehousing: Slides Courtesy Of: Julia Stoyanovitch
No ratings yet
OLAP and Data Warehousing: Slides Courtesy Of: Julia Stoyanovitch
46 pages
Cloudera Hadoop Introduction PDF
100% (1)
Cloudera Hadoop Introduction PDF
50 pages
Data Model - Important - Concepts
No ratings yet
Data Model - Important - Concepts
24 pages
Build Solutions On GCP
No ratings yet
Build Solutions On GCP
3 pages
Big Data in Management: VI Trimester - ELECTIVE Session 1 - 5
No ratings yet
Big Data in Management: VI Trimester - ELECTIVE Session 1 - 5
29 pages
Technologies For Handling Big Data: Prepared By: Saidatul Rahah Hamidi
No ratings yet
Technologies For Handling Big Data: Prepared By: Saidatul Rahah Hamidi
49 pages
Development of Application Data or Algorithms
No ratings yet
Development of Application Data or Algorithms
24 pages
10190-Move and Improve With Oracle Analytics Cloud-Presentation - 287
No ratings yet
10190-Move and Improve With Oracle Analytics Cloud-Presentation - 287
69 pages
Database Services in AWS: Relational Databases
No ratings yet
Database Services in AWS: Relational Databases
9 pages
IOE Nepal BCT Year 1 Part 1 Syllabus
No ratings yet
IOE Nepal BCT Year 1 Part 1 Syllabus
19 pages
AD Module and Assessment Handbook 2022-23-16 - 8 - 2022
No ratings yet
AD Module and Assessment Handbook 2022-23-16 - 8 - 2022
24 pages
3 Dbms
No ratings yet
3 Dbms
25 pages
Chapter 9 - OPEN GL
No ratings yet
Chapter 9 - OPEN GL
15 pages
Datawarehouse Tools
No ratings yet
Datawarehouse Tools
8 pages
DW Life Cycle
No ratings yet
DW Life Cycle
114 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
Lab - GAE
No ratings yet
Lab - GAE
133 pages
ATLIQ MART - Supply Chain Project
No ratings yet
ATLIQ MART - Supply Chain Project
2 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
SWG Reusable Client Proposal For ERM (Enterprise Record Management), Financial Services, Banking Industry, July 2011, US English
No ratings yet
SWG Reusable Client Proposal For ERM (Enterprise Record Management), Financial Services, Banking Industry, July 2011, US English
77 pages
Personnelmanagement
No ratings yet
Personnelmanagement
27 pages
Sample Paper Q0503
No ratings yet
Sample Paper Q0503
20 pages
Extract Transform Load
No ratings yet
Extract Transform Load
80 pages
Kantipur Engineering College: Project Title
No ratings yet
Kantipur Engineering College: Project Title
15 pages
DW Olap
No ratings yet
DW Olap
57 pages
PCVL Brgy 1207003
No ratings yet
PCVL Brgy 1207003
17 pages
Set Your Data in Motion
No ratings yet
Set Your Data in Motion
8 pages
Document 2003782
No ratings yet
Document 2003782
3 pages
Scope, and The Inter-Relationships Among These Entities
No ratings yet
Scope, and The Inter-Relationships Among These Entities
12 pages
Sap-Novasoft-Erp Implementation On Pantaloons: Submitted by
No ratings yet
Sap-Novasoft-Erp Implementation On Pantaloons: Submitted by
11 pages
Artificial Intelligence Lab
No ratings yet
Artificial Intelligence Lab
20 pages
1202990.an Overview of Current Data Lake Architecture Models
No ratings yet
1202990.an Overview of Current Data Lake Architecture Models
6 pages
CC316 - Application Development and Emerging Application Development and Emerging Technologies 3
No ratings yet
CC316 - Application Development and Emerging Application Development and Emerging Technologies 3
5 pages
SQL BI Developer
No ratings yet
SQL BI Developer
6 pages
What Is Bigquery: Enterprise Data Warehouse
No ratings yet
What Is Bigquery: Enterprise Data Warehouse
2 pages
Exp2 Amplitude Modulation and Demodulation W2015
100% (1)
Exp2 Amplitude Modulation and Demodulation W2015
12 pages
DW
No ratings yet
DW
29 pages
Práctica 2 Seguridad BD.2223
No ratings yet
Práctica 2 Seguridad BD.2223
2 pages
Tamil Rosary
No ratings yet
Tamil Rosary
12 pages
2 Hadoop (Uploaded)
No ratings yet
2 Hadoop (Uploaded)
82 pages
APC Building Data Lakes On AWS SG
No ratings yet
APC Building Data Lakes On AWS SG
187 pages
Case Study: Distributed System
No ratings yet
Case Study: Distributed System
3 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
What Is DW2.0
No ratings yet
What Is DW2.0
13 pages
DWH Fundamentals
No ratings yet
DWH Fundamentals
63 pages
Website: Vce To PDF Converter: Facebook: Twitter:: C2090-600.Vceplus - Premium.Exam.60Q
No ratings yet
Website: Vce To PDF Converter: Facebook: Twitter:: C2090-600.Vceplus - Premium.Exam.60Q
19 pages
IOE Syllabus of Data Mining
No ratings yet
IOE Syllabus of Data Mining
2 pages
Business Continuity, Backup & Recovery Strategies
No ratings yet
Business Continuity, Backup & Recovery Strategies
11 pages
The Relational Data Model & Relational Database Constraints
No ratings yet
The Relational Data Model & Relational Database Constraints
4 pages
(Omran) Introduction To Google Cloud Platform
No ratings yet
(Omran) Introduction To Google Cloud Platform
45 pages
Chapter 5 TimeState
No ratings yet
Chapter 5 TimeState
9 pages
Divya Resume
No ratings yet
Divya Resume
2 pages
05.azure Data Lake Authentication
No ratings yet
05.azure Data Lake Authentication
16 pages
First Normal Form (1NF)
No ratings yet
First Normal Form (1NF)
8 pages
Chapter 7 - Neural-Networks
100% (1)
Chapter 7 - Neural-Networks
60 pages
ESB - The VETO Pattern: Towards Real-Time Integration: Best Practices, Patterns and Case Studies
No ratings yet
ESB - The VETO Pattern: Towards Real-Time Integration: Best Practices, Patterns and Case Studies
5 pages
Attempt All The Questions
No ratings yet
Attempt All The Questions
1 page
Matillion Optimizing Snowflake
No ratings yet
Matillion Optimizing Snowflake
23 pages
Fundamentals of Database Systems: Lesson 1: Introduction
No ratings yet
Fundamentals of Database Systems: Lesson 1: Introduction
35 pages
Data Mining N Business Intelligence
No ratings yet
Data Mining N Business Intelligence
63 pages
Business Intelligence DW
No ratings yet
Business Intelligence DW
17 pages
DataStage Faq S
No ratings yet
DataStage Faq S
57 pages
Flink Vs Spark by Slim Baltagi
No ratings yet
Flink Vs Spark by Slim Baltagi
67 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Adbms Data Warehousing and Data Mining
No ratings yet
Adbms Data Warehousing and Data Mining
169 pages
CS8492 DBMS - Part A & Part B
No ratings yet
CS8492 DBMS - Part A & Part B
26 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
MCQ SDLC
100% (2)
MCQ SDLC
7 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Data Management Complete Self-Assessment Guide
From Everand
Data Management Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
TIBCO Software The Ultimate Step-By-Step Guide
From Everand
TIBCO Software The Ultimate Step-By-Step Guide
Gerardus Blokdyk
No ratings yet

Introduction To Data Mining

Uploaded by

Introduction To Data Mining

Uploaded by

Data mining

Presented by: Tek Narayan Adhikari

• “Extraction of interesting, non-trivial, implicit,

• Data mining is basically concerned with the analysis of

Descriptive Data Mining

Predictive Data Mining

• X: Vectors of independent variables.

Problem Data Model Knowledge

• Various modeling techniques are applied and calibrated the

• Data mining is a combination of multidisciplinary field. It can be

A data warehouse should be:

Data Information Decision

Fig: “Architecture of a Data Warehouse”

• Data Mart is a subset of the information content of a data warehouse

• It provides us enterprise-wide data integration.

• The data is integrated from operational systems and external

• This information can vary from a few gigabytes to hundreds of

• Schema is a logical description of the entire database.

• It includes the name and description of records of all record

• Much like a database, a data warehouse also requires to

• A database uses relational model, while a data warehouse uses

branch location_key location to_location

You might also like