01 IntroToDMandDBMS
01 IntroToDMandDBMS
DATA MINING
WEEK 1
INTRO TO DATA
MINING AND
DATABASE
MANAGEMENT
SYSTEMS
DR PAUL HANCOCK
CURTIN UNIVERSITY
SEMESTER 2, 2022
DATA MINING
Application Scenarios
AGGARWAL CH 1
What is Data Mining?
Basic Data Types
Data Mining Tasks
Customer recommendations
www.microkhan.com/2010/09/23/just-rats-in-a-maze-market/
Recommend other products
medium.com/@tomar.ankur287/user-user-collaborative-filtering-
recommender-system-51f568489727
Financial transactions
Identify fraud, advertise services
www.freepic.com
Describe a situation that you have been exposed to a data mining application:
-
-
-
-
-
<rant>
Mining is the process of extracting nuggets of
value from a bulk of material.
Data are the bulk, we want nuggets of
information
We mine for information, which we synthesize
into knowledge and hopefully transform
into wisdom
</rant>
Very application specific Remember the adage: Garbage In, Garbage Out
Good data collection requires understanding Data mining cannot be used to find what is
of the end goals not there
Data storage is important and can inform
the collection method (and vice versa)
Transform data into features or attributes How do you deal with missing or incomplete
Aggarwal CH1.2:
"""
Select methods […] The entire data mining process is an art
form, which is based on the skill of the
Build models
analyst, and cannot be fully captured by a
Train, test, validate single technique or building block.
Deploy, predict, """
analyze
Feedback, Therefore:
evaluate
we must become expert in the creation or
selection of building blocks, and on the
integration of these blocks into a workflow.
Feature, x43
(scalar)
Types Examples
Quantitative or Numeric -
Categorical -
Binary -
Set -
Text -
Week starting -
Workshop -
Prac -
Notes -
Aggarwal CH 1.4:
"""
[…] Data mining is all about finding summary relationships between the entries in the data matrix that are
either unusually frequent or unusually infrequent.
"""
General Goal: Finding relationships between entries in the data matrix.
Between rows -> data clustering, outlier analysis
Outlier analysis:
find samples that are different from
the norm
Inverse of clustering
-
-
-
https://www.perthnow.com.au/business/agriculture/big-knickers-a-standout-on-
myalup-farm-ng-b881032899z
Data redundancy (storing the same data Store data once, then link to data from
multiple times) multiple places
Data inconsistency (multiply stored data -
may not agree)
Data isolation (varying formats of data) Defined schema means fewer formats
Administrators Users
Physical Data Independence: the ability to modify the physical schema without changing the
logical schema
Applications and users depend on the logical schema but need not know about the physical schema
Two-tier system
Application emits SQL statements to interact with DBMS
Changing DBMS means changing/updating client apps
Three-tier system
Client app uses an API to make requests to the server
app
Server app translates API into SQL
Changing DBMS means changing only the server app
Entity-Relationship
Tools for describing Relational Model
Model
Data Abstract concept of the An implementation of the
relation: table
tuple: row
attribute: column
Relations
Attributes
Primary keys
When would you use What is the difference Which of the two
create? between these operations requires a
When would you use two commands: larger change to the
> drop table course; database?
insert?
> alter instructor add
> delete from course;
"notes" varchar(10);
> update instructor set
salary=salary*1.05;
Sqlite3 – Basic DB, uses single file for storage. No user management/permissions. Good for
small/simple DB.
PostgreSQL – Open/Free and fully featured DBMS. Overhead to setup/maintain.