0% found this document useful (0 votes)
28 views50 pages

01 IntroToDMandDBMS

This document provides an overview of an introductory data mining course. It discusses several key topics: 1. Application scenarios for data mining like product recommendations, resource management, and medical analysis. 2. Basic concepts of data mining including the three main processes of data collection, feature extraction/cleaning, and analytical processing using methods like classification, clustering, and association rule mining. 3. Different data types like numeric, categorical, text and their usage in data mining. The document provides examples of how to characterize features in a sample dataset.

Uploaded by

Manish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views50 pages

01 IntroToDMandDBMS

This document provides an overview of an introductory data mining course. It discusses several key topics: 1. Application scenarios for data mining like product recommendations, resource management, and medical analysis. 2. Basic concepts of data mining including the three main processes of data collection, feature extraction/cleaning, and analytical processing using methods like classification, clustering, and association rule mining. 3. Different data types like numeric, categorical, text and their usage in data mining. The document provides examples of how to characterize features in a sample dataset.

Uploaded by

Manish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

COMP5009

DATA MINING

WEEK 1
INTRO TO DATA
MINING AND
DATABASE
MANAGEMENT
SYSTEMS
DR PAUL HANCOCK
CURTIN UNIVERSITY
SEMESTER 2, 2022
DATA MINING

 Application Scenarios
AGGARWAL CH 1
 What is Data Mining?
 Basic Data Types
 Data Mining Tasks

COMP5009 – DATA MINING, CURTIN UNIVERSITY


APPLICATION SCENARIOS

 Store product placement


 Customer ease of access or increased sales

 Customer recommendations
www.microkhan.com/2010/09/23/just-rats-in-a-maze-market/
 Recommend other products

 Resource use (eg, Power, Internet)


 Ensure availability, detect anomalies

 Traffic flow (data, power, road users)


www.maps.google.com

 Reduce congestion, plan interruptions

medium.com/@tomar.ankur287/user-user-collaborative-filtering-
recommender-system-51f568489727

COMP5009 – DATA MINING, CURTIN UNIVERSITY


APPLICATION SCENARIOS

 Medical patient tracking

www.analytics.google.com  Preventative medicine

 Web log analysis


 Detection intrusion or exfiltration events

 Financial transactions
 Identify fraud, advertise services
www.freepic.com

COMP5009 – DATA MINING, CURTIN UNIVERSITY


DATA MINING IN EVERYDAY LIFE

Describe a situation that you have been exposed to a data mining application:
-
-
-
-
-

COMP5009 – DATA MINING, CURTIN UNIVERSITY


WHAT IS DATA MINING?

<rant>
 Mining is the process of extracting nuggets of
value from a bulk of material.
 Data are the bulk, we want nuggets of
information
 We mine for information, which we synthesize
into knowledge and hopefully transform
into wisdom
</rant>

COMP5009 – DATA MINING, CURTIN UNIVERSITY


WHAT IS DATA MINING?

All the fancy buzzwords that you have heard,


Data mining is a process:
including Machine Learning, Deep Learning,
1. Data collection Artificial Intelligence, and Predictive
Analytics, are all part of stage 3.
2. Feature extraction and data cleaning
None of these fancy things are possible
3. Analytical processing
without stages 1 and 2.

COMP5009 – DATA MINING, CURTIN UNIVERSITY


 Mining requires that we process an enormous amount of
material to find the valuable components

DATA  More data != better


 Better data == better
COLLECTION
 Collecting the right data is key
 Data collection should be informed by the expected analysis

COMP5009 – DATA MINING, CURTIN UNIVERSITY


DATA COLLECTION

 May require specialized hardware or


software
 Manual vs automated collection methods  Context or metadata is still data

 Very application specific  Remember the adage: Garbage In, Garbage Out

 Good data collection requires understanding  Data mining cannot be used to find what is
of the end goals not there
 Data storage is important and can inform
the collection method (and vice versa)

COMP5009 – DATA MINING, CURTIN UNIVERSITY


FEATURE EXTRACTION OR DATA CLEANING

Feature extraction Data cleaning

 Transform data into features or attributes  How do you deal with missing or incomplete

 What features are relevant / available? data?


 Are partial records useful?
 Usually requires domain knowledge
 What storage format is most useful?

 Are there any redundant or nuisance


features?

COMP5009 – DATA MINING, CURTIN UNIVERSITY


ANALYTICAL PROCESSING

Aggarwal CH1.2:
"""
 Select methods […] The entire data mining process is an art
form, which is based on the skill of the
 Build models
analyst, and cannot be fully captured by a
 Train, test, validate single technique or building block.
 Deploy, predict, """
analyze
 Feedback, Therefore:
evaluate
we must become expert in the creation or
selection of building blocks, and on the
integration of these blocks into a workflow.

COMP5009 – DATA MINING, CURTIN UNIVERSITY


BASICS OF DATA
TYPES

COMP5009 – DATA MINING, CURTIN UNIVERSITY


NOMENCLATURE
Record,
X4
(vector)
Dataset,
D
(Matrix)

Feature, x43
(scalar)

COMP5009 – DATA MINING, CURTIN UNIVERSITY


DATA DEPENDENCY

Nondependency-oriented data Dependency-oriented data

 No dependency between attributes  Implicit or explicit relationships exist

 No dependency between records between attributes/records


 Network data, or time series data have an
 Relationships or similarity does not imply
explicit relation between records and are
dependency
thus dependent
 Age may be correlated with income, but one
 The nature and degree of dependency make
is not dependent on the other
analysis harder
 Most common data

COMP5009 – DATA MINING, CURTIN UNIVERSITY


DATA DEPENDENCY

 Discuss: Does this data set


represent dependency or non-
dependency oriented data?

COMP5009 – DATA MINING, CURTIN UNIVERSITY


DATA TYPES

Types Examples

 Quantitative or Numeric  -

 Categorical  -

 Binary  -

 Set  -

 Text  -

COMP5009 – DATA MINING, CURTIN UNIVERSITY


DATA TYPES

 What data type would you assign to


each of the features in this data
set?
 Teaching week -

 Week starting -

 Workshop -

 Prac -

 Notes -

COMP5009 – DATA MINING, CURTIN UNIVERSITY


BUILDING BLOCKS OF DATA MINING

Aggarwal CH 1.4:
"""
[…] Data mining is all about finding summary relationships between the entries in the data matrix that are
either unusually frequent or unusually infrequent.
"""
General Goal: Finding relationships between entries in the data matrix.
 Between rows -> data clustering, outlier analysis

 Between columns -> association pattern mining, data classification


Learning Modes:
 Supervised -> desired outputs are available

 Unsupervised -> no additional information

COMP5009 – DATA MINING, CURTIN UNIVERSITY


DATA MINING
TASKS

COMP5009 – DATA MINING, CURTIN UNIVERSITY


Association pattern mining:
find items that co-occur

THE 4 SUPER- Data clustering:


PROBLEMS group samples that share some
similarity
IN DATA
Data classification:
MINING predict the labels of test samples

Outlier analysis:
find samples that are different from
the norm

COMP5009 – DATA MINING, CURTIN UNIVERSITY


ASSOCIATION
PATTERN MINING

 What patterns exist within


the data?
 If you saw someone buying 1kg
sugar, large box of tea bags,
and a large tin of instant
coffee, what would you
recommend?
 If you saw someone buying a
gas bbq, matches,
firestarters, and briquettes,
what would you recommend?

COMP5009 – DATA MINING, CURTIN UNIVERSITY


DATA CLUSTERING

Identification of groups based on


similarity between features

Pietka, Fender &


Keane 2015MNRAS.446.3687P

COMP5009 – DATA MINING, CURTIN UNIVERSITY


DATA CLASSIFICATION
Play
Outlook Temperature Humidity Windy Golf

Rainy Hot High False No

Rainy Hot High True No

Overcast Hot High False Yes


 Use relationships between features to
Sunny Mild High False Yes
classify records.
Sunny Cool Normal False Yes
 A form of prediction
Sunny Cool Normal True No
 Eg: On a rainy day with high heat, normal
Overcast Cool Normal True Yes humidity, and no wind, should we go and
Rainy Mild High False No
play golf?

Rainy Cool Normal False Yes

Rainy Hot Normal No ?

COMP5009 – DATA MINING, CURTIN UNIVERSITY


OUTLIER DETECTION / ANALYSIS

 Inverse of clustering

 Why might outliers be important to


identify?
 -

 -

 -

 -

https://www.perthnow.com.au/business/agriculture/big-knickers-a-standout-on-
myalup-farm-ng-b881032899z

COMP5009 – DATA MINING, CURTIN UNIVERSITY


 Data mining is the process of extracting information
from data.
 We use this information to synthesize knowledge
SUMMARY which in turn is transformed into wisdom
 There is no one solution, we must become expert in
Aggarwal Chapters
the creation or selection of building blocks, and on
• 1 (all) the integration of these blocks into a workflow.

COMP5009 – DATA MINING, CURTIN UNIVERSITY


DATABASE
MANAGEMENT
SYSTEMS
 What is a DBMS?
Silberschatz CH 1, 2, 3
 Structure of a relational DBMS
 SQL and how to work with a DBMS

COMP5009 – DATA MINING, CURTIN UNIVERSITY


NEARLY ALL DIGITAL DATA IS STORED IN A
DATABASE
WE USE A DATABASE MANAGEMENT SYSTEM TO STORE/RETRIEVE THIS DATA

COMP5009 – DATA MINING, CURTIN UNIVERSITY


WHAT IS A DATABASE
MANAGEMENT
SYSTEM (DBMS)

 A database (DB) can be thought


of as a collection of tables
 The tables consist of rows and
columns
 Each column has a defined
data-type
 A database-management system
(DBMS) consists of a
collection of interrelated
data and a collection of
programs to access those data.
[Silberschatz ch 1.10]
[Silberschatz fig 1.
1]​
COMP5009 – DATA MINING, CURTIN UNIVERSITY
WHY USE A DBMS? (INSTEAD OF A FLAT FILE SYSTEM)

Flat file system problems Database solutions

 Data redundancy (storing the same data  Store data once, then link to data from
multiple times) multiple places
 Data inconsistency (multiply stored data  -
may not agree)
 Data isolation (varying formats of data)  Defined schema means fewer formats

 -  Access can be granted per table, or in

 Concurrent edits aggregate


 Atomic transactions

COMP5009 – DATA MINING, CURTIN UNIVERSITY


TWO CORE COMPONENTS OF A DBMS

The data base The management system

 Tables  Manage the storage and retrieval of data

 Indexes for tables via


 Data-definition language (DDL)
 Schema to describe table structure and
 Data manipulation language (DML)
rules for consistency
 SQL is both DML + DDL
 A storage array
 Ensure integrity of database

 Control access and permissions

COMP5009 – DATA MINING, CURTIN UNIVERSITY


DBMS ADMIN AND USERS

Administrators Users

 Define schema  Naive users

 Define storage structure and access method  Application programmers

 Modify schema and physical organization  Sophisticated users (analysts)

 Manage security and permissions  Specialized users: write specialized

 Perform maintenance database applications

COMP5009 – DATA MINING, CURTIN UNIVERSITY


INSTANCES AND SCHEMA

 Logical Schema: the overall logical structure of the database


 Customers and accounts in a bank and the relationship between them

 Physical schema: the overall physical structure of the database


 How and where the data are stored

 Physical Data Independence: the ability to modify the physical schema without changing the
logical schema
 Applications and users depend on the logical schema but need not know about the physical schema

 Instance: the actual content of the database at a particular point in time

COMP5009 – DATA MINING, CURTIN UNIVERSITY


SERVER / CLIENT MODELS FOR DATABASE
APPLICATIONS

 Two-tier system
 Application emits SQL statements to interact with DBMS
 Changing DBMS means changing/updating client apps

 Three-tier system
 Client app uses an API to make requests to the server
app
 Server app translates API into SQL
 Changing DBMS means changing only the server app

COMP5009 – DATA MINING, CURTIN UNIVERSITY


SERVER / CLIENT MODELS FOR DATABASE
APPLICATIONS

 How would moving the DBMS, or swapping to a


backup instance be handled in each case?
 In the era of mobile and in-browser apps, which
of the two models would you use?

COMP5009 – DATA MINING, CURTIN UNIVERSITY


DATA MODELS

Entity-Relationship
Tools for describing Relational Model
Model
 Data  Abstract concept of the  An implementation of the

 Data relationships database database concept


 Useful in the design  Defines formats and
 Data semantics
process relations in database-
 Data constraints friendly language
 Good for people who
aren't implementing the  Required for admin to
DB themselves create a schema
 Optional reading Ch 6

COMP5009 – DATA MINING, CURTIN UNIVERSITY


UNIVERSITY DATABASE MODEL EXAMPLE

 Silberschatz use an example of a


university DBMS
 There are many tables:
 Some describe objects like
classrooms, courses, or people
 Some describe relationships like
teaches or takes

 We refer to these tables as


relations regardless of what they
represent

COMP5009 – DATA MINING, CURTIN UNIVERSITY


DATABASE NOMENCLATURE

 relation: table

 tuple: row

 attribute: column

 domain: set of permitted values of an


attribute
 Primary key: attribute(s) which is
unique between tuples
 Can be a single attribute (eg Staff ID)
 Can be a combination of attributes
(course_id, sec_id, semester, year)

COMP5009 – DATA MINING, CURTIN UNIVERSITY


SCHEMA
Relation Attributes

Primary key – underlined attribute(s)

COMP5009 – DATA MINING, CURTIN UNIVERSITY


SCHEMA
 What do s_ID and
i_ID represent here
in the advisor
relation?
 Do we need
attributes to have
unique names across
all relations?

COMP5009 – DATA MINING, CURTIN UNIVERSITY


SCHEMA
DIAGRAM

 Relations

 Attributes

 Primary keys

 Foreign keys (arrows)

COMP5009 – DATA MINING, CURTIN UNIVERSITY


SCHEMA OPERATIONS

 create: creates new databases, tables, views

 drop: removes commands, views, tables, databases

 alter: modifies existing database schema

COMP5009 – DATA MINING, CURTIN UNIVERSITY


CREATING RELATIONS IN SQL

COMP5009 – DATA MINING, CURTIN UNIVERSITY


DATA TYPES

Standard/base types Different DBMS will allow additional data


 char(n) - fixed length string types such as
 varcahr(n) - variable string with max  Date
length  Time
 int, smallint, real, double - numbers  Vector (eg 3-tuple of float)
 numeric(p,d) - fixed point number with p  Blobs (eg image data)
digits, including d decimal places
 Sets
 float(n) - float with precision of at least
n digits

COMP5009 – DATA MINING, CURTIN UNIVERSITY


RELATION OPERATIONS

 Select: retrieve data

 Insert: add new data

 Update: modify existing data

 Delete: remove existing data

COMP5009 – DATA MINING, CURTIN UNIVERSITY


CREATE, INSERT, DROP, DELETE, UPDATE AND ALTER

Create vs Insert Drop vs Delete Update vs Alter

 When would you use  What is the difference  Which of the two
create? between these operations requires a
 When would you use two commands: larger change to the
 > drop table course; database?
insert?
  > alter instructor add
> delete from course;
"notes" varchar(10);
 > update instructor set
salary=salary*1.05;

COMP5009 – DATA MINING, CURTIN UNIVERSITY


SQL QUERIES

select name, instructor.dept_name, building


from instructor , department
where instructor.dept_name=department.dept_name;

COMP5009 – DATA MINING, CURTIN UNIVERSITY


SQL QUERIES

select name, instructor.dept_name, building


from instructor , department;

COMP5009 – DATA MINING, CURTIN UNIVERSITY


SQL SOFTWARE

 Sqlite3 – Basic DB, uses single file for storage. No user management/permissions. Good for
small/simple DB.
 PostgreSQL – Open/Free and fully featured DBMS. Overhead to setup/maintain.

 We will explore this further in the practical next week

COMP5009 – DATA MINING, CURTIN UNIVERSITY


 A database-management system (DBMS) consists of
a collection of interrelated data and a
collection of programs to access those data.
SUMMARY  Visualize as linked tabular data:

Silberschatz Chapters  Tables/Rows/Columns -> Relations/Tuples/Attributes

• 1 (all)  SQL provides data definition and manipulation


• 2 (all) functions
• 3.1-3.3, 3.9  Schema: create, drop, alter
 Data: select, delete, update, insert

COMP5009 – DATA MINING, CURTIN UNIVERSITY


NEXT: DATA PREPARATION
AGGARWAL CHAPTER 2

COMP5009 – DATA MINING, CURTIN UNIVERSITY

You might also like