Informa) CS: Lecture 6 - Processing Informa4on
Informa) CS: Lecture 6 - Processing Informa4on
Introduc)on
We
have
no
shortage
of
data
about
almost
anything
of
interest
A
well
designed
database
can
make
that
data
easy
to
access
The
use
of
SQL
can
do
simple
interroga)ons
of
the
data
A
huge
amount
of
useful
informa4on
lies
hidden
however
the
need
for
data
mining
Introduc)on
So
in
this
lecture
we
will
look
at
the
elements
of
data
mining
We
will
begin
however
by
looking
at
simple
ways
in
which
our
original
data
may
be
processed
so
that
the
more
complex
stages
later
on
are
not
compromised
Processing
data
Regardless
of
the
source
of
the
data
we
can
encounter
a
number
of
issues:
Errors
some
data
is
wrong
due
to
a
fault
or
a
simple
transcrip)on
error.
Outliers
some
data
is
very
dierent
to
the
rest
can
be
signicant
if
true
Calibra)on
the
data
may
need
to
be
converted
to
a
physical
quan)ty
to
check
Processing
data
Test
ar)fact
it
is
some)mes
possible
to
include
an
object
in
the
data
collec)on
whose
proper)es
are
well
known
we
can
then
check
what
has
been
recorded
Processing
data
With
data
that
begins
as
analogue,
especially
audio
and
video,
there
are
a
number
of
processing
methods
that
can
be
used
to
prepare
the
data
for
later
stages:
Stretch
if
the
data
can
range
from
0-100
but
we
only
record
0-20
we
can
stretch
the
data
to
use
the
whole
range
Equalise
we
can
modify
a
range
of
20-60
to
use
0-100
Processing
data
Filtering
Lo
pass
lter
hiss
and
noise
Hi
pass
lter
rumble
and
hum
Band
pass
selec)ve
ltering
Examples
Knowledge
is
power
Remember
the
hierarchy
that
we
aspire
to
work
through:
Data
facts
and
gures
accuracy
important
Informa)on
organised
data
for
analysis
Knowledge
interpreta)on
to
inform
ac)on
Applica)on areas
Classica)on
Use
data
to
predict
the
category
of
an
object
e.g.
someone
to
lend
money
to
or
perhaps
arrest
or
perhaps
someone
who
will
make
a
certain
kind
of
purchase
etc.
The
result
of
a
classica)on
problem
can
be
a
decision
tree
which
shows
how
a
new
object
can
be
classied
on
the
basis
of
the
exis)ng
data
Classica)on
Data
age
cartype
risk
23
saloon
low
30
sports
low
36
saloon
low
25
hatchback
high
30
saloon
low
23
hatchback
high
30
hatchback
low
25
sports
high
18
saloon
low
Age
<= 25
> 25
Car Type
Saloon
Low risk
Low risk
sports,
hatchback
high risk
Es)ma)on
Similar
to
classica)on
in
that
a
model
is
created
The
model
allows
the
output
of
a
con)nuous
variable
to
be
predicted
The
model
could
be
a
mathema)cal
func)on
to
predict
a
value
or
could
be
a
theorem
which
then
also
predicts
a
value
or
perhaps
even
a
behaviour.
Clustering
Can
we
analyse
the
data
for
a
set
of
objects
and
iden)fy
sub-groups
and
their
membership
We
may
know
the
sub-groups
and
some
exis)ng
members
and
want
to
know
what
data
helps
iden)fy
which
cluster
a
new
object
will
belong
to.
Clustering
Associa)on
Seeking
co-occurrences
of
groups
of
data
items
in
a
data
set
Associa)on
can
be
in
)me
i.e.
a
sequen)al
pa[ern
Can
be
very
popular
with
retailers
to
target
adver)sing
for
related
purchases
and
for
store
layouts
Associa)on
rules
Rules are of the form X => Y
where X and Y are distinct sets of items
Associa)on
rules
All transactions
Transactions
with X
Transactions
with X and Y
Transactions
with Y
Items bought
milk, eggs, tea
butter, milk, sugar, tea
biscuits, sugar, eggs
tea, coffee, eggs
coffee, chocolate, sugar
Support, Confidence
20%, 50%
40%, 66.7%
20%, 33.3%
Associa)on
-
issues
number of rules grows exponentially with number
of items
User to specify
Minimum Support (e.g. 10%) and
Minimum Confidence (e.g. 70%) levels
Which rules are interesting - define interesting
Negative rules can also be interesting
70% buying crisps => do not buy cream
absence implies millions of useless rules!
Hierarchies
Items are grouped
e.g. pen, pencil are writing tools
Can have different rules for groups than for
individual items
e.g., strong positive association between
crisps and biscuits, but negative
associations lower in hierarchy
use to define interesting
e.g. rules across groups can be more
interesting than rules within groups
Hierarchies
+ve
Crisps
Biscuits
C
-ve
+ve
X
-ve
Process
Cleansing, quality
Input data
from repository
Data
Pre-processing
Mining patterns
Data
Post-processing
Pre-processing
We
need
to
understand
the
data
that
we
are
using
type
and
quality
This
will
inform
the
mining
technique
to
be
used
Data
visualisa)on
can
also
inform
the
mining
process
Target
Reading