100% found this document useful (1 vote)
47 views51 pages

CIS62283 02 PreProcessing

The document discusses different types of data and data mining processes. It describes data as a collection of objects and their attributes. There are different types of datasets including records, matrices, documents, transactions, graphs, and ordered data like spatial, temporal, sequential, and genetic data. The data mining process involves data preprocessing steps like handling missing data, smoothing noisy data, and identifying outliers before analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
47 views51 pages

CIS62283 02 PreProcessing

The document discusses different types of data and data mining processes. It describes data as a collection of objects and their attributes. There are different types of datasets including records, matrices, documents, transactions, graphs, and ordered data like spatial, temporal, sequential, and genetic data. The data mining process involves data preprocessing steps like handling missing data, smoothing noisy data, and identifying outliers before analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Data Mining

Data – Steps – Preprocessing

Ahmad Afif Supianto


Overview  
•  Data  
•  Types  of  Datasets  
•  Data  Mining  Process  
•  Data  Quality  
•  Data  Preprocessing  
What  is  Data?  
•  Collec7on  of  data  objects  and   Attributes
their  a<ributes  
•  An  a<ribute  is  a  property  or   Tid Refund Marital Taxable
characteris7c  of  an  object   Status Income Cheat

•  Examples:  eye  color  of  a  person,   1 Yes Single 125K No


temperature,  etc.   2 No Married 100K No

•  A<ribute  is  also  known  as   3 No Single 70K No

variable,  field,  characteris7c,  or   4 Yes Married 120K No

feature   Objects
5 No Divorced 95K Yes

•  A  collec7on  of  a<ributes  describe   6 No Married 60K No


7 Yes Divorced 220K No
an  object   8 No Single 85K Yes
•  Object  is  also  known  as  record,   9 No Married 75K No
point,  case,  sample,  en7ty,  or   10 No Single 90K Yes
instance   10
Attribute  Values  
•  A<ribute  values  are  numbers  or  symbols  assigned  to  an  
a<ribute  
•  Dis7nc7on  between  a<ributes  and  a<ribute  values  
•  Same  a<ribute  can  be  mapped  to  different  a<ribute  values  
•   Example:  height  can  be  measured  in  feet  or  meters  
•  Different  a<ributes  can  be  mapped  to  the  same  set  of  values  
•   Example:  A<ribute  values  for  ID  and  age  are  integers  
•   But  proper7es  of  a<ribute  values  can  be  different  
•  ID  has  no  limit  but  age  has  a  maximum  and  minimum  value  
Types  of  Attributes    
•   There  are  different  types  of  a<ributes  
•  Nominal  
•  Examples:  ID  numbers,  eye  color,  zip  codes  
•  Ordinal  
•  Examples:  rankings  (e.g.,  taste  of  potato  chips  on  a  scale  from  
1-­‐10),  grades,  height  in  {tall,  medium,  short}  
•  Interval  
•  Examples:  calendar  dates,  temperatures  in  Celsius  or  Fahrenheit.  
•  Ra7o  
•  Examples:  temperature  in  Kelvin,  length,  7me,  counts    
Discrete  and  Continuous  Attributes    
•  Discrete  A<ribute  
•  Has  only  a  finite  or  countably  infinite  set  of  values  
•  Examples:  zip  codes,  counts,  or  the  set  of  words  in  a  collec7on  of  
documents    
•  OZen  represented  as  integer  variables.        
•  Note:  binary  a<ributes  are  a  special  case  of  discrete  a<ributes    

•  Con7nuous  A<ribute  
•  Has  real  numbers  as  a<ribute  values  
•  Examples:  temperature,  height,  or  weight.      
•  Prac7cally,  real  values  can  only  be  measured  and  represented  using  a  
finite  number  of  digits.  
•  Con7nuous  a<ributes  are  typically  represented  as  floa7ng-­‐point  
variables.      
Any  Questions?  
Types  of  Datasets    
•  Record  
•  Data  Matrix  
•  Document  Data  
•  Transac3on  Data  

•  Graph  
•  World  Wide  Web  
•  Molecular  Structures  

•  Ordered  
•  Spa3al  Data  
•  Temporal  Data  
•  Sequen3al  Data  
•  Gene3c  Sequence  Data  
Record  Data    
•  Data  that  consists  of  a  collec7on  of  records,  each  of  which  
consists  of  a  fixed  set  of  a<ributes    
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Data  Matrix    
•  If  data  objects  have  the  same  fixed  set  of  numeric  
a<ributes,  then  the  data  objects  can  be  thought  of  as  points  
in  a  mul7-­‐dimensional  space,  where  each  dimension  
represents  a  dis7nct  a<ribute    
•  Such  data  set  can  be  represented  by  an  m  by  n  matrix,  
where  there  are  m  rows,  one  for  each  object,  and  n  
columns,  one  for  each  a<ribute  
Projection Projection Distance Load Thickness
of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1
Document  Data  
•  Each  document  becomes  a  `term'  vector,    
•  each  term  is  a  component  (a<ribute)  of  the  vector,  
•  the  value  of  each  component  is  the  number  of  7mes  the  
corresponding  term  occurs  in  the  document.    

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction  Data  
•  A  special  type  of  record  data,  where    
•  each  record  (transac7on)  involves  a  set  of  items.      
•  For  example,  consider  a  grocery  store.    The  set  of  products  
purchased  by  a  customer  during  one  shopping  trip  cons7tute  a  
transac7on,  while  the  individual  products  that  were  purchased  
are  the  items.    
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph  Data    
•  Examples:  Generic  graph  and  HTML  Links    
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
5 1 <a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
2 <li>
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
5
Chemical  Data    
•  Benzene  Molecule:  C6H6  
Ordered  Data    
•  Sequences  of  transac7ons  
Items/Events

An element of
the sequence
Ordered  Data    
•   Genomic  sequence  data  
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered  Data  
•  Spa7o-­‐Temporal  Data  

Average Monthly
Temperature of
land and ocean
Any  Questions?  
Data  Mining  
Process  
Steps  of  Data  Mining  
1.  Learning  the  applica7on  domain:  
•  relevant  prior  knowledge  and  goals  of  applica7on  
2.  Crea7ng  a  target  data  set:  data  selec7on  
3.  Data  cleaning  and  preprocessing:  (may  take  60%  of  effort!)  
4.  Data  reduc7on  and  transforma7on:  
•  Find  useful  features,  dimensionality/variable  reduc7on,  invariant  representa7on.  
5.  Choosing  func7ons  of  data  mining    
•   summariza7on,  classifica7on,  regression,  associa7on,  clustering.  
6.  Choosing  the  mining  algorithm(s)  
7.  Data  mining:  search  for  pa<erns  of  interest  
8.  Pa<ern  evalua7on  and  knowledge  presenta7on  
•  visualiza7on,  transforma7on,  removing  redundant  pa<erns,  etc.  
9.  Use  of  discovered  knowledge  
Data  Quality    
•  What  kinds  of  data  quality  problems?  
•  How  can  we  detect  problems  with  the  data?    
•  What  can  we  do  about  these  problems?    

•  Examples  of  data  quality  problems:    


•  Noise  and  outliers    
•  Missing  values    
•  Duplicate  data    
Noise  
•  Noise  refers  to  modifica7on  of  original  values  
•  Examples:  distor7on  of  a  person’s  voice  when  talking  on  a  poor  
phone  and  “snow”  on  television  screen  

Two Sine Waves Two Sine Waves +


Noise
Outliers  
•  Outliers  are  data  objects  with  characteris7cs  that  are  
considerably  different  than  most  of  the  other  data  objects  in  
the  data  set  
Missing  Values  
•  Reasons  for  missing  values  
•  Informa7on  is  not  collected    
(e.g.,  people  decline  to  give  their  age  and  weight)  
•  A<ributes  may  not  be  applicable  to  all  cases    
(e.g.,  annual  income  is  not  applicable  to  children)  

•  Handling  missing  values  


•  Eliminate  Data  Objects  
•  Es7mate  Missing  Values  
•  Ignore  the  Missing  Value  During  Analysis  
•  Replace  with  all  possible  values  (weighted  by  their  probabili7es)  
Duplicate  Data  
•  Data  set  may  include  data  objects  that  are  duplicates,  or  
almost  duplicates  of  one  another  
•  Major  issue  when  merging  data  from  heterogeous  sources  

•  Examples:  
•  Same  person  with  mul7ple  email  addresses  

•  Data  cleaning  
•  Process  of  dealing  with  duplicate  data  issues  
Any  Questions?  
Data  Preprocessing  
•  Aggrega7on  
•  Sampling  
•  Dimensionality  Reduc7on  
•  Feature  subset  selec7on  
•  Feature  crea7on  
•  Discre7za7on  and  Binariza7on  
•  A<ribute  Transforma7on  
Aggregation  
•  Combining  two  or  more  a<ributes  (or  objects)  into  a  single  
a<ribute  (or  object)  

•  Purpose  
•  Data  reduc7on  
•   Reduce  the  number  of  a<ributes  or  objects  
•  Change  of  scale  
•   Ci7es  aggregated  into  regions,  states,  countries,  etc  
•  More  “stable”  data  
•   Aggregated  data  tends  to  have  less  variability    
Sampling    
•  Sampling   is   the   main   technique   employed   for   data  
selec7on.  
•  It  is  oZen  used  for  both  the  preliminary  inves7ga7on  of  the  
data  and  the  final  data  analysis.  
•  Sampling   is   used   in   data   mining   because   processing  
the  en7re  set  of  data  of  interest  is  too  expensive  or  
7me  consuming.  
Sampling  …    
•  The  key  principle  for  effec7ve  sampling  is  the  
following:    
•  Using  a  sample  will  work  almost  as  well  as  using  the  
en7re  data  sets,  if  the  sample  is  representa7ve  
•  A  sample  is  representa7ve  if  it  has  approximately  the  
same  property  (of  interest)  as  the  original  set  of  data      
Types  of  Sampling  
•  Simple  Random  Sampling  
•  There  is  an  equal  probability  of  selec7ng  any  par7cular  item  

•  Sampling  without  replacement  


•  As  each  item  is  selected,  it  is  removed  from  the  popula7on  

•  Sampling  with  replacement  


•  Objects  are  not  removed  from  the  popula7on  as  they  are  selected  
for  the  sample.        
•  In  sampling  with  replacement,  the  same  object  can  be  picked  up  more  
than  once  

•  Stra7fied  sampling  
•  Split  the  data  into  several  par77ons;  then  draw  random  samples  
from  each  par77on  
Dimensionality  Reduction  
•  Purpose:  
•  Reduce  amount  of  7me  and  memory  required  by  data  mining  
algorithms  
•  Allow  data  to  be  more  easily  visualized  
•  May  help  to  eliminate  irrelevant  features  or  reduce  noise  

•  Techniques  
•  Principle  Component  Analysis  
•  Singular  Value  Decomposi7on  
•  Others:  supervised  and  non-­‐linear  techniques  
Dimensionality  Reduction:  PCA  
•  Goal  is  to  find  a  projec7on  that  captures  the  largest    amount  
of  varia7on  in  data  
•  Find  the  eigenvectors  of  the  covariance  matrix  
•  The  eigenvectors  define  the  new  space  
x2

x1
Feature  Subset  Selection  
•  Another  way  to  reduce  dimensionality  of  data  
•  Redundant  features    
•  duplicate  much  or  all  of  the  informa7on  contained  in  one  or  more  
other  a<ributes  
•  Example:  purchase  price  of  a  product  and  the  amount  of  sales  tax  
paid  

•  Irrelevant  features  
•  contain  no  informa7on  that  is  useful  for  the  data  mining  task  at  
hand  
•  Example:  students'  ID  is  oZen  irrelevant  to  the  task  of  predic7ng  
students'  GPA  
Feature  Subset  Selection  
•  Techniques:  
•  Brute-­‐force  approch:  
•  Try  all  possible  feature  subsets  as  input  to  data  mining  algorithm  
•  Embedded  approaches:  
•   Feature  selec7on  occurs  naturally  as  part  of  the  data  mining  algorithm  
•  Filter  approaches:  
•   Features  are  selected  before  data  mining  algorithm  is  run  
•  Wrapper  approaches:  
•   Use  the  data  mining  algorithm  as  a  black  box  to  find  best  subset  of  
a<ributes  
Feature  Creation  
•  Create  new  a<ributes  that  can  capture  the  important  
informa7on  in  a  data  set  much  more  efficiently  than  the  
original  a<ributes  

•  Three  general  methodologies:  


•  Feature  Extrac7on  
•   domain-­‐specific  
•  Mapping  Data  to  New  Space  
•  Feature  Construc7on  
•   combining  features    
Discretization  Using  Class  Labels  
•  Entropy  based  approach  

3 categories for both x and y 5 categories for both x and y


Attribute  Transformation  
•  A  func7on  that  maps  the  en7re  set  of  values  of  a  given  a<ribute  
to  a  new  set  of  replacement  values  such  that  each  old  value  can  
be  iden7fied  with  one  of  the  new  values  
•  Simple  func7ons:  xk,  log(x),  ex,  |x|  
•  Standardiza7on  and  Normaliza7on    
Any  Questions?  
Contoh  Kasus  
•  Judul  Peneli7an:  
•  Prediksi  Kelulusan  Mahasiswa  Berdasarkan  Kinerja  Akademik  
Menggunakan  Pendekatan  Data  Mining  Pada  Program  Studi  
Sistem  Informasi  Fakultas  Ilmu  Komputer  Universitas  Brawijaya    
•  Pengumpulan  Data  
•  Mahasiswa  SI  angkatan  2011-­‐2016  
•  1352  records  
•  30  atribut  
•  ID  Mhs,  Jenis  kelamin,  Jalur  Masuk,  IP  Beban  semester  1,  SKS  Beban  
semester  1,  IPK  Beban  Semester  1,  SKSK  Beban  semester  1,  IP  lulus  
semester  1,  SKS  lulus  semester  1,  IPK  lulus  semester  1,  SKSK  lulus  
semester  1,  …  …  ,  SKSK  lulus  semester  4,  IPK  lulus,  predikat,  dan  
Yudisium    
Dataset  
•  Pembersihan  Data  
•  Noise,  7dak  konsisten,  7dak  relevan,  atau  kosong  
•  Mahasiswa  SAP  à  perlakuan  berbeda  
•  Data  setelah  cleaning:  522  data  
•  Seleksi  Atribut  
•  Fokus  ke  kinerja  akademik  
•  Wawancara  dengan  kaprodi  SI  
Transformasi  Data  
Transformasi  Data  
Langkah  Mining  
•  Pemilihan  tugas  data  mining  
•  Classifica7on  
•  Pemilihan  algoritme  mining  
•  Algoritme  Naïve  Bayes  
•  Evaluasi  algoritme  
•  Confussion  matrix  à  akurasi  model  
Any  Questions?  
Review  Data  Mining  Process  
Bussiness  –  Question  –  Driven  Process  
Any  Questions?  
Tugas  
•  Carilah  sebuah  dataset  
•  UCI  Machine  Learning,  atau  
•  Kaggle  Dataset  
•  Lakukan  pre-­‐processing  pada  dataset  tersebut  
•  Diskusikan  dengan  kelompok  kalian  (2-­‐3  mhs)  tentang  pre-­‐
processing  apa  yang  mungkin  bisa  dilakukan  pada  dataset  
tersebut  
•  Tuliskan  contoh  data  dan  proses  pengolahannya  
•  Kerjakan  di  selembar  kertas  à  difoto  à  upload  di  Google  
Classroom  

You might also like