Why Data Mining
Why Data Mining
in/v/140019/Weka-Tutorial-02-Data-
Preprocessing-101--Data-Prep#course_14386
http://facweb.cs.depaul.edu/mobasher/classes/ect584/weka/
index.html
https://philippe-fournier-viger.com/spmf/videos/
closed_video.mp4
https://dm.cs.tu-dortmund.de/en/mlbits/frequent-pattern-
maximal-and-closed/
https://www.youtube.com/watch?v=5H1QxY5nj0o
https://www.youtube.com/watch?v=E1UzOR2fTjU
https://www.youtube.com/watch?v=T8IiEIUY01M
The knowledge discovery process is shown in Figure 1.4 as an iterative sequence of the
following steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)3
3. Data selection (where data relevant to the analysis task are retrieved from the
database)
4. Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations) 4
5. Data mining (an essential process where intelligent methods are applied to extract
data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on interestingness measures.
Knowledge presentation (where visualization and knowledge representation techniques
are used to present mined knowledge to users)
Steps 1 through 4 are different forms of data preprocessing, where data are prepared
for mining. The data mining step may interact with the user or a knowledge base. The
interesting patterns are presented to the user and may be stored as new knowledge in the
knowledge base.
The preceding view shows data mining as one step in the knowledge discovery process,
albeit an essential one because it uncovers hidden patterns for evaluation.
Data mining is the process of discovering interesting patterns and
knowledge from large amounts of data. The data sources can include databases, data
warehouses, the Web, other information repositories, or data that are streamed into the
system dynamically.
LAB 1
https://sourceforge.net/projects/weka/
c:\Program Files\Weka-3-8-5\data
Diabetes.arff Filter-> choose filters, unsupervised, attribute, numeric cleaner, index 6 mass, min Default
NaN, min Threshold 0.1E-7, ok, apply.
Select mass edit
Filter unsupervised , instance, remove with values , filet attribute index 6, match missing values True, ok,
apply, check mass, check edit removed.
Impute undo choose fdilter unsupervised attribute, replace missing values, apply, edit replaced
Weather numeric data edit , probability play percentage should be between 0 to 100. Filetrc
unsuopervisede attribute, numeric cleaner max thresholf 100 min threshold 0, max default 100, min
default 0
45 to 49 must become 50 , closeto: 47 changeto : 50, close to tolerance: 3( means less than 3) , attribute
indices: 5, Ok apply edit
Diabetes.arff filet unsupervised attribute, interquartile range apply. New attributes at 10 and 11 outlier
and extreme values, edit
Outlier Detection: unsupervised instances, remove with values , attribute index 10, nominal indices:
last move: filter, unsupervised instances, remove with values , attribute index 11, nominal indices: last
ok. Apply save Diabetes1.arff
Glass.arff
Filter, Unsupervised, attribute, Normalize used for numeric attribute only. ( for -1 to +1 chooose
scale=2 and translation = -1 select normalize in filter to edit , scale=1, translation=1 for values between
0 to 1. OK, Apply, edit undo. ( for -1 to +1 choose scale=2 and translation = -1 ) Save will replace the file.
Give a new name
Filter, Unsupervised, attribute, Standardize( zero mean unit variance) used for numeric attribute only
Check for numeric attribute mean and std deviation
Unzip
Open file auto-mpg.data and auto-mpg.names using notepad. Change extension to txt
Go to download , select auto-mpg-data.txt, next, select delimiter tab and space , next, finish, put data
=$A$1
Weka, tools , arff viewer, file , open, select csv file, save as arff
Data Cleaning using weka:
Select edit, you can find at lot if missing values shown in grey color
Discretize
Open credit-g.arff
Select attribute age unsupervised, attribute, Discretize, select on the discretize bar, attribute indices 13
(for age), bins range precision ( for decimal values limit) = 2,bins =3, apply, save as type csv
Open file in excel replace values with Old, Middle and Youg, save the file as csv
attribute Evaluator
InfogainAttributeEval
Start
Check Results
Edit
Weka, filters, unsupervised, attribute, NumericToNominal, Click on bar, attribute indices 1, Apply
5. Normalize
Open iris.arff
7. Best attributes:
Weka, filters, supervised, attribute, attribute selection
Weka, select attributes , chooe, ClassifierSubsetEval, click, classifier, choose, NaiveBayes, ok, start
8. Finding Outliers
Open file cpu.arff
Two extra columns added. Select column outlier, set class as outlier, visualize
Attribute outlier has two values no(1) and yes(2). We want to remove outliers, so nominal indices=2 or
last.
9. Numeric transform
Iris.arff weka filter unsupervised attribute NumericTransform, metod name : floor
10. PCA
Open file cpu.arff, filter, unsupervised, attribute, PrincipalComponents, click, variance covered:0.95, ok,
apply.
Check for variance/Std deviation on the right. It is the maximum variance, Set threshold=50% of the
maximum. All other attributes have less than 50%. Select them ( 2,3,4,5) and click remove.
Sparse dataset