0% found this document useful (0 votes)
8 views12 pages

Why Data Mining

The document discusses the evolution of data mining as a critical tool for analyzing vast amounts of data generated daily. It outlines the knowledge discovery process, which includes data cleaning, integration, selection, transformation, mining, evaluation, and presentation. Additionally, it provides practical examples of data preprocessing techniques using Weka software, emphasizing the importance of effective data analysis in transforming raw data into valuable insights.

Uploaded by

veenashan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views12 pages

Why Data Mining

The document discusses the evolution of data mining as a critical tool for analyzing vast amounts of data generated daily. It outlines the knowledge discovery process, which includes data cleaning, integration, selection, transformation, mining, evaluation, and presentation. Additionally, it provides practical examples of data preprocessing techniques using Weka software, emphasizing the importance of effective data analysis in transforming raw data into valuable insights.

Uploaded by

veenashan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

https://edurev.

in/v/140019/Weka-Tutorial-02-Data-
Preprocessing-101--Data-Prep#course_14386
http://facweb.cs.depaul.edu/mobasher/classes/ect584/weka/
index.html
https://philippe-fournier-viger.com/spmf/videos/
closed_video.mp4

https://dm.cs.tu-dortmund.de/en/mlbits/frequent-pattern-
maximal-and-closed/
https://www.youtube.com/watch?v=5H1QxY5nj0o

https://www.youtube.com/watch?v=E1UzOR2fTjU

https://www.youtube.com/watch?v=T8IiEIUY01M

sequential pattern mining bharani priya


spade algorithm GRIETCSEPROJECTS
gsp shivani srivarshini
https://www.youtube.com/watch?v=KTDZBd638s0 spade

Why Data Mining?


We live in a world where vast amounts of data are collected daily. Analyzing such data
is an important need. Data mining can meet this need by providing tools to discover
knowledge from data. Data mining can be viewed as a result of the natural evolution of
information technology
1.1.2 Data Mining as the Evolution of Information Technology
Data mining can be viewed as a result of the natural evolution of information technology.
The database and data management industry evolved in the development of
several critical functionalities (Figure 1.1): data collection and database creation, data
management (including data storage and retrieval and database transaction processing),
and advanced data analysis (involving data warehousing and data mining). The early
development of data collection and database creation mechanisms served as a prerequisite
for the later development of effective mechanisms for data storage and retrieval,
as well as query and transaction processing. Nowadays numerous database systems
offer query and transaction processing as common practice. Advanced data analysis has
naturally become the next step.
After the establishment of database management systems, database technology
moved toward the development of advanced database systems, data warehousing, and
data mining for advanced data analysis and web-based databases. Advanced database
systems incorporate new and powerful data models such as extended-relational,
object-oriented, object-relational models. Application-oriented database
systems have flourished, including spatial, temporal, multimedia, active, stream and
sensor, scientific and engineering databases, knowledge bases, and office information
bases.
Advanced data analysis sprang up from the late 1980s onward.
This technology provides a great boost to the database and information
industry, and it enables a huge number of databases and information repositories to be
available for transaction management, information retrieval, and data analysis. Data
can now be stored in many different kinds of databases and information repositories.
One emerging data repository architecture is the data warehouse.This is a repository of
multiple heterogeneous data sources organized under a unified
schema at a single site to facilitate management decision making. Data warehouse
technology includes data cleaning, data integration, and online analytical processing
(OLAP)—that is, analysis techniques with functionalities such as summarization,
consolidation,
and aggregation, as well as the ability to view information from different
angles. Although OLAP tools support multidimensional analysis and decision making,
additional data analysis tools are required for in-depth analysis—for example, data mining
tools that provide data classification, clustering, outlier/anomaly detection, and the
characterization of changes in data over time.
Huge volumes of data have been accumulated beyond databases and data warehouses.
During the 1990s, the World Wide Web and web-based databases (e.g., XML
databases) began to appear. Internet-based global information bases, such as theWWW
and various kinds of interconnected, heterogeneous databases, have emerged and play
a vital role in the information industry. The effective and efficient analysis of data from
such different forms of data by integration of information retrieval, data mining, and
information network analysis technologies is a challenging task.
In summary, the abundance of data, coupled with the need for powerful data analysis
tools, has been described as a data rich but information poor situation (Figure 1.2).
The fast-growing, tremendous amount of data, collected and stored in large and numerous
data repositories, has far exceeded our human ability for comprehension without powerful
tools. As a result, data collected in large data repositories become “data tombs”—data
archives that are seldom visited. Consequently, important decisions are often made
based not on the information-rich data stored in data repositories but rather on a decision
maker’s intuition, simply because the decision maker does not have the tools to
extract the valuable knowledge embedded in the vast amounts of data. Efforts have
been made to develop expert system and knowledge-based technologies, which typically
rely on users or domain experts to manually input knowledge into knowledge bases.
Unfortunately, however, the manual knowledge input procedure is prone to biases and
errors and is extremely costly and time consuming. The widening gap between data and
information calls for the systematic development of data mining tools that can turn data
tombs into “golden nuggets” of knowledge.

1.2 What Is Data Mining?


It is no surprise that data mining, as a truly interdisciplinary subject, can be defined
in many different ways. Even the term data mining does not really present all the major
Many people treat data mining as a synonym for another popularly used term,
knowledge discovery from data, or KDD.

The knowledge discovery process is shown in Figure 1.4 as an iterative sequence of the
following steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)3
3. Data selection (where data relevant to the analysis task are retrieved from the
database)
4. Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations) 4
5. Data mining (an essential process where intelligent methods are applied to extract
data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on interestingness measures.
Knowledge presentation (where visualization and knowledge representation techniques
are used to present mined knowledge to users)
Steps 1 through 4 are different forms of data preprocessing, where data are prepared
for mining. The data mining step may interact with the user or a knowledge base. The
interesting patterns are presented to the user and may be stored as new knowledge in the
knowledge base.
The preceding view shows data mining as one step in the knowledge discovery process,
albeit an essential one because it uncovers hidden patterns for evaluation.
Data mining is the process of discovering interesting patterns and
knowledge from large amounts of data. The data sources can include databases, data
warehouses, the Web, other information repositories, or data that are streamed into the
system dynamically.

LAB 1

Weka 3-Data Mining with open source machine

https://sourceforge.net/projects/weka/

c:\Program Files\Weka-3-8-5\data

Diabetes.arff Filter-> choose filters, unsupervised, attribute, numeric cleaner, index 6 mass, min Default
NaN, min Threshold 0.1E-7, ok, apply.
Select mass edit

Filter unsupervised , instance, remove with values , filet attribute index 6, match missing values True, ok,
apply, check mass, check edit removed.

Impute undo choose fdilter unsupervised attribute, replace missing values, apply, edit replaced

Weather numeric data edit , probability play percentage should be between 0 to 100. Filetrc
unsuopervisede attribute, numeric cleaner max thresholf 100 min threshold 0, max default 100, min
default 0

45 to 49 must become 50 , closeto: 47 changeto : 50, close to tolerance: 3( means less than 3) , attribute
indices: 5, Ok apply edit

Diabetes.arff filet unsupervised attribute, interquartile range apply. New attributes at 10 and 11 outlier
and extreme values, edit

Outlier Detection: unsupervised instances, remove with values , attribute index 10, nominal indices:
last move: filter, unsupervised instances, remove with values , attribute index 11, nominal indices: last
ok. Apply save Diabetes1.arff

Glass.arff

Normalize: open weater numeric.arff , Attribute Relation File Format

Filter, Unsupervised, attribute, Normalize used for numeric attribute only. ( for -1 to +1 chooose
scale=2 and translation = -1 select normalize in filter to edit , scale=1, translation=1 for values between
0 to 1. OK, Apply, edit undo. ( for -1 to +1 choose scale=2 and translation = -1 ) Save will replace the file.
Give a new name

Filter, Unsupervised, attribute, Standardize( zero mean unit variance) used for numeric attribute only
Check for numeric attribute mean and std deviation

Rushdi Shams Weka Tutoriqals

Data Sources, Arff Loader,

Evaluation, class Assigner

Filter, supervised, attributeselection

Visualization, Text Viewer


Convert csv files to arff files

Download files from https://archive.ics.uci.edu/dataset/9/auto+mpg

Unzip

Open file auto-mpg.data and auto-mpg.names using notepad. Change extension to txt

Open excel, Data, get external data , from text

Go to download , select auto-mpg-data.txt, next, select delimiter tab and space , next, finish, put data
=$A$1

Insert new row at top

Copy paste column names from auto-mpg-names.txt

Save file as csv

Weka, tools , arff viewer, file , open, select csv file, save as arff
Data Cleaning using weka:

Open file labor.arff.

Check relation name

Select first attribute

Check if missing values, in this case 2% for first attribute

Select edit, you can find at lot if missing values shown in grey color

1. Replace missing values using weka:


Go to filter, go to weka, filters, unsupervised, attribute, replace missing values, apply

Discretize

Open credit-g.arff

Select attribute age unsupervised, attribute, Discretize, select on the discretize bar, attribute indices 13
(for age), bins range precision ( for decimal values limit) = 2,bins =3, apply, save as type csv

Open file in excel replace values with Old, Middle and Youg, save the file as csv

2. Info Gain Attribute Evaluator


Open csv file credit-g-nominal.csv in weka

select attributes from top bar

attribute Evaluator

InfogainAttributeEval

Alert- yes for ranker

Start

Check Results

Select attributes : 17,19,18,8,11,16, remove, save


3. Change any attribute as class
Open mpg.csv,

Edit

Select mpg, set attribute as class, ok

4. Change Numeric to Nominal


Open diabetes.arff

Select attribute preg- numeric

Weka, filters, unsupervised, attribute, NumericToNominal, Click on bar, attribute indices 1, Apply

5. Normalize
Open iris.arff

Weka, filters, unsupervised, attribute, normalize, apply

Undo, standardize, apply

6. Remove Missing values


Open glass.arff

Select attribute plant-stand. It has missing values

Weka, filters, unsupervised,instance, RemoveWithValues, click bar, attribute indices : 2, invert


Selection: true, matchMissingValues: True, OK

7. Best attributes:
Weka, filters, supervised, attribute, attribute selection

Weka, select attributes , chooe, ClassifierSubsetEval, click, classifier, choose, NaiveBayes, ok, start

Choose ,tree, j48, ok, start

Find the best attributes


Preprocess, select 1,3,4,5 select Invert, remove

Classifier, naïve bayes, see results

8. Finding Outliers
Open file cpu.arff

Weka, filters, unsupervised, attribute, InterQuartilerange, Apply

Two extra columns added. Select column outlier, set class as outlier, visualize

Weka,Filters , unsupervised, instance, removeWithvalues, click on bar

Attribute outlier has two values no(1) and yes(2). We want to remove outliers, so nominal indices=2 or
last.

Attributeindex: 11, nominalindices: 2, classify

Attributeindex: 11, nominalindices: 2, classify

Undo, click on bar, detectionPerAttribute: true , Undo

9. Numeric transform
Iris.arff weka filter unsupervised attribute NumericTransform, metod name : floor

10. PCA
Open file cpu.arff, filter, unsupervised, attribute, PrincipalComponents, click, variance covered:0.95, ok,
apply.

Check for variance/Std deviation on the right. It is the maximum variance, Set threshold=50% of the
maximum. All other attributes have less than 50%. Select them ( 2,3,4,5) and click remove.

Sparse dataset

Open file sparse.arff, edit to see sparse data


Filter, choose, weka, filter, unsupervised, instance, NonSparseToSparse

You might also like