0% found this document useful (0 votes)

8 views12 pages

Why Data Mining

The document discusses the evolution of data mining as a critical tool for analyzing vast amounts of data generated daily. It outlines the knowledge discovery process, which includes data cleaning, integration, selection, transformation, mining, evaluation, and presentation. Additionally, it provides practical examples of data preprocessing techniques using Weka software, emphasizing the importance of effective data analysis in transforming raw data into valuable insights.

Uploaded by

veenashan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views12 pages

Why Data Mining

Uploaded by

veenashan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 12

https://edurev.

in/v/140019/Weka-Tutorial-02-Data-
Preprocessing-101--Data-Prep#course_14386
http://facweb.cs.depaul.edu/mobasher/classes/ect584/weka/
index.html
https://philippe-fournier-viger.com/spmf/videos/
closed_video.mp4

https://dm.cs.tu-dortmund.de/en/mlbits/frequent-pattern-
maximal-and-closed/
https://www.youtube.com/watch?v=5H1QxY5nj0o

https://www.youtube.com/watch?v=E1UzOR2fTjU

https://www.youtube.com/watch?v=T8IiEIUY01M

sequential pattern mining bharani priya

spade algorithm GRIETCSEPROJECTS
gsp shivani srivarshini
https://www.youtube.com/watch?v=KTDZBd638s0 spade

Why Data Mining?

We live in a world where vast amounts of data are collected daily. Analyzing such data
is an important need. Data mining can meet this need by providing tools to discover
knowledge from data. Data mining can be viewed as a result of the natural evolution of
information technology
1.1.2 Data Mining as the Evolution of Information Technology
Data mining can be viewed as a result of the natural evolution of information technology.
The database and data management industry evolved in the development of
several critical functionalities (Figure 1.1): data collection and database creation, data
management (including data storage and retrieval and database transaction processing),
and advanced data analysis (involving data warehousing and data mining). The early
development of data collection and database creation mechanisms served as a prerequisite
for the later development of effective mechanisms for data storage and retrieval,
as well as query and transaction processing. Nowadays numerous database systems
offer query and transaction processing as common practice. Advanced data analysis has
naturally become the next step.
After the establishment of database management systems, database technology
moved toward the development of advanced database systems, data warehousing, and
data mining for advanced data analysis and web-based databases. Advanced database
systems incorporate new and powerful data models such as extended-relational,
object-oriented, object-relational models. Application-oriented database
systems have flourished, including spatial, temporal, multimedia, active, stream and
sensor, scientific and engineering databases, knowledge bases, and office information
bases.
Advanced data analysis sprang up from the late 1980s onward.
This technology provides a great boost to the database and information
industry, and it enables a huge number of databases and information repositories to be
available for transaction management, information retrieval, and data analysis. Data
can now be stored in many different kinds of databases and information repositories.
One emerging data repository architecture is the data warehouse.This is a repository of
multiple heterogeneous data sources organized under a unified
schema at a single site to facilitate management decision making. Data warehouse
technology includes data cleaning, data integration, and online analytical processing
(OLAP)—that is, analysis techniques with functionalities such as summarization,
consolidation,
and aggregation, as well as the ability to view information from different
angles. Although OLAP tools support multidimensional analysis and decision making,
additional data analysis tools are required for in-depth analysis—for example, data mining
tools that provide data classification, clustering, outlier/anomaly detection, and the
characterization of changes in data over time.
Huge volumes of data have been accumulated beyond databases and data warehouses.
During the 1990s, the World Wide Web and web-based databases (e.g., XML
databases) began to appear. Internet-based global information bases, such as theWWW
and various kinds of interconnected, heterogeneous databases, have emerged and play
a vital role in the information industry. The effective and efficient analysis of data from
such different forms of data by integration of information retrieval, data mining, and
information network analysis technologies is a challenging task.
In summary, the abundance of data, coupled with the need for powerful data analysis
tools, has been described as a data rich but information poor situation (Figure 1.2).
The fast-growing, tremendous amount of data, collected and stored in large and numerous
data repositories, has far exceeded our human ability for comprehension without powerful
tools. As a result, data collected in large data repositories become “data tombs”—data
archives that are seldom visited. Consequently, important decisions are often made
based not on the information-rich data stored in data repositories but rather on a decision
maker’s intuition, simply because the decision maker does not have the tools to
extract the valuable knowledge embedded in the vast amounts of data. Efforts have
been made to develop expert system and knowledge-based technologies, which typically
rely on users or domain experts to manually input knowledge into knowledge bases.
Unfortunately, however, the manual knowledge input procedure is prone to biases and
errors and is extremely costly and time consuming. The widening gap between data and
information calls for the systematic development of data mining tools that can turn data
tombs into “golden nuggets” of knowledge.

1.2 What Is Data Mining?

It is no surprise that data mining, as a truly interdisciplinary subject, can be defined
in many different ways. Even the term data mining does not really present all the major
Many people treat data mining as a synonym for another popularly used term,
knowledge discovery from data, or KDD.

The knowledge discovery process is shown in Figure 1.4 as an iterative sequence of the
following steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)3
3. Data selection (where data relevant to the analysis task are retrieved from the
database)
4. Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations) 4
5. Data mining (an essential process where intelligent methods are applied to extract
data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on interestingness measures.
Knowledge presentation (where visualization and knowledge representation techniques
are used to present mined knowledge to users)
Steps 1 through 4 are different forms of data preprocessing, where data are prepared
for mining. The data mining step may interact with the user or a knowledge base. The
interesting patterns are presented to the user and may be stored as new knowledge in the
knowledge base.
The preceding view shows data mining as one step in the knowledge discovery process,
albeit an essential one because it uncovers hidden patterns for evaluation.
Data mining is the process of discovering interesting patterns and
knowledge from large amounts of data. The data sources can include databases, data
warehouses, the Web, other information repositories, or data that are streamed into the
system dynamically.

LAB 1

Weka 3-Data Mining with open source machine

https://sourceforge.net/projects/weka/

c:\Program Files\Weka-3-8-5\data

Diabetes.arff Filter-> choose filters, unsupervised, attribute, numeric cleaner, index 6 mass, min Default
NaN, min Threshold 0.1E-7, ok, apply.
Select mass edit

Filter unsupervised , instance, remove with values , filet attribute index 6, match missing values True, ok,
apply, check mass, check edit removed.

Impute undo choose fdilter unsupervised attribute, replace missing values, apply, edit replaced

Weather numeric data edit , probability play percentage should be between 0 to 100. Filetrc
unsuopervisede attribute, numeric cleaner max thresholf 100 min threshold 0, max default 100, min
default 0

45 to 49 must become 50 , closeto: 47 changeto : 50, close to tolerance: 3( means less than 3) , attribute
indices: 5, Ok apply edit

Diabetes.arff filet unsupervised attribute, interquartile range apply. New attributes at 10 and 11 outlier
and extreme values, edit

Outlier Detection: unsupervised instances, remove with values , attribute index 10, nominal indices:
last move: filter, unsupervised instances, remove with values , attribute index 11, nominal indices: last
ok. Apply save Diabetes1.arff

Glass.arff

Normalize: open weater numeric.arff , Attribute Relation File Format

Filter, Unsupervised, attribute, Normalize used for numeric attribute only. ( for -1 to +1 chooose
scale=2 and translation = -1 select normalize in filter to edit , scale=1, translation=1 for values between
0 to 1. OK, Apply, edit undo. ( for -1 to +1 choose scale=2 and translation = -1 ) Save will replace the file.
Give a new name

Filter, Unsupervised, attribute, Standardize( zero mean unit variance) used for numeric attribute only
Check for numeric attribute mean and std deviation

Rushdi Shams Weka Tutoriqals

Data Sources, Arff Loader,

Evaluation, class Assigner

Filter, supervised, attributeselection

Visualization, Text Viewer

Convert csv files to arff files

Download files from https://archive.ics.uci.edu/dataset/9/auto+mpg

Unzip

Open file auto-mpg.data and auto-mpg.names using notepad. Change extension to txt

Open excel, Data, get external data , from text

Go to download , select auto-mpg-data.txt, next, select delimiter tab and space , next, finish, put data
=$A$1

Insert new row at top

Copy paste column names from auto-mpg-names.txt

Save file as csv

Weka, tools , arff viewer, file , open, select csv file, save as arff
Data Cleaning using weka:

Open file labor.arff.

Check relation name

Select first attribute

Check if missing values, in this case 2% for first attribute

Select edit, you can find at lot if missing values shown in grey color

1. Replace missing values using weka:

Go to filter, go to weka, filters, unsupervised, attribute, replace missing values, apply

Discretize

Open credit-g.arff

Select attribute age unsupervised, attribute, Discretize, select on the discretize bar, attribute indices 13
(for age), bins range precision ( for decimal values limit) = 2,bins =3, apply, save as type csv

Open file in excel replace values with Old, Middle and Youg, save the file as csv

2. Info Gain Attribute Evaluator

Open csv file credit-g-nominal.csv in weka

select attributes from top bar

attribute Evaluator

InfogainAttributeEval

Alert- yes for ranker

Start

Check Results

Select attributes : 17,19,18,8,11,16, remove, save

3. Change any attribute as class
Open mpg.csv,

Edit

Select mpg, set attribute as class, ok

4. Change Numeric to Nominal

Open diabetes.arff

Select attribute preg- numeric

Weka, filters, unsupervised, attribute, NumericToNominal, Click on bar, attribute indices 1, Apply

5. Normalize
Open iris.arff

Weka, filters, unsupervised, attribute, normalize, apply

Undo, standardize, apply

6. Remove Missing values

Open glass.arff

Select attribute plant-stand. It has missing values

Weka, filters, unsupervised,instance, RemoveWithValues, click bar, attribute indices : 2, invert

Selection: true, matchMissingValues: True, OK

7. Best attributes:
Weka, filters, supervised, attribute, attribute selection

Weka, select attributes , chooe, ClassifierSubsetEval, click, classifier, choose, NaiveBayes, ok, start

Choose ,tree, j48, ok, start

Find the best attributes

Preprocess, select 1,3,4,5 select Invert, remove

Classifier, naïve bayes, see results

8. Finding Outliers
Open file cpu.arff

Weka, filters, unsupervised, attribute, InterQuartilerange, Apply

Two extra columns added. Select column outlier, set class as outlier, visualize

Weka,Filters , unsupervised, instance, removeWithvalues, click on bar

Attribute outlier has two values no(1) and yes(2). We want to remove outliers, so nominal indices=2 or
last.

Attributeindex: 11, nominalindices: 2, classify

Undo, click on bar, detectionPerAttribute: true , Undo

9. Numeric transform
Iris.arff weka filter unsupervised attribute NumericTransform, metod name : floor

10. PCA
Open file cpu.arff, filter, unsupervised, attribute, PrincipalComponents, click, variance covered:0.95, ok,
apply.

Check for variance/Std deviation on the right. It is the maximum variance, Set threshold=50% of the
maximum. All other attributes have less than 50%. Select them ( 2,3,4,5) and click remove.

Sparse dataset

Open file sparse.arff, edit to see sparse data

Filter, choose, weka, filter, unsupervised, instance, NonSparseToSparse

Chapter-1 - Introduction To Data Mining
No ratings yet
Chapter-1 - Introduction To Data Mining
10 pages
Unit Plan On Walter Dean Myers "Monster"
100% (1)
Unit Plan On Walter Dean Myers "Monster"
34 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
Sathyapriya Thesis NEW
No ratings yet
Sathyapriya Thesis NEW
47 pages
DMDW Full PDF
No ratings yet
DMDW Full PDF
784 pages
Performing Culture Roman Spectacle and The Banquets of The Powerful, John H Darms
No ratings yet
Performing Culture Roman Spectacle and The Banquets of The Powerful, John H Darms
21 pages
Predictive Data Mining and Discovering Hidden Values of Data Warehouse
No ratings yet
Predictive Data Mining and Discovering Hidden Values of Data Warehouse
5 pages
Venice BLVD Great Street One-Year Evaluation Report
50% (2)
Venice BLVD Great Street One-Year Evaluation Report
40 pages
Data Mining - Extracting Knowledge From Large Datasets
No ratings yet
Data Mining - Extracting Knowledge From Large Datasets
1 page
DWM NOTES
No ratings yet
DWM NOTES
118 pages
Data Mining Task Primitives and Major Issues
No ratings yet
Data Mining Task Primitives and Major Issues
18 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Chapter 3
No ratings yet
Chapter 3
9 pages
Data Mining Summaries PDF
No ratings yet
Data Mining Summaries PDF
22 pages
Data Mining1
No ratings yet
Data Mining1
13 pages
1712060004 (1)
No ratings yet
1712060004 (1)
25 pages
Data Mining
No ratings yet
Data Mining
27 pages
Mekelle University-Mekelle Institute of Technology Department of Information Technology Data Mining and Knowledge Discovery
No ratings yet
Mekelle University-Mekelle Institute of Technology Department of Information Technology Data Mining and Knowledge Discovery
36 pages
DM - MOD - 1 Part I
No ratings yet
DM - MOD - 1 Part I
9 pages
DMiningKuliah 2B DPreparation Lanjutan New2 - 3
No ratings yet
DMiningKuliah 2B DPreparation Lanjutan New2 - 3
40 pages
Data Mining Notes UNIT I
No ratings yet
Data Mining Notes UNIT I
21 pages
Ijcse 01768
No ratings yet
Ijcse 01768
4 pages
Unit 2: Big Data Analytics
No ratings yet
Unit 2: Big Data Analytics
45 pages
CH 14
100% (1)
CH 14
28 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
3 pages
Data Mining - 2
No ratings yet
Data Mining - 2
16 pages
DataMining S
No ratings yet
DataMining S
103 pages
DM
No ratings yet
DM
15 pages
D-Unit-1 R16
No ratings yet
D-Unit-1 R16
17 pages
Chapter 1 - What Is Data Mining
No ratings yet
Chapter 1 - What Is Data Mining
8 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
Unit 3 DW
No ratings yet
Unit 3 DW
19 pages
Study Material I
No ratings yet
Study Material I
140 pages
Experiment 1: Installation of WEKA Tool Aim
No ratings yet
Experiment 1: Installation of WEKA Tool Aim
19 pages
Data Warehousing and Mining
No ratings yet
Data Warehousing and Mining
56 pages
Data Mining
No ratings yet
Data Mining
7 pages
CS-DM Module - 1
No ratings yet
CS-DM Module - 1
27 pages
DWDM
No ratings yet
DWDM
48 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
62 pages
DMDW Imp Ques
No ratings yet
DMDW Imp Ques
17 pages
Data Mining and Warehousing-1
No ratings yet
Data Mining and Warehousing-1
43 pages
Unit-2 Finalized
No ratings yet
Unit-2 Finalized
12 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
Data Warehouse and Data Mining - Unit 2
No ratings yet
Data Warehouse and Data Mining - Unit 2
24 pages
Data Mining
No ratings yet
Data Mining
15 pages
1.data Mining Functionalities
No ratings yet
1.data Mining Functionalities
14 pages
Datawarehouse&Data Mining - ALL
No ratings yet
Datawarehouse&Data Mining - ALL
46 pages
CCN Lecture Notes
No ratings yet
CCN Lecture Notes
56 pages
Class 1a-DataCollection
No ratings yet
Class 1a-DataCollection
14 pages
Unit 1
No ratings yet
Unit 1
28 pages
358 44 Datamining and Warehousing 4.4
No ratings yet
358 44 Datamining and Warehousing 4.4
155 pages
IV-cse DM Viva Questions
No ratings yet
IV-cse DM Viva Questions
10 pages
Lecture 1428550844
No ratings yet
Lecture 1428550844
87 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
1 Intro
No ratings yet
1 Intro
33 pages
Bi Lesson 6
No ratings yet
Bi Lesson 6
36 pages
LECTURE 3-BDM 411 Data Analytics and BIG Data
No ratings yet
LECTURE 3-BDM 411 Data Analytics and BIG Data
49 pages
Data Mining Merged PDF CS1 CS8
No ratings yet
Data Mining Merged PDF CS1 CS8
272 pages
Unit 2
No ratings yet
Unit 2
144 pages
A Beginners Guide To Data and Analytics
100% (1)
A Beginners Guide To Data and Analytics
22 pages
DWM - Module 2
No ratings yet
DWM - Module 2
74 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
Unit III DWDM
No ratings yet
Unit III DWDM
113 pages
Module 4
No ratings yet
Module 4
54 pages
How To Operate Photoshop: Tutorial For Beginners
100% (1)
How To Operate Photoshop: Tutorial For Beginners
2 pages
Marie: An Introduction To A Simple Computer
No ratings yet
Marie: An Introduction To A Simple Computer
26 pages
Russell New Lo Profile Elec
No ratings yet
Russell New Lo Profile Elec
10 pages
Analysis of Elastic Thermal Stresses by Station-Function Collocat
No ratings yet
Analysis of Elastic Thermal Stresses by Station-Function Collocat
51 pages
MA 437 - Group 2 - End of Term Paper
No ratings yet
MA 437 - Group 2 - End of Term Paper
32 pages
Behavioral and Psychological Symptoms of Dementia
No ratings yet
Behavioral and Psychological Symptoms of Dementia
64 pages
Aff 2019-05
No ratings yet
Aff 2019-05
619 pages
Core Java With SCJP Ocjp Notes by Durga Sir Language Fundamentals
No ratings yet
Core Java With SCJP Ocjp Notes by Durga Sir Language Fundamentals
58 pages
Final
No ratings yet
Final
12 pages
Communication Skills Speaking and Listen PDF
No ratings yet
Communication Skills Speaking and Listen PDF
5 pages
Armor Coat Quick-Setting Epoxy Adhesive, 28.4-mL Canadian Tire
No ratings yet
Armor Coat Quick-Setting Epoxy Adhesive, 28.4-mL Canadian Tire
5 pages
Astm - stp29062s - en Us
No ratings yet
Astm - stp29062s - en Us
30 pages
Initiatives of Indian Govt in Developing Tourism Industry: "Master of Business Administration"
No ratings yet
Initiatives of Indian Govt in Developing Tourism Industry: "Master of Business Administration"
5 pages
Document From Minhaz
No ratings yet
Document From Minhaz
3 pages
Lecture 15
No ratings yet
Lecture 15
15 pages
CLOTHES Presentc
No ratings yet
CLOTHES Presentc
2 pages
Akshay Akshay Akshay Akshay Akshay Akshay Akshay Akshay Akshay Ashis Ashis Ashis Ashis Ashis Ashis Ashis Ashis Ashis Ashis Shrena Shrena Shrena Shrena Shrena Shrena Shrena Shrena Shrena
No ratings yet
Akshay Akshay Akshay Akshay Akshay Akshay Akshay Akshay Akshay Ashis Ashis Ashis Ashis Ashis Ashis Ashis Ashis Ashis Ashis Shrena Shrena Shrena Shrena Shrena Shrena Shrena Shrena Shrena
31 pages
M Athira Resume 2024
No ratings yet
M Athira Resume 2024
2 pages
Damasco, John Rey C
No ratings yet
Damasco, John Rey C
11 pages
List of General & Commercial Banks in The Philippines: Philippine Banking System
No ratings yet
List of General & Commercial Banks in The Philippines: Philippine Banking System
3 pages
SPM Shock Pulse Cable
No ratings yet
SPM Shock Pulse Cable
1 page
Westermo SC Cybox-Gw
No ratings yet
Westermo SC Cybox-Gw
1 page
VKSS08 Floppy Brim Hat
No ratings yet
VKSS08 Floppy Brim Hat
1 page

Why Data Mining

Uploaded by

Why Data Mining

Uploaded by

https://edurev.

sequential pattern mining bharani priya

Why Data Mining?

1.2 What Is Data Mining?

Weka 3-Data Mining with open source machine

Normalize: open weater numeric.arff , Attribute Relation File Format

Rushdi Shams Weka Tutoriqals

Data Sources, Arff Loader,

Evaluation, class Assigner

Filter, supervised, attributeselection

Visualization, Text Viewer

Download files from https://archive.ics.uci.edu/dataset/9/auto+mpg

Open excel, Data, get external data , from text

Insert new row at top

Copy paste column names from auto-mpg-names.txt

Save file as csv

Open file labor.arff.

Check relation name

Select first attribute

Check if missing values, in this case 2% for first attribute

1. Replace missing values using weka:

2. Info Gain Attribute Evaluator

select attributes from top bar

Alert- yes for ranker

Select attributes : 17,19,18,8,11,16, remove, save

Select mpg, set attribute as class, ok

4. Change Numeric to Nominal

Select attribute preg- numeric

Weka, filters, unsupervised, attribute, normalize, apply

Undo, standardize, apply

6. Remove Missing values

Select attribute plant-stand. It has missing values

Weka, filters, unsupervised,instance, RemoveWithValues, click bar, attribute indices : 2, invert

Choose ,tree, j48, ok, start

Find the best attributes

Classifier, naïve bayes, see results

Weka, filters, unsupervised, attribute, InterQuartilerange, Apply

Weka,Filters , unsupervised, instance, removeWithvalues, click on bar

Attributeindex: 11, nominalindices: 2, classify

Attributeindex: 11, nominalindices: 2, classify

Undo, click on bar, detectionPerAttribute: true , Undo

Open file sparse.arff, edit to see sparse data

You might also like