0% found this document useful (0 votes)
24 views48 pages

1-Introduction To Data Mining-13-12-2024

The document provides an overview of data mining, highlighting its significance due to the exponential growth of data and the need for knowledge extraction. It outlines the evolution of scientific disciplines leading to data science, the knowledge discovery process, and the architecture of data mining systems. Additionally, it discusses the interdisciplinary nature of data mining, its applications, and the challenges posed by traditional data analysis methods.

Uploaded by

naresh.r2021
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views48 pages

1-Introduction To Data Mining-13-12-2024

The document provides an overview of data mining, highlighting its significance due to the exponential growth of data and the need for knowledge extraction. It outlines the evolution of scientific disciplines leading to data science, the knowledge discovery process, and the architecture of data mining systems. Additionally, it discusses the interdisciplinary nature of data mining, its applications, and the challenges posed by traditional data analysis methods.

Uploaded by

naresh.r2021
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 48

Introduction to Data Mining

SWE2009 - Data Mining


March 20, 2025 1
Why Data Mining?
 The Explosive Growth of Data: from terabytes to petabytes

Data collection and data availability

Automated data collection tools, database systems, Web,
computerized society

Major sources of abundant data

Business: Web, e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific simulation,


Society and everyone: news, digital cameras, YouTube
 We are drown in data, but starving for knowledge!
 We are data rich, but information poor.
 “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
March 20, 2025 SWE2009 - Data Mining 2
Evolution of Sciences
 Before 1600, empirical science
 1600-1950s, theoretical science
 Each discipline has grown a theoretical component. Theoretical models often
motivate experiments and generalize our understanding.
 1950s-1990s, computational science
 Over the last 50 years, most disciplines have grown a third, computational
branch (e.g. empirical, theoretical, and computational ecology, or physics, or
linguistics.)
 Computational Science traditionally meant simulation. It grew out of our
inability to find closed-form solutions for complex mathematical models.
 1990-now, data science
 The flood of data from new scientific instruments and simulations
 The ability to economically store and manage petabytes of data online
 The Internet and computing Grid that makes all these archives universally
accessible
 Scientific info. management, acquisition, organization, query, and visualization
tasks scale almost linearly with data volumes. Data mining is a major new
challenge!
 Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online
Science,
March 20, 2025 Comm. ACM, 45(11): 50-54, Nov.- Data
SWE2009 2002 Mining 3
March 20, 2025 SWE2009 - Data Mining 4
Evolution of Database
Technology
 1960s:
 Data collection, database creation, IMS and network DBMS
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 RDBMS, advanced data models (extended-relational, OO, deductive,
etc.)
 Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s:
 Data mining, data warehousing, multimedia databases, and Web
databases
 2000s
 Stream data management and mining
 Data mining and its applications
 Web technology (XML, data integration) and global information
systems
March 20, 2025 SWE2009 - Data Mining 5
What Is Data Mining?

 Data mining (knowledge discovery from data)


 Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from larger amount of data
 Data mining: a misnomer?
 The mining of gold from rocks or sand is referred to as
gold mining rather than rock or sand mining.
 The mining of coal from rocks or sand is referred to as
coal mining.

March 20, 2025 SWE2009 - Data Mining 6


What Is Data Mining?

 Alternative names
 Knowledge discovery (mining) in databases
(KDD), knowledge extraction, data/pattern
analysis, data archeology, data dredging,
information harvesting, business intelligence,
etc.

 Data mining—searching for knowledge


(interesting patterns) in your data.

March 20, 2025 SWE2009 - Data Mining 7


KDD: A Definition

Simply stated, data


mining refers to
extracting or “mining”
knowledge
from large amounts of
data, usually
automatically gathered

March 20, 2025 SWE2009 - 8Data Mining


KDD: A Definition
KDD is the automatic or semi-automatic
extraction of non-obvious, hidden knowledge
from large volumes of data.

106-1012 bytes: What is the knowledge?


we never see the whole Then run Data How to represent
data set, so will put it in Mining algorithms and use it?
the memory of computers

March 20, 2025 SWE2009 - Data Mining 9


Data, Information, Knowledge
We often see data as a string of bits, or
numbers and symbols, or “objects” which we
collect daily.

Information is data stripped of redundancy, and


reduced to the minimum necessary to
characterize the data.

Knowledge is integrated information, including


facts and their relations, which have been
perceived, discovered, or learned as our
“mental pictures”.
Knowledge can be considered data at
a high level of abstraction and generalization.

March 20, 2025 SWE2009 - Data Mining 10


From Data to Knowledge

Numerical attribute categorical attribute missing values class labels

If (Headache=No AND Vomiting = Yes AND Temperature = High)


THEN Viral illness = Yes

March 20, 2025 SWE2009 - Data Mining 11


Data Rich Knowledge Poor
How to acquire knowledge
for
knowledge-based systems
remains as the main
People gathered and stored difficult
so much data because they and crucial
think some valuable assets
are implicitly coded within it.
problem. ?
Raw data is rarely of direct knowledge inference
base engine
benefit.
Its true value depends on the
ability to extract information
useful for decision support. Tradition: via knowledge
engineers
Impractical Manual Data Analysis New trend: via automatic
programs
March 20, 2025 SWE2009 - Data Mining 12
Knowledge Discovery (KDD) Process

 Data mining—core of Pattern Evaluation


knowledge discovery
process
Data Mining

Task-relevant Data

Data Selection
Warehouse
Data Cleaning

Data Integration

Databases
March 20, 2025 SWE2009 - Data Mining 13
KDD Process - Steps
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be
combined)
3. Data selection (where data relevant to the analysis
task are retrieved from the database)
4. Data transformation (where data are transformed or
consolidated into forms appropriate for mining by
performing summary or aggregation operations)
5. Data mining (an essential process where intelligent
methods are applied in order to extract data patterns)
6. Pattern evaluation (to identify the truly interesting
patterns representing knowledge based on some
interestingness measures)
7. Knowledge presentation (where visualization and
knowledge representation techniques are used to
present the mined knowledge to the user)

March 20, 2025 SWE2009 - Data Mining 14


Architecture of Typical Data Mining
System

March 20, 2025 SWE2009 - Data Mining 15


Architecture of a typical data
mining system
 Database, data warehouse, World
Wide Web, or other information
repository:

One or a set of databases, data warehouses,
spreadsheets, or other kinds of information
repositories.

Data cleaning and data integration techniques
may be performed on the data.

 Database or data warehouse server:



Responsible for fetching the relevant data,
based on the user’s data mining request.

March 20, 2025 SWE2009 - Data Mining 16


Contd….
 Knowledge base:

Knowledge is used to guide the search or
evaluate the interestingness of resulting
patterns.


knowledge can include concept hierarchies,
used to organize attributes or attribute values
into different levels of abstraction.


Knowledge such as user beliefs, which can be
used to assess a pattern’s interestingness based
on its unexpectedness, may also be included.

March 20, 2025 SWE2009 - Data Mining 17


Contd…
 Data mining engine:

Consists of a set of functional modules for tasks such as
characterization, association and correlation analysis,
classification, prediction, cluster analysis, outlier analysis,
and evolution analysis.

 Pattern evaluation module:



To focus the search toward interesting patterns.

To filter out discovered patterns.

The pattern evaluation module may be integrated with the
mining module, depending on the implementation of the
data mining method used.

For efficient data mining, it is highly recommended to
push the evaluation of pattern interestingness as deep as
possible into the mining process so as to confine the
search to only the interesting patterns.

March 20, 2025 SWE2009 - Data Mining 18


Contd….
 User interface:

Communicates between users and the data
mining system

Allow the user to interact with the system by
specifying a data mining query or task

Provide information to help focus the search

Performing exploratory data mining based on
the intermediate data mining results.

Allow the user to browse database and data
warehouse schemas or data structures, evaluate
mined patterns, and visualize the patterns in
different forms.

March 20, 2025 SWE2009 - Data Mining 19


Data Mining and Business
Intelligence
Increasing potential
to support
business decisions End User
Decisio
n
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
March 20, 2025 SWE2009 - Data Mining 20
Data Mining: Confluence of Multiple
Disciplines

Database
Technology Statistics

Machine Visualization
Learning Data Mining

Pattern
Recognition Other
Algorithm Disciplines

March 20, 2025 SWE2009 - Data Mining 21


Contd….
 DM  an interdisciplinary field
 Set of disciplines including database
systems, statistics, machine learning,
visualization, and information science.
 Other disciplines  Neural networks, fuzzy
logic or rough set theory, knowledge
representation, etc.

March 20, 2025 SWE2009 -22


Data Mining
 Statistics is the study of the collection, organization, analysis,
interpretation and presentation of data.
 Machine learning, a branch of artificial intelligence, concerns
the construction and study of systems that can learn from data.
 For example, a machine learning system could be trained on
email messages to learn to distinguish between spam and non-
spam messages. Ex- trees, neural n/w etc.
 A database is an organized collection of data.

SWE2009 - Data Mining 23


AI

 Artificial intelligence (AI) is technology and


a branch of computer science that studies
and develops intelligent machines and
software.
 Pattern recognition aims to classify data (patt
erns) based on either a priori knowledge or o
n statistical information extracted from the
patterns.

SWE2009 - Data Mining 24


Data Mining: Classification
Schemes
 General functionality
 Descriptive data mining
 Predictive data mining
 Different views, different classifications
 Kinds of databases to be mined
 Kinds of knowledge to be discovered
 Kinds of techniques utilized
 Kinds of applications adapted

March 20, 2025 SWE2009 - Data Mining 25


Data Mining
 Prediction Methods

using some variables to predict unknown or
future values of other variables

 Descriptive Methods

finding human-interpretable patterns
describing the data

March 20, 2025 SWE2009 - Data Mining 26


Why Not Traditional Data
Analysis?
 Tremendous amount of data
 Algorithms must be highly scalable to handle such as tera-
bytes of data
 High-dimensionality of data
 Micro-array may have tens of thousands of dimensions
 High complexity of data
 Data streams and sensor data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social networks and multi-linked data
 Heterogeneous databases and legacy databases
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
 New and sophisticated applications

March 20, 2025 SWE2009 - Data Mining 27


Multi-Dimensional View of Data
Mining
 Data to be mined
 Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text, multi-
media, heterogeneous, legacy, WWW
 Knowledge to be mined
 Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc.
 Multiple/integrated functions and mining at multiple levels
 Techniques utilized
 Machine learning, statistics, visualization, etc.
 Applications adapted
 Retail, telecommunication, banking, fraud analysis, bio-data
mining, stock market analysis, text mining, Web mining, etc.

March 20, 2025 SWE2009 - Data Mining 28


Multi-Dimensional View of Data
Mining
 Data to be mined
1. Relational
2. Data warehouse
3. Transactional
4. Stream
5. Object-oriented
6. Temporal Databases, Sequence Databases, and Time-Series
Databases
7. Spatial and Spatiotemporal
8. Heterogeneous Databases and Legacy Databases
9. Text and multi-media
10. WWW

March 20, 2025 SWE2009 - Data Mining 29


1. Relational
 A database system, also called a database
management system (DBMS).
 DBMS consists of a collection of interrelated
data, known as a database.
 A set of software programs to manage and
access the data.
 The software programs involve mechanisms
for the definition of database structures; for
data storage; for concurrent, shared, or
distributed data access; and for ensuring the
consistency and security of the information
stored, despite system crashes or attempts
at unauthorized access.
March 20, 2025 SWE2009 - Data Mining 30
Contd…..
 A relational database is a collection of tables,
each of which is assigned a unique name.
 Each table consists of a set of attributes
(columns or fields) and usually stores a large
set of tuples (records or rows).
 Each tuple in a relational table represents an
object identified by a unique key and
described by a set of attribute values.
 A semantic data model, such as an entity-
relationship (ER) data model, is often
constructed for relational databases.
 An ER data model represents the database
as a set of entities and their relationships.
March 20, 2025 SWE2009 - Data Mining 31
2. Data warehouse
 A repository of information collected from
multiple sources, stored under a unified
schema, and that usually resides at a single
site.

 Constructed via a process of data cleaning,


data integration, data transformation, data
loading, and periodic data refreshing.

 “A data warehouse is a subject-oriented,


integrated, time-variant, and nonvolatile
collection of data in support of
management’s decision-making process.”—
W. H. Inmon
March 20, 2025 SWE2009 - Data Mining 32
Contd…
 Usually modeled by a multidimensional
database structure
 Each dimension corresponds to an attribute
or a set of attributes in the schema
 Each cell stores the value of some
aggregate measure, such as count or sales
amount.
 The actual physical structure of a data
warehouse may be a relational data store or
a multidimensional data cube.
 A data cube provides a multidimensional
view of data and allows the pre-
computation and fast accessing of
summarized data.
March 20, 2025 SWE2009 - Data Mining 33
Contd…

March 20, 2025 SWE2009 - Data Mining 34


3. Transactional
 Consists of a file where each record
represents a transaction.
 A transaction typically includes a unique
transaction identity number (trans ID) and a
list of the items making up the transaction
(such as items purchased in a store).

March 20, 2025 SWE2009 - Data Mining 35


 The transactional database may have
additional tables associated with it, which
contain other information regarding the
sale, such as the date of the transaction,
the customer ID number, the ID number of
the salesperson and of the branch at which
the sale occurred, and so on.

March 20, 2025 SWE2009 - Data Mining 36


4. Stream
 data flow in and out of an observation
platform (or window) dynamically

 Unique features:

huge or possibly infinite volume

dynamically changing

flowing in and out in a fixed order

allowing only one or a small number of scans

demanding fast (often real-time) response time.

March 20, 2025 SWE2009 - Data Mining 37


4. Stream
 Typical examples of data streams include
various kinds of scientific and engineering
data, time-series data, and data produced
in other dynamic environments, such as
power supply, network traffic, stock
exchange, telecommunications, Web click
streams, video surveillance, and weather or
environment monitoring.

March 20, 2025 SWE2009 - Data Mining 38


5. Object-oriented
 Each entity is considered as an object

 Objects that share a common set of properties can


be grouped into an object class.

 Each object is an instance of its class.

 Object classes can be organized into class/subclass


hierarchies so that each class represents
properties that are common to objects in that
class.

 For instance, an employee class can contain


variables like name, address, and birthdate.

March 20, 2025 SWE2009 - Data Mining 39


Contd…
 Suppose that the class, sales person, is a
subclass of the class, employee.

 A sales person object would inherit all of the


variables pertaining to its superclass of
employee.

 In addition, it has all of the variables that


pertain specifically to being a salesperson
(e.g., commission).

 Such a class inheritance feature benefits


information sharing.
March 20, 2025 SWE2009 - Data Mining 40
6. Temporal Databases, Sequence
Databases, and Time-Series Databases

 A temporal database typically stores relational


data that include time-related attributes. These
attributes may involve several timestamps, each
having different semantics.

 A sequence database stores sequences of


ordered events, with or without a concrete notion of
time. Examples include customer shopping
sequences, Web click streams, and biological
sequences.

 A time-series database stores sequences of


values or events obtained over repeated
measurements of time (e.g., hourly, daily, weekly).
Examples include data collected from the stock
exchange, inventory control, and the observation of
natural phenomena SWE2009
March 20, 2025
(like -temperature
Data Mining
and wind). 41
7. Spatial and
Spatiotemporal
 Spatial databases contain spatial-related
information.
 Examples include geographic (map) databases,
very large-scale integration (VLSI) or computed-
aided design databases, and medical and satellite
image databases.
 Spatial data may be represented in raster format,
consisting of n-dimensional bit maps or pixel maps.
 For example, a 2-D satellite image may be
represented as raster data, where each pixel
registers the rainfall in a given area.
 Maps can be represented in vector format, where
roads, bridges, buildings, and lakes are represented
as unions or overlays of basic geometric constructs,
such as points, lines, polygons, and the partitions
and networks formed by these components.
March 20, 2025 SWE2009 - Data Mining 42
Contd….
 A spatial database that stores spatial
objects that change with time is called a
spatiotemporal database, from which
interesting information can be mined. For
example,
 we may be able to group the trends of
moving objects and identify some strangely
moving vehicles, or distinguish a
bioterrorist attack from a normal outbreak
of the flu based on the geographic spread of
a disease with time.

March 20, 2025 SWE2009 - Data Mining 43


8. Heterogeneous Databases
and Legacy Databases
 A heterogeneous database consists of a
set of interconnected, autonomous
component databases.

 A legacy database is a group of


heterogeneous databases that combines
different kinds of data systems.

 The heterogeneous databases in a legacy


database may be connected by intra or
inter-computer networks.

March 20, 2025 SWE2009 - Data Mining 44


9. Text and multi-media
 Text databases are databases that contain
word descriptions for objects.

 Words, sentences or paragraphs (product


specifications, error or bug reports, warning
messages, summary reports, notes, or other
documents).

 may be highly unstructured (such as some


Web pages on theWorldWideWeb).

March 20, 2025 SWE2009 - Data Mining 45


Contd…
 Some text databases may be somewhat
structured, that is, semi-structured (such as
e-mail messages and many HTML/XML Web
pages),

 Others are relatively well structured (such


as library catalogue databases).

 Text databases with highly regular


structures typically can be implemented
using relational database systems.

March 20, 2025 SWE2009 - Data Mining 46


Contd….
 (e.g.) Document classification

 Multimedia databases store image, audio, and


video data.

 Used in applications such as picture content-based


retrieval, voice-mail systems, video-on-demand
systems, the World Wide Web, and speech-based
user interfaces that recognize spoken commands.

 It must support large objects, because data


objects such as video can require gigabytes of
storage.

March 20, 2025 SWE2009 - Data Mining 47


10. WWW
 Distributed information services, such as
Yahoo!, Google, America Online, and
AltaVista, provide rich, worldwide, on-line
information services, where data objects are
linked together to facilitate interactive access.

 Users seeking information of interest traverse


from one object via links to another.

 Capturing user access patterns in such


distributed information environments is called
Web usage mining (or Weblog mining).

March 20, 2025 SWE2009 - Data Mining 48

You might also like