0% found this document useful (0 votes)
14 views19 pages

Unit 1

Data mining is the process of extracting knowledge from large data sets using statistical and computational techniques to discover patterns and relationships for informed decision-making. It has applications across various industries, including marketing and healthcare, and is part of the broader Knowledge Discovery in Data (KDD) process, which involves several steps from data cleaning to knowledge presentation. The document also discusses the evolution of database technology, the difference between KDD and data mining, and the functionalities and issues associated with data mining.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views19 pages

Unit 1

Data mining is the process of extracting knowledge from large data sets using statistical and computational techniques to discover patterns and relationships for informed decision-making. It has applications across various industries, including marketing and healthcare, and is part of the broader Knowledge Discovery in Data (KDD) process, which involves several steps from data cleaning to knowledge presentation. The document also discusses the evolution of database technology, the difference between KDD and data mining, and the functionalities and issues associated with data mining.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19

School of Computing Science and Engineering

Course Code : Course Name: Data mining and web Algo

Unit – 1
Data Mining

Faculty Name: Mr. Soumalya Ghosh Program Name: B.Tech CSE


What is Data Mining?

• Data mining is the process of


– extracting knowledge or insights from large amounts of data
• using various statistical and computational techniques.
• The primary goal of data mining
– is to discover hidden patterns and relationships in the data that can be used
to make informed decisions or predictions.
What is Data Mining?

• This involves exploring the data using various techniques such as


– Clustering
– Classification
– regression analysis,
– association rule mining
– anomaly detection.
Data Mining: Applications

• Data mining has a wide range of applications across various industries,


including marketing, finance, healthcare, and telecommunications.
• For example,
– in marketing,
• data mining can be used to identify customer segments and target marketing
campaigns
– in healthcare
• it can be used to identify risk factors for diseases and develop personalized
treatment plans.
Evolution of Database Technology

• The Explosive Growth of Data: from terabytes to petabytes


– Data collection and data availability
• Automated data collection tools, database systems, Web, computerized society
– Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—Automated analysis of massive data
sets
Why it is called Data Mining?

• Simply stated, data mining refers to extracting or “mining” knowledge from


large amounts of data.
• The term is actually a misnomer.
– Remember that the mining of gold from rocks or sand is referred to as gold
mining rather than rock or sand mining.
– Thus, data mining should have been more appropriately named “knowledge
mining from data,” which is unfortunately somewhat long.
– “Knowledge mining,” a shorter term, may not reflect the emphasis on mining
from large amounts of data.
• Thus, such a misnomer that carries both “data” and “mining” became a
popular choice.
Why it is called Data Mining?

• Many other terms carry a similar or slightly different meaning to data


mining, such as
– knowledge mining from data,
– knowledge extraction,
– data/pattern analysis,
– data archaeology
– data dredging
• Many people treat data mining as a synonym for another popularly used
term, Knowledge Discovery from Data, or KDD
• Alternatively, others view data mining as simply an essential step in the
process of knowledge discovery
Data mining as a step in the process of knowledge discovery

• 1. Data cleaning (to remove noise and inconsistent data)


• 2. Data integration (where multiple data sources may be combined)
• 3. Data selection (where data relevant to the analysis task are retrieved from the database)
• 4. Data transformation (where data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations, for instance)
• 5. Data mining (an essential process where intelligent methods are applied in order to
• extract data patterns)
• 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
• based on some interestingness measures)
• 7. Knowledge presentation (where visualization and knowledge representation techniques
are used to present the mined knowledge to the user)
Knowledge Discovery (KDD) Process

– Data mining—core of
Pattern Evaluation
knowledge discovery
process
Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
Difference between KDD and Data Mining

• Although the two terms KDD and Data Mining are heavily used interchangeably,
they refer to two related yet slightly different concepts.

• KDD is the overall process of extracting knowledge from data, while Data Mining
is a step inside the KDD process, which deals with identifying patterns in data.

• And Data Mining is only the application of a specific algorithm based on the
overall goal of the KDD process.

• KDD is an iterative process where evaluation measures can be enhanced, mining


can be refined, and new data can be integrated and transformed to get different
and more appropriate results.
Data Mining and Business Intelligence

Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Architecture: Typical Data Mining System

Graphical User Interface

Pattern Evaluation
Knowl
Data Mining Engine edge-
Base
Database or Data Warehouse
Server

data cleaning, integration, and selection

Data World-Wide Other Info


Database Repositories
Warehouse Web
Data Mining: Confluence of Multiple Disciplines

Database
Technology Statistics

Machine Visualization
Data Mining
Learning

Pattern
Recognition Other
Algorithm Disciplines
Data Mining: On What Kinds of Data?

• Database-oriented data sets and applications


– Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
– Data streams and sensor data
– Time-series data, temporal data, sequence data (incl. bio-sequences)
– Structure data, graphs, social networks and multi-linked data
– Object-relational databases
– Heterogeneous databases and legacy databases
– Spatial data and spatiotemporal data
– Multimedia database
– Text databases
– The World-Wide Web
Data Mining Functionalities

• Multidimensional concept description: Characterization and discrimination


– Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions
• Frequent patterns, association, correlation vs. causality
– Diaper  Beer [0.5%, 75%] (Correlation or causality?)
• Classification and prediction
– Construct models (functions) that describe and distinguish classes or concepts for
future prediction
• E.g., classify countries based on (climate), or classify cars based on (gas mileage)
– Predict some unknown or missing numerical values
Data Mining Functionalities

• Cluster analysis
– Class label is unknown: Group data to form new classes, e.g., cluster houses to find
distribution patterns
– Maximizing intra-class similarity & minimizing interclass similarity
• Outlier analysis
– Outlier: Data object that does not comply with the general behavior of the data
– Noise or exception? Useful in fraud detection, rare events analysis
• Trend and evolution analysis
– Trend and deviation: e.g., regression analysis
– Sequential pattern mining: e.g., digital camera  large SD memory
– Periodicity analysis
– Similarity-based analysis
• Other pattern-directed or statistical analyses
Data Mining - Issues
Data Mining - Issues

• Mining methodology
– Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
– Performance: efficiency, effectiveness, and scalability
– Pattern evaluation: the interestingness problem
– Incorporation of background knowledge
– Handling noise and incomplete data
– Parallel, distributed and incremental mining methods
– Integration of the discovered knowledge with existing one: knowledge fusion
• User interaction
– Data mining query languages and ad-hoc mining
– Expression and visualization of data mining results
– Interactive mining of knowledge at multiple levels of abstraction
• Applications and social impacts
– Domain-specific data mining & invisible data mining
– Protection of data security, integrity, and privacy
Data Mining Applications

You might also like