0% found this document useful (0 votes)

66 views52 pages

DM 1

Uploaded by

Rithik 56

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views52 pages

DM 1

Uploaded by

Rithik 56

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Introduction to Data Mining

Ravleen Kaur
Ravleen Kaur, NSUT 2
Overview of terms
Data refers to raw facts, figures, and statistics that can be processed to
extract meaningful information. It serves as the foundational element for
analysis, decision-making, and understanding trends in various fields.
Types of Data: By nature
● Qualitative Data (Categorical Data): Descriptive data that cannot be
measured numerically.
○ Examples: Colors, names, categories (e.g., gender, type of car, customer
feedback).
● Quantitative Data (Numerical Data): Data that can be measured and
expressed numerically.
○ Subtypes:
■ Discrete Data: Integer values, countable items.
● Examples: Number of customers, number of products sold.
■ Continuous Data: Any value within a range, often measured.
● Examples: temperature, time, salary

Ravleen Kaur, NSUT 3

Types of Data: By Measurement Scale

Nominal Data: Data that can be categorized but not Interval Data: Numeric data with meaningful
ordered. Examples: Gender, race, or types of fruits. intervals but no true zero point. Examples:
Temperature (Celsius or Fahrenheit), IQ scores.

Ordinal Data: Data that can be categorized and ordered

but does not have a consistent difference between values. Ratio Data: Numeric data with meaningful intervals
Examples: Customer satisfaction ratings (e.g., poor, fair, and a true zero point. Examples: Height, weight, age,
good, excellent). sales revenue.

Ravleen Kaur, NSUT 4

Types of Data: By Source
● Primary Data: Data collected firsthand for a specific purpose.
○ Examples: Surveys, interviews, experiments.
● Secondary Data: Data that has already been collected and is available for analysis.
○ Examples: Research articles, government reports, online databases.

Types of Data: By Format

● Structured Data: Organized data that is easily searchable and formatted in a
predefined manner.
○ Examples: Databases, spreadsheets, CSV files.
● Unstructured Data: Data that does not have a specific format, making it more
challenging to analyze.
○ Examples: Text documents, images, videos, social media posts.

Ravleen Kaur, NSUT 5

Motivation
● Data explosion problem: from terabytes to petabytes
○ Data collection and data availability
■ Automated data collection tools, database systems, Web, computerized society
○ Major sources of abundant data
■ Business: Web, e-commerce, transactions, stocks, ...
■ Science: Remote sensing, bioinformatics, scientific simulation, ...
■ Society and everyone: news, digital cameras, YouTube

We are drowning in data, but starving for knowledge!

● “Necessity is the mother of invention”—Data mining—Automated analysis of massive

datasets.
● Extraction of interesting knowledge (rules, regulations, patterns, constraints) from data
in large databases.

Ravleen Kaur, NSUT 6

Evolution of Sciences
● Before 1600, empirical science
● 1600-1950s, theoretical science
○ Each discipline has grown a theoretical component. Theoretical models often motivate
experiments and generalize our understanding.
● 1950s-1990s, computational science
○ Over the last 50 years, most disciplines have grown a third, computational branch (e.g.
empirical, theoretical, and computational ecology, or physics, or linguistics.)
○ Computational Science traditionally meant simulation. It grew out of our inability to find
closed-form solutions for complex mathematical models.
● 1990-now, data science
○ The flood of data from new scientific instruments and simulations
○ The ability to economically store and manage petabytes of data online
○ The Internet and computing Grid that makes all these archives universally accessible
○ Scientific info. management, acquisition, organization, query, and visualization tasks scale
almost linearly with data volumes. Data mining is a major new challenge!
● Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm.
ACM, 45(11): 50-54, Nov. 2002

Ravleen Kaur, NSUT 7

Evolution of Database Technology
● 1960s:
○ Data collection, database creation, IMS and network DBMS
● 1970s:
○ Relational data model, relational DBMS implementation
● 1980s:
○ RDBMS, advanced data models and application oriented DBMS (spatial, scientific,
engineering etc.)
● 1990s:
○ Data mining and data warehousing, multimedia databases, and web databases
● 2000s
○ Stream data management and mining
○ Data mining and its applications
○ Web technology (XML, data integration) and global information systems

Ravleen Kaur, NSUT 8

What can you do with the data?
Suppose you are a search engine and you have a Suppose that you are the owner of a supermarket and you
toolbar log consisting of have collected billions of market basket data. What
● pages browsed, information would you extract from it and how would
● queries, you use it?
● pages clicked,
● ads clicked
each with a user id and a timestamp. What
information would you like to get out of the data?

Ravleen Kaur, NSUT 9

What can you do with the data?
Suppose you are biologist who has microarray You are the owner of a social network, and you have
expression data: thousands of genes, and their expression full access to the social graph, what kind of
values over thousands of different settings (e.g. tissues). information do you want to get out of your graph?
What information would you like to get out of your data?

Ravleen Kaur, NSUT 10

What is Data Mining?
● “Data mining is the analysis of (often large) observational data sets to find unsuspected relationships
and to summarize the data in novel ways that are both understandable and useful to the data analyst”
(Hand, Mannila, Smyth)
● “Data mining is the discovery of models for data” (Rajaraman, Ullman)
○ We can have the following types of models
■ Models that explain the data (e.g., a single function)
■ Models that predict the future data instances.
■ Models that summarize the data
■ Models the extract the most prominent features of the data.
● Data mining (knowledge discovery from data)
○ Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful)
patterns or knowledge from huge amount of data
○ Data mining: a misnomer?
● Alternative names
○ Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis,
data archeology, data dredging, information harvesting, business intelligence, etc.

Ravleen Kaur, NSUT 11

What is (not) data mining?
● Simple search and query processing
● Deductive expert Systems or statistical programs
Examples:
● Looking up phone number in phone directory
● Query a web search engine for information about ‘Amazon’
● Use “Google” to check the price of an item.
○ Getting the statistical details about the historical price change of an item.
(i.e. max price, min price, average price)
○ Checking the sales of an item by color.
○ Example:
○ Green item sales– 500
○ Blue item sales– 1000 etc.

Ravleen Kaur, NSUT 12

Why Mine Data?
Informed Decision-Making Enhanced Customer Understanding
● Data-Driven Strategies: Businesses can make ● Segmentation and Targeting: Data mining
more accurate decisions based on empirical helps identify customer segments, enabling
more effective marketing campaigns tailored to
evidence rather than intuition.
specific needs and preferences.
● Real-Time Insights: Immediate access to data
● Personalized Experiences: Businesses can
analytics allows for timely adjustments to create customized offerings and
strategies and operations. recommendations, improving customer
satisfaction and loyalty.

Ravleen Kaur, NSUT 13

Why Mine Data?
Increased Operational Efficiency Competitive Advantage
● Process Optimization: Analyzing data can uncover ● Market Trends Analysis: Businesses can spot
inefficiencies and streamline operations, reducing trends and shifts in consumer behavior ahead
costs and improving productivity. of competitors, allowing for proactive strategy
● Supply Chain Management: Data mining can development.
optimize inventory levels and enhance logistics ● Innovation and Development: Data insights
planning through demand forecasting. can guide product development and
innovation, ensuring offerings meet market
demands.

Ravleen Kaur, NSUT 14

Why Mine Data?
Risk Management and Compliance Revenue Growth
● Fraud Detection: Data mining techniques can ● Upselling and Cross-Selling: Understanding
identify unusual patterns that may indicate customer preferences allows businesses to
fraudulent activities. implement effective upselling and cross-selling
● Regulatory Compliance: Data analysis helps strategies.
businesses adhere to regulations by monitoring ● New Market Opportunities: Data mining can
transactions and data usage, reducing legal risks. reveal untapped markets or customer needs,
presenting new avenues for revenue.

Ravleen Kaur, NSUT 15

Challenges of Implementation in Data mining
Noisy & Incomplete Data
The data in the real-world is heterogeneous, incomplete, and noisy. Data
in huge quantities will usually be inaccurate or unreliable. These
problems may occur due to data measuring instrument or because of
human errors.
Suppose a retail chain collects phone numbers of customers who
spend more than $500, and the accounting employees put the information
into their system.
● The person may make a digit mistake when entering the phone number,
which results in incorrect data.
● Even some customers may not be willing to disclose their phone
numbers, which results in incomplete data.
● The data could get changed due to human or system error.

Performance
The data mining system's performance relies primarily on the efficiency
of algorithms and techniques used.
If the designed algorithm and techniques are not up to the mark,
then the efficiency of the data mining process will be affected adversely.
Ravleen Kaur, NSUT 16
Challenges of Implementation in Data mining
Data Distribution Complex Data
Practically, It is a quite tough task to collect all the data in a Real-world data is heterogeneous, and it could be
centralized data repository mainly due to organizational and multimedia data, including audio and video, images,
technical concerns. complex data, spatial data, time series, and so on.
For example, various regional offices may have their Managing these various types of data and extracting
servers to store their data. It is not feasible to store, all the useful information is a tough task.
data from all the offices on a central server. Therefore, data Most of the time, new technologies, new tools,
mining requires the development of tools and algorithms and methodologies would have to be refined to obtain
that allow the mining of distributed data. specific information.

Ravleen Kaur, NSUT 17

Challenges of Implementation in Data mining
Data Privacy & Security
Data mining usually leads to serious issues in terms of data security, governance, and privacy. For example, if a
retailer analyzes the details of the purchased items, then it reveals data about buying habits and preferences of
the customers without their permission.

Data Visualization
In data mining, data visualization is a very important process
because it is the primary method that shows the output to the
user in a presentable way. The extracted data should convey
the exact meaning of what it intends to express.
But many times, representing the information to the
end-user in a precise and easy way is difficult. The input data
and the output information being complicated, very efficient,
and successful data visualization processes need to be
implemented to make it successful.

Ravleen Kaur, NSUT 18

Why Confluence of Multiple Disciplines?

● Tremendous amount of data

○ Algorithms must be highly scalable to handle such as terabytes of data
● High-dimensionality of data
○ Micro-array may have tens of thousands of dimensions
● High complexity of data
○ Data streams and sensor data
○ Time-series data, temporal data, sequence data
○ Structure data, graphs, social networks and multi-linked data
○ Heterogeneous databases and legacy databases
○ Spatial, spatiotemporal, multimedia, text and Web data
○ Software programs, scientific simulations
● New and sophisticated applications

Ravleen Kaur, NSUT 19

It is the process of discovering patterns, trends, and useful It is a subset of artificial intelligence (AI) that involves
information from large sets of data. It involves extracting algorithms that enable computers to learn from and make
knowledge from data. predictions or decisions based on data.

Common techniques include clustering, classification, Involves supervised learning, unsupervised learning,
regression, association rule mining, and anomaly detection. reinforcement learning, and deep learning.

The primary goal is to analyze data and summarize it into The main aim is to build models that can predict outcomes or
useful information, often for decision-making. classify data based on new inputs.

It requires human involvement for even a minor change in It can alter the rules according to the environment and provide
the rules. solutions to a specific problem. Human efforts are only required
while defining the algorithm.

It uses a data warehouse, data mining engine, and pattern It involves neural networks and algorithms to produce results.
assessment techniques to produce results.

Used in fields like market research, fraud detection, Used in applications like image recognition, natural language
customer segmentation, and more. processing, recommendation systems, and autonomous vehicles.

Ravleen Kaur, NSUT 20

The process of discovering patterns and extracting useful Software that enables the creation, management, and manipulation
information from large datasets. of databases.

Aims to uncover hidden insights and knowledge from data. Aims to store, retrieve, and manage data efficiently and securely.

Can handle both structured and unstructured data; often Primarily manages structured data organized into tables.
involves significant preprocessing.

Utilizes methods like clustering, classification, regression, Utilizes structured query language (SQL) for data retrieval and
association rule mining, and anomaly detection. manipulation.

Produces patterns, rules, and insights that can inform Provides access to data and supports transaction management and
decision-making. data integrity.

Used in fields like marketing, healthcare, finance, and social Used in various domains for data storage, retrieval, and
sciences for predictive analytics and decision support. management in applications like enterprise systems, websites, and
analytics platforms.

Commonly uses specialized software (e.g., RapidMiner, Examples include MySQL, Oracle, Microsoft SQL Server, and
Weka, KNIME) for analysis. PostgreSQL.
Ravleen Kaur, NSUT 21
Data mining refers to the field of computer
OLAP is a technology of immediate access to
science, which deals with the extraction of
data with the help of multidimensional
data, trends and patterns from huge sets of
structures.
data.

It deals with the data summary. It deals with detailed transaction-level data.

It is discovery-driven. It is query driven.

It is used for future data prediction. It is used for analyzing past data.

It has huge numbers of dimensions. It has a limited number of dimensions.

Bottom-up approach. Top-down approach.

Ravleen Kaur, NSUT 22
Data mining is a process of extracting useful information, pattern, Statistics refers to the analysis and presentation of numeric data, and it
and trends from huge data sets and utilizes them to make a is the major part of all data mining algorithm.
data-driven decision.

The data used is numeric or non-numeric. The data used is numeric only.

The types of data mining are clustering, classification, association, The types of statistics are Descriptive statistical and Inferential
neural network, sequence-based analysis, visualization, etc. statistical.

It is suitable for huge data sets. It is suitable for smaller data set.

It is an inductive process. It means the generation of new theory It is the deductive process. It does not indulge in making any
from data. predictions.

Data cleaning is a part of data mining. In statistics, clean data is used to implement the statistical method.

It requires less user interaction to validate the model, so it is easy to It requires user interaction to validate the model, so it is complex to
automate. automate.

Data mining applications include financial Data Analysis, Retail The application of statistics includes biostatistics, quality control,
Industry, Telecommunication Industry, Biological Data Analysis, demography, operational research, etc.
Certain Scientific Applications, etc.
Ravleen Kaur, NSUT 23
Data Mining Process
1. Data collection/gathering: 2. Data preparation:
Relevant data for an analytics application is identified and This stage includes a set of steps to get the data ready
assembled. The data may be located in different source to be mined. It starts with data exploration, profiling
systems, a data warehouse or a data lake, an increasingly and pre-processing, followed by data cleansing work to
common repository in big data environments that contain a fix errors and other data quality issues. Data
mix of structured and unstructured data. External data transformation is also done to make data sets
sources may also be used. Wherever the data comes from, consistent, unless a data scientist is looking to analyze
a data scientist often moves it to a data lake for the unfiltered raw data for a particular application.
remaining steps in the process.

Ravleen Kaur, NSUT 24

Data Mining Process
3. Mining the data: 4. Data analysis and interpretation:
Once the data is prepared, a data scientist chooses the The data mining results are used to create
appropriate data mining technique and then implements one analytical models that can help drive
or more algorithms to do the mining. In machine learning decision-making and other business actions. The
data scientist or another member of a data
applications, the algorithms typically must be trained on
science team also must communicate the
sample data sets to look for the information being sought findings to business executives and users, often
before they're run against the full set of data. through data visualization and the use of data
storytelling techniques.

Ravleen Kaur, NSUT 25

Multi-dimensional view of Data Mining

Ravleen Kaur, NSUT 26

Data Mining System Architecture
Database or Data Warehouse Server
The Database/Data Warehouse Server is responsible for
storing large amounts of historical business, transactional, and
operational information in an organized fashion. This could be
relational databases such as Oracle, SQL Server or
non-relational databases such as Hadoop or NoSQL solutions
like MongoDB. The Data Warehouse contains all of this
structured data combined into one larger repository, allowing
analysts to query it efficiently via tools such as Business
Intelligence Software (BI).

Ravleen Kaur, NSUT 27

Data Mining System Architecture
Data mining engine
Data mining engines are software programs that perform data
mining tasks like pattern recognition and classification. They
typically use algorithms derived from machine learning,
statistics or artificial intelligence to search for patterns in large
datasets. These engines can be used independently or
integrated with other applications within a larger data
processing architecture.

Knowledge Base
A knowledge base in such an architecture can provide a
foundation for finding meaningful insights useful for
decision-making, allowing the user to analyze previously
unseen trends or correlations. The goal is to distill the
information into actionable intelligence that can inform
decisions about marketing campaigns, customer segmentation,
product development, and more.

Ravleen Kaur, NSUT 28

Knowledge Discovery of Databases (KDD)
● This is a view from typical database systems
and data warehousing communities
● Data mining plays an essential role in the
knowledge discovery process

KDD Process: A typical view from ML and Statistics

Ravleen Kaur, NSUT 29

Example: A Web Mining Framework Example: Medical Data Mining
Web mining usually involves ● Health care & medical data mining –
● Data cleaning often adopted such a view in statistics
● Data integration from multiple sources and machine learning
● Warehousing the data ● Preprocessing of the data (including
● Data cube construction feature extraction and dimension
● Data selection for data mining reduction)
● Data mining ● Classification or/and clustering processes
● Presentation of the mining results ● Post-processing for presentation
● Patterns and knowledge to be used or
stored into knowledge-base

Ravleen Kaur, NSUT 30

Types of Data Mining Architecture
● No Coupling: Data mining system will not utilize any functionality of
a database or data warehouse system
● Loose Coupling: Data mining system will use some facilities of
database and data warehouse system like storing the data in either of
database or data warehouse systems and using these systems for data
retrieval
● Semi-tight Coupling: Besides linking a data mining system to a
database/data warehouse systems, efficient implementation of a few
data mining primitives.
● Tight Coupling: Data mining system is smoothly integrated with
database/data warehouse systems. Each of these data mining,
database/data warehouse is treated as main functional component of
information retrieval system.

Ravleen Kaur, NSUT 31

Data Mining: Datasets
Database-oriented data sets and applications
● Relational database, data warehouse, transactional database

Advanced data sets and advanced applications

● Data streams and sensor data
● Time-series data, temporal data, sequence data (incl. bio-sequences)
● Structure data, graphs, social networks and multi-linked data
● Object-relational databases
● Heterogeneous databases and legacy databases
● Spatial data and spatiotemporal data
● Multimedia database
● Text databases
● The World-Wide Web

Ravleen Kaur, NSUT 32

Data Mining Function: Generalization
Information integration and data warehouse
construction
● Data cleaning, transformation, integration, and
multidimensional data model
Data cube technology
● Scalable methods for computing (i.e., materializing)
multidimensional aggregates
● OLAP (online analytical processing)
Multidimensional concept description: Characterization
and discrimination
● Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet region

Ravleen Kaur, NSUT 33

Outlier or Anomaly Detection
Outlier: An outlier is a data point that differs significantly from other data points in a dataset, typically
falling far outside the range of the majority of data. Outliers can be extreme values that don't fit the
general distribution. doesn’t contribute to any bad thing. For example, employees making too high or
too low salary.
Anomaly: An anomaly is a data point or a pattern that deviates from the expected behavior in the data. It
can be due to rare events, novel behaviors, or errors. For example, a user who does 1000-2000/-
purchase daily from card, has done 50000/- purchase.
● Noise or exception? ― One person’s garbage could be another person’s treasure
● Methods: by product of clustering or regression analysis, ...
● Useful in fraud detection, rare events analysis

Ravleen Kaur, NSUT 34

Outlier or Anomaly Detection Methods
● Statistical Methods: Uses statistical models to identify anomalies based on probabilities and
distributions. Examples include z-scores and Gaussian distribution models.
● Machine Learning Methods:
○ Supervised Learning: Uses labeled training data to detect anomalies. Examples include
classifiers that distinguish between normal and anomalous instances.
○ Unsupervised Learning: Identifies anomalies without labeled data. Examples include
clustering methods (e.g., DBSCAN), dimensionality reduction techniques (e.g., PCA), and
density-based methods (e.g., Isolation Forest).
● Distance-Based Methods: Measures the distance between data points
and identifies points that are far from others. Examples include
k-nearest neighbors (KNN).
● Density-Based Methods: Analyzes the density of data points to
identify regions with low density as potential anomalies. Examples
include Local Outlier Factor (LOF).

Ravleen Kaur, NSUT 35

Association Rule Learning
Association rules are if-then statements that support to show the probability of interactions between data items
within large data sets in different types of databases. For example, a list of grocery items that you have been
buying for the last six months. It calculates a percentage of items being purchased together.
● Frequent patterns (or frequent itemsets)
○ What items are frequently purchased together in your Walmart?
● Association, correlation vs. causality
○ A typical association rule
■ Diaper → Beer [0.5%, 75%] (support, confidence)
■ age(X, ―20..29ǁ) ^ income(X, ―20..29Kǁ) buys(X, ―PCǁ) [support = 2%, confidence =
60%]
■ contains(T, ―computerǁ) contains(x, ―softwareǁ) [1%, 75%]
○ Are strongly associated items also strongly correlated?
● How to mine such patterns and rules efficiently in large datasets?
● How to use such patterns for classification, clustering, and other applications?

Ravleen Kaur, NSUT 36

Association Rule Learning
These are three major measurements technique:
● Lift:
This measurement technique measures the accuracy of the confidence over how often
item B is purchased.
(Confidence) / (item B)/ (Entire dataset)
● Support:
This measurement technique measures how often multiple items are purchased and
compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
● Confidence:
This measurement technique measures how often item B is purchased when item A is
purchased as well.
(Item A + Item B)/ (Item A)

Ravleen Kaur, NSUT 37

Clustering Analysis
Given a set of data points, each having a set of attributes, and a similarity measure among them, find
clusters such that
● Data points in one cluster are more similar to one another.
● Data points in separate clusters are less similar to one another.
● Unsupervised learning (i.e., Class label is unknown)
● Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns
● Principle: Maximizing intra-class similarity & minimizing interclass similarity
● Many methods and applications

Ravleen Kaur, NSUT 38

Classification Analysis
Classification analysis is the method of obtaining information about the data. Classification of the data means
dividing data in terms of different relevant categories. For example, your email providers classify emails into
categories such as inbox, important, spammed, deleted, etc.
● Classification and label prediction
○ Construct models (functions) based on some training
examples
○ Describe and distinguish classes or concepts for future
prediction
■ E.g., classify countries based on (climate), or
classify cars based on (gas mileage)
○ Predict some unknown class labels
● Typical methods
○ Decision trees, naïve Bayesian classification, support
vector machines, neural networks, rule-based
classification, pattern-based classification, logistic
regression, ...
● Typical applications:
○ Credit card fraud detection, direct marketing, classifying
stars, diseases, web-pages, ...

Ravleen Kaur, NSUT 39

Ravleen Kaur, NSUT 40
Regression Analysis Neural Networks
Regression analysis can be defined as a statistical Neural networks process data through the use
modeling method in which previously obtained data is of nodes. These nodes are comprised of inputs,
used to predict a continuous quantity for new observations. weights, and an output. Data is mapped
For example, we might use it to project certain through supervised learning, similar to how the
costs, depending on other factors such as availability, human brain is interconnected. This model can
consumer demand, and competition. Primarily it gives the be programmed to give threshold values to
exact relationship between two or more variables in the determine a model's accuracy.
given data set.

Ravleen Kaur, NSUT 41

Time & Ordering: Sequential Patterns, Trend & Evolution Analysis
This is a method of identifying the sequence of the data. Sequential analysis is also very useful for
businesses as it can help them track the sales pattern. It can also help companies develop an
understanding of the series of events happening in their databases.

Sequence, trend and evolution analysis

● Trend, time-series, and deviation analysis: e.g.,
regression and value prediction
● Sequential pattern mining
○ e.g., first buy digital camera, then buy large
SD memory cards
● Periodicity analysis
● Motifs and biological sequence analysis
○ Approximate and consecutive motifs
● Similarity-based analysis

Mining data streams

● Ordered, time-varying, potentially infinite, data
streams

Ravleen Kaur, NSUT 42

Predictive Analysis
Predictive analysis strives to leverage historical
information to build graphical or mathematical models to
forecast future outcomes. Overlapping with regression
analysis, this technique aims to support an unknown figure
in the future based on current data on hand.

For prediction, we do not utilize the phrasing of “Class

label attribute” because the attribute for which values are
being predicted is consistently valued(ordered) instead of
categorical (discrete-esteemed and unordered). The
attribute can be referred to simply as the predicted
attribute. Prediction can be viewed as the construction and
use of a model to assess the class of an unlabeled object, or
to assess the value or value ranges of an attribute that a
given object is likely to have.

Ravleen Kaur, NSUT 43

Structure and Network Analysis
● Graph mining
○ Finding frequent subgraphs (e.g., chemical compounds), trees (XML), substructures
(web fragments)
● Information network analysis
○ Social networks: actors (objects, nodes) and relationships (edges)
■ e.g., author networks in CS, terrorist networks
○ Multiple heterogeneous networks
■ A person could be multiple information networks: friends, family, classmates, …
○ Links carry a lot of semantic information: Link mining
● Web mining
○ Web is a big information network: from PageRank to Google
○ Analysis of Web information networks
■ Web community discovery, opinion mining, usage mining, ...

Ravleen Kaur, NSUT 44

Knowledge Representation Methods
1. Logical Representation
Logical representation is a language with some concrete
rules which deals with propositions and has no ambiguity
in representation. Logical representation means drawing a
conclusion based on various conditions. This
representation lays down some important communication
rules. It consists of precisely defined syntax and semantics
which supports the sound inference. Each sentence can be
translated into logics using syntax and semantics.

Syntax:
● Syntaxes are the rules which decide how we can
construct legal sentences in the logic.
● It determines which symbol we can use in
knowledge representation.
● How to write those symbols.

Ravleen Kaur, NSUT 45

Logical Representation

Logical representation can be categorized into

mainly two logics:
a. Propositional Logics
b. Predicate Logics

Advantages of logical representation:

● Logical representation enables us to do logical reasoning.
● Logical representation is the basis for the programming languages.

Disadvantages of logical Representation:

● Logical representations have some restrictions and are challenging to work with.
● Logical representation technique may not be very natural, and inference may not be so efficient

Ravleen Kaur, NSUT 46

Knowledge Representation Methods
2. Semantic Network Representation
● Semantic networks are alternative of predicate logic for
knowledge representation.
● In Semantic networks, we can represent our knowledge in the
form of graphical networks.
● This network consists of nodes representing objects and arcs
which describe the relationship between those objects.
● Semantic networks can categorize the object in different forms
and can also link those objects.
● Semantic networks are easy to understand and can be easily
extended.
● This representation consist of mainly two types of relations:
○ IS-A relation (Inheritance)
○ Kind-of-relation
● Example: Following are some statements which we need to
represent in the form of nodes and arcs.

Ravleen Kaur, NSUT 47

Semantic Network Representation
Statements:
a. Jerry is a cat.
b. Jerry is a mammal
c. Jerry is owned by Priya.
d. Jerry is brown colored.
e. All Mammals are animal.

Advantages of Semantic network:

1. Semantic networks are a natural representation of knowledge.
2. Semantic networks convey meaning in a transparent manner.
3. These networks are simple and easily understandable.

Drawbacks:
➢Semantic networks take more computational time at runtime as we need to traverse the complete network
tree to answer some questions. It might be possible in the worst case scenario that after traversing the entire
tree, we find that the solution does not exist in this network.
➢ Semantic networks try to model human-like memory (Which has 1015 neurons and links) to store the
information, but in practice, it is not possible to build such a vast semantic network.
➢These types of representations are inadequate as they do not have any equivalent quantifier, e.g., for all,
for some, none, etc.
Ravleen Kaur, NSUT 48
Knowledge Representation Methods
3. Frame Representation
● A frame is a record like structure which consists of a collection
of attributes and its values to describe an entity in the world.
● Frames are the AI data structure which divides knowledge into
substructures by representing stereotypes situations. It consists
of a collection of slots and slot values.
● These slots may be of any type and sizes. Slots have names
and values which are called facets.

● Facets: The various aspects of a slot is known as Facets. Facets are features of frames which enable us
to put constraints on the frames. Example: IF-NEEDED facts are called when data of any particular
slot is needed.
● A frame may consist of any number of slots, and a slot may include any number of facets and facets
may have any number of values. A frame is also known as slot-filter knowledge representation in
artificial intelligence.

Ravleen Kaur, NSUT 49

Frame Representation
● Frames are derived from semantic networks and later evolved into our modern-day classes and objects.
A single frame is not much useful. Frames system consist of a collection of frames which are
connected.
● In the frame, knowledge about an object or event can be stored together in the knowledge base. The
frame is a type of technology which is widely used in various applications including Natural language
processing and machine visions.

Advantages of frame representation :

1. The frame knowledge representation makes the programming easier by grouping the related data.
2. The frame representation is comparably flexible and used by many applications in AI.
3. It is very easy to add slots for new attribute and relations.
4. It is easy to include default data and to search for missing values.
5. Frame representation is easy to understand and visualize.

Disadvantages of frame representation:

1. In frame system inference mechanism is not be easily processed.
2. Inference mechanism cannot be smoothly proceeded by frame representation.
3. Frame representation has a much generalized approach.

Ravleen Kaur, NSUT 50

Knowledge Representation Methods
4. Production Rules
Production rules system consist of (condition, action) pairs which
mean, "If condition then action".
It has mainly three parts:
○ The set of production rules
○ Working Memory
○ The recognize-act-cycle
● In production rules agent checks for the condition and if the
condition exists then production rule fires and corresponding
action is carried out.
● The condition part of the rule determines which rule may be applied to a problem. And the action part
carries out the associated problem-solving steps.
● This complete process is called a recognize-act cycle.
● The working memory contains the description of the current state of problems-solving and rule can
write knowledge to the working memory. This knowledge match and may fire other rules.

Ravleen Kaur, NSUT 51

Production Rules
Example:
● IF (at bus stop AND bus arrives) THEN action (get into the bus)
● IF (on the bus AND paid AND empty seat) THEN action (sit down).
● IF (on bus AND unpaid) THEN action (pay charges).
● IF (bus arrives at destination) THEN action (get down from the bus).

Advantages of Production rule:

1. The production rules are expressed in natural language.
2. The production rules are highly modular, so we can easily remove, add or modify an individual rule.

Disadvantages of Production rule:

1. Production rule system does not exhibit any learning capabilities, as it does not store the result of the
problem for the future uses.
2. During the execution of the program, many rules may be active hence rule-based production systems are
inefficient.

Ravleen Kaur, NSUT 52

Rapport Pfe
100% (1)
Rapport Pfe
85 pages
Chapter-25 SUBSTATION CONTROL AND AUTOMATION
No ratings yet
Chapter-25 SUBSTATION CONTROL AND AUTOMATION
8 pages
DBMS Lab Manual
No ratings yet
DBMS Lab Manual
36 pages
Previous EEBs Now Resolved in NetBackup 11.0 - EEBs and Other Known Issues Resolved in NetBackup 11.0
No ratings yet
Previous EEBs Now Resolved in NetBackup 11.0 - EEBs and Other Known Issues Resolved in NetBackup 11.0
9 pages
rs&GISu 5
No ratings yet
rs&GISu 5
16 pages
Ismail CV
No ratings yet
Ismail CV
5 pages
Practical #3 (MS Access)
No ratings yet
Practical #3 (MS Access)
7 pages
1042-23 Answer Key
No ratings yet
1042-23 Answer Key
24 pages
Notes NoSQL Module 2 Leason 5
No ratings yet
Notes NoSQL Module 2 Leason 5
6 pages
Interview Questions All
No ratings yet
Interview Questions All
13 pages
Akanksha Final Documentation
No ratings yet
Akanksha Final Documentation
43 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
41 pages
Inf 444e - Datamining N Advanced Databases Introduction 2019
No ratings yet
Inf 444e - Datamining N Advanced Databases Introduction 2019
32 pages
1 Intro
No ratings yet
1 Intro
50 pages
Real-Time Inventory Management With RFIDash
No ratings yet
Real-Time Inventory Management With RFIDash
10 pages
Data Mining
No ratings yet
Data Mining
395 pages
02-Introduction To Data Mining
No ratings yet
02-Introduction To Data Mining
40 pages
Sushan Java Resume
No ratings yet
Sushan Java Resume
6 pages
Slide 03 Chapter1 Introduction
No ratings yet
Slide 03 Chapter1 Introduction
36 pages
01 Intro
No ratings yet
01 Intro
52 pages
Week 1A - Overview and Introduction of Data Mining
No ratings yet
Week 1A - Overview and Introduction of Data Mining
41 pages
Intro Data Mining
No ratings yet
Intro Data Mining
51 pages
Chapter 1 - Tagged
No ratings yet
Chapter 1 - Tagged
46 pages
Federal Register / Vol. 70, No. 201 / Wednesday, October 19, 2005 / Notices
No ratings yet
Federal Register / Vol. 70, No. 201 / Wednesday, October 19, 2005 / Notices
1 page
DWDM LS1 Fall 24 25
No ratings yet
DWDM LS1 Fall 24 25
42 pages
Data Mining 1
No ratings yet
Data Mining 1
39 pages
AWS Solutions Architect Professional Questions Answers
75% (4)
AWS Solutions Architect Professional Questions Answers
7 pages
Database Intrusion Detection System
0% (1)
Database Intrusion Detection System
4 pages
Unit 1
No ratings yet
Unit 1
95 pages
01 Intro
No ratings yet
01 Intro
28 pages
NetBackup83 9x Tuning Guide
No ratings yet
NetBackup83 9x Tuning Guide
98 pages
DM-Unit 1
No ratings yet
DM-Unit 1
110 pages
Data Analysis-2
No ratings yet
Data Analysis-2
41 pages
01 - Data Mining Introduction
No ratings yet
01 - Data Mining Introduction
21 pages
Haramaya University College of Engineering and Technology Department of Information Technology
No ratings yet
Haramaya University College of Engineering and Technology Department of Information Technology
38 pages
ERP - SAP Awareness - 01
No ratings yet
ERP - SAP Awareness - 01
41 pages
Introduction
No ratings yet
Introduction
46 pages
CSM6404 DM L1
No ratings yet
CSM6404 DM L1
29 pages
SQL PDF
No ratings yet
SQL PDF
7 pages
21IS503 UnitII LM5
No ratings yet
21IS503 UnitII LM5
20 pages
1 DM Intro
No ratings yet
1 DM Intro
38 pages
DWM Unit II
No ratings yet
DWM Unit II
76 pages
In Search of Database Nirvana
100% (1)
In Search of Database Nirvana
54 pages
01 Intro
No ratings yet
01 Intro
22 pages
01 Intro
No ratings yet
01 Intro
35 pages
NetBackup Copilot Configuration Guide - 2.7.3
No ratings yet
NetBackup Copilot Configuration Guide - 2.7.3
50 pages
01 Intro
No ratings yet
01 Intro
29 pages
Manual Del MapInfo Discovery PDF
No ratings yet
Manual Del MapInfo Discovery PDF
69 pages
Consistency Checks
No ratings yet
Consistency Checks
44 pages
1 01intro, 2data (Except2 3), 3preprocessing
No ratings yet
1 01intro, 2data (Except2 3), 3preprocessing
169 pages
Week 01 Chapt01
No ratings yet
Week 01 Chapt01
49 pages
01 Intro 1
No ratings yet
01 Intro 1
50 pages
Certified Ethical Hacker Course Content
No ratings yet
Certified Ethical Hacker Course Content
36 pages
Firebird Gfix
No ratings yet
Firebird Gfix
29 pages
Module 3
No ratings yet
Module 3
187 pages
Women Security Module
No ratings yet
Women Security Module
3 pages
Chapter-1 (Introduction)
No ratings yet
Chapter-1 (Introduction)
17 pages
BulkConfigurator V11
No ratings yet
BulkConfigurator V11
96 pages
1 - Introduction To DM
No ratings yet
1 - Introduction To DM
59 pages
01 Intro
No ratings yet
01 Intro
40 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
28 pages
SAPBusinessOne-Citrix Installation Guide
No ratings yet
SAPBusinessOne-Citrix Installation Guide
16 pages
Chapter 1 DM
No ratings yet
Chapter 1 DM
20 pages
Proposal For LMS
No ratings yet
Proposal For LMS
50 pages
01 Introduction
No ratings yet
01 Introduction
36 pages
C04 Warehouse Development
No ratings yet
C04 Warehouse Development
12 pages
Internal
No ratings yet
Internal
267 pages
Combine 056
No ratings yet
Combine 056
57 pages
July 16, 2009 1 Data Mining
No ratings yet
July 16, 2009 1 Data Mining
26 pages
Motivation of Data Mining
No ratings yet
Motivation of Data Mining
4 pages
Major Issues in Data Mining
75% (4)
Major Issues in Data Mining
45 pages
Data Rich, Information Poor
No ratings yet
Data Rich, Information Poor
5 pages
1 Chapter One
No ratings yet
1 Chapter One
54 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
37 pages
2 DM Module 1 Introduction DVS
No ratings yet
2 DM Module 1 Introduction DVS
81 pages
Data Mining: Department of Computer Science & Engineering Jamia Hamdard, New Delhi
No ratings yet
Data Mining: Department of Computer Science & Engineering Jamia Hamdard, New Delhi
43 pages
Data Mining: Department of Computer Science & Engineering Jamia Hamdard, New Delhi
No ratings yet
Data Mining: Department of Computer Science & Engineering Jamia Hamdard, New Delhi
43 pages
Data Mining: Nicoleta ROGOVSCHI
No ratings yet
Data Mining: Nicoleta ROGOVSCHI
84 pages
Unit 1: Data Warehousing & Data Mining
No ratings yet
Unit 1: Data Warehousing & Data Mining
54 pages
Anaum Hamid: Lecture 01 - Introduction To DM
No ratings yet
Anaum Hamid: Lecture 01 - Introduction To DM
50 pages
Data Mining Concepts
No ratings yet
Data Mining Concepts
35 pages
The Importance of Data Mining in IT Industry
No ratings yet
The Importance of Data Mining in IT Industry
50 pages
Week 02 PDF
No ratings yet
Week 02 PDF
39 pages
Introduction To Data Mining
75% (4)
Introduction To Data Mining
45 pages
01 Intro
No ratings yet
01 Intro
23 pages
Data Mining: by Doug Alexander
No ratings yet
Data Mining: by Doug Alexander
6 pages
Motivation For Data Mining The Information Crisis
No ratings yet
Motivation For Data Mining The Information Crisis
13 pages
Data Science Mastery: From Beginner to Expert in Big Data Analytics
From Everand
Data Science Mastery: From Beginner to Expert in Big Data Analytics
Kameron Hussain
No ratings yet
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
From Everand
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
Rob Botwright
No ratings yet

DM 1

Uploaded by

DM 1

Uploaded by

Introduction to Data Mining

Ravleen Kaur, NSUT 3

Ordinal Data: Data that can be categorized and ordered

Ravleen Kaur, NSUT 4

Types of Data: By Format

Ravleen Kaur, NSUT 5

We are drowning in data, but starving for knowledge!

● “Necessity is the mother of invention”—Data mining—Automated analysis of massive

Ravleen Kaur, NSUT 6

Ravleen Kaur, NSUT 7

Ravleen Kaur, NSUT 8

Ravleen Kaur, NSUT 9

Ravleen Kaur, NSUT 10

Ravleen Kaur, NSUT 11

Ravleen Kaur, NSUT 12

Ravleen Kaur, NSUT 13

Ravleen Kaur, NSUT 14

Ravleen Kaur, NSUT 15

Ravleen Kaur, NSUT 17

Ravleen Kaur, NSUT 18

● Tremendous amount of data

Ravleen Kaur, NSUT 19

Ravleen Kaur, NSUT 20

It is discovery-driven. It is query driven.

It has huge numbers of dimensions. It has a limited number of dimensions.

Bottom-up approach. Top-down approach.

Ravleen Kaur, NSUT 24

Ravleen Kaur, NSUT 25

Ravleen Kaur, NSUT 26

Ravleen Kaur, NSUT 27

Ravleen Kaur, NSUT 28

KDD Process: A typical view from ML and Statistics

Ravleen Kaur, NSUT 29

Ravleen Kaur, NSUT 30

Ravleen Kaur, NSUT 31

Advanced data sets and advanced applications

Ravleen Kaur, NSUT 32

Ravleen Kaur, NSUT 33

Ravleen Kaur, NSUT 34

Ravleen Kaur, NSUT 35

Ravleen Kaur, NSUT 36

Ravleen Kaur, NSUT 37

Ravleen Kaur, NSUT 38

Ravleen Kaur, NSUT 39

Ravleen Kaur, NSUT 41

Sequence, trend and evolution analysis

Mining data streams

Ravleen Kaur, NSUT 42

For prediction, we do not utilize the phrasing of “Class

Ravleen Kaur, NSUT 43

Ravleen Kaur, NSUT 44

Ravleen Kaur, NSUT 45

Logical representation can be categorized into

Advantages of logical representation:

Disadvantages of logical Representation:

Ravleen Kaur, NSUT 46

Ravleen Kaur, NSUT 47

Advantages of Semantic network:

Ravleen Kaur, NSUT 49

Advantages of frame representation :

Disadvantages of frame representation:

Ravleen Kaur, NSUT 50

Ravleen Kaur, NSUT 51

Advantages of Production rule:

Disadvantages of Production rule:

Ravleen Kaur, NSUT 52

You might also like