DM 1
DM 1
Ravleen Kaur
Ravleen Kaur, NSUT 2
Overview of terms
Data refers to raw facts, figures, and statistics that can be processed to
extract meaningful information. It serves as the foundational element for
analysis, decision-making, and understanding trends in various fields.
Types of Data: By nature
● Qualitative Data (Categorical Data): Descriptive data that cannot be
measured numerically.
○ Examples: Colors, names, categories (e.g., gender, type of car, customer
feedback).
● Quantitative Data (Numerical Data): Data that can be measured and
expressed numerically.
○ Subtypes:
■ Discrete Data: Integer values, countable items.
● Examples: Number of customers, number of products sold.
■ Continuous Data: Any value within a range, often measured.
● Examples: temperature, time, salary
Nominal Data: Data that can be categorized but not Interval Data: Numeric data with meaningful
ordered. Examples: Gender, race, or types of fruits. intervals but no true zero point. Examples:
Temperature (Celsius or Fahrenheit), IQ scores.
Performance
The data mining system's performance relies primarily on the efficiency
of algorithms and techniques used.
If the designed algorithm and techniques are not up to the mark,
then the efficiency of the data mining process will be affected adversely.
Ravleen Kaur, NSUT 16
Challenges of Implementation in Data mining
Data Distribution Complex Data
Practically, It is a quite tough task to collect all the data in a Real-world data is heterogeneous, and it could be
centralized data repository mainly due to organizational and multimedia data, including audio and video, images,
technical concerns. complex data, spatial data, time series, and so on.
For example, various regional offices may have their Managing these various types of data and extracting
servers to store their data. It is not feasible to store, all the useful information is a tough task.
data from all the offices on a central server. Therefore, data Most of the time, new technologies, new tools,
mining requires the development of tools and algorithms and methodologies would have to be refined to obtain
that allow the mining of distributed data. specific information.
Data Visualization
In data mining, data visualization is a very important process
because it is the primary method that shows the output to the
user in a presentable way. The extracted data should convey
the exact meaning of what it intends to express.
But many times, representing the information to the
end-user in a precise and easy way is difficult. The input data
and the output information being complicated, very efficient,
and successful data visualization processes need to be
implemented to make it successful.
Common techniques include clustering, classification, Involves supervised learning, unsupervised learning,
regression, association rule mining, and anomaly detection. reinforcement learning, and deep learning.
The primary goal is to analyze data and summarize it into The main aim is to build models that can predict outcomes or
useful information, often for decision-making. classify data based on new inputs.
It requires human involvement for even a minor change in It can alter the rules according to the environment and provide
the rules. solutions to a specific problem. Human efforts are only required
while defining the algorithm.
It uses a data warehouse, data mining engine, and pattern It involves neural networks and algorithms to produce results.
assessment techniques to produce results.
Used in fields like market research, fraud detection, Used in applications like image recognition, natural language
customer segmentation, and more. processing, recommendation systems, and autonomous vehicles.
Aims to uncover hidden insights and knowledge from data. Aims to store, retrieve, and manage data efficiently and securely.
Can handle both structured and unstructured data; often Primarily manages structured data organized into tables.
involves significant preprocessing.
Utilizes methods like clustering, classification, regression, Utilizes structured query language (SQL) for data retrieval and
association rule mining, and anomaly detection. manipulation.
Produces patterns, rules, and insights that can inform Provides access to data and supports transaction management and
decision-making. data integrity.
Used in fields like marketing, healthcare, finance, and social Used in various domains for data storage, retrieval, and
sciences for predictive analytics and decision support. management in applications like enterprise systems, websites, and
analytics platforms.
Commonly uses specialized software (e.g., RapidMiner, Examples include MySQL, Oracle, Microsoft SQL Server, and
Weka, KNIME) for analysis. PostgreSQL.
Ravleen Kaur, NSUT 21
Data mining refers to the field of computer
OLAP is a technology of immediate access to
science, which deals with the extraction of
data with the help of multidimensional
data, trends and patterns from huge sets of
structures.
data.
It deals with the data summary. It deals with detailed transaction-level data.
It is used for future data prediction. It is used for analyzing past data.
The data used is numeric or non-numeric. The data used is numeric only.
The types of data mining are clustering, classification, association, The types of statistics are Descriptive statistical and Inferential
neural network, sequence-based analysis, visualization, etc. statistical.
It is suitable for huge data sets. It is suitable for smaller data set.
It is an inductive process. It means the generation of new theory It is the deductive process. It does not indulge in making any
from data. predictions.
Data cleaning is a part of data mining. In statistics, clean data is used to implement the statistical method.
It requires less user interaction to validate the model, so it is easy to It requires user interaction to validate the model, so it is complex to
automate. automate.
Data mining applications include financial Data Analysis, Retail The application of statistics includes biostatistics, quality control,
Industry, Telecommunication Industry, Biological Data Analysis, demography, operational research, etc.
Certain Scientific Applications, etc.
Ravleen Kaur, NSUT 23
Data Mining Process
1. Data collection/gathering: 2. Data preparation:
Relevant data for an analytics application is identified and This stage includes a set of steps to get the data ready
assembled. The data may be located in different source to be mined. It starts with data exploration, profiling
systems, a data warehouse or a data lake, an increasingly and pre-processing, followed by data cleansing work to
common repository in big data environments that contain a fix errors and other data quality issues. Data
mix of structured and unstructured data. External data transformation is also done to make data sets
sources may also be used. Wherever the data comes from, consistent, unless a data scientist is looking to analyze
a data scientist often moves it to a data lake for the unfiltered raw data for a particular application.
remaining steps in the process.
Knowledge Base
A knowledge base in such an architecture can provide a
foundation for finding meaningful insights useful for
decision-making, allowing the user to analyze previously
unseen trends or correlations. The goal is to distill the
information into actionable intelligence that can inform
decisions about marketing campaigns, customer segmentation,
product development, and more.
Syntax:
● Syntaxes are the rules which decide how we can
construct legal sentences in the logic.
● It determines which symbol we can use in
knowledge representation.
● How to write those symbols.
Drawbacks:
➢Semantic networks take more computational time at runtime as we need to traverse the complete network
tree to answer some questions. It might be possible in the worst case scenario that after traversing the entire
tree, we find that the solution does not exist in this network.
➢ Semantic networks try to model human-like memory (Which has 1015 neurons and links) to store the
information, but in practice, it is not possible to build such a vast semantic network.
➢These types of representations are inadequate as they do not have any equivalent quantifier, e.g., for all,
for some, none, etc.
Ravleen Kaur, NSUT 48
Knowledge Representation Methods
3. Frame Representation
● A frame is a record like structure which consists of a collection
of attributes and its values to describe an entity in the world.
● Frames are the AI data structure which divides knowledge into
substructures by representing stereotypes situations. It consists
of a collection of slots and slot values.
● These slots may be of any type and sizes. Slots have names
and values which are called facets.
● Facets: The various aspects of a slot is known as Facets. Facets are features of frames which enable us
to put constraints on the frames. Example: IF-NEEDED facts are called when data of any particular
slot is needed.
● A frame may consist of any number of slots, and a slot may include any number of facets and facets
may have any number of values. A frame is also known as slot-filter knowledge representation in
artificial intelligence.