UNIT - II - Data Mining Essentials

The document discusses data mining essentials and community analysis. It covers topics like the KDD process, data types, vectorization, data quality, preprocessing, sampling techniques, supervised and unsupervised learning algorithms. Decision tree learning and other supervised learning algorithms like naive Bayes, k-nearest neighbor and neural networks are explained.

Uploaded by

vani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views20 pages

UNIT - II - Data Mining Essentials

Uploaded by

vani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 20

UNIT – II

Community Analysis

Data mining essentials – introduction

• Mountains of raw data are generated daily by individuals on social media.
• Data mining provides the necessary tools for discovering patterns in data.
• the general process for analyzing social media data and
• ways to use data mining algorithms in this process to extract actionable
patterns from raw data.
• The process of extracting useful patterns from raw data is known as
Knowledge discovery in databases (KDD).
Data mining essentials
KDD process
• In the KDD process, data is represented in a tabular format.
• Consider the example of predicting whether an individual who visits
an online book seller is going to buy a specific book.
• John is an example of an instance. Instances are also called points,
data points, or observations.
• A dataset consists of one or more instances:
KDD Process
• A dataset is represented using a set of features, and an instance is rep
resented using values assigned to these features. Features are also
known as measurements or attributes.
• Instances – values
• Features – attributes/ fields.
• An instance such as John in which the class attribute value is unknown
is called an unlabeled instance.
KDD process
• A labeled instance is an instance in which the class attribute value in
known.
• The class attribute is optional in a dataset.
• Only necessary for prediction or classification purposes.
• There are different types of features
• i) continuous feature
• Ii) discrete feature
• types of features can be represented by “levels of measurement”-
Stanley Smith Stevens
Types of features
• Nominal (categorical). - take values that are often represented as strings. For instance, a
customer’s name is a nominal feature.
• Ordinal- the feature values have an intrinsic order to them. Ex: high low money spent on
an item.
• Interval - In interval features, in addition to their intrinsic order ing, di fferences are
meaningful whereas ratios are meaningless.
• Addition and subtraction allowed.
• Multiplications and divisions are not allowed.
• Ex:
• 6:16 PMand3:08 PM. The di erence between these two time readings is meaningful (3
hours and 8 minutes); however, there is no meaning to 6:16 PM 3:08 PM 2

• Ratio - Ratio features, as the name suggests, add the additional prop erties of
multiplication and division. An individual’s income is an example of a ratio feature
Data

• individuals generate many types of nontabular data, such as text,

voice, or video.
• These types of data are first converted to tabular data and then
processed using data mining algorithms.
• Example:
voice can be converted to feature values using approximation
techniques such as the fast Fourier transform (FFT)- apply data
mining algorithms.
• To convert text into the tabular format,- vectorization process used.
Vectorization - Vector Space Model
• A well-known method for vectorization is the vector-space model.
• given a set of documents D. Each document is a set of words.
• To convert these textual documents to [feature] vectors. We can
represent document i with vector di,

• where wji represents the weight for word j that occurs in document i
and N is the number of words used for vectorization
Vector space model
• To compute wji,
• Set 1 ---- when the word j exists in document i
• Set 0 ---- when the word j not exists in document I
• Another approach is,
• Term frequency-inverse document frequency (TF-IDF) weighting
scheme.
• In this Wj,i calculated as,

• where tfji is the frequency of word j in document i.

• idfj is the inverse TF-IDF frequency of word j across all documents
Term frequency-inverse document frequency
(TF-IDF)
• Example: consider the following statements,
d1 = “social media mining”
d2 = “social media data”
d3 = “financial market data” by applying vetorization model(TF-IDF)
We can get,the following vector values.
Data Quality
• Before applying the data mining algorithms the data quality need to
be verified.
• The following aspects need to be verified:
• 1. Noise
• 2.Outliers
• 3. Missing Values
• 4. Duplicate data
Data Preprocessing
• Data preprocessing should be done before applying data mining
algorithms. They are
• 1. Aggregation- when multiple features need to be combined into a
single one
• 2. Discretization - process of converting continuous features to
discrete ones and deciding the continuous range that is being
assigned to a discrete value is called discretization.
• 3. Feature Selection- selecting appropriate feature(columns/fields).
• 4. Feature Extraction- deriving features from other feature.
• 5. Sampling – processing smaller set of data.
Data Preprocessing
• Three major sampling techniques:
• 1. Random sampling - instances are selected uniformly from the
dataset.
• 2. Sampling with or without replacement
With replacement- an instance can be selected multiple times.
Without replacement - instances are removed from the selection pool
once selected.
• 3. Stratified sampling - the dataset is first partitioned into multiple
bins. Then ,a fixed number of instances are selected from each bin
using random sampling.
Data Mining Algorithms
• Data mining algorithms can be divided into several categories.
• 1. supervised learning and
• 2.unsupervised learning
• In supervised learning,
-----the class attribute exists, and
-----the task is to predict the class attribute value.
• In unsupervised learning
---- the dataset has no class attribute, and
---- our task is to find similar instances in the dataset and group them.
Supervised Learning
• The class attribute values for the dataset are known before running the algorithm.
• This data is called labeled data or training data.
• Instances in this set are tuples in the format (x,y),
----where x is a vector and
----y is the class attribute, commonly a scalar.
Example:
Scalars are simply single numerical values. They represent a single piece of
information without any internal structure.
• Age of a customer (e.g., 35)
• Price of a product (e.g., $19.99)
• Temperature reading (e.g., 22°C)
Supervised Learning
• Supervised learning builds a model that maps x to y.
• task is to find a mapping m()
• such that m(x) = y.
Supervised Learning
• Supervised learning can be divided into
• 1. classification - When the class attribute is discrete, it is called
classification;
• 2. regression- when the class attribute is continuous, it is regression.
• classification methods are,
• 1. decision tree learning,
• 2.naiveBayesclassifier,
• 3.k-nearest neighbor classifier, and
• 4. classification with network information
Supervised Learning
• Regression methods are,
• 1. linear regression and
• 2. logistic regression.
• A supervised learning algorithm is run on the training set in a process
known as induction.
Decision Tree Learning
Decision Tree Learning

A Brief Overview On Data Mining Survey PDF
No ratings yet
A Brief Overview On Data Mining Survey PDF
8 pages
EEG Eye State Report
No ratings yet
EEG Eye State Report
19 pages
DM-Unit-I Introduction To Association-1
No ratings yet
DM-Unit-I Introduction To Association-1
97 pages
Data
No ratings yet
Data
36 pages
CH 2
No ratings yet
CH 2
37 pages
Unit 3 - Data Mining - WWW - Rgpvnotes.in PDF
No ratings yet
Unit 3 - Data Mining - WWW - Rgpvnotes.in PDF
10 pages
DMML Notes
No ratings yet
DMML Notes
89 pages
DM - Unit I-Updated
No ratings yet
DM - Unit I-Updated
65 pages
BI Unit 3 Part 1
No ratings yet
BI Unit 3 Part 1
51 pages
Data Mining - An Overview
No ratings yet
Data Mining - An Overview
40 pages
DWDM Unit-II Notes
No ratings yet
DWDM Unit-II Notes
29 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Assignment Solution 074
No ratings yet
Assignment Solution 074
8 pages
Lecture 1
No ratings yet
Lecture 1
55 pages
Data Mining Questions
100% (1)
Data Mining Questions
7 pages
Chapter 1
No ratings yet
Chapter 1
6 pages
R18CSE4102-UNIT 2 Data Mining Notes
100% (1)
R18CSE4102-UNIT 2 Data Mining Notes
31 pages
BCA Data Mining
No ratings yet
BCA Data Mining
116 pages
2 Buss Intel Analytics
No ratings yet
2 Buss Intel Analytics
43 pages
CSE2021 - MODULE 1ppt
No ratings yet
CSE2021 - MODULE 1ppt
62 pages
Unit I Notes
No ratings yet
Unit I Notes
23 pages
DMDW Unit1
No ratings yet
DMDW Unit1
31 pages
DMlecture 1
No ratings yet
DMlecture 1
39 pages
Basic Concept of Classification (Data Mining)
No ratings yet
Basic Concept of Classification (Data Mining)
11 pages
Unit 1 Data Mining Task
No ratings yet
Unit 1 Data Mining Task
7 pages
Unit I DM
No ratings yet
Unit I DM
27 pages
Data Mining
No ratings yet
Data Mining
30 pages
Data Mining
No ratings yet
Data Mining
7 pages
Unit-4 DWM
No ratings yet
Unit-4 DWM
73 pages
DM Day2 DataUnderstanding MS S25
No ratings yet
DM Day2 DataUnderstanding MS S25
165 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Unit-1 Notes
No ratings yet
Unit-1 Notes
24 pages
Unit 1 Datamining
No ratings yet
Unit 1 Datamining
16 pages
What Is Data Mining: Effective Data Collection Warehousing
No ratings yet
What Is Data Mining: Effective Data Collection Warehousing
21 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
Data Mining
No ratings yet
Data Mining
25 pages
Introduction
No ratings yet
Introduction
26 pages
Data Mining
No ratings yet
Data Mining
254 pages
Wk. 1. Introduction (08.10.2020)
No ratings yet
Wk. 1. Introduction (08.10.2020)
30 pages
Study Material I
No ratings yet
Study Material I
140 pages
Introduction To Data Mining Techniques: Dr. Rajni Jain
No ratings yet
Introduction To Data Mining Techniques: Dr. Rajni Jain
11 pages
DWDM Reference Notes
No ratings yet
DWDM Reference Notes
126 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
Module 4
No ratings yet
Module 4
54 pages
CS-DM Module - 1
No ratings yet
CS-DM Module - 1
27 pages
Data Mining
No ratings yet
Data Mining
22 pages
Unit 1
No ratings yet
Unit 1
21 pages
Lecture 1428550844
No ratings yet
Lecture 1428550844
87 pages
Process: 1. Data Mining (The Analysis Step of The "Knowledge Discovery in Databases" Process, or KDD)
No ratings yet
Process: 1. Data Mining (The Analysis Step of The "Knowledge Discovery in Databases" Process, or KDD)
4 pages
Week 4 - Introduction To Data Mining and Data Mining Techniques
No ratings yet
Week 4 - Introduction To Data Mining and Data Mining Techniques
44 pages
Data Mining Techniques (DMT) by Kushal Anjaria Session-1 (Lecture Note)
No ratings yet
Data Mining Techniques (DMT) by Kushal Anjaria Session-1 (Lecture Note)
2 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
2016 Book PrinciplesOfDataMining PDF
100% (3)
2016 Book PrinciplesOfDataMining PDF
530 pages
Down 2
No ratings yet
Down 2
61 pages
Combinepdf 1
No ratings yet
Combinepdf 1
74 pages
03 Data Science Process - Fall 23-24
No ratings yet
03 Data Science Process - Fall 23-24
38 pages
10 Challenging Problems in Data Mining Research
No ratings yet
10 Challenging Problems in Data Mining Research
8 pages
Data Mining
No ratings yet
Data Mining
40 pages
Chapter 3-IB
No ratings yet
Chapter 3-IB
69 pages
Data Collection: Six Sigma Thinking, #1
From Everand
Data Collection: Six Sigma Thinking, #1
Sumeet Savant
No ratings yet
Drone
No ratings yet
Drone
102 pages
DAB302 Handout Chapter4 Pt1
No ratings yet
DAB302 Handout Chapter4 Pt1
2 pages
Chap-1 AI 2019
No ratings yet
Chap-1 AI 2019
74 pages
ISBM Pune Proceedings IntlConf. April2024 Publication
No ratings yet
ISBM Pune Proceedings IntlConf. April2024 Publication
144 pages
ISSCC2020-01 Digest
No ratings yet
ISSCC2020-01 Digest
34 pages
PRCV Lab Manual-Final
No ratings yet
PRCV Lab Manual-Final
60 pages
Dissertation Computer Science PDF
100% (2)
Dissertation Computer Science PDF
5 pages
Chapter 3
No ratings yet
Chapter 3
18 pages
Application of Artificial Intelligence in Paramedic Education Current Scenario and Future Perspective A Narrative Review
No ratings yet
Application of Artificial Intelligence in Paramedic Education Current Scenario and Future Perspective A Narrative Review
8 pages
Lean-Six Sigma in The Age of Artificial Intelligence
No ratings yet
Lean-Six Sigma in The Age of Artificial Intelligence
5 pages
The Transformative Impact of Ai in Finance and Banking
No ratings yet
The Transformative Impact of Ai in Finance and Banking
8 pages
Confusion Matrix
No ratings yet
Confusion Matrix
21 pages
DataHack Summit'24 - Agenda
No ratings yet
DataHack Summit'24 - Agenda
4 pages
Rating System Based On Text Review Using Sentiment Analysis: Final Presentation
No ratings yet
Rating System Based On Text Review Using Sentiment Analysis: Final Presentation
20 pages
Sentiment Analysis With WEKA-ITSOct19 PDF
No ratings yet
Sentiment Analysis With WEKA-ITSOct19 PDF
69 pages
Revolutionizing Education With Industry 5.0 Challenges and Future Research Agendas
No ratings yet
Revolutionizing Education With Industry 5.0 Challenges and Future Research Agendas
5 pages
SurvLIME A Method For Explaining Machine Learning Survival Models
No ratings yet
SurvLIME A Method For Explaining Machine Learning Survival Models
20 pages
CS221 - Artificial Intelligence - Machine Learning - 3 Linear Classification
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 3 Linear Classification
28 pages
AI 900 Demo
No ratings yet
AI 900 Demo
13 pages
Machine Learning Yearning
100% (1)
Machine Learning Yearning
9 pages
PHD Thesis Genser IVT SVT 2022
No ratings yet
PHD Thesis Genser IVT SVT 2022
173 pages
Final Project Synopsys
No ratings yet
Final Project Synopsys
53 pages
EEG-based Emotion Recognition Via Transformer Neural Architecture Search
No ratings yet
EEG-based Emotion Recognition Via Transformer Neural Architecture Search
10 pages
Addendum Circular For SBT-2025 6-6-25
No ratings yet
Addendum Circular For SBT-2025 6-6-25
9 pages
Real Time Event Detection in Social Media Using Big Data
No ratings yet
Real Time Event Detection in Social Media Using Big Data
64 pages
Submited By: Dhruv Kasliwal EN21IT301040 Amit Singh Thakur EN21IT301016 Jayesh Gulani EN21IT301055 Guided By: Sagar Pandya Assistant Professor
No ratings yet
Submited By: Dhruv Kasliwal EN21IT301040 Amit Singh Thakur EN21IT301016 Jayesh Gulani EN21IT301055 Guided By: Sagar Pandya Assistant Professor
20 pages
ML Sanchit
No ratings yet
ML Sanchit
49 pages
Doc3 Main Report
No ratings yet
Doc3 Main Report
60 pages
May June 2025
No ratings yet
May June 2025
26 pages

UNIT - II - Data Mining Essentials

Uploaded by

UNIT - II - Data Mining Essentials

Uploaded by

UNIT – II

Data mining essentials – introduction

• individuals generate many types of nontabular data, such as text,

• where tfji is the frequency of word j in document i.

You might also like