0% found this document useful (0 votes)
33 views

UNIT - II - Data Mining Essentials

The document discusses data mining essentials and community analysis. It covers topics like the KDD process, data types, vectorization, data quality, preprocessing, sampling techniques, supervised and unsupervised learning algorithms. Decision tree learning and other supervised learning algorithms like naive Bayes, k-nearest neighbor and neural networks are explained.

Uploaded by

vani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

UNIT - II - Data Mining Essentials

The document discusses data mining essentials and community analysis. It covers topics like the KDD process, data types, vectorization, data quality, preprocessing, sampling techniques, supervised and unsupervised learning algorithms. Decision tree learning and other supervised learning algorithms like naive Bayes, k-nearest neighbor and neural networks are explained.

Uploaded by

vani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

UNIT – II

Community Analysis

Data mining essentials – introduction


• Mountains of raw data are generated daily by individuals on social media.
• Data mining provides the necessary tools for discovering patterns in data.
• the general process for analyzing social media data and
• ways to use data mining algorithms in this process to extract actionable
patterns from raw data.
• The process of extracting useful patterns from raw data is known as
Knowledge discovery in databases (KDD).
Data mining essentials
KDD process
• In the KDD process, data is represented in a tabular format.
• Consider the example of predicting whether an individual who visits
an online book seller is going to buy a specific book.
• John is an example of an instance. Instances are also called points,
data points, or observations.
• A dataset consists of one or more instances:
KDD Process
• A dataset is represented using a set of features, and an instance is rep
resented using values assigned to these features. Features are also
known as measurements or attributes.
• Instances – values
• Features – attributes/ fields.
• An instance such as John in which the class attribute value is unknown
is called an unlabeled instance.
KDD process
• A labeled instance is an instance in which the class attribute value in
known.
• The class attribute is optional in a dataset.
• Only necessary for prediction or classification purposes.
• There are different types of features
• i) continuous feature
• Ii) discrete feature
• types of features can be represented by “levels of measurement”-
Stanley Smith Stevens
Types of features
• Nominal (categorical). - take values that are often represented as strings. For instance, a
customer’s name is a nominal feature.
• Ordinal- the feature values have an intrinsic order to them. Ex: high low money spent on
an item.
• Interval - In interval features, in addition to their intrinsic order ing, di fferences are
meaningful whereas ratios are meaningless.
• Addition and subtraction allowed.
• Multiplications and divisions are not allowed.
• Ex:
• 6:16 PMand3:08 PM. The di erence between these two time readings is meaningful (3
hours and 8 minutes); however, there is no meaning to 6:16 PM 3:08 PM 2

• Ratio - Ratio features, as the name suggests, add the additional prop erties of
multiplication and division. An individual’s income is an example of a ratio feature
Data

• individuals generate many types of nontabular data, such as text,


voice, or video.
• These types of data are first converted to tabular data and then
processed using data mining algorithms.
• Example:
voice can be converted to feature values using approximation
techniques such as the fast Fourier transform (FFT)- apply data
mining algorithms.
• To convert text into the tabular format,- vectorization process used.
Vectorization - Vector Space Model
• A well-known method for vectorization is the vector-space model.
• given a set of documents D. Each document is a set of words.
• To convert these textual documents to [feature] vectors. We can
represent document i with vector di,

• where wji represents the weight for word j that occurs in document i
and N is the number of words used for vectorization
Vector space model
• To compute wji,
• Set 1 ---- when the word j exists in document i
• Set 0 ---- when the word j not exists in document I
• Another approach is,
• Term frequency-inverse document frequency (TF-IDF) weighting
scheme.
• In this Wj,i calculated as,

• where tfji is the frequency of word j in document i.


• idfj is the inverse TF-IDF frequency of word j across all documents
Term frequency-inverse document frequency
(TF-IDF)
• Example: consider the following statements,
d1 = “social media mining”
d2 = “social media data”
d3 = “financial market data” by applying vetorization model(TF-IDF)
We can get,the following vector values.
Data Quality
• Before applying the data mining algorithms the data quality need to
be verified.
• The following aspects need to be verified:
• 1. Noise
• 2.Outliers
• 3. Missing Values
• 4. Duplicate data
Data Preprocessing
• Data preprocessing should be done before applying data mining
algorithms. They are
• 1. Aggregation- when multiple features need to be combined into a
single one
• 2. Discretization - process of converting continuous features to
discrete ones and deciding the continuous range that is being
assigned to a discrete value is called discretization.
• 3. Feature Selection- selecting appropriate feature(columns/fields).
• 4. Feature Extraction- deriving features from other feature.
• 5. Sampling – processing smaller set of data.
Data Preprocessing
• Three major sampling techniques:
• 1. Random sampling - instances are selected uniformly from the
dataset.
• 2. Sampling with or without replacement
With replacement- an instance can be selected multiple times.
Without replacement - instances are removed from the selection pool
once selected.
• 3. Stratified sampling - the dataset is first partitioned into multiple
bins. Then ,a fixed number of instances are selected from each bin
using random sampling.
Data Mining Algorithms
• Data mining algorithms can be divided into several categories.
• 1. supervised learning and
• 2.unsupervised learning
• In supervised learning,
-----the class attribute exists, and
-----the task is to predict the class attribute value.
• In unsupervised learning
---- the dataset has no class attribute, and
---- our task is to find similar instances in the dataset and group them.
Supervised Learning
• The class attribute values for the dataset are known before running the algorithm.
• This data is called labeled data or training data.
• Instances in this set are tuples in the format (x,y),
----where x is a vector and
----y is the class attribute, commonly a scalar.
Example:
Scalars are simply single numerical values. They represent a single piece of
information without any internal structure.
• Age of a customer (e.g., 35)
• Price of a product (e.g., $19.99)
• Temperature reading (e.g., 22°C)
Supervised Learning
• Supervised learning builds a model that maps x to y.
• task is to find a mapping m()
• such that m(x) = y.
Supervised Learning
• Supervised learning can be divided into
• 1. classification - When the class attribute is discrete, it is called
classification;
• 2. regression- when the class attribute is continuous, it is regression.
• classification methods are,
• 1. decision tree learning,
• 2.naiveBayesclassifier,
• 3.k-nearest neighbor classifier, and
• 4. classification with network information
Supervised Learning
• Regression methods are,
• 1. linear regression and
• 2. logistic regression.
• A supervised learning algorithm is run on the training set in a process
known as induction.
Decision Tree Learning
Decision Tree Learning

You might also like