0% found this document useful (0 votes)
12 views1 page

Unit 5 Batnote

The document discusses text analysis techniques like TF-IDF and POS tagging with examples. It also explains clustering and the K-means algorithm with steps. Additionally, it defines evaluation metrics like confusion matrix and AUC-ROC as well as model evaluation techniques like holdout method and random sub-sampling.

Uploaded by

ELECTRO CLASHING
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views1 page

Unit 5 Batnote

The document discusses text analysis techniques like TF-IDF and POS tagging with examples. It also explains clustering and the K-means algorithm with steps. Additionally, it defines evaluation metrics like confusion matrix and AUC-ROC as well as model evaluation techniques like holdout method and random sub-sampling.

Uploaded by

ELECTRO CLASHING
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

DSBDA UNIT 5 IMP: BIG DATA ANALYTICS AND MODEL EVALUATION Here is an example to illustrate the K-means algorithm:

n example to illustrate the K-means algorithm: 1. Holdout Method: In this method, the dataset is randomly split into two sets - a training
set and a testing set.
1] Explain text analysis ? explain TF /IDF of terms in text analysis with example ? Suppose we have a dataset of 100 data points with two features (x and y).
The training set is used to train the model, while the testing set is used to evaluate its
Text analysis, also known as text mining, is the process of deriving insights from textual data using We want to cluster these data points into 3 clusters using the K-means algorithm. performance.
natural language processing techniques.
1. Choose K=3. 2. Random Sub-Sampling Method: This method involves randomly selecting a subset of
TF-IDF (Term Frequency-Inverse Document Frequency) is a widely used method for processing
2. Initialize 3 cluster centroids randomly. the data for training and testing multiple times.
text data by calculating the frequency and weight of each word in a document or collection of
documents to identify the most relevant themes and patterns. 3. Repeat the following steps until convergence: a. Assign each data point to the nearest Each time, a different random subset is chosen for training and testing, and the model is
cluster centroid. b. Update the cluster centroids by calculating the mean of all data points trained and evaluated on each subset.
Example: assigned to each cluster.
consider a social media platform where users share posts about sports, food, and travel. 4. Convergence happens when the groupings of data points into clusters stop changing 5]
significantly.
We can use TF-IDF to uncover the most relevant terms for each category.
3] write short note :
In sports-related posts, terms like "score", "team", and "gameplay" might have high TF-IDF scores.

In food-related posts, terms like "recipe", "ingredients", and "cooking technique" might have high 1. Confusion Matrix: A confusion matrix is a table that is often used to describe the
TF-IDF scores. performance of a classification model on a set of data for which the true values are known.

In travel-related posts, terms like "destination", "sightseeing", and "adventure" might have high It allows visualization of the performance of an algorithm. Each row of the matrix
TF-IDF scores. represents the instances in a predicted class, while each column represents the instances
in an actual class..
TF-IDF helps find important words in each category, showing what topics people talk about most
on social media. 2. AUC-ROC: The AUC-ROC curve is a performance measurement for classification problems Step 1: Initialize the Cluster Centers
at various threshold settings.
2] what is clustering with suitable example explain steps involved in K means Given the points A1(2, 10), A2(2, 5), A3(8, 4), B1(5, 8), B2(7, 5), B3(6, 4), C1(1, 2), and C2(4,
algorithm ? ROC stands for Receiver Operating Characteristic. It is a graphical representation of the 9).
true positive rate against the false positive rate for the different possible cutpoints of a Suppose we initially assign A1, B1, and C1 as the centers of the clusters.
Clustering is a technique used in data analysis to group similar data points together. diagnostic test.
It is a form of unsupervised learning, meaning that the algorithm does not require labeled data Step 2: Calculate Distances and Assign Points to Clusters
Definitions with respect to Confusion Matrix:
to make predictions.
Calculate distances between each point and the cluster centers.
In the K-means algorithm, the goal is to partition a set of data points into K clusters, where each 1. Accuracy: Accuracy is the ratio of correctly predicted instances to the total instances in
Assign each point to the cluster with the closest center: A1, B1, or C1.
data point belongs to the cluster with the nearest mean. the dataset.
2. Precision: Precision is the ratio of correctly predicted positive observations to the total Step 3: Recalculate the Cluster Centers
The steps involved in the K-means algorithm are as follows: predicted positive observations.
Update cluster centers by averaging the coordinates of points in each cluster.
3. Recall (Sensitivity): Recall is the ratio of correctly predicted positive observations to the
1. Choose the number of clusters K.
all observations in actual class. Step 4: Repeat Steps 2 and 3
2. Initialize K cluster centroids randomly.
3. Repeat the following steps until convergence: a. Assign each data point to the nearest 4] explain hold out method and random sub sampling methods ? Keep repeating steps 2 and 3 until the cluster centers no longer change or until a specified
cluster centroid. b. Update the cluster centroids by calculating the mean of all data points number of iterations is reached.
assigned to each cluster. The holdout method and random sub-sampling methods are both techniques used in machine
4. Convergence happens when the groupings of data points into clusters stop changing learning for splitting a dataset into training and testing sets. Following these steps, we can determine the cluster centers after the first round of execution.
significantly.

1 2 3

6] explain following text analysis steps with example Random subsampling, also known as random under-sampling, is a technique used to balance
imbalanced datasets
1. Part of Speech (POS) Tagging: Part of speech tagging involves labeling words in a
sentence with their corresponding part of speech, such as nouns, verbs, adjectives, etc. This Random subsampling is another technique used to evaluate a model, where the dataset is
process helps in understanding the grammatical structure of a sentence. randomly split into a training and testing set.

example: This process is typically repeated multiple times to get a more stable estimate of the model's
performance.
Sentence: "The quick brown fox jumps over the lazy dog."
***
POS Tags:

o The/DT quick/JJ brown/JJ fox/NN jumps/VBZ over/IN the/DT lazy/JJ dog/NN

2. Lemmatization: Lemmatization is the process of reducing words to their base or root


form. For example, the lemma of "running" is "run". Lemmatization helps in standardizing
words to their base form for analysis.

Example:

o Word: "running"
o Lemma: "run"
3. Stemming: Stemming is the process of reducing words to their root or stem by removing
suffixes. It is a crude heuristic that chops off the ends of words. For example, the stem of I AM BATMAN
"running" is "run".

Example:

o Word: "running"
o Stem: "run"

7] Explain K fold cross validation and random subsampling ?

K-fold cross-validation is a technique used in machine learning to evaluate the performance of a


model.

It involves splitting the dataset into K equal-sized folds, training the model on K-1 folds, and then
testing it on the remaining fold.

This process is repeated K times, each time using a different fold as the test set.

K-fold cross-validation helps provide a more reliable estimate of the model's performance
compared to a single train-test

4 5

You might also like