Unit 5 Batnote

The document discusses text analysis techniques like TF-IDF and POS tagging with examples. It also explains clustering and the K-means algorithm with steps. Additionally, it defines evaluation metrics like confusion matrix and AUC-ROC as well as model evaluation techniques like holdout method and random sub-sampling.

Uploaded by

ELECTRO CLASHING

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views1 page

Unit 5 Batnote

Uploaded by

ELECTRO CLASHING

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

DSBDA UNIT 5 IMP: BIG DATA ANALYTICS AND MODEL EVALUATION Here is an example to illustrate the K-means algorithm:

n example to illustrate the K-means algorithm: 1. Holdout Method: In this method, the dataset is randomly split into two sets - a training
set and a testing set.
1] Explain text analysis ? explain TF /IDF of terms in text analysis with example ? Suppose we have a dataset of 100 data points with two features (x and y).
The training set is used to train the model, while the testing set is used to evaluate its
Text analysis, also known as text mining, is the process of deriving insights from textual data using We want to cluster these data points into 3 clusters using the K-means algorithm. performance.
natural language processing techniques.
1. Choose K=3. 2. Random Sub-Sampling Method: This method involves randomly selecting a subset of
TF-IDF (Term Frequency-Inverse Document Frequency) is a widely used method for processing
2. Initialize 3 cluster centroids randomly. the data for training and testing multiple times.
text data by calculating the frequency and weight of each word in a document or collection of
documents to identify the most relevant themes and patterns. 3. Repeat the following steps until convergence: a. Assign each data point to the nearest Each time, a different random subset is chosen for training and testing, and the model is
cluster centroid. b. Update the cluster centroids by calculating the mean of all data points trained and evaluated on each subset.
Example: assigned to each cluster.
consider a social media platform where users share posts about sports, food, and travel. 4. Convergence happens when the groupings of data points into clusters stop changing 5]
significantly.
We can use TF-IDF to uncover the most relevant terms for each category.
3] write short note :
In sports-related posts, terms like "score", "team", and "gameplay" might have high TF-IDF scores.

In food-related posts, terms like "recipe", "ingredients", and "cooking technique" might have high 1. Confusion Matrix: A confusion matrix is a table that is often used to describe the
TF-IDF scores. performance of a classification model on a set of data for which the true values are known.

In travel-related posts, terms like "destination", "sightseeing", and "adventure" might have high It allows visualization of the performance of an algorithm. Each row of the matrix
TF-IDF scores. represents the instances in a predicted class, while each column represents the instances
in an actual class..
TF-IDF helps find important words in each category, showing what topics people talk about most
on social media. 2. AUC-ROC: The AUC-ROC curve is a performance measurement for classification problems Step 1: Initialize the Cluster Centers
at various threshold settings.
2] what is clustering with suitable example explain steps involved in K means Given the points A1(2, 10), A2(2, 5), A3(8, 4), B1(5, 8), B2(7, 5), B3(6, 4), C1(1, 2), and C2(4,
algorithm ? ROC stands for Receiver Operating Characteristic. It is a graphical representation of the 9).
true positive rate against the false positive rate for the different possible cutpoints of a Suppose we initially assign A1, B1, and C1 as the centers of the clusters.
Clustering is a technique used in data analysis to group similar data points together. diagnostic test.
It is a form of unsupervised learning, meaning that the algorithm does not require labeled data Step 2: Calculate Distances and Assign Points to Clusters
Definitions with respect to Confusion Matrix:
to make predictions.
Calculate distances between each point and the cluster centers.
In the K-means algorithm, the goal is to partition a set of data points into K clusters, where each 1. Accuracy: Accuracy is the ratio of correctly predicted instances to the total instances in
Assign each point to the cluster with the closest center: A1, B1, or C1.
data point belongs to the cluster with the nearest mean. the dataset.
2. Precision: Precision is the ratio of correctly predicted positive observations to the total Step 3: Recalculate the Cluster Centers
The steps involved in the K-means algorithm are as follows: predicted positive observations.
Update cluster centers by averaging the coordinates of points in each cluster.
3. Recall (Sensitivity): Recall is the ratio of correctly predicted positive observations to the
1. Choose the number of clusters K.
all observations in actual class. Step 4: Repeat Steps 2 and 3
2. Initialize K cluster centroids randomly.
3. Repeat the following steps until convergence: a. Assign each data point to the nearest 4] explain hold out method and random sub sampling methods ? Keep repeating steps 2 and 3 until the cluster centers no longer change or until a specified
cluster centroid. b. Update the cluster centroids by calculating the mean of all data points number of iterations is reached.
assigned to each cluster. The holdout method and random sub-sampling methods are both techniques used in machine
4. Convergence happens when the groupings of data points into clusters stop changing learning for splitting a dataset into training and testing sets. Following these steps, we can determine the cluster centers after the first round of execution.
significantly.

1 2 3

6] explain following text analysis steps with example Random subsampling, also known as random under-sampling, is a technique used to balance
imbalanced datasets
1. Part of Speech (POS) Tagging: Part of speech tagging involves labeling words in a
sentence with their corresponding part of speech, such as nouns, verbs, adjectives, etc. This Random subsampling is another technique used to evaluate a model, where the dataset is
process helps in understanding the grammatical structure of a sentence. randomly split into a training and testing set.

example: This process is typically repeated multiple times to get a more stable estimate of the model's
performance.
Sentence: "The quick brown fox jumps over the lazy dog."
***
POS Tags:

o The/DT quick/JJ brown/JJ fox/NN jumps/VBZ over/IN the/DT lazy/JJ dog/NN

2. Lemmatization: Lemmatization is the process of reducing words to their base or root

form. For example, the lemma of "running" is "run". Lemmatization helps in standardizing
words to their base form for analysis.

Example:

o Word: "running"
o Lemma: "run"
3. Stemming: Stemming is the process of reducing words to their root or stem by removing
suffixes. It is a crude heuristic that chops off the ends of words. For example, the stem of I AM BATMAN
"running" is "run".

Example:

o Word: "running"
o Stem: "run"

7] Explain K fold cross validation and random subsampling ?

K-fold cross-validation is a technique used in machine learning to evaluate the performance of a

model.

It involves splitting the dataset into K equal-sized folds, training the model on K-1 folds, and then
testing it on the remaining fold.

This process is repeated K times, each time using a different fold as the test set.

K-fold cross-validation helps provide a more reliable estimate of the model's performance
compared to a single train-test

4 5

L35 MC 6
No ratings yet
L35 MC 6
351 pages
Data Science Interview Questions
100% (1)
Data Science Interview Questions
68 pages
Decision Tree, Clustering
No ratings yet
Decision Tree, Clustering
73 pages
DATA 2024 - Dist
No ratings yet
DATA 2024 - Dist
97 pages
Coincent - Data Science With Python Assignment
100% (2)
Coincent - Data Science With Python Assignment
23 pages
ML Interview Questions and Answers
100% (1)
ML Interview Questions and Answers
25 pages
GS EP EXP 207 09 Systems Units
No ratings yet
GS EP EXP 207 09 Systems Units
18 pages
Unit 2
No ratings yet
Unit 2
57 pages
Kamala Das Poems
No ratings yet
Kamala Das Poems
14 pages
Unit Ii
No ratings yet
Unit Ii
102 pages
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
Classification
No ratings yet
Classification
50 pages
Ensem Imp Data Science and Big Data Analytics Unit - 5
No ratings yet
Ensem Imp Data Science and Big Data Analytics Unit - 5
16 pages
AIML IMP Question For UT II - Model Ans - Compressed
No ratings yet
AIML IMP Question For UT II - Model Ans - Compressed
23 pages
K Mean Clustering
No ratings yet
K Mean Clustering
19 pages
K Mean Clustering
No ratings yet
K Mean Clustering
18 pages
Data Mining Final
No ratings yet
Data Mining Final
25 pages
Supervised Learning - SVM - DT
No ratings yet
Supervised Learning - SVM - DT
43 pages
NLP Chapter 2
No ratings yet
NLP Chapter 2
79 pages
UNIT 3 - Final
No ratings yet
UNIT 3 - Final
37 pages
04 03 Behavior Cluster Credit Card
No ratings yet
04 03 Behavior Cluster Credit Card
24 pages
K Mean Clustering
No ratings yet
K Mean Clustering
18 pages
Intro To Machine Learning New
No ratings yet
Intro To Machine Learning New
18 pages
2.unit 2 ML Q&A
No ratings yet
2.unit 2 ML Q&A
36 pages
Data Sciene - Unit 5 Material
No ratings yet
Data Sciene - Unit 5 Material
15 pages
Dsbda 5
No ratings yet
Dsbda 5
13 pages
K Means Clustering
No ratings yet
K Means Clustering
11 pages
Aiml Unit 4
No ratings yet
Aiml Unit 4
20 pages
Unsupervised ML Clustering
No ratings yet
Unsupervised ML Clustering
15 pages
ML Question Bank CA-II
No ratings yet
ML Question Bank CA-II
10 pages
9.54 Class 13: Unsupervised Learning
No ratings yet
9.54 Class 13: Unsupervised Learning
54 pages
FDS Comprehensive Sovled
No ratings yet
FDS Comprehensive Sovled
10 pages
Ai&ml 2
No ratings yet
Ai&ml 2
15 pages
ML Notes
No ratings yet
ML Notes
12 pages
Raghav Soni (20IOT6014) Algo - Assignment
No ratings yet
Raghav Soni (20IOT6014) Algo - Assignment
14 pages
Presentation On Classification
No ratings yet
Presentation On Classification
18 pages
HUMIDIFICADOR Fisher Paykel MR850 2
No ratings yet
HUMIDIFICADOR Fisher Paykel MR850 2
57 pages
ml2 1
No ratings yet
ml2 1
7 pages
DM - MP
No ratings yet
DM - MP
15 pages
Data Mining Final Review
No ratings yet
Data Mining Final Review
20 pages
Dsbda Ut5
No ratings yet
Dsbda Ut5
7 pages
Lecture 4
No ratings yet
Lecture 4
31 pages
ML Unit-3 - RTU
No ratings yet
ML Unit-3 - RTU
20 pages
Fluid Statics Examples
No ratings yet
Fluid Statics Examples
14 pages
Module 3
No ratings yet
Module 3
7 pages
(KtabPDF Com) xrwA7TEBGp
No ratings yet
(KtabPDF Com) xrwA7TEBGp
32 pages
Cross Validation
No ratings yet
Cross Validation
10 pages
DataMining Unit-3
No ratings yet
DataMining Unit-3
8 pages
Data Mining (Viva)
No ratings yet
Data Mining (Viva)
18 pages
AI Capstone Project - Notes-Part2
No ratings yet
AI Capstone Project - Notes-Part2
8 pages
It 311-Ads Module 5
No ratings yet
It 311-Ads Module 5
9 pages
Algorithms New
No ratings yet
Algorithms New
8 pages
Decision Region Vs
No ratings yet
Decision Region Vs
4 pages
Unit 5 Bi
No ratings yet
Unit 5 Bi
3 pages
Dsbda Unit 5 Imp Batnotes
No ratings yet
Dsbda Unit 5 Imp Batnotes
5 pages
Objectives Questions For Data Mining
No ratings yet
Objectives Questions For Data Mining
4 pages
Unit 3 & 4 (p18)
No ratings yet
Unit 3 & 4 (p18)
18 pages
ML Assignment 2 PDF
No ratings yet
ML Assignment 2 PDF
9 pages
Pa ZG512 Ec-3r First Sem 2022-2023
No ratings yet
Pa ZG512 Ec-3r First Sem 2022-2023
5 pages
Power Electronics and DC Lectures
No ratings yet
Power Electronics and DC Lectures
159 pages
ML Questions Answers
No ratings yet
ML Questions Answers
4 pages
UNIT1-3 Notes
No ratings yet
UNIT1-3 Notes
57 pages
5.classification and Prediction
No ratings yet
5.classification and Prediction
9 pages
Ics Question Paper
No ratings yet
Ics Question Paper
28 pages
AIML Solved Paper Nov-Dec 2024
No ratings yet
AIML Solved Paper Nov-Dec 2024
2 pages
ML - Machine Learning PDF
No ratings yet
ML - Machine Learning PDF
13 pages
Test Planner - Neev 2026
No ratings yet
Test Planner - Neev 2026
3 pages
T.E. 2019 Pattern Timetable-1
No ratings yet
T.E. 2019 Pattern Timetable-1
25 pages
Weather Vocab
No ratings yet
Weather Vocab
2 pages
ML U6 Omkar Pawar
No ratings yet
ML U6 Omkar Pawar
10 pages
Eaton DS 265760 NZMN4 AE1000 en - GB 20241113
No ratings yet
Eaton DS 265760 NZMN4 AE1000 en - GB 20241113
4 pages
Fantasy Film
No ratings yet
Fantasy Film
26 pages
Supply Chain Improvement in Construction Industry
No ratings yet
Supply Chain Improvement in Construction Industry
8 pages
Integrado POFF - AD7858AN Datasheet
No ratings yet
Integrado POFF - AD7858AN Datasheet
32 pages
Casius 5.0.1.8
No ratings yet
Casius 5.0.1.8
42 pages
Kamuli District DDP III 2020 - 2025 - 0
No ratings yet
Kamuli District DDP III 2020 - 2025 - 0
233 pages
Liao 2020
No ratings yet
Liao 2020
35 pages
WT 5,6
No ratings yet
WT 5,6
17 pages
S.E. 2019-Pat. Timetable
No ratings yet
S.E. 2019-Pat. Timetable
21 pages
Activity 2.1 Scavenger Hunt Form
No ratings yet
Activity 2.1 Scavenger Hunt Form
2 pages
Catalysis For Co2 Conversion A Key Technology For Rapid Introduction of Renewable Energy in The Value Chain of Chemical Industries
No ratings yet
Catalysis For Co2 Conversion A Key Technology For Rapid Introduction of Renewable Energy in The Value Chain of Chemical Industries
20 pages
ML U3 Omkar Pawar-2
No ratings yet
ML U3 Omkar Pawar-2
13 pages
So HVAC
No ratings yet
So HVAC
1 page
HPC 1 Insem 2024 FlyHigh Services
No ratings yet
HPC 1 Insem 2024 FlyHigh Services
11 pages
Basics of Essay Writing
No ratings yet
Basics of Essay Writing
20 pages
May Jun 2022-4
No ratings yet
May Jun 2022-4
2 pages
Cab and Chassis Connections Cab Wiring (Right Side) Fuse Block Wiring
No ratings yet
Cab and Chassis Connections Cab Wiring (Right Side) Fuse Block Wiring
4 pages
Itership REPORT Time Tale
No ratings yet
Itership REPORT Time Tale
2 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
TOC Unit 6 Notes by DK?
No ratings yet
TOC Unit 6 Notes by DK?
9 pages
Information Security-6-85
No ratings yet
Information Security-6-85
9 pages
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
From Everand
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
Fouad Sabry
No ratings yet
Unit III SPM
No ratings yet
Unit III SPM
2 pages
Questionnaire For Employees
No ratings yet
Questionnaire For Employees
7 pages
CID 20210320173003021556 989295 uniROC Ipayob
No ratings yet
CID 20210320173003021556 989295 uniROC Ipayob
6 pages
Chapter 15 Exercises No Answers
No ratings yet
Chapter 15 Exercises No Answers
3 pages
March 2024
No ratings yet
March 2024
1 page
Information Security
No ratings yet
Information Security
4 pages
Aqa Accn4 W SQP 07
No ratings yet
Aqa Accn4 W SQP 07
6 pages
Nov Dec 2022-4
No ratings yet
Nov Dec 2022-4
2 pages
Journal Club Complete
No ratings yet
Journal Club Complete
2 pages
Is Assignment 4
No ratings yet
Is Assignment 4
1 page
Oct - 2022
No ratings yet
Oct - 2022
1 page
Information Security 37 85
No ratings yet
Information Security 37 85
1 page
Grade 11 Courage (Engineering)
No ratings yet
Grade 11 Courage (Engineering)
8 pages
Written Performance Task in English 9
No ratings yet
Written Performance Task in English 9
4 pages

Unit 5 Batnote

Uploaded by

Unit 5 Batnote

Uploaded by

DSBDA UNIT 5 IMP: BIG DATA ANALYTICS AND MODEL EVALUATION Here is an example to illustrate the K-means algorithm:

o The/DT quick/JJ brown/JJ fox/NN jumps/VBZ over/IN the/DT lazy/JJ dog/NN

2. Lemmatization: Lemmatization is the process of reducing words to their base or root

7] Explain K fold cross validation and random subsampling ?

K-fold cross-validation is a technique used in machine learning to evaluate the performance of a

You might also like