BDA3

BIG DATA ANALYTICS
MC-5101 (Unit III: Data Mining Techniques for Analysis)

Study Material- 03
Text Mining and Data Mining Techniques for Analysis:
Classification and Clustering
1. Text Mining and its Applications
2. Text Preprocessing
3. BoW and TF-IDF For Creating Features from Text
4. Dimensionality Reduction
5. Web Mining and Types of Web Mining
6. Mining Multimedia Data on the Web
7. Classification Algorithms in Data Mining
8. k-NN Algorithm 21. What is Hadoop Distributed File System (HDFS)
9. Decision Tree Classifier 22. Explain MapReduce and its advantages
10. Bayesian Classification 23. What is NoSQL
11. Support Vector Machines
12. Rule Based Classification
13. Model Selection
14. Overview and Applications of Cluster Analysis in Data Mining.
15.Clustering Methods in Data Mining
16.Partitioning Method: k-Means Algorithm and k-Medoids
17. Hierarchical Method: Agglomerative Approach and Divisive Approach
18. Density Based Method: DBSCAN
19. Limitations with Cluster Analysis
20. Outlier Analysis
1) Text Mining and its Applications
• Almost 80% of data in the world resides in an unstructured format
• Therefore, text mining is an extremely valuable practice within organizations.
• Text mining tools and Natural Language Processing (NLP) techniques,
like information extraction,
• allow us to transform unstructured documents into a structured format
• to enable analysis and the generation of high-quality insights.
• This, in turn, improves the decision-making of organizations, leading
to better business outcomes.
1.1) Text Mining and its Applications…
Text mining often includes the following techniques:
• Information extraction is a technique for extracting domain specific information from texts.
• Text fragments are mapped to field or template lots that have a definite semantic technique.
• Text summarization involves identifying, summarizing and organizing related text so that users
can efficiently deal with information in large documents.
• Text categorization involves organizes documents into a taxonomy, thus allowing for more
efficient searches.
• It involves the assignment of subject descriptors or classification codes or abstract concepts to
complete texts.
• Text clustering involves automatically clustering documents into groups where documents within
each group share common features.
1.2) Text Mining and its Applications…
Following are some of the applications of Text Mining:
• Customer service: There are various ways in which we invite customer feedback from our users.
• When combined with text analytics tools, feedback systems such as chatbots, customer surveys, Net-Promoter Scores, online reviews, support
tickets, and social media profiles, enable companies to improve their customer experience with speed.
• Text mining and sentiment analysis can provide a mechanism for companies to prioritize key pain points for their customers, allowing businesses
to respond to urgent issues in realtime and increase customer satisfaction.
• Risk management: Text mining also has applications in risk management.
• It can provide insights around industry trends and financial markets by monitoring shifts in sentiment and by extracting information from analyst
reports and whitepapers.
• This is particularly valuable to banking institutions as this data provides more confidence when considering business investments across various
sectors.
• Maintenance: Text mining provides a rich and complete picture of the operation and functionality of products and
machinery.
• Over time, text mining automates decision making by revealing patterns that correlate with problems and preventive and reactive maintenance
procedures.
• Text analytics helps maintenance professionals unearth the root cause of challenges and failures faster.
• Healthcare: Text mining techniques have been increasingly valuable to researchers in the biomedical field,
particularly for clustering information.
• Manual investigation of medical research can be costly and time-consuming; text mining provides an automation method for extracting valuable
information from medical literature.
• Spam filtering: Spam frequently serves as an entry point for hackers to infect computer systems with malware.
• Text mining can provide a method to filter and exclude these e-mails from inboxes, improving the overall user experience and minimizing the risk
of cyber-attacks to end users.
2) Text Preprocessing
• Text preprocessing is an approach for cleaning and preparing text data for use in a specific context.
• The ultimate goal of cleaning and preparing text data is to reduce the text to only the words that you
need for your NLP goals.
• Once you have a clear idea of the type of application you are developing and the source and nature of
text data, you can decide on which preprocessing stages can be added to your NLP pipeline.
• Most of the NLP toolkits on the market include options for all of the preprocessing stages such as:
2 “Stop words” are frequently occurring words used to construct sentences.

1
3
4 Normalization
• Upper or lowercasing
• Stopword removal
• Stemming – bluntly removing prefixes and suffixes from a word
• Lemmatization – replacing a single-word token with its root
3) BoW and TF-IDF For Creating Features from Text
• We understand that sentence in a fraction of a second, such as
• Review 1: This movie is very scary and long
• Review 2: This movie is not scary and is slow
• Review 3: This movie is spooky and good
• But machines simply cannot process text data in raw form. They need us to break down the text into a
numerical format that’s easily readable by the machine.
• This is where the two concepts come into play
• Bag-of-Words (BoW) and
• Term Frequency-Inverse Document Frequency (TF-IDF).
• Both BoW and TF-IDF are techniques that help us convert text sentences into numeric vectors.
3) Bag-of-Words (BoW)
• We will first build a vocabulary from all the unique vocabulary, which consists of these 11 words:
• ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’.
• We can now take each of these words and mark their occurrence in the three movie reviews above with 1s and 0s.
• This will give us 3 vectors for 3 reviews as a Vector Representation:
Review 1: This movie is very scary and long
Review 2: This movie is not scary and is slow
Review 3: This movie is spooky and good
Drawbacks of using a BoW

• If the new sentences contain new words, then
• our vocabulary size would increase and
✓ thereby, the length of the vectors would
increase too.
• Additionally,
• the vectors would also contain many 0s,
➢ Vector of Review 1: [1 1 1 1 1 1 1 0 0 0 0]
✓ thereby resulting in a sparse matrix (which is
➢ Vector of Review 2: [1 1 2 0 0 1 1 0 1 0 0]
what we would like to avoid)
➢ Vector of Review 3: [1 1 1 0 0 1 0 0 0 1 1]
• We are retaining no information
• on the grammar of the sentences nor
And that’s the core idea behind a Bag of Words (BoW) model.
• on the ordering of the words in the text.
3) Term Frequency-Inverse Document Frequency (TF-IDF)
• Term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important
a word is to a document in a collection or corpus.
• Term Frequency (TF): is a measure of how frequently a term, t, appears in a document, d:
𝑛𝑡,𝑑
𝑡𝑓𝑡𝑑 =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑟𝑚𝑠 𝑖𝑛 𝑎 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡
• Here, in the numerator, n is the number of times the term “t” appears in the document “d”.
• Thus, each document and term would have its own TF value.
Term Review Review Review TF TF TF
• Example: How to calculate the TF for 1 2 3 (Review1) (Review2) (Review3)
This 1 1 1 1/7 1/8 1/6
Review #2: This movie is not scary and is slow.
movie 1 1 1 1/7 1/8 1/6
Here, is 1 2 1 1/7 1/4 1/6
• Vocabulary: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, very 1 0 0 1/7 0 0
‘not’, ‘slow’, ‘spooky’, ‘good’ scary 1 1 0 1/7 1/8 0
• Number of words in Review 2 = 8 and 1 1 1 1/7 1/8 1/6
• TF for the word ‘this’ = (number of times ‘this’ appears in long 1 0 0 1/7 0 0
review 2)/(number of terms in review 2) = 1/8 not 0 1 0 0 1/8 0
Similarly, • TF(‘scary’) = 1/8 • TF(‘slow’) = 1/8 slow 0 1 0 0 1/8 0
• TF(‘movie’) = 1/8 • TF(‘and’) = 1/8 • TF( ‘spooky’) = 0/8 = 0 spooky 0 0 1 0 0 1/6
• TF(‘is’) = 2/8 = 1/4 • TF(‘long’) = 0/8 = 0 • TF(‘good’) = 0/8 = 0 good 0 0 1 0 0 1/6
• TF(‘very’) = 0/8 = 0 • TF(‘not’) = 1/8
3) Inverse Document Frequency (IDF)
• Computing just the TF alone is not sufficient to understand the importance of words, thus, we need the IDF value
• IDF is a measure of how important a term is.
• Example: We can calculate the IDF values for the all the words in Review 2: Term Review Review Review IDF
1 2 3
• IDF(‘this’) =
This 1 1 1 0.00
= log(number of documents/number of documents containing the word ‘this’) movie 1 1 1 0.00
= log(3/3) = log(1) = 0 is 1 2 1 0.00

very 1 0 0 0.48
Similarly, scary 1 1 0 0.18
• IDF(‘movie’, ) = log(3/3) = 0 ✓ Hence, we see that words like “is”, “this”, and 1 1 1 0.00
• IDF(‘is’) = log(3/3) = 0 “and”, etc., are reduced to 0 and have little long 1 0 0 0.48
• IDF(‘not’) = log(3/1) = 0.48 importance; not 0 1 0 0.48
• IDF(‘scary’) = log(3/2) = 0.18 ✓ while words like “scary”, “long”, “good”, slow 0 1 0 0.48
• IDF(‘and’) = log(3/3) = 0 etc. are words with more importance and spooky 0 0 1 0.48
• IDF(‘slow’) = log(3/1) = 0.48 thus have a higher value. good 0 0 1 0.48

3) Compute the TF-IDF score
• We can now calculate the TF-IDF score for every word in Review 2:
• TF-IDF(‘this’, Review 2) = TF(‘this’, Review 2) * IDF(‘this’) = 1/8 * 0 = 0
Similarly, Similarly, we can calculate the TF-IDF scores for all the
• TF-IDF(‘movie’, Review 2) = 1/8 * 0 = 0 words with respect to all the reviews:
• TF-IDF(‘is’, Review 2) = 1/4 * 0 = 0 Term Review Review Review IDF TF TF TF
1 2 3 (Review1) (Review2) (Review3)
• TF-IDF(‘not’, Review 2) = 1/8 * 0.48 = 0.06
This 1 1 1 0.00 0.000 0.000 0.000
• TF-IDF(‘scary’, Review 2) = 1/8 * 0.18 = 0.023
movie 1 1 1 0.00 0.000 0.000 0.000
• TF-IDF(‘and’, Review 2) = 1/8 * 0 = 0
is 1 2 1 0.00 0.000 0.025 0.000
• TF-IDF(‘slow’, Review 2) = 1/8 * 0.48 = 0.06
very 1 0 0 0.48 0.068 0.000 0.000
• Words with a higher score are more important, and scary 1 1 0 0.18 0.025 0.022 0.000
• those with a lower score are less important and 1 1 1 0.00 0.000 0.000 0.000
• TF-IDF gives larger values for less frequent words. long 1 0 0 0.48 0.068 0.000 0.000
0 1 0 0.48 0.000 0.060 0.000
• It also gives large value for frequent words in a single not
slow 0 1 0 0.48 0.000 0.060 0.000
document but, rare in all the documents combined spooky 0 0 1 0.48 0.000 0.000 0.080
• Means, both IDF and TF values are high good 0 0 1 0.48 0.000 0.000 0.080
4) Dimensionality Reduction
• The number of input features, variables, or columns present in a given dataset is known as dimensionality, and
• the process to reduce these features is called dimensionality reduction.
• Handling the high-dimensional data is very difficult in practice, commonly known as
• the curse of dimensionality.
• If the machine learning model is trained on high-dimensional data, it becomes overfitted and results in poor
performance.
• Hence, it is often required to reduce the number of features, which can be done with dimensionality reduction.
• Some benefits of applying dimensionality reduction technique to the given dataset are given below:
• By reducing the dimensions of the features, the space required to store the dataset also gets reduced.
• Less Computation training time is required for reduced dimensions of features.
• Reduced dimensions of features of the dataset help in visualizing the data quickly.
• It removes the redundant features (if present) by taking care of multi-collinearity.
4.1) Techniques for Dimensionality Reduction
• Dimensionality reduction is accomplished based on either feature selection or feature extraction.
• Feature selection is based on omitting those features from the available measurements which do not
contribute to class separability. In other words, redundant and irrelevant features are ignored.
a) Variance Thresholds
b) Correlation Thresholds
c) Genetic Algorithms
d) Stepwise Regression- This has two types: forward and backward.
• Feature extraction, Feature extraction is for creating a new, smaller set of features that still captures
most of the useful information. This can come as supervised (e.g. LDA) and unsupervised (e.g.
PCA) methods.
a) Principal Component Analysis (PCA)
b) Linear Discriminant Analysis (LDA)
4.2) Feature selection:
a) Variance Thresholds: This technique looks for the variance from one observation to another of a given
feature and then
• if the variance is not different in each observation according to the given threshold,
• feature that is responsible for that observation is removed.
b) Correlation Thresholds: We first calculate all pair-wise correlations. Then, if
• the correlation between a pair of features is above a given threshold,
• we remove the one that has larger mean absolute correlation with other features.
• Like the previous technique, this is also based on intuition and hence the burden of tuning the thresholds in
such a way that the useful information will not be neglected, will fall upon the user.
• Because of those reasons, algorithms with built-in feature selection or algorithms like PCA(Principal
Component Analysis) are preferred over this one.
a) Genetic Algorithms: They are search algorithms that are inspired by evolutionary biology and natural
selection, combining mutation and cross-over to efficiently traverse large solution spaces.
• Genetic Algorithms are used to find an optimal binary vector, where each bit is associated with a feature.
✓ If the bit of this vector equals 1, then the feature is allowed to participate in classification.
✓ If the bit is a 0, then the corresponding feature does not participate.
4.3) Feature selection…
d) Stepwise Regression: This is a greedy algorithm and commonly has a lower performance than the supervised methods
such as regularizations etc.
• This has two types: forward and backward.
• For forward stepwise search, we start without any features. Then,
• We train a 1-feature model using each of our candidate features and keep the version with the best performance.
• We would continue adding features, one at a time, until our performance improvements stall.
• Backward stepwise search is the same process, just reversed:
• start with all features in our model and
• then remove one at a time until performance starts to drop substantially.
4.4) Feature selection: Example
A fitness level prediction based on the three independent variables is used to show how ID Calories_ Gender Plays_ Fintess
forward feature selection works. burnt Sport? Level
121 M Yes Fit
• So, the first step in Forward Feature Selection is 1
230 M No Fit
• To train n models and judge how well they work by looking at each feature on its own. 2
3 342 F No Unfit
• So, if we have three independent variables, we'll train three models, one for each of
these three traits. 4 70 M Yes Fit
5 278 F Yes Unfit
• Let's say we trained the model using the Calories Burned feature and the Fitness Level goal
variable and 6 146 M Yes Fit
ID Calories_ Gender Plays_ Fintess
• got an accuracy of 87%. burnt Sport? Level 7 168 F No Unfit
8 231 F Yes Fit
• We'll next use the Gender feature to train the model, 9 150 M No Fit
• we acquire an accuracy of 80%. burnt Sport? Level 10 190 F No Fit
• And similarly, the Plays_sport variable gives us an accuracy of 85%.
burnt Sport? Level
✓ At this point, we are going to select the variable that produced the most favourable results.
✓ When these two sets of data were compared, the winner was, unsurprisingly,
✓ the number of calories burned.
✓ As a direct result of this, we will select this variable.
4.5) Feature selection: Example conti…
• The next thing we'll do is repeat the previous steps, but this time we'll just add a single variable at
a time.
• Because of this, it makes perfect sense for us to retain the Calories Burned variable as we
proceed to add variables one at a time.
• Consequently, if we use gender as an illustration, we have an accuracy rate of 88%.
• We acquire a 91% accuracy when we combine Plays Sport with Calories Burnt.
• As a result, we will keep it and use it in our model.
• We will keep repeating the process till all the features are considered in improving the model
performance
4.6) Feature Extraction: Principal Component Analysis (PCA)
• PCA is a dimensionality reduction that
• identifies important relationships in our data,
• transforms the existing data based on these relationships, and then
• quantifies the importance of these relationships so we can keep the most important relationships.
Objectives of PCA:
1. Reduces attribute space: It is basically a non-dependent procedure:
• From a large number of variables to a smaller number of factors.
• But there is no guarantee that the dimension is interpretable.
2. Identifying patterns: PCA can help identify patterns or relationships between variables.
3. Feature extraction: PCA can be used to extract features from a set of variables
• that are more informative or relevant than the original variables.
4. Data compression: PCA can be used to compress large datasets by
reducing the number of variables
• while retaining as much information as possible.
5. Noise reduction: PCA can be used to reduce the noise in a dataset by
• Identifying and removing the principal components that
correspond to the noisy parts of the data.
6. Visualization: PCA can be used to visualize high-dimensional data in
a lower-dimensional space,
• making it easier to interpret and understand.
4.7) Feature Extraction: Linear Discriminant Analysis (LDA)
• Linear Discriminant Analysis (LDA) is a supervised learning algorithm
• used for classification tasks in machine learning.
• It is a technique used to find a linear combination of features that
• best separates the classes in a dataset.
Example:
• Suppose we have two sets of data points belonging to two different classes
• that we want to classify.
• As shown in the given 2D graph, when the data points are plotted on the 2D plane,
• there’s no straight line that can separate the two classes of the data points
completely.
• Hence, in this case, LDA (Linear Discriminant Analysis) is used
• which reduces the 2D graph into a 1D graph
• in order to maximize the separability between the two classes.
Two criteria are used by LDA to create a new axis:

1. Maximize the distance between means of the two classes.
2. Minimize the variation within each class.
5) Web Mining
• Web Mining can be referred as discovering interesting and useful information from Web content and usage.
• Web Mining Features:
• Web Server: It maintains the entry of web log pages in the log file. This web log entries helps to identify
the loyal or potential customers from ecommerce website or companies.
• Web page: It is considered as a graph like structure, where pages are considered as nodes, hyperlinks as
edges.
o Pages = nodes, hyperlinks = edges
o Ignore content
o Directed graph
• High linkage:
o 8-10 links/page on average
o Power-law degree distribution
• Web Mining Tasks:
1) Generating patterns existing in some websites, like customer buying behavior or navigation of web
sites.
2) The web mining helps to retrieve faster results of the queries or the search text posted on the search
engines like Google, Yahoo etc.
3) The ability to classify web documents according to the search performed on the ecommerce websites
helps to increase businesses and transactions.
5) Types of Web Mining…
• There are three types of web mining-
Web Mining
Web Content Web Structure Web Usage

Mining Mining Mining
1. Text Document 1. Web Server Logs

Hyperlinks
2. Image Structure 2. Application Server Logs
3. Audio 3. Application Level Logs
4. Video
5. Structured 1. Inter Document Structure
Record 2. Intra Document Structure
6) Mining Multimedia Data on the Web
• The websites are flooded with the multimedia data like, video, audio, images, and graphs.
• This multimedia data has different characteristics.
• This is the reason the typical multimedia data mining techniques cannot be applied.
• The following are few web-based mining terminologies and algorithms:

• PageRank: This measure is used to count the number of pages the webpage is connected to other websites.
• HITS: This measure is used to rate the webpage.
• It uses hubs and authorities to be determined from a web page.
• Page Layout Analysis: It extracts and maintains the page-to-block, block-to-page relationships from link structure of
web pages.
• Vision page segmentation (VIPS) algorithm.
• Block-level Link Analysis: The block-to-block model is quite useful for web image retrieval and web page
categorization.
• It uses kinds of relationships, i.e., block-to-page and page-to-block.
• Block-Based Link Structure Analysis: The block-to-page relationship gives a more accurate and robust representation
of the link structures of the web.
• It is used to organize the web image pages.
6.1) Automatic Classification of Web Documents
• The categorization of web pages into the respective subjects or domains
• is called classification of web documents.
• For example,
Benefits of Automatic Document Classification System

1. It is more efficient system of classification as produces improved
• accuracy of results and
• speed up the process of classification.
2. The system incurs in less operational costs
3. Easy data store and retrieval.
4. It organizes the files and documents in a better streamlined way.
6.2) Automatic Classification…
• The automated document classification of web pages is based on the textual content.
• The model requires initial training phase of document classifiers for each category
• based on training examples.
7. Classification– An Overview
• Classification is a data mining process
• that assigns items in a collection to target categories or classes.
• The objective of classification is
• to accurately predict the target class for each record in the data.
• For example,
• A classification model used to identify loan applicants as low, medium, or high credit risks.
• A classification model that predicts credit risk could be developed based on observed data for many
loan applicants over a period of time.
• A predictive classifier with a numerical target uses a regression algorithm, not a classification algorithm.
7.2. General Approach to Classification
• Data classification is a two-step process, consisting of:
1) A learning step (or training phase)
• where a classification model is constructed
2) A classification step
• where the model is used to predict class labels for given data.
• In the Step 1):
• Because the class label of each training tuple is provided, this step is also known as supervised
learning
• i.e., the learning of the classifier is “supervised” in that it is told to which class each training tuple belongs.
• It contrasts with unsupervised learning (or clustering), in which
• the class label of each training tuple is not known, and
• the number or set of classes to be learned may not be known in advance.
• In the Step 2):
• A test set is used, made up of test tuples and their associated class labels.
• They are independent of the training tuples, meaning that they were not used to construct the classifier.
• The accuracy of a classifier on a given test set is the percentage of test set tuples that are
correctly classified by the classifier.
7.3. Applications of Classification Models
• Product Cart Analysis on the eCommerce platform uses the • Financial institutes use classification to determine the
classification technique to associate the items into groups defaulters and help in determining the loan seekers, and
and create combinations of products to recommend. other categories.
• This is a very common Classification Applications in • These Classification Applications in Data Mining helps
Data Mining in finding the target audience much easier.
• The weather patterns can be predicted and classified based • Spam detection e-mails based on the header and content
on parameters such as temperature, humidity, wind of the document.
direction, and many more. • Classification of students according to their qualifications.
• These Classification Applications of Data Mining are • Patients are classified according to their medical history.
used in daily life. • Classification can be used for the approval of credit.
• The public health sector classifies the diseases based on • Facial key points detection.
the parameters like spread rate, severity, and a lot more. • Drugs classification.
• This helps in charting out strategies to mitigate • Pedestrian’s detection in an automotive car driving.
diseases. These Classification Applications of Data • Cancer tumor cells identification.
Mining help in finding cures. • Sentiment Analysis.
8. Classification Algorithms in Data Mining
• Classification is the operation of separating various entities into several classes.
• These classes can be defined by
• business rules, class boundaries, or some mathematical function.
• The classification operation may be based on a relationship between a known class
assignment and characteristics of the entity to be classified.
• This type of classification is called supervised.
• If no known examples of a class are available, the classification is unsupervised.
• The most common unsupervised classification approach is clustering
• Classification algorithm finds relationships between the values of the predictors and the values
of the target.
• Different Classification algorithms use different techniques for finding relationships.
• Data mining has many classifiers/classification algorithms such as:
✓ Logistic regression ✓ Rule-based Classification
✓ Linear regression ✓ Bayesian Classification
✓ K-Nearest Neighbours Algorithm (kNN) ✓ Random Forest
✓ Decision trees ✓ Naive Bayes
✓ Support Vector Machines
8. k-NN (k-Nearest Neighbors) ALGORITHM
• The following is the pseudocode for KNN:
1. Load the data
2. Choose K value
3. For each data point in the data:
o Find the Euclidean distance to all training data samples
o Store the distances on an ordered list and sort it
o Choose the top K entries from the sorted list
o Label the test point based on the majority of classes present in the selected points
4. End
✓ To validate the accuracy of the k-NN classification, a confusion matrix is used.

✓ Other statistical methods such as the likelihood-ratio test are also used for validation.
8.1. k-NN ALGORITHM…
Some of the areas where the k-Nearest Neighbor algorithm can be used:
• Credit rating: The k-NN algorithm helps determine an individual's credit rating by comparing them with the
ones with similar characteristics.
• Loan approval: Similar to credit rating, the k-nearest neighbor algorithm is beneficial in identifying
individuals who are more likely to default on loans by comparing their traits with similar individuals.
• Data Preprocessing: Datasets can have many missing values. The k-NN algorithm is used for a process called
missing data imputation that estimates the missing values.
• Pattern Recognition: The ability of the k-NN algorithm to identify patterns creates a wide range of
applications. For example, it helps detect patterns in credit card usage and spot unusual patterns. Pattern
detection is also useful in identifying patterns in customer purchase behavior.
• Stock Price Prediction: Since the k-NN algorithm has a flair for predicting the values of unknown entities, it's
useful in predicting the future value of stocks based on historical data.
• Recommendation Systems: Since k-NN can help find users of similar characteristics, it can be used in
recommendation systems. For example, it can be used in an online video streaming platform to suggest
content a user is more likely to watch by analyzing what similar users watch.
• Computer Vision: The k-NN algorithm is used for image classification. Since it’s capable of grouping similar
data points, for example, grouping cats together and dogs in a different class, it’s useful in several computer
vision applications.
8.2. Advantages and Disadvantages of KNN
Some of the advantages of using the k-Nearest Neighbors algorithm are:
• It's easy to understand and simple to implement
• It can be used for both classification and regression problems
• It's ideal for non-linear data since there's no assumption about underlying data
• It can naturally handle multi-class cases
• It can perform well with enough representative data.
The disadvantages of using the k-Nearest Neighbors algorithm:

• Associated computation cost is high as it stores all the training data
• Requires high memory storage
• Need to determine the value of K
• Prediction is slow if the value of N is high
• Sensitive to irrelevant features
9. Decision Tree Classifier
• Decision tree classifier is the most effective and common prediction and classification method.
• Each time it receives an answer, a follow-up question is asked until a conclusion about the class label of the
record is reached.
• A decision tree is a flow chart like tree structure, where each internal node denotes a test on an attribute, each
branch represents an outcome of the test, and leaf nodes represent classes or class distributions.
Algorithm: Generate_decision_tree. Generate a decision tree from the given Training Data
Input: The training samples, samples, represented by discrete-valued attributes; the set of candidate attributes, attribute-list.
Output: A decision tree.
Partition
Stop
✓ In this version of the algorithm, all attributes are categorical, that is,
✓ discrete-valued.
✓ Continuous valued attributes must be discretized.
9.1: Decision Tree Classifier…
The advantages of decision tree approaches are:
• Decision trees are simple to understand and interpret.
• They require little data and are able to handle both numerical and categorical data.
• Decision trees can produce comprehensible rules.
• Classification of decision trees without much computation.
• Decision trees clearly show which fields for prediction or classification are most important.
• They are strong in nature, therefore,
• they perform well even if its assumptions are somewhat violated by the true model from
which the data were generated.
• Decision trees perform well with large data in a short time.
• Nonlinear relationships between parameters do not affect tree performance.
9.2: Decision Tree Classifier…
The drawbacks of decision tree approaches are:
• Decision trees are less suited to estimate tasks where the goal is to predict a constant attribute value.
• The decision trees are vulnerable to mistakes in many class problems and relatively limited numbers of training
instances.
• The decision-making method is computationally costly.
• Each splitting field must be sorted at each node before the best split can be identified.
• Combinations of fields are used in some algorithms and search must be made for optimal combined weights.
• Pruning algorithms can also be costly as many sub-trees of candidates have to be created and compared.
• Data fragmentation: Each split in a tree leads to a reduced dataset under consideration.
• And, hence the model created at the split will potentially introduce bias.
• High variance and unstable : As a result of the greedy strategy applied by decision tree's variance in finding the
right starting point of the tree
• can greatly impact the final result. i.e. small changes early on can have big impacts later.
• So- if for example we draw two different samples from our universe,
• the starting points for both the samples could be very different (and may even be different variables) this
can lead to totally different results.
10. Bayesian Classification
• Bayesian classifiers are statistical classifiers.
• They can predict class membership probabilities,
• such as the probability that a given tuple belongs to a particular class.
• Bayesian classification is based on Bayes’ theorem.
• A simple Bayesian classifier known as the Naïve Bayesian classifier
• to be comparable in performance with
• decision tree and selected neural network classifiers.
• Bayesian classifiers exhibits high accuracy and speed when applied to large databases.
• Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is
• independent of the values of the other attributes.
• This assumption is called class conditional independence.
• It is made to simplify the computations involved and, in this sense, is considered “Naïve.”
10. Bayesian Classification…
Some to the advantages of the Naïve Bayes Classifier are:
• Naive Bayes Algorithm is a fast, highly scalable algorithm.
• Naive Bayes can be use for Binary and Multiclass classification.
• It provides different types of Naive Bayes Algorithms like
• GaussianNB, MultinomialNB, BernoulliNB.
• It is a simple algorithm that depends on doing a bunch of counts.
• Great choice for Text Classification problems.
• It can be easily train on small dataset
The disadvantage of Naïve Bayes Classifier is:

• Naïve Bayes can learn individual features importance but
• can’t determine the relationship among features.
✓ Common applications of Naïve Bayes algorithm are in Spam filtering.

✓ Gmail from Google uses Naïve Bayes algorithm for filtering spam emails.
✓ Sentiment analysis is another area where Naïve Bayes can calculate the probability of emotions expressed in the text
being positive or negative.
✓ Leading web portals may understand the reaction of customers to their new products based on sentiment analysis.
11. Support Vector Machines (SVM)
• This algorithm looks for
• a linearly separable hyperplane,
• or a decision boundary separating members of one class from the other.
• If such a hyperplane exists, the work is done!
• If such a hyperplane does not exist,
• SVM uses a nonlinear mapping to transform the training data into a higher dimension.
• Then it searches for the linear optimal separating hyperplane.
• With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can
always be separated by a hyperplane.
• The SVM algorithm finds this hyperplane using support vectors and margins.
• For a general n-dimensional feature space, the defining equation becomes:
• If the vector of the weights is denoted by Θ and |Θ| is the norm of this vector, then
• it is easy to see that the size of the maximal margin is 2/|Θ|.
11. SVM…
• Finding the maximal margin hyperplanes and support vectors is
• a problem of convex quadratic optimization.
• It is important to note that the complexity of SVM is characterized by
• the number of support vectors, rather than the dimension of the feature space.
• That is the reason SVM has a comparatively less tendency to overfit.
• If all data points other than the support vectors are removed from the training data set, and the training
algorithm is repeated,
• the same separating hyperplane would be found.
• The number of support vectors provides an upper bound to the expected error rate of the SVM classifier,
• which happens to be independent of data dimensionality.
• An SVM with a small number of support vectors has good generalization,
• even when the data has high dimensionality.
• As a training algorithm, SVM may not be very fast compared to some other classification methods,
• but owing to its ability to model complex nonlinear boundaries, SVM has high accuracy.
• SVM is comparatively less prone to overfitting.
• SVM has successfully been applied to handwritten digit recognition, text classification, speaker identification etc..
12. Rule Based Classification
• Rule-based classifier makes use of a set of IF-THEN Some of the advantages of Rule-Based classifiers:
rules for classification.
• We can express a rule in the following from • They have the characteristics quite similar to decision
IF condition THEN conclusion trees
• These classifiers are as highly expressive as decision trees
• Let us consider a rule R1, • They are easy to interpret
R1: IF age = youth AND student = yes • Their performance is comparable to decision trees
THEN buy_computer = yes • They can handle redundant attributes
• They are better suited for handling imbalanced classes
• Rule Notation: (Condition) → Class Label • There are harder to handle missing values in the test set
13. Model Selection Models Evaluation for classification model:
✓ Confusion Matrix
• Model selection is a technique for selecting the best model
• after the individual models are evaluated based on the required criteria.
• Model selection is the problem of choosing one from among a set of candidate models.
• in the case of supervised learning, the three most common approaches are:
• Train, Validation, and Test datasets
• Resampling Methods
• Probabilistic Statistics
The simplest reliable method of model selection involves fitting candidate models on a training set, tuning them on
the validation dataset, and selecting a model that performs the best on the test dataset according to a chosen metric,
such as accuracy or error. A problem with this approach is that it requires a lot of data.
Resampling techniques attempt to achieve the same as the train/val/test approach to model selection, although using
a small dataset. An example is k-fold cross validation where a training set is split into many train/test pairs and a
model is fit and evaluated on each. This is repeated for each model and a model is selected with the best average
score across the k-folds. A problem with this and the prior approach is that only model performance is assessed,
regardless of model complexity.
A third approach to model selection attempts to combine the complexity of the model with the performance of the
model into a score, then select the model that minimizes or maximizes the score. There are three statistical
approaches to estimating how well a given model fits a dataset and how complex the model is.
1. Akaike Information Criterion (AIC). Derived from frequentist probability
2. Bayesian Information Criterion (BIC). Derived from Bayesian probability
3. Minimum Description Length (MDL). Derived from information theory
14. CLUSTERING – AN OVERVIEW
• Clustering helps in organizing huge voluminous data into clusters and
• displays interior structure of statistical information.
• Clustering improves the data readiness towards artificial intelligence techniques.
• Process for clustering, exhibits knowledge discovery in data,
• It is used either as a stand-alone tool to get penetration into data distribution or
• as a preprocessing step for other algorithm.
14.1. General Approach to Clustering
• Cluster analysis is an exploratory discovery process.
• It can be used to discover structures in data without providing an explanation/interpretation.
• Cluster analysis includes two major aspects: clustering and cluster validation.
• Clustering aims at partitioning objects into groups according to a certain criteria.
• To achieve different application purposes, a large number of clustering algorithms have been developed.
• While due to there are no general purpose clustering algorithms to fit all kinds of applications,
• thus, it is required an evaluation mechanism to assess the quality of clustering results that
• produced by different clustering algorithms or
• a clustering algorithm with different parameters,
so that the user may find a fit cluster scheme for a specific application.
• The quality assessment process of clustering results is regarded as cluster validation.
• Cluster analysis is an iterative process of clustering and cluster verification by the user facilitated with
• clustering algorithms,
• cluster validation methods,
• visualization and
• domain knowledge to databases.
14.3. Applications of Cluster Analysis
• Clustering analysis is widely utilized in a variety of fields, including
• data analysis, market research, pattern identification, and image processing.
• Earth observation databases use this data to identify
• similar land regions and
• to group houses in a city based on house type, value, and geographic position.
• It is the backbone of search engine algorithms,
• where objects that are similar to each other must be presented together and dissimilar
objects should be ignored.
• Also, it is required to fetch objects that are closely related to a search term, if not
completely related.
• Used in image segmentation in bioinformatics where
• clustering algorithms have proven their worth in detecting cancerous cells from various
medical imagery
– eliminating the prevalent human errors and other bias.
14.3. Applications of Cluster Analysis…
• Clustering effectively detects hidden patterns, rules, constraints, flow etc.
• based on various metrics of traffic density from GPS data and
• can be used for segmenting routes and
• suggesting users with best routes, location of essential services, search for objects
on a map etc.
• Satellite imagery can be segmented to find suitable and arable lands for agriculture.
• Document clustering is effectively being used in preventing the spread of
fake news on Social Media.
• Website network traffic can be divided into various segments and
• heuristically when we can prioritize the requests and
• also helps in detecting and preventing malicious activities.
15: Clustering Methods in Data Mining
• For a successful grouping there are two major goals –
(i) Similarity between one data point with another
(ii) Distinction of those similar data points with others which most certainly,
heuristically differ from those points.
• To address the challenges such as scalability, attributes, dimensional,
boundary shape, noise, and interpretation
• there are various types of clustering methods to solve one or many of these
problems.
• Various types of Clustering methods are: ✓ Partitioning Method
✓ Hierarchical Method
✓ Density-based Method
✓ Grid-Based Method
✓ Model-Based Method
✓ Constraint-based Method
16. Partitioning Method: k-Means Algorithm
Advantages
• In this, m data set are clustered to form some number of • Effortless implementation process.
clusters say k, where each of the data set belongs to the • Dense clusters are produced when clusters are
closer mean cluster.
spherical when compared to hierarchical method.
• Appropriate for large databases.
Algorithm
1. Define the number of clusters (k) to be produced and Disadvantages
identical data point centroids. • Inappropriate for clusters with different density and size.
2. The distance from every data point to all the centroids is • Equivalent results are not produced on iterative run.
calculated and the • Euclidean distance measures can weigh unequally due to
point is assigned to the cluster with a minimum distance. underlying factors.
3. Follow the above step for all the data points. • Unsuccessful for non-linear data set and categorical data.
4. The average of the data points present in a cluster is • Noisy data and outliers are difficult to handle.
calculated and can set
new centroid for that cluster.
5. Until desired clusters are formed repeat Step 2.
✓ The initial centroid is selected randomly and thus the resulting clusters have larger
✓ influence on them. Complexity of k-means algorithm is O(tkn) where n- total data set,
✓ k-clusters formed, t-iterations in order to form cluster
16.1 Partitioning Method: k-Medoids or PAM
(Partitioning Around Medoids)
• It is similar in process to the K-means clustering algorithm with Advantages
the difference being in the assignment of the center of the cluster. • Effortless understanding and implementation process.
The algorithm is implemented in two steps: • Can run quickly and converge in few steps.
Build: Initial medoids are innermost objects. • Dissimilarities between the objects is allowed.
Swap: A function can be swapped by another function until the • Less sensitive to outliers when compared to k-means.
function can no longer be reduced.
Disadvantages
Algorithm
• Initial sets of medoids can produce different
1. Initially choose m random points as initial medoids from given
clustering’s. It is thus advisable to run the
data set.
procedure several times with different initial sets.
2. For every data point assign a closest medoid by distance metrics.
• Resulting clusters may depend upon units of
3. Swapping cost is calculated for every chosen and unchosen object
measurement. Variables of different magnitude
given as TCns where s is selected and n is non-selected object.
can be standardized.
4. If TCns < 0, s is replaced by n
5. Until there is no change in medoids, repeat 2 and 3.
Four characteristics to be considered are:
✓ Shift-out membership: Movement of an object from current cluster to another is allowed.
✓ Shift-in membership: Movement of an object from outside to current cluster is allowed.
✓ Update the current medoids: Current medoid can be replaced by a new medoid.
✓ No change: Objects are at their appropriate distances from cluster.
17. Hierarchical Method: Agglomerative and Divisive Approach
• This method decomposes a set of data items into a hierarchy. Depending on how the hierarchical breakdown
is generated, we can put hierarchical approaches into different categories. Following are the two approaches;
Agglomerative Approach Divisive Approach
• This Algorithm is also referred as Bottom-up approach. • This approach is also referred as the top-down approach.
• This approach treats each and every data point as a single • In this, we consider the entire data sample set as one cluster and
cluster and • continuously splitting the cluster into smaller clusters iteratively.
• then merges each cluster by considering the similarity • It is done until each object in one cluster or the termination
(distance) in each individual cluster condition holds.
• until a single large cluster is obtained or when some • This method is rigid, because once a merging or splitting is done, it can
condition is satisfied. never be undone.
Advantages
• Easy to identify nested clusters.
• Gives better results and ease in implementation. Advantage
• They are suitable for automation. • It produces more accurate hierarchies than bottom-up algorithm in some
• Reduces the effect of initial values of cluster on the clustering circumstances.
results.
• Reduces the computing time and space complexity.
Disadvantages
Disadvantages
• Top down approach is computationally more complex than bottom up
• It can never undo what was done previously.
approach because we need a second flat clustering algorithm.
• Difficulty in handling different sized clusters and convex shapes
• Use of different distance metrics for measuring distance between clusters
lead to increase in time complexity
may generate different results.
• There is no direct minimization of objective function.
• Sometimes there is difficulty in identifying the exact number of
clusters by the Dendrogram.
18. Density Based Method: DBSCAN
(Density-Based Spatial Clustering of Applications with Noise)
• In DBSCAN, a cluster is defined as group of data that is of highly dense. Algorithm
1. In order to form clusters, initially consider a random
• DBSCAN considers two parameters such as: point say point p
• Eps: the maximum value of radius from its neighborhood. 2. The second step is to find the all points that are density
• MinPts: The Eps is surrounded by data points (i.e. Eps-Neighborhood) that should be reachable from point p with respect to Eps and MinPts.
minimum.
The following condition is checked in order to form the
• To define Eps-Neighborhood it should satisfy the following condition, cluster
a. If point p is found to be core point, then cluster is
NEps(q) : { p belongs to D|(p, q) ≤ Eps }. obtained.
• In order to understand the Density Based Clustering let us follow few b. If point p is found to be border point, then no
definitions: points are density reachable from point p and
• Core point: It is point which lies within Eps and MinPts which are specified by user. And hence visit the next point of database.
that point is surrounded by dense neighborhood. 3. Continue this process until all the points is processed.
• Border point: It is point that lies within the neighborhood of core point and multiple core
points can share same border point and this point does not contains dense neighborhood. Advantages
• Noise/Outlier: It is point that does not belongs to cluster. • It can identify Outlier.
• Direct Density Reachable: A point p is directly Density Reachable from point q with respect • It does not require number of clusters to be specified in
to Eps, MinPts if point p belongs to NEps(q) and Core point condition advance.
i.e.|NEps(q) | ≥ MinPts Disadvantages
• Density Reachable: A point p is said to Density Reachable from point q with respect to Eps, • If the density of data keeps changing then efficiency of
MinPts if there a chain points such as p1,p2, ... ... pn, p1 = q, pn = p such that pi + 1 is
directly reachable from pn. finding clusters is difficult.
• It does not suit for high quality of data and the user has
to specify the parameter in advance.
19. Limitations with Cluster Analysis
• There are two major drawbacks that influence the feasibility of cluster analysis in real world applications in data
mining.
• The existing automated clustering algorithms on dealing with arbitrarily shaped data distribution of the datasets.
• The second issue is that, the evaluation of the quality of clustering results by statistics-based methods is time
consuming when the database is large,
• primarily due to the drawback of very high computational cost of statistics based methods for assessing the
consistency of cluster structure between the sampling subsets.
• The implementation of statistics-based cluster validation methods does not scale well in very large
datasets.
• On the other hand, and Web Mining arbitrarily shaped clusters also make the traditional statistical cluster
validity indices ineffective, which leave it difficult to determine the optimal cluster structure.
✓ Cluster analysis is a multiple runs iterative process, without any user domain knowledge,
✓ it would be inefficient and
✓ unintuitive to satisfy specific requirements of application tasks in clustering.
➢ Outlier Detection Techniques:
• Numeric Outlier- Calculated by IQR (InterQuartile Range)
20. Outlier Analysis • Z-Score- The Z-score technique considers the Gaussian
distribution of data. Outliers are data points that are on the tail
of the distribution and are therefore far from average.
• In Data Mining, it is common to utilize Outlier Detection to • DBSCAN (clustering method)- It is a non-standard, density-
• find anomalies based outlier detection method. Here, all data points are
• find patterns or trends. defined as focal points, boundary points, or noise points.
• Isolated forest- This non-parameter system is suitable for
• Examples: large datasets in one or more dimensional features.
• Identifying financial fraud such as credit card hacking or other similar ➢ Models for Outlier Detection Analysis:
scams makes use of this technology. • Intensive Value Analysis- In this external analysis approach, the
• It’s utilized to keep track of a customer's changing purchase habits. largest or smallest values are considered externally. The Z-Test
and the Students’ T-Test are excellent examples. These are good
• It’s used to find and report human-made mistakes in typing.
heuristics for the initial analysis of data but they are not of much
• It’s utilized for troubleshooting and identifying problems with machines value in multifaceted systems.
and systems.
• Linear Models- The distance of each data point is calculated for
a plane that corresponds to the sub-interval. This distance is used
✓ Outlier detection can be defined as the process of detecting to detect outliers. PCA (primary component analysis) is an
and then excluding outsiders from a given set of data. example of a linear model for anomaly detection.
• Probabilistic and Statistical Models- Expectation-enhancement
✓ There are no standardized outlier identification methods (EM) methods are used to estimate the parameters of the sample.
because these are mostly dataset-dependent. Finally, they calculate the probability of the member of each data
point for the calculated distribution. Points with the lowest
Remember two important questions about your database during Outlier probability of membership are marked externally.
Identification: • Proximity-based Models- In this mode, the outliers are
(i) What and how many features do I consider for outlier detection? designed as points of isolation from other observations. Cluster
analysis, density-based analysis, and neighborhood environment
(Similarity/diversity)
are key approaches of this type.
(ii) Can I take the distribution (s) of values for the features I have selected? • Information-theoretical models- In this mode, outliers increase
(Parameter / non-parameter) the minimum code length to describe a data set.
21) Hadoop Introduction: Hadoop Distributed File System (HDFS) :
• Hadoop is a collection of several software services that are all freely accessible to the
public and can be used in conjunction with one another.
• It offers a software framework
• for storing a huge amount of data in a variety of locations using Hadoop Distributed File System
(HDFS) and
• for working with that data by utilizing the MapReduce programming style.
• The combination of HDFS and MapReduce creates an architecture that,
• conceals all of the complexity associated with the analysis of big data.
• It is scalable and fault-tolerant.
21.1) Various daemons in Apache Hadoop
• Apache Hadoop includes the five daemons,
• Three, relates to HDFS for the purpose of efficiently managing distributed storage
• NameNode,
• DataNode, and
• Secondary NameNode
• Two, utilized by the MapReduce engine are responsible for both job tracking and job execution
• JobTracker and
• TaskTracker
• Each of the mentioned daemons runs
on their respective JVM
21.2) HDFS: NameNode
• NameNode : A single NameNode daemon operates on the master node.
• NameNode is responsible for storing and managing the metadata that is connected with the file system.
• This metadata is stored in a file that is known as fsimage.
• When a client makes a request to read from or write to a file, the metadata is held in a cache that is located within the
main memory so that the client may access it more rapidly.
• The I/O tasks are completed by the slave DataNode daemons, which are directed in their actions by the NameNode.
• The NameNode is responsible for managing and directing
• how files are divided up into blocks,
• selecting which slave node should store these blocks, and
• monitoring the overall health and fitness of the distributed file system.
• In addition, it decides which slave node should store these blocks.
• Memory and input/output (I/O) are both put to intensive use

in the operations
• that are carried out by the NameNode in the network.
21.3) HDFS: DataNode
• DataNode : A DataNode daemon is present on each slave node, which is a component of the Hadoop cluster.
• DataNodes are the primary storage parts of HDFS.
• They are responsible for storing data blocks and catering to requests to read or write files that are stored on HDFS.
• These are under the authority of NameNode.
• Blocks that are kept in DataNodes are replicated in accordance with the configuration in order to guarantee both
high availability and reliability.
• These duplicated blocks are dispersed around the cluster so that computation may take place more quickly.
21.4) HDFS: Secondary NameNode
• Secondary NameNode : A backup for the NameNode is not provided by the Secondary NameNode.
• The job of the Secondary NameNode is to read the file system at regular intervals, log any changes that have
occurred, and then apply those changes to the fsimage file.
• This assists in updating NameNode so that it can start up more quickly the next time, as shown in Figure.
22) MapReduce: Concept
• MapReduce can refer to either a programming methodology or a software framework.
• Both are utilized in Apache Hadoop.
• Hadoop MapReduce is a programming framework that is made available for creating applications that can process
and analyze massive data sets in parallel on large multi-node clusters of commodity hardware in a manner that is
scalable, reliable, and fault tolerant.
• The processing and analysis of data consist of two distinct stages known as
• the Map phase and the Reduce phase.
• The result of the Map phase is

sorted by the Hadoop framework,
and this information is then sent
as input to the Reduce phase in
order to begin parallel reduce jobs
(see Figure).
22.1) MapReduce key-value pairs
• In theory,
• a MapReduce job will accept a data set as an input in the form of a key-value pair, and
• it will only produce output in the form of a key-value pair after processing the data set through
MapReduce stages.
• The output of the Map phase, which is referred to as the intermediate results,
• is sent on to the Reduce phase as an input.
22.2) JobTracker and TaskTracker
• On the same lines as HDFS, MapReduce also makes use of a master/slave architecture.
• As illustrated in Figure,
• the JobTracker daemon resides on the master node, while
• the TaskTracker daemon resides on each of the slave nodes.
• The MapReduce processing layer consists of two different daemons i.e. JobTracker and TaskTracker,
JobTracker: The JobTracker service is responsible for monitoring MapReduce tasks that are carried out
on slave nodes and is hosted on the master node.
• The job is sent to the JobTracker by the user through their interaction with the Master node.
TaskTracker: On each slave node that makes up a cluster, a

TaskTracker daemon is executed.
• It works to complete MapReduce tasks after accepting jobs
from the JobTracker.
22.3) Advantages of Hadoop MapReduce
1. Parallel processing: Data is processed in parallel that making processing fast
2. Data Locality: Map function is generally performed locally on the DFS node where data is stored.
Processing the data locally is very effective for the cost.
3. The MapReduce system makes sure that the processing in performed in a fault tolerant manner.
• Thus, in MapReduce programming, an entire task can be divided into map task and reduce task.
• Map takes input as a key value, and produces output as a list of <key-value> pair.
• Reduce takes input as shuffling of key and list value and the final output is the key value as shown in Figure
There are mainly three operations of MapReduce:

• Map Stage
• Shuffle Stage
• Reduce Stage
23) What is NoSQL
• In order to facilitate the development of cutting-edge applications,
• NoSQL was designed to work with a variety of different data models and
schemas such as
• key-value pairs, multimedia files, documents, columnar data, graphs, external files, and
more.
• Due to the fact that it does not adhere to the guidelines established
by Relational Database Management Systems (RDBMS),
• we cannot query our data using conventional SQL commands.
• We can think of such well-known examples as MongoDB, Neo4J, HyperGraphDB, etc.

BDA3

Uploaded by

Copyright:

Available Formats

BDA3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BDA3

Uploaded by

Copyright:

Available Formats

BIG DATA ANALYTICS

MC-5101 (Unit III: Data Mining Techniques for Analysis)

2 “Stop words” are frequently occurring words used to construct sentences.

Drawbacks of using a BoW

= log(3/3) = log(1) = 0 is 1 2 1 0.00

Similarly, scary 1 1 0 0.18

• IDF(‘not’) = log(3/1) = 0.48 importance; not 0 1 0 0.48

• IDF(‘slow’) = log(3/1) = 0.48 thus have a higher value. good 0 0 1 0.48

Two criteria are used by LDA to create a new axis:

Web Content Web Structure Web Usage

1. Text Document 1. Web Server Logs

• The following are few web-based mining terminologies and algorithms:

Benefits of Automatic Document Classification System

✓ To validate the accuracy of the k-NN classification, a confusion matrix is used.

The disadvantages of using the k-Nearest Neighbors algorithm:

The disadvantage of Naïve Bayes Classifier is:

✓ Common applications of Naïve Bayes algorithm are in Spam filtering.

• For a general n-dimensional feature space, the defining equation becomes:

• Memory and input/output (I/O) are both put to intensive use

• The result of the Map phase is

TaskTracker: On each slave node that makes up a cluster, a

There are mainly three operations of MapReduce:

You might also like