Data Analytics
Data Analytics
D AT A
A N A L YT I C S
For T.Y.B.Sc. Computer Science : Semester – VI
[Course Code CS 364 : Credits – 2]
CBCS Pattern
As Per New Syllabus
Price ` 230.00
N5949
DATA ANALYTICS ISBN 978-93-5451-317-6
First Edition : February 2022
© : Authors
The text of this publication, or any part thereof, should not be reproduced or transmitted in any form or stored in any
computer storage system or device for distribution including photocopy, recording, taping or information retrieval system or
reproduced on any disc, tape, perforated media or other information storage device etc., without the written permission of
Authors with whom the rights are reserved. Breach of this condition is liable for legal action.
Every effort has been made to avoid errors or omissions in this publication. In spite of this, errors may have crept in. Any
mistake, error or discrepancy so noted and shall be brought to our notice shall be taken care of in the next edition. It is notified
that neither the publisher nor the authors or seller shall be responsible for any damage or loss of action to any one, of any kind, in
any manner, therefrom. The reader must cross check all the facts and contents with original Government notification or
publications.
DISTRIBUTION CENTRES
PUNE
Nirali Prakashan Nirali Prakashan
(For orders outside Pune) (For orders within Pune)
S. No. 28/27, Dhayari Narhe Road, Near Asian College 119, Budhwar Peth, Jogeshwari Mandir Lane
Pune 411041, Maharashtra Pune 411002, Maharashtra
Tel : (020) 24690204; Mobile : 9657703143 Tel : (020) 2445 2044; Mobile : 9657703145
Email : [email protected] Email : [email protected]
MUMBAI
Nirali Prakashan
Rasdhara Co-op. Hsg. Society Ltd., 'D' Wing Ground Floor, 385 S.V.P. Road
Girgaum, Mumbai 400004, Maharashtra
Mobile : 7045821020, Tel : (022) 2385 6339 / 2386 9976
Email : [email protected]
DISTRIBUTION BRANCHES
DELHI BENGALURU NAGPUR
Nirali Prakashan Nirali Prakashan Nirali Prakashan
Room No. 2 Ground Floor Maitri Ground Floor, Jaya Apartments, Above Maratha Mandir, Shop No. 3,
4575/15 Omkar Tower, Agarwal Road No. 99, 6th Cross, 6th Main, First Floor, Rani Jhanshi Square,
Darya Ganj, New Delhi 110002 Malleswaram, Bengaluru 560003 Sitabuldi Nagpur 440012 (MAH)
Mobile : 9555778814/9818561840 Karnataka; Mob : 9686821074 Tel : (0712) 254 7129
Email : [email protected] Email : [email protected] Email : [email protected]
[email protected] | www.pragationline.com
Also find us on www.facebook.com/niralibooks
Preface …
We take an opportunity to present this Text Book on "Data Analytics" to the students of
Third Year B.Sc. (Computer Science) Semester-VI as per the New Syllabus, June 2021.
The book has its own unique features. It brings out the subject in a very simple and lucid
manner for easy and comprehensive understanding of the basic concepts. The book covers
theory of Introduction to Data Analytics, Overview of Machine Learning, Mining Frequent
Patterns, Associations and Correlations, Social Media and Text Analytics.
A special word of thank to Shri. Dineshbhai Furia, and Mr. Jignesh Furia for
showing full faith in us to write this text book. We also thank to Mr. Amar Salunkhe and
Mr. Akbar Shaikh of M/s Nirali Prakashan for their excellent co-operation.
We also thank Mrs. Yojana Despande, Mr. Ravindra Walodare, Mr. Sachin Shinde, Mr.
Ashok Bodke, Mr. Moshin Sayyed and Mr. Nitin Thorat.
Although every care has been taken to check mistakes and misprints, any errors,
omission and suggestions from teachers and students for the improvement of this text book
shall be most welcome.
Authors
Syllabus …
1. Introduction to Data Analytics (6 Lectures)
• Concept of Data Analytics
• Data Analysis vs Data Analytics
• Types of Analytics
o Diagnostic Analytics
o Predictive Analytics
o Prescriptive Analytics
o Exploratory Analysis
o Mechanistic Analysis
• Mathematical Models:
o Concept
• Model Evaluation:
• Metrics for Evaluating Classifiers:
o Class Imbalance:
AUC, ROC (Receiver-Operator Characteristic) Curves
Evaluating Value Prediction Models
2. Machine Learning Overview (6 Lectures)
• Introduction to Machine Learning, Deep Learning, Artificial intelligence
• Applications for Machine Learning in Data Science
• The Modeling Process
o Engineering Features and Selecting a Model
o Training the Model
o Validating the Model
o Predicting New Observations
• Types of Machine Learning
o Supervised Learning
o Unsupervised Learning
o Semi-supervised Learning
o Ensemble Techniques
• Regression Models
o Linear Regression
o Polynomial Regression
o Logistic Regression
• Concept of Classification, Clustering and Reinforcement Learning
3. Mining Frequent Patterns, Associations and Correlations (12 Lectures)
• What kind of Patterns can be Mined
• Class/Concept Description:
o Characterization and Discrimination
o Mining Frequent Patterns
o Associations and Correlations
o Classification and Regression for Predictive Analysis
o Cluster Analysis
o Outlier Analysis
• Mining Frequent Patterns: Market Basket Analysis
• Frequent Itemsets, Closed Itemsets and Association Rules
• Frequent Itemset Mining Methods
• Apriori Algorithm
• Generating Association Rules from Frequent Itemsets
• Improving Efficiency of Apriori Algorithm
• Frequent Pattern Growth (FP-Growth) Algorithm
4. Social Media and Text Analytics (12 Lectures)
• Overview of Social Media Analytics
o Social Media Analytics Process
o Seven Layers of Social Media Analytics
o Accessing Social Media Data
• Key Social Media Analytics Methods
• Social Network Analysis
o Link Prediction
o Community Detection
o Influence Maximization
o Expert Finding
o Prediction of Trust and Distrust among Individuals
• Introduction to Natural Language Processing
• Text Analytics:
o Tokenization
o Bag of Words
o Word weighting: TF-IDF
o n-Grams
o Stop Words
o Stemming and Lemmatization
o Synonyms and Parts of Speech Tagging
o Sentiment Analysis
• Document or Text Summarization
• Trend Analytics
• Challenges to Social Media Analytics
Contents …
1.0 INTRODUCTION
• An important phase of technological innovation associated with the rise and rapid
development of computer technology came into existence only a few decades ago.
• The technological innovation brought about a revolution in the way people work, first
in the field of science and then in many others, from technology to business, as well as
in day-to-day life.
• In today’s data driven world the massive amount of data collected/generated/
produced at remarkable speed and high volume at every day. Data allows to makes
better predictions about the future.
• For processing and analyzing need of this massive/huge amount of a new discipline is
formed known as data science. The objective/goal of data science is to extract
information from data sources.
• Data science is a collection of techniques used to extract value from data. Data science
has become an essential tool for any organization that collects stores and processes
data as part of its operations.
• Data science is the task of scrutinizing and processing raw data to reach a meaningful
conclusion. Data science techniques rely on finding useful patterns, connections and
relationships within data.
• Data science applies an ever-changing and vast collection of techniques and
technology from mathematics, statistics, Machine Learning (ML) and Artificial
Intelligence (AI) to decompose complex problems into smaller tasks to deliver insight
and knowledge.
• Analytics is the systematic computational analysis of data. Analytics is the discovery,
interpretation, and communication of meaningful patterns in data.
1.1
Data Analytics Introduction to Data Analytics
• Especially valuable in areas rich with recorded information, analytics relies on the
simultaneous application of statistics, computer programming and operations
research to quantify performance.
• Data analysis is a process of inspecting, cleansing, transforming, and modeling data
with the goal of discovering useful information, informing conclusions, and
supporting decision-making.
• Data analytics and all associated strategies and techniques are essential when it comes
to identifying different patterns, finding anomalies and relationships in large
chunks/set of data and making the data or information collected more meaningful and
more understandable.
1. Data
Discovery
2. Data
6. Operationalize
Preparation
5. Communicate 3. Model
Results Planning
4. Model
Building
• Data preparation phase of the data analytics lifecycle involves data preparation, which
includes the steps to explore, preprocess and condition data prior to modeling and
analysis.
• The data preparation and processing phase involves collecting, processing and
conditioning data before moving to the model building process.
• An analytics sandbox is a platform that allows us to store and process large amounts
of data.
• Data are loaded in the sandbox in three ways namely, ETL (Extract, Transform and
Load), ELT (Extract, Load, and Transform) and ETLT.
Phase 3 - Model Planning:
rd
• The 3 phase of the lifecycle is model planning, where the data analytics team
members makes proper planning of the methods to be adapted and the various
workflow to be followed during the next phase of model building.
• Model planning is a phase where the data analytics team members have to analyze the
quality of data and find a suitable model for the project.
Phase 4 - Model Building:
• In this phase the team works on developing datasets for training and testing as well as
for production purposes.
• This phase is based on the planning made in the previous phase, the execution of the
model is carried out by the team.
• Model building is the process where team has to deploy the planned model in a real-
time environment. It allows analysts to solidify their decision-making process by gain
in-depth analytical information.
• The environment needed for the execution of the model is decided and prepared so
that if a more robust environment is required, it is accordingly applied.
Phase 5 - Communicate Results:
th
• The 5 phase of the life cycle of data analytics checks the results of the project to find
whether it is a success or failure.
• The result is scrutinized by the entire team along with its stakeholders to draw
inferences on the key findings and summarize the entire work done.
• In communicate results phase, the business/organizational values are quantified and
an elaborate narrative on the key findings is prepared.
Phase 6 - Operationalize:
th
• In 6 phase, the team delivers final reports is prepared by the team along with the
briefings, source code and related technical documents.
• Operationalize phase also involves running the pilot project to implement the model
and test it in a real-time environment.
1.5
Data Analytics Introduction to Data Analytics
• As data analytics help build models that lead to better decision making, it, in turn,
adds values to individuals, customers, business sectors and other organizations.
• As soon the team prepares a detailed report including the key findings, documents,
and briefings, the data analytics life cycle almost comes close to the end.
• The next step remains the measure the effectiveness of the analysis before submitting
the final reports to the stakeholders.
Presentation Layer
Analytics Layer
2. Data Management Layer: Once, the data has been extracted, data scientists must
perform a number of functions that are grouped under the data management
layer. The data may need to be normalized and stored in certain database
architectures to improve data query and access by the analytics layer. We‘ll cover
taxonomies of database tools including SQL, NoSQL, Hadoop, Spark and other
architecture in the upcoming sections.
3. Analytics Layer: In analytics layer, a data scientist uses a number of engines to
implement the analytical functions. Depending on the task at hand, a data scientist
may use one or multiple engines to build an analytics application. A more
complete layer would include engines for optimization, machine learning, natural
language processing, predictive modeling, pattern recognition, classification,
inferencing and semantic analysis.
4. Presentation Layer: The presentation layer includes tools for building
dashboards, applications and user-facing applications that display the results of
analytics engines. Data scientists often mash up several data visualization widgets,
1.7
Data Analytics Introduction to Data Analytics
web parts and dashboards (sometimes called Mash boards) on the screen to
display the results using info-graphic reports. These dashboards are active and
display data dynamically as the underlying analytics models continuously update
the results for dashboards.
How can we
make it happen ?
Analytics
h
sig
What Diagnostic
Fore
happened? Analytics
t
Descriptive igh
Analytics Ins
t
on igh
ati ds
rm i n
I nfo H
Dif!culty
Fig. 1.3: Types of Data Analytics
• Descriptive analytics enables learning from the past and assessing how the past might
influence future outcomes.
• Descriptive analytics is valuable as it enables associations to gain from past practices
and helps them in seeing how they may impact future results.
• Descriptive analytics looks at data and analyzes past events for insight as to how to
approach the future.
• It looks at past performance and understands that performance by mining historical
data to look for the reasons behind past success or failure.
Examples:
1. An organizations’ records give a past review of their financials, operations,
customers and stakeholders, sales and so on.
2. Using descriptive analysis, a data analyst will be able to generate the statistical
results of the performance of the hockey players of team India. For generating
such results, the data may need to be integrated from multiple data sources to gain
meaningful insights through statistical analysis.
root cause initially, such as traffic has been reduced and from there, it will fine-
tune the problem after finding the reasons for the downside in website traffic such
as Software Engine Optimization (SEO), social marketing, email marketing and
any other factors, which are not enabling the website to reach many people.
• Predictive analytics turns data into valuable, actionable information and it uses data
to determine the probable future outcome of an event or a likelihood of a situation
occurring.
Example: Using predictive analysis, a data analyst will be able to predict the
performance of each player of the hockey team for the upcoming Olympics. Such
prediction analysis can help the Indian Hockey Federation to decide on the players'
selection for the upcoming Olympics.
1.4.1 Concept
• In this section, we will see various ways of thinking about models to help shape the
way we build them.
Occam's Razor:
• Occam's razor is a problem-solving principle arguing that simplicity is better than
complexity.
th
• Named after 14 century logician and theologian William of Ockham, this theory has
been helping many great thinkers for centuries.
• Occam's razor is the problem solving principle, which states that "entities should not
be multiplied beyond necessity", sometimes inaccurately paraphrased as "the simplest
explanation is usually the best one."
• In simple words, Occam's razor is the philosophical principle states that, the simplest
explanation is the best explanation.
• Occam's notion of simpler generally refers to reducing the number of assumptions
employed in developing the model.
• With respect to statistical modeling, Occam's razor tells or speaks to the need to
minimize the parameter count of a model.
• Overfitting occurs when a mathematical model tries too hard to achieve accurate
performance on its training data.
• It is the production of an analysis that corresponds too closely or exactly to a
particular set of data and may therefore fail to fit additional data or predict future
observations reliably.
• Overfitting occurs or happens when there are so many parameters that the model can
essentially memorize its training set, instead of generalizing appropriately to
minimize the effects of error and outliers.
• Overfit models tend to perform extremely well on training data, but much less
accurately on independent test data.
• An overfit model is a statistical model. An overfit model contains
more parameters than can be justified by the data.
• Invoking Occam's razor requires that we have a meaningful way to evaluate how
accurately our models are performing. Simplicity is not an absolute virtue, when it
leads to poor performance.
• Deep learning is a powerful technique for building models with millions of
parameters. Despite the danger of overfitting, these models perform extremely well on
a variety of complex tasks.
• Occam would have been suspicious of such models, but come to accept those that have
substantially more predictive power than the alternatives.
1.17
Data Analytics Introduction to Data Analytics
• Appreciate the inherent trade-off between accuracy and simplicity. It is almost always
possible to improve the performance of any model by kludging-on extra parameters
and rules to govern exceptions.
• Complexity has a cost, as explicitly captured in machine learning methods like
LASSO/ridge regression. These techniques employ penalty functions to minimize the
features used in the model.
• Underfitting occurs when a statistical model cannot adequately capture the underlying
structure of the data.
• An under-fitted model is a model where some parameters or terms that would appear
in a correctly specified model are missing.
• Underfitting would occur, for example, when fitting a linear model to non-linear data.
Such a model will tend to have poor predictive performance.
Bias-Variance Trade-Offs:
• The bias–variance tradeoff is the property of a model that the variance of the
parameter estimated across samples can be reduced by increasing the bias in
the estimated parameters.
• Bias-variance trade-off is tension between the between model complexity and
performance shows up in the statistical notion of the bias-variance trade-off:
1. Bias: It is error from incorrect assumptions built into the model, such as restricting
an interpolating function to be linear instead of a higher-order curve.
2. Variance: It is error from sensitivity to fluctuations in the training set. If our
training set contains sampling or measurement error, this noise introduces
variance into the resulting model.
• Errors of bias produce or generate underfit models and they do not fit the training
data as tightly as possible, were they allowed the freedom to do so.
• Underfitting occurs/happens when a statistical model cannot adequately capture the
underlying structure of the data.
• Errors of variance result in overfit models (their quest for accuracy causes overfit
models to mistake noise for signal and they adjust so well to the training data that
noise leads them astray).
• Models that do much better on testing data than training data are overfit models.
An underfitted model is a model where some parameters or terms that would appear
in a correctly specified model are missing.
Nate Silver’s Principles for Effective Modeling:
• Nate R. Silver is perhaps the most prominent public face of data science today. He
outlines following principles for effective modeling:
Principle #1 (Think Probabilistically): Forecasts which make concrete statements
are less meaningful than those that are inherently probabilistic. The real world is an
1.18
Data Analytics Introduction to Data Analytics
uncertain place, and successful models recognize this uncertainty. There are always a
range of possible outcomes that can occur with slight perturbations of reality, and this
should be captured in the model.
Principle #2 (Change the Forecast in Response to New Information): Live models
are much more interesting than dead ones. A model is live if it is continually updating
predictions in response to new information. Fresh information should change the
result of any forecast. Scientists should be open to changing opinions in response to
new data and built the infrastructure that maintains a live model. Any live model
should track and display its predictions over time, so the viewer can guess whether
changes accurately reflected the impact of new information.
Principle #3 (Look for Consensus): Data should derive from as many different
sources as possible to get the good forecast. Ideally, multiple models should be built,
each trying to predict the same thing in different ways. We should have an opinion as
to which model is the best, but be concerned when it substantially differs from the
herd. Often third parties produce competing forecasts, which you can monitor and
compare against.
Principle #4 (Employ Baysian Reasoning): The Bayes' theorem has several
interpretations, but perhaps most clearly provides a way to calculate how probabilities
change in response to new evidence. When stated as given below, it provides a way to
calculate how the probability of event A changes in response to new evidence B.
P(B|A) P(A)
P(A | B) = P(B)
Applying Bayes' theorem requires a prior probability P(A), the likelihood of event A
before knowing the status of a particular event B. This might be the result of running a
classifier to predict the status of A from other features or background knowledge
about event frequencies in a population. Without a good estimate for this prior, it is
very difficult to know how seriously to take the classifier.
• The first-principle model might be a discrete event simulation. Data driven models are
based on observed correlations between input parameters and outcome variables.
• The same basic model might be used to predict tomorrow's weather or the price of a
given stock, differing only on the data it was trained on.
• Machine learning methods make it possible to build an effective and efficient model
on a domain one knows nothing about, provided we are given a good enough training
set.
• Ad hoc models are built using domain specific knowledge to guide their structure and
design.
• These ad hoc models tend to be brittle in response to changing conditions, and difficult
to apply to new tasks.
• In contrast, machine learning models for classification and regression are basic,
because they employ no problem-specific ideas, only specific data.
• Retrain the models on fresh data and classification and regression adapt to changing
conditions.
• Train them on a different data set and classification and regression can do something
completely different. By this rubric, general models sound much better than ad hoc
ones.
• The truth is that the best models are a mixture of both i.e., theory and data. It is
important to understand the domain as deeply as possible, while using the best data or
information we can in order to fit and evaluate the models.
Linear vs. Non-Linear Models:
• Linear models are governed by equations that weigh each feature variable by a
coefficient reflecting its importance and sum up these values to produce a score.
• Powerful machine learning techniques, (for example, linear regression) can be used to
identify the best possible coefficients to fit training data, yielding very effective
models.
• But basically speaking, the world is not linear. Richer mathematical descriptions
include higher order exponentials, logarithms and polynomials.
• These permit mathematical models that fit training data much more tightly than linear
functions can.
• Generally speaking, it is much harder to find the best possible coefficients to fit non-
linear models.
• But we do not have to find the best possible fit, ‘deep learning techniques, based on
neural networks, offer excellent performance despite inherent difficulties in
optimization’.
Blackbox vs. Descriptive Models:
• Black boxes are devices that do their task, but in some unknown manner. Stuff goes in
and stuff comes out, but how the sausage is made is completely impenetrable to
outsiders.
1.20
Data Analytics Introduction to Data Analytics
• Looked at as a whole, the network does only one thing. But because they are built
from multiple nested layers (the deep in deep learning), these deep learning models
presume that there are complex features there to be learned from the lower level
inputs.
Stochastic vs. Deterministic Models:
• Demanding a single deterministic prediction from a mathematical model can be a
fool's errand. The world is a complex and critical place of many realities, with events
that generally would not unfold in exactly the same way if time could be run over
again.
• Good forecasting models incorporate such thinking and produce probability
distributions over all possible events.
• Stochastic (meaning “randomly determined") modeling techniques that explicitly build
some notion of probability into the model include logistic regression and Monte Carlo
simulation.
• It is important that the model observe the basic properties of probabilities, including:
o Each probability is a value between 0 and 1: Scores that are not constrained to
be in 0 and 1 range do not directly estimate probabilities. The solution is often to
put the values through a logit() function to turn them into probabilities in a
principled way.
o Rare events do not have probability zero: Any event i.e., possible must have a
greater than zero probability of occurrence. Discounting is a way of evaluating the
likelihood of unseen but possible events. Probabilities are a measure of humility
about the accuracy of our model and the uncertainty of a complex world. Models
must be honest in what they do and don't know.
o That they must sum to 1: Independently generating values between 0 and 1 does
not mean that they together add up to a unit probability, over the full event space.
The solution here is to scale these values so that they do, by dividing each by the
partition function. Alternately, rethink the model to understand why they didn't
add up in the first place.
• There are two common tasks for data science models namely, classification and value
prediction.
Baseline Models for Classification:
• In classification tasks, we are given a small set of possible labels for any given item,
like (man or woman), (spam or not spam) or (car or truck).
• We seek a system that will generate or produce a label accurately describing a
particular instance of a person, e-mail or vehicle.
• Representative baseline models for classification include:
1. Uniform or Random Selection among Labels: If we have absolutely no prior
distribution on the objects, we might as well make an arbitrary selection using the
broken watch method. Comparing the stock market prediction model against
random coin flips will go a long way to showing how hard the problem is.
2. The most common Label appearing in the Training Data: A large training
dataset usually provides some notion of a prior distribution on the classes.
Selecting the most frequent label is better than selecting them uniformly or
randomly. This is the theory behind the sun-will-rise-tomorrow baseline model.
3. The most Accurate Single-feature Model: Powerful classification baseline
models strive to exploit all the useful features present in a given data set. But it is
valuable to know what the best single feature can do. Occam's razor deems the
simplest and easiest model to be best. Only when the complicated model beats all
single-factor models does it start to be interesting.
4. Somebody else's Model: Often we are not the first person to attempt a particular
task. Our firm/organization may have a legacy model that we are charged with
updating or revising. One of two things can happen/occur when we compare the
model against someone else's work: either we beat them or we don't. If we beat
them, we now have something worth bragging about. If we don't, it is a chance to
learn and improve. Why didn't we win? The fact that we lost gives we certainty
that your model can be improved, at least to the level of the other guy's model.
5. Clairvoyance: There are circumstances when even the best possible classification
baseline model cannot theoretically reach 100% accuracy.
Baseline Models for Value Prediction:
• In baseline value prediction models problems, we are given a collection of feature
value pairs (fi, vi) to use to train a function F such that F(vi) = vi.
• Baseline models for value prediction problems follow from similar techniques to what
were proposed for classification, as follows:
1. Mean or Median: Just ignore the features, so we can always output the consensus
value of the target. This proves that, to be quite an informative baseline, because if
we cannot substantially beat always guessing the mean, either we have the wrong
features or is working on a hopeless task.
1.23
Data Analytics Introduction to Data Analytics
• Evaluating a classifier means measuring how accurately our predicted labels match
the gold standard labels in the evaluation set.
• For the common case of two distinct labels or classes (binary classification), we
typically call the smaller and more interesting of the two classes as positive and the
larger/other class as negative.
• In a spam classification problem, the spam would typically be positive and the ham
(non-spam) would be negative.
• This labeling aims to ensure that identifying the positives is at least as hard as
identifying the negatives, although often the test instances are selected so that the
classes are of equal cardinality.
• There are four possible results of what the classification model could do on any given
instance, which defines the confusion matrix or contingency table shown in Fig. 1.4.
• A confusion matrix contains information about actual and predicted classifications
done by a classifier. Performance of such systems is commonly evaluated using the
data in the matrix.
• A confusion matrix is a table that is often used to describe the performance of a
classification model (or "classifier") on a set of test data for which the true values are
known.
• The confusion matrix itself is relatively simple to understand, but the related
terminology can be confusing. A confusion matrix also known as an error matrix.
• A confusion matrix is a technique for summarizing the performance of a classification
algorithm.
• A confusion matrix is nothing but a table with two dimensions viz. “Actual” and
“Predicted” and furthermore, both the dimensions have “True Positives (TP)”, “True
Negatives (TN)”, “False Positives (FP)”, “False Negatives (FN)”.
Predicted Class
Positive Negative
• The explanation of the terms associated with confusion matrix are as follows:
o True Positives (TP): Here our classifier labels a positive item as positive, resulting
in a win for the classifier.
o True Negatives (TN): Here the classifier correctly determines that a member of
the negative class deserves a negative label. Another win.
o False Positives (FP): The classifier mistakenly calls a negative item as a positive,
resulting in a “Type I" classification error.
o False Negatives (FN): The classifier mistakenly declares a positive item as
negative, resulting in a “Type II" classification error.
• Fig. 1.5 shows where these result classes fall in separating two distributions (men and
women), where the decision variable is height as measured in centimeters.
• The classifier under evaluation labels everyone of height ≥ 168 centimeters as male.
The purple regions represent the intersection of both male and female.
• The four possible results in the confusion matrix reflect which instances were
classified correctly (TP and TN) and which ones were not (FN and FP).
0.06
0.05
0.04
0.03
0.02 TN TP
0.01 FN FP
0.00
140 150 160 170 180 190 200 210
Fig. 1.5
Example: Consider Test set of 1100 images, from these images 1000 are non cat images
and 100 are cat images. The following fig shows the confusion matrix with TP=90, FN=10,
TN= 940 and FP=60
Actual
Actual Class
+ –
Cat Non-Cat + TP FP
Type I Error
Test
Cat 90 60
Predicted FN
Class
– TN
Type II Error
Non-Cat 10 940
Fig. 1.6
1.26
Data Analytics Introduction to Data Analytics
True Positive: We predicted positive and it’s true. In the Fig. 1.6, we predicted that 90
images are cat images.
True Negative: We predicted negative and it’s true. In the Fig. 1.6, we predicted that
940 images are non cat.
False Positive (Type I Error): We predicted positive and it’s false. In the Fig. 1.6, we
predicted that 60 images are cat images but actually not.
False Negative (Type II Error): We predicted negative and it’s false. In the Fig. 1.6, we
predicted that 10 images are non cat but actually yes.
Statistic Measures for Classifier:
1. Accuracy: The accuracy of classifier, the ratio of the number of correct predictions
over total predictions. We can calculate accuracy by confusion matrix with the
help of following formula:
TP + TN
Accuracy = TP + FP + FN + TN
2. Precision: The precision measures the number of positive values predicted by the
classifier that are actually positive. We can calculate precision by confusion
matrix with the help of following formula:
TP
Precision = TP + FP
3. Recall or Sensitivity: Recall determines the proportion of the positives values that
were accurately predicted. Sensitivity or Recall means out of all actual positives,
how many did we predict as positive. We can calculate recall by confusion matrix
with the help of following formula:
TP
Recall = TP + FN
4. F-score: The F-score (or sometimes F1-score) is such a combination, returning the
harmonic mean of precision and recall. We can calculate recall by confusion
matrix with the help of following formula:
F = 2 × (Precision × Recall) / (Precision + Recall)
• For above example, consider following the confusion matrix:
[[90 60]
[10 940]]
TP = 90, FN = 10, TN = 940 and FP = 60
• For above binary classifier,
TP + TN = 90 + 940 = 1030 and
TP + FP + FN + TN = 90 + 60+ 10 + 940 = 1100
Hence, Accuracy = 1030 / 1100 = 0.9364.
1.27
Data Analytics Introduction to Data Analytics
• Consider what happens as we sweep our threshold from left to right over these
distributions.
• Every time we pass over another example, we either increase the number of true
positives (if this example was positive) or false positives (if this example was in fact a
negative).
• At the very left, we achieve true/false positive rates of 0%, since the classifier labeled
nothing as positive at that cutoff.
• Moving as far to the right as possible, all examples will be labeled positively, and
hence both rates become 100%.
• Each threshold in between defines a possible classifier, and the sweep defines a
staircase curve in true/false positive rate space taking us from (0%, 0%) to
(100%, 100%).
• An ROC curve is the most commonly used way to visualize the performance of a
binary classifier and AUC is (arguably) the best way to summarize its performance in a
single number.
• The area under the ROC curve (AUC) is often used as a statistic measuring the quality
of scoring function defining the classifier.
• The best possible ROC curve has an area of 100% × 100 % → 1, while the monkey's
triangle has an area of 1/2. The closer the area is to 1, the better the classification
function is.
• The Area Under the Curve (AUC) is another evaluation metric that we can use for
classification models.
• The 45 degree line is the baseline for which the AUC is 0.5. The perfect model will have
an AUC of 1.0. The closer the AUC to 1.0, the better the predictions.
100
True Positive Rate (Sensitivity)
80
60
40
20
0
0 20 40 60 80 100
False Positive Rate (100-Speci!city)
1.29
Data Analytics Introduction to Data Analytics
• Fig. 1.8 shows the absolute error distributions from two models for predicting the year
of authorship of documents from their word usage distribution.
• On the left, we see the error distribution for the monkey, randomly guessing a year
from 1800 to 2005. What do we see? The error distribution is broad and bad, as we
might have expected, but also asymmetric.
• Far more documents produced positive errors than negative ones. Why? The test
corpus apparently contained more modern documents than older ones, so is more
often positive than negative.
800 4000
700 3500
600 3000
500 2500
Counts
Courts
400 2000
300 1500
200 1000
100 500
0 0
–200 –150 –100 –50 0 50 100 150 200 250 –150 –100 –50 0 50 100 150 200 250
Absolute Error
Absolute Error
Fig. 1.8: Error Distribution Histograms for Random (Left) and Naive Bayes Classifiers
Predicting the Year of Authorship for Documents (Right)
• In contrast, Fig. 1.8 (right) presents the error distribution for our naïve Bayes classifier
for document dating. This looks much better: there is a sharp peak around zero and
much narrower tails.
• But the longer tail now resides to the left of zero, telling us that we are still calling a
distressing number of very old documents modern. We need to examine some of these
instances, to figure out why that is the case.
• We need a summary statistic reducing such error distributions to a single number, in
order to compare the performance of different value prediction models.
• A commonly-used statistic is Mean Squared Error (MSE), which is computed as
follows:
n
1
MSE (Y, Y') = n Σ (y'i − yi)2
i=1
PRACTICE QUESTIONS
Q. I Multiple Choice Questions:
1. Which is a collection of techniques used to extract value from data.?
(a) Data science (b) Data analysis
(c) Data analytics (d) Exploratory analytics
2. Which is the science of examining raw data with the purpose of drawing
conclusions about that information?
(a) Data science (b) Data analysis
(c) Data analytics (d) Exploratory analytics
3. Which is a process of inspecting, cleansing, transforming and modeling data with
the goal of discovering useful information or insights?
(a) Data science (b) Data analysis
(c) Data analytics (d) Exploratory analytics
4. The types of data analytics includes,
(a) Descriptive Analytics (what happened?)
(b) Diagnostic Analytics (why did it happen?)
(c) Predictive Analytics (what will it happen?)
(d) All of the mentioned
5. Which analytics deals with prediction of future based on the available current and
past data?
(a) Descriptive Analytics (b) Diagnostic Analytics
(c) Mechanistic analytics (d) Predictive Analytics
6. Which is data analytics attempts to find hidden, unseen or previously unknown
relationships?
(a) Exploratory Analytics (b) Diagnostic Analytics
(c) Mechanistic Analytics (d) Predictive Analytics
7. Which analytics allow data scientists to understand clear alterations in variables
which can result in changing of variables?
(a) Exploratory Analytics (b) Diagnostic Analytics
(c) Mechanistic Analytics (d) Predictive Analytics
8. Which is the philosophical principle states that, the simplest explanation is the best
explanation?
(a) Occam's Analysis (b) Occam's mazor
(c) Occam's Analytics (d) Occam's razor
9. Which is the process where team has to deploy the planned model in a real-time
environment?
(a) Model analytics (b) Model building
(c) Model analysis (d) Model science
1.32
Data Analytics Introduction to Data Analytics
10. Which analytics looks at data and analyzes past events for insight as to how to
approach the future?
(a) Descriptive Analytics (b) Diagnostic Analytics
(c) Mechanistic Analytics (d) Predictive Analytics
11. Which is analytics is kind of root cause analysis that focuses on the processes and
causes, key factors and unseen patterns?
(a) Descriptive Analytics (b) Diagnostic Analytics
(c) Mechanistic Analytics (d) Predictive Analytics
12. Which is a graphical plot that illustrates the performance of a binary classifier?
(a) ROC curve (b) COR curve
(c) ETL curve (d) None of the mentioned
13. First-principle models can employ the full weight of classical mathematics such as,
(a) calculus (b) algebra
(c) geometry (d) All of the mentioned
14. Which analysis uses past data to create a model that answer the question, what
will happen?
(a) descriptive (b) diagnostic
(c) predictive (d) Predictive
Answers
1. (a) 2. (c) 3. (b) 4. (d) 5. (d) 6. (a) 7. (c) 8. (d) 9. (b) 10. (a)
11. (b) 12. (a) 13. (d) 14. (c)
Q. II Fill in the Blanks:
1. The purpose of _______ data analysis is to check for missing data and other
mistakes.
2. Data _______ is a broad term capturing the endeavor of analyzing data into
information into knowledge.
3. _______ is used for the discovery, interpretation, and communication of meaningful
patterns and/or insights in data.
4. Data analytics is defined as, a science of _______ meaningful, valuable information
from raw data.
5. A data _______ works with massive amount of data and responsible for building
and maintaining the data architecture of a data science project.
6. An analytics _______ is a part of data lake architecture that allows you to store and
process large amounts of data.
7. the _______ -layer framework of data analytics consists of a data management layer
an analytics engine layer and a presentation layer.
8. EDA is an approach to analyzing datasets to _______ their main characteristics,
often with visual/graphical methods.
1.33
Data Analytics Introduction to Data Analytics
2. Recall determines the proportion of the positives values that were accurately
predicted.
3. Analytics defines the science behind the analysis.
4. Data analytics is the process of exploring the data from the past to make
appropriate decisions in the future by using valuable insights.
5. The art and science of refining data to fetch useful insight which further helps in
decision making is known as analysis.
6. Descriptive analytics enables learning from the past and assessing how the past
might influence future outcomes.
7. Overfitting occurs when a model tries too hard to achieve accurate performance
on its training data.
8. Exploratory data analytics attempts to find hidden, unseen, or previously
unknown relationships.
9. Predictive analysis, as the name suggests, deals with prediction of future based on
the available current and past data.
10. We need a summary statistic reducing such error distributions to a single number,
in order to compare the performance of different value prediction models.
11. Predictive analytics is often associated with data science.
12. The accuracy is such a combination, returning the harmonic mean of precision
and recall.
13. Underfitting occurs when a statistical model cannot adequately capture the
underlying structure of the data.
14. A confusion matrix is also known as error matrix.
15. An ROC curve is the most commonly used way to visualize the performance of a
binary classifier.
Answers
1. (T) 2. (T) 3. (T) 4. (T) 5. (F) 6. (T) 7. (T) 8. (T) 9. (T) 10. (T)
11. (T) 12. (F) 13. (T) 14. (T) 15. (T) 16. (T)
Q. IV Answer the following Questions:
(A) Short Answer Questions:
1. What is data science?
2. Define the term analytics.
3. Enlist types of data analytics.
4. Define data analysis.
5. Define mathematical model.
6. What is the purpose of diagnostic analytics?
7. Define class imbalance.
1.35
Data Analytics Introduction to Data Analytics
1.36
CHAPTER
2
2.0 INTRODUCTION
• Machine learning is buzzwords in today’s technical and data driven world. Learning is
the process of converting experience into expertise or knowledge.
• Machine Learning (ML) is a field of computer science that studies algorithms and
techniques for automating solutions to complex problems that are hard to program
using conventional programming methods.
• The conventional/traditional programming method consists of following two distinct
steps. Given a specification for the program (i.e., what the program is supposed to do
and not and how).
st
o 1 step is to create a detailed design for the program i.e., a fixed set of steps/stages
used to solving the problem.
nd
o 2 step is to implement the detailed design as a program in a computer language.
• Though data science includes machine learning as one of its fundamental areas of
study, machine learning in itself is a vast research area of study that requires good
skills and experience to expertise.
• The basic idea of machine learning is to allow machines (computers) to independently
learn from the wealth of data that is fed as input into the machine.
• To master in machine learning, a learner needs to have an in-depth knowledge of
computer fundamentals, programming skills, data modeling and evaluation skills,
probability and statistics.
• With the advancement of new technology, machines are being trained to behave like a
human in decision-making capability.
2.1
Data Analytics Machine Learning Overview
• In doing so, it is necessary to automate decisions that can be inferred by the machines
with the interaction with the environment and understanding from past knowledge.
• The field of machine learning deals with all those algorithms that help machines to get
self-trained in this process.
• Machine learning techniques are broadly categorized into supervised machine
learning, unsupervised machine learning, and reinforcement learning.
1. Supervised Machine Learning sometimes described as “learn from the past to
predict the future”. Supervised machine learning is a field of learning where the
machine learns with the help of a supervisor and instructor.
2. Unsupervised Machine Learning the machine learns without any supervision.
The goal of unsupervised learning is to model the structure of the data to learn
more about the data.
3. Reinforcement Machine Learning happens with an interaction with the
environment. If we are assumed to be a program and with every encounter with
the environment the program eventually starts learning then the process is called
reinforcement learning.
• Today, Deep Learning (DL) is a fast-growing field of research and its applications run
the gamut of structured and unstructured data (text, voice, images, video and so on).
• Deep learning (DL) is a subset of ML and ML is a subset of AI. Nowadays, AI and DL are
the latest technologies that are doing much more.
• They are supporting humans in complex and creative problem-solving by analyzing
vast amounts of data and identifying trends that were previously impossible to detect.
Artificial
Intelligence (AI)
Machine
Learning (ML)
Deep
Learning (DL)
2.6
Data Analytics Machine Learning Overview
Experience
(E)
2. Experience (E):
o As name suggests, it is the knowledge gained from data points provided to the
algorithm or model.
o Once, provided with the dataset, the model will run iteratively and will learn some
inherent pattern. The learning thus acquired is called Experience (E).
o Making an analogy with human learning, we can think of this situation as in which
a human being is learning or gaining some experience from various attributes like
situation, relationships etc.
o Supervised, unsupervised and reinforcement learning are some ways to learn or
gain experience. The experience gained by out ML model or algorithm will be used
to solve the Task (T).
3. Performance (P):
o An ML algorithm is supposed to perform task and gain experience with the
passage of time.
o The measure which tells whether ML algorithm is performing as per expectation
or not is its performance (P).
o The P is basically a quantitative metric that tells how a model is performing the
Task (T) using its Experience (E).
o There are many metrics that help to understand the ML performance, such as
accuracy score, F1 score, confusion matrix, precision, recall, sensitivity etc.
Defining the Learning Task
Improve on task T, with respect to Performance metric P,
based on Experience E
Sr.
Task (T) Performance (P) Experience (E)
No.
1. Playing Checkers % of games won Playing practice games
against an arbitrary against itself.
opponent.
2. Recognizing hand % of words correctly Data-based of human-
written words. classified. labeled images of
handwritten words.
3. Driving on four-lane Average distance A sequence of images and
highways using travelled before a steering commands
vision sensors. human-judged error. recorded while observing
a human driver.
4. Categorize email % of email messages Database of emails, some
messages as spam or correctly classified. with human- given
legitimate. labels.
2.9
Data Analytics Machine Learning Overview
2. Capacity and Dimension: The increases in the number of data sources and the
globalization of diversification of businesses have led to the exponential growth of
the data.
3. Speed: As data volume increases, so must the speed at which data is captured and
transformed.
4. Complexity: With the increasing complexity of data, high data quality and
security is required to enable data collection, transformation, and analysis to
achieve expedient decision making.
5. Applicability: These aforementioned factors can compromise the applicability of
the data to business process and performance improvement.
liked or added to cart, brand preferences etc., the product recommendations are
sent to the user. Machine learning is widely used by various e-commerce and
entertainment companies such as Amazon, Netflix etc., for product
recommendation to the user.
in our daily lives, and the huge amount of data those sensors routinely generate.
For example, driverless car development requires millions of images and
thousands of hours of video.
2. Deep learning requires substantial computing power, including high-performance
GPUs that have a parallel architecture, efficient for deep learning. When combined
with cloud computing or distributed computing, this enables the training time for
a deep learning network to be reduced from the usual weeks to hours or even less.
Inputs Outputs
5. Useful for Risky Areas: AI machines can be helpful in situations such as defusing
a bomb, exploring the ocean floor, where to employ a human can be risky.
6. Digital Assistant: AI can be very useful to provide digital assistant to the users
such as AI technology is currently used by various E-commerce websites to show
the products as per customer requirement.
Disadvantages of Artificial Intelligence:
1. High Cost: The hardware and software requirement of AI is very costly as it
requires lots of maintenance to meet current world requirements.
2. No Original Creativity: As humans are so creative and can imagine some new
ideas but still AI machines cannot beat this power of human intelligence and
cannot be creative and imaginative.
3. Increase dependency on machines: With the increment of technology, people are
getting more dependent on devices and hence they are losing their mental
capabilities.
4. No feelings and emotions: AI machines can be an outstanding performer, but still
it does not have the feeling so it cannot make any kind of emotional attachment
with human, and may sometime be harmful for users if the proper care is not
taken.
• The uses for regression and automatic classification are wide ranging, such as the
following:
1. Proactively identifying car parts that are likely to fail (regression).
2. Finding oil fields, gold mines, or archeological sites based on existing sites
(classification and regression).
3. Predicting the number of eruptions of a volcano in a period (regression).
4. Finding place names or persons in text (classification).
5. Recognizing birds based on their whistle (classification).
6. Face recognition or retina recognition, biometric (classification).
7. Predicting which team will win the Champions League in soccer (classification).
8. Identifying profitable customers (regression and classification).
9. Identifying tumors and diseases (classification).
10. Predicting the amount of money a person will spend on product X (regression).
11. Predicting your company’s yearly revenue (regression).
12. Identifying people based on pictures or voice recordings (classification).
13. Retail Marketing (clustering): Retail organizations often use clustering to identify
groups of households that are similar to each other. For example, a retail
organization may collect the following information on households:
o Household income.
o Household size.
o Head of household Occupation.
o Distance from nearest urban area.
o These variables are used to identify the clusters of the families.
They can then feed these variables into a clustering algorithm to perhaps identify
the following clusters:
o The organization can then send personalized advertisements or sales letters to
each household based on how likely they are to respond to specific types of
advertisements.
14. Streaming services often use clustering analysis to identify viewers who have
similar behavior. For example, a streaming service may collect the following data
about individuals:
o Minutes watched per day.
o Total viewing sessions per week.
o Number of unique shows viewed per month.
Using these metrics, a streaming service can perform cluster analysis to identify
high usage and low usage users so that they can know who they should spend most
of their advertising dollars on, (clustering).
2.21
Data Analytics Machine Learning Overview
• These techniques essentially allow us to train your model by using a simple set of lines
of codes.
• Advanced data science techniques require the scientist to be capable of using heavy
mathematical calculations and then implement modern data science techniques to use
with these calculations.
• Once we have trained the model, the next thing to do is check whether it works as we
intended it to.
• Training time is the number of hours to train the model also plays a vital role in
determining the selection of the model. It is directly related to the accuracy of the
obtained model.
2. K-folds cross validation is the strategy divide the data set into k parts and use
each part one time as a test data set while using the others as a training data set.
This has the advantage that we use all the data available in the data set.
3. Leave-1 out is an approach in which the same as k-folds but with k=1. You always
leave one observation out and train on the rest of the data. This is used only on
small data sets, so it’s more valuable to people evaluating laboratory experiments
than to big data analysts.
• Regularization is another popular term in machine learning. When applying the term
regularization, we incur a penalty for every extra variable used to construct the
model.
o With L1 regularization we ask for a model with as few predictors as possible. This
is important for the model’s robustness: simple solutions tend to hold true in more
situations.
o The L2 regularization aims to keep the variance between the coefficients of the
predictors as small as possible.
• Overlapping variance between predictors in a model makes it hard to make out the
actual impact of each predictor. Keeping their variance from overlapping will increase
interpretability.
• To keep it simple, the regularization is mainly used to stop a model from using too
many features and thus prevent over-fitting.
2. Unsupervised Learning doesn’t rely on labeled data and attempt to find patterns
in a data set without human interaction.
3. Semi-supervised Learning needs labeled data and therefore human interaction,
to find patterns in the data set, but they can still progress toward a result and learn
even if passed unlabeled data as well.
Error
Supervised
Input Learning
Output
• Based on the ML tasks, supervised learning algorithms can be divided into two classes
namely, Classification and Regression.
2.26
Data Analytics Machine Learning Overview
1. Classification:
• The key objective of classification-based tasks is to predict categorical output labels or
responses for the given input data. The output will be based on what the model has
learned in training phase.
• As we know that the categorical output response means unordered and discrete
values, hence each output response will belong to a specific class or category.
• Classification refers to process of predicting discrete output values for an input. For
example, given an input predicting whether a student will pass or fail the exam.
2. Regression:
• The key objective of regression-based tasks is to predict output labels or responses
which are continue numeric values, for the given input data. The output will be based
on what the model has learned in its training phase.
• Basically, regression models use the input data features (independent variables) and
their corresponding continuous numeric output values (dependent or outcome
variables) to learn specific association between inputs and corresponding outputs.
• In regression problems the task of machine learning model is to predict a continuous
value. For example, for given input, predict the marks obtained by a student on an
exam etc.
How Supervised Learning Works?
• Supervised learning is the types of machine learning in which machines are trained
using well labeled training data and on basis of that data, machines predict the output.
The labeled data means some input data is already tagged with the correct output.
• In supervised learning, the training data provided to the machines work as the
supervisor that teaches the machines to predict the output correctly. It applies the
same concept as a student learns in the supervision of the teacher.
• Supervised learning is a process of providing input data as well as correct output data
to the machine learning model.
• The aim of a supervised learning algorithm is to find a mapping function to map the
input variable(x) with the output variable(y).
• For example, we have x (input variables) and Y (output variable). Now, apply an
algorithm to learn the mapping function from the input to output as: Y=f(x).
• In supervised learning, models are trained using labeled dataset, where the model
learns about each type of data.
• Once the training process is completed, the model is tested on the basis of test data (a
subset of the training set) and then it predicts the output.
• Fig. 2.5 shows working of supervised learning. Suppose we have a dataset of different
types of shapes which includes square, rectangle, triangle and Polygon.
2.27
Data Analytics Machine Learning Overview
• Now the first step is that we need to train the model for each shape.
o If the given shape has four sides, and all the sides are equal, then it will be labeled
as a Square.
o If the given shape has three sides, then it will be labeled as a triangle.
o If the given shape has six equal sides then it will be labeled as hexagon.
• Now, after training, we test our model using the test set, and the task of the model is to
identify the shape.
• The machine is already trained on all types of shapes, and when it finds a new shape,
it classifies the shape on the bases of a number of sides, and predicts the output.
Labeled Data
Prediction
Square
Triangle
Model Training
Labels
Test Data
Hexagon Square
Triangle
4. Logistic Regression
5. Support Vector Machines (SVMs)
k-Nearest-Neighbors (kNN):
• The k-NN algorithm is a type of supervised ML algorithm which can be used for both
classification as well as regression predictive problems.
• The k-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.
• The k-NN algorithm stores all the available data and classifies a new data point based
on the similarity; means when new data appears then it can be easily classified into a
well suite category by using k-NN algorithm.
• The k-NN algorithm can be used for regression as well as for classification but mostly
it is used for the classification problems.
• The k-NN is a type of classification where the function is only approximated locally
and all computation is deferred until function evaluation.
• The k-NN algorithm uses ‘feature similarity’ to predict the values of new data-points
which further means that the new data point will be assigned a value based on how
closely it matches the points in the training set.
Need for k-NN Algorithm:
• Suppose there are two categories, i.e., Category A and Category B, and we have a new
data point x1, so this data point will lie in which of these categories. To solve this type
of problem, we need a k-NN algorithm.
• With the help of k-NN, we can easily identify the category or class of a particular
dataset.
Category B Category B
100
80
60
40
20
10 20 30 40 50 60 70 80 90
Fig. 2.7
• Now, we need to classify new data point with black dot (at point 60, 60) into gray or
black class. We are assuming k = 3 i.e., it would find three nearest data points and
shown in the Fig. 2.8.
100
80
60
40
20
10 20 30 40 50 60 70 80 90
Fig. 2.8
• We can see in the Fig. 2.8 the three nearest neighbors of the data point with square
black dot. Among those three, two of them lie in black class hence the black dot will
also be assigned in black class.
• Take another example, we have a new data point and we need to put it in the required
category (See Fig. 2.9).
X2
Category B
New Data
point
Category A
X1
Fig. 2.9
2.30
Data Analytics Machine Learning Overview
• Firstly, we will choose the number of neighbors, so we will choose the k=5. As we can
see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.
X2
New Data
point
Category A
X1
Fig. 2.10
Advantages of k-NN:
1. The k-NN algorithm is simple and easy to implement.
2. The k-NN a versatile algorithm as we can use it for classification as well as
regression.
3. The k-NN is very useful for nonlinear data because there is no assumption about
data in this algorithm.
Disadvantages of k-NN:
1. The k-NN algorithm gets significantly slower as the number of examples and/or
predictors/independent variables increase.
2. The k-NN algorithm is computationally a bit expensive algorithm because it stores
all the training data.
3. The k-NN algorithm requires high memory storage.
Decision Tree:
• In supervised learning the decisions are performed on the basis of features of the
given dataset. The decision tree is a graphical representation for getting all the
possible solutions to a problem/decision based on given conditions.
• Decision tree is a supervised learning technique that can be used for both
classification and regression problems
• A decision tree is a tree-structured classifier, where internal nodes represent the
features of a dataset, branches represent the decision rules and each leaf node
represents the outcome.
• A decision tree builds classification or regression models in the form of a tree
structure. The decisions are performed on the basis of features of the given dataset.
2.31
Data Analytics Machine Learning Overview
• It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
• Fig. 2.11 general structure of a decision tree. Root node is from where the decision tree
starts. It represents the entire dataset, which further gets divided into two or more
homogeneous sets.
• Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node. Decision nodes are used to make any decision and have multiple
branches.
Decision Node Root Node
Fig. 2.11
How does the Decision Tree Work?
• In a decision tree, for predicting the class of the given dataset, the algorithm starts
from the root node of the tree.
• This algorithm compares the values of root attribute with the record (real dataset)
attribute and, based on the comparison, follows the branch and jumps to the next
node.
• For the next node, the algorithm again compares the attribute value with the other
sub-nodes and move further. It continues the process until it reaches the leaf node of
the tree.
• Fig. 2.12 shows an example of decision tree. Suppose there is a candidate who has a
job offer and wants to decide whether he should accept the offer or Not. So, to solve
this problem, the decision tree starts with the root node (Salary attribute).
• The root node splits further into the next decision node (distance from the office) and
one leaf node based on the corresponding labels. The next decision node further gets
split into one decision node (Cab facility) and one leaf node.
• Finally, the decision node splits into two leaf nodes (Accepted offers and Declined
offer).
2.32
Data Analytics Machine Learning Overview
Salary is between
` 50000 – ` 80000
s
No
Ye
Of"ce near to Declined
home offer
No
s
Ye
Provides Cab Declined
facility offer
No
s
Ye
Accepted Declined
offer offer
Fig. 2.12
Advantages of Decision Tree:
1. Decision trees are simple to understand and interpret.
2. Decision trees are able to handle both numerical and categorical data.
3. Decision trees works well with large datasets.
4. Decision trees are fast and accurate.
Disadvantages of Decision Tree:
1. A small change in the training data can result in a larger change in the tree and
consequently the final predictions.
2. Decision trees performance is not good if there are lots of uncorrelated variables
in the data set.
3. Decision trees are generally easy to use, but making them, particularly huge ones
with numerous divisions or branches, is complex.
Support Vector Machine (SVM):
• Support Vector Machines (SVMs) are powerful yet flexible supervised machine
learning algorithms which are used both for classification and regression problems.
• The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point
in the correct category in the future.
• The objective of the support vector machine algorithm is to find a hyperplane in an
N-dimensional space (N- the number of features) that distinctly classifies the data
points.
2.33
Data Analytics Machine Learning Overview
X2 X2
O
pt
im
al
hy
pe
rp
la
n e
Maximum
margin
X1 X1
Class B
Support
X-Axis
Fig. 2.14
• The important concepts in SVM are explained below:
1. Data points that are closest to the hyperplane is called support vectors.
Separating line will be defined with the help of these data points.
2. Hyperplane is a decision plane or space which is divided between a set of objects
having different classes.
3. Margin may be defined as the gap between two lines on the closet data points of
different classes. It can be calculated as the perpendicular distance from the line to
the support vectors. Large margin is considered as a good margin and small
margin is considered as a bad margin.
2.34
Data Analytics Machine Learning Overview
Advantages of SVM:
1. SVM offers great accuracy.
2. SVM work well with high dimensional space.
3. It is effective in cases where number of dimensions is greater than the number of
samples.
4. It uses a subset of training points in the decision function (called support vectors),
so it is also memory efficient.
Disadvantages of SVM:
1. SVMs have high training time hence in practice not suitable for large datasets.
2. It also does not perform very well, when the data set has more noise i.e. target
classes are overlapping.
Naïve Bayes:
• Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes’
theorem and used for solving classification problems. Bayes' theorem is also known as
Bayes' Rule or Bayes' law.
• The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as follows:
1. Naïve is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features.
2. Bayes is called Bayes because it depends on the principle of Bayes' Theorem.
• Naïve Bayes classifier is one of the simple and most effective classification algorithms
which helps in building the fast machine learning models that can make quick
predictions.
• Naïve Bayes is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
• It is a classification technique based on Bayes’ Theorem with an assumption of
independence among predictors.
• In simple terms, a Naive Bayes classifier assumes that the presence of a particular
feature in a class is unrelated to the presence of any other feature.
• Naive Bayes model is easy to build and particularly useful for very large data sets.
Naive Bayes is used for creating classifiers. Suppose we want to sort out (classify)
fruits of different kinds from a fruit basket.
• We may use features such as color, size and shape of a fruit, For example, any fruit
that is red in color, is round in shape and is about 9 cm in diameter may be considered
as Apple.
• So to train the model, we would use these features and test the probability that a given
feature matches the desired constraints. The probabilities of different features are
then combined to arrive at a probability that a given fruit is an Apple.
2.35
Data Analytics Machine Learning Overview
• Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c),
P(x) and P(x|c). Look at the equation below:
Likelihood Class Prior Probability
P(x | c) P(c)
P(c | x) = P(x)
• Instead, models itself find the hidden patterns and insights from the given data. It can
be compared to learning which takes place in the human brain while learning new
things.
• Unsupervised learning can be defined as, a type of machine learning in which models
are trained using unlabeled dataset and are allowed to act on that data without any
supervision.
• In unsupervised learning, data is unlabeled, so the learning algorithm is left to find
commonalities among its input data.
• As unlabeled data are more abundant than labeled data, machine learning methods
that facilitate unsupervised learning are particularly valuable.
• The system itself must then decide which features it will use to group the input data.
• The training process extracts the statistical properties of the training set and groups of
similar vectors into classes or clusters.
• Unsupervised learning is often used for anomaly detection including for fraudulent
credit card purchases, and recommender systems that recommend what products to
buy next.
Unsupervised
Input Learning
Output
2.37
Data Analytics Machine Learning Overview
I can see
a pattern
Unsupervised
Model
Input Data
• In this algorithm, the data points are assigned to a cluster in such a manner that the
sum of the squared distance between the data points and centroid would be
minimum.
• It is to be understood that less variation within the clusters will lead to more similar
data points within same cluster.
• The k-means clustering is an unsupervised learning algorithm, which groups the
unlabeled dataset into different clusters.
• Here k defines the number of pre-defined clusters that need to be created in the
process, as if k=2, there will be two clusters, and for k=3, there will be three clusters,
and so on.
• It allows us to cluster the data into different groups and a convenient way to discover
the categories of groups in the unlabeled dataset on its own without the need for any
training.
• It is a centroid-based algorithm, where each cluster is associated with a centroid. The
main aim of this algorithm is to minimize the sum of distances between the data point
and their corresponding clusters.
• The algorithm takes the unlabeled dataset as input, divides the dataset into k-number
of clusters, and repeats the process until it does not find the best clusters. The value of
k should be predetermined in this algorithm.
• The k-means clustering algorithm mainly performs two tasks:
1. Determines the best value for K center points or centroids by an iterative process.
2. Assigns each data point to its closest k-center. Those data points which are near to
the particular k-center, create a cluster.
• Hence, each cluster has data points with some commonalities, and it is away from
other clusters.
• Fig. 2.17 shows the working of the k-means clustering algorithm.
Before k-Means After k-Means
k-Means
Fig. 2.17
2.39
Data Analytics Machine Learning Overview
Cluster A
Cluster B
Centroids
Fig. 2.18
• The k-means algorithm is extremely simple and easy to understand and implement.
• We begin by randomly assigning each example from the data set into a cluster,
calculate the centroid of the clusters as the mean of all member examples
• Then iterate the data set to determine whether an example is closer to the member
cluster or the alternate cluster (given that k = 2).
• If the member is closer to the alternate cluster, the example is moved to the new
cluster and its centroid recalculated. This process continues until no example moves to
the alternate cluster.
• As illustrated, k-means partitions the example data set into k clusters without any
understanding of the features within the example vectors (that is, without
supervision).
Example for k-means Algorithm:
• Suppose we have two variables A1 and A2. The x-y axis scatter plot of these two
variables is given below:
2.40
Data Analytics Machine Learning Overview
• Let us take number k of clusters, i.e., k=2, to identify the dataset and to put them into
different clusters mean here we will try to group these datasets into two different
clusters. We need to choose some random k points or centroid to form the cluster.
These points can be either the points from the dataset or any other point. So, here we
are selecting the below two points as k points, which are not the part of our dataset.
• Now we will assign each data point of the scatter plot to its closest k-point or centroid.
We will compute it by applying some mathematics that we have studied to calculate
the distance between two points. So, we will draw a median between both the
centroids.
• From the above image, it is clear that points left side of the line is near to the k1 or
black centroid, and points to the right of the line are close to the gray centroid. Let us
color them as black and gray for clear visualization.
2.41
Data Analytics Machine Learning Overview
• As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as given below:
• Next, we will reassign each datapoint to the new centroid. For this, we will repeat the
same process of finding a median line. The median will be given below:
• From the above image, we can see, one gray point is on the left side of the line and two
black points are right to the line. So, these three points will be assigned to new
centroids.
2.42
Data Analytics Machine Learning Overview
• As reassignment has taken place, so we will again go to the Point 4, which is finding
new centroids or k-points. We will repeat the process by finding the center of gravity
of centroids, so the new centroids will be as shown in the below:
• As we got the new centroids so again will draw the median line and reassign the data
points. So, the image will be:
• We can see in the above image; there are no dissimilar data points on either side of the
line, which means our model is formed.
• As our model is ready, so we can now remove the assumed centroids, and the two final
clusters will be shown below:
2.43
Data Analytics Machine Learning Overview
2. Cross Marketing is to work with other businesses that complement your own, not
competitors. For example, vehicle dealerships and manufacturers have cross
marketing campaigns with oil and gas companies for obvious reasons.
3. Catalog Design the selection of items in a business’ catalog are often designed to
complement each other so that buying one item will lead to buying of another. So
these items are often complements or very related.
Apriori Algorithm:
• The Apriori algorithm was proposed by Agrawal and Srikant in 1994.
• The Apriori algorithm helps in building/creating the association rules from the
frequent itemsets that remains after applying Apriori property during each iteration.
Anomaly Detection:
• Anomaly detection is an unsupervised ML method.
• Anomaly detection is used to find out the occurrences of rare events or observations
that generally do not occur.
Difference between Supervised and Unsupervised Learning:
• Machine learning defines basically two types of learning namely, supervised and
unsupervised learning. But both the techniques are used in different scenarios and
with different datasets.
• Supervised learning is a machine learning method in which models are trained using
labeled data. Unsupervised learning is another machine learning method in which
patterns inferred from the unlabeled input data.
• Following table differences between Supervised learning and Unsupervised learning:
Sr. Supervised Unsupervised
No. Learning Learning
1. In supervised learning both input and In unsupervised learning only input
output variables are provided on the variables are provided and no output
basis of which the output could be variable are available due to which
predicted and probability of its the outcome or resultant learning is
correctness is higher. dependent on one intellectual
observation.
2. As supervised learning is treated as Unsupervised learning is
highly accurate and trustworthy comparatively less accurate and
method so the accuracy and correctness trustworthy method.
is better as compare to unsupervised
learning.
3. Supervised learning algorithms are Unsupervised learning algorithms are
trained using labeled data. trained using unlabeled data.
contd. …
2.45
Data Analytics Machine Learning Overview
• Thus, using the knowledge of both labeled and unlabeled data, the model can classify
unseen documents in the future.
Large Unlabeled Data
WWW HTML docs 1. Expert Labels for Unlabeled data
Fig. 2.19
• Take for example the plot in Fig. 2.20 In this case, the data has only two labeled
observations; normally this is too few to make valid predictions.
7
6
Does not buy
5
3
Buy
2
0
0 2 4 6 8 10 12
Fig. 2.20: Plot has only Two Labeled Observations - too few for Supervised Observations, but
enough to Start with an Unsupervised or Semi-supervised Approach
• A common semi-supervised learning technique is label propagation. In this technique,
we start with a labeled data set and give the same label to similar data points.
• This is similar to running a clustering algorithm over the data set and labeling each
cluster based on the labels they contain.
• If we were to apply this approach to the data set in Fig. 2.20, we might end up with
something like Fig. 2.21.
• The previous Fig. 2.20 shows that the data has only two labeled observations, far too
few for supervised learning. The Fig. 2.21 shows how we can exploit the structure of
the underlying data set to learn better classifiers than from the labeled data only.
2.48
Data Analytics Machine Learning Overview
• The data is split into two clusters by the clustering technique; we only have two
labeled values, but if we’re bold we can assume others within that cluster have that
same label (buyer or non-buyer), as depicted here. This technique isn’t flawless; it’s
better to get the actual labels if we can.
7
4 Buyers
2 Non buyers
1
0
0 2 4 6 8 10 12
Fig. 2.21
• The special approach to semi-supervised learning worth mentioning here is active
learning, in which the program points out the observations it wants to see labeled for
its next round of learning based on some criteria we have specified.
• For example, we might set it to try and label the observations the algorithm is least
certain about or we might use multiple models to make a prediction and select the
points where the models disagree the most.
Advantages of Semi-supervised Machine Learning Algorithms:
1. It is easy to understand and simple to implement.
2. It reduces the amount of annotated data used.
3. It is a stable algorithm.
4. It has high efficiency.
Disadvantages of Semi-supervised Machine Learning Algorithms:
1. Iteration results are not stable.
2. It is not applicable to network-level data.
3. It has low accuracy.
• Ensemble methods are the machine learning technique that combines several base
models in order to produce one optimal predictive model.
• Ensemble methods are techniques that aim at improving the accuracy of results in
models by combining multiple models instead of using a single model.
• The following picture shows basic concept of ensemble methods and/or techniques.
Model1
Ensemble methods
(combing individual Final Single
Model2
model predictions) Model
ModelN
• Ensemble methods are models composed of multiple weaker models that are
independently trained and their prediction result are combined together in a single
model which makes the predictions more accurate and better performance.
• Bagging, boosting and random forests are examples of ensemble methods/techniques.
1. Bagging Ensemble Technique:
• Bagging, (short form for bootstrap aggregating), is mainly applied in classification and
regression.
• Bagging is classified into two types namely, bootstrapping and aggregation.
o Bootstrapping is a sampling technique where samples are derived from the
whole population (set) using the replacement procedure. The sampling with
replacement method helps make the selection procedure randomized. The base
learning algorithm is run on the samples to complete the procedure.
o Aggregation in bagging is done to incorporate all possible outcomes of the
prediction and randomize the outcome. Without aggregation, predictions will not
be accurate because all outcomes are not put into consideration. Therefore, the
aggregation is based on the probability bootstrapping procedures or on the basis
of all outcomes of the predictive models.
2. Boosting Ensemble Technique:
• Boosting is an ensemble technique that learns from previous predictor mistakes to
make better predictions in the future.
• The technique combines several weak base learners to form one strong learner, thus
significantly improving the predictability of models.
• Boosting works by arranging weak learners in a sequence, such that weak learners
learn from the next learner in the sequence to create better predictive models.
• Boosting takes many forms, including gradient boosting, Adaptive Boosting
(AdaBoost), CatBoost and XGBoost (Extreme Gradient Boosting).
2.50
Data Analytics Machine Learning Overview
2. Averaging: In this method, we take an average of predictions from all the models
and use it to make the final prediction.
3. Weighted Averaging: All models are assigned different weights defining the
importance of each model for prediction. For example, if two of our colleagues are
critics, while others have no prior experience in this field, then the answers by
these two friends are given more importance as compared to the other people.
• If there is only one input variable (x), then such linear regression is called simple
linear regression. And if there is more than one input variable, then such linear
regression is called multiple linear regression.
• The linear regression model provides a sloped straight line representing the
relationship between the variables, (see Fig. 2.22).
Y
Dependent Variable
s
oint
t ap
Da
Line of
regression
Independent Variables X
Fig. 2.22
• Mathematically the relationship can be represented with the help of following
equation,
Y = aX + b … (2.1)
Here, Y = dependent variables (target variables), X = Independent variables (predictor
variables), a and b are the linear coefficients.
• To quantify the number of errors the regression line has committed, all the errors
have to be added up.
• However, if the sign of the errors are both positive and/or negative then certain errors
might cancel out each other and therefore wouldn't be reflected in the overall error
computation.
• It is therefore, necessary that the error is squared and one of the popular ways of
computing the error could be the Root Mean Squared Error (RMSE). The lower the
RMSE, the better is the regression model.
• Equation 2.1 could be modified in line with the definition of the linear regression
model to incorporate the error term. The modified equation is represented in
Equation 2.2 in which e is the error term:
Y = a0 + a1X + ε … (2.2)
Here,
Y = Dependent Variable (Target Variable)
X = Independent Variable (predictor Variable)
2.53
Data Analytics Machine Learning Overview
X
The line of equation will be : Y = a0 + a1x
Fig. 2.23
2. Negative Linear Relationship: If the dependent variable decreases on the Y-axis
and independent variable increases on the X-axis, then such a relationship is
called a negative linear relationship.
Y
X
The line of equation will be : Y = –a0 + a1x
Fig. 2.24
2.54
Data Analytics Machine Learning Overview
• Linear regression can be further divided into two types of the algorithm:
1. Simple Linear Regression: If a single independent variable is used to predict the
value of a numerical dependent variable, then such a linear regression algorithm
is called simple linear regression.
2. Multiple Linear Regression: If more than one independent variable is used to
predict the value of a numerical dependent variable, then such a linear regression
algorithm is called multiple linear regression.
• The relationship between variables in the linear regression model can be shown in
the Fig. 2.25. Here, we are predicting the salary of an employee on the basis of the
year of experience.
Y
120000
100000
Salary
80000
60000
40000
2 4 6 8 10 X
Experience
Fig. 2.25
Example on Fitting of Linear Regression: The following data gives expenditure on
R&D and profit of a company
Profit 50 60 40 70 85 100
Expenditure on R&D 0.40 0.40 0.30 0.50 0.60 0.80
1. Find the regression equation of profit on R & D expenditure.
2. Estimate the profit when expenditure on R & D is budgeted at Rs 1 Crore.
3. Find the correlation coefficient.
4. What proportion of variability in profit is explained by variability in expenditure
on R&D.
Solution:
2.55
Data Analytics Machine Learning Overview
From the above Scatter diagram, WE observed that profit and expenditure on R&D are
highly positively correlated.
In the above example, let X denote expenditure on R&D and Y denote profit
Expen-
Profit diture ∧ ∧ ∧
X^2 Y^2 XY
(Y) on R&D Y (Y − Y) (Y − Y)^2
(X)
50 0.4 0.16 2500 20 55.3125 5.3125 28.22266
60 0.4 0.16 3600 24 55.3125 4.6875 21.97266
40 0.3 0.09 1600 12 43.125 -3.125 9.765625
70 0.5 0.25 4900 35 67.5 2.5 6.25
85 0.6 0.36 7225 51 79.6875 5.3125 28.22266
100 0.8 0.64 10000 80 104.0625 4.0625 16.50391
2
Σy = 405 Σx = 3 Σx = 1.66 29825 Σyx = 222 110.9375
Step 1: Means,
− Σx − Σy
x = n = 0.5; y = n = 67.5
Step 2: Variances
2 2
Σx −2 Σy −2
var(X) = n − x = 0.026667; Var(Y) = n − y = 414.5833
2.56
Data Analytics Machine Learning Overview
2.57
Data Analytics Machine Learning Overview
y_test
# graphical representation
#plot for the TRAIN data
plt.scatter(X_train, y_train, color='Green') # plotting the
observation line
plt.plot(X_train, regressor.predict(X_train), color='Red') # plotting
the regression line
plt.title("Study in hours vs Marks_Scored (Training set)") # stating
the title of the graph
plt.xlabel("Study in hours") # adding the name of x-axis
plt.ylabel("Marks scored in %") # adding the name of y-axis
plt.show() # specifies end of graph
#plot for the TEST data
plt.scatter(X_test, y_test, color='Green')
plt.plot(X_train, regressor.predict(X_train), color='Brown') # plotting
the regression line
plt.title("Study in hours vs Marks_Scored (Testing set)")
plt.xlabel("Study in hours ")
plt.ylabel("Marks scored in % ")
plt.show()
Output:
stsc.csv(application/vnd.ms-excel)-214 bytes, last modified: 2/1/2022-100% done
2.58
Data Analytics Machine Learning Overview
y = b0 + b1x1 + b2x12
2.59
Data Analytics Machine Learning Overview
36.73998103448068
• The nature of target or dependent variable is dichotomous, which means there would
be only two possible classes.
• In simple words, the dependent variable is binary in nature having data coded as
either 1 (stands for success/yes) or 0 (stands for failure/no).
• Logistic regression is used for predicting the categorical dependent variable using a
given set of independent variables.
• Logistic regression predicts the output of a categorical dependent variable. Therefore
the outcome must be a categorical or discrete value.
• It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact
value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
• Logistic regression is much similar to the Linear Regression except that how they are
used.
• Linear regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
• The logistic regression primarily deals with binary output such as play versus not-play
or success versus failure.
• In logistic regression, multiple independent variables are mapped to a single
dependent variable. Popularly two types of logistic regression are found-binary and
multinomial.
• Binary logistic regression is used when the dependent variable partitions the output
class into two subsets and independent variables are found to be either categorical or
continuous.
• However, if the dependent variable divides the target class into more than two subsets
then the logistic regression is multinomial.
• In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1). A logistic curve is an S-shaped
or sigmoid shape.
Sigmoid Function:
• The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
• The sigmoid function maps any real value into another value within a range of
0 and 1.
• The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid
function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
2.61
Data Analytics Machine Learning Overview
• The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight,
etc.
• Logistic regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.
• Logistic regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
• Fig. 2.27 shows assumptions logistic function.
Y
1
S-Curve
y = 0.8
0.5
Threshold Value
y = 0.3
0
X
0.5
0
–6 –4 –2 0 2 4 6
Fig. 2.28
• The classes can be divided into positive or negative. The output comes under the
probability of positive class if it lies between 0 and 1.
• For our implementation, we are interpreting the output of hypothesis function as
positive if it is ≥ 0.5, otherwise negative.
Program for Logistic Regression:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import seaborn as sn
import matplotlib.pyplot as plt
import pandas as pd
students = {'cmat': [680,650,690,700,640,720,680,710,730,640,620,680,700,65
0,760 ],'gpa': [4,3.9,3.3,3.7,3.9,3.7,2.3,3.3,3.3,1.7,2.7,3.7,
3.7,3.3,3.3 ], 'work_exp': [3,4,3,5,4,6,1,4,5,1,3,5,6,4,3 ],
'status': [1,1,0,1,0,1,0,1,1,0,0,1,1,0,1]}
#status is admission status -1 for admitted,0 for not admitted
df = pd.DataFrame(students,columns= ['cmat', 'gpa','work_exp','status'])
print (df)
X = df[['cmat', 'gpa','work_exp']]
y = df['status']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,
random_state=0)
logistic_regression= LogisticRegression()
logistic_regression.fit(X_train,y_train)
y_pred=logistic_regression.predict(X_test)
2.63
Data Analytics Machine Learning Overview
2.64
Data Analytics Machine Learning Overview
Y
Class A
Class B
X
Fig. 2.29
• The algorithm which implements the classification on a dataset is known as a
classifier. There are two types of classifiers:
1. Binary Classifier: If the classification problem has only two possible outcomes,
then it is called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
2. Multi-class Classifier: If a classification problem has more than two outcomes,
then it is called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
Classification Techniques:
1. Decision Tree Classification: Decision trees are also known as classification trees.
Decision trees approach the classification problem by partitioning the data into
purer subsets based on the values of the input attributes. The attributes that help
achieve the cleanest levels of such separation are considered significant in their
influence on the target variable and end up at the root and closer-to-root levels of
the tree. The output model is a tree framework than can be used for the prediction
of new unlabeled data.
2. k-NN Classification: The k-NN (k-Nearest Neighbor) classification is based on the
principle that any objects in nature that have similarities tend to be in close
proximity to each other.
3. Support Vector Machine (SVM) Classification: SVM is one of the most widely
used classification algorithms both for separating linear or non-linear data. With
linear data, the objective is to find that linear hyperplane that separates the
instances of two different classes by the maximal distance apart.
4. Random Forest Classification: Another standard most-commonly used
classification in supervised machine learning is the Random Forest classification.
Random Forest can be used both for classification and regression problems. The
random forest as the name indicates is a forest, but the forest of what? It is a forest
2.66
Data Analytics Machine Learning Overview
of decision trees. As seen in Fig. 2.30, decision trees are generated with randomly
drawn instances from the training set. The trees are constructed as shown in the
algorithm. However, the final class of classification of the input instance is based
on the majority voting as shown in Fig. 2.30.
Bootstrap Dataset
Tree 1 Tree n
Class A Class B
Majority voting
Final Class
• Before explaining the different ways to implement clustering, the different types of
clusters have to be defined.
• Based on a data point’s membership to an identified group, a cluster can be:
1. Overlapping Clusters: These are also known as multi-view clusters. The cluster
groups are not exclusive and each data object may belong to more than one
cluster. For example, a customer of a company can be grouped in a high-profit
customer cluster and a high-volume customer cluster at the same time.
2. Exclusive or Strict Partitioning Clusters: Each data object, in this cluster belongs
to one exclusive cluster, like the example shown in Fig. 2.32. This is the most
common type of cluster.
3. Fuzzy or Probabilistic Clusters: Each data point, in this cluster belongs to all
cluster groups with varying degrees of membership from 0 to 1. For example, in a
dataset with clusters A, B, C, and D, a data point can be associated with all the
clusters with degree A50.5, B50.1, C50.4 and D50. Instead of a definite association
of a data point with one cluster, fuzzy clustering associates a probability of
membership to all the clusters. Fuzzy C-means algorithm is the example of this
type of clustering; it is sometimes also known as the Fuzzy k-means algorithm.
4. Hierarchical Clusters: In this type of cluster, each child cluster can be merged to
form a parent cluster. For example, the most profitable customer cluster can be
further divided into a long-term customer cluster and a cluster with new
customers with high-value purchases.
2.6
2.5
2.4
2.3
2.2
2.1
2.0
1.9
1.8
1.7
1.6 Cluster 3
Petal width
1.5
1.4
1.3
1.2
1.1
1.0
0.9
Cluster 2
0.8
0.7
0.6
0.5
0.4
0.3
0.2 Cluster 1
0.1
0.0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0
Petal length
Fig. 2.31
2.69
Data Analytics Machine Learning Overview
k-Means Clustering:
• The k-means clustering technique identifies a cluster based on a central prototype
record.
• The k-Means clustering is a prototype-based clustering method where the dataset is
divided into k-clusters.
DBSCAN Clustering:
• Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a base
algorithm for density-based clustering.
• Density-based clustering refers to unsupervised learning methods that identify
distinctive groups/clusters in the data.
• It is based on the idea that a cluster in data space is a contiguous region of high point
density, separated from other such clusters by contiguous regions of low point
density.
Self-Organizing Maps:
• A Self-Organizing Map (SOM) is a powerful visual clustering technique that evolved
from a combination of neural networks and prototype-based clustering.
• The SOM was introduced by Teuvo Kohonen in the 1980s. This technique is also
known as Kohonen networks or Kohonen map. SOM is sometimes also referred to by
a more specific name, Self-Organizing Feature Maps (SOFM).
• A SOM is a form of neural network where the output is an organized visual matrix,
usually a two-dimensional grid with rows and columns.
• Fig. 2.32 shows the typical framing of a Reinforcement Learning (RL) scenario.
• An agent takes actions in an environment, which is interpreted into a reward and a
representation of the state, which are fed back into the agent.
Environment
Action
Rew
ard
Interpreter
Sta
te
Agent
Fig. 2.32
• Fig. 2.33 shows a real-world example of Reinforcement Learning (RL).
Environment
Agent
Clo
se
d
1 Observe
2 Select action
using policy
7. Value is expected long-term retuned with the discount factor and opposite to the
short-term reward. The value of a state is the total aggregated number of rewards
that the agent can expect to get in the future if it starts from that state.
Advantages of Reinforcement Machine Learning:
1. RL is used to solve complex problems that cannot be solved by conventional
techniques.
2. The solutions obtained by RL very accurate.
3. RL model will undergo a rigorous training process that can take time. This can
help to correct any errors.
4. Due to RL’s learning ability, it can be used with neural networks. This can be
termed as deep reinforcement learning.
5. When it comes to creating simulators, object detection in automatic cars, robots,
etc., reinforcement learning plays a great role in the models.
Disadvantages of Reinforcement Machine Learning:
1. RL needs a lot of data and a lot of computation.
2. Too much reinforcement learning can lead to an overload of states which can
diminish the results.
3. RL algorithm is not preferable for solving simple problems. To solving simpler
problems won’t be correct.
4. RL need lots of data to feed the model consumes time and lots of computational
power.
5. RL models require a lot of training data to develop accurate results.
6. When it comes to building RL models on real-world examples, the maintenance
cost is very high.
PRACTICE QUESTIONS
Q. I Multiple Choice Questions:
1. Which is the process of converting experience into expertise or knowledge.?
(a) Learning (b) Writing
(c) Listening (d) All of the mentioned
2. Machine Learning (ML) techniques are broadly categorized into,
(a) supervised ML (b) unsupervised ML
(c) reinforcement ML (d) All of the mentioned
3. Which is the simulation of human intelligence processes by machines, especially
computer systems?
(a) Machine Learning (ML) (b) Artificial Intelligence (AI)
(c) Deep Learning (d) All of the mentioned
2.73
Data Analytics Machine Learning Overview
15. Which methods are techniques that create multiple models and then combine
them to produce improved/accurate results?
(a) Machine dependant (b) Ensemble
(c) Computer machine independent (d) All of the mentioned
16. Which is a form of data analysis that extracts models describing important data
classes?
(a) Regression (b) Binary classifier
(c) Classification (d) None of the mentioned
17. Boosting takes many forms, including
(a) AdaBoost (b) XGBoost
(c) Gradient boosting (d) All of the mentioned
18. Which clustering partitions the data based on variation in the density of records in
a dataset?
(a) HIERARCHICALSCAN (b) TRESCAN
(c) DBSCAN (d) None of the mentioned
Answers
1. (a) 2. (d) 3. (b) 4. (c) 5. (a) 6. (d) 7. (c) 8. (b) 9. (a) 10. (d)
11. (c) 12. (b) 13. (a) 14. (d) 15. (b) 16. (c) 17. (d) 18. (c)
Q. II Fill in the Blanks:
1. _______ learning algorithms build a model based on sample data, known as training
data, in order to make predictions or decisions.
2. Reinforcement learning is a _______ -based Machine Learning (ML) technique.
3. _______ learning is a subset of ML and ML is a subset of AI.
4. The ultimate goal of _______ is to make machines as intelligent as humans.
5. Unsupervised learning is a learning method in which a machine learns _______ any
supervision.
6. Classification refers to process of predicting _______ output values for an input.
7. In regression problems the task of machine learning model is to _______ a
continuous value.
8. Speech recognition is a process of converting _______ instructions into text and
known as computer speech recognition.
9. A _______ consists of constructs of information called features or predictors and a
target or response variable.
10. _______ is of the model is extremely important because it determines whether the
model works in real-life conditions.
11. The _______ data means some input data is already tagged with the correct output.
2.75
Data Analytics Machine Learning Overview
12. _______ learning is a process of providing input data as well as correct output data
to the machine learning model.
13. The _______ tree is a graphical representation for getting all the possible solutions
to a problem/decision based on given conditions.
14. _______ is the gap between two lines on the closet data points of different classes.
15. _______ Bayes algorithm is a supervised learning algorithm
16. Unsupervised learning is a machine learning technique in which models are not
supervised using _______ dataset.
17. _______ cluster analysis is used for exploratory data analysis in order to find the
hidden patterns.
18. An _______ rule is an unsupervised learning method which is used for finding the
relationships between variables in the large database.
19. Clustering is a method of grouping the objects into clusters such that objects with
most _______ remains into a group and has less or no similarities with the objects of
another group.
20. The _______ algorithm involves us to telling the algorithms how many possible
cluster (or k) there are in the dataset.
21. Basket Data Analysis is to analyze the _______ of purchased items in a single basket
or single purchase.
22. Apriori algorithm uses _______ datasets to generate association rules.
23. Anomaly or _______ detection identifies the data points that are significantly
different from other data points in a dataset.
24. _______ -supervised learning falls between unsupervised learning (with no labeled
training data) and supervised learning (with only labeled training data).
25. Regression analysis is a set of _______ processes for estimating the relationships
among variables.
26. Linear regression shows the _______ relationship, which means it finds how the
value of the dependent variable is changing according to the value of the
independent variable.
27. The dataset used in Polynomial regression for training is of _______ -linear nature.
28. _______ regression is a supervised learning classification algorithm used to predict
the probability of a target variable.
29. The _______ function maps any real value into another value within a range of
0 and 1.
30. Random _______ can be built using bagging in tandem with random attribute
selection.
31. The algorithm which implements the classification on a dataset is known as a
_______.
2.76
Data Analytics Machine Learning Overview
11. The sigmoid function is a mathematical function used to map the predicted values
to probabilities.
12. Classification algorithms are used when the output variable is categorical, which
means there are two classes such as Yes-No, Male-Female, True-false, etc.
13. If the classification problem has only two possible outcomes, then it is called as
Binary Classifier.
14. If a classification problem has more than two outcomes, then it is called as Multi-
class Classifier.
15. Clustering is supervised learning techniques in machine learning.
16. Decision trees are also known as classification trees.
17. Support Vector Machine (SVM) classification is used for separating linear or non-
linear data.
18. The k-means clustering is a hierarchical-based clustering method where the
dataset is divided into k-clusters.
19. Hierarchical clustering is a process where a cluster hierarchy is created based on
the distance between data points.
Answers
1. (T) 2. (F) 3. (T) 4. (T) 5. (T) 6. (T) 7. (T) 8. (F) 9. (T) 10. (T)
11. (T) 12. (T) 13. (T) 14. (T) 15. (F) 16. (T) 17. (T) 18. (F) 19. (T)
Q. IV Answer the following Questions:
(A) Short Answer Questions:
1. Define machine learning.
2. Define deep learning?
3. List types of machine learning.
4. Enlist three parameters for machine learning.
5. Define classification and regression.
6. Define reinforcement machine learning.
7. State any two uses of machine learning.
8. Define Neural Networks (NNs).
9. Define Artificial intelligence (AI).
10. List AI applications. Any two.
11. Define model.
12. Define supervised machine learning.
13. Give purpose of k-NN algorithm.
14. Define decision tree.
15. What is the purpose of SVM?
16. Give use of Naïve Bayes.
2.78
Data Analytics Machine Learning Overview
2.80
CHAPTER
3
3.0 INTRODUCTION
• Today’s information age a huge amount of data available and it increasing day by day.
This data is of no use until it is converted into useful information.
• It is necessary to analyze this huge amount of data and extract useful information
from it.
• Data mining is a technique that extracting information from huge sets of data. Data
mining is the procedure of mining knowledge from data.
Overview of Data Mining:
• We live in a world where vast amounts of data are collected daily. Analyzing such data
is an important need. Data mining can meet this need by providing tools to discover
knowledge from data.
• Data mining is an interdisciplinary subfield of computer science and statistics with an
overall goal to extract information (with intelligent methods) from a data set and
transform the information into a comprehensible structure for further use.
• Data mining is defined as, “to extracting or mining knowledge from massive amount
of datasets."
• Some people view data mining as an essential step in the process of knowledge
discovery.
• Steps involved in Fig. 3.1 are explained below:
Step 1: Data Cleaning: In this step, the noise and inconsistent data is removed and/or
cleaned.
3.1
Data Analytics Mining Frequent Patterns, Associations and Correlations
Knowledge
Evaluation and
Presentation
Data
Mining Patterns
Selection and
Transformation
Data
Warehouse
Cleaning and
Integration
• A substructure can refer to different structural forms (e.g., graphs, trees, or lattices)
that may be combined with itemsets or subsequences.
• If a substructure occurs frequently, it is called a (frequent) structured pattern. Mining
frequent patterns leads to the discovery of interesting associations and correlations
within data.
Mining of Association:
• Associations are used in retail sales to identify patterns that are frequently purchased
together.
• This process refers to the process of uncovering the relationship among data and
determining association rules.
• For example, a retailer generates an association rule that shows that 70% of time milk
is sold with bread and only 30% of times biscuits are sold with bread.
Mining of Correlations:
• It is a kind of additional analysis performed to uncover interesting statistical
correlations between associated-attribute-value pairs or between two item sets to
analyze that if they have positive, negative or no effect on each other.
age ?
middle_aged, f3 f6 Class A
youth
senior age f1
f4 f7 Class B
income? Class C
income f2
high low f5 f8 Class C
Class A Class B
(b) (c)
Fig. 3.2: A Classification Model can be represented in various Forms
(a) IF-THEN Rules (b) A Decision Tree (c) Neutral Network
• Regression also encompasses the identification of distribution trends based on the
available data.
• Classification and regression may need to be preceded by relevance analysis, which
attempts to identify attributes that are significantly relevant to the classification and
regression process.
• Such attributes will be selected for the classification and regression process. Other
attributes, which are irrelevant, can then be excluded from consideration.
• Classification predicts the class of objects whose class label is unknown. Its objective is
to find a derived model that describes and distinguishes data classes or concepts.
• The derived model is based on the analysis set of training data i.e. the data object
whose class label is well known.
• Prediction is used to predict missing or unavailable numerical data values rather than
class labels. Regression Analysis is generally used for prediction.
• Prediction can also be used for identification of distribution trends based on available
data.
3.1.4 Cluster Analysis
• Cluster refers to a group of similar kind of objects. Clustering is the process of
identifying the natural groupings in a dataset.
• Cluster analysis refers to forming group of objects that are very similar to each other
but are highly different from the objects in other clusters.
• Cluster analysis has been widely used in many applications such as Business
Intelligence (BI), Image pattern recognition, Web search, Security and so on.
o In business intelligence, clustering can be used to organize a large number of
customers into groups, where customers within a group share strong similar
characteristics. This facilitates the development of business strategies for
enhanced customer relationship management.
3.6
Data Analytics Mining Frequent Patterns, Associations and Correlations
• However, in some applications (e.g., fraud detection) the rare events can be more
interesting than the more regularly occurring ones. The analysis of outlier data is
referred to as outlier analysis or anomaly mining.
• Outliers may be detected using statistical tests that assume a distribution or
probability model for the data, or using distance measures where objects that are
remote from any other cluster are considered outliers.
• Rather than using statistical or distance measures, density-based methods may
identify outliers in a local region, although they look normal from a global statistical
distribution view.
• For example, outlier analysis may uncover fraudulent usage of credit cards by
detecting purchases of unusually large amounts for a given account number in
comparison to regular charges incurred by the same account. Outlier values may also
be detected with respect to the locations and types of purchase or the purchase
frequency.
• Suppose, that we are given the Electronic store relational database relating to
purchases. A data mining system may find association rules like,
age (X, “ ”)^income(X, “20K:::29K”) →buys(X, “Laptop”)
[support = 2%, confidence = 60%]
• The rule indicates that of the customers under study, 2% are 20 to29 years of age with
an income of 20,000 to 29,000 and have purchased a laptop at Electronic store.
• There is a 60% probability that a customer in this age and income group will purchase
a laptop.
• Note that there is an association between more than one attribute or predicate (i.e.,
age, income, and buys).
• Adopting the terminology used in multidimensional databases, where each attribute is
referred to as a dimension, the above rule can be referred to as a multidimensional
association rule.
• Typically, association rules are considered interesting if they satisfy both a minimum
support threshold and a minimum confidence threshold.
• Additional analysis can be performed to uncover interesting statistical correlations
between associated attribute-value pairs.
Association Rule Metrics/Measures:
• Various metrics are in place to help us understand the strength of association between
these two itemsets.
1. Support: The support of a rule x → y (where x and y are each items/events etc.) is
defined as the proportion of transactions in the data set which contain the item set x
as well as y. So,
Number of transactions which contain the item set x and y
Support (x → y) = Total number of transactions
Whereas,
Number of transactions which contain the item set x
Support (x) = Total number of transactions
2. Confidence: The confidence of a rule x → y is defined as:
Support (x → y)
Confidence (x → y) = Support (x)
• So, it is the ratio of the number of transactions that include all items in the consequent
(y in this case), as well as the antecedent (x in this case) to the number of transactions
that include all items in the antecedent (x in this case).
• In the table below, Support (milk → bread) = 0.4 means milk and bread are purchased
together occur in 40% of all transactions. Confidence (milk → bread) = 0.5 means that
if there are 100 transactions containing milk then there will be 50 that will also
contain bread.
3.10
Data Analytics Mining Frequent Patterns, Associations and Correlations
3.11
Data Analytics Mining Frequent Patterns, Associations and Correlations
(v) AB → D
3
Support = 4 = 75%
3/4
Confidence = 4/4 = 75%
(vi) D → AB
3
Support = 4 = 75%
3/4
Confidence = 3/4 = 100%
(vii) AD → B
3
Support = 4 = 75%
3/4
Confidence = 3/4 = 100%
(viii) B → AD
3
Support = 4 = 75%
3/4
Confidence = 4/4 = 75%
(ix) BD → A
3
Support = 4 = 75%
3/4
Confidence = 3/4 = 100%
(x) A → BD
3
Support = 4 = 75%
3/4
Confidence = 4/4 = 100%
Lift:
• Lift measures dependency between X and Y.
Support (x → y) Confidence (x → y)
Lift (x → y) = Support (x) × Support (y) = Support (y)
• The numerator for lift is the proportion of transactions where x and y occur jointly.
The denominator is an estimate of the expected joint occurrence of x and y, assuming
that they occur independently.
• A lift value of 1 indicates that x and y jointly occur in transactions with the frequency
that would be expected by chance alone.
• The preferred values are much larger or smaller than 1, whereby values less than 1
mean x and y are negatively correlated and greater values indicate positive
correlation.
• Lift is vulnerable to noise in small datasets, because infrequent itemset can have very
high lift values.
3.12
Data Analytics Mining Frequent Patterns, Associations and Correlations
Frequent Itemsets
Closed Frequent
Itemsets
Maximal
Frequent
Itemsets
Fig. 3.4
• Closed frequent itemsets are more widely used than maximal frequent itemset
because when efficiency is more important that space, they provide us with the
support of the subsets so no additional pass is needed to find this information.
• The analytical process that finds frequent itemset and associations from data sets is
called frequent pattern mining or association rule mining.
• Let us see an example to understand the concept of frequent itemset, closed itemset
and maximal itemset. Suppose we have four documents in corpus CS= {D1, D2, D3, D4}
also documents D1, D2, D3 and D4 are represented by set of terms as depicts in
Table 3.1. The Table 3.2 shows all itemsets generated from corpus with their support.
• In given example if 50% is the threshold value (minimum support) for the item set to
be frequent. The Table 3.3 shows list of frequent itemsets for corpus C.
3.13
Data Analytics Mining Frequent Patterns, Associations and Correlations
Example 1: If total no. of items is 7 then how many possible itemsets can be
generated?
7
Solution: Total no. of itemsets which can be generated using 7 items are 2 = 128.
Transaction ID Items
1 A,B,C,E
2 A,C,D,E
3 B,C,E
4 A,C,D,E
5 C,D,E
6 A,D,E
• Let us say min_support=0.5. This is fulfilled if min_support_count>= 3.
o Frequent item set X ∈ FX ∈ F is maximal if it does not have any frequent supersets.
o Frequent item set X ∈ F is closed if it has no superset with the same frequency.
Find Frequent Item Set:
• A frequent itemset is simply a set of items occurring a certain percentage of the time.
In a dataset, a itemset is considered frequent if its frequency/occurrence is equal or
more than on min_support % or min_support_count.
3.15
Data Analytics Mining Frequent Patterns, Associations and Correlations
• Following Table 3.4 is showing the all frequent item set on shopping dataset.
Table 3.4: Frequent Itemsets
Itemsets Frequency/Occurrence Decision
1-itemset {A} = 4 Frequent.
{B} = 2 Not frequent => ignore.
{C} = 5 Frequent.
{D} = 4 Frequent.
{E} = 6 Frequent.
2-itemset {A,B} = 1 Not frequent => ignore.
{A,C} = 3 Frequent.
{A,D} = 3 Frequent.
{A,E} = 4 Frequent.
{B,C} = 2 Not frequent => ignore.
{B,D} = 0 Not frequent => ignore.
{B,E} = 2 Not frequent => ignore.
{C,D} = 3 Frequent.
{C,E} = 5 Frequent.
{D,E} = 4 Frequent.
Itemsets Frequency/Occurrence Decision
3-itemset {A,B,C} = 1 Not frequent => ignore.
{A,B,D} = 0 Not frequent => ignore.
{A,B,E} = 1 Not frequent => ignore.
{A,C,D} = 2 Not frequent => ignore.
{A,C,E} = 3 Frequent
{A,D,E} = 3 Frequent
{B,C,D} = 0 Not frequent => ignore
{B,C,E} = 2 Not frequent => ignore
{C,D,E} = 3 Frequent
4-itemset {A,B,C,D} = 0 Not frequent => ignore
{A,B,C,E} = 1 Not frequent => ignore
{B,C,D,E} = 0 Not frequent => ignore
3.16
Data Analytics Mining Frequent Patterns, Associations and Correlations
• Following Table 3.6 is showing the all maximal frequent item set on shopping Dataset.
Table 3.6: Maximal Itemsets
Item Sets Frequency/ Decision
Occurrence
1-itemset {A} = 4 Not Maximal due to its frequent superset.
{C} = 5 Not Maximal frequent due to its frequent superset.
{D} = 4 Not Maximal frequent due to its frequent superset.
{E} = 6 Not Maximal frequent due to its frequent superset.
2-itemset {A,C} = 3 Not Maximal frequent due to its frequent superset.
{A,D} = 3 Not Maximal frequent due to its frequent superset.
{A,E} = 4 Not Maximal frequent due to its frequent superset.
{C,D} = 3 Not Maximal frequent due to its frequent superset.
{C,E} = 5 Not Maximal frequent due to its frequent superset.
{D,E} = 4 Not Maximal frequent due to its frequent superset.
3-itemset {A,C,E} = 3 Maximal frequent.
{A,D,E} = 3 Maximal Frequent.
{C,D,E} = 3 Maximal Frequent.
• Association rules are the statements which are used to find the relationships between
unrelated data in a database. These rules are useful for analyzing and predicting
customer behavior.
• Support and Confidence are two methods for generating the association rules. The
frequent items are calculated by support threshold and confidence threshold. The
values for these thresholds are predefined by the users.
• Mining association rules consist of following two-step approach:
1. Frequent Itemset Generation: Generate all itemsets whose support ≥ minsup.
2. Rule Generation: Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset.
• Initially Transactional Dataset (also called as ‘Corpus’) is scanned to uncover the set of
1-itemsets (singleton sets).1-itemsets which do not satisfied the minimum support
condition are removed.
• The consequential set is represented by S1. Now S1 is utilized to uncover S2, the set of
frequent 2-itemsets, which is utilized to uncover S3, and so on.
• The process will be continued until no new k-itemsets can be determined.
Algorithm for Apriori:
Input: D, database.
min_sup: Minimum Support Threshold.
Output: L, Frequent itemsets in database.
1. Scan the database to determine the support of each one-itemset, compare it with
min_sup and get the frequent one-itemset, (L1).
2. Use LK-1 property, Join LK-1 to find the candidate k itemset.
3. Scan the database to find the support of each candidate k-itemset, compare it with
min_sup and get the frequent K-itemset.
4. Repeat the Steps 2 to 3 until candidate itemset is null. If null, generate all subsets
for each frequent itemset.
Example 1: Consider, total number of transactions is 15 and Min Support = 20%.
TID List of Items
1 A1, A5, A6, A8
2 A2, A4, A8
3 A4, A5, A7
4 A2, A3
5 A5, A6, A7
6 A2, A3, A4
7 A2, A6, A7, A9
8 A5
9 A8
10 A3, A5, A7
11 A3, A5, A7
12 A5, A6, A8
13 A2, A4, A6, A7
14 A1, A3, A5, A7
15 A2, A3, A9
3.20
Data Analytics Mining Frequent Patterns, Associations and Correlations
A2 6
A3 6
A4 4
A5 8
A6 5
A7 7
A8 4
Scan 2: Calculate SupportCount for frequent two itemset (C2) using L1.
Itemset SupportCount Itemset Support Count
{A2, A3} 3 {A4, A5} 1
{A2, A4} 3 {A4, A6} 1
{A2, A5} 0 {A4, A7} 2
{A2, A6} 2 {A4, A8} 1
{A2, A7} 2 {A5, A6} 3
{A2, A8} 1 {A5, A7} 5
{A3, A4} 1 {A5, A8} 2
{A3, A5} 3 {A6, A7} 3
{A3, A6} 0 {A6, A8} 2
{A3, A7} 3 {A7, A8} 0
{A3, A8} 0
3.21
Data Analytics Mining Frequent Patterns, Associations and Correlations
Prune 2: Prune C2 by comparing with min, SupportCount i.e., 3. and keep those items
in L2 which > 3 SupportCount.
Itemset SupportCount
{A2, A3} 3
{A2, A4} 3
{A3, A5} 3
{A3, A7} 3
{A5, A6} 3
{A5, A7} 5
{A6, A7} 3
Scan 3: Calculate SupportCount for frequent three itmeset (C3) using L2.
Itemset SupportCount
{A2, A3, A4} 1
{A3, A5, A7} 3
{A5, A6, A7} 1
Prune 3: Prune C3 by comparing with min, SupportCount i.e. three itemset (C3) in L3
which > 3 SupportCount.
Itemset SupportCount
{A3, A5, A7} 3
At the end {A3, A5, A7} represent a frequent itemset in transaction.
Example 2: Trace the results of Apriori algorithm for following transaction with
minimum support threshold 3.
TID Items Brought
1 M, T, B
2 E, T, C
3 M, E, T, C
4 E, C
5 J
From this calculate the frequency of all items,
Items Brought Support
M 2
E 3
T 3
C 3
J 1
B 1
3.22
Data Analytics Mining Frequent Patterns, Associations and Correlations
3.23
Data Analytics Mining Frequent Patterns, Associations and Correlations
• In general, confidence does not have an anti-monotone property (as already explained
in Apriori algorithm) c(ABC →D) can be larger or smaller than c(AB →D).
• But confidence of rules generated from the same itemset have an anti-monotone
property for example,
L = {A, B, C, D} : c(ABC → D) ≥ c(AB → CD) ≥ c(A → BCD)
• If we will follow the antimonotonic property then we can prune most of the low
confidence rules.
• We start with a frequent itemset {A,B,C,D} and start forming rules with just one
consequent. Remove the rules failing to satisfy the minconf condition.
• Now, start forming rules using a combination of consequents from the remaining ones.
Keep repeating until only one item is left on antecedent. This process has to be done for
all frequent itemsets.
• In Apriori algorithm, at the end of example, we get a frequent itemset X = {A3, A5, A7}
but the task of generation of rule is not still achieved.
• Association rules can be generated from X. The non-empty subsets of X are {A3, A5,
A7}, {A3, A5}, {A5, A7}, {A3, A7}, {A3}, {A5} and {A7}. The resulting association rules
are as shown below, each listed with its confidence.
Sub (X ∪ Y)
Association Rules Confidence = Sup (X)
{A3, A5} → A7 3/3 = 100%
{A3, A7} → A5 3/3 = 100%
{A5, A7} → A3 3/5 = 60%
A3 → {A5, A7} 3/6 = 50%
A5 → {A3, A7} 3/8 = 37.5%
A7 → {A3, A5} 3/7 = 42.8%
• Now as per confidence value we can select the rule from frequent time set. For
example, confidence threshold is 75 then two rules will be generated from frequent
itemset {A3, A5, A7}.
{A3, A5} → A7
{A3, A7} → A5
• Since, Apriori algorithm was first introduced and as experience was accumulated,
there have been many attempts to devise more efficient algorithms of frequent itemset
mining including approaches such as hash-based technique, partitioning, sampling,
and using vertical data format.
• Several refinements have been proposed that focus on reducing the number of
database scans, the number of candidate itemsets counted in each scan or both.
• “How can we further improve the efficiency of Apriori-based mining?” Many
variations of the Apriori algorithm have been proposed that focus on improving the
efficiency of the original algorithm.
• The Apriori algorithm is the most classic association rule mining algorithm.
Association rule mining leads to the discovery of associations and correlations among
items in large transactional or relational data sets and the discovery of interesting and
valuable correlation relationship among huge amounts of transaction records.
• The objective of association rule mining is to find the rule that the occurrence of one
event can lead to another incident happened.
• Many variations of Apriori algorithm have been proposed that focus on improving the
efficiency of the original algorithm.
• Some improvements techniques in Apriori algorithm are given below:
1. Hash-based technique (hashing itemsets into corresponding buckets):
A hash-based technique can be used to reduce the size of the candidate k-itemsets,
Ck, for k > 1.
2. Transaction reduction (reducing the number of transactions scanned in
future iterations): The goal of transaction reduction technique is to reduce the
number of transactions in future that do not have any frequent itemset. A
transaction
that does not contain any frequent k-itemsets cannot contain any frequent
(k + C1)-itemsets. Therefore, such a transaction can be marked or removed from
further consideration because subsequent database scans for j-itemsets, where j >
k, will not need to consider such a transaction.
3. Partitioning (partitioning the data to find candidate itemsets): A partitioning
technique can be used that requires just two database scans to mine the frequent
itemsets. In data set partitioning technique, data or a set of transactions is
partitioned into smaller segments for the purpose of finding candidate itemsets.
4. Sampling (mining on a subset of the given data): Sampling is a technique is
important when efficiency is most important than accuracy. It is based on the
mining on a subset of the given data. The basic idea of the sampling approach is to
pick a random sample S of the given data D, and then search for frequent itemsets
in S instead of D. In this way, we trade off some degree of accuracy against
3.27
Data Analytics Mining Frequent Patterns, Associations and Correlations
efficiency. The S sample size is such that the search for frequent itemsets in S can
be done in main memory, and so only one scan of the transactions in S is required
overall. Because we are searching for frequent itemsets in S rather than in D, it is
possible that we will miss some of the global frequent itemsets.
5. Dynamic itemset counting (adding candidate itemsets at different points
during a scan): The dynamic itemset counting technique during scanning,
candidate itemset would be added at different start point, if all their subsets are
estimated. A dynamic itemset counting technique was proposed in which the
database is partitioned into blocks marked by start points. In this variation, new
candidate itemsets can be added at any start point, unlike in Apriori, which
determines new candidate itemsets only immediately before each complete
database scan. The technique uses the count-so-far as the lower bound of the
actual count. If the count-so-far passes the minimum support, the itemset is added
into the frequent itemset collection and can be used to generate longer candidates.
This leads to fewer database scans than with Apriori for finding all the frequent
itemsets.
• Although above improvements were used to improve the efficiency of Apriori
algorithm, reduce the size of candidate itemsets and lead to good performance gain,
still they have following two limitations:
1. Difficult to handle a large number of itemsets.
2. It is tedious to repeatedly scan the datasets and check a large dataset of candidates
by pattern matching.
• It then divides the compressed database into a set of conditional databases (a special
kind of projected database), each associated with one frequent item or “pattern
fragment” and mines each database separately.
• For each “pattern fragment,” only its associated data sets need to be examined.
Therefore, this approach may substantially reduce the size of the data sets to be
searched, along with the “growth” of patterns being examined.
• The FP-growth algorithm is used to mine the complete set of frequent itemsets. It
creates FP tree to compress a large dataset.
• In FP tree nodes, frequent items are arranged in such a manner that more frequently
occurring nodes have better chances of sharing nodes than the less frequently
occurring ones.
• The FP algorithm uses divide and conquer method to decompose mining task after that
avoid candidate generation by consider sub-database only.
• The FP-growth method performance shows that it is efficient and scalable for mining
both long and short frequent patterns, and is about an order of magnitude faster than
the Apriori algorithm.
• FP-growth algorithm preserves whole information for frequent pattern mining. FP-
growth algorithm is as follows:
1. Construct Condition Pattern Base for Each Node in the FP-Tree:
(i) Starting at the frequent header table in the FP-tree.
(ii) Traverse the FP-tree by following the link of each frequency item.
(iii) Accumulate all of transformed prefix paths of that item to form a conditional
pattern base.
2. Construct Conditional FP-Tree from each Conditional Pattern-Base:
(i) Accumulate the count for each item in the base.
(ii) Construct the FP-tree for the frequency items of the pattern base.
3. Recursively Mine Conditional FP-trees and Grow Frequency Patterns obtained
so Far:
(i) If the conditional FP-tree contains a single path, simply enumerate all the
patterns.
Example 1: Consider following table,
Database D
Transaction ID Item Item Count
After that, all the itemset arrange in a sequential order or give the priority and
construct FP-tree. And new consider the items as a suffix for the database D and
arrange the items or Transaction according to priority.
Item Count
I5 4 Null ()
I3:4
I7 3 I2:1
I2 3 I3:1
I3:1
I3 I2:1 I4:1
3 I3:1
I4 I1:1
3
I4:1
I4:1
Priority Table
After that FP-tree of priority table, insert this priority table except item I5 because it
contains higher priority and mining the above FP-tree as summarized in Table 3.7.
Table 3.7: FP-Tree of Priority Table
Conditional Pattern Base Conditional Frequent
Item
FP-Tree Generated Pattern
I4 (I1, I2, I3 : 1), (I2, I3 : 1), (I1 : 1) (I2, I3 : 2), (I1 : 2) (I2, I3, I4 : 1) (I1, I4 : 2)
I3 (I1, I2 : 2) (I2 : 1) (I1, I2 : 2), (I2 : 3) (I1, I2, I3 : 3) (I2, I3) : 3
I2 (I1 : 2) (I1 : 2) (I1, I2 : 2)
I1 (I5 : 3) (I5 : 3) (I5, I1 : 3)
Example 2: Generate the FP tree for following transaction dataset with minimum
support 3.
TID Item Brought
100 F, A, C, D, G, L, M, P
200 A, B, C, F, L, M, O
300 B, F, H, I, O
400 B, C, K, S, P
500 A, F, C, E, L, P, M, N
Solution: First calculate the frequency of each item.
Item Frequency Item Frequency
A 3 J 1
B 3 K 1
C 4 L 2
D 1 M 3
E 1 N 1
F 4 O 2
G 1 P 3
H 1 S 1
I 1
3.30
Data Analytics Mining Frequent Patterns, Associations and Correlations
Now frequent pattern set (L) is built which will contain all elements whose frequency
is greater than or equal or 3.
L = {(F : 4), (C : 4), (A : 3), (M : 3), (P : 3)}
Next we will build ordered frequent items.
Item Frequency Item
100 F, A, C, D, G, L, M, P F, C, A, M, P
200 A, B, C, F, L, M, O F, C, A, B, M
300 B, F, H, J, O F, B
400 B, C, K, S, P C, B, P
500 A, F, C, E, L, P, M, N
Root F, C, A, M, P
Now we will build data structure for first transaction.
F:1
Root
C:2
F:1
A:1
C:2
M:1
A:1
P:1
M:1
P:1
Fig. 3.5
For next transaction we will increase the frequency of present items and add the
present items in data structure.
So for TID - 200.
Root
F:2
C:2
A:2
M:1 B:1
P:1 M:1
Fig. 3.6
3.31
Data Analytics Mining Frequent Patterns, Associations and Correlations
Like this we will increase the frequency count of each item for next transactions. After
all transactions final tree will be shown in Fig. 3.7.
Root Root
F:4 F:1
C:1
B:1
C:3 Root
C:2
B:1
A:3 F:1
A:1
P:2
M:2 C:2
B:1 M:1
P:2 A:1
M:2 P:1
From this we will build conditional pattern base starting from bottom to top.
P:1
Conditional Pattern
Item
Base
P {{F, C, A, M : 2}, {C, B : 1}}
M {{F, C, A : 2}, {F, C, A, B : 1}}
B {{F, C, A : 1}, {F : 1}, {C : 1}}
A {{F, C : 3}}
C {{F : 3}}
F φ
Conditional Conditional
Item
Pattern Base FP-Tree
P {{F, C, A, M : 2}, {C, B : 1}} {C : 3}
M {{F, C, A : 2}, {F, C, A, B : 1}} {F, C, A : 3}
B {{F, C, A : 1}, {F : 1}, {C : 1}} φ
A {{F, C : 3}} {F, C : 3}
C {{F : 3}} {F : 3}
F φ φ
3.32
Data Analytics Mining Frequent Patterns, Associations and Correlations
Output:
Requirement already satisfied: pyfpgrowth in /usr/local/lib/python3.7/dist-packages (1.0).
Frequent Pattern
{('Milk',): 2, ('Milk', 'Saffron'): 2, ('Bournvita', 'Milk'): 1,
('Bournvita', 'Milk', 'Saffron'): 1, ('Wafer',): 2, ('Saffron',
'Wafer'): 1, ('Bournvita', 'Wafer'): 2, ('Bournvita', 'Saffron',
'Wafer'): 1, ('Bournvita',): 3, ('Saffron',): 3, ('Bournvita',
'Saffron'): 2}
Rules
{('Bournvita',): (('Saffron',), 0.6666666666666666),
('Bournvita', 'Milk'): (('Saffron',), 1.0),
('Bournvita', 'Saffron'): (('Wafer',), 0.5),
('Bournvita', 'Wafer'): (('Saffron',), 0.5),
('Milk',): (('Bournvita', 'Saffron'), 0.5),
('Milk', 'Saffron'): (('Bournvita',), 0.5),
('Saffron',): (('Bournvita',), 0.6666666666666666),
('Saffron', 'Wafer'): (('Bournvita',), 1.0),
('Wafer',): (('Bournvita', 'Saffron'), 0.5)}
Advantages of FP-growth Algorithm:
1. The FP-growth algorithm scans the database only twice which helps in decreasing
computation cost.
2. The FP-growth algorithm uses divide and conquer method so the size of
subsequent conditional FP-tree is reduced.
3. The FP-growth method transforms the problem of finding long frequent patterns
into searching for shorter ones in much smaller conditional databases recursively.
Disadvantages of FP-growth Algorithm:
1. The FP-growth algorithm is difficult to be used in an interactive mining process as
users may change the support threshold according to the rules which may lead to
repetition of the whole mining process.
2. The FP-growth algorithm is not suitable for incremental mining.
3. When the dataset is large, it is sometimes unrealistic to construct a main memory
based FP-tree.
PRACTICE QUESTIONS
Q. I Multiple Choice Questions:
1. Which is the process as extracting information from huge sets of data?
(a) Data Mining (b) Big Data Mining
(c) Data Processing (d) None of the mentioned
3.34
Data Analytics Mining Frequent Patterns, Associations and Correlations
12. Consider the itemset {p,q,r,s}. Which of the following statements is always true?
(a) confidence(pqr → s) ≥ confidence (pq → rs)
(b) confidence(pqr → s) ≥ confidence (pq → s)
(c) confidence(pqr → s) ≤ confidence (pq → rs)
(d) confidence(pqr →s) ≤ confidence(pq→s)
13. Consider three itemsets V1 = {oil, soap, toothpaste}, V2 = {oil,soap}, V3 = {oil}. Which
of the following statements are correct?
(a) support(V1) > support(V2) (b) support(V3) > support(V2)
(c) support(V1) > support(V3) (d) support(V2) > support(V3)
14. In the following data table, if the support threshold is (greater than or equal to) 0.2
the frequent 4-itemsets are:
Transaction ID Itemsets
1 {a, b, d, e}
2 {b, c, d}
3 {a, b, d, e}
4 {a, c, d, e}
5 {b, c, d, e}
6 {b, d, c}
7 {c, d}
8 {a, b, c}
9 {a, d, e}
10 {b, c}
Answers
1. (a) 2. (d) 3. (b) 4. (c) 5. (d) 6. (a) 7. (b) 8. (a) 9. (c) 10. (a)
11. (b) 12. (a) 13. (b) 14. (d) 15. (a) 16. (c) 17. (a) 18. (b) 19. (c) 20. (b)
21. (d) 22. (c) 23. (b) 24. (c) 25. (d)
Q. II Fill in the Blanks:
1. Data mining is the process of collecting massive amounts of raw data and
transforming that data into useful _______.
2. Frequent pattern growth is a method of mining frequent itemsets _______ candidate
generation.
3. ______, adopts a divide-and-conquer strategy for finding the complete set of
frequent itemsets.
4. All nonempty subsets of a frequent itemset must also be _______.
5. To improve the efficiency of the level-wise generation of frequent itemsets, the
Apriori _______ is used to reduce the search space..
6. Association _______ are considered interesting if they satisfy both a minimum
support threshold and a minimum confidence threshold.
7. Frequent _______ are patterns that appear frequently in a data set.
8. The descriptions of a class or a concept are called _______ descriptions.
9. Data _______ refers to summarizing data of class under study.
10. _______ analysis or clustering is the task of grouping a set of objects in such a way
that objects in the same group (called a cluster).
11. Frequent pattern mining is also called as _______ rule mining.
Answers
1. information 2. without 3. FP-growth 4. frequent
5. property 6. rules 7. patterns 8. class/concept
9. Characterization 10. Cluster 11. association
18. A database has six transactions. Let min-sup = 50% and min-conf = 75%. Find all
frequent item sets using Apriori algorithm. List all the strong association rules.
T4 {Carrots, Mango}
t4 {I2, I5}
3.41
Data Analytics Mining Frequent Patterns, Associations and Correlations
22. Consider following database and find out the frequent item sets using Apriori
algorithm with min-sup = 50%.
3.42
CHAPTER
4
4.0 INTRODUCTION
• In the information age communication is the act of exchanging information by
speaking, writing or using some other medium has exploded.
• The most basic communication theory states that communication consists of a sender,
a message, a channel where the message travels, noise or interference and a receiver.
• In recent years, social media has gained significant popularity and become an
essential medium of communication.
• Merriam-Webster (America's most trusted online dictionary for English word
definitions, meanings and pronunciation) defines social media as, "forms of electronic
communication through which users create online communities to share information,
ideas, personal messages and other content”. OR
• As per guidelines given by Government of India, ‘Department of Electronics and
Information Technology’: “Social Media in recent times has become synonymous with
Social Networking sites such as Facebook or MicroBlogging sites such as Twitter.
However, very broadly social media can be defined as, any web or mobile based
platform that enables an individual or agency to communicate interactively and
enables exchange of user generated content.”
• Critical characteristics of social media are:
1. Connectedness: This characteristic of social media basically evaluate any social
media’s ability through that how much it is able to connect and re-connect people
interested in same topics and domains. Connectedness property of social media
also ensured with facility of all time availability for the users using variety of
media and access devices including PCs, Laptops, mobile phones etc.
2. Collaboration: The connections achieved on this media, enable people to
collaborate and create knowledge. Such collaborations can be either open or
4.1
Data Analytics Social Media and Text Analytics
• These social networks are useful to study the relationships between individuals,
groups, social units or societies.
• Social media analytics is the process of collecting, tracking and analyzing data from
social networks.
• Fig. 4.1 illustrates the five year growth figures for the number of social media users
that has been conveyed by Digital 2019 reports.
• The statistical figures reveal drastic increase in the growth of social media users
(almost double) throughout the world considering from year 2014 to year 2019.
3,484
3,196
2,796
2,307
2,078
1,157
Fig. 4.1: Number of Social Media Users (in millions) from Year 2014 to Year 2019
• Web 2.0 and social media facilitate the creation of vast amounts of digital content that
represents a valuable data source for researchers and companies alike.
• Social media analytics relies on new and established statistical and machine learning
techniques to derive meaning from large amounts of textual and numeric data.
• The analysis of social networks is centered on the fundamental theory that the social
network is made up of the relations and interaction between users and within units
rather than by the properties of the user itself.
• Social Network Analysis (SNA) is the general reference to the process of investigating
social networks or structures within social media through the use of networks,
knowledge graphs and graph theory.
• In this section we will social media analytics methods and to demonstrate how social
media analytics can be applied in a variety of contexts to deliver useful insight.
Benefits of Social Media Analytics:
1. The continuous monitoring, capturing and analyzing of social media data can
become the valuable information for decision-making.
2. The social media is that it gives us the ability to track and analyze the growth of
the community on social media sites and the activities and behavior of the people
using the sites.
3. Governments from around the world are starting to realize the potential of data
analytics in making timely and effective decisions.
4.3
Data Analytics Social Media and Text Analytics
Data Understanding
Data Capturing
Data Presentation
Gathering data from
multiple sources Removing noisy data Summarizing and
generating the !ndings
Preprocessing the Performing required
gathered data analytics (such as Presenting the !ndings
sentiment analysis and
Extracting relevant recommender analysis)
information from the
preprocessed data
• Before analyzing the data, the process of noise removal from data may be required for
better accuracy in data analysis.
• Preprocessing of data covered treatment of noisy, missing or corrupted data, which
involve several fields and techniques such as statistical analysis, machine learning,
deep learning, computer vision, natural language processing, and/or data mining.
• The data understanding stage, in social media analytics process lies in the middle of
the social media analytics process and forms the core and most important stage in the
entire process.
• Data analysis is the set of activities/tasks that assist in transforming raw data into
insight, which in turn leads to a new base of knowledge and business value.
• In other words, data analysis is the phase that takes filtered data as input and
transforms that into information of value to the analysts.
• Once the data analysis is performed, the analyzed data is further presented to the next
stage in social media analytics process, which is the data presentation stage.
• The evaluation of results or outcomes in the last stage mostly depends on the findings
in the data understanding stage.
• If proper analysis techniques are not used in the data understanding stage in social
media analytics process, the findings may lead to fully incorrect output generation.
• Hence, immense care needs to be taken in the data understanding stage, in social
media analytics process so that the right tools and techniques are used while
analyzing the relevant data.
3. Data Presentation:
• This stage is the final stage in the social media analytics process. In data presentation
stage, the results are summarized and evaluated to gain significant insights.
• These final results or outcomes are then presented mostly using proper data
visualization tools to present the output or result in an easy and simple interpretable
form.
• Note that data presentation is what the users get as output/result at the end and hence,
no matter how big the data is in volume, the data visualization graphic(s) should make
the output easily understandable for the data analysts.
• Interactive data visualization has led us to a new era of revolution where graphics
have led us to easy data analysis and decision making.
• For example, data visualization can help in identifying outliers in data, improving
response time of analysts to quickly identify issues, displaying data in a concise
format, providing easier visualization of patterns etc.
• The most challenging part in data representation is to learn how data visualization
works and which visualization tool serves the best purpose for analyzing precise
information in a given case.
4.5
Data Analytics Social Media and Text Analytics
• It is most important to understand in data representation stage that the three stages of
the social media analytics process most of the time work iteratively rather than
linearly.
• If the models generated in this stage (data understanding), in social media analytics
process fails to uncover useful results during analysis, the process turns back to the
data capturing stage to further capture additional data that may increase the
predictive and analysis power.
• Similarly, if in this stage (data understanding), the results that are generated is not
convincing or have low predictive power, then there is a necessity to turn back to the
data capturing or data understanding stage to tune the data and/or the parameters
used in the analytics model.
• Thus, this entire process of social media analytics may go through several iterations
before the final results are generated and presented.
Text
1.0
Search Networks
Engine 2.0
7.0
7 Layers of
Social Media
Analytics Actions
Location 3.0
6.0
Mobile Hyperlinks
5.0 4.0
• The Facebook friendship network and Twitter Follower network Network analytics
seeks to identify influential nodes (e.g., people and organization) and their position in
the network.
3. Layer 3 (Actions):
• Actions in social media mainly include the actions performed by users while using
social media such as clicking on like or dislike button, sharing posts, creating new
events or groups, accepting a friend request and so on.
• Data analysts often carry out actions analytics using social media data for measuring
various factors such as popularity of a product or person, recent trends followed by
users and popularity of user groups.
• Social media actions analytics deals with extracting, analyzing and interpreting the
action performed by social media users, including, likes, dislikes, shares, mentions and
endorsement.
• Action analytics in social media are mostly used to measure popularity and influence
over social media.
4. Layer 4 (Mobile):
• Mobile analytics is comparatively a recent trend in social media analytics. Mobile
analytics focuses on analysis of user engagement with mobile applications.
• Mobile analytics are usually carried out for marketing analysis to attract those users
who are highly engaged with a mobile application.
• The in-app analysis is another common analysis carried out in mobile analytics. The
in-app analysis concentrates on the kind of activities and interaction of users with an
application.
5. Layer 5 (Hyperlinks):
• Hyperlinks (or links) are commonly found in almost all Web pages that allow
navigation of one Web page to another.
• Hyperlinks analytics, extracting, analyzing and interpreting social media hyperlinks
(e.g. in-links and out-links).
o The hyperlink into a Web page is called as in-link. The number of in-links to a Web
page is referred to as in-degree.
o The hyperlink out of a Web page is called as out-link. The number of out-links
from a web page is referred to as out-degree.
• Hyperlink analysis can reveal, for example, Internet traffic patterns and sources of the
incoming or outgoing traffic to and from a source.
• In simple words, mobile analytics is all about analyzing and interpreting social media
hyperlinks.
6. Layer 6: (Location):
• Location analytics is concerned with mining and mapping the locations of social media
users, contents and data.
4.8
Data Analytics Social Media and Text Analytics
• Location analytics is also known as geospatial analysis or simply spatial analytics. This
analytics is carried out to gain insight from the geographic content of social media
data.
• Real-time location analytics is often carried out by data analysts for Business
Intelligence (BI). For example, the courier services used by social media sites need to
keep track of the locations of delivery in real-time.
• In location analytics, historical geographic data is also often used to bring an increase
in sales and profit in businesses.
7. Layer 7 (Search Engines):
• The search engines analytics focuses on analyzing historical search data for gaining a
valuable insight into a range of areas, including trends analysis, keyword monitoring,
search result and advertisement history and advertisement spending statistics.
• Search engine analytics pays attention to analyzing historical search data to generate
informative search engine statistics and these statistical results can then be used for
Search Engine Optimization (SEO) and Search Engine Marketing (SEM).
Identi!cation
1.0
Extraction
Interpretation 2.0
6.0
Business
Objectives
Visualization Cleaning
5.0 3.0
Analyzing
4.0
• Fig. 4.4 shows six general steps common for all social media analytics processes. In the
beginning, the business objectives need to be clearly defined and then all the six stages
of the social media analytics cycle are carried out one after another until the business
objectives are all fully satisfied.
• Let us now understand the contribution of each of the steps in the analytics life cycle.
Step 1 - Identification:
• The identification step is mainly concerned in identifying the correct source of data for
carrying out analysis.
• The identification step is a crucial step in the social media analytics life cycle that
decides which data to consider among the vast, varied and diverse data that is
collected from various social media platforms.
• The decision on which data is to be considered for mining is mainly governed by the
business/organization objectives that are needed to be achieved.
Step 2 - Extraction:
• Once the accurate and/or appropriate social media data source is identified for a
specific analysis, the next step is to use a suitable API for data extraction.
• Through these APIs (Application Programming Interfaces) that almost every social
media service companies have, it is possible to access a required small portion of
social media data hosted in the database.
• Number of other specialized tools can help in data extraction from social media sites.
However, the extraction of data from social media sites is to be done following all
privacy and ethical issues concerned with mining data from social media platforms.
Step 3 - Cleaning:
• Data cleaning is done as a data preprocessing step used to remove the unwanted
and/or unnecessary data so as to reduce the quantity of data to be analyzed. Cleaning
data is usually done to handle irrelevant or missing data.
• In this step, the data is cleaned by filling in the missing values, smoothing any noisy
data, identifying and removing outliers, and resolving any inconsistencies. Filling up
the missing values in data is known as the imputation of missing data.
• The data cleaning method to be adapted for filling up the missing values depends on
the pattern of data used and the nature of analysis to be performed with the data.
• The technique, smoothing is also used during data preprocessing for data analysis.
Smoothing is intended to identify trends in the presence of noisy data when the
pattern of the trend is unknown.
• For data cleaning, outliers should also be excluded from the data set as much as
possible as these outliers may mislead the analysis process resulting in incorrect
results.
• For this reason, an important preprocessing step is to correct the data by following
some data cleaning techniques.
4.10
Data Analytics Social Media and Text Analytics
Step 4 - Analyzing:
• In analyzing step, the clean data is used by data analysts for analyzing data for
business insights.
• The main objective in analyzing step is to generate meaningful insights from the social
media data fed as input for analysis. Choosing the right tools and techniques is an
important factor for accurate data analysis.
• The analyzing step often involves a detailed step-by-step procedure to arrive at a
meaningful insight for correct and fruitful decision-making.
Step 5 - Visualization:
• Once the data is analyzed, it is preferred to be presented in a graphical and/or pictorial
format as it is said that the human brain processes visual content better than plain
textual information. This is where the role of data visualization in data analytics
comes into play.
• Interactive data visualization has led to a new era of revolution where graphics lead to
easy and simple analysis and decision making.
• The right data visualization tool to be used depends on which layer of social media
data analytics is to be dealt with.
• For example, if it is required to display network data visualization, once can display it
through a network chart, the ‘word cloud’ visualization can be used for representing
textual data.
Step 6 - Interpretation:
• The final stage/step in the social media analytics life cycle is the interpretation of
results. Data interpretation step involves translating outcomes in meaningful business
solutions.
• At data interpretation step, the various data visualizations generated from the
previous stage are studied to understand the results and finally give a meaningful
summary of the entire analytics that is carried out on the data.
• Social media data refers to all of the raw insights and information collected from
individuals social media networks.
• To work with social media data, it is required to get access to data that is generated
from various social media sites.
• Social media scraping is the task of extracting the unstructured data generated from
these social media sites.
• There are various social media scrapers developed which act as an excellent tool for
extracting social media data.
• Few such prominent scrapers often used by data analysts and data scientists include
Octoparse, OutWit Hub, Scrapinghub, Parsehub, and Dexi.io.
• Most of the social media network sites have their APIs (Application Programming
Interfaces) that can be used by data analysts or data scientists to access the social
media data found in these sites and also integrate the various other features of the
APIs into applications.
• The data is often collected with tools which communicate with the respective API of
the social media platform, if one exists and crawl the data.
• The APIs can be differentiated among each other based on the features provided by
the APIs, the popularity of the social media site to which the API is connected, the cost
of using the APIs and the ease in which each API can be used for data analysis.
• Some of the prominent social media APIs used for accessing social media data are
explained below:
1. Facebook API:
• Facebook provides a platform, where people come to socialize, talk and share their
views with each other.
• Facebook social networking sites commonly used by a large number of the people to
interact with their families and friends and also making business appearances or
meeting online with other users.
• Facebook content can be accessed through Facebook APIs free of cost. Facebook’s one
of the commonly used APIs is the Facebook Graph API.
• Facebook Graph API is commonly used by social media researchers to access Facebook
data. Facebook APIs also help in posting images, creating new stories, accessing
Facebook pages and so on.
• In Facebook there is also a provision of using the Facebook Marketing API that helps
create applications for services and product marketing.
2. YouTube API:
• The YouTube platform is owned by Google. The YouTube’s basic functionality is video
and music sharing.
4.12
Data Analytics Social Media and Text Analytics
• The API of YouTube provides options to YouTube users to play or upload videos,
manage playlists, search music and videos and several other functionalities.
• The YouTube API also has a provision for analysis of the videos, subscribe to the
videos, and schedule live streaming broadcasts.
3. Instagram API:
• Instagram is a photo and video sharing social networking platform owned by
Facebook.
• The Instagram social media platform has a provision of photo and video sharing
among users. The sharing of data can be done either publically or only among
followers of Instagram users.
• The concept of hashtags became popular on Instagram API that allows users to
provide highlights of the topics portrayed in their feeds.
• Instagram also has number of APIs built for specific purposes. One most popular API is
the Instagram Graph API that is used by analysts or data scientists to access the data of
Instagram accounts used for business operations.
• Mostly, Instagram content can be accessed through Instagram APIs free of cost. Such
Instagram content is often used by researchers from the social media analytics
community to generate meaningful insights about some particular topic, product or
individual.
4. Twitter API:
• Twitter is a social networking platform which enables registered users to read and
post short messages called tweets. The tweet contains rich social media content.
• The length of the tweet messages is limited to 140 characters and users are also able to
upload photos or short videos.
• It is the place, where people are more inclined to listen than to speak. Twitter is one of
the popular social media services provided online.
• The bulk amount of Twitter API users post messages called tweets that contain rich
social media content. Twitter content can be accessed through Twitter APIs for free of
cost.
• However, there are paid versions of Twitter APIs which provide more accessibility to
data as well as reliability. The APIs provided by Twitter are categorized based on the
type of service it provides.
• For example, the Search API allows access of tweets to retrieve historical content, the
advertisement API help creating advertisement campaigns, and the Account Activity
API is used to access Twitter account activities.
5. LinkedIn API:
• LinkedIn is the world’s largest professional social networking site. It offers a platform,
where users or company’s do B2B marketing.
4.13
Data Analytics Social Media and Text Analytics
• LinkedIn allows the profile owners to share employment and personal information
with others. This site focuses on user’s professional identities.
• Every popular social media network sites, as well as online discussion forums and new
sites, develop their APIs to provide access to content that can be used for data analysis
to generate interesting results that can help in product promotion, political campaign,
influence maximization, information diffusion, business profit uplifts, and many
more.
• Social media crawling has become a buzzword even in the student community as
students carry out projects on social media analysis by using real-time social media
datasets.
o Static Structure Mining: It works with snapshots of data of a social network that
is stored within a specified time period. In this case, the analysis is thus carried out
on a static social network and focus is given on the structural regularities of the
static network graph.
o Dynamic Structure Mining: It uses dynamic data that constantly keeps changing
with time. In this case, the analysis is thus carried out on a dynamic social network
and focus is given on unveiling the changes in the pattern of data with the change
in time.
• Following program shows the source code to display a simple social network graph
that consists of six nodes and nine edges. To display the graph, the networkx Python
library is imported and used.
• The edges between nodes are created one by one and then the graph is displayed for
visualization of the network.
• The various other network information such as the number of nodes and edges,
network density, etc. is also provided in the output.
• The program also displays the degree value, the clustering coefficient value and the
eccentricity value of each node in the graph.
• The clustering coefficient value of each node in the program help in assessing the
degree to which the nodes in the social graph tend to cluster together. The eccentricity
value determines the maximum graph distance to be traveled between a node and any
other nodes in the graph.
** Program for Social Network Graph
import networkx as nx
from operator import itemgetter
G = nx.Graph()
G.add_edge("A", "B")
G.add_edge("A","B")
G.add_edge("A","C")
G.add_edge("A","D")
G.add_edge("A","E")
G.add_edge("B","C")
G.add_edge("B","D")
G.add_edge("B","E")
G.add_edge("F","E")
G.add_edge("F","D")
nx.draw_networkx(G)
#Displaying Graph Information
4.15
Data Analytics Social Media and Text Analytics
print(nx.info(G))
density = nx.density(G)
print('Network density:', density)
# Displaying Degree of each node
print(nx.degree(G))
#Displaying Top 3 Nodes Based on Highest Degree
degree_dict = dict(G.degree(G.nodes()))
nx.set_node_attributes(G,degree_dict,G.degree)
#nx.set_node_attributes(G, hist_sig_dict, 'historical_significance')
sorted_degree = sorted(degree_dict.items(), key=itemgetter(1),
reverse=True)
print('Top 3 nodes by degree:')
for d in sorted_degree[:3]:
print(d)
# Displaying Clustering Coefficients of each node
print('Clustering Coefficients of iven graph : ', nx.clustering(G))
# Displaying eccentricity of each node
print('Eccentricity of given graph:', nx.eccentricity(G))
Output:
Graph with 6 nodes and 9 edges
Network density: 0.6
[('A', 4), ('B', 4), ('C', 2), ('D', 3), ('E', 3), ('F', 2)]
Top 3 nodes by degree:
('A', 4)
('B', 4)
('D', 3)
Clustering Coefficients : {'A': 0.5, 'B': 0.5, 'C': 1.0, 'D':
0.3333333333333333, 'E': 0.3333333333333333, 'F': 0}
Eccentricity : {'A': 2, 'B': 2, 'C': 3, 'D': 2, 'E': 2, 'F':
• Fig. 4.5 shows the network graph that is displayed using the draw_networkx() function
of the networkx library.
• The output of the above program consists of basic social network graph information
such as the number of nodes, the number of edges, average degree of all nodes
considered together, network density and the top three nodes based on the highest
degree.
4.16
Data Analytics Social Media and Text Analytics
• Also, the above program output shows the degree value, the clustering coefficient
value, and the eccentricity value of each node.
Similarity-basd approach
Similarity Order Top ranked
measures scores pairs
Learning-basd approach
Learning models
Similarity
features Classi!er Positive
Probabilistic model instances
Other ...
features
• A user belonging to the same community is expected to share similar tastes, likes and
dislikes which helps in the prediction of what products a user is likely to buy, which
movie a user is likely to watch, what services a user may be interested in, and so on.
• Fig. 4.7 shows the various categories of approaches followed for community detection
in social media networks.
• The community detection categories are broadly divided into, the traditional
clustering community detection methods, the link-based community detection
methods, the topic-based community detection methods and the topic-link based
community detection methods.
based Methods
Hierarchical Community Latent Semantic
Link+Topic
Author Recipient
Traditional
3. Topic-based Methods:
o The topic-based community detection methods emphasize the generation of
communities based on the common topic of interests.
o In topic-based community detection methods, what is mainly explored is finding
communities that are topically similar and do not consider any emphasis on the
strength of connections between nodes.
o Two standard topic-based community detection methods are Probabilistic Latent
Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA).
4. Topic-link Based Methods:
o The topic-link based methods community detection methods are the most common
approaches used nowadays for community detection in social networks.
o These community detection methods are hybrid as these approaches consider both
the strength of connections between nodes as well as finding communities that are
topically similar.
o Thus, the topic-link based methods consider the disadvantages of using only one
single method i.e., link-based or topic-based for community detection, and
combine both the methods to give more accurate results.
o Two standard topic-link based community detection methods are Community-
Author-Recipient-Topic (CART) and Topic-Link LDA.
• Community detection in social media analytics plays a major role not only in social
networks but also in various other fields such as information networks, sociology,
biology, politics and economics.
• The main challenges of community detection lie when the network of nodes to
consider is vast and dynamic which needs special methods to deal with the complexity
of the network.
• An important application that community detection in social media analytics has
found in network science is the prediction of missing links and the identification of
false links in the network.
• Nowadays the viral marketing is considered an effective tool being adapted by all
companies and organizations for the promotion of brands or companies and publicity
of organizations.
• Viral marketing is a strategy that uses existing social networks to promote a product
mainly on various social media platforms.
• What is done in case of viral marketing is to initially use an influence maximization
technique to,
o find a set of few influential users of a social network, and
o influence those users about the goodness and usefulness of a product so that it can
create a cascade of influence of buying the same product by the users’ friends.
• The user’s friends will again, in turn, recommend or publicize the same product to
their friends and this helps in easy and simple product promotion.
• This product promotion strategy is also adopted in other domains such as political
campaigns, movie recommendations, company publicity and so on.
• The main challenge in influence maximization is to generate the best influential users
(the seed set) in the social media network.
• Fig. 4.8 shows the generic influence maximization model in which an unweighted
social graph is fed as input to the model.
• The social graph contains past action propagation traces which are then used by an
influence diffusion technique to learn the weights of each edge. Now the unweighted
social graph is converted to a weighted graph.
• A weighted graph is again provided to a standard influence maximization algorithm to
generate the seed set. This seed set is considered as output for the entire influence
maximization model.
1 x 0.5
0.5
t Seed set
1 0.25 1
w 0.25 z
0.25 u 0.25
Standard In!uence
Propagation Algorithm
Past Action
Propagation Traces
• There are number of standard information diffusion models that can be used to assign
weights to the un-weighted social graph.
• Some of information diffusion models include the Independent Cascade (IC) model, the
Linear Threshold (LT) model, and the Weighted Cascade (WC) model.
• In general, an information diffusion model considers the entire social media network
as a graph G consisting of vertices (or nodes) and edges (or connection between
nodes).
• This graph can be represented as G = (V, E) where, V denotes the vertices of the social
network and E denotes the edges between the nodes.
• The influence diffusion model to be used for influence maximization is chosen based
on the nature of the complexity of the social networking site.
• In real cases, it is not easy to decide which diffusion model will work the best for a
particular social media network and a standard model is usually chosen to provide an
optimum result.
A B
X Y
• NLP mainly aims for the interconnection between natural languages and computers
that means how to analyze and model a high volume of natural language data.
• NLP will ever remain a standard requirement in the field of data science. NLP focuses
on bridging the gap between human communication and computer understanding.
• NLP is a collection of techniques for working with human language. Examples would
include flagging e-mails as spam, using Twitter to assess public sentiment and finding
which text documents are about similar topics.
• Natural Language Understanding helps machines “read” text (or another input such as
speech) by simulating the human ability to understand a natural language such as
English.
• NLP is a subfield of linguistics, computer science, and artificial intelligence concerned
with the interactions between computers and human language, in particular how to
program computers to process and analyze large amounts of natural language data.
• The goal is a computer capable of "understanding" the contents of documents,
including the contextual nuances of the language within them.
• The technology can then accurately extract information and insights contained in the
documents as well as categorize and organize the documents themselves.
• In recent years, online social networking has revolutionized interpersonal
communication.
• The newer research on language analysis in social media has been increasingly
focusing on the latter's impact on our daily lives, both on a personal and a professional
level. NLP is one of the most promising avenues for social media data processing.
• NLP is a scientific challenge to develop powerful methods and algorithms which
extract relevant information from a large volume of data coming from multiple
sources and languages in various formats.
• The field of NLP is found to be highly beneficial for resolving ambiguity in the various
languages spoken worldwide and is a key area of study for text analytics as well as
speech recognition.
• NLP is the technology that is used by machines to understand, analyze, manipulate,
and interpret human's languages.
• NLP, as an important branch of data science, plays a vital role in extracting insights
from the input text. Industry experts have predicted that the demand for NLP in data
science will grow immensely in the years to come.
• One of the key areas where NLP is playing a pivotal role in data science is while
dealing with multi-channel data like mobile data or any social media data.
• Through the use of NLP, these multichannel data are being assessed and evaluated to
understand customer sentiments, moods, and priorities.
• Language is a method of communication with the help of which we can speak, read
and write.
4.25
Data Analytics Social Media and Text Analytics
• NLP is a subfield of Computer Science that deals with Artificial Intelligence (AI), which
enables computers to understand and process human language.
• Text processing has a direct application to Natural Language Processing, also known
as NLP.
• NLP is aimed at processing the languages spoken or written by humans when they
communicate with one another.
• Fig. 4.11 shows the phases or logical steps in natural language processing.
Input Sentence
Morphological
Processing
Lexicon
Syntax
Analysis
Grammar
Semantic Semantic
rules Analysis
Contextual Pragmatic
information Analysis
Target representation
Fig. 4.11
• Let us discuss the phases or logical steps in natural language processing:
1. Morphological Processing is the first phase of NLP. The purpose of this phase is to
break chunks of language input into sets of tokens corresponding to paragraphs,
sentences and words. For example, a word like “uneasy” can be broken into two
sub-word tokens as “un-easy”.
2. Syntax Analysis is the second phase of NLP. The purpose of this phase is two folds:
to check that a sentence is well formed or not and to break it up into a structure
that shows the syntactic relationships between the different words. For example,
the sentence like “The school goes to the boy” would be rejected by syntax
analyzer or parser.
4.26
Data Analytics Social Media and Text Analytics
3. Semantic Analysis is the third phase of NLP. The purpose of this phase is to draw
exact meaning, or you can say dictionary meaning from the text. The text is
checked for meaningfulness. For example, semantic analyzer would reject a
sentence like “Hot ice-cream”.
4. Pragmatic Analysis is the fourth phase of NLP. Pragmatic analysis simply fits the
actual objects/events, which exist in a given context with object references
obtained during the last phase (semantic analysis). For example, the sentence “Put
the banana in the basket on the shelf” can have two semantic interpretations and
pragmatic analyzer will choose between these two possibilities.
• There are two very different schools of thought in NLP, namely, statistical NLP and
linguistic NLP.
o The statistical school of NLP solves this problem by using massive corpuses of
training data to find statistical patterns in language.
o The linguistic school focuses on understanding language as language, with
techniques such as identifying which words are verbs or parsing the structure of a
sentence.
Examples of NLP Applications:
• Today, NLP an emerging technology, derives various forms of AI we used to see these
days.
• For today’s and tomorrow’s increasingly cognitive applications, the use of NLP in
creating a seamless and interactive interface between humans and machines will
continue to be a top priority.
• Some common applications of NLP are explained below:
1. Automatic text summarization is a technique which creates a short, accurate
summary of longer text documents.
2. Machine Translation (MT) is basically a process of translating one source language
or text into another language
3. Spelling correction and grammar correction is a very useful feature of word
processor software like Microsoft Word. NLP is widely used for this purpose.
4. Question-answering, another main application of natural language processing
(NLP), focuses on building systems which automatically answer the question
posted by user in their natural language.
5. Sentiment analysis is among one other important applications of NLP. Online E-
commerce companies like Amazon, ebay, etc., are using sentiment analysis to
identify the opinion and sentiment of their customers online.
6. Speech engines like Siri, Google Voice, Alexa are built on NLP so that we can
communicate with them in our natural language.
4.27
Data Analytics Social Media and Text Analytics
• Pre-processing of text documents play vital role for any Text analytics based
applications.
• The pre-processing steps have enormous consequence on the quality knowledge
output during text analytics.
• As the applicability of phrase “Garbage-in, garbage-Out” on each text analytics process,
Input for text data supposed to be improved to enhance quality of text analytics.
• Although the applying of good text analytics algorithm, but availability of redundant,
irrelevant and noisy input data hamper the quality output of text analytics.
• So preparation and filtration of raw input data makes the sense before inputting it to
the text analytics process.
• In general text pre-processing steps include Tokenization, Bag of words, Word
weighting: TF-IDF, n-Grams, stop words, Stemming and lemmatization, synonyms and
parts of speech tagging.
4.29
Data Analytics Social Media and Text Analytics
4.5.1 Tokenization
• The first part of any NLP process is simply breaking a piece of text into its constituent
parts (usually words) this process is called tokenization.
• This extracts unwanted element from a text document, converts all the letters to lower
case, and removes punctuation marks.
• The output of tokenization is a representation of the text document as a stream of
terms.
o Convert Lower Case: Each term of the document is covert into the lower case so
that words like “hello” and “Hello” are reflected the same term for analysis.
o Remove White Space, Numbers and Punctuations: White space, numbers and
punctuation marks like “,”, “?” etc. are remove from text documents as they have
no significant contribution for clustering.
• In tokenization the text is cut into pieces called “tokens” or “terms.” These tokens are
the most basic unit of information we will use for yhe model.
• Right off the cuff, we might want to consider the following extensions to the word
vector:
1. There are a staggering number of words in English and an infinite number of
potential strings that could appear in text. We need some way to cap them off.
2. Some words are much more informative than others – we want to weight them by
importance.
3. Some words don’t usually matter at all. Things such as “I” and “is” are often called
“stop words,” and we may want to just throw them out at the beginning.
4. The same word can come in many forms. We may want to turn every word into a
standardized version of itself, so that “ran,” “runs,” and “running” all become the
same thing. This is called “lemmatization.”
5. Sometimes, several words have the same or similar meanings. In this case, we
don’t want a vector of words so much as a vector of meanings. A “synset” is a
group of words that are synonyms of each other, so we can use synsets rather than
just words.
6. Sometimes, we care more about phrases than individual words. A set of n words in
order is called an “n-gram,” and we can use n-grams in place of words.
• NLP is a deep, highly specialized area. If we want to work on a cutting-edge NLP
project that tries to develop real understanding of human language, the knowledge we
will need goes well outside the scope.
• However, simple NLP is a standard tool in the data science toolkit and unless we end
up specializing in NLP, this chapter should give us what we need. When it comes to
running code that uses bag-of-words, there is an important thing to note.
• Mathematically, we can think of word vectors as normal vectors (an ordered list of
numbers, with different indices corresponding to different words in the language.
• There is often no need to actually specify which words correspond to which vector
indices and doing so would add a human-indecipherable layer in the data processing,
which is usually a bad idea.
• In fact, for many applications, it is not even necessary to explicitly enumerate the set
of all words being captured.
• This is not just about human readability: the vectors being stored are often quite
sparse, so it is more computationally efficient to only store the nonzero entries.
• We might wonder why we would bother to think of a map from strings to floats as a
mathematical vector. The reason is that vector operations, such as dot products, end
up playing a central role in NLP.
• NLP pipelines will often include stages where they convert from map representations
of word vectors to more conventional sparse vectors and matrices, for more
complicated linear algebra operations such as matrix decompositions.
4.31
Data Analytics Social Media and Text Analytics
4.5.4 n-Grams
• An n-gram means a sequence of n words. An n-gram is a piece of text containing M
words can be broken into a collection of M − n + 1 n-grams, as shown in Fig. 4.12
(contains 2-grams).
• We can create a bag-of-words out of n-grams, run TF-IDF on them, or model them with
a Markov chain, just as if they were normal words. The problem with n-grams is that
there are so many potential ones out there.
• The problem with n-grams is that there are so many potential ones out there. Most n-
grams that appear in a piece of text will occur only once, with the frequency
decreasing, the larger n is.
4.32
Data Analytics Social Media and Text Analytics
• The general approach here is to only look at n-grams that occur more than a certain
number of times in the corpus.
Fig. 4.12
• In NLP for short, n-grams are used for a variety of things. Some examples include auto
completion of sentences (such as the one we see in Gmail these days), auto spell check
(yes, we can do that as well), and to a certain extent, we can check for grammar in a
given sentence.
Basic Concept of n-gram with Example:
• An n-gram is a contiguous sequence of n items from a given sequence of text or
speech.
• The items can be phonemes, syllables, letters, words or base pairs according to the
application.
• The n-grams typically are collected from a text or speech corpus. An n-gram of size 1 is
referred to as a “unigram”; size 2 is a “bigram”; size 3 is a “trigram”.
• In this project we will use 2-gram(bigram), 3-gram(trigram), and 4-gram (quadgram) in
the prediction.
this,
this,
is, is,
NN
==11
: :This
This is is
a asentence
sentence unigrams:
unigrams:a, a,
sentence
sentence
this is, is,
this
NN
==22
: :This
This is is
a asentence
sentence bigrams:is a,
bigrams: is a,
a sentence
a sentence
NN
==33
: :This this is a,
this is a,
This is is
a asentence
sentence trigrams:
trigrams: is a, sentence
is a, sentence
Fig. 4.13
• So stop words is a list of words that doesn't have the potential to contribute
characterize the content in the text.
Stop Word
List
o However, if searching query has word organize then all the documents which
contain any form of the words are expected to be match.
o A word can be view as combination of Stem and affixes. So such query required
that affixes should be removing and convert words to their stem form.
o Stemming generally refers to an experiential process which works for removal of
derivational affixes. So, Stemming is the simplified form of morphological analysis,
which just find the stem of the word.
o Text miners are working with many stemming algorithm like Porter stemmer,
Snowball stemmer etc.
o Stemming is a rather crude method for cataloging related words; it essentially
chops off letters from the end until the stem is reached.
o This works fairly well in most cases, but unfortunately English has many
exceptions where a more sophisticated process is required.
2. Lemmatization:
o Lemmatization technique is like stemming. The output we will get after
lemmatization is called ‘lemma’, which is a root word rather than root stem, the
output of stemming.
o After lemmatization, we will be getting a valid word that means the same thing. A
method that switches any kind of a word to its base root mode is called
Lemmatization.
o In contrast to stemming, lemmatization looks beyond word reduction and
considers a language’s full vocabulary to apply a morphological analysis to words.
The lemma of ‘was’ is ‘be’ and the lemma of ‘mice’ is ‘mouse’.
o Lemmatization is typically seen as much more informative than simple stemming.
Lemmatization looks at surrounding text to determine a given word’s part of
speech, it does not categorize phrases.
Difference between Stemming and Lemmatization:
Sr. No. Stemming Lemmatization
1. Stemming is faster because it chops Lemmatization is slower as
words without knowing the context compared to stemming but it knows
of the word in given sentences. the context of the word before
proceeding.
2. It is a rule-based approach. It is a dictionary-based approach.
3. Accuracy is less. Accuracy is more as compared to
Stemming.
contd. …
4.35
Data Analytics Social Media and Text Analytics
4. When we convert any word into root- Lemmatization always gives the
form then stemming may create the dictionary meaning word while
non-existence meaning of a word. converting into root-form.
5. Stemming is preferred when the Lemmatization would be
meaning of the word is not important recommended when the meaning of
for analysis. the word is important for analysis.
Example: Spam Detection Example: Question Answer
6. For example: For example:
“Studies” => “Studi” “Studies” => “Study”
• Stemming and Lemmatization are broadly utilized in Text mining where Text Mining
is the method of text analysis written in natural language and extricate high-quality
information from text.
4.5.7 Synonyms
• Generally, all Natural language processing (NLP) application deals with text frequency
analysis and text indexing.
• In such text analysis activities it is always useful to compress the vocabulary without
losing meaning because it saves lots of memory. To accomplish this, we must have to
define mapping of a word to its synonyms.
• Intuitively, when we work with text analysis, the words themselves are less important
than the “meaning.”
• This suggests that when we might want to collapse related terms such as “big” and
“large” into a single identifier. These identifiers are often called “synsets” for sets of
synonyms.
• Most of the NLP packages utilized synsets as a major component of their ability to
understand a piece of text and making semantic relations among the text.
• The simplest use of synsets is to take a piece of text and replace every word with its
corresponding synset.
• Ultimately, we think of synsets as constituting the vocabulary of a “clean” language,
one in which there is a one-to-one matching between meanings and words.
• Replacing word with synset usually deal with some major problems because in
general, a word can belong to several synsets, because a word can have several
distinct meanings. So doing a translation from the original language to “synset-ese” is
not always feasible.
4.37
Data Analytics Social Media and Text Analytics
List of Stopwords :
{"didn't", "wouldn't", "she's", 'too', 'their', 'is', 'off', 'them',
'doing', 'themselves', 'again', 'any', 'out', 'mustn', 'a', 'her',
'how', "you've", 'mightn', 'itself', 'not', 'such', 's', 'from',
'which', 'those', 'shan', 'yours', 'i', 'him', 'there', 'your',
'below', "shan't", 't', 'myself', 'his', 'of', 'hasn', 'having', 'in',
'had', 'so', 'have', 'other', "won't", 'aren', "that'll", 'm', "it's",
"you're", "wasn't", 'should', 'weren', "couldn't", 'did', 'down',
'are', 'has', 'y', 'was', 'over', "should've", 'wouldn', "hadn't",
"mustn't", 'no', 'yourselves', 'what', "hasn't", 'the', 'between',
"don't", 'nor', 'do', 'ours', 'through', 'each', 'you', 'can', 'me',
'it', 'ain', 'once', 'to', 'when', 'd', 'because', 'she', 'or',
'here', 'couldn', 'been', 'won', "you'll", 'with', 'at', 'all',
'further', 'herself', 'then', 'but', 'hadn', 'most', 'both', 'until',
4.39
Data Analytics Social Media and Text Analytics
• There are more complicated versions of sentiment analysis that can, for example,
determine complicated emotional content such as anger, fear, and elation.
• But the most common examples focus on “polarity,” where on the positive–negative
continuum a sentiment falls.
• The following program of sentiment analysis, takes Airline tweets records as input.
Program categorized tweets into positive, negative and neutral category and displays it
with the help of pie and Bar chart.
import numpy as np
import pandas as pd
#import nltk
import matplotlib.pyplot as plt
data_source_url = "https://raw.githubusercontent.com/kolaveridi/kaggle-
Twitter-US-Airline-Sentiment-/master/Tweets.csv"
airtweets = pd.read_csv(data_source_url)
plsize = plt.rcParams["figure.figsize"]
plt.rcParams["figure.figsize"] = plsize
print("Data in CSV file")
print(airtweets.head())
print("Distribution of sentiments across all tweets ")
airtweets.airline_sentiment.value_counts().plot(kind='pie', autopct='%1.
0f%%', colors=["Black", "Orange", "green"])
airline_sentiment = airtweets.groupby(['airline', 'airline_sentiment']).
airline_sentiment.count().unstack()
airline_sentiment.plot(kind='bar',color=['black', 'blue', 'cyan'])
Output:
Data in CSV file
tweet_id ... user_timezone
0 570306133677760513 ... Eastern Time (US & Canada)
1 570301130888122368 ... Pacific Time (US & Canada)
2 570301083672813571 ... Central Time (US & Canada)
3 570301031407624196 ... Pacific Time (US & Canada)
4 570300817074462722 ... Pacific Time (US & Canada)
[5 rows x 15 columns]
Distribution of sentiments across all tweets
<matplotlib.axes._subplots.AxesSubplot at 0x7f93e058d950>
4.41
Data Analytics Social Media and Text Analytics
• Simple sentiment analysis is often done with manual lists of keywords. If words such
as “bad” and “terrible” occur a lot in a piece of text, it strongly suggests that the overall
tone is negative. That is what above stock market articles analysis program did.
• Slightly more sophisticated versions are based on plugging bag-of-words into machine
learning pipelines.
• Commonly, we will classify the polarity of some pieces of text by hand and then use
them as training data to train a sentiment classifier.
• This has the massive advantage that it will implicitly identify key words you might not
have thought of and will figure out how much each word should be weighted.
• The extensions of the bag-of-words model, similar to n-grams, can also utilized to
identify phrases such as “nose dive,” which deserve a very large weight in sentiment
analysis but whose constituent words don’t mean much.
4.42
Data Analytics Social Media and Text Analytics
• Lastly in the following program, the weighted frequency is plugged in place of the
corresponding words found in the original sentences. The sum of the weighted
frequencies is then found and then the sentences are sorted and arranged in
descending order of sum. The summarized text is finally displayed as output.
** Program for Text Summarization
import bs4 as bs
import urllib.request
import re
import nltk
import heapq
!pip install beautifulsoup4
!pip install lxml
nltk.download('punkt')
nltk.download('stopwords')
url_data=urllib.request.urlopen("https://en.wikipedia.org/wiki/
Indian_Space_Research_Organisation")
article = url_data.read()
parsedart = bs.BeautifulSoup(article,'lxml')
para = parsedart.find_all('p')
arttext = ""
for p in para :
arttext += p.text
# Removing Square Brackets and Extra Spaces
arttext = re.sub(r'\[[0-9]*\]', ' ',arttext)
arttext = re.sub(r'\s+', ' ', arttext)
# Removing special characters and digits
formart_text = re.sub('[^a-zA-Z]', ' ', arttext )
formart_text = re.sub(r'\s+', ' ',formart_text)
# Tokenize sentences
slist = nltk.sent_tokenize(arttext)
stopwords = nltk.corpus.stopwords.words('english')
# Find the weighted frequency of occurrences of all words
word_frequencies = {}
for word in nltk.word_tokenize(formart_text):
if word not in stopwords:
if word not in word_frequencies.keys():
word_frequencies[word] = 1
4.45
Data Analytics Social Media and Text Analytics
else:
word_frequencies[word] += 1
# Calculating weighted frequency of each word
max_frequncy = max(word_frequencies.values())
for word in word_frequencies.keys():
word_frequencies[word] = (word_frequencies[word]/max_frequncy)
# Calculating Sentence Scores
sentence_scores = {}
for sent in slist:
for word in nltk.word_tokenize(sent.lower()):
if word in word_frequencies.keys():
if len(sent.split(' ')) < 30:
if sent not in sentence_scores.keys():
sentence_scores[sent] = word_frequencies[word]
else:
sentence_scores[sent] += word_frequencies[word]
summary_sentences = heapq.nlargest(7, sentence_scores,
key=sentence_scores.get)
summary = ' '.join(summary_sentences)
# Displaying the summarized text
print('\n\n Summarized Text : \n', summary)
Output:
Summarized Text :
It is one of six government space agencies in the world which
possess full launch capabilities, deploy cryogenic engines, launch
extra-terrestrial missions and operate large fleets of artificial
satellites. Polar Satellite Launch Vehicle or PSLV is the first
medium-lift launch vehicle from India which enabled India to launch
all its remote-sensing satellites into Sun-synchronous orbit.
Parallelly, another solid fueled rocket Augmented Satellite Launch
Vehicle based upon SLV-3 was being developed technologies to launch
satellites into geostationary orbit. It undertakes the design and
development of space rockets, satellites, explores upper atmosphere
and deep space exploration missions. Augmented or Advanced Satellite
Launch Vehicle (ASLV) was another small launch vehicle realised in
1980s to develop technologies required to place satellites into
geostationary orbit. ISRO is the primary agency in India to perform
4.46
Data Analytics Social Media and Text Analytics
4.48
Data Analytics Social Media and Text Analytics
oEvery business sectors nowadays follow trend analysis for prediction of its
profitability in the upcoming years.
o This has made trend analysis a vital research topic to focus on generating more
accurate predictions of the future.
3. Temporal Trend Analysis:
o This trend analysis allows one to examine and model the change in the value of a
feature or variable in a dataset over time.
o The best example of temporal trend analysis is the time-series analysis which has
been elaborately.
o Time series analysis deals with statistical data that are placed in chronological
order, that is, in accordance with time.
o It deals with two variables, one being time and the other being a particular
phenomenon.
• Time series can be constituted by three components namely short-term movement
(periodic changes), long-term movement (secular trend), and random movement
(irregular trend).
• Time series modeling and forecasting have vital significance in various practical
analysis domains. A good amount of real research work is being done on a daily basis
in this research area for several years.
• The collected data for time series analysis is considered as the past observations based
on which forecasting is done for analyzing future trends.
• Various fitting models have been developed for time series forecasting that suits well
for a particular area such as business, finance, engineering, and economics. One of the
important characteristics of the time series is its stationarity.
• A time series is considered stationary if its behavior does not change over time. This
indicates that the observed values vary about the same level, i.e., the variability is
constant over time.
• In turn, the statistical properties of the stationary time series such as the mean and
variance also remain constant over time.
• However, most of the time-series problems that are encountered are non-stationary.
Non-stationary time series do not have a constant mean or variance and follows a
trend by either drifting upward or backward.
• For example, social media data can be analyzed to gain insights into issues, trends,
influential actors and other kinds of information.
• Social media analytics tools are used for gathering data from social platforms to help
guide marketing strategies. Examples of some of the Social Media analytical tools
include Keyhole, AgoraPulse, Google Analytics and many more.
• Social media data are governed by the properties of high volume, high velocity, as well
as high variety.
• This makes social media complex to deal with though it has a plus point as it carries
valuable insights that can be analyzed for fruitful decision-making.
• Some of the major challenges faced in dealing with such complex data in social media
analytics are explained below:
1. Unstructured Data: Unstructured Data (or unstructured information) refers to
information that either does not have a predefined data model. Unstructured data
is not an organized data. Social media data can be of various forms – textual (e.g.,
comments, tweets, etc.), graphical (e.g., images, videos, etc.), action-oriented
(clicking like button, accepting friend-request, etc.), and relation-based (e.g.,
friends, followers, etc.). This makes social media data to be highly unstructured
and poses a challenge to the data analyst or data scientist for intelligent decision-
making. The unstructured and uncertain nature of this kind of big data presents a
new kind of challenge: how to evaluate the quality of data and manage the value of
data within data analytics. Shared files, images, videos, audio, comments, and
messages are all unstructured data types.
2. High Volume and Velocity of Data: The volume refers to the size of Data. Velocity
refers to the speed at which the data is getting accumulated. Social media data gets
generated every flicker of a second and capturing and analyzing data that is high
in volume and velocity is a real challenge. Imagine the number of likes given in
Facebook posts by users per second and the number of tweets posted by users on
Twitter. Capturing and analyzing such bulk amounts of data requires special
sophisticated tools that are often used by data analysts or data scientists to
generate required results.
3. Diversity of Data: Both social media users and the data these users generate are
diverse. The users belong to various cultures, regions, and countries and the data
generated are of various data types and multilingual. Not every data is crucial to
be studied and analyzed. Finding and capturing only important content from such
noisy diverse data is again a challenging and time-consuming task.
4. Organizational Level Issues; Nowadays, many organizations are spending huge
amount of money in developing their resources for collecting, managing and
analyzing data sets from the social media. However, they do not clearly
4.50
Data Analytics Social Media and Text Analytics
understand how to ethically use social media analytics. Most organizations lack
ethical data control practices like well-defined standards and procedures for
sourcing, analyzing and sharing big data.
• To meet up the above-mentioned challenges related to social media data, a large
number of tools have been developed and are still undergoing better advancements.
• Each tool may prove to be advantageous in dealing with one of the layers of social
media analytics.
• For example, the Netminer tool is often used for dealing with social network data, the
Google Analytics tool is powerful in dealing with action-oriented social media data,
and the Lexalytics tool is good for handling textual social media data.
PRACTICE QUESTIONS
Q. I Multiple Choice Questions:
1. Which refers to the means of interactions among people in which they create,
share, and/or exchange information and ideas in virtual communities?
(a) Social media (b) Social communication network
(c) Social data transfer (d) All of the mentioned
2. Web based social network services include,
(a) Twitter (b) Facebook
(c) LinkedIn (d) All of the mentioned
3. Which is a type of complex network and can be described as a social structure
composed of a set of social users with social interactions between them?
(a) social media (b) social network
(c) social relation (d) None of the mentioned
4. Which is the process of tracking, collecting and analyzing data from social
networks?
(a) Social media analysis (b) Social network analytics
(c) Social media analytics (d) None of the mentioned
5. Which is the automatic discovery of new, previously unknown, information from
unstructured textual data?
(a) text analytics (b) graphics analytics
(c) video analytics (d) data analytics
6. Social media APIs includes,
(a) Facebook API (b) YouTube API
(c) Twitter API (d) All of the mentioned
4.51
Data Analytics Social Media and Text Analytics
7. How many layers play a vital role in contributing to social media input for gaining
useful insights?
(a) 5 (b) 6
(c) 7 (d) 9
8. Which is a collection of techniques for working with human language?
(a) NDP (b) NLP
(c) NML (d) NDLP
9. Methods in trend analysis includes,
(a) geographic trend analysis (b) intuitive trend analysis
(c) temporal trend analysis (d) All of the mentioned
10. Which is the process of identifying the most important meaningful information in
a document and compressing them into a shorter version preserving its overall
meanings?
(a) Text reporting (b) Text observations
(c) Text summarization (d) Text planning
11. Which analytics mainly involves determining the possible drifts or trends over a
period of time?
(a) Trend (b) Social media
(c) Social network (d) None of the mentioned
12. To handle challenges related to social media data following which tools are used.
(a) Google Analytics tool (dealing with action-oriented social media data)
(b) Lexalytics tool (handling textual social media data)
(c) Netminer tool (dealing with social network data)
(d) All of the mentioned
13. Which is the problem of predicting the existence of a link between two entities in a
social network?
(a) Link observation (b) Link prediction
(c) Link analysis (d) Link analytics
14. Which can be used in machine learning to detect groups with similar properties
and extract groups for various reasons?
(a) Community transfer (b) Community processing
(c) Community detection (d) None of the mentioned
15. Which analytics is carried out to gain insight from the geographic content of social
media data?
(a) Location (b) Community
(c) Text (d) Trend
4.52
Data Analytics Social Media and Text Analytics
16. Which analysis means to identify the view or emotion behind a situation?
(a) Trend (b) Location
(c) Sentiment (d) None of the mentioned
17. Which uses data (unstructured or semistructured) from a variety of sources such
as Media, Web and so on?
(a) Location mining (b) Trend mining
(c) Sentiment mining (d) Text mining
Answers
1. (a) 2. (d) 3. (b) 4. (c) 5. (a) 6. (d) 7. (c) 8. (b) 9. (d) 10. (c)
11. (a) 12. (d) 13. (b) 14. (c) 15. (a) 16. (c) 17. (d)
Q. II Fill in the Blanks:
1. _______ media are interactive technologies that allow the creation or
sharing/exchange of information, ideas, interests, and other forms of expression
via virtual communities and networks.
2. _______ analytics applications require clear, interpretable results and actionable
outcomes to achieve the desired result.
3. Data _______ means valid data identification.
4. In POS we identify whether words in a sentence are nouns, verbs, adjectives etc.,
and can be done in _______ for NLP.
5. Social media has gained significant popularity and become an essential medium of
_______.
6. The text _______ also referred to as text data mining or text mining is the process of
obtaining high-quality information from text.
7. Social media analytics is the process of gathering and analyzing data from social
_______ such as Facebook, Instagram, LinkedIn and Twitter.
8. _______ analysis is the set of activities that assist in transforming raw data into
meaning insight.
9. Interactive data visualization has led us to a new era of revolution where _______
have led us to easy analysis and decision making.
10. The textual _______ of social media mainly include tweets, textual posts, comments,
status updates, and blog posts.
11. SNA is the process of investigating social _______ through the use of networks and
graph theory.
12. _______ structure mining uses dynamic data that constantly keeps changing with
time.
13. _______ detection techniques are useful for social media algorithms to discover
people with common interests and keep them tightly connected.
4.53
Data Analytics Social Media and Text Analytics
Answers
1. Social 2. Text 3. capture 4. NLTK
5. communication 6. analytics 7. networks 8. Data
9. graphics 10. messages 11. structures 12. Dynamic
13. Community 14. Topic 15. clustering 16. expert
17. NLP 18. tokens 19. vector 20. TF-IDF
21. n-gram 22. Stop 23. same 24. tuples
25. insights 26. classification 27. summary 28. tools
29. volume 30. sentiment 31. search
Q. III State True or False:
1. Social media refers to the means of interactions among people in which they
create, share, and/or exchange information and ideas in virtual communities such
as public posts, video chat etc. and networks.
2. Social media analytics is the ability to gather and find meaning in data gathered
from social networks.
3. Data identification is the process of identifying the subsets of available data to
focus on for analysis.
4. The social media analysis process which comprises of three stages namely data
capturing, data understanding and data presentation.
5. Geographic trend analysis is mainly involved in analyzing the trend of products,
users or other elements within or across geographic locations.
6. Social media network analytics focuses on the networking structure of the social
media data which indicates the connection between users based on the concept of
friends and followers.
7. Most of the social media Websites such as Facebook, Twitter, LinkedIn, YouTube
etc., have their APIs (Application Programming Interfaces) that can be used by
data analysts to access the social media data found in these sites.
8. Search engine analytics pays attention to analyzing historical search data to
generate informative search engine statistics.
9. Tokenization setting splits the documents into words constructing a word vector
known as Bag-of-Words (BoW).
10. TF-IDF stands for the Term Frequency-Inverse Document Frequency.
11. Stemming is a technique used to extract the base form of the words by removing
affixes from them.
12. Location analysis is used to identify the emotions conveyed by the unstructured
text.
4.55
Data Analytics Social Media and Text Analytics
13. Text analytics helps analysts extract meanings, patterns, and structure hidden in
unstructured textual data.
14. Rule based text classification categorizes text into organized clusters by using a set
of linguistic rules.
15. Historical trends are analyzed to determine future trends for a given phenomenon
or feature.
16. Structured data refers to information that either does not have a predefined data
format or model.
17. The volume refers to the size of Data. Velocity refers to the speed at which the data
is getting accumulated.
18. Sentiment analysis is very commonly used in social media analytics by considering
as input the comments, tweets, reviews, discussions, emails or feedbacks provided
in social media by several online users.
19. The cycle of social media analytics consists of six steps namely, identification,
extraction, cleaning, analyzing, visualization, and interpretation.
20. Influence maximization is the problem of finding a small subset of nodes (seed
nodes) in a social network that could maximize the spread of influence.
21. NLP enables computers to understand and process human language.
22. Expert finding is concerned about finding persons who are knowledgeable on a
given topic.
23. Social media data is the information that is collected from a organization’s profiles
across different social media networks.
Answers
1. (T) 2. (T) 3. (T) 4. (F) 5. (T) 6. (T) 7. (T) 8. (T) 9. (T) 10. (T)
11. (T) 12. (F) 13. (T) 14. (T) 15. (T) 16. (F) 17. (T) 18. (F) 19. (T) 20. (T)
21. (T) 22. (T) 23. (T)
Q. IV Answer the following Questions:
(A) Short Answer Questions:
1. Define social media.
2. What is the purpose of social media?
3. Define text analytics.
4. Define social media analytics.
5. Define social network.
6. List social media sites.
7. Define location text analytics.
8. Define tokenization.
9. What is the purpose of n-grams?
4.56
Data Analytics Social Media and Text Analytics
4.58