DataAnalytics_LabBook
DataAnalytics_LabBook
Commerce(Autonomous),
Ganeshkhind, Pune-16
LAB COURSE I
SECTION I
DATA ANALYTICS
T.Y.B.SC.
(COMPUTERSCIENCE)
SEMESTER-VI
Name
College Name
Academic Year
HOD
Teacher In charge Dept. of Computer Science
Prepared by:
Ms. Prerana Sarode. Modern College of Arts, Science and
Commerce(Autonomous), Ganeshkhind, Pune-16
Prof. Kumod Sapkal Modern College of Arts, Science and
Commerce(Autonomous), Ganeshkhind, Pune-16
Table Contents
x
Out of 5
Assignment 1: Linear and Logistic Regression
No. of slots: 02
Objectives
Apply appropriate analytic techniques and tools to analyze data, create models, and identify insights
that can lead to actionable results.
Apply modeling and data analysis linear and logistic regression techniques to the solution of real
world business problems
Reading
You should read the following topics before starting this exercise
The modeling process, Engineering features and selecting a model, Training the model, Validating
the model, Predicting new observations
Types of machine learning
Regression models
Concept of classification, clustering and reinforcement learning
Ready Reference and Self Activity
Machine Learning -
Machine Learning is said as a subset of artificial intelligence that is mainly concerned with the
development of algorithms which allow a computer to learn from the data and past experiences on
their own. The term machine learning was first introduced by Arthur Samuel in 1959.
Definition: Machine learning enables a machine to automatically learn from data, improve
performance from experiences, and predict things without being explicitly programmed.
With the help of sample historical data, which is known as training data, machine learning algorithms
build a mathematical model that helps in making predictions or decisions without being explicitly
programmed. Machine learning brings computer science and statistics together for creating predictive
models.
Machine learning can be classified into three types:
1. Supervised learning 2. Unsupervised learning 3. Reinforcement learning
1) Supervised Learning - Supervised learning is a type of machine learning method in which we provide
sample labeled data to the machine learning system in order to train it, and on that basis, it predicts the output.
2) Unsupervised Learning - Unsupervised learning is a learning method in which a machine learns without
any supervision.
3) Reinforcement Learning - Reinforcement learning is a feedback-based learning method, in which a
learning agent gets a reward for each right action and gets a penalty for each wrong action. The agent learns
automatically with these feedbacks and improves its performance. In reinforcement learning, the agent
interacts with the environment and explores it. The goal of an agent is to get the most reward points, and
hence, it improves its performance.
DATA ANALYTICS ASSIGNMENT 1 | Prepared by: Prof. Amit Mogal
Regression Analysis-
Regression analysis is a statistical method to model the relationship between a dependent (target) and
independent (predictor) variables with one or more independent variables.
Regression analysis helps us to understand how the value of the dependent variable is changing
corresponding to an independent variable when other independent variables are held fixed.
Regression is a supervised learning technique which helps in finding the correlation between variables
and enables us to predict the continuous output variable based on the one or more predictor variables.
It is mainly used for prediction, forecasting, time series modeling, and determining the causal-
effect relationship between variables.
In Regression, we plot a graph between the variables which best fits the given datapoints, using this
plot, the machine learning model can make predictions about the data.
"Regression shows a line or curve that passes through all the datapoints on target-predictor graph
in such a way that the vertical distance between the datapoints and the regression line is minimum."
The distance between datapoints and line tells whether a model has captured a strong relationship or
not.
Types of Regression
There are various types of regressions which are used in data science and machine learning. Each type has its
own importance on different scenarios, but at the core, all the regression methods analyze the effect of the
independent variable on dependent variables. Here in this assignment we will learn Linear Regression and
Logistic Regression in detail.
Linear Regression:
Linear regression is a statistical regression method which is used for predictive analysis.
It is one of the very simple and easy algorithms which works on regression and shows the relationship
between the continuous variables.
It is used for solving the regression problem in machine learning.
Linear regression shows the linear relationship between the independent variable (X-axis) and the
dependent variable (Y-axis), hence called linear regression.
If there is only one input variable (x), then such linear regression is called simple linear regression.
And if there is more than one input variable, then such linear regression is called multiple linear
regression.
The relationship between variables in the linear regression model can be explained using the below
image. Here we are predicting the salary of an employee on the basis of the year of experience.
Logistic Regression:
Logistic regression is another supervised learning algorithm which is used to solve the classification
problems. In classification problems, we have dependent variables in a binary or discrete format such
as 0 or 1.
Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No, True or
False, Spam or not spam, etc.
It is a predictive analysis algorithm which works on the concept of probability.
Logistic regression is a type of regression, but it is different from the linear regression algorithm in the
term how they are used.
Logistic regression uses sigmoid function or logistic function which is a complex cost function. This
sigmoid function is used to model the data in logistic regression. The function can be represented as:
It uses the concept of threshold levels, values above the threshold level are rounded up to 1, and values
below the threshold level are rounded up to 0.
Self-Activity
6. Residual analysis(Check the results of model fitting to know whether the model is satisfactory)
plt.scatter(X_test,y_test,color="green") # Plot a graph with X_test vs y_test
plt.plot(X_train,regressor.predict(X_train),color="red",linewidth=3) # Regressior line showing
plt.title('Regression(Test Set)')
plt.xlabel('HP')
plt.ylabel('MSRP')
plt.show()
Here we plot a scatter plot graph between X_test and y_test datasets and we draw a regression line.
plt.scatter(X_train,y_train,color="blue") # Plot a graph with X_train vs y_train
plt.plot(X_train,regressor.predict(X_train),color="red",linewidth=3) # Regressior line showing
plt.title('Regression(training Set)')
Sample Example -
Goal is to build a logistic regression model in Python in order to determine whether candidates would get
admitted to a prestigious university.
Here, there are two possible outcomes: Admitted (represented by the value of ‘1’) vs. Rejected (represented
by the value of ‘0’).
You can then build a logistic regression in Python, where:
The dependent variable represents whether a person gets admitted; and
The 3 independent variables are the GMAT score, GPA and Years of work experience
2. Reading and understanding the data(eventually do appropriate transformations- cleaning, filling nulls,
duplicates, etc…)
data = pd.read_csv("C:\TYBSC\Student_Score.csv") # dataset
6. Print test data and predicted data Predictions on the test set
Diving Deeper into the Results -> print two components in the python code:
print (x_test)
print (y_pred)
The prediction was also made for those 10 records (where 1 = admitted, while 0 = rejected):
In the actual dataset (from step-1), you’ll see that for the test data, we got the correct results 8 out of 10
times:
SET B
2. Use the iris dataset. Write a Python program to view some basic statistical details like percentile,
mean, std etc. of the species of 'Iris-setosa', 'Iris-versicolor' and 'Iris-virginica'. Apply logistic
regression on the dataset to identify different species (setosa, versicolor, verginica) of Iris
flowers given just 4 features: sepal and petal lengths and widths.. Find the accuracy of the
model.
Assignment Evaluation
Objectives
● To understand the impact of finding frequent patterns from large datasets.
● To learn the Apriori Algorithm which is used for frequent itemsets mining.
● To understand Association Rule Mining.
● To write and learn implementation of such concepts with Python.
Reading
You should read the following topics before starting this exercise:
Ready Reference
Frequent Itemset Mining: Finding frequent patterns, associations, correlations, or causal
structures among sets of items or objects in transaction databases, relational databases, and
other information repositories.
Association Mining searches for frequent items in the data-set. In frequent mining usually the
interesting associations and correlations between item sets in transactional and relational
databases are found.
If there are 2 items X and Y purchased frequently then it is good to put them together in stores
or provide some discount offer on one item on purchase of other item. This can really increase
the sales. For example it is likely to find that if a customer buys Milk and bread he/she also
buys Butter. So the association rule is [‘milk]^[‘bread’]=>[‘butter’].
Applications: Market Basket Analysis is one of the key techniques used by large retailers to
uncover associations between item, catalog design, loss-leader analysis, clustering,
classification, recommendation systems, etc.
Apriori is an algorithm for frequent item set mining and association rule learning over
relational databases. The name of the algorithm is based on the fact that the algorithm uses
prior knowledge of frequent itemset properties. Apriori employs an iterative approach known
as a levelwise search, where k-itemsets are used to explore (k+1)-itemsets
Support:
The support of item I is defined as the ratio between the number of transactions containing the
item I by the total number of transactions expressed as :
Confidence:
This is measured by the proportion of transactions with item I1, in which item I2 also appears.
Given that the item on the left hand side (antecedent) is purchased then the item on the right
hand side(consequent) would also be purchased.
Lift:
Lift is the ratio between the confidence and support expressed as :
Lift (antecedent => consequent) = 1 means that there is no correlation within the itemset, > 1
means that there is a positive correlation within the itemset, i.e., products in the
itemset, antecedent, and consequent, are more likely to be bought together, < 1 means that there
is a negative correlation within the itemset, i.e., products in itemset, antecedent, and consequent,
are unlikely to be bought together.
1. Define the minimum support and confidence for the association rule
2. Take all the subsets in the transactions with higher support than the minimum support
3. Take all the rules of these subsets with higher confidence than minimum confidence
4. Sort the association rules in the decreasing order of lift.
5. Visualize the rules along with confidence and support.
In this assignment you will analyze collections of market baskets and will determine frequent
itemsets and association rules present in the collections.
Python libraries
Python has many libraries for apriori implementation.
i. Mlxtend (apriori)
ii. Apyori (apriori)
iii. pypi (efficient_apriori)
The apriori module from mlxtend library provides fast and efficient apriori implementation.
Parameters
df : One-Hot-Encoded DataFrame or DataFrame that has 0 and 1 or True and False as
values
min_support : Floating point value between 0 and 1 that indicates the minimum support
required for an itemset to be selected.
# of observation with item / total observation# of observation with item / total observation
use_colnames : This allows to preserve column names for itemset making it more
readable.
max_len : Max length of itemset generated. If not set, all possible lengths are evaluated.
verbose : Shows the number of iterations if >= 1 and low_memory is True. If =1 and
low_memory is False , shows the number of combinations.
low_memory :
If True, uses an iterator to search for combinations above min_support. Note that while
low_memory=True should only be used for large dataset if memory resources are limited,
because this implementation is approx. 3–6x slower than the default.
The function returns a pandas DataFrame with columns ['support', 'itemsets'] of all itemsets
that are >= min_support and < than max_len (if max_len is not None).
Leverage computes the difference between the observed frequency of A and C appearing
together and the frequency that would be expected if A and C were independent. A leverage
value of 0 indicates independence.
A high conviction value means that the consequent is highly depending on the antecedent.
Self-Activity
Dataset Sources
https://www.kaggle.com/datasets/sivaram1987/association-rule-learningapriori
https://github.com/shivang98/Market-Basket-Optimization
https://www.kaggle.com/datasets/hemanthkumar05/market-basket-optimization
https://www.kaggle.com/datasets/irfanasrullah/groceries
Lab Assignments
SET A:
1. Create the following dataset in python
SET B:
SET C:
Write a python code to implement the apriori algorithm. Test the code on any standard dataset.
Assignment Evaluation
Objectives
To understand the concept of sentiment analysis.
To learn various methodologies for analysis on text including text analytics, tokenization, frequency
distribution, stopwords, stemming, lemmatization, part-of-speech tagging.
To write the Python scripts using various libraries for sentiment analysis using natural language processing
toolkit and classifying emotions on basis of labels i.e. Positive, Negative and Neutral. Also to use wordcloud
package for words comparison.
To perform analysis on social media data such as Facebook, Twitter, YouTube.
To graphically represent the analyzed data.
Reading
You should read the following topics before starting the exercise :
What is the need of doing data analysis using natural language processing. Basics of Python libraries such as
pandas, matplotlib, numpy, scikit-learn, nltk, VADER tool to perform the data analysis.
Ready Reference
Python Libraries for performing text and Sentiment Analysis :
Natural Language Toolkit (NLTK) :
NLTK is a Python Package for performing Natural Language Processing on human language data which is
mostly unstructured. It mainly focuses on analyzing textual data. It supports different natural language
processing algorithms such as Tokenization, Frequency Distribution, Stopwords, Lexicon Normalization,
Stemming, Lemmatization, POS Tagging. These are considered as pre-processing steps to perform text
analytics.
Installation of NLTK : You can use any IDE to perform Python programming for the following tasks. Here
Spyder IDE is used.
After running the above script, a screen will come to download the packages. Here click on download to
download all the supporting NLTK packages.
You can also download all NLTK packages using Python statement :
nltk.download(‘all’)
If all the packages are not needed, then individual packages can also be installed by passing its name in
nltk.download().
Syntax : nltk.download(‘package_name’)
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
For example : nltk.download(‘punkt’)
Tokenization : It is the first step to perform text analytics. Tokenization means breaking down a textual
paragraph into small chunks such as words or sentences. It is classified into two sections :
Sentence Tokenization and Word Tokenization : Sentence Tokenization breaks the text into sentences
whereas Word Tokenization breaks the text into words.
Example :
Output :
Tokenized Sentences :
['Hello all, Welcome to Python Programming Academy.', 'Python Programming Acade
my is a nice platform to learn new programming skills.', 'It is difficult to get
enrolled in this Academy.']
Tokenized Words :
['Hello', 'all', ',', 'Welcome', 'to', 'Python', 'Programming', 'Academy', '.',
'Python', 'Programming', 'Academy', 'is', 'a', 'nice', 'platform', 'to', 'learn'
, 'new', 'programming', 'skills', '.', 'It', 'is', 'difficult', 'to', 'get', 'en
rolled', 'in', 'this', 'Academy', '.']
Frequency Distribution :
The frequency distribution helps to understand how many words have occurred how many times in the
given textual data.
Example :
# Import word_tokenize
from nltk.tokenize import word_tokenize
# Import FreqDist package belonging to nltk.probability
from nltk.probability import FreqDist
# Textual data for word tokenization
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
paragraph_text="""Hello all, Welcome to Python Programming Academy. Python
Programming Academy is a nice platform to learn new programming skills. It is
difficult to get enrolled in this Academy."""
# Word Tokenization
tokenized_words=word_tokenize(paragraph_text)
# Pass the tokenized words to FreqDist
frequency_distribution=FreqDist(tokenized_words)
print(frequency_distribution)
Output :
<FreqDist with 24 samples and 32 outcomes>
To find most common words using Frequency Distribution, add the following lines in above code :
print(frequency_distribution.most_common(2))
Output :
Output :
Stopwords : Stopwords are considered as Noise in textual data. For example if text is containing words
such as is, are, am, a, this, the, an etc. then they are treated as stopwords.
These stopwords needs to be removed from actual text for further processing. Using NLTK, first identify
and create a list of stopwords in given text. Then remove it from the original content. Before working with
stopwords, make sure to download it by using following :
import nltk
nltk.download('stopwords')
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
To check list all Stopwords :
from nltk.corpus import stopwords
# It will find the stowords in English language.
stop_words_data=set(stopwords.words("english"))
print(stop_words_data)
Output :
{'wouldn', 'down', 'was', 'any', 'themselves', 'on', 'how', 'y', 'them', 'do',
'as', "couldn't", 'wasn', 'can', 'yourself', "mightn't", 'm', "wasn't", 'yours',
"haven't", 'have', 'their', 'from', 'with', 'through', 'been', 'couldn', 'here',
'your', 'above', 'same', 'ours', 'now', 'isn', 'that', 'just', 'further',
'only', "won't", 'having', 'these', 'won', 'himself', 'ourselves', 'which',
"you're", 'while', 'of', "doesn't", "should've", "mustn't", 'hadn', 'are',
'not', 'he', 'she', 'am', 'an', 'most', 'whom', 'where', 'than', 'didn',
"isn't", 'shouldn', 'what', 'mustn', 'some', 'very', 'should', 'ain', "you'd",
'yourselves', 'own', 'but', 'we', 't', 'out', 'such', 'in', 've', 'this',
'shan', 'about', 'over', 'both', 'all', 'why', 'i', 'being', "wouldn't", 'll',
'myself', 'between', 'has', "didn't", 'hers', 'hasn', "she's", 'other', 'if',
'itself', 'below', "aren't", 'too', 'under', 'herself', 'be', 'after', 'off',
're', 'during', 'until', 'our', "shouldn't", 'into', 'don', 'again', 'nor',
'needn', "that'll", "weren't", 'no', 'so', 'then', 'before', 'his', 'its',
'few', 'doing', "don't", "you'll", "hadn't", 'because', 'there', 'did', 'my',
"needn't", "it's", 'they', 'for', 'does', 'is', 'a', 'against', 'who', 'and',
"shan't", 'o', 'weren', 'him', 'or', 'theirs', 'were', 'had', 'doesn', 'you',
'haven', 'those', 'me', 'when', 's', 'd', 'it', 'up', 'by', 'each', 'once',
'aren', "you've", 'her', "hasn't", 'to', 'more', 'will', 'mightn', 'the', 'at',
'ma'}
Removing Stopwords :
The above words in the output are predefined stopwords in English Language. If either of these words occur in
a user-defined textual data, then it can be removed as follows :
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Output :
Tokenized Words :
Filtered Words :
Stemming : Stemming is a process of linguistics normalization to reduce words to their word root or divide
the derivational affixes. For example : writing, wrote, written can stemmed or reduced as write.
Example :
# Same code as previous example to remove stop words from tokenized words
from nltk.stem import PorterStemmer
porter_stemmer=PorterStemmer()
stemmed_text_words=[]
for words in filtered_words_list:
stemmed_text_words.append(porter_stemmer.stem(words))
print("Filtered Words : \n",tokenized_words,"\n")
print("Stemmed Words : \n",stemmed_text_words,"\n")
Lemmatization : Lemmatization is a process of removing words to their base words which is linguistically
correct lemmas. For example : “Running” word will be lemmatized to “run”. Before that download the
package “wordnet” belonging to nltk as follows :
import nltk
nltk.download('wordnet')
# Lemmatization
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()
word_text="running"
print("Lemmatized Word : ",lemmatizer.lemmatize(word_text,"v"))
Output :
Lemmatized Word : run
POS Tagging : The POS (Part-of-Speech) tagging is basically used to identify the grammatical group of
given words i.e. Noun, Pronoun, Verb, Adjective, Adverbs etc. on the basis of its context.
Before that download the package “averaged_perceptron_tagger” belonging to nltk as follows :
import nltk
nltk.download('averaged_perceptron_tagger')
# Part-of-Speech Tagging
import nltk
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
from nltk.tokenize import word_tokenize
text_data="Hello all, Welcome to Python programming"
tokenized_data=word_tokenize(text_data)
print(nltk.pos_tag(tokenized_data))
Output :
[('Hello', 'NNP'), ('all', 'DT'), (',', ','), ('Welcome', 'NNP'), ('to', 'TO'),
('Python', 'NNP'), ('programming', 'NN')]
Text Summarization :
Text summarization is an NLP technique that extracts text from a large amount of data. It is the process of
identifying the most important meaningful information in a document and compressing it into a shorter
version by preserving its meaning. Types: Extractive summarization and Abstractive summarization
To perform extractive summarization, we calculate the sentence weights and choose the first ‘n’ sentences
with maximum weight. The weights are calculated on the basis of the word frequencies
Steps:
1. Preprocess the text
2. Create the word frequency table
3. Tokenize the sentence
4. Score the sentences: Term frequency
5. Generate the summary
Sample code
import nltk
nltk.download('all')
#Preprocessing
import re
text="""
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
# Creating a frequency table of words
wordfreq = {}
for word in words:
if word in stopWords:
continue
if word in wordfreq:
wordfreq[word] += 1
else:
wordfreq[word] = 1
#Compute the weighted frequencies
maximum_frequency = max(wordfreq.values())
for word in wordfreq.keys():
wordfreq[word] = (wordfreq[word]/maximum_frequency)
# Creating a dictionary to keep the score # of each sentence
sentences = sent_tokenize(text)
sentenceValue = {}
for sentence in sentences:
for word, freq in wordfreq.items():
if word in sentence.lower():
if sentence in sentenceValue:
sentenceValue[sentence] += freq
else:
sentenceValue[sentence] = freq
import heapq
summary = ''
summary_sentences = heapq.nlargest(4, sentenceValue, key=sentenceValue.get)
summary = ' '.join(summary_sentences)
print(summary)
import nltk
nltk.download('vader_lexicon')
Examples : Let’s consider some text statements expressing different emotions and analyzing them using
VADER.
Example 1 :
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
vader_analyzer=SentimentIntensityAnalyzer()
text1="I am feeling good" # The text is positive.
print(vader_analyzer.polarity_scores(text1))
Output :
{'neg': 0.0, 'neu': 0.185, 'pos': 0.815, 'compound': 0.5267}
It has given ‘pos’ value as 0.815 which is maximum of all the other values since the statement is positive.
Similarly, we can check it on other emotions as well.
Example 2 :
Output :
Example 3 : Consider the following example to get the overall rating about a statement
i.e. overall whether it is positive, negative or neutral.
Output :
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Word cloud is basically a data visualization technique to represent the textual content where the size of each
visualized word implies its importance, frequency and intensity. It is a good tool to visualize the text and
perform sentiment analysis to find the frequency of words having positive, negative or neutral emotions.
Now to perform sentiment analysis on above dataset and creating a wordcloud, consider the following code :
(Here, we will represent Positive words with green color, Negative words with red color and Neutral words
with white color)
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
# Create dictionaries to store positive and negative words with polarity.
positive_words=dict()
negative_words=dict()
# Sentiment Analysis
sentiment_analyzer=SentimentIntensityAnalyzer()
for i in words:
if not i.lower() in stop_words_data: # It will remove stopwords.
polarity=sentiment_analyzer.polarity_scores(i)
if polarity['compound']>=0.05: # Positive Sentiment
positive_words[i]=polarity['compound']
if polarity['compound']<=-0.05: # Negative Sentiment
negative_words[i]=polarity['compound']
# Append the positive and negative words from dictionaries to lists i.e.
positive[] and negative[]
for key,value in positive_words.items():
positive.append(key)
for key,value in negative_words.items():
negative.append(key)
# Create a dictionary to mention the colors : green for positive and red for
negative
coloured_words={"green":positive,"red":negative}
def get_colour(self,word):
try:
colour=next(
colour for (colour,words) in self.coloured_words
if word in words)
except StopIteration:
colour=self.default
return colour
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
# To print the plot
word_cloud=WordCloud(collocations=False,background_color='black').generate(movie
s_reviews1)
# Neutral words will be visible as black
group_color=ColourAssignment(coloured_words, 'white')
word_cloud.recolor(color_func=group_color)
plt.figure()
plt.imshow(word_cloud, interpolation="bilinear")
plt.axis("off")
plt.show()
Output :
When it comes to social media such as Twitter, Facebook, YouTube etc., bulk of data is available which is
needed to be examined or analyzed for interpreting the opinions of people conveyed in different formats. It
basically puts the subjective information in the form emotions.
To get the tweets through twitter API, Twitter account is needed and App is to be registered. Follow the below
steps :
First create a Twitter account if you do not have one. Visit to https://twitter.com/i/flow/signup and create
an account. Existing account can also be used.
Now create an App on Twitter Developer using following link :
https://developer.twitter.com/en/apps
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Now click on “Create an App” button to create an application to get the API key for
credentials. It will ask to apply for a Developer Account.
Click on Apply and continue. And then answer the questions visible on the screen.
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
After submitting the request, you will receive a message from twitter to get the Email
confirmation. Then we can get the keys. Visit the following link for App creation.
https://developer.twitter.com/en/portal/register/welcome
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Libraries used for twitter data analysis :
1. tweepy : It is a Python library which is used to access the Twitter API. To install
tweepy, use the following command :
pip install tweepy
Or you can also use the “Bearer Token” to perform authentication. For this code, Bearer
Token is used. If you want to use another approach, refer this :
https://docs.tweepy.org/en/stable/authentication.html. You can find Bearer Token here :
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Before using Bearer Token, make sure to use “Elevated” section of App. When first time
app gets created, it comes with “Essential”. But to use Bearer Token directly, “Elevated” is
to be used.
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
In case if the token gets expired, then it can be regenerated as well. Add the following lines
of code :
auth=tweepy.OAuth2BearerHandler("Your Bearer Token")
api=tweepy.API(auth)
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
# Get the tweets on the basis of Hash Tags or Keywords.
search_tag=input("Enter the Hash Tag or Keyword for which you want to get the
tweets : ")
no_of_tweets=int(input("How many tweets you want ? "))
# Iterate over the tweets.
tweets=tweepy.Cursor(api.search_tweets, q=search_tag).items(no_of_tweets)
# Create a list to store all the tweets.
tweet_list=[]
for tweet in tweets:
tweet_list.append(tweet.text)
print(tweet_list)
Output : (Example)
Enter the Hash Tag or Keyword for which you want to get the tweets : #sadhguru
More Analysis on Twitter Data : We can further perform different analysis on gathered
data as follows :
First select the user ID on which analysis is to be done.
Then we can find various information related to tweets such as 'created_at', 'id',
'id_str', 'text', 'truncated', 'entities', 'metadata', 'source', 'in_reply_to_status_id',
'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str',
'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors',
'retweeted_status', 'is_quote_status', 'retweet_count', 'favorite_count', 'favorited',
'retweeted', 'lang', 'possibly_sensitive'.
Example :
# Select a specific user by using a twitter user ID.
user_id=input("Enter a Twitter user ID : ")
no_of_tweets=int(input("How many tweets you want ? "))
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
for tweet_info in tweets:
print("Tweet ID : ",tweet_info.id) # Tweet ID
print("Created at : ",tweet_info.created_at) # Date on which tweet is created.
print("Tweet : ",tweet_info.full_text) # Tweet
print("Retweet count : ",tweet_info.retweet_count) # Number of retweets on each
tweet.
print("\n")
Output : (Example)
Tweet ID : 1502113329103487012
Created at : 2022-03-11 02:45:00+00:00
Tweet : Kriya Yoga requires nothing but dedication towards the practice. As you
refine your energies, there is no way you can remain untransformed.
#SadhguruQuotes https://t.co/byjrSIld2u
Retweet count : 1696
Tweet ID : 1501771371562471426
Created at : 2022-03-10 04:06:11+00:00
Tweet : Congratulations @CISFHQrs for your courageous & committed
contribution to Nation Building for more than five decades. Bharat is proud
& grateful for your stellar service. May you continue to inspire Peace &
Prosperity. Best Wishes. –Sg #CISFRaisingDay2022
Retweet count : 1595
Tweet ID : 1501750941187551232
Created at : 2022-03-10 02:45:00+00:00
Tweet : You cannot change the past. You can only experience the present moment.
The future must be crafted the way you want. #SadhguruQuotes
https://t.co/eTCAmU3gOl
Retweet count : 2510
Tweet ID : 1501624364889825281
Created at : 2022-03-09 18:22:02+00:00
Tweet : Machel, #VelliangiriMountains are a Cascade of Grace. Their Power has
empowered millions & will continue to empower future populations. Wonderful
your #Sadhanapada culminated here; it was beautiful to have you & Renee.
Journey on- sing, dance, also transform lives. Blessings. –Sg
https://t.co/y2qV6EBM2k
Retweet count : 1483
Visualizing Twitter Data : We can visualize the twitter data in multiple ways on the
basis of attributes returned by Twitter API.
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Example : To visualize the number of re-tweets on each tweet :
First create a DataFrame so that it will become easy to get the attributes of Twitter API.
Then create a plot (e.g. Pie Plot) to get the number of re-tweets on each tweet.
import tweepy
import pandas as pd
import matplotlib.pyplot as plt
twitter_data={'id':tweet_id,'created_at':tweet_created_at,'full_text':tweet_full_text,'r
etweet_count':tweet_retweet_count,'favorite_count':tweet_favorite_count}
# DataFrame
twitter_dataframe=pd.DataFrame(twitter_data)
# Plotting Pie Graph for retweets on each tweet.
twitter_dataframe['retweet_count'].plot.pie()
plt.show()
Output :
Enter a Twitter user ID : Tesla
How many tweets you want ? 10
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Now to plot the likes and re-tweets received on each tweet, add the following script :
Consider the following script to plot the Time Series for likes and re-tweets along with
dates on which the tweets were published.
# Time Series
time_likes=pd.Series(data=twitter_dataframe['favorite_count'].values,index=twitter_dataf
rame['created_at'])
time_likes.plot(figsize=(16,4),label="likes",legend=True,color="magenta")
time_retweets=pd.Series(data=twitter_dataframe['retweet_count'].values,index=twitter_dat
aframe['created_at'])
time_retweets.plot(figsize=(16,4),label="retweets",legend=True,color="blue")
plt.show()
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
You can also get information regarding tweets as total number of likes and re-tweets on
each tweet, which tweet has maximum count of likes and got maximum re-tweets.
for tweet_info in tweets:
print("Tweet ID : ",tweet_info.id) # Tweet ID
print("Created at : ",tweet_info.created_at) # Date on which tweet is created.
print("Tweet : ",tweet_info.full_text) # Tweet
print("Retweet count : ",tweet_info.retweet_count) # Number of retweets on each
tweet.
print("Favorite count : ",tweet_info.favorite_count)
print("\n")
Output : (Example)
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Enter a Twitter user ID : SadhguruJV
How many tweets you want ? 5
Total number of tweets : 5
Total number of likes on each tweet : 16863
Total number of retweets on each tweet : 6180
Number of likes for most liked tweet : 5964
Number of retweets for the most retweeted tweet : 2207
Tweet ID : 1502475716935442442
Created at : 2022-03-12 02:45:00+00:00
Tweet : Only if you invest your emotions in what matters to you, will life become
powerful and really meaningful. #SadhguruQuotes https://t.co/EWJ2Aneqps
Retweet count : 447
Favorite count : 1247
Tweet ID : 1502375448885432326
Created at : 2022-03-11 20:06:34+00:00
Tweet : #SaveSoil #MoU #CARICOM
@GastonBrowne @AntiguaOpm @SkerritR @PhilipJPierreLC @pmharriskn @antiguagov
@SaintLuciaGov @skngov @molwynjoseph @SamMarshallMP @machelmontano @armandarton
@GlobalCitizenFo @cpsavesoil @PMOIndia https://t.co/RMXpcgW12d
Retweet count : 461
Favorite count : 1026
Tweet ID : 1502375423451164672
Created at : 2022-03-11 20:06:28+00:00
Tweet : A historic moment marked by the first #SaveSoil MoUs signed by the pearls of
the ocean. Governments of Antigua & Barbuda, Dominica, St Lucia, and St Kitts &
Nevis — may your commitment to soil revitalization be an inspiration to the rest of the
world. -Sg @CARICOMorg #CARICOM https://t.co/0glWuMlFBy
Retweet count : 1074
Favorite count : 2806
Tweet ID : 1502151438419464196
Created at : 2022-03-11 05:16:26+00:00
Tweet : Sir Vivian Richards & Lord Ian Botham - a joy to meet you during my Antigua
visit for the #SaveSoil movement. Your achievements in cricket & beyond are
commendable. Please join me in restoring our world’s Soil, the basis of all Life on
Earth. -Sg @ivivianrichards @BeefyBotham https://t.co/M53Ckhu0Lg
Retweet count : 2207
Favorite count : 5964
Tweet ID : 1502113329103487012
Created at : 2022-03-11 02:45:00+00:00
Tweet : Kriya Yoga requires nothing but dedication towards the practice. As you refine
your energies, there is no way you can remain untransformed. #SadhguruQuotes
https://t.co/byjrSIld2u
Retweet count : 1991
Favorite count : 5820
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Sentiment Analysis on Twitter Data : We can also perform sentiment analysis on gathered twitter data.
Here two libraries will be needed, i.e. TextBlob and Vader.
1. textblob : It is a Python library which is used for processing textual data. It is built on top of NLTK
module and offers a simple API to access its methods to perform basic Natural Language Processing
tasks. To install textblob, use the following command :
pip install textblob
2. VADER description has already been given in previous topic for Sentiment Analysis using NLTK.
positive_tweets=0
negative_tweets=0
neutral_tweets=0
polarity_of_tweets=0
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
compound_score=polarity_score['compound']
polarity_of_tweets+=analysis.sentiment.polarity
if negative_score>positive_score:
negative_tweets_list.append(tweet.text)
negative_tweets+=1
elif positive_score>negative_score:
positive_tweets_list.append(tweet.text)
positive_tweets+=1
elif positive_score==negative_score:
neutral_tweets_list.append(tweet.text)
neutral_tweets+=1
positive_tweets=sentiment_percentage(positive_tweets,no_of_tweets)
negative_tweets=sentiment_percentage(negative_tweets,no_of_tweets)
neutral_tweets=sentiment_percentage(neutral_tweets,no_of_tweets)
polarity_of_tweets=sentiment_percentage(polarity_of_tweets,no_of_tweets)
positive_tweets=format(positive_tweets,'.1f')
negative_tweets=format(negative_tweets,'.1f')
neutral_tweets=format(neutral_tweets,'.1f')
Output :
Enter the Hash Tag or Keyword for which you want to get the tweets : amazonIN
Positive Tweets :
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
["Back to my weekend's favourite activity. Some insights\n\n56 days,\n12
emails\n>30 calls.\n\nAnd the amazing collaboratio… https://t.co/38JoPHM0aw",
'@amazonIN i am unable to rest my Amazon password and i am trying to call
180030001593 on this number but every tim… https://t.co/QNEeqtA1cm', 'RT
@amazonIN: Soundbar Days is back with exciting offers & great discount from
popular brands! Get up to 55% off on bestselling soundbars,…', '@amazonIN
@amazon @AmitAgarwal \nAmazon app is not performing well like add to cart, save
later , move to cart obse… https://t.co/DPnt2Zg2pL']
Negative Tweeets :
Neutral Tweets :
Downloading the twitter datasets online : The online available datasets containing
Twitter data can be downloaded and different analytics can be performed on it.
Example : https://www.kaggle.com/crowdflower/twitter-user-gender-classification
Self-Activity : Download and analyze the data using above link and apply different
analytics techniques on it.
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Getting data using Facebook Access Token
Now create an app to get the token to be used for further processing.
Click on Create App and then select the type of app to be created. Multiple options will
be available i.e. Business, Consumer, Instant Games, Gaming, Workplace, None. You can
read the details and select an option. If Business option available in list is selected, then it
creates an app which manages business assets like Pages, Events, Groups, Ads,
Messenger and Instagram Graph API using the available business permissions, features
and products.
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
After an app gets created, you can get the Access Token as follows :
Go to : https://developers.facebook.com/tools/explorer/
Now click on Generate Access Token. Proceed to the next step by clicking on
Continue.
The Access Token will be visible in Access Token input box.
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Now allow the necessary permissions to access the Facebook pages as well.
Click on “Add a Permission” dropdown and select the permissions from it.
From “User or Pages” dropdown select “Get User Token” again to get the Token with
revised permissions.
Now if we want to see the details of publically available Facebook users or pages,
change the request in the request url box.
Example : If we want to get the details of Facebook Page “Sanganak Academy” then
change the name of page as : SanganakAcademy?fields=id,name. Before that,
make sure to allow the permissions for accessing that page as well.
You can see the posts on this page as well by changing the url as :
SanganakAcademy?fields=id,name,posts. Same you can do to access other
information as well.
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
The Facebook data can be accessed using Python script as follows :
Import the necessary libraries :
import requests
import time
import pickle
import random
Provide Access Token and get the URL to access the data :
# Access Token
access_token="Your Access Token"
# In Graph URL, provide the correct version of Graph API. Here currently I am using
v13.0
graphURL="https://graph.facebook.com/v13.0/"
# Request URL to get the relevant data of Facebook Page Sanganak Academy.
# You can access any other page by using its ID as well.
requestURL="SanganakAcademy?fields=id,name,posts{message,created_time,comments.limit(
0).summary(true), likes.limit(0).summary(true)}"
actual_url=graphURL+requestURL
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Create a DataFrame using Pandas library.
# Create a dataframe using pandas library
import pandas as pd
fb_dataframe=pd.DataFrame(received_data)
fb_dataframe=pd.json_normalize(received_data)
print(fb_dataframe)
# Columns
print("Columns in Dataframe : ")
for col in fb_dataframe.columns:
print(col)
Here columns in the dataframe are :
message
created_time
id
comments.data
comments.summary.order
comments.summary.total_count
comments.summary.can_comment
likes.data
likes.summary.total_count
likes.summary.can_like
likes.summary.has_liked
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
time_likes=pd.Series(data=fbdata_with_dates['likes.summary.total_count'].values,index
=fbdata_with_dates['created_time'])
time_likes.plot(figsize=(16,4),label="likes",legend=True)
plt.show()
And then click on “Generate Access Token”. Now copy the generated access token
and use it for further analysis.
Creating a Facebook post : To post something on Facebook Page wall, use put_object()
method of Facebook Graph API.
import facebook
access_token="Your Page Access Token"
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
fb=facebook.GraphAPI(access_token)
fb.put_object(parent_object='me', connection_name='feed', message='Hello all...Welcome
to Sanganak Academy')
Here, 511268930590718 is the Facebook Post ID. To get “Facebook Page ID”, Open the
Facebook Page and in About section, you will find the Facebook Page ID.
Now combine Facebook Page ID and Facebook Post ID as pageid_postid. For example : If
Page ID = 12345 and Post ID = 511268930590718, then combine it as
12345_511268930590718.
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
Liking a post :
Syntax : graph_api_object.put_like(object_id = ‘post_id’)
Example :
# Liking a facebook post.
fb.put_like(object_id='12345_511268930590718')
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
(Refer this https://facebook-sdk.readthedocs.io/en/latest/api.html for more information
about Facebook SDK.)
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (12, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
fig = plt.figure()
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
axis2 = fig.add_axes([0.8, 0, 0.75, 0.75], aspect=1)
# Pie chart for reactors
pie_vars = ['Likers','Dislikers','Commenters']
pie_values =
[youtube_data['likes'].sum(),youtube_data['dislikes'].sum(),youtube_data['comment_count'
].sum()]
axis2.pie(pie_values,labels=pie_vars,autopct='%1.2f%%')
axis2.set_title("Types of reactors")
plt.show()
i=0
for i in range(youtube_data.shape[0]):
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
date_time_obj = datetime.datetime.strptime(youtube_data['publish_time'].at[i],'%Y-
%m-%dT%H:%M:%S.000Z')
youtube_data['publish_time'].at[i] = date_time_obj
i = i+1
date=[]
year=[]
month=[]
day=[]
for i in range(youtube_data.shape[0]):
d = youtube_data['publish_time'][i].date()
y = youtube_data['publish_time'][i].date().year
m = youtube_data['publish_time'][i].date().month
days = youtube_data['publish_time'][i].date().day
date.append(d) # Storing dates
year.append(y) # Storing years
month.append(m) # Storing months
day.append(d) # Storing days
i = i+1
youtube_data.drop(['publish_time'], inplace=True,axis=1)
youtube_data['publish_time']=date
youtube_data['year']=year
youtube_data['month'] = month
youtube_data['day'] = day
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
# Year wise statistics of dislikes.
plt.scatter(youtube_data['year'], youtube_data['dislikes'], c="red")
plt.xlabel("Year")
plt.ylabel("Dislikes")
plt.show()
Lab Assignments
SET A
1. Consider any text paragraph. Preprocess the text to remove any special characters and digits. Generate
the summary using extractive summarization process.
2. Consider any text paragraph. Remove the stopwords. Tokenize the paragraph to extract words and
sentences. Calculate the word frequency distribution and plot the frequencies. Plot the wordcloud of
the text.
3. Consider the following review messages. Perform sentiment analysis on the messages.
i. I purchased headphones online. I am very happy with the product.
ii. I saw the movie yesterday. The animation was really good but the script was ok.
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
iii. I enjoy listening to music
iv. I take a walk in the park everyday
4. Perform text analytics on WhatsApp data :
Write a Python script for the following :
i. First Export the WhatsApp chat of any group. Read the exported “.txt” file using open() and read()
functions.
ii. Tokenize the read data into sentences and print it.
iii. Remove the stopwords from data and perform lemmatization.
iv. Plot the wordcloud for the given data.
Set B
1. Consider the following dataset :
https://www.kaggle.com/datasets/prasertk/top-1000-instagram-influencers
Write a Python script for the following :
i. Read the dataset and find the top 5 Instagram influencers from India.
ii. Find the Instagram account having least number of followers.
iii. Read the column “Category”, remove stopwords and plot the wordcloud to find the keywords which
will imply that in which category maximum accounts are created.
iv. Group the Instagram accounts category wise.
v. Visualize the dataset and plot the relationship between Followers and Authentic engagement columns.
Set C
Q.2 Write a Python script to read the Tweets using Twitter API and tweepy library to perform the
following tasks :
i. Authenticate Twitter API (Using Bearer Token)
ii. Get the tweets using Keywords or Hash Tags.
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde
iii. Find the total number of likes and retweets on each tweet.
iv. Find the most liked tweet and print its text.
v. Visualize the tweets and plot the time series for likes and retweets along with dates on which tweets
are published.
1. Import and format the data into a DataFrame using pandas library. Example : For working with Facebook
Posts, read the JSON file available in “Posts” folder as :
import pandas as pd
facebook_dataframe=pd.read_json("your_posts.json")
Similarly you can work with other downloaded data and read the JSON files available in them.
2. Now perform data cleaning operation on created dataframe and remove unnecessary columns.
3. Perform multiple statistical analysis such as finding the posts by date, number of likes on a post, comments
on a post.
4. Perform sentiment analysis to find the polarity scores and classify the posts text in three categories i.e.
positive, negative and neutral posts.
Assignment Evaluation
DATA ANALYTICS ASSIGNMENT 3 | Prepared by: Dr. Harsha Patil, Prof. Kamakshi Goyal, Dr. Poonam Ponde