WSMA-lab
WSMA-lab
L
WEBANDSOCIALMEDIAANALYTICSLAB
For
B. Tech IV Year I Semester
(COMPUTERSCIENCEANDENGINEERIN
G)
(DATASCIENCE)
(R18Regulations)
DEPARTMENTOFCOMPUTERSCIENCEANDENG
INEERING
(DATASCIENCE)
Sreyas Institute of Engineering and
Technology
Prepared by B. Venkata Varma
An UGC Autonomous Institution
4. Conductinvestigationsofcomplexproblems: Useresearch-basedknowledgeandresearch
methods including design of experiments, analysis and interpretation of data, and synthesis of
the information to provide valid conclusions.
5. Moderntoolusage:Create,select,andapplyappropriatetechniques,resources,andmodern
engineering and IT tools including prediction and modeling to complex engineering activities
with an understanding of the limitations.
6. Theengineerandsociety: Applyreasoninginformedbythecontextualknowledgetoassess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to
the professional engineering practice.
8. Ethics:Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
DEPARTMENTOFCOMPUTERSCIENCEANDENG
INEERING
(DATASCIENCE)
JNTUHYDERABAD
DATAMININGLAB
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
CO1 3 3 3 3 3 - 2 3 3 - - 3
CO2 3 3 3 2 2 2 - 3 3 - - 3
CO3 3 3 3 3 3 - - 3 3 - - 3
CO4 3 3 3 3 3 - - 3 3 - - 3
CO5 3 3 3 - 3 - - 3 3 - - 3
AVG 3 3 3 3 3 2 2 3 3 2 2 3
CO-PSOMAPPING:
PSO1 PSO2
CO1 - 2
CO2 - 1
CO3 - 1
CO4 - 2
CO5 - 1
AVG 0 2
2. Stop Word List: Have a predefined list of stop words (e.g., provided by NLP
libraries or custom lists).
3. Filtering: Remove words from the text that are in the stop word list.
Example:
NLTK library maintains a list of around 179 stopwords (shown below) that can be
used to filtering stopwords from the text. You may also add or remove stopwords
from the default list.
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
print(stopwords.words('english'))
Out:-
‘i’, ‘me’, ‘my’, ‘myself’, ‘we’, ‘our’, ‘ours’, ‘ourselves’, ‘you’, “you’re”, “you’ve”, “you’ll”,
“you’d”, ‘your’, ‘yours’, ‘yourself’, ‘yourselves’, ‘he’, ‘him’, ‘his’, ‘himself’, ‘she’, “she’s”,
‘her’, ‘hers’, ‘herself’, ‘it’, “it’s”, ‘its’, ‘itself’, ‘they’, ‘them’, ‘their’, ‘theirs’, ‘themselves’,
‘what’, ‘which’, ‘who’, ‘whom’, ‘this’, ‘that’, “that’ll”, ‘these’, ‘those’, ‘am’, ‘is’, ‘are’,
‘was’, ‘were’, ‘be’, ‘been’, ‘being’, ‘have’, ‘has’, ‘had’, ‘having’, ‘do’, ‘does’, ‘did’, ‘doing’,
‘a’, ‘an’, ‘the’, ‘and’, ‘but’, ‘if’, ‘or’, ‘because’, ‘as’, ‘until’, ‘while’, ‘of’, ‘at’, ‘by’, ‘for’,
‘with’, ‘about’, ‘against’, ‘between’, ‘into’, ‘through’, ‘during’, ‘before’, ‘after’, ‘above’,
Prepared by B. Venkata Varma
‘below’, ‘to’, ‘from’, ‘up’, ‘down’, ‘in’, ‘out’, ‘on’, ‘off’, ‘over’, ‘under’, ‘again’, ‘further’,
‘then’, ‘once’, ‘here’, ‘there’, ‘when’, ‘where’, ‘why’, ‘how’, ‘all’, ‘any’, ‘both’, ‘each’,
‘few’, ‘more’, ‘most’, ‘other’, ‘some’, ‘such’, ‘no’, ‘nor’, ‘not’, ‘only’, ‘own’, ‘same’, ‘so’,
‘than’, ‘too’, ‘very’, ‘s’, ‘t’, ‘can’, ‘will’, ‘just’, ‘don’, “don’t”, ‘should’, “should’ve”, ‘now’,
‘d’, ‘ll’, ‘m’, ‘o’, ‘re’, ‘ve’, ‘y’, ‘ain’, ‘aren’, “aren’t”, ‘couldn’, “couldn’t”, ‘didn’, “didn’t”,
‘doesn’, “doesn’t”, ‘hadn’, “hadn’t”, ‘hasn’, “hasn’t”, ‘haven’, “haven’t”, ‘isn’, “isn’t”,
‘ma’, ‘mightn’, “mightn’t”, ‘mustn’, “mustn’t”, ‘needn’, “needn’t”, ‘shan’, “shan’t”,
‘shouldn’, “shouldn’t”, ‘wasn’, “wasn’t”, ‘weren’, “weren’t”, ‘won’, “won’t”, ‘wouldn’,
“wouldn’t”
import nltk
nltk.download('stopwords')
def stopword_elimination(text):
stopwords = nltk.corpus.stopwords.words('english')
filtered_words = [word for word in text.split() if word.lower() not in stopwords]
return filtered_words
if __name__ == '__main__':
text = "This is a sample text with stopwords."
filtered_words = stopword_elimination(text)
print(filtered_words)
Output
['sample','text','with']
b)Stemming
Stemming also reduces the words to their root forms but unlike lemmatization, the
stem itself may not a valid word in the Language.
NLTK has many stemming functions with different algorithms, we will use
PorterStemmer over here.You will like to either perform stemming or lemmatization
and not both. We will however perform stemming on our data just to explain to you.
We have defined a custom function stemming() that returns the text by converting
the words to stem, we finally apply it to Twitter dataframe.
import nltk
from nltk.stem import PorterStemmer
def stemming(text):
stemmer = PorterStemmer()
stemmed_words = []
Output
['thi', 'is', 'a', 'sampl', 'text', 'with', 'stemming.']
In [1]:
from nltk.stem import PorterStemmer
def stemming(text):
porter = PorterStemmer()
result = []
for word in text:
result.append(porter.stem(word))
return result
# Test
text = ['Connects', 'Connecting', 'Connections', 'Connected', 'Connection', 'Connectings',
'Connect']
stemmed_words = stemming(text)
print(stemmed_words)
[Out]:
C) Lemmatization
Output
['this','sample','text','lemmatization']
D) POStagging
If we scrape our data from a different website, removing HTML tags becomes
an essential step as part of our preprocessing.
We can use Python regular expression function to find all the unwanted tags. Here in
this example, we have defined a custom function remove_tag() which cleans the
HTML tags from the text by using regular expression. And finally, we apply this
function to our Twitter dataframe.
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
def pos_tagging(text):
tokens = nltk.word_tokenize(text)
tagged_tokens = nltk.pos_tag(tokens)
return tagged_tokens
if __name__ == '__main__':
text = "This is a sample text with lexical analysis."
tagged_tokens = pos_tagging (text)
print(tagged_tokens)
Output
[('This','DT'),('is','VBZ'),('a','DT'),('sample','NN'),('text','NN'),('with','IN'), ('POS',
'NN'), ('tagging', 'VBG')]
E) Lexical analysis
Prepared by B. Venkata Varma
Lexical analysis is the process of converting a sequence of characters in a source code file
into a sequence of tokens that can be more easily processed by a compiler or interpreter. It
is often the first phase of the compilation process and is followed by syntax analysis and
semantic analysis.
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
def lexical_analysis(text):
tokens = nltk.word_tokenize(text)
tagged_tokens = nltk.pos_tag(tokens)
return tagged_tokens
if __name__ == '__main__':
text = "This is a sample text with lexical analysis."
tagged_tokens = lexical_analysis(text)
print(tagged_tokens)
Output
[('This','DT'),('is','VBZ'),('a','DT'),('sample','NN'),('text','NN'),('with','IN'),
('lexical','JJ'),('analysis','NN')]
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
def sentiment_analysis(text):
analyzer = SentimentIntensityAnalyzer()
sentiment = analyzer.polarity_scores(text)
return sentiment
if __name__ == '__main__':
text = "This is a sample text with positive sentiment."
sentiment = sentiment_analysis(text)
print(sentiment)
Output
{ 'neg': 0.0, 'neu': 0.625, 'pos': 0.375, 'compound': 0.5574}
3. Web analytics
Prepared by B. Venkata Varma
a. Web usage data(web server log data, click stream analysis)
import pandas as pd
def web_usage_analysis(log_file):
# Read web log data from CSV file
try:
log_data = pd.read_csv(log_file)
except Exception as e:
print(f"Error reading file: {e}")
return
# Check if necessary columns exist
required_columns = ['user_id', 'session_id', 'timestamp']
if not all(col in log_data.columns for col in required_columns):
print("Missing required columns in the log data.")
return
# Group by user and session to count requests per user
user_requests = log_data.groupby('user_id')['session_id'].count()
# Display the results
print("Web Requests per User:")
print(user_requests)
# Example usage with a log file
log_file = '/content/web_log.csv' # Example file path
web_usage_analysis(log_file)
Output
Prepared by B. Venkata Varma
Web Requests per User:
user_id
101 3
102 2
103 1
104 2
105 2
a.Name: session_id, dtype: int64
b. Hyperlink
data
import requests
import bs4
def hyperlink_analysis(url):
# Send a request to the URL and parse the HTML
response = requests.get(url)
soup = bs4.BeautifulSoup(response.content, 'html.parser')
# Find all hyperlinks in the page
links = soup.find_all('a')
# Analyze the links
link_counts = {}
for link in links:
anchor_text = link.text
url = link.get('href', '') # Get href, handle if it doesn't exist
if url not in link_counts:
link_counts[url] = 0
link_counts[url] += 1
Output
pip install requests
pip install bs4
Prepared by B. Venkata Varma
python hyperlink_analysis.py
/search: 5
/maps: 1
/shopping: 1
/about: 1
https://policies.google.com/privacy: 1
/intl/en/policies/terms/: 1
4. Search engine optimization-implement spamdexing
import nltk
nltk.download('stopwords')
def spamdexing(text):
# Load English stopwords from NLTK
stopwords = nltk.corpus.stopwords.words('english')
# Define the keywords to be added
keywords = ['keyword1', 'keyword2', 'keyword3']
# Filter the text by removing stopwords
filtered_text = [word for word in text.split() if word.lower() not in stopwords]
Output
['This','is','a','sample','text','with','stopwords.','keyword1','keyword1','keyword1',
'keyword1','keyword1', 'keyword1', 'keyword1', 'keyword1', 'keyword2', 'keyword2',
'keyword2', 'keyword2','keyword2', 'keyword2', 'keyword2', 'keyword2', 'keyword3',
'keyword3', 'keyword3', 'keyword3','keyword3', 'keyword3', 'keyword3', 'keyword3']
def get_conversion_data(conversion_id):
url = 'https://analytics.google.com/analytics/v3/data/ga'
params = {
'ids': f'ga:{conversion_id}',
'start-date': '2023-01-01',
'end-date': '2023-08-01',
'metrics': 'ga:conversions',
'dimensions': 'ga:date',
'samplingLevel': '1'
}
response = requests.get(url, params=params)
return response.json()
if __name__ == '__main__':
conversion_id = '1234567890'
conversion_data = get_conversion_data(conversion_id)
print(conversion_data)
Output
The output of the program will depend on the data in the data file. However, the
output might include the following information:
•The conversion rate
•The number of conversions
•The number of visitors.
b. Visitor Profiles
To create Visitor Profiles in Python, you need to analyze visitor data from sources like an
API, database, or CSV files. Visitor profiles typically include attributes such as
demographics, preferences, behavior patterns, and interaction history. Here's how you can
structure your approach:
# Convert to DataFrame
df = pd.DataFrame(visitor_data)
# Group by gender
gender_summary = df.groupby('gender').agg({'visits': 'mean', 'purchases': 'mean',
'conversion_rate': 'mean'})
print("\nGender-Based Summary:")
print(gender_summary)
# Plot profiles
plt.figure(figsize=(10, 6))
df.groupby('location')['visits'].sum().plot(kind='bar', color='skyblue')
plt.title('Visits by Location')
plt.xlabel('Location')
plt.ylabel('Total Visits')
plt.show()
if response.status_code == 200:
return response.json()
else:
print(f"Error: {response.status_code} - {response.text}")
return None
if __name__ == '__main__':
profile_id = '1234567890' # Replace with your actual profile ID
access_token = 'YOUR_ACCESS_TOKEN' # Replace with a valid access token
if traffic_sources:
save_data_to_json(traffic_sources)
Error: 401 - {
"error": {
"code": 401,
"message": "Request is missing required authentication credential.",
"errors": [
{
"message": "Request is missing required authentication credential.",
"domain": "global",
"reason": "required"
}
]
}
}
Invalid Profile ID:-
Error: 400 - {
"error": {
"code": 400,
"message": "Invalid value 'ga:123456'. Values must match the pattern 'ga:[0-9]+'.",
"errors": [
{
"message": "Invalid value 'ga:123456'. Values must match the pattern 'ga:[0-9]+'.",
"domain": "global",
"reason": "invalid"
}
Prepared by B. Venkata Varma
]
}
}