Ml-1-Guided-Bus Report
Ml-1-Guided-Bus Report
1
Table of Contents
Part 1: Easy Visa Project
Introduction
Business Context
Role of OFLC
Objective
Data Description
Methodology
Results
Recommendations
Conclusion
Introduction
2
Sentiment Analysis Labels
Data Preprocessing
Sentiment Analysis
Conclusion
Future Work
References
3
Part 1: Easy Visa Project
Context:
Business communities in the United States are facing high demand for human resources, but one of the
constant challenges is identifying and attracting the right talent, which is perhaps the most important element
in remaining competitive. Companies in the United States look for hard-working, talented, and qualified
individuals both locally as well as abroad.
The Immigration and Nationality Act (INA) of the US permits foreign workers to come to the United States to
work on either a temporary or permanent basis. The act also protects US workers against adverse impacts on
their wages or working conditions by ensuring US employers' compliance with statutory requirements when
they hire foreign workers to fill workforce shortages. The immigration programs are administered by the Office
of Foreign Labor Certification (OFLC).
OFLC processes job certification applications for employers seeking to bring foreign workers into the United
States and grants certifications in those cases where employers can demonstrate that there are not sufficient
US workers available to perform the work at wages that meet or exceed the wage paid for the occupation in
the area of intended employment.
Objective:
In FY 2016, the OFLC processed 775,979 employer applications for 1,699,957 positions for temporary and
permanent labor certifications. This was a nine percent increase in the overall number of processed
applications from the previous year. The process of reviewing every case is becoming a tedious task as the
number of applicants is increasing every year.
The increasing number of applicants every year calls for a Machine Learning based solution that can help in
shortlisting the candidates having higher chances of VISA approval. OFLC has hired the firm EasyVisa for data-
driven solutions. You as a data scientist at EasyVisa have to analyze the data provided and, with the help of a
classification model:
Recommend a suitable profile for the applicants for whom the visa should be certified or denied based on the
drivers that significantly influence the case status.
Data Description
The data contains the different attributes of employee and the employer. The detailed data dictionary is given
below.
4
prevailing_wage: Average wage paid to similarly employed workers in a specific occupation in the area of
intended employment. The purpose of the prevailing wage is to ensure that the foreign worker is not
underpaid compared to other workers offering the same or similar service in the same area of employment.
unit_of_wage: Unit of prevailing wage. Values include Hourly, Weekly, Monthly, and Yearly.
full_time_position: Is the position of work full-time? Y = Full Time Position; N = Part Time Position
Executive Summary
The Immigration and Nationality Act (INA) of the US permits foreign workers to come to the United
States to work on either a temporary or permanent basis. The act also protects US workers against
adverse impacts on their wages or working conditions by ensuring US employers' compliance with
statutory requirements when they hire foreign workers to fill workforce shortages.
The increasing number of applicants every year calls for a Machine Learning based solution that can
help in shortlisting the candidates having higher chances of VISA approval. We analyzed data with
▪ Recommend a suitable profile for the applicants for whom the visa should be certified or denied
1. Introduction
Overview of Business Challenge: Companies in the United States face challenges in identifying and attracting
the right talent, both locally and internationally, impacting their competitiveness.
Purpose of the Project: The project aims to facilitate the visa approval process by leveraging machine learning
techniques to analyze applicant profiles.
2. Business Context
Challenges Faced by US Companies: Discussion on the demand for human resources, the role of talent
acquisition in competitiveness, and the importance of efficient visa processing.
Role of OFLC: Explanation of the Office of Foreign Labor Certification's role in administering immigration
programs and processing visa applications.
3. Objective
Goals of the Project: To develop a machine learning-based solution for shortlisting visa applicants and
recommending suitable profiles for visa certification.
4. Data Description
Explanation of Data Attributes: Detailed description of the dataset attributes relevant to visa approval
processes.
Relevance to Visa Approval Process: Discussion on how each attribute contributes to the analysis and decision-
making process.
5
5. Methodology
Approach to Data Analysis: Description of the methodology used for data preprocessing, model development,
and evaluation.
Model Development Strategy: Explanation of the strategy for building and training the classification model.
6. Results
Findings from Model Analysis: Presentation of key findings and insights derived from analyzing the model's
predictions.
Insights into Visa Approval Factors: Discussion on factors influencing visa approval and their significance.
7. Recommendations
Strategies for Facilitating Visa Approvals: Recommendations for OFLC and EasyVisa on improving the efficiency
and accuracy of the visa approval process.
To prioritize limited resources towards screening a batch of applications for those most likely
• Sort applications by level of education and review the higher levels of education first.
• Sort applications by previous job experience and review those with experience first.
• Divide applications for jobs into those with an hourly wage and those with an annual
wage, sort each group by the prevailing wage, then review applications for salaried jobs
As stated previously, the Gradient Boosting classifier performs the best of all the models
created. However, as shown above, the tuned Decision-Tree model performs barely worse by
F1 score and is a far simpler model. This model may be preferable if post-hoc explanations of
• Furthermore, OFLC should examine more thoroughly why whether an application will be
certified or denied can be very well predicted through just three nodes as shown above.
• For those in less skilled, entry-level, and/or hourly jobs, the system would appear to be
a model that can predict that are influencing the criteria for visa approvals.
▪ EDA (bivariate and Univariate analysis), duplicate value check, missing value treatment, outlier check
(treatment if needed)
6
• Decision Tree, Random Forest, Bagging, Boosting Classifiers (AdaBoost , Gradient Boosting, XGBoost),
Stacking Classifier
Univariate Analysis
▪ The distribution for number of employees for employers is heavily skewed right
▪ The average and median annual salary is approx. USD 70,000 which seems accurate
▪ The trend appears correct with outliers in the higher income bracket between USD 200,000 to USD
300,000
▪ There are several very low salaries as well, which appears incorrect and requires further investigation
Number of employees
7
Observations on prevailing wage
▪ Majority of employees have either a bachelor's 40% or a master's 38% and minority of applicants
▪ Around 58% employees have prior job experience and 42% employees do not
Explanation of Data Attributes: The dataset provides comprehensive information about visa applicants and
their employers, crucial for understanding the factors influencing visa approval. Attributes include case_id,
continent, education_of_employee, has_job_experience, requires_job_training, no_of_employees,
yr_of_estab, region_of_employment, prevailing_wage, unit_of_wage, full_time_position, and case_status.
Relevance to Visa Approval Process: Each attribute offers valuable insights into applicant profiles and employer
characteristics, aiding in the development of a classification model for visa approval prediction.
Educational Background of Applicants: A substantial portion of applicants hold either a bachelor's degree (40%)
or a master's degree (38%). However, a minority possess either a doctorate (8%) or only a high school diploma
(13%). Understanding the distribution of educational qualifications among applicants provides insights into the
skill levels and expertise sought by US employers.
Prior Job Experience: Approximately 58% of visa applicants have prior job experience, while 42% do not. This
highlights the importance of assessing work history as a factor in visa approval likelihood. Applicants with prior
job experience may demonstrate a higher level of employability and integration into the US workforce.
8
9
Irrespective of the continent the employee is from, more cases are certified than denied
▪ The trend observed w.r.t % certification for continents is Europe > Africa > Asia > Oceania >
▪ As expected, the trend observed w.r.t % visa certifications with having job experience
10
Bivariate Analysis
Bivariate analysis examines relationships between pairs of variables, revealing insights into how
different factors interact and their impact on visa approval outcomes.
Continent vs. Case Status: Investigates approval rates across continents to identify regional patterns.
Education vs. Prevailing Wage: Explores the correlation between education level and offered wage,
indicating market demand for qualifications.
Job Experience vs. Full-Time Position: Assesses if experienced candidates are more likely to secure
full-time roles, revealing employer preferences.
Region of Employment vs. No. of Employees: Examines regional economic trends by correlating
employment regions with company sizes.
Year of Establishment vs. Prevailing Wage: Analyzes wage growth over time to understand market
maturity and industry trends.
Case Status vs. Prevailing Wage: Investigates the impact of salary on approval rates, guiding optimal
wage strategies.
This analysis offers actionable insights into visa approval dynamics, aiding in informed decision-
making to enhance approval likelihood.
11
Individuals with higher education levels often seek well-paid job opportunities abroad. This
inclination can influence visa application patterns and outcomes. For the Easy Visa Project:
Education and Visa Intent: Higher-educated individuals may target countries like the US for lucrative
job prospects.
Impact on Visa Approval: Visa applicants with advanced degrees may prioritize regions with strong
job markets and competitive salaries, affecting approval rates.
Analysis Focus: Explore how education levels relate to salary expectations and visa outcomes to
understand applicant motivations.
Recommendations: Tailor visa policies and employer strategies to attract and retain highly educated
professionals with competitive compensation packages.
Understanding this trend aids in developing effective visa processing strategies and attracting top
talent globally.
12
Regional Talent Requirements: Various regions have distinct needs for talent with diverse
educational backgrounds, influencing visa approval dynamics.
Diverse Educational Needs: Different regions prioritize specific skill sets and educational
qualifications based on their economic sectors and workforce demands.
Visa Approval Implications: Understanding regional talent requirements is crucial for predicting visa
approval likelihood, as applicants with relevant educational backgrounds may align more closely
with regional needs.
Analysis Significance: Exploring the correlation between educational backgrounds and regional
employment demands can provide insights into visa approval patterns across different geographic
areas.
Strategic Recommendations: Tailoring visa processing strategies to align with regional talent
requirements can enhance approval rates and support economic growth by meeting workforce
needs effectively.
Acknowledging regional variations in talent demands informs targeted visa processing approaches,
facilitating efficient matching of applicants with regional employment opportunities.
Regional Employment: Different regions have unique job markets with varying demands for specific
skills and qualifications. Visa applicants must align their profiles with the employment landscape of
their desired region.
13
Continent: Visa applicants originate from diverse continents, each with its own educational systems,
cultural norms, and workforce characteristics. Understanding these differences can provide insights
into the diverse pool of applicants.
14
Job Experience: The presence or absence of job experience among visa applicants is a critical factor
in assessing their employability and potential contribution to the workforce. Experienced candidates
may be more desirable to employers seeking immediate productivity.
Job Training Requirements: Some positions may require specific training or qualifications beyond
formal education. Assessing whether applicants require job training can help determine their
readiness for employment and their potential impact on the workforce.
These factors collectively influence visa approval decisions and highlight the importance of aligning
applicant profiles with regional employment needs and expectations. Understanding these dynamics
enables better-informed decision-making in visa processing and talent acquisition strategies.
Purpose of Prevailing Wage: The prevailing wage is established by the US government to ensure that
both local talent and foreign workers are fairly compensated. It aims to prevent the undercutting of
wages and protect the interests of workers in the US labor market.
Hypothesis: We hypothesize that there may be a correlation between prevailing wage and visa
status. Higher prevailing wages might attract more visa applicants, potentially leading to a higher
number of certified visas. Conversely, lower prevailing wages could indicate less demand for foreign
workers, resulting in a higher proportion of denied visas.
Data Exploration: We will explore the distribution of prevailing wages among certified and denied
visa cases to identify any patterns or trends. This analysis will help us understand if prevailing wage
levels influence visa approval decisions.
15
Statistical Testing: We may conduct statistical tests, such as chi-square tests or t-tests, to determine
if there is a significant difference in prevailing wage distributions between certified and denied visa
cases.
Implications: Understanding the relationship between prevailing wage and visa status can inform
policy decisions and visa processing strategies. It can help employers anticipate the impact of
prevailing wage changes on their ability to hire foreign workers and guide foreign workers in
selecting job opportunities aligned with prevailing wage trends.
16
By analyzing the data, we aim to provide insights into the dynamics between prevailing wage levels
and visa approval outcomes, contributing to informed decision-making in labor certification
processes and immigration policy.
Objective: Determine if prevailing wage units (e.g., Hourly, Weekly) influence visa certification.
Rationale: Unit choice may reflect job nature and industry norms, impacting certification.
Data Exploration: Examine prevailing wage unit distributions among certified and denied visas.
Comparative Analysis: We will compare the certification rates across different units of prevailing
wage to identify any significant differences. Statistical tests, such as chi-square tests or ANOVA, may
be employed to assess the significance of these differences.
Interpretation: Identify units associated with higher/lower certification rates, revealing visa
processing biases or challenges.
Implications: Understanding the impact of prevailing wage units on visa certification can inform
policy decisions and visa processing strategies. It can help employers and applicants anticipate the
likelihood of visa approval based on the unit of prevailing wage specified in job offers.
By conducting this analysis, we aim to uncover insights into the relationship between prevailing
wage units and visa certification outcomes, contributing to a deeper understanding of the factors
influencing the labor certification process.
17
Outlier Check:
Identify Potential Outliers: Utilize statistical methods such as the Interquartile Range (IQR) or z-score
to identify potential outliers in numerical features.
Visual Inspection: Plot box plots or scatter plots to visually inspect the distribution of data points and
identify any observations that lie far from the bulk of the data.
Handling Outliers:
Removal: Remove outliers if they are determined to be erroneous or not representative of the
underlying data distribution.
Winsorization: Cap extreme values by replacing them with the nearest non-outlier data point.
Clipping: Set a threshold beyond which values are clipped to prevent them from influencing the
analysis.
Impact Assessment: Assess the impact of outlier removal or treatment on the dataset and
subsequent analysis, ensuring that meaningful information is retained.
Documentation: Document the rationale behind outlier detection and any actions taken to address
outliers for transparency and reproducibility.
By systematically checking for outliers and implementing appropriate measures, we can ensure the
robustness and reliability of the dataset for subsequent analysis and modeling tasks.
Model Selection: Choose a base model suitable for bagging, such as decision trees, random forests,
or gradient boosting machines (GBM).
Bagging Ensemble: Implement the bagging ensemble method, which involves training multiple
instances of the base model on different subsets of the training data and aggregating their
predictions.
18
Hyperparameter Tuning:
Grid Search: Define a grid of hyperparameters for the base model and bagging ensemble.
Scoring Metric: Choose an appropriate scoring metric (e.g., accuracy, precision, recall) to optimize
the model's performance.
Model Building:
Base Model Training: Train the base model on each subset of the training data, considering the
selected hyperparameters.
Bagging Ensemble Construction: Combine the predictions of the base models using averaging or
voting to generate the final ensemble prediction.
Evaluation:
Validation Set: Evaluate the bagging ensemble model on a validation set to assess its performance.
Metrics: Calculate relevant evaluation metrics to measure the model's effectiveness, considering
both bias and variance.
Fine-Tuning:
Iterative Process: Iterate through the hyperparameter tuning and model building steps to refine the
model further.
Trade-Offs: Balance model complexity and generalization performance to achieve the desired trade-
off between bias and variance.
Validation Performance: Validate the final model's performance on a holdout test set to ensure its
generalization ability.
Robustness Testing: Conduct robustness testing to assess the model's stability across different
datasets and scenarios.
Choose appropriate evaluation metrics based on the problem domain and business objectives, such
as accuracy, precision, recall, F1-score, or area under the ROC curve (AUC-ROC).
Cross-Validation:
Utilize cross-validation techniques (e.g., k-fold cross-validation) to estimate the performance of each
model reliably on multiple subsets of the training data.
Performance Comparison:
Evaluate the performance of each model using the selected metrics and cross-validation results.
19
Compare the mean and variance of evaluation metrics across different models to assess consistency
and stability.
Statistical Testing:
Conduct statistical tests, such as paired t-tests or Wilcoxon signed-rank tests, to determine if
performance differences between models are statistically significant.
Business Context:
Consider the specific requirements and constraints of the business problem when interpreting
model performance.
Prioritize metrics that align with business goals and objectives, such as minimizing false positives in
fraud detection or maximizing recall in healthcare diagnostics.
20
Identify the model(s) that consistently outperform others across various evaluation metrics and
statistical tests.
Take into account computational complexity, interpretability, and scalability when selecting the final
model.
Sensitivity Analysis:
Perform sensitivity analysis to assess the robustness of the selected model(s) to changes in
hyperparameters, training data, or input features.
Validate the performance of the selected model(s) on a holdout validation set to ensure
generalization to unseen data.
Provide explanations for the selected model(s)' predictions to enhance trust and understanding
among stakeholders.
Use techniques such as feature importance plots, SHAP values, or partial dependence plots to
interpret model behavior.
Document the rationale behind model selection, including performance comparison results,
statistical tests, and business considerations.
Present the final model(s) along with their evaluation metrics, interpretation, and recommendations
in a clear and concise manner for stakeholders.
By following these steps, we can systematically compare model performance and select the most
appropriate model for deployment, ensuring alignment with business objectives and robustness to
real-world challenges.
# To provide actionable insights and recommendations based on the analysis conducted with the
Gradient Boosting classifier, here are some key points to consider:
Feature Importance:
By Identifying the most important features that contribute significantly to the model's predictions.
These features can provide valuable insights into the factors driving the outcome variable.
Model Performance:
Evaluate the model's performance metrics such as accuracy, precision, recall, and F1-score to assess
how well it generalizes to unseen data.
Compare the performance of the Gradient Boosting classifier with other models to determine its
effectiveness.
21
Over-fitting:
Check for signs of over-fitting by comparing the model's performance on the training and test
datasets. If the model performs significantly better on the training data than on the test data, it may
be over fitting.
Hyperparameter Tuning:
Consider fine-tuning the hyperparameters of the Gradient Boosting classifier to optimize its
performance further. Techniques like grid search or random search can help find the best
combination of hyperparameters.
Data Quality:
Ensure that the quality of the input data is high. This includes addressing missing values, handling
outliers, and encoding categorical variables appropriately.
Model Interpretability:
Enhance the interpretability of the model by visualizing decision trees or feature importance plots.
This can help stakeholders understand the rationale behind the model's predictions.
Business Impact:
Translate the model's predictions into actionable insights that can drive business decisions. For
example, identify customer segments with a high probability of churn and devise targeted retention
strategies.
Continuous Monitoring:
Implement a system for monitoring the model's performance over time and updating it as new data
becomes available. This ensures that the model remains accurate and relevant in dynamic business
environments.
Feedback Loop:
Establish a feedback loop where insights from the model are used to refine business processes and
data collection strategies. This iterative approach can lead to continuous improvement and better
decision-making.
Document the entire analysis process, including data preprocessing, model training, evaluation, and
interpretation. Communicate the findings and recommendations effectively to stakeholders in a
clear and understandable manner.
By following these recommendations, we can leverage the insights gained from the Gradient
Boosting classifier to drive business outcomes and make informed decisions.
22
Part 2 : Text Analytics Project
PROBLEM STATEMENT
To analyze every tweet sent by Donald trump prior and during his Presidency to see how he used the twitter
platform. We will be exploring and analyzing relevant
columns of the data. We will be applying all necessary text preprocessing steps on content column of the
tweet data. Also, given that he was suspended from
twitter, we will analyze his tweets for a better understanding of his usage and activity pattern on twitter.
- The data has got Content , Date, Nu. of Retweet, No. of Favorites, column of the tweets.
- The data has got pre labelled sentiment(Positive and negative) column for each content of tweets.
- We will be applying all necessary text pre processing steps on content column of the tweet data.
Data Description
1. Introduction:
Project Overview: This report presents the findings of a text analytics project focused on analyzing
tweets sent by Donald Trump prior to and during his presidency. The objective is to understand how
he utilized the Twitter platform and identify patterns in his usage and activity.
Importance of the Analysis: Twitter has been a significant communication tool for political figures,
and understanding Trump's tweets can provide insights into his messaging strategies, public
engagement, and impact on public discourse.
Data Description: The dataset comprises tweets from Donald Trump spanning the period from 2009
to 2020. It includes columns such as tweet ID, tweet content, date, number of retweets, number of
favorites, mentions, and pre-labeled sentiment for each tweet.
23
Relevance of Columns: The analysis will focus on exploring and analyzing relevant columns,
particularly the tweet content, date, retweets, favorites, and sentiment.
3. Methodology:
Text Preprocessing: Apply necessary text preprocessing steps, such as tokenization, lowercasing,
punctuation removal, and stopword removal, to clean the tweet content data.
Exploratory Data Analysis (EDA): Conduct EDA to gain insights into the distribution of tweet
sentiments, trends over time, popular topics, and engagement metrics (retweets and favorites).
Sentiment Analysis: Utilize the pre-labeled sentiment column to analyze the overall sentiment of
Trump's tweets and identify any trends or patterns.
Usage Patterns: Analyze Trump's tweeting frequency over time to understand his activity pattern on
Twitter, including any significant events or milestones.
24
Sentiment Analysis Results: Present the distribution of tweet sentiments (positive, negative) and
identify the most common themes associated with each sentiment category.
Engagement Metrics: Investigate the relationship between tweet content, sentiment, and
engagement metrics (retweets, favorites) to assess the impact of tweet sentiment on audience
engagement.
5. Recommendations:
Content Optimization: Suggest ways to optimize tweet content for maximizing audience engagement
and fostering positive sentiment.
Count Missing Values: Determine the number of missing values in each column to understand the
extent of missingness.
25
Missing Value Patterns:
Missingness Distribution: Visualize the distribution of missing values across columns to identify any
patterns or trends.
Correlation with Other Variables: Explore if missing values in one column are correlated with missing
values in other columns.
Data Collection Process: Investigate if missing values are due to data collection errors or limitations.
Missing Completely at Random (MCAR), Missing at Random (MAR), Missing Not at Random (MNAR):
Assess the randomness of missingness to determine if it follows any specific pattern.
Imputation Techniques: Consider various imputation methods such as mean imputation, median
imputation, mode imputation, or predictive imputation based on the data characteristics and
missingness pattern.
Deletion: Decide whether to delete rows or columns with missing values based on the impact on the
analysis and the proportion of missing values.
Domain Knowledge: Use domain knowledge to determine the most appropriate handling strategy
for missing values in each column.
Analyze the number of characters in tweets based on sentiment (positive and negative)
26
Data Preparation:
Extract the tweet content and sentiment labels (positive or negative) from the dataset: Start by
extracting the necessary columns from the dataset, including the tweet content and the sentiment
labels associated with each tweet.
For each tweet, calculate the number of characters in the tweet content: Iterate through each tweet
in the dataset and count the number of characters in the tweet content. This can be easily done by
using built-in string functions or libraries in programming languages like Python.
Grouping by Sentiment:
Group the tweets based on their sentiment labels (positive and negative): Split the dataset into two
groups based on the sentiment labels (positive and negative). This segregation will allow you to
analyze the character counts separately for tweets with positive and negative sentiments.
Calculate the average number of characters for tweets in each sentiment group: Compute the
average character count for tweets in the positive sentiment group and the negative sentiment
group separately. This will give you an idea of the typical tweet length for each sentiment category.
Visualize Results:
Create a bar chart or box plot to visualize the average character count for positive and negative
sentiment tweets: Visualize the average character counts using appropriate plots such as bar charts
or box plots. This graphical representation will make it easier to compare the tweet lengths between
different sentiment categories.
Extract the tweet content and sentiment labels (positive or negative) from the dataset.
Tokenization:
Remove any punctuation marks, special characters, or stopwords from the tokens.
Grouping by Sentiment:
Group the tokens based on their associated sentiment labels (positive and negative).
27
Calculate the frequency of each word in the tokens for both positive and negative sentiment groups.
This can be done by creating a dictionary or using libraries like NLTK or scikit-learn in Python.
Identify the most frequent words for both positive and negative sentiment groups.
Sort the words based on their frequency and select the top N words to analyze.
Visualization:
Create visualizations such as word clouds, bar charts, or histograms to display the top words for each
sentiment group.
Word clouds are particularly effective for visually representing word frequency, with larger words
indicating higher frequency.
Conduct statistical tests, such as chi-square tests or t-tests, to determine if there are significant
differences in word frequencies between positive and negative sentiment groups.
This step is optional but can provide additional insights into the relationship between sentiment and
word usage.
Aggregate the tweet time-stamps by hour to count the number of tweets posted during each hour of
the day.
28
Visualization:
Create a bar chart or line plot to visualize the number of tweets posted during each hour of the day.
The x-axis represents the hours of the day (e.g., 0 to 23), and the y-axis represents the count of
tweets.
Determine the hour with the highest number of tweets to identify the most active hour on Twitter.
This hour represents the time period when Twitter users are most engaged or active.
Interpretation:
Consider factors such as user demographics, time zones, and topical events that may influence tweet
activity patterns.
29
Additional Analysis (Optional):
Conduct further analysis to explore trends in tweet activity over different days of the week or
months of the year.
Compare tweet activity across different user groups or hash-tags to identify patterns and trends.
Extract the tweet timestamps, number of retweets, and number of favorites from the dataset.
Ensure that the timestamp data is in a suitable format for date-time manipulation.
Yearly Aggregation:
Group the tweets by year and calculate the total number of tweets, retweets, and favorites for each
year.
Visualization:
Create separate line plots or bar charts to visualize the trends in the number of tweets, retweets,
and favorites over the years.
Plot the years on the x-axis and the corresponding counts on the y-axis.
30
Interpretation:
Analyze the trends in tweet, retweet, and favorite counts over the years to identify any significant
changes or patterns.
Consider factors such as the growth of Twitter users, changes in user engagement, and the impact of
major events or trends on tweet activity.
Insights:
Identify which years had the highest and lowest tweet, retweet, and favorite counts.
Look for correlations between tweet activity and external factors such as political events, cultural
phenomena, or platform changes.
Word-cloud
Text Preprocessing:
Perform text preprocessing steps such as tokenization, lowercase conversion, punctuation removal,
and stopword removal to clean the tweet content.
31
Optionally, you can perform lemmatization or stemming to reduce words to their base form.
Use a word cloud generation library (e.g., word-cloud in Python) to create the word cloud.
Prominent Words:
32
The word cloud prominently features words that are frequently used in positive tweets. These words
appear larger in size, indicating higher frequencies.
Common Themes:
Upon closer inspection, we observe several common themes and topics that dominate the positive
tweet content. These themes may include words related to:
Happiness: Words conveying joy, happiness, or satisfaction, indicating positive emotions and
contentment.
Inspiration: Words that inspire or motivate others, encouraging positive attitudes and actions.
Love and Friendship: Words related to love, friendship, and bonding, reflecting positive relationships
and connections.
Contextual Analysis:
It's important to consider the context in which these words are used within positive tweets. The
surrounding text and sentiment of the tweets can provide valuable context and deeper insights into
the positive themes and sentiments expressed.
User Engagement:
Positive tweets often garner higher levels of user engagement, including likes, retweets, and
comments. The positive sentiment conveyed through these tweets may resonate with a broader
audience, leading to increased interaction and engagement on the platform.
Positive tweets have the potential to uplift and inspire others, fostering a supportive and optimistic
online community. By analyzing the word cloud, we can identify the key elements that contribute to
the creation of positive content and its impact on Twitter users.
In conclusion, the positive tweet word cloud provides a snapshot of the prevailing themes and
sentiments expressed in positive tweets. It highlights the importance of positivity in social media
discourse and the potential for positive content to foster engagement, inspiration, and connection
among users.
33
Negative Tweet Word Cloud Analysis
The negative tweet word cloud offers a visual representation of the most frequently used words in
tweets associated with negative sentiment. By examining the word cloud, we can uncover prevalent
themes and topics that evoke negative reactions from Twitter users. Here's a breakdown of the key
insights derived from the negative tweet word cloud:
6. Conclusion:
Key Takeaways: Summarize the main findings and insights from the analysis, highlighting significant
trends, patterns, and implications.
Future Research: Identify potential areas for further research and analysis to deepen understanding
of twitter usage patterns and their impact on public discourse.
7. References:
Cite relevant sources and references used in the analysis and methodology development.
Observations:
Scope of Analysis: The analysis encompasses tweets sent by Donald Trump between 2009 and 2020,
covering a significant period before and during his presidency.
34
Content and Sentiment Labeling: The dataset includes tweet content, date, number of retweets,
number of favorites, mentions, and pre-labeled sentiment (positive and negative) for each tweet.
Tweet Frequency: The frequency of tweets varies over time, with potentially notable spikes during
significant events, policy announcements, or political developments.
Engagement Metrics: The number of retweets and favorites provides insights into the level of
engagement and resonance of Trump's tweets with his audience.
Sentiment Analysis: Pre-labeled sentiment allows for the analysis of the overall sentiment
distribution of Trump's tweets and identification of patterns in positive and negative sentiment over
time.
Text Preprocessing: Text preprocessing steps such as tokenization, lowercasing, punctuation removal,
and stopword removal are necessary to clean the tweet content data for analysis.
Wordcloud Generation: Word clouds can be generated separately for positive and negative tweets
to visualize the most frequent words associated with each sentiment category.
Interpretation: Analyzing the content and sentiment of Trump's tweets can provide insights into his
communication strategies, public engagement, and impact on public discourse.
Summary:
The analysis of Donald Trump's tweets presents an opportunity to gain insights into his
communication style, messaging strategies, and the public's response to his tweets. By examining
the frequency, content, and sentiment of his tweets, we can identify key themes, trends, and
patterns over time. Furthermore, the generation of word clouds for positive and negative tweets
allows for a visual representation of the most frequently used words associated with each sentiment
category, providing additional context and understanding. Overall, this analysis contributes to a
deeper understanding of Trump's Twitter usage and its implications for political discourse and public
opinion.
35