0% found this document useful (0 votes)

45 views35 pages

Ml-1-Guided-Bus Report

Uploaded by

Naga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views35 pages

Ml-1-Guided-Bus Report

Uploaded by

Naga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

DATA INSIGHTS:

NAVIGATING VISA APPROVALS AND

POLITICAL DISCOURSE THROUGH
ANALYTICAL LENS.
BUSINESS REPORT

By: Naga Reddy

PGP-Data Science and Business Analytics
(PGPDSBA.O.SEP23.C)

1
Table of Contents
Part 1: Easy Visa Project

 Introduction

Overview of Business Challenge

Purpose of the Project

 Business Context

Challenges Faced by US Companies

Role of OFLC

 Objective

Goals of the Project

Data Description

 Explanation of Data Attributes

Relevance to Visa Approval Process

Methodology

 Approach to Data Analysis

Model Development Strategy

Results

 Findings from Model Analysis

Insights into Visa Approval Factors

 Recommendations

Strategies for Facilitating Visa Approvals

 Conclusion

Summary of Project Outcomes

Implications for EasyVisa and OFLC

Part 2: Text Analytics Project

 Introduction

Purpose of Text Analytics

Relevance to Understanding Donald Trump's Twitter Usage

 Data Set Information

Description of Trump Tweets Dataset

2
Sentiment Analysis Labels

 Data Preprocessing

Steps for Text Cleaning and Preparation

Handling Missing Values

 Exploratory Data Analysis

Trends in Trump's Tweet Activity

Distribution of Positive and Negative Sentiments

Analysis of Sentiment Over Time

 Conclusion

Insights into Trump's Twitter Usage Patterns

Implications of Suspension on Analysis

 Future Work

Potential Extensions or Further Analysis

 References

Citations for Data Sources and Methods

3
Part 1: Easy Visa Project

Context:

Business communities in the United States are facing high demand for human resources, but one of the
constant challenges is identifying and attracting the right talent, which is perhaps the most important element
in remaining competitive. Companies in the United States look for hard-working, talented, and qualified
individuals both locally as well as abroad.

The Immigration and Nationality Act (INA) of the US permits foreign workers to come to the United States to
work on either a temporary or permanent basis. The act also protects US workers against adverse impacts on
their wages or working conditions by ensuring US employers' compliance with statutory requirements when
they hire foreign workers to fill workforce shortages. The immigration programs are administered by the Office
of Foreign Labor Certification (OFLC).

OFLC processes job certification applications for employers seeking to bring foreign workers into the United
States and grants certifications in those cases where employers can demonstrate that there are not sufficient
US workers available to perform the work at wages that meet or exceed the wage paid for the occupation in
the area of intended employment.

Objective:

In FY 2016, the OFLC processed 775,979 employer applications for 1,699,957 positions for temporary and
permanent labor certifications. This was a nine percent increase in the overall number of processed
applications from the previous year. The process of reviewing every case is becoming a tedious task as the
number of applicants is increasing every year.

The increasing number of applicants every year calls for a Machine Learning based solution that can help in
shortlisting the candidates having higher chances of VISA approval. OFLC has hired the firm EasyVisa for data-
driven solutions. You as a data scientist at EasyVisa have to analyze the data provided and, with the help of a
classification model:

Facilitate the process of visa approvals.

Recommend a suitable profile for the applicants for whom the visa should be certified or denied based on the
drivers that significantly influence the case status.

Data Description

The data contains the different attributes of employee and the employer. The detailed data dictionary is given
below.

case_id: ID of each visa application

continent: Information of continent the employee

education_of_employee: Information of education of the employee

has_job_experience: Does the employee has any job experience? Y= Yes; N = No

requires_job_training: Does the employee require any job training? Y = Yes; N = No

no_of_employees: Number of employees in the employer's company

yr_of_estab: Year in which the employer's company was established

region_of_employment: Information of foreign worker's intended region of employment in the US.

4
prevailing_wage: Average wage paid to similarly employed workers in a specific occupation in the area of
intended employment. The purpose of the prevailing wage is to ensure that the foreign worker is not
underpaid compared to other workers offering the same or similar service in the same area of employment.

unit_of_wage: Unit of prevailing wage. Values include Hourly, Weekly, Monthly, and Yearly.

full_time_position: Is the position of work full-time? Y = Full Time Position; N = Part Time Position

case_status: Flag indicating if the Visa was certified or denied

Executive Summary
The Immigration and Nationality Act (INA) of the US permits foreign workers to come to the United

States to work on either a temporary or permanent basis. The act also protects US workers against

adverse impacts on their wages or working conditions by ensuring US employers' compliance with

statutory requirements when they hire foreign workers to fill workforce shortages.

The increasing number of applicants every year calls for a Machine Learning based solution that can

help in shortlisting the candidates having higher chances of VISA approval. We analyzed data with

the help of classification models to:

▪ Facilitate the process of visa approvals.

▪ Recommend a suitable profile for the applicants for whom the visa should be certified or denied

based on the drivers that significantly influence the case status.

1. Introduction

Overview of Business Challenge: Companies in the United States face challenges in identifying and attracting
the right talent, both locally and internationally, impacting their competitiveness.

Purpose of the Project: The project aims to facilitate the visa approval process by leveraging machine learning
techniques to analyze applicant profiles.

2. Business Context

Challenges Faced by US Companies: Discussion on the demand for human resources, the role of talent
acquisition in competitiveness, and the importance of efficient visa processing.

Role of OFLC: Explanation of the Office of Foreign Labor Certification's role in administering immigration
programs and processing visa applications.

3. Objective

Goals of the Project: To develop a machine learning-based solution for shortlisting visa applicants and
recommending suitable profiles for visa certification.

4. Data Description

Explanation of Data Attributes: Detailed description of the dataset attributes relevant to visa approval
processes.

Relevance to Visa Approval Process: Discussion on how each attribute contributes to the analysis and decision-
making process.

5
5. Methodology

Approach to Data Analysis: Description of the methodology used for data preprocessing, model development,
and evaluation.

Model Development Strategy: Explanation of the strategy for building and training the classification model.

6. Results

Findings from Model Analysis: Presentation of key findings and insights derived from analyzing the model's
predictions.

Insights into Visa Approval Factors: Discussion on factors influencing visa approval and their significance.

7. Recommendations

Strategies for Facilitating Visa Approvals: Recommendations for OFLC and EasyVisa on improving the efficiency
and accuracy of the visa approval process.

To prioritize limited resources towards screening a batch of applications for those most likely

to be approved, the OFLC can:

• Sort applications by level of education and review the higher levels of education first.

• Sort applications by previous job experience and review those with experience first.

• Divide applications for jobs into those with an hourly wage and those with an annual

wage, sort each group by the prevailing wage, then review applications for salaried jobs

first from highest to lowest wage.

As stated previously, the Gradient Boosting classifier performs the best of all the models

created. However, as shown above, the tuned Decision-Tree model performs barely worse by

F1 score and is a far simpler model. This model may be preferable if post-hoc explanations of

OFLC decision-making is expected to be required.

• Furthermore, OFLC should examine more thoroughly why whether an application will be

certified or denied can be very well predicted through just three nodes as shown above.

• For those in less skilled, entry-level, and/or hourly jobs, the system would appear to be

biased against these applications being certified.

Business Problem Overview and Solution Approach

➢ Our objective is to identify the factors that have a high impact for approvals of visas and build

a model that can predict that are influencing the criteria for visa approvals.

➢ Solution approach and methodology

▪ EDA (bivariate and Univariate analysis), duplicate value check, missing value treatment, outlier check

(treatment if needed)

▪ Model building and Hyperparameter Tuning

6
• Decision Tree, Random Forest, Bagging, Boosting Classifiers (AdaBoost , Gradient Boosting, XGBoost),

Stacking Classifier

▪ Model Performance Comparison and Final Model Selection

▪ Important features of the final model

Univariate Analysis
▪ The distribution for number of employees for employers is heavily skewed right

▪ The average and median annual salary is approx. USD 70,000 which seems accurate

▪ The trend appears correct with outliers in the higher income bracket between USD 200,000 to USD

300,000

▪ There are several very low salaries as well, which appears incorrect and requires further investigation

Number of employees

7
Observations on prevailing wage

Exploratory Data Analysis

▪ Majority of employees >50% are from Asia

▪ Majority of employees have either a bachelor's 40% or a master's 38% and minority of applicants

have either a doctorate 8% or only a high school diploma 13%

▪ Around 58% employees have prior job experience and 42% employees do not

Explanation of Data Attributes: The dataset provides comprehensive information about visa applicants and
their employers, crucial for understanding the factors influencing visa approval. Attributes include case_id,
continent, education_of_employee, has_job_experience, requires_job_training, no_of_employees,
yr_of_estab, region_of_employment, prevailing_wage, unit_of_wage, full_time_position, and case_status.

Relevance to Visa Approval Process: Each attribute offers valuable insights into applicant profiles and employer
characteristics, aiding in the development of a classification model for visa approval prediction.

Educational Background of Applicants: A substantial portion of applicants hold either a bachelor's degree (40%)
or a master's degree (38%). However, a minority possess either a doctorate (8%) or only a high school diploma
(13%). Understanding the distribution of educational qualifications among applicants provides insights into the
skill levels and expertise sought by US employers.

Prior Job Experience: Approximately 58% of visa applicants have prior job experience, while 42% do not. This
highlights the importance of assessing work history as a factor in visa approval likelihood. Applicants with prior
job experience may demonstrate a higher level of employability and integration into the US workforce.

8
9
Irrespective of the continent the employee is from, more cases are certified than denied

▪ The trend observed w.r.t % certification for continents is Europe > Africa > Asia > Oceania >

North America & South America

▪ As expected, the trend observed w.r.t % visa certifications with having job experience

10
Bivariate Analysis

Bivariate analysis examines relationships between pairs of variables, revealing insights into how
different factors interact and their impact on visa approval outcomes.

Continent vs. Case Status: Investigates approval rates across continents to identify regional patterns.

Education vs. Prevailing Wage: Explores the correlation between education level and offered wage,
indicating market demand for qualifications.

Job Experience vs. Full-Time Position: Assesses if experienced candidates are more likely to secure
full-time roles, revealing employer preferences.

Region of Employment vs. No. of Employees: Examines regional economic trends by correlating
employment regions with company sizes.

Year of Establishment vs. Prevailing Wage: Analyzes wage growth over time to understand market
maturity and industry trends.

Case Status vs. Prevailing Wage: Investigates the impact of salary on approval rates, guiding optimal
wage strategies.

This analysis offers actionable insights into visa approval dynamics, aiding in informed decision-
making to enhance approval likelihood.

11
Individuals with higher education levels often seek well-paid job opportunities abroad. This
inclination can influence visa application patterns and outcomes. For the Easy Visa Project:

Education and Visa Intent: Higher-educated individuals may target countries like the US for lucrative
job prospects.

Impact on Visa Approval: Visa applicants with advanced degrees may prioritize regions with strong
job markets and competitive salaries, affecting approval rates.

Analysis Focus: Explore how education levels relate to salary expectations and visa outcomes to
understand applicant motivations.

Recommendations: Tailor visa policies and employer strategies to attract and retain highly educated
professionals with competitive compensation packages.

Understanding this trend aids in developing effective visa processing strategies and attracting top
talent globally.

12
Regional Talent Requirements: Various regions have distinct needs for talent with diverse
educational backgrounds, influencing visa approval dynamics.

Diverse Educational Needs: Different regions prioritize specific skill sets and educational
qualifications based on their economic sectors and workforce demands.

Visa Approval Implications: Understanding regional talent requirements is crucial for predicting visa
approval likelihood, as applicants with relevant educational backgrounds may align more closely
with regional needs.

Analysis Significance: Exploring the correlation between educational backgrounds and regional
employment demands can provide insights into visa approval patterns across different geographic
areas.

Strategic Recommendations: Tailoring visa processing strategies to align with regional talent
requirements can enhance approval rates and support economic growth by meeting workforce
needs effectively.

Acknowledging regional variations in talent demands informs targeted visa processing approaches,
facilitating efficient matching of applicants with regional employment opportunities.

Regional Employment: Different regions have unique job markets with varying demands for specific
skills and qualifications. Visa applicants must align their profiles with the employment landscape of
their desired region.

13
Continent: Visa applicants originate from diverse continents, each with its own educational systems,
cultural norms, and workforce characteristics. Understanding these differences can provide insights
into the diverse pool of applicants.

14
Job Experience: The presence or absence of job experience among visa applicants is a critical factor
in assessing their employability and potential contribution to the workforce. Experienced candidates
may be more desirable to employers seeking immediate productivity.

Job Training Requirements: Some positions may require specific training or qualifications beyond
formal education. Assessing whether applicants require job training can help determine their
readiness for employment and their potential impact on the workforce.

These factors collectively influence visa approval decisions and highlight the importance of aligning
applicant profiles with regional employment needs and expectations. Understanding these dynamics
enables better-informed decision-making in visa processing and talent acquisition strategies.

Analysis of Visa Status and Prevailing Wage:

Purpose of Prevailing Wage: The prevailing wage is established by the US government to ensure that
both local talent and foreign workers are fairly compensated. It aims to prevent the undercutting of
wages and protect the interests of workers in the US labor market.

Hypothesis: We hypothesize that there may be a correlation between prevailing wage and visa
status. Higher prevailing wages might attract more visa applicants, potentially leading to a higher
number of certified visas. Conversely, lower prevailing wages could indicate less demand for foreign
workers, resulting in a higher proportion of denied visas.

Data Exploration: We will explore the distribution of prevailing wages among certified and denied
visa cases to identify any patterns or trends. This analysis will help us understand if prevailing wage
levels influence visa approval decisions.

15
Statistical Testing: We may conduct statistical tests, such as chi-square tests or t-tests, to determine
if there is a significant difference in prevailing wage distributions between certified and denied visa
cases.

Implications: Understanding the relationship between prevailing wage and visa status can inform
policy decisions and visa processing strategies. It can help employers anticipate the impact of
prevailing wage changes on their ability to hire foreign workers and guide foreign workers in
selecting job opportunities aligned with prevailing wage trends.

16
By analyzing the data, we aim to provide insights into the dynamics between prevailing wage levels
and visa approval outcomes, contributing to informed decision-making in labor certification
processes and immigration policy.

Analysis of Prevailing Wage Units and Visa Certification:

Objective: Determine if prevailing wage units (e.g., Hourly, Weekly) influence visa certification.

Rationale: Unit choice may reflect job nature and industry norms, impacting certification.

Data Exploration: Examine prevailing wage unit distributions among certified and denied visas.

Comparative Analysis: We will compare the certification rates across different units of prevailing
wage to identify any significant differences. Statistical tests, such as chi-square tests or ANOVA, may
be employed to assess the significance of these differences.

Interpretation: Identify units associated with higher/lower certification rates, revealing visa
processing biases or challenges.

Implications: Understanding the impact of prevailing wage units on visa certification can inform
policy decisions and visa processing strategies. It can help employers and applicants anticipate the
likelihood of visa approval based on the unit of prevailing wage specified in job offers.

By conducting this analysis, we aim to uncover insights into the relationship between prevailing
wage units and visa certification outcomes, contributing to a deeper understanding of the factors
influencing the labor certification process.

17
Outlier Check:

Identify Potential Outliers: Utilize statistical methods such as the Interquartile Range (IQR) or z-score
to identify potential outliers in numerical features.

Visual Inspection: Plot box plots or scatter plots to visually inspect the distribution of data points and
identify any observations that lie far from the bulk of the data.

Domain Knowledge: Leverage domain knowledge or business context to determine if identified

outliers are valid data points or erroneous entries.

Handling Outliers:

Removal: Remove outliers if they are determined to be erroneous or not representative of the
underlying data distribution.

Transformation: Apply transformations such as logarithmic transformation to mitigate the impact of

outliers.

Winsorization: Cap extreme values by replacing them with the nearest non-outlier data point.

Clipping: Set a threshold beyond which values are clipped to prevent them from influencing the
analysis.

Impact Assessment: Assess the impact of outlier removal or treatment on the dataset and
subsequent analysis, ensuring that meaningful information is retained.

Documentation: Document the rationale behind outlier detection and any actions taken to address
outliers for transparency and reproducibility.

By systematically checking for outliers and implementing appropriate measures, we can ensure the
robustness and reliability of the dataset for subsequent analysis and modeling tasks.

Bagging - Model Building and Hyperparameter Tuning

Model Selection: Choose a base model suitable for bagging, such as decision trees, random forests,
or gradient boosting machines (GBM).

Bagging Ensemble: Implement the bagging ensemble method, which involves training multiple
instances of the base model on different subsets of the training data and aggregating their
predictions.

18
Hyperparameter Tuning:

Grid Search: Define a grid of hyperparameters for the base model and bagging ensemble.

Cross-Validation: Utilize cross-validation to evaluate different combinations of hyperparameters and

select the optimal configuration.

Scoring Metric: Choose an appropriate scoring metric (e.g., accuracy, precision, recall) to optimize
the model's performance.

Model Building:

Base Model Training: Train the base model on each subset of the training data, considering the
selected hyperparameters.

Bagging Ensemble Construction: Combine the predictions of the base models using averaging or
voting to generate the final ensemble prediction.

Evaluation:

Validation Set: Evaluate the bagging ensemble model on a validation set to assess its performance.

Metrics: Calculate relevant evaluation metrics to measure the model's effectiveness, considering
both bias and variance.

Fine-Tuning:

Iterative Process: Iterate through the hyperparameter tuning and model building steps to refine the
model further.

Trade-Offs: Balance model complexity and generalization performance to achieve the desired trade-
off between bias and variance.

Validation and Testing:

Validation Performance: Validate the final model's performance on a holdout test set to ensure its
generalization ability.

Robustness Testing: Conduct robustness testing to assess the model's stability across different
datasets and scenarios.

Model Performance Comparison and Final Model Selection

Evaluation Metrics Selection:

Choose appropriate evaluation metrics based on the problem domain and business objectives, such
as accuracy, precision, recall, F1-score, or area under the ROC curve (AUC-ROC).

Cross-Validation:

Utilize cross-validation techniques (e.g., k-fold cross-validation) to estimate the performance of each
model reliably on multiple subsets of the training data.

Performance Comparison:

Evaluate the performance of each model using the selected metrics and cross-validation results.

19
Compare the mean and variance of evaluation metrics across different models to assess consistency
and stability.

Statistical Testing:

Conduct statistical tests, such as paired t-tests or Wilcoxon signed-rank tests, to determine if
performance differences between models are statistically significant.

Business Context:

Consider the specific requirements and constraints of the business problem when interpreting
model performance.

Prioritize metrics that align with business goals and objectives, such as minimizing false positives in
fraud detection or maximizing recall in healthcare diagnostics.

Final Model Selection:

20
Identify the model(s) that consistently outperform others across various evaluation metrics and
statistical tests.

Take into account computational complexity, interpretability, and scalability when selecting the final
model.

Sensitivity Analysis:

Perform sensitivity analysis to assess the robustness of the selected model(s) to changes in
hyperparameters, training data, or input features.

Validation Set Performance:

Validate the performance of the selected model(s) on a holdout validation set to ensure
generalization to unseen data.

Model Explanation and Interpretability:

Provide explanations for the selected model(s)' predictions to enhance trust and understanding
among stakeholders.

Use techniques such as feature importance plots, SHAP values, or partial dependence plots to
interpret model behavior.

Documentation and Reporting:

Document the rationale behind model selection, including performance comparison results,
statistical tests, and business considerations.

Present the final model(s) along with their evaluation metrics, interpretation, and recommendations
in a clear and concise manner for stakeholders.

By following these steps, we can systematically compare model performance and select the most
appropriate model for deployment, ensuring alignment with business objectives and robustness to
real-world challenges.

Actionable Insights and Recommendations

# To provide actionable insights and recommendations based on the analysis conducted with the
Gradient Boosting classifier, here are some key points to consider:

Feature Importance:

By Identifying the most important features that contribute significantly to the model's predictions.
These features can provide valuable insights into the factors driving the outcome variable.

Model Performance:

Evaluate the model's performance metrics such as accuracy, precision, recall, and F1-score to assess
how well it generalizes to unseen data.

Compare the performance of the Gradient Boosting classifier with other models to determine its
effectiveness.

21
Over-fitting:

Check for signs of over-fitting by comparing the model's performance on the training and test
datasets. If the model performs significantly better on the training data than on the test data, it may
be over fitting.

Hyperparameter Tuning:

Consider fine-tuning the hyperparameters of the Gradient Boosting classifier to optimize its
performance further. Techniques like grid search or random search can help find the best
combination of hyperparameters.

Data Quality:

Ensure that the quality of the input data is high. This includes addressing missing values, handling
outliers, and encoding categorical variables appropriately.

Model Interpretability:

Enhance the interpretability of the model by visualizing decision trees or feature importance plots.
This can help stakeholders understand the rationale behind the model's predictions.

Business Impact:

Translate the model's predictions into actionable insights that can drive business decisions. For
example, identify customer segments with a high probability of churn and devise targeted retention
strategies.

Continuous Monitoring:

Implement a system for monitoring the model's performance over time and updating it as new data
becomes available. This ensures that the model remains accurate and relevant in dynamic business
environments.

Feedback Loop:

Establish a feedback loop where insights from the model are used to refine business processes and
data collection strategies. This iterative approach can lead to continuous improvement and better
decision-making.

Documentation and Communication:

Document the entire analysis process, including data preprocessing, model training, evaluation, and
interpretation. Communicate the findings and recommendations effectively to stakeholders in a
clear and understandable manner.

By following these recommendations, we can leverage the insights gained from the Gradient
Boosting classifier to drive business outcomes and make informed decisions.

22
Part 2 : Text Analytics Project
PROBLEM STATEMENT

To analyze every tweet sent by Donald trump prior and during his Presidency to see how he used the twitter
platform. We will be exploring and analyzing relevant

columns of the data. We will be applying all necessary text preprocessing steps on content column of the
tweet data. Also, given that he was suspended from

twitter, we will analyze his tweets for a better understanding of his usage and activity pattern on twitter.

#Data Set Information & Context:

##Tweets from Donald Trump between 2009 and 2020.

- The data has got Content , Date, Nu. of Retweet, No. of Favorites, column of the tweets.

- The data has got pre labelled sentiment(Positive and negative) column for each content of tweets.

- We will be exploring and analyzing relevant columns of the data.

- We will be applying all necessary text pre processing steps on content column of the tweet data.

Data Description

id: Unique tweet id

link: Link to tweet

content: Text of tweet

date: Date of tweet

retweets: Number of retweets

favorites: Number of favorites

mentions: Accounts mentioned in tweet

Sentiment: Sentiment of each tweet in the content column

1. Introduction:

Project Overview: This report presents the findings of a text analytics project focused on analyzing
tweets sent by Donald Trump prior to and during his presidency. The objective is to understand how
he utilized the Twitter platform and identify patterns in his usage and activity.

Importance of the Analysis: Twitter has been a significant communication tool for political figures,
and understanding Trump's tweets can provide insights into his messaging strategies, public
engagement, and impact on public discourse.

2. Data Set Information & Context:

Data Description: The dataset comprises tweets from Donald Trump spanning the period from 2009
to 2020. It includes columns such as tweet ID, tweet content, date, number of retweets, number of
favorites, mentions, and pre-labeled sentiment for each tweet.

23
Relevance of Columns: The analysis will focus on exploring and analyzing relevant columns,
particularly the tweet content, date, retweets, favorites, and sentiment.

3. Methodology:

Text Preprocessing: Apply necessary text preprocessing steps, such as tokenization, lowercasing,
punctuation removal, and stopword removal, to clean the tweet content data.

Exploratory Data Analysis (EDA): Conduct EDA to gain insights into the distribution of tweet
sentiments, trends over time, popular topics, and engagement metrics (retweets and favorites).

Sentiment Analysis: Utilize the pre-labeled sentiment column to analyze the overall sentiment of
Trump's tweets and identify any trends or patterns.

4. Findings and Insights:

Usage Patterns: Analyze Trump's tweeting frequency over time to understand his activity pattern on
Twitter, including any significant events or milestones.

24
Sentiment Analysis Results: Present the distribution of tweet sentiments (positive, negative) and
identify the most common themes associated with each sentiment category.

Engagement Metrics: Investigate the relationship between tweet content, sentiment, and
engagement metrics (retweets, favorites) to assess the impact of tweet sentiment on audience
engagement.

5. Recommendations:

Messaging Strategies: Provide recommendations for political figures or organizations based on

insights from Trump's Twitter usage, including effective messaging strategies, engagement tactics,
and audience targeting approaches.

Content Optimization: Suggest ways to optimize tweet content for maximizing audience engagement
and fostering positive sentiment.

Missing Value Analysis:

Dataset Inspection: Review the dataset to identify columns with missing values.

Count Missing Values: Determine the number of missing values in each column to understand the
extent of missingness.

25
Missing Value Patterns:

Missingness Distribution: Visualize the distribution of missing values across columns to identify any
patterns or trends.

Correlation with Other Variables: Explore if missing values in one column are correlated with missing
values in other columns.

Reasons for Missingness:

Data Collection Process: Investigate if missing values are due to data collection errors or limitations.

Missing Completely at Random (MCAR), Missing at Random (MAR), Missing Not at Random (MNAR):
Assess the randomness of missingness to determine if it follows any specific pattern.

Handling Missing Values:

Imputation Techniques: Consider various imputation methods such as mean imputation, median
imputation, mode imputation, or predictive imputation based on the data characteristics and
missingness pattern.

Deletion: Decide whether to delete rows or columns with missing values based on the impact on the
analysis and the proportion of missing values.

Domain Knowledge: Use domain knowledge to determine the most appropriate handling strategy
for missing values in each column.

Analyze the number of characters in tweets based on sentiment (positive and negative)

26
Data Preparation:

Extract the tweet content and sentiment labels (positive or negative) from the dataset: Start by
extracting the necessary columns from the dataset, including the tweet content and the sentiment
labels associated with each tweet.

Character Count Calculation:

For each tweet, calculate the number of characters in the tweet content: Iterate through each tweet
in the dataset and count the number of characters in the tweet content. This can be easily done by
using built-in string functions or libraries in programming languages like Python.

Grouping by Sentiment:

Group the tweets based on their sentiment labels (positive and negative): Split the dataset into two
groups based on the sentiment labels (positive and negative). This segregation will allow you to
analyze the character counts separately for tweets with positive and negative sentiments.

Calculate Average Character Count:

Calculate the average number of characters for tweets in each sentiment group: Compute the
average character count for tweets in the positive sentiment group and the negative sentiment
group separately. This will give you an idea of the typical tweet length for each sentiment category.

Visualize Results:

Create a bar chart or box plot to visualize the average character count for positive and negative
sentiment tweets: Visualize the average character counts using appropriate plots such as bar charts
or box plots. This graphical representation will make it easier to compare the tweet lengths between
different sentiment categories.

To analyze the words in tweets based on sentiment (positive and negative)

Data Preparation:

Extract the tweet content and sentiment labels (positive or negative) from the dataset.

Tokenization:

Tokenize the tweet content to split it into individual words.

Remove any punctuation marks, special characters, or stopwords from the tokens.

Grouping by Sentiment:

Group the tokens based on their associated sentiment labels (positive and negative).

Word Frequency Calculation:

27
Calculate the frequency of each word in the tokens for both positive and negative sentiment groups.

This can be done by creating a dictionary or using libraries like NLTK or scikit-learn in Python.

Top Words Identification:

Identify the most frequent words for both positive and negative sentiment groups.

Sort the words based on their frequency and select the top N words to analyze.

Visualization:

Create visualizations such as word clouds, bar charts, or histograms to display the top words for each
sentiment group.

Word clouds are particularly effective for visually representing word frequency, with larger words
indicating higher frequency.

Statistical Analysis (Optional):

Conduct statistical tests, such as chi-square tests or t-tests, to determine if there are significant
differences in word frequencies between positive and negative sentiment groups.

This step is optional but can provide additional insights into the relationship between sentiment and
word usage.

Most active hour on twitter

Hourly Aggregation:

Aggregate the tweet time-stamps by hour to count the number of tweets posted during each hour of
the day.

Convert the time stamp data to the appropriate timezone if necessary.

28
Visualization:

Create a bar chart or line plot to visualize the number of tweets posted during each hour of the day.

The x-axis represents the hours of the day (e.g., 0 to 23), and the y-axis represents the count of
tweets.

Identify the Most Active Hour:

Determine the hour with the highest number of tweets to identify the most active hour on Twitter.

This hour represents the time period when Twitter users are most engaged or active.

Interpretation:

Analyze the results to understand the peak hours of activity on Twitter.

Consider factors such as user demographics, time zones, and topical events that may influence tweet
activity patterns.

29
Additional Analysis (Optional):

Conduct further analysis to explore trends in tweet activity over different days of the week or
months of the year.

Compare tweet activity across different user groups or hash-tags to identify patterns and trends.

Number of tweet, re-tweet, favorite per Year

Data Preparation:

Extract the tweet timestamps, number of retweets, and number of favorites from the dataset.

Ensure that the timestamp data is in a suitable format for date-time manipulation.

Yearly Aggregation:

Extract the year component from each tweet timestamp.

Group the tweets by year and calculate the total number of tweets, retweets, and favorites for each
year.

Visualization:

Create separate line plots or bar charts to visualize the trends in the number of tweets, retweets,
and favorites over the years.

Plot the years on the x-axis and the corresponding counts on the y-axis.

30
Interpretation:

Analyze the trends in tweet, retweet, and favorite counts over the years to identify any significant
changes or patterns.

Consider factors such as the growth of Twitter users, changes in user engagement, and the impact of
major events or trends on tweet activity.

Insights:

Identify which years had the highest and lowest tweet, retweet, and favorite counts.

Look for correlations between tweet activity and external factors such as political events, cultural
phenomena, or platform changes.

Word-cloud
Text Preprocessing:

Perform text preprocessing steps such as tokenization, lowercase conversion, punctuation removal,
and stopword removal to clean the tweet content.

31
Optionally, you can perform lemmatization or stemming to reduce words to their base form.

Word Frequency Calculation:

Calculate the frequency of each word in the preprocessed tweet content.

Create a dictionary or data-frame to store the word frequencies.

Word Cloud Generation:

Use a word cloud generation library (e.g., word-cloud in Python) to create the word cloud.

Pass the word frequencies as input to the word cloud generator.

Positive Tweet Word Cloud Analysis

The positive tweet word cloud provides a visual representation of the most frequent words used in
tweets associated with positive sentiment. By analyzing the word cloud, we can gain insights into the
prevalent themes and topics that evoke positive reactions from Twitter users. Below is an
explanation of the key elements and insights derived from the positive tweet word cloud:

Prominent Words:

32
The word cloud prominently features words that are frequently used in positive tweets. These words
appear larger in size, indicating higher frequencies.

Common Themes:

Upon closer inspection, we observe several common themes and topics that dominate the positive
tweet content. These themes may include words related to:

Achievements: Words associated with success, accomplishments, or achievements, reflecting

positivity and optimism.

Happiness: Words conveying joy, happiness, or satisfaction, indicating positive emotions and
contentment.

Gratitude: Words expressing appreciation, thankfulness, or gratitude towards others or positive

experiences.

Inspiration: Words that inspire or motivate others, encouraging positive attitudes and actions.

Love and Friendship: Words related to love, friendship, and bonding, reflecting positive relationships
and connections.

Contextual Analysis:

It's important to consider the context in which these words are used within positive tweets. The
surrounding text and sentiment of the tweets can provide valuable context and deeper insights into
the positive themes and sentiments expressed.

User Engagement:

Positive tweets often garner higher levels of user engagement, including likes, retweets, and
comments. The positive sentiment conveyed through these tweets may resonate with a broader
audience, leading to increased interaction and engagement on the platform.

Impact of Positive Content:

Positive tweets have the potential to uplift and inspire others, fostering a supportive and optimistic
online community. By analyzing the word cloud, we can identify the key elements that contribute to
the creation of positive content and its impact on Twitter users.

In conclusion, the positive tweet word cloud provides a snapshot of the prevailing themes and
sentiments expressed in positive tweets. It highlights the importance of positivity in social media
discourse and the potential for positive content to foster engagement, inspiration, and connection
among users.

33
Negative Tweet Word Cloud Analysis
The negative tweet word cloud offers a visual representation of the most frequently used words in
tweets associated with negative sentiment. By examining the word cloud, we can uncover prevalent
themes and topics that evoke negative reactions from Twitter users. Here's a breakdown of the key
insights derived from the negative tweet word cloud:

6. Conclusion:

Key Takeaways: Summarize the main findings and insights from the analysis, highlighting significant
trends, patterns, and implications.

Future Research: Identify potential areas for further research and analysis to deepen understanding
of twitter usage patterns and their impact on public discourse.

7. References:

Cite relevant sources and references used in the analysis and methodology development.

Observations:
Scope of Analysis: The analysis encompasses tweets sent by Donald Trump between 2009 and 2020,
covering a significant period before and during his presidency.

34
Content and Sentiment Labeling: The dataset includes tweet content, date, number of retweets,
number of favorites, mentions, and pre-labeled sentiment (positive and negative) for each tweet.

Tweet Frequency: The frequency of tweets varies over time, with potentially notable spikes during
significant events, policy announcements, or political developments.

Engagement Metrics: The number of retweets and favorites provides insights into the level of
engagement and resonance of Trump's tweets with his audience.

Sentiment Analysis: Pre-labeled sentiment allows for the analysis of the overall sentiment
distribution of Trump's tweets and identification of patterns in positive and negative sentiment over
time.

Text Preprocessing: Text preprocessing steps such as tokenization, lowercasing, punctuation removal,
and stopword removal are necessary to clean the tweet content data for analysis.

Wordcloud Generation: Word clouds can be generated separately for positive and negative tweets
to visualize the most frequent words associated with each sentiment category.

Interpretation: Analyzing the content and sentiment of Trump's tweets can provide insights into his
communication strategies, public engagement, and impact on public discourse.

Summary:

The analysis of Donald Trump's tweets presents an opportunity to gain insights into his
communication style, messaging strategies, and the public's response to his tweets. By examining
the frequency, content, and sentiment of his tweets, we can identify key themes, trends, and
patterns over time. Furthermore, the generation of word clouds for positive and negative tweets
allows for a visual representation of the most frequently used words associated with each sentiment
category, providing additional context and understanding. Overall, this analysis contributes to a
deeper understanding of Trump's Twitter usage and its implications for political discourse and public
opinion.

Time Series Forecasting Project (Shoe Sales)
No ratings yet
Time Series Forecasting Project (Shoe Sales)
26 pages
FRA Extended
No ratings yet
FRA Extended
22 pages
Predictive Modelling Project - Nandini
No ratings yet
Predictive Modelling Project - Nandini
31 pages
FRA Project Report - Chilla Nagaraju
100% (1)
FRA Project Report - Chilla Nagaraju
66 pages
Amit Khilare Used Device Data PM Project
No ratings yet
Amit Khilare Used Device Data PM Project
25 pages
Great Lakes Extraa - Learn Project Business Report - 2-Kavish-Rathod
No ratings yet
Great Lakes Extraa - Learn Project Business Report - 2-Kavish-Rathod
22 pages
EasyVisa Problem Statement
No ratings yet
EasyVisa Problem Statement
2 pages
SMDM Project
No ratings yet
SMDM Project
17 pages
Finance Risk Analytics - Priyanka Sharma - Business Report
No ratings yet
Finance Risk Analytics - Priyanka Sharma - Business Report
49 pages
Data Mining Project - PCA - Hair Salon
No ratings yet
Data Mining Project - PCA - Hair Salon
8 pages
FRA Main Project Part B Guided
No ratings yet
FRA Main Project Part B Guided
23 pages
FINANCE & RISK ANALYTICS – PROJECT - YARESH VIJAYASUNDARAM
No ratings yet
FINANCE & RISK ANALYTICS – PROJECT - YARESH VIJAYASUNDARAM
48 pages
Capstone Notes-1
No ratings yet
Capstone Notes-1
18 pages
Anshul Dyundi Machine Learning July 2022
50% (2)
Anshul Dyundi Machine Learning July 2022
46 pages
Social Media Tourism - Capstone Project
No ratings yet
Social Media Tourism - Capstone Project
13 pages
AS Graded Project Suchi Solanki
No ratings yet
AS Graded Project Suchi Solanki
21 pages
P L Lohitha 19-04-23 TSF Business Report
No ratings yet
P L Lohitha 19-04-23 TSF Business Report
70 pages
Rahulsharma - 03 12 23
No ratings yet
Rahulsharma - 03 12 23
25 pages
Great Learning Predictive Modelling Project
No ratings yet
Great Learning Predictive Modelling Project
12 pages
Wholesale Custumer
100% (1)
Wholesale Custumer
32 pages
Advance Stats Project Parijat
No ratings yet
Advance Stats Project Parijat
18 pages
Clustering Project
100% (1)
Clustering Project
44 pages
AS Extended Buisnesss Report
No ratings yet
AS Extended Buisnesss Report
25 pages
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
No ratings yet
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
28 pages
Nagareddy 18-Nov-2023
No ratings yet
Nagareddy 18-Nov-2023
20 pages
SMT Capstone PPT Ayushi Rastogi PGPDSBA.O.MAY22.C
No ratings yet
SMT Capstone PPT Ayushi Rastogi PGPDSBA.O.MAY22.C
12 pages
Business Analytics Report: Submitted To
No ratings yet
Business Analytics Report: Submitted To
32 pages
Kewal Kumar Singh
No ratings yet
Kewal Kumar Singh
21 pages
Random Forest - US - Heart - Patients - Class
100% (1)
Random Forest - US - Heart - Patients - Class
24 pages
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
No ratings yet
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
56 pages
ML 2 - Problem Statements and Rubirics
No ratings yet
ML 2 - Problem Statements and Rubirics
3 pages
ML-2 Guided Project Report
No ratings yet
ML-2 Guided Project Report
63 pages
Factor-Hair RV PDF
No ratings yet
Factor-Hair RV PDF
23 pages
Traditional Training Methods
No ratings yet
Traditional Training Methods
23 pages
Ensemble Learning: Proprietary Content. ©great Learning. All Rights Reserved. Unauthorized Use or Distribution Prohibited
No ratings yet
Ensemble Learning: Proprietary Content. ©great Learning. All Rights Reserved. Unauthorized Use or Distribution Prohibited
6 pages
Vijayalakshmi
No ratings yet
Vijayalakshmi
17 pages
SMDM Project Report-Survi Ghura
100% (1)
SMDM Project Report-Survi Ghura
26 pages
Data Mining Business Report
No ratings yet
Data Mining Business Report
38 pages
Akshaya SMDM Project Report
100% (1)
Akshaya SMDM Project Report
18 pages
Machine Learning - Nabeel Khan - Final Project Report - Problem 2
100% (1)
Machine Learning - Nabeel Khan - Final Project Report - Problem 2
24 pages
M4 Data Mining W4 Business Report
No ratings yet
M4 Data Mining W4 Business Report
22 pages
FRA Project Report Milestone 1 PDF
No ratings yet
FRA Project Report Milestone 1 PDF
29 pages
Palash Bhai - Machine Learning Assignment
100% (2)
Palash Bhai - Machine Learning Assignment
18 pages
Predictive Modeling
No ratings yet
Predictive Modeling
38 pages
SMDM Project Report Dipti
No ratings yet
SMDM Project Report Dipti
14 pages
SMDM Report
No ratings yet
SMDM Report
12 pages
TSF - Project
100% (1)
TSF - Project
5 pages
Uber Drive Practice DP PDF
No ratings yet
Uber Drive Practice DP PDF
10 pages
Sample - Customer Churn Prediction Python Documentation
No ratings yet
Sample - Customer Churn Prediction Python Documentation
33 pages
Education - Post 12th Standard - CSV
No ratings yet
Education - Post 12th Standard - CSV
11 pages
PM ProjectJune - 2021
100% (1)
PM ProjectJune - 2021
33 pages
SMDM Project Report
100% (1)
SMDM Project Report
19 pages
Clustering Analysis: Prepared by Muralidharan N
100% (1)
Clustering Analysis: Prepared by Muralidharan N
16 pages
Mini Project - Factor Hair Analysis: Sravanthi.M
100% (2)
Mini Project - Factor Hair Analysis: Sravanthi.M
24 pages
End Term Quiz1 - Attempt Review
No ratings yet
End Term Quiz1 - Attempt Review
5 pages
Data Science & Business Analytics: Post Graduate Program in
No ratings yet
Data Science & Business Analytics: Post Graduate Program in
16 pages
Rovec
No ratings yet
Rovec
75 pages
Pakistan National Education Policy Review WhitePaper
No ratings yet
Pakistan National Education Policy Review WhitePaper
99 pages
Quiz 3 Name: Kainat Iftikhar Reg# 2021630007 1. List Three Examples of Time Series Data. Time Series Data
No ratings yet
Quiz 3 Name: Kainat Iftikhar Reg# 2021630007 1. List Three Examples of Time Series Data. Time Series Data
2 pages
Data Science & Business Analytics: Post Graduate Program in
No ratings yet
Data Science & Business Analytics: Post Graduate Program in
16 pages
ML Models
No ratings yet
ML Models
2 pages
Data Mining Project - 27.06.2021
No ratings yet
Data Mining Project - 27.06.2021
6 pages
PPT5 Quarter2 Tle8 2324
No ratings yet
PPT5 Quarter2 Tle8 2324
36 pages
Project Questions
No ratings yet
Project Questions
3 pages
RPMS T 1 3 2023 2024
No ratings yet
RPMS T 1 3 2023 2024
38 pages
I. Leadership & Governance: Fortunato F. Halili National Agricultural School Magasawang-Sapa Annex
No ratings yet
I. Leadership & Governance: Fortunato F. Halili National Agricultural School Magasawang-Sapa Annex
28 pages
Contoh Mid Term Exam Form 1
No ratings yet
Contoh Mid Term Exam Form 1
17 pages
LSBF Brochure Global Mba
No ratings yet
LSBF Brochure Global Mba
18 pages
Ausbel en Ingles
No ratings yet
Ausbel en Ingles
15 pages
The Effect of Using Hendy's 4Cs Model On Acquiring Some Vocational Concepts and Social Skills For Primary School Students
No ratings yet
The Effect of Using Hendy's 4Cs Model On Acquiring Some Vocational Concepts and Social Skills For Primary School Students
10 pages
UT Dallas Syllabus For Pa3377.001.10f Taught by IRINA VAKULENKO (Iav081000)
No ratings yet
UT Dallas Syllabus For Pa3377.001.10f Taught by IRINA VAKULENKO (Iav081000)
5 pages
Evaluation in Education-Educational Evaluation: June 2016
No ratings yet
Evaluation in Education-Educational Evaluation: June 2016
51 pages
PROF ED LET REVIEWER (1-150 Items With Answer Key)
No ratings yet
PROF ED LET REVIEWER (1-150 Items With Answer Key)
18 pages
Accounting Literature Review
No ratings yet
Accounting Literature Review
5 pages
IB Exams
No ratings yet
IB Exams
4 pages
Nexus Vol 9 No 2
No ratings yet
Nexus Vol 9 No 2
5 pages
Revised PCD Proforma
No ratings yet
Revised PCD Proforma
6 pages
Mep June Cofirmation
No ratings yet
Mep June Cofirmation
18 pages
Training: Melike Bendas ET 3-A
No ratings yet
Training: Melike Bendas ET 3-A
5 pages
Week 2, Module 3 Arithmetic Means
No ratings yet
Week 2, Module 3 Arithmetic Means
9 pages
KidEx School Classes - RuleBook
No ratings yet
KidEx School Classes - RuleBook
3 pages
Teach4SDGs C4 Programme Guide
No ratings yet
Teach4SDGs C4 Programme Guide
4 pages
Lesson Plan
No ratings yet
Lesson Plan
7 pages
Answers To Pedagogy Test
No ratings yet
Answers To Pedagogy Test
9 pages
Personal Statement: Megan Purdy
No ratings yet
Personal Statement: Megan Purdy
2 pages
Results Leaflet LaSWAP 2018
No ratings yet
Results Leaflet LaSWAP 2018
2 pages
Diary Curriculum Map Technology and Livelihood Education: Dressmaking/ Tailoring Offline Online
No ratings yet
Diary Curriculum Map Technology and Livelihood Education: Dressmaking/ Tailoring Offline Online
3 pages
Survey
No ratings yet
Survey
4 pages
s1 Kelas Internasional FKG Ui
No ratings yet
s1 Kelas Internasional FKG Ui
1 page
Young Learners Describing People Level 1
No ratings yet
Young Learners Describing People Level 1
2 pages

Ml-1-Guided-Bus Report

Uploaded by

Ml-1-Guided-Bus Report

Uploaded by

DATA INSIGHTS:

NAVIGATING VISA APPROVALS AND

By: Naga Reddy

Overview of Business Challenge

Purpose of the Project

Challenges Faced by US Companies

Goals of the Project

 Explanation of Data Attributes

Relevance to Visa Approval Process

 Approach to Data Analysis

Model Development Strategy

 Findings from Model Analysis

Insights into Visa Approval Factors

Strategies for Facilitating Visa Approvals

Summary of Project Outcomes

Implications for EasyVisa and OFLC

Part 2: Text Analytics Project

Purpose of Text Analytics

Relevance to Understanding Donald Trump's Twitter Usage

 Data Set Information

Description of Trump Tweets Dataset

Steps for Text Cleaning and Preparation

Handling Missing Values

 Exploratory Data Analysis

Trends in Trump's Tweet Activity

Popular Topics and Mentions

Distribution of Positive and Negative Sentiments

Analysis of Sentiment Over Time

Insights into Trump's Twitter Usage Patterns

Implications of Suspension on Analysis

Potential Extensions or Further Analysis

Citations for Data Sources and Methods

Facilitate the process of visa approvals.

case_id: ID of each visa application

continent: Information of continent the employee

education_of_employee: Information of education of the employee

has_job_experience: Does the employee has any job experience? Y= Yes; N = No

requires_job_training: Does the employee require any job training? Y = Yes; N = No

no_of_employees: Number of employees in the employer's company

yr_of_estab: Year in which the employer's company was established

region_of_employment: Information of foreign worker's intended region of employment in the US.

case_status: Flag indicating if the Visa was certified or denied

the help of classification models to:

▪ Facilitate the process of visa approvals.

based on the drivers that significantly influence the case status.

to be approved, the OFLC can:

first from highest to lowest wage.

OFLC decision-making is expected to be required.

biased against these applications being certified.

Business Problem Overview and Solution Approach

➢ Solution approach and methodology

▪ Model building and Hyperparameter Tuning

▪ Model Performance Comparison and Final Model Selection

▪ Important features of the final model

Exploratory Data Analysis

have either a doctorate 8% or only a high school diploma 13%

North America & South America

Analysis of Visa Status and Prevailing Wage:

Analysis of Prevailing Wage Units and Visa Certification:

Domain Knowledge: Leverage domain knowledge or business context to determine if identified

Transformation: Apply transformations such as logarithmic transformation to mitigate the impact of

Bagging - Model Building and Hyperparameter Tuning

Cross-Validation: Utilize cross-validation to evaluate different combinations of hyperparameters and

Validation and Testing:

Model Performance Comparison and Final Model Selection

Final Model Selection:

Validation Set Performance:

Model Explanation and Interpretability:

Documentation and Reporting:

Actionable Insights and Recommendations

Documentation and Communication:

#Data Set Information & Context:

##Tweets from Donald Trump between 2009 and 2020.

- We will be exploring and analyzing relevant columns of the data.

id: Unique tweet id

link: Link to tweet