0% found this document useful (0 votes)
2 views

ISB_Assignment 2

This assignment focuses on data scraping, SEM replication, and machine learning analysis using Python. Students must select a research paper, scrape relevant data from a subreddit, preprocess it for SEM analysis, and implement machine learning models for comparison. Deliverables include a data file, source code, a PowerPoint presentation on the scraping workflow, detailed documentation of the modeling pipeline, and a LaTeX report summarizing the findings.

Uploaded by

f20212745
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

ISB_Assignment 2

This assignment focuses on data scraping, SEM replication, and machine learning analysis using Python. Students must select a research paper, scrape relevant data from a subreddit, preprocess it for SEM analysis, and implement machine learning models for comparison. Deliverables include a data file, source code, a PowerPoint presentation on the scraping workflow, detailed documentation of the modeling pipeline, and a LaTeX report summarizing the findings.

Uploaded by

f20212745
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Assignment 2:

Data Scraping, SEM Replication, and Analysis

Objective:
This assignment assesses your practical skills in data acquisition, data preprocessing,
and the application of SEM using a programming language (preferably Python). It tests
your ability to replicate published research and critically analyze the results.

Task:
Part 0: Paper Selection:
1. Choose one of the five papers you reviewed in Assignment 1. Clearly state which
paper you have chosen and explain your reasoning. Factors to consider might
include:
a. The feasibility of replicating the analysis with the data you can scrape.
b. The clarity and completeness of the methodology described in the paper.
c. Your personal interest in the research topic.
d. The availability of code or detailed model specifications from the original
authors (this is a bonus, but not required).

Part 1: Data Scraping


1. Subreddit Selection: Choose a single, active subreddit to focus on based-on/relevant-
to your selected paper. The subreddit should have a reasonable volume of posts and
comments over the past year. Clearly state the subreddit you have chosen and
provide a brief justification for your selection (e.g., relevance to a specific research
area, high activity level, etc.). Avoid subreddits with overly sensitive or potentially
harmful content.
2. Data Scraping: Write a functional Python script to scrape data from the chosen
subreddit. You may use the Reddit API or a non-API approach (e.g., using requests
and BeautifulSoup). Your script should:
 Collect data for a period of one year. Specify the exact date range you are
collecting data for.
 Extract, at minimum, the following information for each post and comment:
o Post/Comment ID
o Post Title (for posts)
o Post/Comment Body (text)
o Author (username)
o Timestamp (date and time)
o Upvotes/Downvotes (or score)
o Number of comments (for posts)
o Parent ID (for comments, to reconstruct conversation threads)
 Handle potential issues such as:
o Rate limiting (from the Reddit API or website).
o Bot detection (if using a non-API approach).
o Missing or incomplete data.
o Changes to the website structure (if using a non-API approach).
3. Scraping Workflow Documentation (PPT): Create a single-slide PowerPoint
presentation that clearly outlines your scraping workflow. This should include:
 A diagram or flowchart illustrating the steps of your scraping process.
 A brief description of the libraries/tools you used.
 An explanation of how you handled rate limiting and/or bot detection.
 A description of any data cleaning or preprocessing steps performed during the
scraping process (e.g., handling HTML entities, removing deleted comments).
 Any limitations or challenges encountered during scraping.

Part 2: SEM Replication


1. Data Preprocessing: Prepare the scraped data for SEM analysis. This will likely
involve:
 Text preprocessing (e.g., tokenization, stemming/lemmatization, stop word
removal, handling of special characters and URLs).
 Feature engineering (e.g., creating variables based on text analysis, such
as sentiment scores, topic proportions, or measures of linguistic
complexity).
 Creating any necessary dummy variables or interaction terms.
 Handling missing data (e.g., through imputation or deletion).
 Scaling or transforming variables as needed.
2. SEM Implementation: Implement the SEM model from the chosen paper using
Python. The semopy package is recommended, but you may use other suitable
libraries (e.g., lavaan in R with rpy2, or even manual matrix calculations if you are
comfortable with that). Your code should:
 Clearly define the latent variables, observed variables, and their
relationships.
 Specify the estimation method (matching the original paper if possible).
 Calculate and report appropriate model fit indices.
 Estimate the path coefficients and their significance.
3. Result Replication: Attempt to replicate the key results of the original study as
closely as possible. This may not be perfectly achievable due to differences in
data, sample size, or specific implementation details, but you should strive for the
closest possible replication.

Part 3: Machine Learning Model Development and Comparison


1. Problem Framing for ML: Based on the research question and variables from
the paper you chose to replicate, define a specific prediction task that is suitable
for an ML model. This is crucial. SEM and ML are often used for different purposes,
however they could be used to predict the same outcome:
 SEM: Primarily focuses on testing hypothesized relationships between
latent and observed variables (explanatory modeling).
 ML: Primarily focuses on prediction (predictive modeling).
 Therefore, you need to translate the research question into a concrete
prediction problem. Examples:
 If the SEM explores factors influencing user engagement on
Reddit: Your ML task could be to predict the number of upvotes a post will
receive based on its text content, author features, and time of posting.
 If the SEM examines the relationship between sentiment and stock
market movements (using Twitter data): Your ML task could be to
predict the direction of stock price movement (up/down) based on
aggregated sentiment scores from tweets.
 If the SEM investigates the impact of online communities on
political polarization: Your ML task could be to classify users into
different ideological groups based on their posting behavior.
 Clearly state the prediction task you've chosen and justify why it's a
relevant and meaningful comparison point to the SEM analysis. Explain
what your target variable (the thing you're trying to predict) and your
predictor variables (the features you'll use to make the prediction) will be.
2. Model Selection and Justification: Choose at least two different ML models
that are appropriate for your prediction task. Consider models like:
 Regression models: Linear Regression, Ridge/Lasso Regression, Support
Vector Regression (for predicting continuous target variables).
 Classification models: Logistic Regression, Support Vector Machines,
Random Forests, Gradient Boosting Machines (e.g., XGBoost, LightGBM),
Naive Bayes (for predicting categorical target variables).
 Deep learning models: If appropriate for your data and task, you could
consider Transformers, Recurrent Neural Networks (RNNs, especially
LSTMs or GRUs) for text data, or feedforward neural networks for other
types of data.
 Justify your model choices. Explain why each model is suitable for the task
and the type of data you have.
3. Model Training and Evaluation:
 Data pre-processing: Explain your strategy.
 Train each of your chosen ML models on the training data.
 Use the validation set to tune hyperparameters (e.g., regularization
strength, number of trees in a random forest, learning rate) using
appropriate techniques like cross-validation.
 Evaluate the performance of each model on the test set using appropriate
metrics for your prediction task. Examples:
 Regression: Mean Squared Error (MSE), Root Mean Squared Error
(RMSE), R-squared, Mean Absolute Error (MAE).
 Classification: Accuracy, Precision, Recall, F1-score, AUC-ROC
(Area Under the Receiver Operating Characteristic Curve).
4. Comparison with SEM: This is the critical part. Compare the results of your ML
models with the SEM results. Do not expect the ML models to "replicate" the SEM
results. Instead, focus on:
 Predictive Power: How well do the ML models predict the target variable
compared to the implied predictions from the SEM? You might need to
derive a way to make predictions from the SEM.
 Feature Importance: For ML models that provide feature importance
scores, examine which features are most important for prediction. How do
these features relate to the variables and relationships in the SEM? Do
they provide any insights that the SEM might have missed?
 Complementary Insights: Discuss how the ML and SEM approaches
provide complementary perspectives on the research problem. SEM helps
understand the underlying mechanisms, while ML focuses on predictive
accuracy. Highlight the strengths and weaknesses of each approach.

Part 4: Reporting and Analysis (Modified to include ML)


1. Modeling Pipeline Documentation: (Additions in bold)
a. Data Preprocessing: (No changes)
b. Model Assumptions: A clear statement of the assumptions underlying the
SEM model and the ML models you implemented, and a discussion of
whether those assumptions are likely to be met by your data.
c. Model Parameter Fine-Tuning: Describe any parameter tuning or model
adjustments you made for both the SEM and the ML models. Justify
any changes you made.
d. Post-Model Analysis: Describe any analyses you performed after fitting the
SEM and training the ML models.
2. Results and Discussion Report (LaTeX/Overleaf): (Additions in bold)
a. Introduction: (No changes)
b. Methods: Summarize your data scraping, preprocessing, SEM
implementation, and ML model development and evaluation.
c. Results: Present your key findings, including model fit indices, path
coefficients, and ML model performance metrics. Use tables and
figures.
d. Discussion: Compare your results to those of the original study. Compare
the SEM and ML results, focusing on predictive power, feature
importance, and complementary insights. Discuss the limitations of
your replication and comparison, and any potential areas for future
research.
e. Conclusion: Summarize your main conclusions and their implications.

Deliverables:
 Datafile: The scraped and preprocessed data in CSV format.
 Source Code: Your Python script(s) for data scraping, SEM analysis and, ML model
training and evaluation, well-commented and organized.
 Scraping Workflow PPT: The single-slide PowerPoint presentation describing your
scraping workflow.
 Modeling Pipeline Documentation: A detailed, written description of your modeling
pipeline (as a separate document, e.g., a Markdown or text file).
 Results and Discussion Report: A 2-3 page report in LaTeX format (PDF).

Evaluation Criteria:
 Data Scraping: Completeness, accuracy, and efficiency of the data scraping process.
Effective handling of potential issues.
 Data Preprocessing: Appropriateness and thoroughness of preprocessing steps. Clear
justification for choices made.
 SEM Implementation: Correct implementation of the SEM model, including model
specification, estimation, and fit assessment.
 ML Model Development: Appropriate choice of ML models, proper training and
evaluation procedures, and clear justification of choices.
 SEM and ML Comparison: Thoughtful and insightful comparison of the two
approaches, focusing on relevant aspects like predictive power and feature
importance.
 Result Replication: Degree of success in replicating the key findings of the original
study.
 Analysis and Interpretation: Thoughtful and insightful comparison of results,
discussion of limitations, and identification of potential areas for future research.
 Documentation and Reporting: Clear, concise, and well-organized documentation of
all steps, including code comments, workflow descriptions, and the final report.
 Code Quality: Readability, efficiency, and adherence to good coding practices.
 Latex Report Quality: Proper use of Latex syntax, well-formatted, structed report with
professional look.

Academic Integrity and Use of AI Tools


 Original Writing Requirement: All written content in your PowerPoint presentation,
modeling pipeline documentation, and LaTeX report must be your own original work.
The use of Large Language Models (LLMs) or other Generative AI tools (e.g., ChatGPT,
Bard, etc.) to generate text for your explanations, justifications, discussions, or
conclusions is strictly prohibited. Any use of such tools for generating written content
will be considered a violation of academic integrity and will result in the assignment
being rejected.
 Code Assistance (Permitted with Disclosure): The use of AI coding assistants
(e.g., GitHub Copilot) for code-related tasks (such as syntax suggestions, debugging,
or generating boilerplate code) is permitted, provided that you clearly acknowledge
their use. If you use an AI coding assistant, include a brief statement in your code
comments indicating which parts of the code were assisted by the tool. The core logic
and structure of your code (including both the scraping and modeling components)
must still be your own. You must be able to fully explain any code you submit,
including the rationale behind your design choices and implementation details.
 Plagiarism: All sources (including code snippets from online resources) must be
properly cited. Any instance of plagiarism (presenting someone else's work as your
own) will result in the assignment being rejected.

You might also like