0% found this document useful (0 votes)
7 views3 pages

API Data Collection

The document outlines a plan to collect health-related discussions from Twitter and Reddit, storing the data in MongoDB. It details the steps for setting up API access, fetching data using Tweepy and PRAW, and automating the data collection process. Additionally, it emphasizes the importance of testing API rate limits, filtering irrelevant data, and setting up logging for errors.

Uploaded by

manfredbaraka33
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views3 pages

API Data Collection

The document outlines a plan to collect health-related discussions from Twitter and Reddit, storing the data in MongoDB. It details the steps for setting up API access, fetching data using Tweepy and PRAW, and automating the data collection process. Additionally, it emphasizes the importance of testing API rate limits, filtering irrelevant data, and setting up logging for errors.

Uploaded by

manfredbaraka33
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Phase 1: Collecting Data from Twitter & Reddit

✅ Goal: Gather health-related discussions from Twitter (X) and Reddit and store them in
MongoDB.

Step 1: Set Up API Access


1.1 Get API Keys

You need API access for both Twitter (X) and Reddit:

📌 Twitter API (X)

1. Sign up at Twitter Developer Portal.


2. Create a project and get API keys & Bearer Token.
3. Use Tweepy or httpx to interact with the API.

📌 Reddit API

1. Go to Reddit Apps.
2. Click "Create App" and select "Script".
3. Save the client_id, client_secret, user_agent, username, and password.
4. Use PRAW (Python Reddit API Wrapper) to fetch data.

Step 2: Fetch Twitter Data


2.1 Install Dependencies
pip install tweepy pymongo

2.2 Fetch Tweets

Use Tweepy with Twitter's API v2. You can search for recent tweets containing health-related
keywords.

import tweepy
import pymongo

# Twitter API Credentials


BEARER_TOKEN = "your_bearer_token"

client = tweepy.Client(bearer_token=BEARER_TOKEN)

# MongoDB Connection
mongo_client = pymongo.MongoClient("mongodb://localhost:27017/")
db = mongo_client["health_sentiment"]
tweets_collection = db["tweets"]

# Fetch recent tweets


query = "(mental health OR vaccine OR covid OR anxiety) lang:en -
is:retweet"
tweets = client.search_recent_tweets(query=query, max_results=100,
tweet_fields=["created_at", "text", "author_id"])

# Store in MongoDB
for tweet in tweets.data:
tweets_collection.insert_one({
"platform": "twitter",
"author_id": tweet.author_id,
"text": tweet.text,
"timestamp": tweet.created_at
})

print("Tweets saved successfully!")

2.3 Automate Data Collection

 Run the script every 15 minutes using a cron job or FastAPI background task.

Step 3: Fetch Reddit Data


3.1 Install Dependencies
pip install praw pymongo

3.2 Fetch Reddit Posts

Use PRAW to scrape trending posts from relevant subreddits.

import praw

# Reddit API Credentials


reddit = praw.Reddit(
client_id="your_client_id",
client_secret="your_client_secret",
user_agent="health_sentiment_scraper"
)

# MongoDB Connection
reddit_collection = db["reddit_posts"]

# Fetch posts from health-related subreddits


subreddits = ["health", "Coronavirus", "mentalhealth"]
for subreddit in subreddits:
for post in reddit.subreddit(subreddit).hot(limit=50):
reddit_collection.insert_one({
"platform": "reddit",
"subreddit": subreddit,
"author": post.author.name if post.author else "unknown",
"text": post.title + " " + post.selftext,
"timestamp": post.created_utc
})

print("Reddit posts saved successfully!")

3.3 Automate Data Collection

 Schedule this script to run every hour for fresh data.

Step 4: Store & Structure Data in MongoDB


We store data in two collections:

1. tweets
2. reddit_posts

Schema Example
{
"platform": "twitter",
"author_id": "123456",
"text": "The new vaccine rollout is promising!",
"timestamp": "2025-03-31T10:00:00Z"
}
{
"platform": "reddit",
"subreddit": "mentalhealth",
"author": "user123",
"text": "I've been feeling anxious about the vaccine lately...",
"timestamp": 1711862400
}

Next Steps
🔹 Test API Rate Limits – Ensure you don’t get blocked.
🔹 Filter Irrelevant Data – Remove spammy or promotional posts.
🔹 Set Up Logging – Save API errors and failed requests.

Want help with setting up a cron job or FastAPI background tasks? 🚀

You might also like