API Data Collection
API Data Collection
✅ Goal: Gather health-related discussions from Twitter (X) and Reddit and store them in
MongoDB.
You need API access for both Twitter (X) and Reddit:
📌 Reddit API
1. Go to Reddit Apps.
2. Click "Create App" and select "Script".
3. Save the client_id, client_secret, user_agent, username, and password.
4. Use PRAW (Python Reddit API Wrapper) to fetch data.
Use Tweepy with Twitter's API v2. You can search for recent tweets containing health-related
keywords.
import tweepy
import pymongo
client = tweepy.Client(bearer_token=BEARER_TOKEN)
# MongoDB Connection
mongo_client = pymongo.MongoClient("mongodb://localhost:27017/")
db = mongo_client["health_sentiment"]
tweets_collection = db["tweets"]
# Store in MongoDB
for tweet in tweets.data:
tweets_collection.insert_one({
"platform": "twitter",
"author_id": tweet.author_id,
"text": tweet.text,
"timestamp": tweet.created_at
})
Run the script every 15 minutes using a cron job or FastAPI background task.
import praw
# MongoDB Connection
reddit_collection = db["reddit_posts"]
1. tweets
2. reddit_posts
Schema Example
{
"platform": "twitter",
"author_id": "123456",
"text": "The new vaccine rollout is promising!",
"timestamp": "2025-03-31T10:00:00Z"
}
{
"platform": "reddit",
"subreddit": "mentalhealth",
"author": "user123",
"text": "I've been feeling anxious about the vaccine lately...",
"timestamp": 1711862400
}
Next Steps
🔹 Test API Rate Limits – Ensure you don’t get blocked.
🔹 Filter Irrelevant Data – Remove spammy or promotional posts.
🔹 Set Up Logging – Save API errors and failed requests.