0% found this document useful (0 votes)
8 views3 pages

Sma 3

The document outlines a Python script for cleaning and processing social media data from a CSV file, focusing on Facebook and Instagram posts. It includes steps for data cleaning, feature engineering, and storing the cleaned data in a MongoDB database. Additionally, it provides a sample dataset and demonstrates how to verify data insertion and calculate average engagement rates by platform.

Uploaded by

Ganesh Panigrahi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views3 pages

Sma 3

The document outlines a Python script for cleaning and processing social media data from a CSV file, focusing on Facebook and Instagram posts. It includes steps for data cleaning, feature engineering, and storing the cleaned data in a MongoDB database. Additionally, it provides a sample dataset and demonstrates how to verify data insertion and calculate average engagement rates by platform.

Uploaded by

Ganesh Panigrahi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

EXPERIMENT NO: 3

import pandas as pd
import numpy as np
import pymongo
from pymongo import MongoClient

# Step 1: Load the raw social media data (assuming a CSV file)
df = pd.read_csv('social_media_data.csv')

# Step 2: Data Cleaning


# 2.1 Handle missing values
# For simplicity, we can drop rows with missing values (or use imputation)
df.dropna(inplace=True)

# 2.2 Remove duplicates


df.drop_duplicates(inplace=True)

# 2.3 Filter data (e.g., we may only be interested in Facebook and Instagram posts)
df = df[df['platform'].isin(['Facebook', 'Instagram'])]

# 2.4 Correct column types (e.g., ensure 'post_date' is in datetime format)


df['post_date'] = pd.to_datetime(df['post_date'])

# 2.5 Remove posts with zero likes, shares, or comments, as they might be irrelevant
df = df[(df['likes'] > 0) | (df['shares'] > 0) | (df['comments'] > 0)]

# Step 3: Feature Engineering (Optional)


# 3.1 Calculate engagement rate: (likes + shares + comments) / followers
df['engagement_rate'] = (df['likes'] + df['shares'] + df['comments']) / df['followers']

# 3.2 Extract relevant date parts (optional)


df['year'] = df['post_date'].dt.year
df['month'] = df['post_date'].dt.month
df['day'] = df['post_date'].dt.day
df['weekday'] = df['post_date'].dt.weekday

# Step 4: Connect to MongoDB and store the cleaned data


# Create a connection to MongoDB (local or cloud)
client = MongoClient('mongodb://localhost:27017/') # MongoDB connection string
db = client['social_media_data_db'] # Create a database
collection = db['posts'] # Create a collection

# Convert the cleaned dataframe to a dictionary format


records = df.to_dict(orient='records')

# Insert the cleaned data into MongoDB


collection.insert_many(records)
# Step 5: Verify data insertion
print(f"Inserted {len(records)} records into MongoDB.")

# You can verify by querying MongoDB directly or using Python:


result = collection.find().limit(5) # Displaying the first 5 records
for record in result:
print(record)

# Print first 5 Facebook posts


for post in collection.find({"platform": "Facebook"}).limit(5): print(post)

# Print avg engagement rate by platform


for record in collection.aggregate([{"$group": {"_id": "$platform", "avg_engagement_rate":
{"$avg": "$engagement_rate"}}}]):
print(record)

for record in avg_engagement:


print(record)

dbcreate.py
import pandas as pd

# Define the sample data as a dictionary


data = {
"post_date": ["2024-01-01", "2024-01-02", "2024-01-03", "2024-01-04", "2024-01-05",
"2024-01-06", "2024-01-07", "2024-01-08", "2024-01-09", "2024-01-10"],
"platform": ["Facebook", "Instagram", "Twitter", "Instagram", "Facebook",
"Twitter", "Instagram", "Facebook", "Twitter", "Instagram"],
"likes": [350, 500, 300, 450, 400, 280, 550, 600, 320, 470],
"shares": [50, 75, 40, 80, 60, 50, 90, 100, 45, 85],
"comments": [120, 200, 100, 150, 110, 90, 250, 130, 110, 180],
"followers": [15000, 12000, 18000, 13000, 16000, 17000, 14000, 15000, 16000, 12500],
"hashtags": ["#business #growth", "#marketing #innovation", "#growth #success",
"#socialmedia #strategy", "#branding #entrepreneur",
"#leadership #business", "#digitalmarketing #startup",
"#innovation #content", "#productivity #success", "#inspiration #growth"]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Save the DataFrame to a CSV file


df.to_csv("social_media_data.csv", index=False)

print("Sample social media data saved to 'social_media_data.csv'")

You might also like