0% found this document useful (0 votes)
10 views7 pages

Set-A

The document contains a unit test for a course on Essentials of Data and Text Processing, featuring Python code that creates a DataFrame of movie data, performs one-hot encoding, and calculates statistical measures. It includes visualizations such as histograms and boxplots for ratings and duration, as well as a scatter plot of ratings versus votes. The test demonstrates data manipulation and analysis using the pandas library.

Uploaded by

Dhruvin Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views7 pages

Set-A

The document contains a unit test for a course on Essentials of Data and Text Processing, featuring Python code that creates a DataFrame of movie data, performs one-hot encoding, and calculates statistical measures. It includes visualizations such as histograms and boxplots for ratings and duration, as well as a scatter plot of ratings versus votes. The test demonstrates data manipulation and analysis using the pandas library.

Uploaded by

Dhruvin Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Name: Jivani Dhairya Pravinbhai

Enrollment no: 202203100110120


Class: TYBCA
Div: B
Date: 12/02/2025
Subject: Essentials of Data and Text Processing (CS5006)
Unit Test-1(SET:A)

Q-1)

import pandas as pd

# Create dummy movie data


movie_data = [
["The Shawshank Redemption", "Drama", 9.3, 142, 2500000],
["The Godfather", "Crime, Drama", 9.2, 175, 1800000],
["The Dark Knight", "Action, Crime, Drama", 9.0, 152, 2400000],
["Inception", "Action, Adventure, Sci-Fi", 8.8, 148, 2100000],
["Pulp Fiction", "Crime, Drama", 8.9, 154, 1900000],
["Fight Club", "Drama", 8.8, 139, 1950000],
["Forrest Gump", "Drama, Romance", 8.8, 142, 1850000],
["Matrix", "Action, Sci-Fi", 8.7, 136, 1750000],
["Goodfellas", "Biography, Crime, Drama", 8.7, 146, 1650000],
["The Silence of the Lambs", "Crime, Drama, Thriller", 8.6, 118, 1350000],
["Interstellar", "Adventure, Drama, Sci-Fi", 8.6, 169, 1600000],
["Saving Private Ryan", "Drama, War", 8.6, 169, 1400000],
["The Green Mile", "Crime, Drama, Fantasy", 8.6, 189, 1200000],
["Gladiator", "Action, Adventure, Drama", 8.5, 155, 1300000],
["The Departed", "Crime, Drama, Thriller", 8.5, 151, 1250000],
["The Prestige", "Drama, Mystery, Sci-Fi", 8.5, 130, 1150000],
["The Lion King", "Animation, Adventure, Drama", 8.5, 88, 950000],
["Whiplash", "Drama, Music", 8.5, 106, 850000],
["The Usual Suspects", "Crime, Mystery, Thriller", 8.5, 106, 1050000],
["Eternal Sunshine of the Spotless Mind", "Drama, Romance, Sci-Fi", 8.3, 108, 950000]
]

# Create DataFrame
columns = ['Title', 'Genre', 'Rating', 'Duration', 'Votes']
df = pd.DataFrame(movie_data, columns=columns)

# Save to Excel
df.to_excel('movie_data.xlsx', index=False)

# Display the first few rows


print("First few rows of the dataset:")
print(df.head())

# Display basic statistics


print("\nDataset Info:")
print(df.info())
print("\nBasic Statistics:")
print(df.describe())

print("\nData successfully saved to 'movie_data.xlsx'")

Q-2)

import pandas as pd

movie_data = [
["The Shawshank Redemption", "Drama", 9.3, 142, 2500000],
["The Godfather", "Crime, Drama", 9.2, 175, 1800000],
["The Dark Knight", "Action, Crime, Drama", 9.0, 152, 2400000],
["Inception", "Action, Adventure, Sci-Fi", 8.8, 148, 2100000],
["Pulp Fiction", "Crime, Drama", 8.9, 154, 1900000],
["Fight Club", "Drama", 8.8, 139, 1950000],
["Forrest Gump", "Drama, Romance", 8.8, 142, 1850000],
["Matrix", "Action, Sci-Fi", 8.7, 136, 1750000],
["Goodfellas", "Biography, Crime, Drama", 8.7, 146, 1650000],
["The Silence of the Lambs", "Crime, Drama, Thriller", 8.6, 118, 1350000],
["Interstellar", "Adventure, Drama, Sci-Fi", 8.6, 169, 1600000],
["Saving Private Ryan", "Drama, War", 8.6, 169, 1400000],
["The Green Mile", "Crime, Drama, Fantasy", 8.6, 189, 1200000],
["Gladiator", "Action, Adventure, Drama", 8.5, 155, 1300000],
["The Departed", "Crime, Drama, Thriller", 8.5, 151, 1250000],
["The Prestige", "Drama, Mystery, Sci-Fi", 8.5, 130, 1150000],
["The Lion King", "Animation, Adventure, Drama", 8.5, 88, 950000],
["Whiplash", "Drama, Music", 8.5, 106, 850000],
["The Usual Suspects", "Crime, Mystery, Thriller", 8.5, 106, 1050000],
["Eternal Sunshine of the Spotless Mind", "Drama, Romance, Sci-Fi", 8.3, 108, 950000]
]

# Create DataFrame
columns = ['Title', 'Genre', 'Rating', 'Duration', 'Votes']
df = pd.DataFrame(movie_data, columns=columns)

# One-hot encoding for genres


genre_dummies = df['Genre'].str.get_dummies(sep=', ')

# One-hot encoding for ratings (we'll bin the ratings into categories)
rating_bins = pd.cut(df['Rating'], bins=[0, 3, 5, 7, 10], labels=['0-3', '3-5', '5-7', '7-10'])
rating_dummies = pd.get_dummies(rating_bins, prefix='Rating')

# Merge the one-hot encoded data with the original dataframe


df_encoded = pd.concat([df, genre_dummies, rating_dummies], axis=1)

print("\nData with One-Hot Encoding:")


print(df_encoded.head())
Q-3)

rating_mean = df['Rating'].mean()
rating_median = df['Rating'].median()
rating_mode = df['Rating'].mode()[0]

duration_mean = df['Duration'].mean()
duration_median = df['Duration'].median()
duration_mode = df['Duration'].mode()[0]

print("\nMeasures of Central Tendency:")


print(f"Rating - Mean: {rating_mean}, Median: {rating_median}, Mode: {rating_mode}")
print(f"Duration - Mean: {duration_mean}, Median: {duration_median}, Mode: {duration_mode}")

rating_range = df['Rating'].max() - df['Rating'].min()


rating_std = df['Rating'].std()
rating_variance = df['Rating'].var()

duration_range = df['Duration'].max() - df['Duration'].min()


duration_std = df['Duration'].std()
duration_variance = df['Duration'].var()

print("\nMeasures of Variation:")
print(f"Rating - Range: {rating_range}, Std: {rating_std}, Variance: {rating_variance}")
print(f"Duration - Range: {duration_range}, Std: {duration_std}, Variance: {duration_variance}")

rating_skewness = df['Rating'].skew()

print("\nSkewness of Ratings Distribution:")


print(f"Skewness: {rating_skewness}")
Q-4)

plt.figure(figsize=(10, 6))
plt.hist(df['Rating'], bins=10, color='skyblue', edgecolor='black')
plt.title('Distribution of Movie Ratings')
plt.xlabel('Ratings')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
plt.boxplot(df['Rating'])
plt.title('Boxplot of Ratings')

plt.subplot(1, 2, 2)
plt.boxplot(df['Duration'])
plt.title('Boxplot of Duration')

plt.tight_layout()
plt.show()

plt.figure(figsize=(10, 6))
plt.scatter(df['Votes'], df['Rating'], color='blue', alpha=0.7)
plt.title('Scatter Plot of Ratings vs. Votes')
plt.xlabel('Votes')
plt.ylabel('Ratings')
plt.grid(True)
plt.show()

You might also like