Dsa and ML 10
Dsa and ML 10
A PROJECT REPORT ON
An Activity Report Submitted in partial fulfillment of requirement for the award of 6th
semester of
BACHELOR OF ENGINEERING (B.E)
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING ENGINEERING
SUBMITTED BY
JOSHI AVANTIKA (3GN21AI013)
J SIRI CHANDANA (3GN21AI011)
ESHA ELNOORKAR (3GN21AI008)
ANJALI ARTHAM (3GN21AI0005)
PROBLEM STATEMENT
Analyze and visualize video game sales data to identify trends in sales
performance, genre popularity, and platform success, offering insights for market analysis,
strategic planning, and understanding consumer preferences in the gaming industry.
STEPS TO BE FOLLOWED
Data Visualization
• Histograms
• Scatter Plots
• Pair Plots
Hypothesis Testing
• Data Cleaning and Preparation
• Exploratory Data Analysis (EDA)
• Feature Selection
• Model Training and Evaluation
• Visualization
The code snippet you've provided is used for various data analysis and natural language processing tasks in
Python. Let's break down what each part does:
1. Importing Libraries:
• import pandas as pd: Imports the Pandas library for data manipulation and analysis.
• import numpy as np: Imports the NumPy library for numerical operations.
• import matplotlib.pyplot as plt: Imports Matplotlib's pyplot module for creating visualizations.
• import seaborn as sns: Imports the Seaborn library for statistical data visualization.
• Sklearn: Scikit-learn is a library in Python that provides many unsupervised and supervised learning
algorithms.
LET'S BEGIN!!!
# Importing Libraries
# Data Cleaning
# Drop rows with missing values
df_cleaned = df.dropna()
0 Wii Sports Wii 2006 Sports 41.49 29.02 3.77 8.46 82.74
Super Mario
1 NES 1985 Platform 29.08 3.58 6.81 0.77 40.24
Bros.
Mario Kart
2 Wii 2008 Racing 15.85 12.88 3.79 3.31 35.82
Wii
Wii Sports
3 Wii 2009 Sports 15.75 11.01 3.28 2.96 33.00
Resort
Pokemon
Role-
4 Red/Pokemon GB 1996 11.27 8.89 10.22 1.00 31.37
Playing
Blue
# Encode categorical variables
o/p-
First 5 rows of standardized X:
[[ 5.14583529 -0.11867924 -0.03106919 ... -0.0426875 -0.01545278 -
0.01317869]
[-0.22321422 -0.11867924 -0.03106919 ... -0.0426875 -0.01545278 -
0.01317869]
[-0.22321422 -0.11867924 -0.03106919 ... -0.0426875 -0.01545278 -
0.01317869]
[-0.22321422 -0.11867924 -0.03106919 ... -0.0426875 -0.01545278 -
0.01317869]
[-0.22321422 -0.11867924 -0.03106919 ... -0.0426875 -0.01545278 -
0.01317869]]
# Correlation Heatmap
o/p-
# Visualization:
# SVM
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score
svm_regressor = SVR(kernel='linear')
svm_regressor.fit(X_train, y_train)
y_pred = svm_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred) # Use r2_score for regression
print("Mean Squared Error:", mse)
print("R-squared:", r2) # Print the R-squared score
o/p-
0.997263407834269
o/p-
0.1942313593126726
le = LabelEncoder()
X_train['Platform'] =
le.fit_transform(X_train['Platform'])
X_test['Platform'] =
le.transform(X_test['Platform'])
rf_clf = RandomForestClassifier(random_state=42)
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)
print("Random Forest Classifier Accuracy:",
accuracy_score(y_test, y_pred_rf))
print("Random Forest Classifier Report:\n",
classification_report(y_test, y_pred_rf))
o/p-
0.22829088677508438
# Visualization:
# Visualization:
Sales Histogram
plt.figure(figsize=(25,30))
sales_columns = ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales']
for i, column in enumerate(sales_columns):
plt.subplot(3,2,i+1)
sns.distplot(data[column], bins=20, kde=False, fit=stats.gamma)
o/p-
publisher comperison
plt.figure(figsize=(30, 15))
sns.barplot(x='Publisher', y='Sale_Price', hue='Sale_Area', data=comp_publisher)
plt.xticks(fontsize=14, rotation=90)
plt.yticks(fontsize=14)
plt.show()
o/p-
# Visualization:
publisher= top_publisher_count['Publisher']
plt.figure(figsize=(30, 15))
g = sns.barplot(x='Year', y='Count', data=top_publisher_count)
index = 0
for value in top_publisher_count['Count'].values:
# print(asd)
g.text(index, value + 5, str(publisher[index] + '----' +str(value)), color='#000', size=14,
rotation= 90, ha="center")
index += 1
plt.xticks(rotation=90)
plt.show()
o/p-
Total revenue by region
plt.figure(figsize=(12, 8))
sns.barplot(x='region', y='sale', data = top_sale_reg)
o/p-
# Visualization:
plt.figure(figsize=(30, 10))
sns.countplot(x="Year", data=data, hue='Genre',
order=data.Year.value_counts().iloc[:5].index)
plt.xticks(size=16, rotation=90)
o/p-
# Visualization:
Pie chart
plt.figure(figsize=(10, 8))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90)
o/p-
# Visualization: