0% found this document useful (0 votes)
23 views18 pages

Dsa and ML 10

Uploaded by

Esha Elnoorkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views18 pages

Dsa and ML 10

Uploaded by

Esha Elnoorkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

VISVESVARAYA TECHNOLOGICAL UNIVERSITY,

Jnana Sangama, Belgaum-590018

A PROJECT REPORT ON

“ANALYZE AND VISUALIZE VIDEO GAME SALES”

An Activity Report Submitted in partial fulfillment of requirement for the award of 6th
semester of
BACHELOR OF ENGINEERING (B.E)
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING ENGINEERING
SUBMITTED BY
JOSHI AVANTIKA (3GN21AI013)
J SIRI CHANDANA (3GN21AI011)
ESHA ELNOORKAR (3GN21AI008)
ANJALI ARTHAM (3GN21AI0005)

UNDER THE GUIDANCE OF


Prof. JASMINEET KAUR ARORA
DR. HARISH JOSHI

GURU NANAK DEV ENGINEERING COLLEGE, BIDAR


MAILOOR ROAD, BIDAR, KARNATAKA-585403
CHAPTER 1

PROBLEM STATEMENT

Analyze and visualize video game sales data to identify trends in sales
performance, genre popularity, and platform success, offering insights for market analysis,
strategic planning, and understanding consumer preferences in the gaming industry.

STEPS TO BE FOLLOWED

Exploratory Data Analysis

• Installing Libraries and Modules


• Loading Data
• Data Inspection
• Understanding Variables
• Data Wrangling
• Feature Engineering

Data Visualization

• Histograms
• Scatter Plots
• Pair Plots

Hypothesis Testing
• Data Cleaning and Preparation
• Exploratory Data Analysis (EDA)
• Feature Selection
• Model Training and Evaluation
• Visualization

Machine Learning Models


• Random Forests
• SVM
• Decision Tree
IMPORT LIBRARIES AND MODULES

The code snippet you've provided is used for various data analysis and natural language processing tasks in
Python. Let's break down what each part does:

1. Importing Libraries:

• import pandas as pd: Imports the Pandas library for data manipulation and analysis.

• import numpy as np: Imports the NumPy library for numerical operations.

• import matplotlib.pyplot as plt: Imports Matplotlib's pyplot module for creating visualizations.

• import seaborn as sns: Imports the Seaborn library for statistical data visualization.

• Sklearn: Scikit-learn is a library in Python that provides many unsupervised and supervised learning
algorithms.

2. Text Processing Libraries:

• from sklearn.model_selection import train_test_split


• from sklearn.model_selection import GridSearchCV
• from sklearn.svm import SVR
• from sklearn.metrics import mean_squared_error, r2_score
Purpose:
Analyzing and visualizing video game sales helps identify market trends, forecast revenue, assess marketing
effectiveness, manage inventory, and gain insights into consumer behavior and competitive performance. This
guides strategic decisions and improves overall business outcomes.

LET'S BEGIN!!!

# Importing Libraries

import numpy as np import


pandas as pd
import matplotlib.pyplot as plt import
seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score,
classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score

# Loading the dataset


# Load the dataset
url = 'https://raw.githubusercontent.com/snanilim/video-
games-sales-analysis-and-visualization/main/vgsales.csv'
df = pd.read_csv('vgsales.csv')

# Display the first few rows of the dataset


df.head()

# Data Cleaning
# Drop rows with missing values
df_cleaned = df.dropna()

# Check for duplicates


print(df_cleaned.duplicated().sum())

# Drop duplicate rows


df_cleaned = df_cleaned.drop_duplicates()

# Drop columns that are not needed


# Check the available columns using:
# print(df_cleaned.columns)
df_cleaned = df_cleaned.drop(columns=['Rank', 'Publisher'])
# Only drop existing columns

# Convert 'Year' to integer


df_cleaned['Year'] = df_cleaned['Year'].astype(int)

# Display cleaned data


df_cleaned.head()
o/p-
Rank 0
Name 0
Platform 0
Year 271
Genre 0
Publisher 58
NA_Sales 0
EU_Sales 0
JP_Sales 0
Other_Sales 0
Global_Sales 0
dtype: int64
0
Name Platform Year Genre NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales

0 Wii Sports Wii 2006 Sports 41.49 29.02 3.77 8.46 82.74

Super Mario
1 NES 1985 Platform 29.08 3.58 6.81 0.77 40.24
Bros.

Mario Kart
2 Wii 2008 Racing 15.85 12.88 3.79 3.31 35.82
Wii

Wii Sports
3 Wii 2009 Sports 15.75 11.01 3.28 2.96 33.00
Resort

Pokemon
Role-
4 Red/Pokemon GB 1996 11.27 8.89 10.22 1.00 31.37
Playing
Blue
# Encode categorical variables

# One-hot encoding for categorical features


df_encoded = pd.get_dummies(df_cleaned, columns=['Platform', 'Genre'])

# Check the encoded dataframe


df_encoded.head()

o/p-
First 5 rows of standardized X:
[[ 5.14583529 -0.11867924 -0.03106919 ... -0.0426875 -0.01545278 -
0.01317869]
[-0.22321422 -0.11867924 -0.03106919 ... -0.0426875 -0.01545278 -
0.01317869]
[-0.22321422 -0.11867924 -0.03106919 ... -0.0426875 -0.01545278 -
0.01317869]
[-0.22321422 -0.11867924 -0.03106919 ... -0.0426875 -0.01545278 -
0.01317869]
[-0.22321422 -0.11867924 -0.03106919 ... -0.0426875 -0.01545278 -
0.01317869]]

# Correlation Heatmap

o/p-
# Visualization:

# Train and evaluate models


Define features (X) and target (y)
X = df_cleaned.drop(columns=['Genre', 'Name']) # Dropping the
'Name' column
y = df_cleaned['Genre']

# Split the data into training and testing sets


from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.2, random_state=42)

# SVM
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score
svm_regressor = SVR(kernel='linear')
svm_regressor.fit(X_train, y_train)
y_pred = svm_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred) # Use r2_score for regression
print("Mean Squared Error:", mse)
print("R-squared:", r2) # Print the R-squared score

o/p-
0.997263407834269

# Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier


from sklearn.metrics import accuracy_score, classification_report
dt_clf = DecisionTreeClassifier(random_state=42)
dt_clf.fit(X_train, y_train)
y_pred_dt = dt_clf.predict(X_test)
print("Decision Tree Classifier Accuracy:", accuracy_score(y_test, y_pred_dt))
print("Decision Tree Classifier Report:\n", classification_report(y_test, y_pred_dt))

o/p-
0.1942313593126726

# Random Forest Regression


from sklearn.ensemble import
RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
X_train['Platform'] =
le.fit_transform(X_train['Platform'])
X_test['Platform'] =
le.transform(X_test['Platform'])
rf_clf = RandomForestClassifier(random_state=42)
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)
print("Random Forest Classifier Accuracy:",
accuracy_score(y_test, y_pred_rf))
print("Random Forest Classifier Report:\n",
classification_report(y_test, y_pred_rf))

o/p-
0.22829088677508438
# Visualization:

# Visualization:

Sales Histogram
plt.figure(figsize=(25,30))
sales_columns = ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales']
for i, column in enumerate(sales_columns):
plt.subplot(3,2,i+1)
sns.distplot(data[column], bins=20, kde=False, fit=stats.gamma)

o/p-
publisher comperison

plt.figure(figsize=(30, 15))
sns.barplot(x='Publisher', y='Sale_Price', hue='Sale_Area', data=comp_publisher)
plt.xticks(fontsize=14, rotation=90)
plt.yticks(fontsize=14)
plt.show()

o/p-
# Visualization:

# Visualization: Top publisher by Count each year


top_publisher = data[['Year', 'Publisher']]
top_publisher_df = top_publisher.groupby(by=['Year',
'Publisher']).size().reset_index(name='Count')
top_publisher_idx = top_publisher_df.groupby(by=['Year'])['Count'].transform(max) ==
top_publisher_df['Count']
top_publisher_count = top_publisher_df[top_publisher_idx].reset_index(drop=True)
top_publisher_count = top_publisher_count.drop_duplicates(subset=["Year", "Count"],
keep='last').reset_index(drop=True)
# top_publisher_count

publisher= top_publisher_count['Publisher']

plt.figure(figsize=(30, 15))
g = sns.barplot(x='Year', y='Count', data=top_publisher_count)
index = 0
for value in top_publisher_count['Count'].values:
# print(asd)
g.text(index, value + 5, str(publisher[index] + '----' +str(value)), color='#000', size=14,
rotation= 90, ha="center")
index += 1
plt.xticks(rotation=90)
plt.show()

o/p-
Total revenue by region

plt.figure(figsize=(12, 8))
sns.barplot(x='region', y='sale', data = top_sale_reg)

o/p-
# Visualization:

Top 5 years games release by genre:

plt.figure(figsize=(30, 10))
sns.countplot(x="Year", data=data, hue='Genre',
order=data.Year.value_counts().iloc[:5].index)
plt.xticks(size=16, rotation=90)

o/p-
# Visualization:

Pie chart

plt.figure(figsize=(10, 8))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90)

o/p-
# Visualization:

You might also like