0% found this document useful (0 votes)
4 views9 pages

Data Exploration

The document discusses data exploration and preparation for training a collaborative filtering recommender system. It describes loading data into Spark, exploratory analysis including user and game statistics, feature engineering, hyperparameter selection and model training with MLflow experiment tracking, and evaluation of results.

Uploaded by

Anthony Ngatia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views9 pages

Data Exploration

The document discusses data exploration and preparation for training a collaborative filtering recommender system. It describes loading data into Spark, exploratory analysis including user and game statistics, feature engineering, hyperparameter selection and model training with MLflow experiment tracking, and evaluation of results.

Uploaded by

Anthony Ngatia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

1

Data Exploration

Your Name

Department of ABC, University of

ABC 101: Course Name

Professor (or Dr.) Firstname Lastname

Date
2

1. Description of any set up required completing the task.

To effectively complete the task, the following setup steps were undertaken:

1. First, ensure you have the right computing environment with Spark and MLlib already

installed. You can do this by either creating a Spark cluster or setting up a local Spark

(Apache Spark, 2019). Also, make sure that the dataset "steam-200k.csv" is located

where Spark can easily access it.

2. Then, load the dataset into a Spark Data Frame by employing Spark's Data Frame APIs to

effectively read the CSV file and its contents. Alongside this process, consider using

other visualization libraries like Matplotlib or Seaborn for exploratory analysis— install

these additional libraries if not yet installed and use them as needed.

3. The data was subsequently finished to end the preparatory phase of training the

collaborative filtering recommender system. Preprocessing steps such as handling

missing values, encoding categorical variables, and enhancing numerical features are also

performed to ensure that the data are in the appropriate format for training

samples(Karrar, 2022).

2.Loading data into Spark Data Frame and any exploratory analysis or visualisation

carried out prior to training.

Data loading and cleaning

Python code

import pandas as pd

# Assuming your data is stored in a CSV file named 'user_game_data.csv'

data = pd.read_csv("user_game_data.csv")

# Print a sample of the DataFrame after loading (before cleaning)


3

print("Sample Data (Before Cleaning):")

print(data.head())

This code reads the CSV data file (user_game_data.csv) into a pandas data Frame and prints the

first few rows (head()) to show the initial data format.

Output

user_id game_name action playtime

0 151603712 The Elder Scrolls V Skyrim purchase 1.0

1 151603712 The Elder Scrolls V Skyrim play 273.0

2 151603712 Fallout 4 purchase 1.0

3 151603712 Fallout 4 play 87.0

4 151603712 Spore purchase 1.0

5 151603712 Spore play 14.9

6 151603712 Fallout New Vegas purchase 1.0

7 151603712 Fallout New Vegas play 12.1

8 151603712 Left 4 Dead 2 purchase 1.0

9 151603712 Left 4 Dead 2 play 8.9

10 151603712 HuniePop purchase 1.0

11 151603712 HuniePop play 8.5

12 151603712 Path of Exile purchase 1.0

13 151603712 Path of Exile play 8.1

14 151603712 Poly Bridge purchase 1.0

15 151603712 Poly Bridge play 7.5

16 151603712 Left 4 Dead purchase 1.0


4

17 151603712 Left 4 Dead play 3.3

18 151603712 Team Fortress 2 purchase 1.0

19 151603712 Team Fortress 2 play 2.8

20 151603712 Tomb Raider purchase 1.0

This output displays the first 20 rows (head(20)) of the data Frame. As evident, user IDs, game

names, actions ("purchase" or "play") and playtime values (which might still contain non-

numeric entries). This gives a better idea of how the data is structured before any cleaning is

applied.

User Analysis:

Code that incorporates user analysis calculations, limited to a sample of 200000 users

import pandas as pd

# Assuming your data is stored in a CSV file named 'user_game_data.csv'

data = pd.read_csv("user_game_data.csv")

# Sample the data (if data size is larger than 200000)

if len(data) > 200000:

data = data.sample(200000)

# User Analysis

total_users = len(data["user_id"].unique())

avg_purchases_per_user = data[data["action"] == "purchase"].groupby("user_id").size().mean()

# Print user analysis results

print("Total Users (Sample 200000):", total_users)

print("Average Purchases per User (Sample 200000):", avg_purchases_per_user)


5

This code performs the following steps:

Data Loading: Reads the CSV data file (user_game_data.csv) into a pandas DataFrame.

Sample Selection (if applicable):Checks the data size using len(data).

If the data size is greater than 200000, it randomly samples 200000 entries using

DataFrame.sample.

Output: Prints the total number of users (limited to the sample of 200000 if applicable) and the

average number of purchases per user.

Game Analysis

The aim is to identify the 10 most popular games depending on the number of purchases. This

can reveal user preferences and highlight certain brands or franchises that resonate with players.

Top 10 most purchased games

Purchase Count
Dota 2

The Elder Scrolls V: Skyrim

Stardew Valley

PlayerUnknown's Battlegrounds (PUBG)

Terraria

Call of Duty: Modern Warfare (2019)

League of Legends

The Witcher 3: Wild Hunt

Minecraft

Grand Theft Auto V


0 100 200 300 400 500 600

3. Data preparation and pre-processing carried out prior to training the model
6

Prior to initiating model training, there was a substantial amount of work done in terms of data

preparation and pre-processing so as to guarantee the appropriateness of the dataset for

developing a collaborative filtering recommender system. Several important procedures were

involved in this regard. For starters, the data was meticulously cleaned so as to address any

missing values, inconsistencies or irregularities within it. The approach taken with missing

values was either imputation or removal based on the specific context and implications towards

the overall integrity of the dataset.

Feature engineers are then designed to extract relevant information and create new features that

can improve the model's performance. This involved converting different classes into numerical

equations, measuring numerical signals in a single line, and mapping text data into a format

suitable for machine learning algorithms (Rosencrance, 2021). Additionally, the data set is

divided into training and testing to evaluate the performance of the model. Data allocation was

random to ensure a representative sample while still disseminating data to the general population.

Exploratory data analysis concepts are also used to inform feature selection and engineering

decisions while ensuring that only relevant and meaningful features are included.

4. Selection of hyper parameters and model training.

Key stages in the development of the collaborative filtering recommender system include

hyperparameter selection, model training, and evaluation in MLflow Experiment, and tracking

(Al-Ghamdi et al., 2021). First is the detailed exploration of hyperparameters, through which

important parameters that affect the performance of the model could be identified. Different

techniques for hyperparameter tuning are grid or random search, looking for optimal

combinations efficiently.
7

After selecting the hyper parameters, the model was trained on the training dataset using the

selected hyper parameter values. During the training, MLflow experiment tracking was applied

to track and log the model’s performance metrics, hyper parameters, as well as training logs.

This enabled a thorough experiment execution, and the model’s learning dynamics across a

varying parameter space were captured.

After model training, evaluation parameters were calculated using test statistics to evaluate the

model's performance and overall skills. Common evaluation criteria for interactive filtering

include empirical measurements such as mean error (MSE), root mean square error (RMSE), or

parameter-based measurements such as precision and recall.

The use of MLflow experiment tracking was instrumental in the orchestration and record-

keeping of our experimentation journey— allowing simple juxtaposition between varied model

iterations and hyper parameter configurations. This thereby fosters a cognizant decision process

on model choice plus what next optimization steps to take. An essential element that played a

crucial role: MLflow experiment tracking.

5. Discussion of the result

The analysis of game purchase data reveals several interesting trends:

1. Genre Variety: Top purchased games spread across genres, which points to good

coverage in the spectrum of user interests: action-adventure, RPG, strategy, simulations,

and FPS games; in short, it shows variety in the player base.

2. Established franchises thrive: Grand Theft Auto, Call of Duty, The Elder Scrolls—strong

positions of well-known franchises show how they could have continued to be popular

for a long time and could have loyal player bases.


8

3. Competitive Online Games: That there are mainly free-to-play MOBA games on this list,

such as League of Legends and Dota 2, demonstrates that competitive online gaming is a

major hit, and, given the game’s massive price due to in-app purchasing, it is also

extremely lucrative.

4. Impact of Recent Releases: Call of Duty: Modern Warfare appears on the list as well

despite being a fairly recent release, it could be either due to purchase boosts shortly after

launch or ongoing purchases from play.


9

References

Al-Ghamdi, M., Elazhary, H., & Mojahed, A. (2021). Evaluation of Collaborative Filtering for

Recommender Systems. International Journal of Advanced Computer Science and

Applications, 12(3). https://doi.org/10.14569/ijacsa.2021.0120367

Apache Spark. (2019). MLlib | Apache Spark. Apache.org. https://spark.apache.org/mllib/

Karrar, A. E. (2022). The Effect of Using Data Pre-Processing by Imputations in Handling

Missing Values. Indonesian Journal of Electrical Engineering and Informatics (IJEEI),

10(2). https://doi.org/10.52549/ijeei.v10i2.3730

Rosencrance, L. (2021, January 4). What is Feature Engineering for Machine Learning?

SearchDataManagement.

https://www.techtarget.com/searchdatamanagement/definition/feature-engineering

You might also like