Data Exploration
Data Exploration
Data Exploration
Your Name
Date
2
To effectively complete the task, the following setup steps were undertaken:
1. First, ensure you have the right computing environment with Spark and MLlib already
installed. You can do this by either creating a Spark cluster or setting up a local Spark
(Apache Spark, 2019). Also, make sure that the dataset "steam-200k.csv" is located
2. Then, load the dataset into a Spark Data Frame by employing Spark's Data Frame APIs to
effectively read the CSV file and its contents. Alongside this process, consider using
other visualization libraries like Matplotlib or Seaborn for exploratory analysis— install
these additional libraries if not yet installed and use them as needed.
3. The data was subsequently finished to end the preparatory phase of training the
missing values, encoding categorical variables, and enhancing numerical features are also
performed to ensure that the data are in the appropriate format for training
samples(Karrar, 2022).
2.Loading data into Spark Data Frame and any exploratory analysis or visualisation
Python code
import pandas as pd
data = pd.read_csv("user_game_data.csv")
print(data.head())
This code reads the CSV data file (user_game_data.csv) into a pandas data Frame and prints the
Output
This output displays the first 20 rows (head(20)) of the data Frame. As evident, user IDs, game
names, actions ("purchase" or "play") and playtime values (which might still contain non-
numeric entries). This gives a better idea of how the data is structured before any cleaning is
applied.
User Analysis:
Code that incorporates user analysis calculations, limited to a sample of 200000 users
import pandas as pd
data = pd.read_csv("user_game_data.csv")
data = data.sample(200000)
# User Analysis
total_users = len(data["user_id"].unique())
Data Loading: Reads the CSV data file (user_game_data.csv) into a pandas DataFrame.
If the data size is greater than 200000, it randomly samples 200000 entries using
DataFrame.sample.
Output: Prints the total number of users (limited to the sample of 200000 if applicable) and the
Game Analysis
The aim is to identify the 10 most popular games depending on the number of purchases. This
can reveal user preferences and highlight certain brands or franchises that resonate with players.
Purchase Count
Dota 2
Stardew Valley
Terraria
League of Legends
Minecraft
3. Data preparation and pre-processing carried out prior to training the model
6
Prior to initiating model training, there was a substantial amount of work done in terms of data
involved in this regard. For starters, the data was meticulously cleaned so as to address any
missing values, inconsistencies or irregularities within it. The approach taken with missing
values was either imputation or removal based on the specific context and implications towards
Feature engineers are then designed to extract relevant information and create new features that
can improve the model's performance. This involved converting different classes into numerical
equations, measuring numerical signals in a single line, and mapping text data into a format
suitable for machine learning algorithms (Rosencrance, 2021). Additionally, the data set is
divided into training and testing to evaluate the performance of the model. Data allocation was
random to ensure a representative sample while still disseminating data to the general population.
Exploratory data analysis concepts are also used to inform feature selection and engineering
decisions while ensuring that only relevant and meaningful features are included.
Key stages in the development of the collaborative filtering recommender system include
hyperparameter selection, model training, and evaluation in MLflow Experiment, and tracking
(Al-Ghamdi et al., 2021). First is the detailed exploration of hyperparameters, through which
important parameters that affect the performance of the model could be identified. Different
techniques for hyperparameter tuning are grid or random search, looking for optimal
combinations efficiently.
7
After selecting the hyper parameters, the model was trained on the training dataset using the
selected hyper parameter values. During the training, MLflow experiment tracking was applied
to track and log the model’s performance metrics, hyper parameters, as well as training logs.
This enabled a thorough experiment execution, and the model’s learning dynamics across a
After model training, evaluation parameters were calculated using test statistics to evaluate the
model's performance and overall skills. Common evaluation criteria for interactive filtering
include empirical measurements such as mean error (MSE), root mean square error (RMSE), or
The use of MLflow experiment tracking was instrumental in the orchestration and record-
keeping of our experimentation journey— allowing simple juxtaposition between varied model
iterations and hyper parameter configurations. This thereby fosters a cognizant decision process
on model choice plus what next optimization steps to take. An essential element that played a
1. Genre Variety: Top purchased games spread across genres, which points to good
2. Established franchises thrive: Grand Theft Auto, Call of Duty, The Elder Scrolls—strong
positions of well-known franchises show how they could have continued to be popular
3. Competitive Online Games: That there are mainly free-to-play MOBA games on this list,
such as League of Legends and Dota 2, demonstrates that competitive online gaming is a
major hit, and, given the game’s massive price due to in-app purchasing, it is also
extremely lucrative.
4. Impact of Recent Releases: Call of Duty: Modern Warfare appears on the list as well
despite being a fairly recent release, it could be either due to purchase boosts shortly after
References
Al-Ghamdi, M., Elazhary, H., & Mojahed, A. (2021). Evaluation of Collaborative Filtering for
10(2). https://doi.org/10.52549/ijeei.v10i2.3730
Rosencrance, L. (2021, January 4). What is Feature Engineering for Machine Learning?
SearchDataManagement.
https://www.techtarget.com/searchdatamanagement/definition/feature-engineering