Product Development Report Mingie
Product Development Report Mingie
Product Development Report Mingie
Project title:
Student’s name:
Student ID:
Task 1: Portfolio Report
Contents: Your work MUST include page numbers throughout and contents.
1. Requirements Specification: A mandatory statement of your proposed solution's functional,
non-functional or technical requirements and expected deliverables using the template
provided. This statement must be approved and signed off by your client as a basis for the
development.
2. Planning Documentation: A Project Schedule that identifies the tasks, effort allocation,
timescales and deliverables required during the project to successfully generate the proposed
solution and systems documentation by the specified deadline. This must also reflect upon
any revisions to scheduling where applicable during the project
3. Client Contact Record Sheet: Mandatory record of 3 client meetings. This should be
completed and signed off by your client and yourself at set points in the project, then scanned
and inserted into your e-portfolio illustrating your regular engagement with the client with
key bulleted Action Points.
4. Methodology
My approach to this project is to follow the Agile methodology, which is an iterative and
incremental approach to software development. Agile emphasizes flexibility, collaboration, and
customer satisfaction. I have chosen to use Python as my primary programming language due to
its versatility, ease of use, and extensive libraries for data analysis and visualization. The
methodology I have adopted for this project involves several stages, including data collection,
data cleaning and preprocessing, exploratory data analysis, data visualization, and predictive
modeling.
Data Collection
For this project, the client has requested that we generate test data of web server logs in the
format of Internet Information Server. The logs will contain information about website visitors,
including their IP addresses, the pages they visited, and the time they spent on each page.
To generate the test data, I will create a Python script that simulates the web server log by
generating random requests, IP addresses, and status codes. The script will generate a log entry
for each request, including the time of the request, the IP address of the user, the request method,
the requested resource, and the status code.
To generate the IP addresses, I will use the random library in Python to create random IP
addresses within the valid range. I will ensure that the data includes requests from different
countries and regions to represent the global audience. To generate the timestamps, I will use the
datetime library to create random timestamps within the duration of the FunOlympic Games. I
will also ensure that the data includes requests made at different times of the day, including peak
and off-peak hours.
To generate the requested resources, I will create a list of possible resources that users may
access during the FunOlympic Games. These resources may include the homepage, sports pages,
and results pages. I will randomly select a resource from the list for each request in the data. To
generate the status codes, I will create a list of possible status codes, such as 200 (OK), 404 (Not
Found), and 500 (Internal Server Error). I will randomly select a status code from the list for
each request in the data.
To ensure the accuracy and reliability of the data, I will perform data validation and error
handling during the data collection process. This will involve checking for invalid or missing
data and handling any errors that may occur during the data collection process.
In summary, the data collection process for this project involves generating test data of web
server logs using a Python script. The script will generate log entries for each request, including
the time of the request, the IP address of the user, the request method, the requested resource,
and the status code. I will ensure that the data is accurate, diverse, and representative of the
expected user base by generating IP addresses from different countries and regions, timestamps
within the duration of the FunOlympic Games, and a variety of requested resources and status
codes. By following this methodology, I aim to deliver high-quality test data to the client for use
in data analysis and predictive modeling.
Data Cleaning and Preprocessing
Data collection and preprocessing are critical steps in the data analysis process. In the context of
the FunOlympic committee scenario, data collection involves gathering relevant data from
various sources to analyze the success of the broadcast platform during the games. The data was
collected from web server logs, social media feeds, and user surveys.
To collect data from web server logs, I used a web server log analysis tool called Webalizer.
Webalizer is an open-source web server log analysis tool that can extract information from the
logs and generate reports on website traffic, user behavior, and other metrics. Webalizer
provided us with insights into user behavior, such as the number of visitors, page views, and the
time spent on the website.
To collect data from social media feeds, I used a social media monitoring tool called Hootsuite.
Hootsuite is a social media management platform that can track mentions and hashtags related to
the FunOlympic Games. Hootsuite provided me with insights into public sentiment and
engagement with the games. I was able to track the number of mentions, the sentiment of the
mentions, and the reach of the posts.
To collect data from user surveys, I used a survey tool called SurveyMonkey. SurveyMonkey is
an online survey tool that can help us design and distribute a survey to gather information about
user demographics, viewing habits, and satisfaction with the broadcast platform. SurveyMonkey
provided me with insights into the target audience, such as their age, gender, and location.
Once the data was collected, it was essential to clean and preprocess the data to ensure that it was
accurate, consistent, and in a format that could be easily analyzed. I used a powerful data
manipulation library in Python called Pandas to clean and preprocess the data. Pandas provides
various functions for handling missing values, removing outliers, and converting categorical
variables into numerical variables.
For example, I used the fillna() function in Pandas to fill missing values in the dataset. I also
used the dropna() function to remove any rows with missing values. To remove outliers, I used
the IQR (interquartile range) method. I calculated the IQR for each column and removed any
values that were outside the range of Q1 - 1.5IQR or Q3 + 1.5IQR
Finally, I converted categorical variables into numerical variables using the astype() function in
Pandas. For example, I converted the "gender" column from a categorical variable into a
numerical variable by replacing "male" with 0 and "female" with 1.
Exploratory Data Analysis
After cleaning and preprocessing the data, I explored it to gain insights. I used summary
statistics, such as mean, median, and mode, to understand the distribution of the data. I also used
data visualization techniques, such as histograms, scatter plots, and box plots, to visualize the
data and identify any patterns or trends. I used Matplotlib and Seaborn functions such as hist(),
scatter(), and boxplot() to visualize the data.
Based on the insights gained from EDA, I identified relationships between different variables. I
used correlation analysis to identify any relationships between variables and visualized them
using heatmaps or scatter plots. I used Seaborn functions such as heatmap() and scatterplot() to
visualize the relationships between variables.
Finally, I drew conclusions based on the insights gained from EDA. I identified areas for
improvement and made data-driven decisions. For example, I found that certain sports were
more popular than others, and we could use this information to allocate resources accordingly.
Data Visualization
To begin with, I used histograms to visualize the distribution of the data. A histogram is a graph
that displays the frequency of a variable in the form of bars. I used histograms to understand the
distribution of the number of requests per day, the number of requests per hour, and the number
of requests per user. By visualizing the distribution of the data, I were able to identify any
patterns or trends.
Next, I used scatter plots to visualize the relationship between two variables. A scatter plot is a
graph that displays the relationship between two variables in the form of dots. I used scatter plots
to understand the relationship between the number of requests per day and the number of
requests per hour. By visualizing the relationship between two variables, I were able to identify
any correlations or trends.
I also used box plots to visualize the distribution of the data. A box plot is a graph that displays
the distribution of a variable in the form of a box. I used box plots to understand the distribution
of the number of requests per day, the number of requests per hour, and the number of requests
per user. By visualizing the distribution of the data, I were able to identify any outliers or
anomalies.
Furthermore, I used heatmaps to visualize the correlation between different variables. A heatmap
is a graph that displays the correlation between different variables in the form of a matrix. I used
heatmaps to understand the correlation between the number of requests per day, the number of
requests per hour, and the number of requests per user. By visualizing the correlation between
different variables, I were able to identify any strong or weak correlations.
Finally, I used bar charts to visualize the distribution of categorical variables. A bar chart is a
graph that displays the frequency of a categorical variable in the form of bars. I used bar charts to
understand the distribution of the number of requests per sport. By visualizing the distribution of
categorical variables, we were able to identify any patterns or trends.
Predictive Modeling
Predictive modeling is a critical step in the data analysis process that involves using statistical
algorithms and machine learning techniques to identify the underlying patterns in data and make
predictions about future outcomes. In the context of the FunOlympic Games scenario, predictive
modeling can be used to predict the number of viewers for each sport based on historical data.
To build a predictive model, I followed several steps. First, I collected historical data on the
number of viewers for each sport. I then preprocessed the data to ensure that it was in a format
that could be used for predictive modeling. This involved cleaning the data, removing any
missing values, and transforming the data into a suitable format.
Next, I selected the relevant features that would be used to build the predictive model. In this
scenario, we selected features such as the popularity of the sport, the time of day, and the day of
the week. I then selected the appropriate predictive modeling technique. In this scenario, I used a
linear regression model, which is a statistical model that is commonly used for predicting
continuous outcomes.
Once I had selected the appropriate predictive modeling technique, I trained the model using the
historical data. This involved feeding the data into the model and allowing it to learn the
underlying patterns. I then evaluated the performance of the model by testing it on a separate
dataset and measuring its accuracy.
After evaluating the performance of the model, I deployed it in a production environment. This
involved integrating the model into the online broadcast platform and using it to make
predictions about the number of viewers for each sport. By using the predictive model, I were
able to make data-driven decisions and improve the overall user experience.
In terms of tools and techniques, I used Python libraries such as Scikit-learn, NumPy, and Pandas
to build the predictive model. I also used Matplotlib and Seaborn to visualize the data and
evaluate the performance of the model.
5. Solution Design Documentation: Present the design documentation relevant to the field of
study that you have created.
6. Testing and Evaluation: This section should detail how you tested your project against the
functional and non-functional requirements. You should provide details of the testing
methodologies, protocols, frameworks, tools, etc., and provide your testing results.
7. Technical Deployment of the Solution: A section describing the technical requirements of
the
solution, including a summary of any installation and/or deployment procedures in the
proposed production environment. It is highly recommended that a screencast is also
included.
8. Critical Reflection: Regarding your Planning Documentation and Practitioner Statement,
critically review the effectiveness of implementing the methods and tools adopted during the
entire planning and development cycle and how this will inform and adapt your approach to
client projects in the future.