0% found this document useful (0 votes)
52 views52 pages

Student Assessment Submission and Declaration

The document is a student submission form for an assignment on predictive and prescriptive analytical models. It includes a section defining common predictive models like regression, decision trees, support vector machines and neural networks. It also defines prescriptive models and techniques such as linear programming, integer programming, genetic algorithms and simulated annealing. The form provides examples of how the car rental company GYC could use these models, including customer segmentation, churn prediction, inventory optimization and pricing strategy. The student declares the assignment is their original work and understands the consequences of plagiarism.

Uploaded by

iampetestein
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views52 pages

Student Assessment Submission and Declaration

The document is a student submission form for an assignment on predictive and prescriptive analytical models. It includes a section defining common predictive models like regression, decision trees, support vector machines and neural networks. It also defines prescriptive models and techniques such as linear programming, integer programming, genetic algorithms and simulated annealing. The form provides examples of how the car rental company GYC could use these models, including customer segmentation, churn prediction, inventory optimization and pricing strategy. The student declares the assignment is their original work and understands the consequences of plagiarism.

Uploaded by

iampetestein
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 52

STUDENT ASSESSMENT SUBMISSION AND DECLARATION

When submitting evidence for assessment, each student must sign a declaration confirming that
the work is their own.

Student name: Yousef Hindi Assessor name: Ayah Karajah

Issue date (1St Submission): Submission date (1St Submitted on:


Submission):
10/4/2023
31/5/2023

In case of resubmission

Issue date (1St Submission): Submission date (1St Submitted on:


Submission):
7/6/2023
10/6/2023

Programme: Higher National Diploma in computing

Assignment number and title: Understand and Build Analytical Models

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
Plagiarism
Plagiarism is a particular form of cheating. Plagiarism must be avoided at all costs and students
who break the rules, however innocently, may be penalised. It is your responsibility to ensure
that you understand correct referencing practices. As a university level student, you are
expected to use appropriate references throughout and keep carefully detailed notes of all your
sources of materials for material you have used in your work, including any material
downloaded from the Internet. Please consult the relevant unit lecturer or your course tutor if
you need any further advice.

Student Declaration
Student declaration
I certify that the assignment submission is entirely my own work and I fully understand the
consequences of plagiarism. I understand that making a false declaration is a form of
malpractice.

Student signature: Yousef Hindi Date: 4/17/2023

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
Milestone 1
1.

Predictive analytical models are a category of data analytics


techniques that use historical data to make predictions about
future events or outcomes. These models rely on various statistical
algorithms and machine learning techniques to identify patterns
and trends in the data, which are then used to forecast future
occurrences. Some common predictive analytical models include:

 Regression models: Regression models, such as linear regression,


logistic regression, or polynomial regression, are used to model the
relationship between one or more independent variables (features)
and a dependent variable (target). These models can help predict
continuous or categorical outcomes based on the values of the
independent variables.
 Decision trees: Decision trees are a type of model that recursively
splits the data into subsets based on specific conditions or rules. The
end goal is to create a tree structure that can predict the target
variable based on the input features. Decision trees can be used for
both classification and regression tasks.
 Support vector machines (SVM): SVM is a supervised learning
algorithm that can be used for classification or regression tasks. It
works by finding the optimal hyperplane that separates the data
into different classes, maximizing the margin between the classes.
 Neural networks: Neural networks are a type of model inspired by
the human brain, consisting of layers of interconnected nodes or
neurons. These models are particularly adept at handling large and
complex datasets and can be used for a wide range of predictive

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
tasks, including image recognition, natural language processing, and
financial forecasting.

Prescriptive analytical models go beyond predictive analytics


by not only forecasting future outcomes but also recommending
specific actions to optimize those outcomes. Prescriptive analytics
focuses on finding the best course of action for a given situation,
taking into account various constraints, objectives, and potential
trade-offs. These models often combine optimization techniques,
simulation, and heuristics to determine the most effective solutions
for complex decision-making problems. Some common techniques
used in prescriptive analytics include:

 Linear programming: Linear programming is an optimization


technique used to find the best solution for a problem by minimizing
or maximizing an objective function subject to a set of linear
constraints. This method can be used to allocate resources
efficiently, optimize production schedules, or maximize profits.
 Integer programming: Integer programming is an extension of linear
programming, where some or all of the decision variables are
restricted to integer values. This technique is useful for problems
that involve discrete decisions, such as workforce scheduling,
facility location, or vehicle routing.
 Genetic algorithms: Genetic algorithms are inspired by the process
of natural selection and use a combination of selection, crossover,
and mutation operations to explore the solution space of a problem.
This technique is particularly useful for solving complex
optimization problems with large and diverse search spaces.
 Simulated annealing: Simulated annealing is a stochastic
optimization method that mimics the process of annealing in
metallurgy. The technique uses a combination of random search and
local search to find optimal or near-optimal solutions for complex
problems with many local minima or maxima.
 Monte Carlo simulation: Monte Carlo simulation is a technique that
uses random sampling and statistical modeling to estimate the
probability distribution of an outcome. It can be used to model
uncertainty in decision-making processes, evaluate potential risks,
and estimate the impact of different scenarios on the objective.

Predictive and prescriptive analytical models can


significantly help GYC in optimizing its sales and marketing efforts.

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
While predictive models use historical data to forecast future
outcomes, prescriptive models go a step further by recommending
specific actions that can be taken to achieve desired results. Here
are four examples of how GYC can use these models in their work:

 Customer Segmentation: By using predictive analytics, GYC can


segment its customers based on their likelihood of purchasing or
renting a car. Factors such as occupation, monthly income, credit
score, and financial status can be used to create a predictive model
that identifies potential customers. GYC can then target its
marketing efforts and offer tailored promotions to these customer
segments, improving the effectiveness of its campaigns.
 Churn Prediction: GYC can use predictive models to identify
customers who are likely to discontinue their business with the
company (i.e., churn). By analyzing factors like customer
satisfaction, frequency of interaction with the company, and other
relevant data points, GYC can predict which customers are at risk of
churning. Once identified, the company can take proactive
measures to retain these customers, such as offering special deals
or improved customer service.
 Inventory Optimization: Prescriptive analytics can be used to help
GYC manage its inventory of new and used cars effectively. By
analyzing sales trends, customer preferences, and external factors
like seasonality and market conditions, GYC can optimize its
inventory levels to ensure it has the right mix of cars to meet
customer demand. This would help the company minimize inventory
costs while maximizing sales opportunities.
 Pricing Strategy: GYC can leverage prescriptive analytics to optimize
its pricing strategy. By analyzing customer price sensitivity,
competitor pricing, and market conditions, the company can
identify the optimal price points for different car models and
customer segments. This would help GYC increase sales and profits
while remaining competitive in the market.

By implementing these predictive and prescriptive analytical


models, GYC can make data-driven decisions to optimize its sales
and marketing efforts, ultimately leading to increased customer
satisfaction, higher sales, and improved profitability.

2.

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
Metho Input Process Output Tool
d
Correlat Correlati Correlation The Correlation
ion on analysis output of analysis can be
Analysis analysis begins with correlati performed with
requires identifying on many different
two or the analysis statistical
more quantitative is a software tools, as
quantitat variables to correlati well as general-
ive analyze, and on purpose
variables choosing the matrix programming
as input. appropriate with the languages that
For correlation correlati have statistical
example, method on capabilities. Here
in a (Pearson, coefficie are a few
dataset Spearman, or nts, examples:
containin Kendall). A which
g correlation are 1. Python: The
informati coefficient is numerica pandas library in
on about calculated to l values Python provides
a quantify the between the `corr()`
person's strength and -1 and 1. function which
age and direction of Each cell can be used to
monthly the in the compute
income, relationship, matrix pairwise
the input producing a indicates correlation of
for value the columns.
correlati between -1 correlati
on (strong on 2. R: The `cor()`
analysis negative between function in R
could be relationship) two computes
the "age" and 1 (strong variables correlation
and positive .A between numeric
"monthly relationship). positive columns in a
income" The results value dataframe.
columns. are then signifies
interpreted, a 3. SPSS: This is a
often positive popular

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
visualized via correlati statistical
scatter plot on, software package
or correlation meaning for social
matrix, and both sciences, which
statistically variables also has the
tested for increase ability to perform
significance or correlation
to determine decrease analysis.
if the together,
observed whereas 4. Excel:
correlation is a Microsoft Excel
not due to negative also has the
chance. value functionality to
Utilizing signifies perform basic
statistical a correlation
software or negative analysis between
programming correlati two series of
languages on, data.
simplifies meaning
these steps, as one 5. SAS: This is
delivering variable another popular
insights increases statistical
quickly and , the software suite
effectively. other used in
decrease enterprises,
s, and which can
vice perform a wide
versa. range of
statistical
In analyses
addition including
to the correlation.
correlati
on 6. Tableau: This
matrix, is a data
often a visualization tool
scatter that can be used
plot or a to visualize

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
heatmap correlation
is used matrices or
to correlation
visually scatter plots.
represen
t the 7. Power BI:
correlati Similar to
on. Tableau, Power
Stronger BI also has the
correlati ability to create
ons are visualizations to
often understand
highlight correlations.
ed with a
different The choice of tool
color or largely depends
a more on the scale of
intense the analysis, the
color. complexity of the
data, and the
Moreove preferences or
r, the proficiency of the
significan user.
ce (p-
value) of
the
correlati
on is also
an
importan
t output.
This is
used to
test the
hypothes
is that
the
correlati

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
on is
significan
tly
different
from
zero. If
the p-
value is
small
(usually
less than
0.05), we
can
reject the
null
hypothes
is and
conclude
that the
correlati
on is
significan
t.

In sum,
the
correlati
on
analysis
provides
insights
about
the
strength,
direction,
and
significan
ce of the

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
relations
hips
between
pairs of
variables
in the
dataset.
Logistic Logistic The typical The Python: Scikit-
Regress regressio process for output Learn's
ion n is used logistic of a LogisticRegressio
for regression is logistic n function can
binary as follows: regressio be used to
classifica 1. Feature n model implement
tion Selectio is a logistic
n
tasks, so probabili regression.
2. Data
the Preproc ty that
inputs essing the given
are 3. Model input
features Training point
: Fit the
that belongs
logistic
characte regressi to a
rize the on certain
instance model class.
s. For to your
data.
example,
This
if you're involves
predictin learning
g the
whether weights
for each
a
feature
custome that
r will buy best
a predict
product the
output
or not,
label
you from
might the
use input
features

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
features .
like age, 4. Model
Evaluati
income,
on:
and past Evaluat
purchasi e the
ng perform
behavior ance of
the
model
using
appropr
iate
metrics
(e.g.,
accurac
y,
precisio
n,
recall,
F1-
score,
ROC
AUC).
This
often
involves
using a
separat
e test
set that
was not
used
during
training
.
5. Model
Optimiz
ation:
Fine-
tune
the
model
or
adjust
the

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
parame
ters to
improve
its
predicti
ve
perform
ance.
This
could
involve
techniq
ues like
regulari
zation
to
prevent
overfitti
ng.
Rando Random The process The Python: Scikit-
m forests for random output Learn's
Forests can be forests is of a RandomForestCl
used for similar to random assifier and
both logistic forest RandomForestR
regressio regression, depends egressor
n and but with on functions can be
classifica some whether used to
tion differences in it's used implement
tasks. the training for a random forests
The phase: regressio for classification
inputs  Feature n or a and regression
are Selectio classifica tasks,
n
features tion task. respectively.
 Data
that Preproc For
characte essing regressio
rize the  Model n tasks,
instance Training the
: Fit the
s. For output is
random
example, forest a
if you're model continuo
predictin to your us value.
data.

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
g house This For
prices, involves classifica
building
you tion
many
might decision tasks,
use trees the
features and output is
like size, training a class
each
location, label.
one on
and age a
of the differen
house. t subset
of the
data.
 Evaluat
e the
perform
ance of
the
model
using
appropr
iate
metrics.
For
regressi
on
tasks,
this
might
be
mean
squared
error or
mean
absolut
e error.
For
classific
ation
tasks,
this
might
be
accurac

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
y,
precisio
n,
recall,
F1-
score,
or ROC
AUC.
 Fine-
tune
the
model
or
adjust
the
parame
ters to
improve
its
predicti
ve
perform
ance.
This
could
involve
adjustin
g the
number
of trees
in the
forest,
the
maximu
m depth
of the
trees, or
other
parame
ters.

3)
Types of models:

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
1. Descriptive Models: These models describe the relationship
between different variables in the dataset. They're typically
used for understanding underlying processes or behaviors.
Examples include clustering and association rule mining.

2. Predictive Models: These models use known results to


develop (or train) a model that can predict the values for
different or new data. Examples include regression models,
decision trees, random forests, support vector machines, and
neural networks.

3. Prescriptive Models: These models suggest actions to


benefit from predictions and answer the question: "What
should be done?" They prescribe what action to take to
eliminate a future problem or take full advantage of a
promising trend. Examples include optimization models and
decision tree models.

4. Generative Models: These models try to learn the true data


distribution of the training set so as to generate new data
points. They often provide a probabilistic description of the
observations. Examples include Gaussian Mixture Models
and Generative Adversarial Networks (GANs).

5. Discriminative Models: These models differentiate between


different kinds of data instances. They learn the boundaries
between different classes in a dataset. Examples include
Logistic Regression and Support Vector Machines.

6. Supervised Models: These models are trained on a labeled


dataset, i.e., a dataset where the target variable (or outcome)
is known. Examples include linear regression, logistic
regression, and deep neural networks.

7. Unsupervised Models: These models work on unlabeled


data or data of unknown structure. They are used to draw
inferences from datasets consisting of input data without
labeled responses. Examples include k-means clustering and
hierarchical clustering.

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
8. Semi-Supervised Models: These models use a combination
of a small amount of labeled data and a large amount of
unlabeled data during training. Examples include self-training
models and multi-view training models.

9. Reinforcement Learning Models: These models learn how


to perform an action from experience. They are trained using
rewards and punishments derived from their actions and
state transitions. Examples include Q-learning and Deep Q
Networks.

Each type of analytic model – descriptive, predictive, and


prescriptive – has its own advantages and disadvantages:

Descriptive Analytics:

Pros:

1. Simplicity: Descriptive analytics are generally simpler and


quicker to implement as they summarize raw data and
present interpretable information.

2. Foundation for Other Analytics: They form the base of the


pyramid of business analytics, which further includes
predictive and prescriptive analytics. They help provide
context for the more complex forms of analytics.

Cons:

1. Lack of Deep Insights: Descriptive analytics only look at


what has happened, they don't provide insights into why it
happened or what will happen in the future.

2. Dependence on Quality of Data: The conclusions drawn


from descriptive analytics are only as good as the data
collected. Poor data quality can lead to misleading or
inaccurate conclusions.

Predictive Analytics:

Pros:

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
1. Future Insights: Predictive analytics provide forecasts about
the future, which can help businesses plan and make
proactive decisions.

2. Risk Mitigation: They can help identify potential risks and


opportunities, thereby allowing organizations to mitigate
risks and take advantage of potential opportunities.

Cons:

1. Data Requirements: Predictive models require large


amounts of accurate historical data for training. If data
quality is poor, the predictions will likely be inaccurate.

2. Complexity: Developing and implementing predictive


models can be complex and require specialized skills.

Prescriptive Analytics:

Pros:

1. Actionable Insights: Prescriptive analytics not only predict


future outcomes but also suggest actions to benefit from the
predictions.

2. Optimization: They can help organizations optimize their


operations, strategies, and decision-making processes.

Cons:

1. Complexity: Prescriptive analytics are the most complex


form of analytics and require sophisticated tools, algorithms,
and skilled personnel.

2. Implementation Challenges: The recommendations from


prescriptive analytics might be theoretically optimal but
could be difficult or impractical to implement due to real-
world constraints.

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
4)
Analytical models play a crucial role in optimizing business
decisions and strategies. They offer an evidence-based approach to
understanding complex systems and forecasting future trends,
enabling organizations to be more proactive and strategic in their
decision-making.

1. Increased Efficiency: Analytical models can significantly


increase operational efficiency. They enable businesses to
identify bottlenecks in their operations and provide insight into
areas where the process can be streamlined. For example,
descriptive analytics can help identify patterns and trends in
sales, allowing businesses to better align their production or
inventory management with consumer demand.

2. Risk Mitigation: Predictive analytics help identify potential risks


and opportunities. By forecasting future trends or events,
organizations can proactively develop strategies to mitigate
risks. For instance, a predictive model could warn a financial
institution about a potential default from a client, enabling them
to take preemptive action.

3. Improved Decision-Making: Analytical models can significantly


enhance decision-making capabilities by providing data-backed
insights. They help decision-makers understand patterns,
correlations, and trends in their data, leading to more informed
and strategic decisions. Prescriptive analytics takes this a step
further by not just predicting future outcomes but also providing
recommendations on the best course of action to take.

4. Competitive Advantage: Organizations that effectively leverage


analytical models often gain a competitive edge. They are better
equipped to understand market trends, customer behavior, and
operational efficiency, which can drive innovation and keep
them ahead of their competitors. Predictive analytics can also
provide insights into future market trends, allowing companies
to be the first-movers in leveraging new opportunities.

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
5. Cost Savings: Analytical models can identify inefficiencies and
wastage in operations, leading to significant cost savings. They
can help optimize resource allocation, improve supply chain
management, and reduce downtime, all of which can lead to
reduced operational costs.

6. Enhanced Customer Experience: With the help of analytical


models, businesses can gain a deeper understanding of their
customers' behavior and preferences, enabling them to deliver a
more personalized customer experience. For example,
predictive analytics can help identify what products or services a
customer is likely to be interested in, allowing for more targeted
marketing efforts.

Despite the complexities and challenges associated with


implementing and managing analytical models, the benefits they
offer are substantial. They can transform data into actionable
insights, drive efficiency, and facilitate more strategic decision-
making. As data continues to grow in volume and complexity, the
use of these models will likely become even more critical in the
business world. Organizations that can effectively leverage these
tools will be well-positioned to gain a competitive edge, improve
their operations, and drive innovation.

Milestone 2

1)

The average credit score represents the overall creditworthiness of


a group of individuals or entities.

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
A scatter chart with credit score on the X-axis, monthly income on
the Y-axis, and the legend indicating car ownership (Yes/No) can
provide insights into the relationship between credit score, income,
and car ownership.

1. Credit Score vs. Monthly Income:

- Scatter distribution: The scatter plot will show individual data


points representing individuals or entities, with each point
positioned based on their credit score and corresponding monthly
income. This visual representation helps identify any patterns or
trends.

- Correlation: Observing the scatter plot, you can assess the


correlation between credit score and monthly income. A positive
correlation would indicate that as credit score increases, monthly
income tends to increase as well.

2. Car Ownership:

- Legend differentiation: The legend indicating car ownership


(Yes/No) provides an additional dimension to the scatter plot. It
helps distinguish between individuals who own a car (marked as
'Yes') and those who do not (marked as 'No').

- Grouping and comparison: By using different colors or symbols


for car ownership, you can easily identify and compare the income
and credit score distribution for car owners and non-car owners.

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
3. Insights:

- Income levels: Analyzing the scatter plot can reveal if there are
income patterns among car owners and non-car owners. You can
observe whether car owners generally have higher or lower
incomes compared to non-car owners.

- Credit score patterns: By examining the scatter plot, you can


determine if there is any association between credit score and car
ownership. It may show whether individuals with higher credit
scores are more likely to own cars.

- Outliers and exceptions: The scatter plot might highlight outliers,


which are data points that deviate significantly from the general
trend. These outliers can provide insights into unique cases where
individuals with low credit scores or incomes still own cars or vice
versa.

4. Decision-making:

- Risk assessment: The scatter plot can aid in assessing the risk
associated with lending to individuals based on their credit scores,
incomes, and car ownership. It can help identify high-income
individuals with low credit scores who own cars or individuals with
high credit scores and incomes who do not own cars.

- Marketing and targeting: The scatter plot can assist in


understanding the target audience for car-related products or
services. It can identify potential customers based on their credit
scores and incomes, helping businesses tailor their marketing
strategies accordingly.

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
A stacked column chart with the X-axis representing different
occupations and the Y-axis displaying the count of entities can
provide valuable insights into the distribution of entities across
various occupations.

1. Occupation Distribution:

- Visual representation: The stacked column chart represents


different occupations as distinct categories along the X-axis. The
height of each column corresponds to the count of entities
belonging to that specific occupation.

- Comparison of occupations: The chart allows for a visual


comparison of the distribution of entities across different
occupations. You can observe the relative sizes of the columns and
identify which occupations have a higher or lower number of
entities.

2. Insights from the Chart:

- Occupation popularity: The chart reveals the popularity or


prevalence of different occupations within the dataset. You can
identify which occupations have a larger number of entities,
suggesting their higher representation or demand.

- Occupations with significant presence: By examining the tallest


columns, you can identify occupations that have a substantial

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
number of entities associated with them. This information can be
useful for understanding dominant professions within a specific
dataset or industry.

- Occupations with low representation: Conversely, you can also


identify occupations with relatively smaller column heights,
indicating a lower count of entities. This can be insightful for
identifying niche or specialized professions within the dataset.

3. Comparison between Occupation Categories:

- Proportional representation: The stacked column chart enables


a comparison of the relative distribution of entities across different
occupation categories. By comparing the heights of the columns
within each occupation category, you can observe which specific
occupations are more prevalent within that category.

- Occupation mix: The chart allows you to assess the diversity or


concentration of occupations. If a single occupation dominates a
category with a significantly larger count, it suggests a higher
concentration of entities in that particular profession within the
dataset.

4. Decision-making and Analysis:

- Workforce planning: The chart can aid in workforce planning by


providing insights into the representation of various occupations. It
helps identify occupations that have a high demand or are
underrepresented, allowing organizations to focus their
recruitment or training efforts accordingly.

- Industry analysis: Analyzing the distribution of occupations can


help assess the composition of the workforce within a specific
industry or sector. It provides a snapshot of the types of
occupations prevalent within the dataset, allowing for industry-
specific analysis and decision-making.

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
- Diversity and inclusion: The chart can also be used to analyze
diversity and inclusion efforts within an organization. By assessing
the representation of different occupations, organizations can
identify areas where diversity may be lacking and take steps to
promote inclusivity.

The median of a column called "Number of Children" represents


the middle value when the values are arranged in ascending or
descending order. Here's a discussion of the median and its
significance:

1. Central tendency: The median is a measure of central tendency


that helps identify the typical or representative value within the
"Number of Children" column. Unlike the mean, which can be
influenced by extreme values, the median provides a more robust
representation of the middle value.

2. Balancing effect: When considering the number of children, the


median helps balance out the impact of extreme values or outliers
that may skew the distribution. It focuses on the value that divides
the data into two equal halves, providing a more stable
representation of the typical number of children.

3. Assessing family size: The median of the "Number of Children"


column gives insights into the typical family size within the dataset.
It helps identify the point at which half of the values lie above and
half lie below, giving a sense of the common number of children
within the group.

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
4. Distribution analysis: By comparing the median with other
measures of central tendency, such as the mean or mode, you can
gain insights into the distribution of the "Number of Children"
column. If the median is close to the mean, it suggests a relatively
symmetrical distribution, while a significant difference might
indicate skewness.

5. Decision-making and planning: The median can be valuable for


decision-making and planning purposes related to family-oriented
services, housing, education, or social policy. For example, if
analyzing a dataset related to housing, the median number of
children can help inform decisions on the design and size of family-
friendly homes or the allocation of resources for schools and
childcare facilities.

6. Comparison and benchmarks: The median can also be used to


compare different groups or subgroups within the dataset. By
calculating the median for subsets of the data, such as different age
groups or regions, you can compare the typical number of children
and identify variations or trends across those groups.

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
A pie chart with the legend indicating finance status and the values
representing the count of entities can provide insights into the
distribution of entities based on their finance status.

1. Finance Status Distribution:

- Visual representation: The pie chart displays different finance


statuses as distinct categories, represented by slices of the pie. The
size of each slice corresponds to the proportionate count of entities
associated with that particular finance status.

- Overall distribution: The chart provides a quick and intuitive


visualization of the distribution of entities across different finance
statuses, showcasing the relative sizes of each category.

2. Insights from the Chart:

- Finance status breakdown: The pie chart allows you to observe


the composition of finance statuses within the dataset. You can
identify the different finance status categories and their relative
prevalence.

- Proportional representation: By comparing the sizes of the pie


slices, you can gain insights into the relative distribution of entities
across finance statuses. This can help identify dominant or
significant finance status categories within the dataset.

- Minority or majority statuses: The chart can reveal whether


certain finance statuses are more prevalent or less common. It
enables you to identify if there is a majority finance status category
that encompasses a significant proportion of the entities, or if there
are multiple finance status categories with comparable
representation.

3. Decision-making and Analysis:

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
- Financial analysis: The chart can aid in analyzing the financial
health or stability of the entities within the dataset. By observing
the distribution across finance statuses, you can gain insights into
the overall financial condition, such as the prevalence of entities
with strong financial status versus those with weaker financial
status.

- Risk assessment: The pie chart can assist in assessing the risk
associated with entities based on their finance status. It helps
identify the proportion of entities in different financial conditions,
which can guide decision-making processes related to lending,
investments, or partnerships.

- Targeting and segmentation: Understanding the distribution of


finance statuses can inform targeted marketing or outreach efforts.
By identifying the prevalent finance statuses, organizations can
tailor their strategies to address the specific needs or challenges
associated with each category.

4. Comparison and Context:

- Benchmarking: The pie chart allows for comparisons between


different finance status categories. You can assess the relative sizes
of the slices and compare the distribution of entities across various
contexts, such as comparing finance statuses between different
industries or geographic regions.

- Temporal analysis: If the dataset includes time-based


information, comparing pie charts over different time periods can
reveal trends or changes in finance statuses. This can be useful for
tracking improvements or deterioration in financial conditions.

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
A 100% stacked bar chart with the Y-axis representing car
ownership, the X-axis representing years of employment, and the
legend indicating the number of children can provide insights into
the relationship between these variables.

1. Car Ownership and Years of Employment:

- Visualization: The 100% stacked bar chart displays the


relationship between car ownership and years of employment.
Each bar represents a specific value on the X-axis (years of
employment), and the height of each segment within the bar
indicates the proportion of individuals with a particular car
ownership status (e.g., Yes or No).

- Comparative analysis: The chart allows for a comparison of car


ownership proportions across different years of employment. You
can observe how car ownership varies based on the duration of
employment.

2. Insights from the Chart:

- Car ownership trends: By examining the segments within each


bar, you can identify patterns or trends in car ownership based on
years of employment. This can reveal whether car ownership tends
to increase or decrease as the number of years of employment
progresses.

- Car ownership distribution: The chart provides insights into the


distribution of car ownership status across various levels of

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
employment experience. It can reveal which groups (based on
years of employment) are more likely to own a car or have a higher
proportion of car owners.

- Employment tenure impact: Analyzing the chart can help


understand how years of employment influence car ownership
decisions. It allows for the observation of any associations between
longer employment tenures and higher car ownership rates.

3. Number of Children and Car Ownership:

- Legend differentiation: The legend indicating the number of


children allows for further differentiation within each car ownership
segment. It provides insight into how the number of children
impacts car ownership, allowing for a more detailed analysis.

- Comparative analysis: By comparing the segments within each


bar, you can observe how the number of children influences car
ownership proportions across different years of employment.

4. Decision-making and Analysis:

- Understanding car ownership patterns: The 100% stacked bar


chart can aid in understanding the relationship between car
ownership, years of employment, and the number of children. It
can inform decisions related to transportation planning, vehicle
financing, or marketing strategies targeting specific groups based
on their car ownership status.

- Family dynamics and car ownership: Analyzing the impact of the


number of children on car ownership can provide insights into
family dynamics and financial considerations. It can guide decisions
related to family-oriented services, such as designing family-friendly
vehicles or determining the target market for child-related
products.

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
2)
In the data preparation phase, several crucial activities are
performed to refine raw data into a format that can be easily and
effectively analyzed. The first of these activities is data cleaning,
which involves identifying and rectifying any errors, inconsistencies,
or inaccuracies within the data. These could include incorrect data
entries, missing or incomplete data, or data discrepancies among
data sources. Through data cleaning, we strive to enhance the
overall accuracy and reliability of the dataset. Next is data
discretization, which refers to the process of converting continuous
data into discrete or categorical data. Discretization simplifies data
and facilitates its comprehension and analysis by transforming it
into a form that is more understandable and usable for specific
analytical methods. For example, a continuous range of ages could
be discretized into age groups such as '18-24', '25-34', and so on.
Following discretization, data aggregation is carried out. This
process entails combining data in a way that provides a
comprehensive, summarized view of the dataset. Data is grouped
based on certain attributes or conditions, which could be based on
variables such as time, location, or specific categories. This step
allows for a more broad analysis, highlighting patterns and trends
at a high level. Lastly, data reduction is performed to decrease the
volume of data that needs to be analyzed, while still maintaining
the integrity and usefulness of the dataset for analysis. This can be
achieved through methods such as attribute selection, where
irrelevant or redundant attributes are removed, or through
dimensionality reduction techniques, which reduce the number of
random variables under consideration. This process makes
subsequent data analysis more efficient and manageable, without
compromising the quality of the insights gained. Through these
systematic steps, the data preparation phase prepares raw data to
be thoroughly and accurately analyzed, thereby facilitating
informed, data-driven decision making.

3)

Data Preparation & Cleaning:

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
This block of code performs data cleaning and conversion on
the 'Monthly Income' column of a DataFrame named
'dataset'. Firstly, it removes dollar sign symbols and leading
or trailing whitespace from each entry in 'Monthly Income'. It
then defines a function, `convert_income`, which takes as
input a string that represents income. This function removes
the 'usd' string and commas from the income string, converts
it to lower case, and strips off any remaining whitespace. If
the income string is empty, the function returns NaN. If the
string contains the letter 'k' (representing thousands), this
letter is removed and the remaining number is converted to
a float and multiplied by 1000. If the income string contains
no 'k', it is simply converted to a float. Finally, the
`convert_income` function is applied to all entries in the
'Monthly Income' column twice, effectively converting all
income entries to numerical format in the original units (i.e.,
dollars, not thousands of dollars).

This code is transforming the 'Number of Children' and


'Years of Employment' columns of the 'dataset' DataFrame to
a numerical data type. Initially, the 'Number of Children'
column is converted to a numeric data type using the
`pd.to_numeric` function, which converts any invalid parsing

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
to NaN (as specified by the 'coerce' argument). The data
types of all columns in the dataset are then printed for
verification. The 'Years of Employment' column is processed
next. It is first converted to a string data type, then the
`str.extract` function is used to extract numerical part from
the string entries (assumed to be sequences of digits). The
resulting 'Years of Employment' column, now consisting only
of numbers, is then converted to a numeric data type, and
the data types of all columns are printed again to verify
successful conversion.

The code is designed to handle missing values in the 'dataset'


DataFrame. It specifically focuses on four numerical columns:
'Monthly Income', 'Credit Score', 'Years of Employment', and
'Number of Children'. For each of these columns, the `.fillna`
method is used to replace any missing values (NaNs) with the
mean of the respective column. This is a common technique
for handling missing numerical data as it maintains the
overall distribution of values in the column. The
`inplace=True` argument ensures that the changes are
made directly in the original DataFrame. After handling the
missing values, the 'Number of Children' column is converted
to integer type using the `astype(int)` method, effectively
removing any decimal places from the values in this column.

The provided code block is managing missing values in the


categorical columns of the 'dataset' DataFrame. These
columns are 'Occupation', 'Finance Status', 'Finance History',
and 'Car'. For each column, the code replaces any missing
values (NaNs) with the mode of the respective column, which
is the most frequently occurring value. This is a common
technique for handling missing categorical data, as it doesn't

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
skew the distribution of the categories. The '[0]' following
'mode()' is necessary because 'mode()' returns a Series, and
we only want the first (and in many cases, the only) value
from that Series. The changes are made directly in the
original DataFrame due to the 'inplace=True' parameter.

This Python code block leverages the `LabelEncoder`


functionality from the `sklearn` library to convert categorical
data, or text data, into numerical data that a machine
learning model can understand. The process begins with the
creation of an instance of `LabelEncoder`, which is assigned
to the variable `le`. Then, the code selects all categorical
features from the dataset by looking for all columns of the
type 'object' (which typically means they contain strings), and
stores them in `categorical_features`. Next, the code iterates
over each feature in `categorical_features`. For each feature,
it applies the `fit_transform` function from the LabelEncoder
`le` to the column in the original dataset. This function
effectively learns all unique labels in the column and maps
them to a numerical value. These numerical values then
replace the original categorical values in the dataset. As a
result, all the categorical features in the dataset are
transformed into numerical representations that can be
understood by a machine learning algorithm.

Data Discretization:

This code snippet is an example of data discretization. It


takes the 'Monthly Income' column of the dataset and divides
it into three discrete categories - 'Low Income', 'Middle
Income', and 'High Income'. These categories are determined
by the specified ranges in 'bins_income'. The 'pd.cut()'

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
function is used for this purpose, which bins values into
discrete intervals.

The function creates a new column 'Income Category' in the


dataset where each person's income is categorized based on
the interval it falls into. For instance, a person with a monthly
income of $1500 will fall into the 'Low Income' category,
whereas a person with a monthly income of $5000 will be
classified as 'Middle Income', and so on. This technique of
discretization is particularly useful in turning a continuous
variable into a categorical one, thus simplifying the data and
aiding in creating more interpretable models and analyses.

This code is another instance of data discretization. It


categorizes the 'Credit Score' column of the dataset into four
distinct groups - 'Poor', 'Fair', 'Good', and 'Excellent'. The
credit scores are divided into these categories based on the
ranges provided in 'bins_credit_score'.

The function 'pd.cut()' is used, creating a new column in the


dataset named 'Credit Score Category', where each person's
credit score is grouped based on the interval it belongs to.
For example, a person with a credit score of 550 will fall into
the 'Poor' category, while a person with a credit score of 750
will be classified as 'Good', and so forth.

Data discretization in this way is beneficial in transforming a


continuous attribute, such as credit score, into a categorical
one. It aids in simplifying the data and facilitates the creation
of easier to interpret models and analyses. It's particularly
useful when there are clear thresholds or levels in the data,
as is the case with credit scores.

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
Data Aggregation:

This code is an example of data aggregation. It calculates the


mean 'Monthly Income' and 'Credit Score' for each unique
'Occupation' in the dataset. The `groupby()` function is used
to group the data based on 'Occupation', and then the
`mean()` function calculates the average of 'Monthly Income'
and 'Credit Score' for each group.

Data aggregation is a process where information is gathered


and expressed in a summary form, typically for statistical
analysis. Here, the aim is to get a more granular
understanding of the average 'Monthly Income' and 'Credit
Score' according to the different occupations in the dataset.
This allows us to compare these values across different
occupations, which could reveal useful insights and patterns.

Data Reduction:

This code block is an example of feature selection, a data


reduction technique. In this code, a feature selection process

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
is applied to the dataset to identify the best features that will
contribute to the model's predictive performance.
`SelectKBest` from the `sklearn.feature_selection` module
is used with the `chi2` score function to select the best 5
features.

Firstly, the 'Income Category' and 'Credit Score Category'


columns are transformed using the LabelEncoder `le`, which
converts categorical data into numerical data.

Next, the `fit` method is applied to the feature selector


(`best_features`) using the feature matrix `X` and target
vector `y`. This will rank the features based on their
importance, determined by the chi-squared (`chi2`)
statistical test for non-negative features.

4)
1. Handling of Mixed Data Types: Certain columns like 'Monthly
Income', 'Number of Children', and 'Years of Employment' that were
expected to contain numeric values, actually contained mixed data
types (including strings and numbers). This required a substantial
amount of data cleaning, such as removing unnecessary characters
and converting data types.

2. Handling of Missing Values: The dataset had a significant number


of missing values. Missing values can greatly affect the quality of
the analysis. In the case of numerical columns, you filled missing
values with the mean, while for categorical columns, you filled them
with the mode. For the 'Number of Children' column, there were NA
values after changing it to an integer type, and you needed to fill
those NA values before the conversion.

3. Encoding Categorical Values: The dataset contained categorical


variables ('Income Category' and 'Credit Score Category') that had to
be converted to numerical ones. This step is crucial for the machine
learning algorithms to process the data. The error "could not
convert string to float: 'High Income'" came up because the

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
algorithm tried to convert a category into a number, which was
resolved by applying encoding methods.

5)
Data preparation is a crucial step in the data analysis process and
here are the main reasons:

1. Improving Data Quality: Data collected from various sources


often contains errors, inconsistencies, and missing values. Data
preparation helps in identifying and correcting these issues to
improve the overall data quality. High-quality data leads to more
reliable and accurate results.

2. Enabling Better Decision Making: Clean and well-prepared


data provides a better foundation for analysis, leading to more
informed decision-making. For instance, it can provide insights
into patterns and trends that are not immediately apparent.

3. Ensuring Data Consistency: Datasets often contain


inconsistencies, especially when the data is collected from
multiple sources. Data preparation ensures consistency, which is
critical for a valid analysis.

4. Reducing Errors in Modeling: In machine learning and data


mining, the quality of the input data directly impacts the output.
Errors in data can lead to incorrect predictions and insights. By
preparing data, you minimize the potential for error in the
modeling phase.

5. Facilitating Data Integration: Data preparation includes


combining data from different sources (data integration). This
process is essential when working with big data, where you
might need to combine structured and unstructured data from
various sources.

6. Improving Efficiency: Data preparation can also help to


improve efficiency. By automating the data preparation process,
you can save time and resources that can be better used for
data analysis and interpretation.

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
The meticulous attention given to the data preparation phase in
any data-driven project is well justified considering its profound
impact on the outcomes of the analysis. It serves as the foundation
for the entire analytical process, inherently influencing the quality
and accuracy of the results. Without it, the risk of generating
misleading conclusions or erroneous predictive models increases
substantially due to inherent errors, inconsistencies, or missing
values within the raw data. Moreover, effective data preparation
enhances the efficiency of the analysis, mitigating the need for
repeated corrections during later stages, thereby saving valuable
time and resources. Lastly, it ensures the relevance of the data
being analyzed, allowing us to direct our focus toward variables
pertinent to the study's objectives, thereby avoiding unnecessary
diversions and promoting more insightful and actionable
outcomes.

Model Implementation
1)

In light of the complexities and multifaceted nature of the dataset


at hand, I have elected to employ the Random Forests algorithm for
the task of customer classification. The principal reasons that

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
motivated this selection are threefold. Firstly, Random Forests are
renowned for their robustness and flexibility. As an ensemble
learning method that combines multiple decision trees, they are
uniquely equipped to handle a vast array of data types, as well as
inconsistencies such as missing values. Secondly, Random Forests
are less prone to overfitting than their single tree counterparts. This
is due to their use of bagging (Bootstrap AGGregatING), where
multiple subsets of the data are independently used to train
separate decision trees, the results of which are then averaged or
majority voted to create a more generalizable model. Lastly,
Random Forests offer intrinsic feature importance evaluations,
which can provide valuable insights into the factors driving the
classification decision. This is not only useful for improving the
model's performance, but also for understanding the underlying
processes and relationships within the data. Therefore, considering
these benefits, it is reasonable to conclude that Random Forests
present a compelling choice for our analytical purposes.

2)
1. Confusion Matrix: This visualization is beneficial in a binary
classification problem such as ours ("will buy a car" or "will not buy
a car"). A confusion matrix provides a clear picture of how well the
model is performing by showing true positives, true negatives, false
positives, and false negatives. It provides insight into the instances
the model got correct, as well as the ones it got wrong.

2. AUC (Area Under the Curve): It is a popular evaluation metric


used in binary classification tasks. Specifically, AUC is used to assess
the performance of a machine learning classifier by measuring the
quality of its predictions and the ability to distinguish between
positive and negative samples. The Receiver Operating
Characteristic (ROC) curve is the graphical representation used to
calculate the AUC. The ROC curve plots the True Positive Rate (TPR)
on the y-axis against the False Positive Rate (FPR) on the x-axis.
Each point on the curve represents a different classification
threshold, which determines how the classifier classifies the
samples. The AUC represents the area under the ROC curve. It
provides a single scalar value that quantifies the classifier's

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
performance, ranging from 0 to 1. An AUC of 1 indicates a perfect
classifier that achieves a TPR of 1 (sensitivity) while maintaining an
FPR of 0 (specificity). An AUC of 0.5 suggests a random classifier
with no predictive power, while an AUC below 0.5 indicates a
classifier that performs worse than random. A higher AUC generally
indicates a better classifier, as it signifies a higher ability to
distinguish between positive and negative samples. It also implies
that the classifier is more likely to rank a randomly chosen positive
sample higher than a randomly chosen negative sample. The AUC is
widely used because it is insensitive to class imbalance and the
specific classification threshold. It provides a concise summary of a
classifier's performance, allowing for easy comparison between
different models or variations of the same model. Overall, AUC is a
valuable metric for assessing and comparing the performance of
binary classifiers, providing insights into their discriminative
capabilities and predictive accuracy.

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
BTEC HN Student Submission and Declaration Form
Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
BTEC HN Student Submission and Declaration Form
Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
BTEC HN Student Submission and Declaration Form
Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
Model evaluation
1)
Reflecting on the results of the predictive model, it appears to have
delivered a commendable performance. The overall accuracy of the
model is 0.88, implying that the model is correctly predicting the
outcome in 88% of instances, which is quite high.

Looking deeper into the results, the confusion matrix provides


further insight. In terms of true negatives, the model correctly
predicted 47 instances where customers would not buy a car. On
the other hand, the model incorrectly predicted that 12 customers
would not buy a car when, in reality, they would - these are the
false negatives.

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
Regarding the predictions where customers would buy a car, the
model correctly predicted 85 instances, which are the true
positives. The model incorrectly predicted 6 instances where
customers would buy a car, but they didn't - these are the false
positives.

In summary, the model seems to be performing well, with a high


number of true positives and true negatives and a lower number of
false positives and false negatives. That said, we could still explore
ways to reduce the number of false predictions to further improve
the model's performance. This could involve refining the feature
set, tuning the model's parameters, or even exploring different
modeling techniques.

Based on the evaluation metrics and the confusion matrix, I believe


the model performed quite well. An accuracy of 0.88 indicates that
the model is making correct predictions at a high rate. It suggests
that the model has a strong ability to correctly distinguish between
customers who will and will not buy a car.

The confusion matrix further supports this positive assessment.


The model had significantly more true positives and true negatives
than false negatives and false positives. It means the model was
more often correct in its predictions than not. However, the
existence of any false predictions points to the fact that there's still
room for improvement.

As the individual who implemented this model, I am satisfied with


its performance. Nonetheless, I'm also motivated to continue
refining it. Future work might involve using different features,
tweaking the parameters of the Random Forest algorithm, or
applying a different machine learning algorithm altogether to see if
we can achieve better results. Overall, the model appears robust,
but the work of a data scientist is never really done; there's always
more to explore and optimize.

2)

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
1. Hyperparameter Tuning: Every machine learning model, including
Random Forest, comes with a set of hyperparameters that control
the behavior of the model. For the Random Forest, these could
include the number of decision trees (n_estimators), the depth of
the trees (max_depth), and the minimum number of samples
required to split a node (min_samples_split), among others. These
hyperparameters can significantly impact the model's performance.
Therefore, using techniques like Grid Search or Random Search, we
can tune these hyperparameters to improve the model's accuracy.

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
BTEC HN Student Submission and Declaration Form
Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
This one gave less AUC.

2. Balancing the Dataset: If our dataset is imbalanced, meaning that


one class of the output variable 'Car' (will buy or will not buy a car)
has many more examples than the other, it could bias the Random
Forest model to predict the majority class more often. This could
result in a high accuracy but a low recall for the minority class,
which might not be ideal depending on our business objective. We
could overcome this issue by oversampling the minority class or
undersampling the majority class to create a balanced dataset,
which could improve the overall performance of the model.

This one reduced the accuracy.

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
3. Feature Engineering: Although we have already performed feature
selection to reduce the dimensionality of the dataset, we could also
perform feature engineering to create new features that might be
more informative for our model. This could involve creating
interaction features, which are new features that represent the
interaction between two or more existing features. It could also
involve creating polynomial features to capture non-linear
relationships between the features and the output variable. Feature
engineering could potentially improve the performance of the
model by providing it with more relevant information for making
predictions.

This one is better than the second but not better than the original
result.

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
Given the results after I tried three different enhancements methods, I
think my initial choices were the best for this dataset. After all, an AUC of
0.87 is pretty good.

3)
1. Different algorithms: While Random Forests performed
reasonably well, it could be beneficial to explore different
algorithms. For example, Gradient Boosting algorithms, such as
XGBoost or LightGBM, are also powerful tools for classification
tasks. Even exploring deep learning models might be beneficial,
given sufficient data.

2. Feature engineering: This refers to the process of creating new


features or modifying existing features to better represent the
underlying patterns in the data. This could involve creating
interaction terms, polynomial features, or even domain-specific
features. For instance, you could explore creating a new feature
that represents the ratio of income to the number of children. This
might help the model better capture nuances in the data.

3. Addressing class imbalance: If the target classes are imbalanced


(i.e., there are far more examples of one class than the other), this
could lead to biased predictions favoring the majority class. In such
cases, techniques like SMOTE (Synthetic Minority Over-sampling
Technique) or ADASYN (Adaptive Synthetic) could be used to
balance the classes. Similarly, adjusting the class weights in the
model training process could also help handle class imbalance.

4)
1. Iterative development and validation: One important practice
in model development is to follow an iterative approach where
you continuously develop, test, validate, and refine your model.
This includes splitting the data into a training set, a validation
set, and a test set. The training set is used to build the model,

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
the validation set is used to tune parameters and choose the
best-performing model, and the test set is used to provide an
unbiased evaluation of the final model fit on the training data.
Furthermore, techniques like cross-validation should be
implemented, where the data is divided into 'k' subsets and the
model is trained 'k' times, each time using a different subset as
the test set. This technique helps to prevent overfitting and gives
a better indication of how the model might perform on unseen
data.

2. Understanding the problem and the data: Good analytical


models start with a clear understanding of the problem and the
data. This involves detailed exploratory data analysis to
understand the underlying structures, relationships, and
patterns in the data. It also involves understanding the business
context or the real-world implications of the problem being
solved. Involving domain experts can also be beneficial. They
can provide valuable insights into how the model should be
constructed, what features might be important, and how the
model's predictions will be used. Understanding the problem
and the data will guide the selection of the appropriate
preprocessing techniques, model type, and evaluation metrics.

5)
In reflecting on our work thus far, I'd like to express the
transformative impact that both descriptive and predictive analytics
have had on our operations.

Starting with descriptive analytics, we've been able to scrutinize our


data at a granular level, revealing patterns and trends that were
previously unknown to us. This process served as a kind of
flashlight, illuminating insights about our business operations and
our customer base that we may not have otherwise recognized. It’s
like providing a detailed map of our business landscape that tells us
where we stand.

Then, as we moved to predictive analytics, we unlocked the ability


to look into the future with a degree of confidence. We

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0
implemented a sophisticated machine learning model, the Random
Forest classifier, that provided us with estimations of future
customer behavior. Although we experienced challenges during the
initial setup, the model eventually yielded promising results. The
real value of the model, however, isn't just in the numbers. It’s in
the power of anticipation and the ability to respond proactively to
emerging trends, a substantial competitive advantage in today's
dynamic marketplace.

Following a series of enhancements and optimizations to the


model, we made significant strides in refining its predictive
capabilities. Although the journey was marked by ups and downs,
we learned invaluable lessons about the nature of our data, the
importance of fine-tuning our approaches, and the necessity of
balancing our data.

By integrating these descriptive and predictive analytics models,


we've equipped our business with a more robust and intelligent
decision-making framework. We now have the tools not only to
understand our historical performance but also to predict and
shape our future. This dual capability is a game-changer, enabling
us to be both reactive and proactive in a data-driven way. And in
today's business climate, I truly believe that being data-driven is no
longer just an advantage - it's a necessity.

BTEC HN Student Submission and Declaration Form


Issue Date: June 2021 Owner: HN QD
DCL1 Public (Unclassified) Version 1.0

You might also like