0% found this document useful (0 votes)
6 views

Final Project - Data Analytics Case 1

The document presents a data analytics case study focused on Classic Models, a B2B company selling scale replicas of classic vehicles. It outlines the company's challenges in understanding market performance, product sales, and customer segmentation, and formulates research questions and hypotheses to guide data analysis. The case study emphasizes the importance of data-driven decision-making to optimize inventory management and enhance sales performance across different markets.

Uploaded by

Bli Wilson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Final Project - Data Analytics Case 1

The document presents a data analytics case study focused on Classic Models, a B2B company selling scale replicas of classic vehicles. It outlines the company's challenges in understanding market performance, product sales, and customer segmentation, and formulates research questions and hypotheses to guide data analysis. The case study emphasizes the importance of data-driven decision-making to optimize inventory management and enhance sales performance across different markets.

Uploaded by

Bli Wilson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Case of the study Project #1 Classic Models

Data Analytics Case Study 1 (DAMO-501-3)

Team

Jorge Luis Corzo Ruda

Wilson Kwesi Bli

Javier Alberto Correa Obregón

Omer Rahim

Khawar Malik

Instructor: PhD. Patty Zakaria

13/12/2024

1
Table of Contents

Chapter 1: Background, Problem Definition and Research Questions ............................... 3


1.1 Background ......................................................................................................... 3
1.2 Problem Statement ............................................................................................... 3
1.3 Understanding the Database Entities & Relations ................................................... 4
1.4 Research questions ............................................................................................... 5
Chapter 2: Hypothesis Formulation ............................................................................... 6
Hypothesis One: What is the relationship in sales performance between France, Spain, and
USA? ........................................................................................................................ 6
Hypothesis Two: What is the sales trend in different product lines? .................................... 7
Hypothesis Three: What are the trends in customer segmentation? .................................... 7
Hypothesis Four. What is the relationship between credit limit and amount spent? ............... 8
Chapter 3: Data Collection and SQL Queries .................................................................. 9
Hypothesis One: ........................................................................................................ 9
Hypothesis Two: ...................................................................................................... 11
Hypothesis Three ..................................................................................................... 12
Hypothesis Four: ..................................................................................................... 13
Chapter 4: Data Understanding ................................................................................... 14
Hypothesis two: ........................................................................................................ 16
Hypothesis Four ........................................................................................................ 23
Chapter 5: Data Visualization ...................................................................................... 26
Chapter 6: Model Building .......................................................................................... 33
Chapter 7 – Model Evaluation ..................................................................................... 42
Conclusion .............................................................................................................. 47

2
Chapter 1: Background, Problem Definition and Research Questions

1.1 Background

Classic Models is a B2B company dedicated to the sale, and distribution of scale replicas

of classic vehicles, including cars, motorcycles, planes, ships, trains, trucks, buses, and vintage

cars. These products are designed for children and collectors of all ages who are passionate about

vehicles and scale models. Its primary customers are specialized wholesalers, such as toy stores,

gift shops, and hobby stores, among others.

For this project, we have a database that supports key aspects of commercial operations,

including customer information, payments, product inventory, delivery times, employee

management, orders, offices, deliveries, and returns. This enables the company to understand and

track sales and business process through data analysis. It also will identify potential problems and

facilitate in finding solutions for an informed decision-making to overall improve to increase

revenue.

1.2 Problem Statement

Classic Models faces significant challenges in understanding different markets

performance, key product sales, customer segmentations and strategies to optimize sales,

managing inventory levels effectively, as improper handling can disrupt supply chain operations

and adversely impact sales performance across different markets, product lines, and customer

segments. Due to the large number of products managed by Classic Models, there is a possibility

of inadequate inventory management, which could lead to serious problems both inside and outside

the company.

3
If demand is not met effectively, the company could be negatively impacted in areas such

as reputation, brand perception, and loss of sales, which in turn could lead to the loss of customers

and, consequently, affect profitability. Additionally, the lack of knowledge about inventory and

market trends presents a significant risk. However, understanding the sales performance of top

performing markets, the sales trend in the different product line using geospatial data, such as

regional location, providing avenues for customers to increase the purchasing power through credit

lines, will not only boost revenue but can help to understand the market, make data-driven

decisions and also close the gap between the business and the customers.

1.3 Understanding the Database Entities & Relations

Figure 1- ER Diagram

4
1. Customers: Classic Model information about customers includes names, addresses, and

contact details. Additionally, it is possible to retrieve the assigned sales representative.

2. Employees: Contains details about their employees, including their names, job titles,

supervisors, offices by region, and customer distribution.

3. Orders: Records the orders placed by customers, With it, it is possible to validate the

purchase orders made by each customer.

4. OrderDetails: Provides detailed data about the products in each order, including

quantities and prices.

5. Products: Stores product information such as names, prices, and descriptions.

6. ProductLines: Classifies products into broader categories (e.g., Classic Cars,

Motorcycles). This table is crucial for analyzing sales trends by product category.

7. Payments: Tracks customer payments, including dates and amounts.

8. Offices: It contains information related to office locations.

1.4 Research questions

1. What is the relationship in sales performance between France, Spain, and USA?

2. What is the sales trend in different product lines?

3. What are the trends in customer segmentation?

4. What is the correlation between credit limit and amount spent?

5
1.4.1 The Importance of Data Analysis in each research question

With data analysis, strategies can be identified and generated to mitigate difficulties related

to inventory management. For example, by understanding performance by region, the business can

identify which areas demand higher quantities of products and how to manage this to maintain a

controlled supply based on demand. It is also possible to identify sales trends by product line,

which is key to understanding which items have higher turnover in the inventory, helping us

maintain controlled production and avoid overloading warehouse stocks.

Understanding consumption patterns is essential in inventory management, as identifying

customer groups based on their preferences allows us to adjust inventory levels to meet specific

demands according to segmentation. Finally, identifying credit and spending patterns, and

understanding how they relate, can provide relevant information for future strategies in credit

allocation, anticipating the needs of customers with higher purchasing capabilities.

Chapter 2: Hypothesis Formulation

Hypothesis One: What is the relationship in sales performance between France, Spain, and USA?

• Null Hypothesis (H₀): There is no relationship in sales performance between France,

Spain, and USA.

• Alternative Hypothesis (H₁): There is a relationship in sales performance between France,

Spain, and USA.

The relevance of this hypothesis is aimed at identifying whether there is a relationship between

sales performance by region. By so doing, adapting inventory capacity according to the needs

of each region to avoid supply chain disruptions or excess inventory by location. It is also

6
important to the purchasing power of the different markets, knowing how to price products to

meeting customer’s levels of income.

Lastly, this insight can inform the business to identify the marketing and sales strategies that

works best in all the markets and regions and how to improve the low-performing market

territories.

Hypothesis Two: What is the sales trend in different product lines?

• Null Hypothesis (H₀): There is no significant sales trend in different product lines.

• Alternative Hypothesis (H₁): There are significant sales trend in different product lines.

Identifying the sales trend in different product lines will help the business to understand which

product lines or models are selling and go a long way to the strategies around sales so not waste

resources. For example, the business can focus more on the products that are selling more or

can choose to put a lot of marketing on the products that are not selling.

Moreover, understanding the sale pattern aids in data-driven decisions and efficient

management of resources. This leads to better returns on investment and sales efforts.

Hypothesis Three: What are the trends in customer segmentation?

• Null Hypothesis (H₀): There are no changes in customer trends based on segmentation.

• Alternative Hypothesis (H₁): There are changes in customer trends based on

segmentation.

Another key aspect of the business is understanding the background and segmentation of

customers such as the age group. This will help the business to introduce customers and

7
optimize customer-focused strategies with respect to the customer’s preference, demographic,

age, involvement in marketing activities and many others.

Again, the insight gained from this hypothesis is crucial for resource allocation and targeted

marketing efforts. The business can invest in high growth segments and adjust strategies for

segments showing decline.

Lastly, identifying the trends in customer segmentation supports long-term planning and

competitiveness. The business can stay informed and be proactive to respond to evolving

market dynamics, staying ahead of competitors by addressing new opportunities and mitigating

potential risks associated with customer behaviors.

Hypothesis Four. What is the relationship between credit limit and amount spent?

• Null Hypothesis (H₀): There is no relationship between the credit limit and the amount

spent.

• Alternative Hypothesis (H₁): There is a relationship between the credit limit and the

amount spent.

Testing the hypothesis about the correlation of credit limits and the amount spent could get

conducive insights to the business on customer’s spending behavior prediction. If there is a

close correlation between the credit limit and amount spent, the business can introduce

policies to improve the credit limits to increasing the customer’s purchasing power.

Also, the business can be able to identify the risk levels of the customer segmentation. As a

result, identify high-value customers who can be target for marketing and individual credit

8
policies to maximize revenue. Some strategic decision-making regarding financing of

product, loyalty programs can all be initiated from this insight.

Chapter 3: Data Collection and SQL Queries

Hypothesis One: To answer research question one, it is necessary query the database to

identify the three countries with the highest total sales of the database, for this, we create a

temporary table (CTE). This process joins the tables; customers, orders, orderdetails tables and

grouping the results according to the countries, ordering them according to the results of the total

sales and limiting the query to only three. This refers to the three countries, in addition to this, a

filtering of the data is performed and the results where the status of the order is cancelled are

eliminated.

Once this temporary table is obtained, we proceed to perform the main query, which

involves the display of the country name, the order number, the product code, the quantity ordered,

the price per unit, the total line, the date of the order and the order status. To perform this query, it

is also necessary to perform joins in order to access the necessary data.

9
Figure 2

10
The next step is the prepare the data for analysis by checking for outliers and standardizing

the dataset after to avoid bias and improve the model accuracy.

Hypothesis Two: As we did for question one, it is important to analyze the monthly sales

trends by product line, for example, to carry out an identification of a line as classic models

regarding consistent sales or seasonal peaks. For this reason, a main query is developed, where the

product lines are obtained, a sorting is performed according to the dates and for each of these, the

total value in sales is printed. In addition to this, to give a better clarity to the data obtained, a

filtering of the results is performed, with respect to the capture of results different to a cancelled

status.

With this query, it will be possible to identify patterns and plan inventory strategies with

respect to marketing or sales.

Figure 3

11
Same as we did the for data from research question one, we check for outliers and

standardize the dataset for further analysis.

Hypothesis Three: To answer research question three, it is necessary to perform a query

that focuses on categorizing customers with respect to the total value of their orders, using specific

criteria and analyzing their behavior through useful metrics such as the number of orders and the

total value of their purchases. For this reason, a main query is carried out that yields as results the

customer number, customer name and country, also performs counting operations by which the

total value of orders per customer is obtained, and sum operations, by which the total value of the

order placed is obtained.

Finally, validations are made by means of a case that allows filtering and categorizing the

orders in 'High value', 'Mid value' and 'Low value'. To give more veracity and clarity to the results,

in this query, a filtering is performed to exclude the results with Cancelled status.

Figure 4

12
Hypothesis Four: We carry out a similar process to retrieve the dataset by carry out a

query that examines the customers' expenses and evaluate whether they are spending within their

credit limit. It is for this reason that the query focuses on obtaining the data relating to the customer,

the country, the credit limit of each one and performs a sum of the quantities of products ordered

multiplied by their price per unit to obtain the total expenses of each customer.

To obtain this query, it is necessary to make the respective joins of the tables customers,

orders and order details. Finally, the filtering is performed where the orders with cancelled status

are excluded and only the data where the total expense is greater than the credit limit is shown. We

check for outliers and standardize the dataset for further analysis.

Figure 5

13
Chapter 4: Data Understanding

In hypothesis one, we want to determine if there is relationship in sales performance

between the three performing countries for the classic model business with respect to the quantity

ordered. We started by running descriptive statistics to understand the data, information on the

mean, minimum and maximum value, variance and standard deviation helps us to understand the

variables and further test to be perform for gain insights for the operation of the models.

Figure 6 – Descriptive Analysis of Hypothesis One

The table provides a summary of key descriptive statistics for the dataset, specifically

focusing on three metrics: quantityOrdered, priceEach, and lineTotal. Here’s a detailed

breakdown addressing the metrics:

1. N (Valid Data): There are 1,630 valid observations for each variable, with no missing

values. This indicates that the dataset is complete and reliable for analysis.

2. Quantity Ordered:

• Mean: 35.77 units per order, indicating the average order size across the dataset.

• Standard Deviation: 9.867, showing moderate variability in order size.

14
• Minimum and Maximum: Orders range from 6 units to 97 units, highlighting a significant

spread in order sizes.

3. Price Each:

• Mean: $90.61 per unit, representing the average product price.

• Standard Deviation: 37.23684, reflecting high variability in product prices.

• Minimum and Maximum: Prices range from $26.55 to $214.30, suggesting a diverse

product portfolio with varying price points.

4. Line Total (Total Sales per Transaction):

• Mean: $3,246.34 per transaction.

• Standard Deviation: $1,669.50, indicating high variability in sales values per order.

• Minimum and Maximum: Sales values range from $531 to $11,170.52, reflecting both

small and large-scale transactions.

Observations:

The data shows a wide range of order sizes, product prices, and sales values, which is

critical for evaluating sales performance across the three countries (France, Spain, USA).

Identification of Key Metrics, Trends, and Patterns Relevant to the Research Question

Key Metrics:

• Total Sales (Sum of Line Total): $5,291,532.59. This is the overall revenue generated in

the dataset. Comparing this figure across France, Spain, and the USA will reveal regional

performance.

• Average Order Size (Mean Quantity Ordered): Indicates typical buying behavior. For

deeper insights, this needs to be analyzed separately for each country.

15
• Price Variability: The standard deviation and range in priceEach suggest diverse product

offerings, which might appeal differently across regions.

Trends and Patterns:

• The variability in lineTotal suggests that order value is influenced by both quantityOrdered

and priceEach. Exploring these relationships by country can uncover differences in

purchasing habits.

• The maximum sales value of $11,170.52 suggests high-value transactions, which could be

tied to specific countries or products.

Clear Interpretation of Findings and Insights Derived from the Data Analysis.

• The dataset exhibits substantial variability in order size, price, and sales value, which

provides an opportunity to analyze differences across France, Spain, and the USA.

• High variability in prices and sales values suggests that customer purchasing behavior and

product demand might differ significantly between countries.

• France, Spain, and the USA likely contribute differently to the total sales of $5.29M.

Identifying the contribution of each country is essential to understand their relative

performance.

Hypothesis two: Hypothesis delves to shows the sales trend across different product lines

(classic cars, motorcycles, planes, ships, trains, trucks and buses, and vintage cars). The goal is to

learn about these distinct product lines totals sales over periods.

16
Figure 7 – Descriptive Analysis of Hypothesis three

Figure 8 – Box plot of Hypothesis two

Thorough Exploration and Analysis of the Collected Data

The statistics describe the distribution of sales data for a sample size of 181 observations:

17
• Mean (Average): The mean sales amount is $51,742.19, which represents the central

tendency of the data.

• Median: The median sales value is $36,552.33, indicating that half the sales values are

below this figure and half are above.

• Mode: Not available (#N/A), meaning no sales amount occurs more frequently than others

in the data.

• Standard Deviation: Sales data show a high variability, with a standard deviation of

$56,889.24.

• Range: The difference between the highest ($415,952.81) and lowest ($1,860.93) sales is

$414,091.88, demonstrating a wide spread in sales figures.

• Skewness: With a skewness of 3.39, the sales distribution is heavily right-skewed,

suggesting that most sales values are relatively low, but a few large values pull the

distribution to the right.

• Kurtosis: The kurtosis value of 16.72 indicates a distribution with extreme outliers, likely

driven by a few exceptionally high sales.

• Sum: Total sales amount across all observations is approximately $9,365,336.43.

Identification of Key Metrics, Trends, and Patterns Relevant to the Research Question.

• The research question focuses on sales trends in different product lines. Although the

provided statistics summarize the total sales data, here are observations that might guide

further investigation:

18
• High Variability: The standard deviation and range suggest that sales vary significantly

between product lines or other groupings (e.g., by region, time period, or customer

segment).

• Right-Skewed Distribution: The large skewness and kurtosis values indicate a small

number of extremely high-performing sales, potentially tied to specific product lines or

promotional campaigns.

• Median vs. Mean: The mean is much higher than the median, emphasizing that the

distribution is influenced by outliers or high sales values.

Clear Interpretation of Findings and Insights Derived from the Data Analysis

• Key Insight 1: The sales data is unevenly distributed, with some extreme values

significantly influencing the overall metrics. This implies that certain product lines or

groups are likely dominating total sales figures.

• Key Insight 2: The high range and standard deviation suggest that product performance

varies significantly, potentially highlighting opportunities to improve underperforming

product lines or to understand what drives the success of top-performing ones.

• Key Insight 3: Outliers and right skewness imply that a small number of sales contribute

disproportionately to the total, which may align with specific customer behaviors or

product-line preferences.

In hypothesis three, we want to understand customer segmentation and their spending behavior.

We segmented the customers into clusters and grouped based on the spending behavior and

19
order value to high-value, mid-value and low value customers. Below is a descriptive statistic

of the data:

Figure 9– Descriptive Analysis of Hypothesis three

Figure 2 – Box plot of Hypothesis three

Thorough Exploration and Analysis of the Collected Data total Orders:

• Mean: Customers, on average, placed 3.27 orders.

• Median: The median value is 3, showing that half of the customers placed three or

fewer orders.

• Mode: The most frequently occurring number of orders is 3.

20
• Standard Deviation: A value of 2.77 indicates some variation in the number of orders,

but most customers remain close to the mean.

• Range: The range of orders spans from 1 to 25, highlighting outliers where a few

customers placed significantly more orders.

• Skewness (6.33): The distribution is highly right-skewed, indicating that most

customers placed a low number of orders while a small group placed many.

• Kurtosis (45.03): The high kurtosis suggests extreme outliers, further reinforcing that

some customers are contributing a disproportionately large number of orders.

TotalOrderValue:

• Mean: The average total order value per customer is $95,564.66.

• Median: The median value is $79,306.57, suggesting a slight skew in the data.

• Standard Deviation: With a standard deviation of $93,286.80, the total order value

varies widely across customers.

• Range: The total order value ranges from $7,918.60 to $773,642.18, indicating

substantial disparities in spending patterns among customers.

• Skewness (5.59): The highly positive skew reflects a few customers with exceptionally

high spending.

• Kurtosis (36.97): The high kurtosis shows that a small proportion of customers

significantly impact total sales through their large order values.

Identification of Key Metrics, Trends, and Patterns Relevant to the Research Question.

The research question examines trends in customer segmentation. Key trends and patterns

include:

21
• Order Frequency: Many customers place a small number of orders (median = 3), but

a few customers have very high order counts, possibly representing a loyal or bulk-

buying segment.

• High Variance in Spending: The disparity in total order values suggests distinct

customer groups, ranging from low-spending occasional buyers to high-value repeat

customers.

• Significant Outliers: A small subset of customers drives a large proportion of total

sales and order frequency, as evidenced by high skewness and kurtosis for both metrics.

Potential Segments:

• Low-Value, Infrequent Buyers: Customers with low total order value and a small

number of orders.

• High-Value, Frequent Buyers: A few customers contributing significantly to total

revenue through frequent orders and high spending.

• Moderate Buyers: Customers with spending and frequency near the mean or median.

Clear Interpretation of Findings and Insights Derived from the Data Analysis

• The data suggests a classic "80/20" Pareto distribution, where a small segment of high-

value customers (20%) accounts for a large share of total revenue (80%).

• The high variability and presence of outliers highlight the importance of personalized

marketing and customer retention strategies for top-performing customer segments.

• There is an opportunity to increase engagement among low-value customers by

understanding their purchase behavior and incentivizing repeat purchases.

By segmenting customers effectively, actionable strategies can be developed to drive growth

and optimize customer relationships.

22
Hypothesis Four: The final hypothesis aims to establish a relationship between customers' credit

limit and their spending behavior, It seeks to analyze how the available credit amount affects

customers' purchasing decisions, influencing both the frequency and volume of their purchases.

This behavior can provide key insights for classic model in optimizing their inventory, allowing

them to adjust stock levels based on consumption trends and anticipate demand. In this way,

companies can improve inventory management efficiency, reducing costs and ensuring product

availability without overstocking.

Figure 11 – descriptive statistics Hypothesis 4

Exploration of Data

1. Credit Limit:

• Mean: 88,150.91

• Range: 21,000 to 227,600

• Standard Deviation: 36,881.21

2. Total Spent:

• Mean: 122,289.62

• Range: 22,314.36 to 773,624.18

23
• Standard Deviation: 116,478.01

Figure 12 – Box plot Total spent, box plot credit limit, Hypothesis 4

Figure 13 – scatter plot Total spent, scatter plot credit limit, Hypothesis 4

Two outliers were identified in the two analyzed variables. To prevent these data points from

affecting the correlation and given that the distribution of the data is normal, it was decided to

replace them with the mean. This decision is justified as the outliers represent few observations,

minimizing their impact on the analysis and ensuring the integrity and representativeness of the

results. Additionally, it was verified that the data followed a linear distribution using a scatter plot.

24
This allowed for a visual confirmation of the linear relationship between the variables, validating

the feasibility of performing a reliable and appropriate correlation analysis.

Clear Interpretation of Findings and Insights Derived from the Data Analysis

• Total Spent:

• The high standard deviation (116,478 USD) and wide range (22,314.36 - 773,624.2 USD)

indicate substantial variability in customer spending.

• Skewness (4.57) and kurtosis (22.57) suggest a heavily skewed distribution with extreme

outliers.

• Credit Limit:

• A smaller standard deviation (36,881.21 USD) compared to total spent suggests less

variability in credit limits.

• Skewness (1.46) and kurtosis (4.45) indicate a moderate skew towards higher credit limits,

with fewer extreme values than total spent.

25
Chapter 5: Data Visualization
Hypothesis One:

Figure 14 - Clustered Bar of Quantity Ordered and Price by Each Country

The chart is showing the "Quantity Ordered" and "Price Each" for three different countries: France,

Spain, and the USA. This type of data visualization allows for an easy comparison of these key

metrics across the selected countries.

France has the lowest quantity ordered at 35 units at a price of $90. Spain and USA have the same

quantity ordered even though USA has the highest price per each product. This could mean USA

customers have more purchasing power, and the business can learn from the marketing and sales

strategy to improve the performance in other regions.

26
Hypothesis Two:

Figure 15 - Pie Chart for total sales per year

The figure above shows that 2004 contributes the highest percentage of sales, nearly 46.39%,

which is almost half of the total of the entire sales years. The business can provide a further

insight into key factors that led to the success of 2004, re-introduce, and optimize for the 2005

which is currently showing the lowest sales.

27
Figure 16 - Pie Chart of product line sales from 2003-2005

This pie chart breaks down the sales contributions of different product lines over the 2003-2005

period. The key insights are:

• Classic Cars had the highest sales at 16.02% of the total.

• Motorcycles and Vintage Cars both contributed 14.36% and 14.92% respectively.

• Planes, Ships, and Trains had the lowest sales contributions at 12.71%, 13.26%, and 12.71%

respectively.

So, the data suggests that the Classic Cars product line was the strongest performer in terms of

sales over this 3-year period, while the Planes, Ships, and Trains product lines had the lowest sales

contributions.

28
Hypothesis Three:

Figure 17 - Customer Segmentation Per Total Orders

It is always good to learn that most of the customers are categorized in the “High value” bracket.

This can further help the products and marketing team to introduce more customer retention and

credit policies to keep the customers while working on improving the sales performance of the

low value customers.

29
Figure 18 - Total Orders by Each Territory

The figure 18 above illustrates the distribution of orders across four geographic territories:

APAC, EMEA, Japan, and NA. The largest share of orders, 34%, comes from NA (North

America), followed closely by EMEA (Europe, Middle East, and Africa) with 33%. Japan

accounts for a moderate share of 25%, while APAC (Asia-Pacific) contributes the smallest

portion at 8%. Combined, NA and EMEA dominate the order distribution, comprising 67% of

the total. This chart emphasizes the significant performance disparity between territories, with

NA leading and APAC showing the least activity. Such insights can help businesses strategize

and focus on underperforming regions like APAC to improve their order distribution.

Territory Country

APAC Australia

30
APAC New Zealand

APAC Singapore

EMEA Austria

EMEA Belgium

EMEA Denmark

EMEA Finland

EMEA France

EMEA Germany

EMEA Ireland

EMEA Italy

EMEA Norway

EMEA Norway

EMEA Spain

EMEA Sweden

EMEA Switzerland

EMEA UK

Japan Hong Kong

Japan Japan

Japan Philippines

Japan Singapore

NA Canada

NA USA

Table 1 – Territories by Countries

31
Hypothesis Four:

Figure 19- Total Spending by Each Country

Figure 19 details the total spending amounts across various countries. The y-axis represents the

total spent in monetary terms, while the x-axis lists the countries. Notably, the USA shows the

highest total spending at $2,334,180.24, followed by Singapore with $954,584.74 and France with

$672,136.39. Other countries, such as Australia, Italy, and Japan, exhibit moderate spending levels,

while smaller totals are observed for countries like Sweden, Belgium, and Norway.

Figure 19 complements figure 18’s order distribution by territory, where NA (North

America) held the largest share of orders (34%), and APAC had the lowest (8%). The dominance

of the USA in spending aligns with NA’s high contribution to the order share. Similarly, the

substantial spending by Singapore (in APAC) indicates its pivotal role within a territory that

otherwise accounted for the smallest order share. In contrast, spending in European countries like

32
France and Germany reflects EMEA’s strong showing in the pie chart, where it contributed 33%

of the orders.

Insight:

The line chart reveals the granular spending dynamics within each territory. While NA and EMEA

dominate order distribution, the spending within these regions is heavily concentrated in a few

countries like the USA, France, and Singapore. This correlation highlights the strategic importance

of these countries to the overall revenue distribution and suggests a need to explore growth

opportunities in underperforming regions such as APAC and low-spending European countries.

Chapter 6: Model Building


Hypothesis One:

Figure 20

33
Figure 21

Interpretation of Levene's Test for Homogeneity of Variances:

Hypothesis for Levene's Test:

o Null Hypothesis (H₀): The variances across the groups (France, Spain, and USA) are

equal.

o Alternative Hypothesis (H₁): The variances across the groups are not equal.

Results:

o Based on the Levene Statistic, the p-values (Sig.) for all tests (Mean, Median, Median

with adjusted df, and Trimmed Mean) are greater than 0.05:

▪ Based on Mean: Sig. = 0.617

▪ Based on Median: Sig. = 0.801

▪ Based on Trimmed Mean: Sig. = 0.715

Conclusion:

Since all p-values are greater than 0.05, we fail to reject the null hypothesis. This means the

assumption of homogeneity of variances is met, and it is appropriate to proceed with interpreting

the ANOVA results.

34
Interpretation of ANOVA Results:

Hypothesis for ANOVA:

o Null Hypothesis (H₀): There is no significant difference in sales performance (lineTotal)

between France, Spain, and USA.

o Alternative Hypothesis (H₁): There is a significant difference in sales performance (lineTotal)

between these countries.

Results:

o F-Statistic: 0.160

o p-value (Sig.): 0.852

Conclusion:

Since the p-value = 0.852 is greater than 0.05, we fail to reject the null hypothesis. This

indicates that there is no statistically significant difference in sales performance (lineTotal)

between France, Spain, and the USA.

Summary:

• Levene's Test: Variances are equal across the groups.

• ANOVA: No significant difference in sales performance between the countries.

You can conclude that, based on this sample, sales performance is consistent across France,

Spain, and the USA, and no country shows a statistically higher or lower performance than the

others.

35
Hypothesis Two:

Figure 22

Figure 23

Figure 24

Interpretation of Welch’s ANOVA and Regular ANOVA Outputs:

Hypotheses Recap:

• Null Hypothesis (H₀): There is no significant sales trend in different product lines.

• Alternative Hypothesis (H₁): There is a significant sales trend in different product lines.
36
Step 1: Levene's Test for Homogeneity of Variances

(Since this was done earlier and you used Welch’s test due to variance inequality)

Levene’s Test Result (p = 0.000):

Indicates that variances across product lines are significantly different (p < 0.05). Therefore, the

assumption of homogeneity of variances is violated, justifying the use of Welch’s ANOVA.

Step 2: Welch’s ANOVA Result

• Welch Statistic: 13.575

• Degrees of Freedom: df1 = 2, df2 = 48.417

• Significance (Sig.): p = 0.000

Interpretation:

• Since p < 0.05, we reject the null hypothesis.

• This indicates that there is a significant difference in total sales trends between the

different product lines.

Conclusion:

• Based on Welch’s ANOVA, there is a significant difference in total sales trends across

product lines. Therefore, we reject the null hypothesis and conclude that sales

performance varies significantly by product line.

37
Hypothesis Three:

Figure 25

Figure 26

Figure 27

Interpretation of Levene's Test Output:

• Levene Statistic (Based on Mean) = 0.992

38
• Significance (Sig.) = 0.322

Since p > 0.05, you fail to reject the null hypothesis of Levene's Test. This means there is no

significant difference in variances across the groups (customer segments in your case).

Therefore, you can assume that the variances are equal.

Interpretation of One-Way ANOVA Output:

ANOVA Table:

• Sum of Squares (Between Groups) = 19.778

• Sum of Squares (Within Groups) = 725.324

• F-Statistic = 1.295

• Significance (Sig.) = 0.279

Key Interpretation:

• The F-statistic (1.295) is calculated by dividing the Mean Square Between Groups

(9.889) by the Mean Square Within Groups (7.635).

• The p-value (Sig.) for the ANOVA is 0.279, which is greater than 0.05.

Conclusion Based on the ANOVA Output:

• Since the p-value (0.279) is greater than 0.05, you fail to reject the null hypothesis (H₀).

o Null Hypothesis (H₀): There are no significant differences in totalOrders across

the different customerSegment groups.

o Alternative Hypothesis (H₁): There are significant differences in totalOrders

across the different customerSegment groups.

• Interpretation: There is no statistically significant difference in the total number of

orders between the customer segments.

Final Conclusion:

39
• The results of your One-Way ANOVA indicate that customer segment does not have a

significant effect on the total number of orders.

Hypothesis Four:

Figure 28

Interpretation of the Pearson Correlation Output:

Correlation Table:

• Pearson Correlation between creditLimit and totalSpent = 0.839

• Significance (Sig.) = 0.000

• N (Number of observations) = 55

Hypotheses Recap:

• Null Hypothesis (H₀): There is no relationship between the credit limit and the amount

spent.

40
• Alternative Hypothesis (H₁): There is a relationship between the credit limit and the

amount spent.

Key Interpretation:

1. Pearson Correlation Coefficient (0.839):

o The correlation coefficient of 0.839 indicates a strong positive relationship

between the credit limit and the amount spent. As the credit limit increases, the

amount spent tends to increase as well.

2. Significance (p-value = 0.000):

o Since the p-value is less than 0.05, the result is statistically significant. This

means we reject the null hypothesis (H₀) and conclude that there is a relationship

between the credit limit and the amount spent.

3. Correlation Strength:

o The strong positive correlation (0.839) suggests that the two variables, creditLimit

and totalSpent, are strongly related.

Conclusion:

Based on the Pearson correlation result, we reject the null hypothesis and accept the

alternative hypothesis. This indicates that there is a significant positive relationship between the

credit limit and the amount spent.

41
Chapter 7 – Model Evaluation

The analysis presented in hypothesis one provides a rigorous evaluation of the sales

performance data across three countries: France, Spain, and the USA. The evaluation is carried out

using two key statistical tests - Levene's Test for Homogeneity of Variances and One-Way

ANOVA.

Levene's Test is used to assess the assumption of equal variances across the three country

groups. The results indicate that the p-values for all the test variants (Mean, Median, Median with

adjusted df, and Trimmed Mean) are greater than the significance level of 0.05. This means the

null hypothesis of equal variances cannot be rejected, and the assumption of homogeneity of

variances is met.

The One-Way ANOVA is then employed to evaluate whether there are any significant

differences in sales performance (lineTotal) between the three countries. The analysis reveals an

F-statistic of 0.160 and a corresponding p-value of 0.852, which is greater than the 0.05

significance level. Therefore, the null hypothesis of no significant difference in sales performance

between the countries cannot be rejected.

In summary, the model evaluation based on these statistical tests leads to the following

conclusions:

• The assumption of homogeneity of variances is met, indicating the appropriateness of

proceeding with the ANOVA analysis.

• The ANOVA results show no statistically significant difference in sales performance

between France, Spain, and the USA.

42
These findings suggest that, based on the given sample, sales performance is consistent across the

three countries, and no country demonstrates a statistically higher or lower performance than the

others.

The strengths of this analysis lie in the rigorous application of well-established statistical

methods, Levene's Test and One-Way ANOVA, to assess the underlying assumptions and draw

conclusions about the sales performance data. The use of these standard benchmarks and criteria

provides a robust evaluation of the model's performance in addressing the research question.

Overall, the model evaluation presented in the document provides a solid foundation for

understanding the sales performance dynamics across the three countries, while also highlighting

the need for ongoing monitoring and analysis to fully address the research question.

In hypothesis two, we employ Levene's Test for Homogeneity of Variances and Welch's

ANOVA to address the research question and hypothesis two.

Levene's Test is used to evaluate the assumption of equal variances across the product line groups.

The results indicate that the p-value is less than the significance level of 0.05, rejecting the null

hypothesis of equal variances. This violation of the homogeneity of variances assumption justifies

the use of Welch's ANOVA, a more robust alternative to the standard ANOVA.

The Welch's ANOVA is then applied to examine the differences in total sales trends

between the various product lines. The analysis reveals a Welch statistic of 13.575 with a

corresponding p-value of 0.000, which is less than the 0.05 significance level. This leads to the

rejection of the null hypothesis, indicating that there is a statistically significant difference in sales

performance across the product lines.

43
The strengths of this model evaluation lie in the rigorous application of established

statistical methods, Levene's Test and Welch's ANOVA, to assess the underlying assumptions and

draw conclusions about the sales trends. The use of Welch's ANOVA, which is more appropriate

when the homogeneity of variances assumption is violated, provides a robust and reliable analysis.

However, it is important to note that the implications of these findings are limited to the specific

dataset and context provided. Further research may be necessary to explore the potential drivers

or factors influencing the observed differences in sales performance across product lines.

Additionally, the discussion could be strengthened by considering the practical significance of the

results and their potential implications for product management, pricing strategies, or resource

allocation decisions.

Overall, the model evaluation provides a solid foundation for understanding the sales trends

across different product lines, while also highlighting the need for ongoing monitoring and analysis

to fully address the research question.

Hypothesis three is analyzed through a thorough evaluation of the differences in total

orders across various customer segments using One-Way ANOVA.

Key elements of the model evaluation:

• Levene's Test for Homogeneity of Variances:

✓ The Levene Statistic of 0.992 and a corresponding p-value of 0.322 (greater than 0.05)

indicate that the assumption of equal variances across the customer segments is met.

✓ This justifies the use of the standard One-Way ANOVA, as the homogeneity of variances

assumption is not violated.

44
• One-Way ANOVA Results:

✓ The ANOVA table shows an F-statistic of 1.295 and a p-value of 0.279, which is greater

than the significance level of 0.05.

✓ This means the null hypothesis of no significant differences in total orders across customer

segments cannot be rejected.

The strengths of this model evaluation lie in the rigorous application of Levene's Test and One-

Way ANOVA, which are well-established benchmarks for assessing the equality of variances and

differences between group means, respectively.

The clear articulation of the hypotheses, the interpretation of the Levene's Test and ANOVA

results, and the final conclusion provide a comprehensive and statistically sound evaluation of the

model's performance. However, the analysis is limited to the specific dataset and customer segment

groupings provided. The implications of the findings may not directly translate to other contexts

or customer segmentation approaches without further validation.

Additionally, while the ANOVA results indicate no statistically significant differences in total

orders across the customer segments, there may still be practical or business-relevant differences

that warrant further investigation. The analysis could be strengthened by considering the

magnitude of the differences, even if they do not meet the statistical significance threshold.

In conclusion, the model evaluation demonstrates a robust and rigorous assessment of the

differences in total orders across customer segments. The findings provide valuable insights, but

their practical implications should be later considered within the broader context in classic model

company.

45
Evaluation of the relationship between credit limit and total amount spent was tested using Pearson

correlation analysis for hypothesis four.

The key findings from the model evaluation are:

• Pearson Correlation Coefficient: The analysis reveals a strong positive correlation of 0.839

between credit limit and total amount spent. This indicates a significant positive

relationship between the two variables, suggesting that as credit limit increases, the total

amount spent tends to increase as well.

• Statistical Significance: The reported p-value of 0.000 is less than the standard significance

level of 0.05, allowing the rejection of the null hypothesis. This means the observed

correlation is statistically significant, providing strong evidence that there is a meaningful

relationship between credit limit and total amount spent.

• Correlation Strength: The strong positive correlation coefficient of 0.839 signifies a robust

relationship between the two variables. This suggests that the credit limit is a strong

predictor of the total amount spent by customers.

The strengths of this model evaluation lie in the application of the well-established Pearson

correlation analysis, which is a widely recognized benchmark for assessing the linear relationship

between two variables. The clear articulation of the hypotheses, the interpretation of the correlation

coefficient, and the assessment of statistical significance provide a comprehensive and rigorous

evaluation of the model's performance.

While the results indicate a strong positive relationship between credit limit and total amount spent,

further research would be needed to infer the underlying drivers and mechanisms behind this

relationship. Additionally, the analysis is limited to the specific dataset provided, and the

implications may not generalize to different contexts or populations without additional validation.

46
Overall, the model evaluation presented in the document provides a robust and statistically sound

assessment of the relationship between credit limit and total amount spent. The findings can serve

as a valuable foundation for further research, customer segmentation, credit risk management, or

targeted marketing strategies.

Conclusion

This comprehensive analysis examined four key hypotheses using rigorous statistical methods to

gain insights into the sales performance and customer behavior patterns within the classic model

company.

In Hypothesis one, the One-Way ANOVA analysis revealed no statistically significant differences

in sales performance across the France, Spain, and USA country groups. This suggests that sales

performance is consistent across these regions, with no country demonstrating a higher or lower

performance than the others.

Hypothesis two, explored differences in sales trends across various product lines. By leveraging

Welch's ANOVA to account for unequal variances, the analysis found a significant difference in

total sales trends between the product lines. This indicates that sales performance varies

considerably depending on the specific product offering.

The evaluation of Hypothesis three utilized One-Way ANOVA to examine differences in total

orders across customer segments. The results showed no statistically significant differences,

47
implying that the customer segment does not have a substantial effect on the total number of orders

placed.

Finally, Hypothesis four assessed the relationship between credit limit and total amount spent using

Pearson correlation analysis. The strong positive correlation coefficient of 0.839, along with the

statistically significant p-value, provided evidence of a meaningful positive relationship between

these two variables.

Overall, the statistical analyses presented in this document offer valuable insights into the

company's sales dynamics and customer behaviors. The rigorous application of well-established

benchmarks, such as Levene's Test, One-Way ANOVA, Welch's ANOVA, and Pearson correlation,

has delivered a comprehensive and reliable evaluation of the developed models. These findings

can inform strategic decision-making, product management, credit risk assessment, and targeted

marketing initiatives within the classic model company.

48

You might also like