BA Report
BA Report
Context Setting
Nestlé Pakistan functions as a division of the multinational food and beverage corporation Nestlé
S.A., based in Switzerland. After acquiring Milkpak Ltd. in 1988, Nestlé Pakistan began
operations in the entire country. The company's business strategy in Pakistan is centered on
offering premium food and drink items while supporting the national economy and culture.
Dairy, infant nutrition, bottled water, and other food items that are suited to regional preferences
and dietary requirements are all part of Nestlé Pakistan's product line (Nestlé Pakistan, 2023).
A division of Nestlé dedicated to serving Pakistan's out-of-home food and beverage sector is
called Nestlé Professional. It offers catering services, hotels, restaurants, and other foodservice
businesses nationwide food and beverage solutions. Their products, which are especially made
for professional use in the Pakistani market, include coffee systems, beverage solutions, culinary
products, and dessert solutions (Nestlé Professional Pakistan, 2023).
The Nestlé Professional department's Area Sales Manager (ASM) sent us the data, which
includes detailed information on trade assets in the Faisalabad area, one of Pakistan's largest
cities. (According to the dataset, trade asset images will be included in the appendix.)
Scope of Topic
The dataset which is under analysis is the Nestlé Pakistan Physical Verification list, which
combines data from two sources: trade asset data from SAP and sales data from SalesFlo. These
datasets are linked by a unique Asset code. The data was merged, converted to CSV format, and
then loaded into R for analysis.
Research Questions
RQ1- What variables, such as asset type and manager significantly influence sales in the
Pakistani market?
RQ2- How can missing values (NA) in the dataset be effectively addressed using supervised and
unsupervised machine learning techniques?
Literature pertaining to the highlighted issue
Direct academic literature is not relevant because this dataset is business-specific and used
internally by Nestlé Pakistan. Nonetheless, the dataset fulfills essential internal purposes:
Finding missing assets in the Pakistani market; auditing trade assets in various Pakistani regions;
and evaluating sales performance among various managers in the nation.
1. Data Preprocessing: Taking into account the unique characteristics of data collecting in
Pakistan, this includes merging datasets, managing missing values, and cleansing data.
2. Exploratory Data Analysis: Examining how factors relate to one another and how they affect
sales in Pakistan.
3. Unsupervised/Supervised Learning: Used KNN, Naive Bayes, association rule mining and
decision trees to model the Manager variable effectively.
Data Selection
The Nestlé Pakistan Physical Verification (PV) dataset is perfect for our study because of its
broad and varied structure, which offers crucial insights into the financial and operational aspects
of trade assets. This dataset effortlessly connects SAP trade asset data with SalesFlo sales
performance data using a unique Asset Code. By integrating monthly sales data, asset-specific
information, and managerial oversight into a single dataset, it provides a thorough picture of
trade assets in the Faisalabad region.
This presents a unique opportunity to employ machine learning approaches for imputation,
which can assist develop more accurate models for analysis, even if the dataset contains a large
number of missing values. These imputation techniques not only increase the dataset's usability
but also demonstrate how advanced machine learning techniques can be used in practice to solve
real-world data issues. The dataset's integrated structure allows for the comparison of audit
findings with financial performance, providing valuable insights into asset management, sales
trends, and regional performance. It is a helpful instrument for addressing significant business
difficulties and optimizing asset use due to its practical significance.
For handling NA values, we will employ methods such as:
Data Cleaning
To make sure the dataset was correct, consistent, and prepared for analysis, the data cleaning
procedure comprised several processes.
Understanding the dataset and spotting possible problems were the main goals of this first round.
A summary of each column's distribution, mean, and range of values was given via summary
statistics. To find data gaps, missing values were quantified. Duplicate rows were found to make
sure every entry was unique, and an overview of the dataset's first rows provided information
about its composition and structure. Additionally, data types were reviewed to locate
discrepancies, such as dates stored as numeric values. Examining unique values, especially for
categorical columns like "Distribution" and "Verification," helped understand dataset variability.
Data Manipulation
New columns such as Average.Sales and Total.Sales compile and examine sales performance
over a seven-month period, sales were generated. To make the analysis more clear, these
columns were moved for better visibility.
By using logical rules to reduce missing information, the Manager column was combined from
ASM and Res.Per. The value was kept if both columns matched; if not, the non-missing value
was utilized. By giving Res.Per values priority, ambiguous references such as "Ayesha/Umar"
were eliminated, simplifying managerial data for analysis.
Data from two sources had to be combined for the project: SAP's trade asset data and SalesFlo's
sales performance data. To guarantee accurate merger, these datasets were connected using the
special Asset.Code. Excel's XLOOKUP tool was initially used to integrate the data, and the
combined dataset was saved as a CSV file for additional R research.
Duplicate columns such as Serial.ids and Found. were eliminated to increase uniformity and
usability. Both sources used the same column names and data types, and NA was used to indicate
missing or conflicting values. To make numerical columns compatible with analytical
procedures, they were transformed.
A thorough examination of asset performance was made possible by this integration, which
produced a single dataset that combined sales performance measures and physical asset
verification. This approach facilitated trend identification and generated actionable insights for
strategic decision-making.
Visualizations
Seasonal and performance-based differences in sales among managers from April to October are
depicted in the "Monthly Sales Trend of Managers" graph. One noteworthy finding is that Ali
Azeem's sales have been steadily increasing, peaking at Rs. 6M in October. Given that sales are
predicted to expand more during the winter, when demand for particular beverages (such coffee
or hot drinks) may increase, this tendency indicates that Ali Azeem's methods or market demand
in his region correspond well with seasonal or product-specific needs.
However, Umar Manzar's sales fluctuated, showing steep declines in May and September
followed by a rebound in October. This could indicate problems with market stability,
operational inconsistencies, or external factors like seasonal shifts in consumer demand. A more
thorough analysis of these changes is required in order to comprehend and address the causes
generating this instability.
Shahiq Ahmad Khan and Ayesha Iftikhar's sales performance is still consistent, but small.
Shahiq's steady sales of over Rs. 2M demonstrate operational consistency, but Ayesha's sluggish
growth suggests untapped potential. Given that sales are steady but stagnate, Ayesha/Umar's
ambiguous employment underscores the need for more precisely defined roles and
responsibilities to promote growth.
Overall, the graph's findings indicate that sales are impacted by the seasons, with a forecasted
rise in the winter due to increased demand for Nestlé Professional's hot beverage products. This
demonstrates how important it is to adjust strategies to capitalize on seasonal opportunities while
controlling volatility and improving manager consistency.
Revenue generation for assets aged 0–5 exhibits a little negative trend as acquisition costs
increase. This suggests that younger, more costly assets may initially have trouble making a
profit. The clustering of data points at lower costs with varied income indicates that less
expensive assets within this category do not have an immediate return on investment. These
trends could be the consequence of underutilization or ineffective use of newly acquired assets.
However, for assets that are five years or older, average revenue and purchase costs have a
positive correlation. The durability and long-term value of expensive gear are highlighted by the
fact that long-term, high-cost assets usually yield more income over time. This trend lends
credence to the notion that costly assets are more profitable over the long run since they survive
longer and require fewer replacements.
The disparity between the two age groups emphasizes how crucial it is to make strategic
investments in long-lasting, superior assets. Although costly assets might seem to offer few
short-term benefits, their long-term performance suggests that they are ultimately more
beneficial to the business. The significance of striking a balance between short-term and
long-term asset management techniques is highlighted by this realization.
Machine Learning
Association Rules
The Apriori method was used to identify association rules centered on the Manager variable on
the right-hand side (RHS). After the dataset was transformed into a transactional format, the
algorithm was run with lower thresholds for support (0.015) and confidence (0.5) in order to
capture a larger range of patterns. The objective was to completely understand the connections
within the information by identifying principles that are applicable to all managers.
Despite these adjustments, the generated rules primarily pertain to only two managers, Shahiq
and Ali. Even with reduced support and confidence requirements, no rules were identified for
other managers, suggesting a limitation in the dataset or algorithm’s ability to generalize across
all managerial categories. The scatter plot and graph visualization further illustrate that the
identified rules are concentrated within these two managers. (See Appendix)
While the Apriori method demonstrated utility in identifying strong rules for Shahiq and Ali, its
overall applicability is limited in this context. However, it proves useful in addressing missing
values (NA) or understanding patterns specifically for these two managers.
KNN
The graph illustrates the relationship between model accuracy and k (number of neighbors),
showing that k=1 yields the best accuracy but steadily declines as k increases. According to this
pattern, the dataset can contain unique local clusters that are best categorized by utilizing the
nearest neighbor method. Accuracy is decreased as k rises because the addition of farther-flung
neighbors introduces noise or overlapping patterns from other classes. The technique shows that
although KNN may be good at capturing localized patterns, it may not be able to handle datasets
with uneven or unclear class separations. This suggests that while manager-specific attributes
may be clustered in some cases, they are not generally applicable.
KNN Cross Validation
To evaluate the performance of the KNN algorithm, 5-fold cross-validation was applied. This
approach provided a reliable estimate of the model’s accuracy by dividing the dataset into five
subsets, using each subset as a validation set while training on the rest of the data. Numerical
features were standardized to ensure fair distance calculations, and the target variable, Manager,
was included after preprocessing.
The analysis tested k-values ranging from 1 to 15, in odd numbers only. The highest
cross-validation accuracy of 65 percent was observed at k = 1. As k increased, accuracy steadily
declined, with a slight anomaly at k = 11, where accuracy briefly improved but remained below
the peak value at k = 1. These results suggest that smaller k-values, particularly k = 1, provided
the most precise predictions by focusing on the nearest neighbor, while larger values reduced
accuracy by introducing less relevant neighbors into the decision process. The findings confirm k
= 1 as the optimal value for this dataset, achieving the highest predictive accuracy.
Naive Bayes
Based on Bayes' theorem, Naive Bayes (NB) is a probabilistic machine learning technique that is
frequently applied to classification tasks. It functions well with both numerical and categorical
data and makes the assumption that features are independent. NB successfully classified sales
data for managers like Ali Azeem, whose traits were unique and consistently recognized, with an
accuracy of 85% in this assignment. Misclassifications, especially for Ayesha Iftikhar and Umar
Manzar, however, point to possible class imbalances in the dataset as well as overlapping feature
distributions.
The model finds patterns in the data by identifying which features are most correlated with
specific managers. It illustrates how some managers have distinct sales-related attributes, while
others have overlapping traits that put doubt on forecasts.By providing useful insights on
management identification, this research facilitates better decision-making in circumstances
where values are lacking.
Decision Tree
Decision trees are supervised machine learning techniques used in classification and regression
applications. By splitting the dataset into subgroups based on feature values, it produces a
structure resembling a tree, with each node standing for a decision rule. The decision tree utilized
in this project was built to classify managers in the Nestlé sales dataset with an accuracy of 88%.
It splits the data hierarchically based on parameters like asset descriptions and distribution
networks in order to effectively capture patterns and correlations within the dataset.
The decision tree offers an explanation for the data by emphasizing key traits that distinguish
managers. The first split, which is based on the Distribution characteristic, clearly separates Ali
Azeem's data. This indicates that M Afzal & Brothers-Bhera, his distribution channel, is unique
and frequently associated with him. Other divisions demonstrate that specific asset descriptions,
such as those for the NESCAFÉ FTP 30 and EZ Care Mini-Quattro machines, are crucial in
distinguishing between Umar Manzar and Ayesha Iftikhar. Within some branches, the tree also
emphasizes asset codes as a secondary differentiator. It could be the case that certain codes are
assigned to certain Managers but it is unclear if it is definitely the case.
According to the findings, several characteristics—such as asset descriptions and
distribution—have a high correlation with particular managers and are therefore important
differentiators. For managers like Ali Azeem and Shahiq Ahmad Khan, the dataset's clear
patterns are responsible for the comparatively high accuracy. However, there may be some
classification issues due to overlapping traits for other managers, such as Ayesha Iftikhar and
Umar Manzar. Less clear divisions in the data may result from this overlap, which may be caused
by identical assets, common duties, or regions controlled. Hence, it is an overall good model, but
it is still weak in differentiating between Ayesha and Umar. For dealing with all NA values, it is
the best model so far.
Recommendations
Consolidate Data Sources: Data is currently gathered from two different platforms: SalesFlo for
sales performance data and SAP for trade asset data. By combining these into a single platform,
inconsistencies and redundancies will be removed, guaranteeing that data is reliably collected
and readily available. Better analysis and decision-making will be made possible by this
integration, which will also decrease manual labor, increase efficiency, and offer a single
perspective of operations.
Remove Lost or Destroyed Assets’ Entries: The dataset's assets with missing values and
unconfirmed statuses are probably going to be lost or destroyed. Since these entries don't reflect
active, functional assets, keeping them can result in skewed analysis and untrustworthy insights.
Assets must be regularly verified to ensure their validity before such entries are excluded. By
focusing on active assets, this approach not only makes the dataset more reliable but also ensures
that resources and analysis efforts are directed toward pertinent data, which supports better
operational efficiency and decision-making.
Emphasis on Asset Utilization Efficiency: Long-term profitability trends show that durable,
high-quality assets require less maintenance or replacement and generate larger returns. By
prioritizing these assets, Nestlé Pakistan can reduce operating costs, increase asset reliability, and
ultimately achieve better financial results.
Limitations
Restricted Regional Coverage: The data set only contains information from the Faisalabad
(FSD) region, which comprises about 700 observations. While this is a great starting point for
identifying patterns and issues in the data, the results have limited generalizability across
Pakistan. Expanding the dataset to include all regions would provide for a more complete view
of asset performance and sales trends.
Effects of Data Cleaning: By removing unnecessary columns and missing values, data cleaning
techniques significantly reduced the number of observations that could be examined. Although
this process was necessary to ensure the dataset's reliability and integrity, it may have
inadvertently overlooked potentially useful data elements that could have impacted the results.
Exclusion of Zero Sales Observations: The asset cost vs average sales visualization study did
not include observations with zero sales. Even though this omission makes it simpler to focus on
assets that are in use, it leaves out information that may be essential to understanding
underutilization or operational inefficiencies related to the reasons why certain assets generate no
revenue.
Missing Data Imputation: Even though missing data was addressed, not all gaps were filled
because of the constraints in completely adopting advanced imputation techniques with
comparable managers like Ayesha and Umar. The quality and thoroughness of the study could be
impacted by this limitation.
Single-Year Data: The sample may not fully represent changes in sales over multiple years
because it only covers seven months. A longer time frame would allow for better examination of
sales trends.
Appendix
Variable Dictionary
1. Asset.Code: Unique identifier for each asset, used to link datasets and track specific
assets across systems.
2. Verification: Status of physical verification for the asset, ensuring its existence and
condition.
3. Distribution: The distribution channel or vendor associated with the asset.
4. Manager: Name of the manager responsible for the asset or related sales.
5. Asset.Description: Detailed description of the asset, including product type or model
(e.g., NESCAFÉ machines).
6. Type: Type is short form of asset description
7. Platform: Platform is the broad type of asset namely Soluble, Cold and Chiller.
8. Date.of.Acquisition: The date when the asset was acquired.
9. Acquisition.Cost: Initial cost of the asset.
10. Dep.for.Year: Depreciation value for the asset for the current year.
11. Accumul.Dep: Total accumulated depreciation for the asset.
12. Curr.Book.Val: Current book value of the asset after accounting for depreciation.
13. Cost.Center: Financial cost center associated with the asset for accounting purposes.
14. Average.Sales: Average sales generated by the asset over the reporting period.
15. Total.Sales: Total sales generated by the asset over the reporting period.
16. April.Sales: Sales generated in April.
17. May.Sales: Sales generated in May.
18. June.Sales: Sales generated in June.
19. July.Sales: Sales generated in July.
20. August.Sales: Sales generated in August.
21. September.Sales: Sales generated in September.
22. October.Sales: Sales generated in October.