0% found this document useful (0 votes)
16 views8 pages

Text-Based Product Matching - Semi-Supervised Clustering Approach

This paper presents a semi-supervised clustering approach for product matching in e-commerce, addressing the challenges posed by unstructured data and inconsistent product descriptions. The proposed method utilizes the IDEC algorithm and focuses on textual features and fuzzy string matching, demonstrating that unsupervised matching with limited annotated data can be a viable alternative to traditional supervised methods. The study emphasizes the importance of effective feature engineering and evaluates the performance of the approach using various metrics, highlighting its potential to enhance dynamic pricing and personalized product assortment strategies.

Uploaded by

hho36941
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views8 pages

Text-Based Product Matching - Semi-Supervised Clustering Approach

This paper presents a semi-supervised clustering approach for product matching in e-commerce, addressing the challenges posed by unstructured data and inconsistent product descriptions. The proposed method utilizes the IDEC algorithm and focuses on textual features and fuzzy string matching, demonstrating that unsupervised matching with limited annotated data can be a viable alternative to traditional supervised methods. The study emphasizes the importance of effective feature engineering and evaluates the performance of the approach using various metrics, highlighting its potential to enhance dynamic pricing and personalized product assortment strategies.

Uploaded by

hho36941
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Text-Based Product Matching - Semi-Supervised

Clustering Approach
Alicja Martinek∗† , Szymon Łukasik∗† , Amir H. Gandomi‡

NASK National Research Institute, Poland
† AGH University of Kraków, Poland
‡ University of Technology Sydney, Australia

Email: [email protected], [email protected], [email protected]

Abstract—Matching identical products present in multiple and seamless experience whilst the retailer becomes more
arXiv:2402.10091v1 [cs.DB] 1 Feb 2024

product feeds constitutes a crucial element of many tasks of e- competitive amongst the other sellers.
commerce, such as comparing product offerings, dynamic price Product matching, an instance of the common entity match-
optimization, and selecting the assortment personalized for the
client. It corresponds to the well-known machine learning task ing task, is one of the key exercises for retailers. Its goal is
of entity matching, with its own specificity, like omnipresent un- to match identical products from two different product feeds.
structured data or inaccurate and inconsistent product descrip- This is not a trivial task given the nature of textual data. Prod-
tions. This paper aims to present a new philosophy to product uct descriptions often require specialized background knowl-
matching utilizing a semi-supervised clustering approach. We edge and can be characterized by several different modes.
study the properties of this method by experimenting with the
IDEC algorithm on the real-world dataset using predominantly This all accounts for the importance of high-reliability product
textual features and fuzzy string matching, with more standard matching. Once completed it allows for comparisons between
approaches as a point of reference. Encouraging results show that offerings of the same product, which can be further used for
unsupervised matching, enriched with a small annotated sample dynamic price optimization and selecting a fitted assortment
of product links, could be a possible alternative to the dominant for the client. Mentioned actions are means to fulfilling the
supervised strategy, requiring extensive manual data labeling.
Index Terms—product matching, semi-supervised learning,
ultimate goal of every company - profit maximization.
deep learning The aim of this paper is to show how product matching
can be achieved with the semi-supervised clustering algorithm.
Such an approach allows to exploit all benefits of the un-
I. I NTRODUCTION
supervised methods. Keeping in mind that data labeling is
Online shopping and purchasing services via e-commerce expensive and time-consuming, it is our motivation to fully
platforms are constantly gaining popularity. It happens due gain from widely available data and enrich it with a smaller
to the expanding digitalization of the retail sector and the sample of annotated data. It has been demonstrated that the
growing utilization of advanced algorithms embedded in such described method increases the accuracy of generated clusters
platforms. According to Eurostat average share of European [11]. Our framework also includes text-mining algorithms used
Union-based companies making a profit from e-commerce for feature engineering. We focus on similarity measures as
sales grew in ten years from 14% to 19% in 2021, peaking they are go-to calculations for text comparison tasks. The main
at 20.6% in 2022 [1]. Same source reports a 20 % growth in contribution of this paper is to present a new view on the
proportion of e-shoppers among all Internet users, from 55% product matching problem. We do not invent a new algorithm
in 2012 to 75% in 2022 [2]. Meanwhile, in the United States, but try to show fresh approach to handling such a task.
the share of online sales in the total market rose from 5.1% to The paper is structured as below. Having given the introduc-
15.4% between Q1 2012 and Q2 2023 [3]. In such a setting, tion, Section 2 is a comprehensive review of current solutions
retailers are competing with each other not only in terms of to the product matching task. In addition, it also describes the
profits, but also in cutting-edge technologies which drive the feature engineering process. The following Section presents
numbers. a deep dive into the proposed framework, delivering the
The aforementioned increase in the number of transactions details of the algorithm implementation. Preliminary results
generates high volumes of data. This is where artificial in- of conducted experiments are provided and thoroughly ana-
telligence thrives the most. So far, machine learning – the lyzed in Section 4. Finally, Section 5 covers conclusions and
common branch of AI – has been used in many different areas perspectives for possible future improvements to the proposed
of e-commerce. The most popular include, but are not limited solution.
to: recommendation engines [4], creating targeted marketing
II. R ELATED W ORK
campaigns [5], [6], purchase predictions [7], dynamic pricing
[8], [9] and optimization of retailer resources [10]. A data- A. Product matching
driven approach to e-commerce is beneficial for both sides Among the problems that can be solved with machine
of the deal. In the end, the customer receives a personalized learning algorithms, product matching task is constantly
gaining importance due to exploding amount of data from achieving the aforementioned goal. These methods include a
online platforms. Retailers can leverage this information to bag of words algorithm, TF-IDF vectorization, and generat-
better suit their offerings. ing word embeddings. Another simple yet powerful concept
The core of product matching lies in obtaining pairs of revolves around similarity or distance measuring, proving to
matching goods, based on so-called product feeds. The feeds be successful in product matching [30]. Examples of popular
can originate from different sources, hence discrepancies in distance metrics are:
available attributes can become an obstacle. Such differences • Levenshtein distance and its Damerau-Levenshtein exten-
can manifest in distinct product taxonomies, reporting prices in sion [31] - these are calculated based on the number of
varied currencies, including/excluding taxes or shipping costs, edits one has to make in order to transform one string
or inconsistent formulation of product names. The trick lies into another. They are found in a wide spectrum of
in conscious feature engineering that takes into consideration applications, inter alia, in spell checking and fuzzy string
all possible data issues, not to mention missing information. matching.
Carefully processed data that describes pairwise relations • Jaccard distance [32] - represents a token-level distance
between items of merged feeds can be used to redefine the that compares sets of tokens present in both strings.
nature of the product matching problem. At this stage, the • Euclidean (L2) and cosine distances - they operate on
described task becomes a binary classification exercise. The word embeddings, which are vectors generated to map
target variable in such a setting takes a value of 1 when paired words into n-dimensional space. Such transformation
products represent the same physical good, and it becomes 0 allows calculating of intuitive Euclidean distance as well
otherwise. Models used for classification tasks usually output as magnitude-independent cosine distance.
the probability of a record belonging to the given class (in this The content of features derived from the textual data can
study, meaning that the products describe the same item). be various and is limited only by researchers’ ingenuity. The
There are many approaches to solving the assignment of en- selection of features is critical as attributes describing the data,
tity matching. Most existing solutions are based on supervised regardless of undertaken methodology, have a direct impact
learning methods. They include and are not limited to adapting on the performance of the model following the garbage in =
XGBoost [12], using advanced natural language processing garbage out principle.
models as BERT [13]–[16], and incorporating deep neural
networks [17]–[19] or fuzzy matching [20]. There even exist C. Clustering
attempts to incorporate Large Language Models as chatGPT
Clustering is a classic example of an unsupervised learning
in the task of entity matching [21], [22].
method. It can be implemented by the standard k-means
Another important group of methods takes advantage of algorithm, which generates groupings based on the distances
various types of data available in on-line selling systems. between data points and computed centroids (centers of clus-
Multi-modal approach uses both textual representation of a ters). It reassigns the cluster numbers and adjusts centroids
product as well as images of given good [23], [24]. An until stopping criteria are met. Limitations of this approach
interesting bridge between multi-modal concept and semi- do not negatively influence our solution of product matching:
surprised methodology can be found in [25], where authors a number of clusters is known a priori and is equal to 2
present self-training ensemble model called GREED. (matching and distinct products within a pair) and all features
The second family of algorithms represents the unsuper- lay within the same range of values.
vised style of learning. These models can be successfully used
for product matching. Existing solutions include text-mining
techniques, however, they are reported to be outperformed by
supervised methods [26]. The biggest advantage of unsuper-
vised learning is that it can be performed on unlabelled data.
The performance-to-cost-of-labeling is an inevitable trade-off
that has to be faced by researchers and practitioners. This is
the motivation to use a novel approach of semi-supervised
algorithms as they can overcome the aforementioned trade-
off. The concept of constrained clustering [27]–[29] is a great
example of such an algorithm and it will be described further
in the paper.
B. Transforming textual data into numerical features
Correct representation of data is a key factor in modeling.
Fig. 1. Example of Must Link (solid line) and Can’t Link (dashed line)
Textual data whilst it is comprehensible for humans cannot be constraints
understood the same way by machines. In order to overcome
this obstacle, textual data has to be transformed into its An extension of the clustering algorithm – constrained clus-
numerical counterpart. There exist a variety of methods for tering – allows feeding links describing relationships between
data in such a way that they are included in the cluster captured signal is relevant. In other words, precision reflects
assigning process. There are two types of these links: the fraction of True Positives among all samples labeled
1) Must Link (ML) - pairing samples belonging to the same as positive. On the other hand, recall represents how many
cluster, samples were positively classified compared to all positive
2) Can’t Link (CL) - defining points that should not be records.
grouped together in one cluster. Another metric used for evaluation in this study is strictly
Figure 1 presents the impact of feeding constraints into the intended for clustering problems. Rand Index describes the
algorithm. Note that the data point in the middle was assigned similarity between two clusters [39]. It is calculated as:
to cluster B, due to the presence of the Can’t Link constraint number of agreeing pairs
RI = . (4)
between it and a member of a well-defined cluster A. number of all pairs
There are many implementations of the k-means algorithm It checks if pairs of samples are classified the same way as in
that adopt constraints. They include COP k-means [33], MPC the true labels. The value of RI ranges between 0 and 1, with
k-means method, and others that can be found in conclust 1 representing entirely matching labels.
package in R [34].
A more advanced proposal in the field of constrained clus- III. P ROPOSED A LGORITHM
tering involves deep learning algorithms [28], [35]–[37]. Deep Our approach is based on the novel philosophy that product
clustering solutions utilize autoencoders that learn the data matching could be treated as a semi-supervised task with the
representation themselves. The biggest contribution of DEC knowledge about product matchings/not matchings incorpo-
(Deep Embedded Clustering), and its improved version called rated into the clustering constraints. Improved Deep Embedded
IDEC, is the introduction of clustering loss. It is used during Clustering (IDEC) is used as the constrained clustering engine.
the training process to optimize the objective of assigning The solution proposed in this paper will be studied in the
clusters. IDEC, as an extension, brings in additional losses context of the standard dataset devoted to the task of product
that aim to include more advanced constraints in the learning matching. Skroutz dataset [40] contains information sourced
phase. Those losses include instance difficulty loss, triplet loss,
from online shopping platforms. For this particular research,
and global size constraint loss. Results obtained on benchmark we used the ”Compact Cameras” subcategory of available
datasets (MNIST, Fashion MNIST, Reuters) were reported to product classes. We decided to use only one category of items
outperform other known methods [28]. because in such a setting the problem of distinguishing the
D. Evaluation metrics same entities is less straightforward than comparing different
groups of products. Each category uses words specific to
Appropriate measurement of model performance is as im- the domain, hence intragroup data should be even better in
portant as any other part of the machine learning pipeline. investigating the robustness of the proposed approach. On the
Despite accuracy being the most intuitive way of assessing other hand, working with one category of products resembles
the performance it often can be deceiving. It happens when more the routine of small retailers, as they are often specialized
the tackled problem is defined on an imbalanced dataset. It is in given group of products and do not want to compare their
the case in product matching where the majority of relations goods with all available others.
between products from different feeds are of ”no-match” type. In a given dataset, a single entity is described with the
In such a scenario, other metrics have to be used in order to following fields: title, product category, and an ID used to
fully describe model’s ability of correct classification. identify same products. Products defined in this manner are
One of these metrics is the F Score often referred to as paired together via cross-join and labeled if they represent the
the F1 Score [38]. It builds upon the concepts of precision same physical item or not. The target variable was assembled
and recall which are derived from the confusion matrix. It is based on the equality of the aforementioned IDs. Data was
expressed as follows: sampled to be imbalanced - only 25% of 20000 generated
TP pairs is marked as matching goods.
precision = (1) The textual data in order to be comprehensible to the
TP + FP
algorithms have to be changed into numerical vectors. Features
TP engineered for the clustering task are as follows:
recall = (2)
TP + FN 1) Fuzzy matching based - title ratio, title partial ratio, and
title token set ratio [41]. Generated features utilize the
precision ∗ recall TP Levenshtein distance algorithm.
F Score = 2∗ = 1 , 2) Distance metric - Jaccard distance which measures dis-
precision + recall T P + 2 ∗ (F P + F N )
(3) similarity between two sets. It is expressed as a ratio of
where TP refers to True Positives, FP to False Positives and intersection to the union of sets of all tokens found in
FN to False Negatives. two product titles.
Precision and recall describe quantitatively how good the |P1 ∩ P2 |
model is in capturing relevant signals and how much of the Jaccard(P1 , P2 ) = , (5)
|P1 ∪ P2 |
where P1 and P2 correspond to sets of tokens found Records are sampled without replacement, hence a num-
within the titles. Word tokenizer divides the string de- ber of Can’t Link constraints cannot exceed the number
scribing the product into tokens. of Ones.
3) Comparison of numbers found in the product titles - 3) Modifying the fraction of 1-1 pairs in Must Link Con-
calculated in a similar fashion as Jaccard distance while straints - given that ’1’ means that products represent
taking only numbers into consideration. This feature the same good and ’0’ denotes a lack of match within a
proves to be useful due to its ability to compare product pair there is a possibility of defining a Must Link pair
properties, model numbers, and technical details, which as a relation of 0-0 or 1-1. This parameter changes the
often are numerical. balance between pairs of both types.
As a result, a pair of goods is represented by the numerical Preceding variables were changed in given ranges (where
vector of 5 elements. numbers in brackets represent actual number of constraints):
Our approach in general defines the product matching • Must Link Constraints: 5% (167), 10% (335), 15% (502),
problem as a classification task. Despite the unsupervised 20% (670), 25% (837), 30% (1005), 40% (1340), 50%
nature of clustering algorithms, knowing true labels does not (1675), 60% (2010), whilst the amount of CL was set to
disrupt the algorithm’s work, whilst allowing us to evaluate 10% and the fraction of 1-1 pairs to 100%;
the performance of unsupervised and semi-supervised learning • Can’t Link Constraints: 5%, 10%, 15%, 20%, 25%, 30%,
approaches. The task of matching products is then reformu- 40%, 50%, 60%, 70% (2345), 80% (2680), 90% (3015),
lated as an exercise of deciding if generated pair of products whilst the amount of ML was set to 50% and the fraction
refers to the same entity or not. Such an approach cancels out of 1-1 pairs to 100%;
one of the biggest shortcomings of clustering algorithms - the • Shifting the fraction of 1-1 pairs: from 0% to 100% with
requirement for users to know the number of clusters prior to the interval of 10 whilst ML and CL were set to 50%
the calculations. In the case of the problem being solved in and 20% correspondingly.
this paper, the number of clusters is equal to 2.

IV. E XPERIMENTAL S ETTINGS AND R ESULTS


In experiments run for this research, we tested the perfor-
mance of IDEC algorithm under diverging sets of constraints
applied to calculations. We examined the impact of increasing
the amount of Must Link and Can’t Link constraints separately.
The third experiment tested the influence of balance between
0-0 and 1-1 pairs in Must Link constraints. We changed
only one parameter at a time in order to draw conclusions
about effects directly associated with the altered setting. We
ran the calculations 10 times for each set of parameters
while keeping the same set of constrained pairs for those
runs. Further tests included comparison with other available
methods, performance on different datasets and measuring
the impact of various data distributions. IDEC was run with
following parameters: batch size = 256, learning rate = 0.001,
Fig. 2. Impact of increasing the amount of Must Link Constraints
activation = ReLU, input dimension = 5, encoder dimensions
= [200, 400, 800], decoder dimensions = [800, 400, 200], The effect that the amount of Must Link constraints has
number of epochs = 20, ML an CL penalty = 1. on the performance of clustering can be seen in Figure 2.
We split the dataset into train and test subsets. The fraction Surprisingly, an increase in the amount of information about
of matching pairs was preserved in both sets. Training data 1-1 pairs does not generate more accurate results. Furthermore,
included 13400 samples, whereas the test set had 6600 records. adding more constraints led to substantial elongation in the run
time of the network training process.
A. Constraints impact
Contrary to the ML constraints, expanding the set of Can’t
Using constraints in pair with clustering introduces several Link information about the pairs contributes to achieving
new parameters that can have an impact on the algorithm’s better results. Figure 3 depicts the relationship between the
performance. In this research we analyzed the significance of: metric values and the percentage of CL constraints. The lowest
1) Varying the number of Must Link Constraints - percent- F Score of 0.841 was be observed for 5% of constraints,
age numbers are presented with regards to the amount whereas the highest value of 0.908 is associated with run
of matching pairs in the training dataset. having 70% of them. This analysis proves that sometimes
2) Changing the amount of Can’t Link Constraints - also less is better. Intuition would suggest the best performance at
references the number of ”Ones” present in training. 100%, whereas obtained numbers show that higher amounts
B. Comparison with other methods
The experiment designed in this paper aimed to investigate
the advantage of semi-supervised algorithms in product match-
ing over the traditional unsupervised and supervised learning
represented by the k-means and XGBoost methods, respec-
tively. In addition, more modern method was used to assess
the product pairs. DeepMatcher algorithm [42] implements
3 ways of solving entity matching problems: SIF (Smooth
Inverse Frequency) which is a simple, aggregate model; RNN
that takes sequence into consideration and Attention aware of
both sequences and data similarity.
F Score was maximized in order to select the best threshold
for DeepMatcher classification. Results selected in that manner
were then compared with best runs of IDEC, k-means and
XGBoost with respect to the set of constraints discovered in
Fig. 3. Impact of increasing the amount of Can’t Link Constraints previous experiment.
Table I presents the results of the best runs in each cate-
gory of diverging parameters. The best run was picked with
of information can decrease the quality of results, which is a respect to the F Score as it is the most descriptive of the
perfect example of over-fitting. It is also important to highlight algorithm’s performance given the task being solved. It reports
that Can’t Link constraints are easier to obtain in real-world the average and standard deviation calculated over all runs
scenarios - as they could be generated automatically, without with particular settings. Results demonstrate that enriching
costly surveys of retail experts. data with some additional information, namely constraints,
leads to higher quality results. An increase of 0.07 in the F
Score measurements was observed, compared to the k-means
algorithm. Despite this being a relatively small improvement it
can lead to a substantially significant gain in the task of finding
True Positives - pairs of products that match. This effect is
especially desired having in mind the real-world application
of the presented framework. The amount of real data that can
be analyzed with this methodology far exceeds the sample of
20000 records used in this study. In big-data problems, even
the smallest upgrade goes a long way.
Analysis of XGBoost results and its mirrored IDEC runs
show that Deep Clustering performs better in all of the
measured scenarios. It is worth mentioning that Deep Clus-
tering approach outperformed the k-means algorithm in every
reported metric. DeepMatcher, regardless of the used model,
performs worse than proposed algorithm with respect to all
evaluation metrics.
Fig. 4. Impact of increasing the amount of 1-1 pairs in Must Link Constraints
To further compare k-means and IDEC algorithms Table II
presents examples of product pairs that were impossible for the
Figure 4 shows where the perfect balance between types of simple algorithm to classify correctly. Analysis of these results
Must Link Constraints lies. Observation of F Score curiously allows drawing the conclusion that IDEC is more versatile and
shows that the best results are achieved when there are 20% robust due to its ability to operate regardless of the length of
of 1-1 (actually matching products constraints) pairs and 80% compared strings, presence of typos, or distinctive ways of
of 0-0 pairs. We suspect that given the imbalanced nature of presenting product parameters.
the data, increasing the signal for True Positives is treated
by the model as forcing the outliers to be grouped together. C. Other datasets
Then instead of guidance on how to cluster points, the model Additionally, to fully examine quality of the proposed algo-
gets explicit instructions resulting in over-fitting and therefore rithm it was run on other data. Except for ”Compact Cameras”
reducing the overall performance. This hypothesis is also category from Skroutz dataset the mixture category of other
supported by the results of the first experiment (Figure 2) camera related categories was generated. This dataset called
where given a constant balance between types of pairs, a cameras all includes Analog Cameras, Mirrorless Cameras
higher amount of constraints (simply increasing volume of 1-1 and Digital Single Lens Reflex Cameras. As for benchmarking
pairs) presented to the model did not improve its quality. standards a WDC (Web Data Commons) dataset was used for
TABLE I
C OMPARISON OF K - MEANS , IDEC, XGB OOST AND D EEP M ATCHER ALGORITHMS .

Algorithm Experiment Accuracy F Score Rand Index


k-means 0.924 ± 0.0001 0.848 ± 0.0004 0.859 ± 0.0002
IDEC-ML5-CL10-F100 Must Link 0.941 ± 0.0052 0.893 ± 0.0078 0.889 ± 0.0093
IDEC-ML50-CL70-F100 Can’t Link 0.956 ± 0.0025 0.917 ± 0.0044 0.915 ± 0.0045
IDEC-ML50-CL20-F40 Fraction 1-1 0.956 ± 0.0011 0.917 ± 0.0023 0.916 ± 0.0020
XGBoost-ML5-CL10-F100 Must Link 0.865 ± 0.0213 0.782 ± 0.0290 0.767 ± 0.0307
XGBoost-ML50-CL70-F100 Can’t Link 0.871 ± 0.0234 0.789 ± 0.0339 0.777 ± 0.0352
XGBoost-ML50-CL20-F40 Fraction 1-1 0.910 ± 0.0242 0.837 ± 0.0407 0.837 ± 0.0339
DeepMatcher-attention 0.934 ± 0.0076 0.838 ± 0.0158 0.877 ± 0.0133
DeepMatcher-rnn 0.894 ± 0.0063 0.791 ± 0.0089 0.811 ± 0.0099
DeepMatcher-sif 0.859 ± 0.0197 0.643 ± 0.0314 0.758 ± 0.0283

TABLE II
E XAMPLES OF PAIRS MISCLASSIFIED BY K -M EANS WHILE BEING CORRECTLY CLASSIFIED BY IDEC.

Product 1 Product 2 Type


panasonic lumix bridge camera dc fz82 eu ka269712 panasonic lumix fz82 black eos 24 dosis i eos 60 dosis choris karta TP
nikon coolpix a100 purple thiki nikon doro vna974e1 compact fotografiki nikon coolpix a100 purle se 3 atokes dosis TP
aquapix w1024 b 10017 adiavrochi kamera mavri 10 mp easypix aquapix w1024 splash red TP
olympus tough tg 5 digital camera olympus tg 5 red eos 24 dosis i eos 60 dosis choris karta TP
sony dsc hx60 sony dsc rx10 iii TN
canon powershot sx730 hs silver canon powershot sx620 hs red 1073c003 TN
fotografiki michani nikon coolpix a100 red fotografiki michani olympus tg 5 red TN
cybershot dsc rx100m3 sony cybershot dsc rx10 m2 se 12 atokes dosis TN

more calculations. Data about cameras (wdc cameras) from It is worth mentioning that even WDC Gold Standard use
Version 2.0 of Large-Scale Product Matching Dataset [43] was datasets whose percentage of positive pairs range from 37% to
employed to further check robustness of the semi-supervised 48% in the training data, for classes of shoes and computers
approach. IDEC algorithm used constraints combination ML5- respectively. It is far from the tested and real life scenarios.
CL10-F100.

TABLE III
IDEC PERFORMANCE ON VARIOUS DATASETS .

Dataset Accuracy F Score Rand Index


cameras 0.941 ± 0.0052 0.893 ± 0.0078 0.889 ± 0.0093
cameras all 0.899 ± 0.0296 0.815 ± 0.0375 0.820 ± 0.0452
wdc cameras 0.789 ± 0.0284 0.665 ± 0.0505 0.668 ± 0.0313

Table III shows consistent, high performance achieved on


diverse data sources. A drop in metric values for wdc cameras
data might be caused by the fact that given dataset mixes
product feeds from many pages and multiple languages. In
that fashion some pairs are cross lingual, which might be hard
for fuzzy features to reflect in the training data.

D. Class distribution impact Fig. 5. Impact of increasing the amount of matching pairs in the dataset

Another quality assessment test touches the subject of The implementation of the tested algorithms along with the
algorithm’s performance at varying data distributions. For this full results of this study can be found in the repository [URL
experiment datasets with increasing percentage of matching undisclosed for peer-review].
pairs (ones) were synthetically generated from the cameras
data. Percentages used for these datasets were as follows: V. C ONCLUSION
1, 3, 5, 10, 15, 20, 25. IDEC algorithm was run with the The given paper presents a framework that can be utilized
same constraints setup as in previous experiment (ML5-CL10- for the problem of product matching. The nature of the data
F100). Figure 5 presents increasing trend for all of reported and given task requires the solution to be functional with
metrics. The threshold of 10% brings substantial increase of as little information as is possible. Online shopping websites
F Score value, if doubled all metrics are performing over the often do not bear consistent information across all offerings
value of 0.8. not to mention cross-platform discrepancies. The proposed
solution uses only titles of the products and derives simple [11] K. Sohn, D. Berthelot, C.-L. Li, Z. Zhang, N. Carlini, E. D. Cubuk,
text-based features which, unlike generating word embeddings, A. Kurakin, H. Zhang, and C. Raffel, “Fixmatch: Simplifying semi-
supervised learning with consistency and confidence,” arXiv preprint
do not require high computational power. arXiv:2001.07685, 2020.
Results obtained with the Deep Clustering approach outper- [12] S. Łukasik, A. Michałowski, P. A. Kowalski, and A. H. Gandomi,
form standard k-means algorithm, XGBoost method as well “Text-based product matching with incomplete and inconsistent items
descriptions,” in International Conference on Computational Science,
as deep learning based solutions. These results demonstrate pp. 92–103, Springer, 2021.
that enriching data with constraints, transforming the problem [13] J. Tracz, P. I. Wójcik, K. Jasinska-Kobus, R. Belluzzo, R. Mroczkowski,
from the unsupervised to the semi-supervised domain, leads to and I. Gawlik, “Bert-based similarity learning for product matching,”
in Proceedings of Workshop on Natural Language Processing in E-
a positive outcome. An increase in F Score metric of 0.07 can Commerce, pp. 66–75, 2020.
have a high impact on real-world applications of such product [14] R. Peeters, C. Bizer, and G. Glavaš, “Intermediate training of bert for
matching concepts. More advanced tests proved presented product matching,” small, vol. 745, no. 722, pp. 2–112, 2020.
[15] R. Peeters and C. Bizer, “Supervised contrastive learning for product
solution to be robust and high performing on diverse datasets matching,” in Companion Proceedings of the Web Conference 2022,
as well as under various class distributions. ACM, apr 2022.
There exist multiple ways of improving this research and [16] Y. Li, J. Li, Y. Suhara, A. Doan, and W.-C. Tan, “Deep entity matching
with pre-trained language models,” Proceedings of the VLDB Endow-
extending it to examine more possibilities for boosting the ment, vol. 14, p. 50–60, Sept. 2020.
model’s results. We tested IDEC performance while changing [17] J. Li, Z. Dou, Y. Zhu, X. Zuo, and J.-R. Wen, “Deep cross-platform prod-
only one parameter at a time, but this could be expanded to uct matching in e-commerce,” Information Retrieval Journal, vol. 23,
no. 2, pp. 136–158, 2020.
diverging at least two attributes concurrently. Another path [18] A. Alabdullatif and M. Aloud, “Araprodmatch: A machine learning
for improvement could utilize alternative existing constrained approach for product matching in e-commerce,” International Journal
clustering algorithms such as COP k-means or the MPC of Computer Science & Network Security, vol. 21, no. 4, pp. 214–222,
2021.
version of it. Adding more features, for example related to [19] R. Peeters and C. Bizer, “Supervised contrastive learning for product
the spread of prices of products, can be a valuable extension matching,” arXiv preprint arXiv:2202.02098, 2022.
of the modeling process as well. [20] K. Amshakala and R. Nedunchezhian, “Using fuzzy logic for product
matching,” in Computational Intelligence, Cyber Security and Compu-
It has to be pointed out that within the subject of product tational Models (G. S. S. Krishnan, R. Anitha, R. S. Lekshmi, M. S.
matching of online offerings datasets are countless. In the Kumar, A. Bonato, and M. Graña, eds.), (New Delhi), pp. 171–179,
era of widely available web scrapers, there is a possibility Springer India, 2014.
[21] R. Peeters and C. Bizer, “Entity matching using large language models,”
of gathering various features at the very source of the data 2023.
- the on-line selling platforms. Solutions, as well as the [22] R. Peeters and C. Bizer, “Using chatgpt for entity matching,” 2023.
data volumes, are only limited by the computational costs [23] K. Gupte, L. Pang, H. Vuyyuri, and S. Pasumarty, “Multimodal product
matching and category mapping: Text+ image based deep neural net-
of running algorithms and storing records. For these reasons, work,” in 2021 IEEE International Conference on Big Data (Big Data),
product matching is an important task worth further research pp. 4500–4505, IEEE, 2021.
and development. [24] M. Wilke and E. Rahm, “Towards multi-modal entity resolution for
product matching.,” in GvDB, 2021.
R EFERENCES [25] H. Tzaban, I. Guy, A. Greenstein-Messica, A. Dagan, L. Rokach,
and B. Shapira, “Product bundle identification using semi-supervised
[1] Eurostat, “E-commerce sales,” isoc ec eseln2 dataset, Eurostat, Septem- learning,” in Proceedings of the 43rd International ACM SIGIR Con-
ber 2023. ference on Research and Development in Information Retrieval, SIGIR
[2] Eurostat, “E-commerce continues to grow in the eu,” tech. rep., Eurostat, ’20, (New York, NY, USA), p. 791–800, Association for Computing
Spetember 2023. Machinery, 2020.
[3] Statista, “E-commerce as share of total U.S. retail sales from 1st quarter [26] A. Primpeli, R. Peeters, and C. Bizer, “The wdc training dataset and gold
2010 to 3rd quarter 2021,” dataset, Statista, September 2023. standard for large-scale product matching,” in Companion Proceedings
[4] D. Shankar, S. Narumanchi, H. Ananya, P. Kompalli, and K. Chaudhury, of The 2019 World Wide Web Conference, pp. 381–386, 2019.
“Deep learning based large scale visual recommendation and search for [27] M. Okabe and S. Yamada, “Clustering using boosted constrained k-
e-commerce,” arXiv preprint arXiv:1703.02344, 2017. means algorithm,” Frontiers in Robotics and AI, vol. 5, 2018.
[5] R. Gubela, A. Bequé, S. Lessmann, and F. Gebert, “Conversion uplift in [28] H. Zhang, S. Basu, and I. Davidson, “A framework for deep constrained
e-commerce: A systematic benchmark of modeling strategies,” Interna- clustering-algorithms and advances,” in Joint European Conference on
tional Journal of Information Technology & Decision Making, vol. 18, Machine Learning and Knowledge Discovery in Databases, pp. 57–72,
no. 03, pp. 747–791, 2019. Springer, 2019.
[6] L. Zhou, “Product advertising recommendation in e-commerce based [29] E. Bair, “Semi-supervised clustering methods,” Wiley Interdisciplinary
on deep learning and distributed expression,” Electronic Commerce Reviews: Computational Statistics, vol. 5, no. 5, pp. 349–361, 2013.
Research, vol. 20, no. 2, pp. 321–342, 2020. [30] N. Gali, R. Mariescu-Istodor, and P. Fränti, “Similarity measures for
[7] R. Gupta and C. Pathak, “A machine learning framework for predicting title matching,” in 2016 23rd International Conference on Pattern
purchase by online customers based on dynamic pricing,” Procedia Recognition (ICPR), pp. 1548–1553, IEEE, 2016.
Computer Science, vol. 36, pp. 599–605, 2014. [31] L. Yujian and L. Bo, “A normalized levenshtein distance metric,” IEEE
[8] Y. Narahari, C. Raju, K. Ravikumar, and S. Shah, “Dynamic pricing Transactions on Pattern Analysis and Machine Intelligence, vol. 29,
models for electronic business,” sadhana, vol. 30, no. 2, pp. 231–256, pp. 1091–1095, June 2007.
2005. [32] G. Ivchenko and S. Honov, “On the jaccard similarity test,” Journal of
[9] R. Maestre, J. Duque, A. Rubio, and J. Arévalo, “Reinforcement learning Mathematical Sciences, vol. 88, no. 6, pp. 789–794, 1998.
for fair dynamic pricing,” in Proceedings of SAI Intelligent Systems [33] K. Wagstaff, C. Cardie, S. Rogers, S. Schrödl, et al., “Constrained k-
Conference, pp. 120–135, Springer, 2018. means clustering with background knowledge,” in Icml, vol. 1, pp. 577–
[10] J. Li, T. Wang, Z. Chen, G. Luo, et al., “Machine learning algorithm 584, 2001.
generated sales prediction for inventory optimization in cross-border e- [34] CRAN, “conclust: Pairwise constraints clustering,” package, Apr 2022.
commerce,” International Journal of Frontiers in Engineering Technol- [35] X. Guo, L. Gao, X. Liu, and J. Yin, “Improved deep embedded clustering
ogy, vol. 1, no. 1, 2019. with local structure preservation.,” in Ijcai, pp. 1753–1759, 2017.
[36] H. Zhang, T. Zhan, S. Basu, and I. Davidson, “A framework for deep
constrained clustering,” Data Mining and Knowledge Discovery, vol. 35,
no. 2, pp. 593–620, 2021.
[37] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding for
clustering analysis,” in International conference on machine learning,
pp. 478–487, PMLR, 2016.
[38] L. A. Jeni, J. F. Cohn, and F. De La Torre, “Facing imbalanced
data–recommendations for the use of performance metrics,” in 2013
Humaine association conference on affective computing and intelligent
interaction, pp. 245–251, IEEE, 2013.
[39] J. M. Santos and M. Embrechts, “On the use of the adjusted rand index
as a metric for evaluating supervised classification,” in International
conference on artificial neural networks, pp. 175–184, Springer, 2009.
[40] Kaggle, “Skroutz dataset for product matching,” dataset, Apr 2022.
[41] G. A. Rao, G. Srinivas, K. V. Rao, and P. P. Reddy, “A partial ratio
and ratio based fuzzy-wuzzy procedure for characteristic mining of
mathematical formulas from documents,” IJSC—ICTACT J Soft Comput,
vol. 8, no. 4, pp. 1728–1732, 2018.
[42] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep,
E. Arcaute, and V. Raghavendra, “Deep learning for entity matching:
A design space exploration,” in Proceedings of the 2018 International
Conference on Management of Data, SIGMOD ’18, (New York, NY,
USA), p. 19–34, Association for Computing Machinery, 2018.
[43] A. Primpeli, R. Peeters, and C. Bizer, “The wdc training dataset and
gold standard for large-scale product matching,” pp. 381–386, 05 2019.

You might also like