0% found this document useful (0 votes)
32 views

AutoKnow KnowledgeProductGraph

Cool paper NLP
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

AutoKnow KnowledgeProductGraph

Cool paper NLP
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

AutoKnow: Self-Driving Knowledge Collection for

Products of Thousands of Types


Xin Luna Dong1 , Xiang He1 , Andrey Kan1 , Xian Li1 , Yan Liang1 , Jun Ma1 , Yifan Ethan Xu1 ,
Chenwei Zhang1 , Tong Zhao1 , Gabriel Blanco Saldana1 , Saurabh Deshpande1 ,
Alexandre Michetti Manduca1 , Jay Ren1 , Surender Pal Singh1 , Fan Xiao1 ,
Haw-Shiuan Chang2* , Giannis Karamanolakis3* , Yuning Mao4* , Yaqing Wang5* ,
Christos Faloutsos6* , Andrew McCallum2 , Jiawei Han4
1 Amazon 2 University
of Massachusetts Amherst 3 Columbia University
4 Universityof Illinois at Urbana-Champaign 5 State University of New York at Buffalo 6 Carnegie Mellon University
arXiv:2006.13473v1 [cs.AI] 24 Jun 2020

1 {lunadong,xianghe,avkan,xianlee,ynliang,junmaa,xuyifa,cwzhang,zhaoton,

saldanag,sdeshpa,manduca,renjie,srender,fnxi}@amazon.com 2 {hschang,mccallum}@cs.umass.edu
3 [email protected] 4 {yuningm2,hanj}@illinois.edu 5 [email protected] 6 [email protected]

ABSTRACT ACM Reference Format:


Can one build a knowledge graph (KG) for all products in the world? Xin Luna Dong, Xiang He, Andrey Kan, Xian Li, Yan Liang, Jun Ma, Yifan
Ethan Xu, Chenwei Zhang, Tong Zhao, Gabriel Blanco Saldana, Saurabh
Knowledge graphs have firmly established themselves as valuable
Deshpande, Alexandre Michetti Manduca, Jay Ren, Surender Pal Singh, Fan
sources of information for search and question answering, and it is Xiao, Haw-Shiuan Chang, Giannis Karamanolakis, Yuning Mao, Yaqing
natural to wonder if a KG can contain information about products Wang, Christos Faloutsos, Andrew McCallum, Jiawei Han. 2020. Auto-
offered at online retail sites. There have been several successful ex- Know: Self-Driving Knowledge Collection for Products of Thousands of
amples of generic KGs, but organizing information about products Types. In Proceedings of the 26th ACM SIGKDD Conference on Knowledge
poses many additional challenges, including sparsity and noise of Discovery and Data Mining (KDD ’20), August 23–27, 2020, Virtual Event, CA,
structured data for products, complexity of the domain with mil- USA. ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3394486.
lions of product types and thousands of attributes, heterogeneity 3403323
across large number of categories, as well as large and constantly
growing number of products. 1 INTRODUCTION
We describe AutoKnow, our automatic (self-driving) system A knowledge graph (KG) describes entities and relations between
that addresses these challenges. The system includes a suite of novel them; for example, between entities Amazon and Seattle, there can
techniques for taxonomy construction, product property identifi- be a headquarters_located_at relation. The past decade has wit-
cation, knowledge extraction, anomaly detection, and synonym nessed broad applications of KG in search (e.g., by Google and Bing)
discovery. AutoKnow is (a) automatic, requiring little human and question answering (e.g., by Amazon Alexa or Google Home).
intervention, (b) multi-scalable, scalable in multiple dimensions How to automatically build a knowledge graph with comprehensive
(many domains, many products, and many attributes), and (c) inte- and high-quality data has been a hot topic for research and industry
grative, exploiting rich customer behavior logs. AutoKnow has practice in recent years. In this paper, we answer this question for
been operational in collecting product knowledge for over 11K the Retail Product domain. Rich product knowledge can significantly
product types. improve e-Business shopping experiences through product search,
recommendation, and navigation.
CCS CONCEPTS Existing industry success for knowledge curation is mainly for
• Information systems → Graph-based database models. popular domains such as Music, Movie, and Sport [4, 12]. Two com-
KEYWORDS mon features for such domains make them pioneer domains for
knowledge collection. First, there are already rich data in struc-
knowledge graphs, taxonomy enrichment, attribute importance, tured form and of decent quality for these domains. Taking Movie
data imputation, data cleaning, synonym finding as an example, in addition to common knowledge sources such as
Wikipedia and WikiData, other authoritative movie data sources
∗ Research conducted at Amazon. include IMDb 1 , and so on. Second, the complexity of the domain
schema is manageable. Continuing with the Movie domain, the
Freebase knowledge graph [4] contains 52 entity types and 155 rela-
Permission to make digital or hard copies of part or all of this work for personal or tionships [36] for this domain. An ontology to describe these types
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation and relationships can be manually defined within weeks, especially
on the first page. Copyrights for third-party components of this work must be honored. by leveraging existing data sources.
For all other uses, contact the owner/author(s).
The retail product domain presents a set of new challenges for
KDD ’20, August 23–27, 2020, Virtual Event, CA, USA
© 2020 Copyright held by the owner/author(s). knowledge collection.
ACM ISBN 978-1-4503-7998-4/20/08.
https://doi.org/10.1145/3394486.3403323 1 https://www.imdb.com/
Grocery Product With all of these challenges, the solutions in building existing
KG Grocery
KGs both in industry (e.g., Google Knowledge Graph, Bing Satori
Drinks
Graph [12]), and in research literature (e.g., Yago [10], NELL [6],
Snacks
Snacks Drinks
Candy
Pretzels Candy
Diadem [11], Knowledge Vault [8]), cannot directly apply to the
User logs
AutoKnow
hasType
retail product domain, as we will further discuss in Section 7.
Product Type Flavor Color
Prod. 1 Prod. 2 Prod. 3
In this paper, we present our solution, which we call AutoKnow
● #Types ↑ 3X
Product 1 Snacks Cherry
● Defect rate ↓ flavor flavor color (Figure 1). AutoKnow starts with building product type taxonomy
Product 2 Candy ? ? by up to 68 Chocolate Choc. Gold (i.e., types and hypernym relationships) and deciding applicable
percent points synonym
Product 3 Candy Choc. Gold product attributes for each type; after that, it imputes structured
attributes, cleans up noisy values, and identifies synonym expres-
Figure 1: We propose AutoKnow, a pipeline that constructs a prod- sions for attribute values. Imagine how an autonomous-driving
uct knowledge graph. AutoKnow fixes incorrect values (e.g., “Fla- vehicle perceives and understands the environment using all the
vor:Cherry” for Product 1) and imputes missing values (e.g., “Fla- signals available with minimized human interventions. AutoKnow
vor:Choc.” for Product 2); however, it does not impute where it is in- is self-driving with the following features.
applicable (e.g., Color applies for wrapped candies such as Product
3, but does not apply to pretzel snack Product 1). It also extends tax- • Automatic: First, it trains machine learning models and
onomy (e.g., Pretzels) and finds synonyms (e.g., Chocolate vs. Choc.). requires very little manual efforts. In addition, we leverage
existing Catalog data and customer behavior logs to generate
training data, eliminating the need for manual labeling for
C1- Structure-sparsity: First, except for a few categories such
the majority of the models and allowing extension to new
as electronics, structured data are sparse and noisy across nearly
domains without extra efforts.
all data sources. This is because the majority of product data reside • Multi-scalable: Our system is scalable in multiple dimen-
in catalogs from e-Business websites such as Amazon, Ebay, and sions. It is extensible to new values and is not constrained to
Walmart, and they often rely on data contributed by retailers. In existing vocabularies in the training data. It is extensible to
contrast to publishers for digital products like movies and books, new types, as it trains one-size-fits-all models for thousands
in the retail business retailers mainly list product features in titles of types, and the models behave differently for different
and descriptions instead of providing structured attribute infor- product types to achieve the best results.
mation, and may even abuse those structured attribute fields for • Integrative: Finally, the system applies self-guidance, and
convenience in selling products [35, 37]. As a result, structured uses customer behavior logs to identify important product
knowledge needs to be mined from textual product profiles (e.g., attributes to focus efforts on.
titles and descriptions). Thousands of product attributes, billions of
existing products, and millions of new products emerging on a daily A few techniques play a critical role to allow us to scale up to
basis, require fully automatic and efficient knowledge discovery the large number of product types we need to generate knowledge
and update mechanisms. for. First, we leverage the graph structure that naturally applies
C2- Domain-complexity: Second, the domain is much more to knowledge graphs (entities can be considered as nodes and re-
complex. The number of product types is towards millions and lationships can be considered as edges) and taxonomy (types and
there are various relationships between the types like sub-types hypernym relationships form a tree structure), and apply Graph
(e.g., swimsuit vs. athletic swimwear), synonyms (e.g., swimsuit vs. Neural Network (GNN) for learning. Second, we take product cate-
bathsuit), and overlapping types (e.g., fashion swimwear vs. two- gorization as input signals to train our models, and combine our
piece swimwear). Product attributes vastly differ between types tasks with product categorization for multi-task training to allow
(e.g., compare TVs and dog food), and also evolve over time (e.g., better performance. Third, we strive to learn with limited labels
older TVs did not have WiFi connectivity). All of these make it hard to alleviate the burden of manual training data creation, relying
to design comprehensive ontology and keep it up-to-date, thus heavily on weak supervision (e.g., distant supervision) and on semi-
calling for automatic solutions. supervised learning. Fourth, we mine both facts and heterogeneous
C3- Product-type-variety: Third, the variety of different prod- expressions for the same concept (i.e., type, attribute value) from
uct types makes it even harder to train knowledge enrichment customer behavior logs, abundant in the retail domain.
and cleaning models. Product attributes, value vocabularies, text More specifically, we make the following contributions.
patterns in product titles and descriptions often differ for different (1) Operational system: We describe AutoKnow, a compre-
types. Even neighboring product types can have different attributes; hensive end-to-end solution for product knowledge collec-
for example, Coffee and Tea, which share the same parent Drink, de- tion, covering components from ontology construction and
scribe size using different vocabularies and patterns (e.g., “Ground enrichment, to data extraction, cleaning, and normalization.
Coffee, 20 Ounce Bag, Rainforest Alliance Certified” vs. “Classic Tea A large part of AutoKnow has been deployed and opera-
Variety Box, 48 Count (Pack of 1)”). On the one hand, training one tional in collecting over 1B product knowledge facts for over
single model is inadequate to achieve good results for all different 11K distinct product types, and the knowledge has been used
types of products. On the other hand, collecting training data for for Amazon search and product detail pages.
each of the thousands to millions of product types is extremely (2) Technical novelty: We invented a suite of novel techniques
expensive, and implausible for less-popular types. Maintaining a that together allow us to scale up knowledge discovery to
huge number of models also brings big system overhead. thousands of product types. The techniques range from NLP
and graph mining to anomaly detection, and leverage state- Input Data Ontology Suite
Broad Graph
PT Taxonomy
of-the-art techniques in GNN, transformer, and multi-task Taxonomy
Data Suite Ontology
Enrichment
learning. Catalog Data
{product,
(3) Empirical study: We describe our practice on real-world Behavioral Relation
Imputation
attribute,
e-Business data from Amazon, showing that we are able Signals (e.g., Discovery Data
Cleaning
value}
search logs,
to extend the existing ontology by 2.9X, and considerably reviews,
{value,
synonym,
Q&A) Synonym
increase the quality of structured data, on average improving Discovery value}

precision by 7.6% and recall by 16.4%.


Whereas our paper focuses on retail domain and our experi- Figure 2: AutoKnow architecture, containing ontology suite
ments were conducted on Amazon data, the techniques can be to enrich product ontology and data suite to enrich product
easily applied to other e-Commerce datasets, and adapted to other structured data.
domains with hierarchical taxonomy, rich text profiles, and cus-
tomer behavior logs, such as finance, phylogenetics, and biomedical types in T; (2) A denotes a set of product attributes, and (3) P =
studies. {PID, {T }, {(A, V )}} contains for each product (PID is the ID) a set
of product types {T } and a set of attribute-value pairs {(A, V )}. Let
2 DEFINITION AND SYSTEM OVERVIEW L denote customer behavior logs. Product Knowledge Discovery
takes C and L as input, enriches the product knowledge by adding
2.1 Product Knowledge Graph new types and hypernym relationships to T , and new product
A KG is a set of triples in the form of (subject, predicate, object). types and attribute values for each product in P.
The subject is an entity with an ID, and this entity belongs to one
or multiple types. The object can be an entity or an atomic value, 2.2 System Architecture
such as a string or a number. The predicate describes the relation Figure 2 depicts the architecture of our AutoKnow system. It has
between the subject and the object. For example, (prod_id, has- five components, categorized into two function suites.
Brand, brand_id) is a triple between two entities, whereas (prod_id, Ontology suite: The ontology suite contains two components:
hasSugarsPerServing, “32”) is a triple between an entity and an Taxonomy enrichment and Relation discovery. Taxonomy enrich-
atomic value. One can consider the entities and atomic values as ment identifies new product types not existing in input taxonomy
nodes in the graph, and predicates as edges that connect the nodes. T and decides the hypernym relationships between the newly dis-
For simplicity of problem definition, in this paper we focus on a covered types and existing types, using them to enrich T . Relation
special type of knowledge graph, which we call a broad graph. The discovery decides for each product type T ∈ T and attribute A ∈ A,
broad graph is a bipartite graph G = (N 1 , N 2 , E), where nodes in whether A applies to type T and if so, how important A is when
N 1 represent entities of one particular type, called the topic type, customers make purchase decisions for these products, captured
nodes in N 2 represent attribute values (that can be entities or atomic by an importance score.
values), and edges in E connect each entity with its attribute values. Data suite: The data suite contains three components: Data
The edges are labeled with corresponding attribute names (Figure 1). imputation, Data cleaning, and Synonym discovery. Data imputation
In other words, a broad graph contains only two layers, and thus derives new (attribute, value) pairs for each product in P from
contains attribute values only for entities of the topic type. We product profiles and existing structured attributes. Data cleaning
focus on broad graphs where the topic type is product. Once a identifies anomalies from existing data in P and newly imputed
broad graph is built, one can imagine stacking broad graphs layer values. Synonym discovery associates synonyms between product
by layer to include knowledge about other types of entities (e.g., types and attribute values.
brand), and eventually arrive at a rich, comprehensive graph. Each component is independent, automatic and multi-scalable;
Product types form a tree-structured taxonomy, where the root on the other hand, the components are well pieced together. The
represents all products, each node represents a product type, and early components provide guidance and richer data for later com-
each edge represents a sub-type relationship. For example, Coffee ponents; for example, relation discovery identifies important and
is a sub-type of Drink. meaningful relations for the data suite, and data imputation pro-
We assume two sources of input. First, we assume existence of vides richer data for synonym discovery. The later components fix
a product Catalog, including a product taxonomy, a set of product errors from early parts of the pipeline; for example, data cleaning
attributes (not necessarily distinguished between different product removes mistakes from data imputation.
types), a set of products, and attribute values for each product. We Table 1 summarizes how each component of AutoKnow em-
assume that each product has a product profile that includes title, ploys the aforementioned techniques. We next describe in Section 3
description, and bullet points, and in addition a set of structured how we build the ontology and in Section 4 how we improve the
values, where title is required and other fields are optional. Second, data. To facilitate understanding of our design choices, for each
we assume existence of customer behavior logs, such as the query component we present comparison of our proposed solution with
and purchase log, customer reviews, and Q&A. We next formally the state-of-the-arts, show ablation study, and show real examples
define the problem we solve in this paper. in Appendix A. Unless otherwise mentioned, we use the Grocery
Problem definition: Let C = (T , A, P) be a product Catalog, domain in the US market and the flavor attribute to illustrate our
where (1) T = (T, H) denotes a product taxonomy with a set results, but we have observed the same trend throughout differ-
of product types T and the hypernym relationships H between ent domains and attributes. We describe detail of the experimental
Table 1: Key techniques employed by each component. Table 3: AK-Taxonomy improves over state-of-the-art by
17.7% on Edge-F1.

utation
onomy

onyms
ns

aning
Method Edge-F1 Ancestor-F1

tio
Substr [5] 10.7 52.9

AK-Rela
HiDir [33] 40.5 66.4

AK-Imp

AK-Syn
AK-Tax

AK-Cle
MSejrKu [29] 53.1 76.7
Component Type-Attachment 62.5 84.2
Techniques w/o. multi-hop (≥2) GNN 50.4 (↓12.1%) 75.9 (↓8.3%)
Graph structure X X w/o. user behavior (query↔product) 60.1 (↓2.4%) 83.0 (↓1.2%)
Taxonomy signal X X
Distant supervision X X X sequence of (y1 , y2 , ..., y L ), where yi ∈ {B, I, O, E}, representing
Behavior information X X X "begin", "inside", "outside", "end" respectively. Table 2 illustrates
an example of sequential labels obtained using OpenTag [37]: “ice
Table 2: Example of input(text)/output(BIOE tag) sequences cream” is labeled as product type, and "black cherry cheesecake" as
for the type and flavor of an ice cream product. product flavor.
Input Ben & Jerry’s black cherry cheesecake ice cream
To train the model, we adopt distant supervision to generate
Output O O O B-flavor I-flavor E-flavor B-type E-type the training labels. For product titles, we look for product types
in Catalog provided by retailers (restricted to the existing product
setup and end-to-end empirical study in Section 5 and lessons we types), and generate BIOE tags when types are explicitly and ex-
learned in Section 6. actly mentioned in their titles. For queries, we look for the type
of purchased products in the query to generate BIOE tags. Once
3 ENRICHING THE ONTOLOGY the extraction models are trained, we apply them on product titles
3.1 Taxonomy Enrichment and queries. New types from titles are taken as T ′ , and those from
Problem definition: Online catalog taxonomies are often built and queries, albeit noisier, will be used for type attachment.
maintained manually by human experts (e.g., taxonomists), which Type Attachment: Type attachment organizes extracted types into
is labor-intensive and hard to scale up, leaving out fine-grained the existing taxonomy. We thus solve a binary classification prob-
product types. How to discover new product types and attach them lem, where the classifier determines if the hypernym relationship
to the existing taxonomy in a scalable, automated fashion is critical exists between two types T ∈ T,T ′ ∈ T ′ .
to address the C2- Domain-complexity challenge. Our key intuition is to capture various signals from customer be-
Formally, given an existing product taxonomy T = (T, H), Tax- haviors with a GNN-based module. In particular, we first construct
onomy Enrichment extends it with T ′ = (T ∪ T ′, H ∪ H ′ ), where a graph where the nodes represent types, products, and queries,
T ′ is the set of new product types, and H ′ is the additional set of and the edges represent various relationships including 1) product
hypernym relationships between types in T and in T ′ . co-viewing, 2) a query leading to a product purchase, 3) the type
Key techniques: The product domain proposes its unique chal- mentioned in a query or a product (according to the extraction).
lenges for taxonomy construction. A product type and its instances, The GNN model allows us to refine the node representation using
or a type and its sub-types, are unlikely to be mentioned in the same the neighborhood information on the graph. Finally, the type rep-
sentence as in other domains such as “big US cities like Seattle”, resentation for each T ∈ T ∪ T ′ is combined with semantic features
so traditional methods like Hearst patterns do not apply. Our key (e.g., word embedding) of the type names and fed to the classifier.
intuition is that since product types are very important, they are To train the model, we again apply distant supervision. We use
frequently mentioned in product titles (see Table 2 for an exam- the type hypernym pairs in the existing taxonomy as the supervi-
ple) and search queries (i.e., “k-cups dunkin donuts dark”); we thus sion to generate positive labels, and generate five negative labels by
leverage existing resources such as product profiles in the Catalog randomly replacing the hyponym type with other product types.
C or search queries in behavior logs L to effectively supervise the Component Evaluation: For product type extraction in the Gro-
taxonomy enrichment process. cery domain, we obtained 87.5% precision according to MTurk eval-
We enrich product taxonomy in two steps. We first discover uation; in comparison to state-of-the-art techniques, Noun Phrase
new types T ′ from product titles or customer search queries by (NP) chunking obtains 12.3% precision and AutoPhrase [30] obtains
training a type extractor. Then, we attach candidate types in T ′ 20.9% precision.
to the existing taxonomy T by solving a hypernym classification For type attachment, we took hypernym relationships from
problem. We next briefly describe high-level ideas of each step and existing taxonomy as ground truth, randomly sampled 80% for
details can be found in [22]. model training, 10% as validation set and 10% for testing. We mea-
Type extraction: Type extraction discovers new product types men- sured both Edge-F1 (F-measure on parent-child relationship) and
tioned in product titles or search queries. For the purpose of rec- Ancestor-F1 (F-measure on ancestor-child relationship) [2, 21]. Ta-
ognizing new product types from product titles, it is critical that ble 3 shows that our GNN-based model significantly outperforms
we are able to extract types not included in training data. Thus, the state-of-the-art baselines, improving Edge-F1 by 54.3% over
we adopt an open-world tagging model and formulate type extrac- HiDir [33], and by 17.7% over MSejrKu [29]. Ablation tests show
tion as a “BIOE” sequential labeling problem. In particular, given that both the multi-hop GNN and the user behavior increase the
the product’s title sequence (x 1 , x 2 , ..., x L ), the model outputs the performance.
3.2 Relation Discovery Table 4: AK-Relations outperforms using only coverage fea-
tures on both applicability prediction (by 4.7%) and impor-
Problem definition: In a catalog there are often thousands of
tance prediction (1.9X). Here “Seller features” does not in-
product attributes; however, different sets of attributes apply to
clude the “Coverage features”.
different product types (e.g., flavor applies to snacks, but not to
shampoos), and among them, only a small portion have a big in- Method Precision Recall F1 Spearman
fluence on customer shopping decisions (e.g., brand is more likely Coverage features 0.90 0.80 0.85 0.39
to affect shopping decisions for snacks, but less for fruits). Under- Seller features 0.90 0.84 0.87 0.72
standing applicability and importance will help filter values for Buyer features 0.86 0.83 0.84 0.68
non-applicable attributes and prioritize enrichment and cleaning All features 0.94 0.84 0.89 0.74
for important attributes. Thus, how to identify applicable and im- Table 4. Our results show that comparing with the strongest signal–
portant attributes for thousands of types is another key problem to coverage, various behavior signals improved F-measure by 4.7%
solve to address the C2- Domain-complexity challenge. for applicability, and improved Spearman for importance by 1.9X.
Formally, given a product taxonomy T = (T, H) and a set of Ablation tests show that both buyer features and seller features
product attributes A, Relation Discovery decides for each (T , A) ∈ T× contribute to the final results.
A, (1) whether attribute A applies to products of type T , denoted by
(T , A) → {0, 1}, and (2) how important A is for purchase decisions 4 ENRICHING AND CLEANING KNOWLEDGE
on products of T , denoted by (T , A) → [0, 1]. Here, we do not
4.1 Data Imputation
consider newly extracted types in T ′ , since they are often sparse.
Problem definition: The Data Imputation component addresses
Key techniques: Intuitively, important attributes will be frequently
the C1- Structure-sparsity challenge by extracting structured
mentioned by sellers and buyers, whereas inapplicable attributes
values from product profiles to increase coverage. Formally, given
will appear rarely. Previous approaches explored this intuition,
product information (PID, {T }, {(A, V )}), Data imputation extracts
but either leveraged only one text source at a time (e.g., only cus-
new (A, V ) pairs for each product from its profiles (i.e., title, descrip-
tomer reviews) or combined sources according to a pre-defined
tion, and bullets).
rule [15, 27, 31]. Here we train a classification model to decide
State-of-the-art techniques have solved the problem for a type-
attribute applicability, and a regression model to decide attribute
attribute pair (T , A), obtaining high extraction quality with BIOE
importance. We used Random Forest for both models and employ
sequential labeling combined with active learning [37]. Equation (1)
two types of features reflecting behavior of the customers.
shows sequential labeling with BiLSTM and CRF:
• Seller behavior, captured by coverage of attribute values for (y1 , y2 , ...y L ) = CRF(BiLSTM(e x 1 , e x 2 , ..., e x L )), (1)
a particular product type, and frequency of mentioning at-
tribute values in product profiles. where e x i is the embedding of x i , usually initialized with pre-trained
• Buyer behavior, captured by frequency of mentioning at- word embedding such as GloVe [25], and fine-tuned during model
tribute values in search queries, reviews, Q&A sessions, etc. training. As an example, the output sequence tags in Table 2 shows
For a given (T , A) pair, we compute features that correspond to that "black cherry cheesecake” is a flavor of the product.
different signals (e.g., reviews, search logs, etc.). To this end, we However, the technique does not scale up to thousands to mil-
estimate frequencies of mentions of attribute values in the corre- lions of product types and tens to hundreds of attributes that apply
sponding text sources (see details in Appendix B). Note that sellers to each type. How to train an extraction model that acknowledges
are required to provide certain applicable attributes (e.g., barcode). the differences between different product types is critical to scale
These attributes have high coverage, but they are not always im- up sequential labeling to address the C3- Product-type-variety
portant for shopping decisions and appear rarely in customer logs. challenge.
We thus train two different models for applicability and importance Key techniques: We proposed an innovative taxonomy-aware
to capture such subtle differences. sequence tagging approach that makes predictions conditioned on
We collect manual annotations for training, both in-house and the product type. We describe the high-level ideas next and details
using MTurk. In the latter case, for a given (T , A) pair, we asked can be found in [16].
six MTurk workers whether the attribute A applies to products We extended sequence tagging described in Equation (1) in two
of type T , and how likely A will influence shopping decisions for ways. First, we condition model predictions on product type T ∈ T:
products of type T . The applicability is decided by majority voting, (y1 , y2 , ...y L ) = CRF(CondSelfAtt(BiLSTM(e x 1 , e x 2 , ..., e x L ), eT ))
and importance is decided by averaging influence likelihood grades. (2)
Once we trained the model, we apply it to all (T , A) pairs to decide where eT is the pre-trained hyperbolic-space embedding (Poincare [23])
applicability and importance. of product type T , known to preserve the hierarchical relation be-
Component Evaluation: We collected two datasets. The first tween taxonomy nodes. CondSelfAtt is the conditional self atten-
dataset contains 807 applicability and importance labels for 11 tion layer that allows eT to influence the attention weights.
common attributes (e.g., brand, flavor, scent) and 79 randomly sam- Second, to better identify tokens that indicate the product type,
pled product types. The second dataset contains 240 applicability and address the problem that products can be mis-classified or
labels for 7 product types (e.g., Shampoo, Coffee, Battery) and 180 at- product type information can be missing in a catalog, we employ
tributes for which there are values in the Catalog. We combined the multi-task learning: training sequence tagging and product catego-
data, used 80% for training and 20% testing, and reported results in rization at the same time, with a shared BiLSTM layer.
Table 5: AK-Imputation improves over state-of-the-art by Table 6: AK-Cleaning improves over state-of-the-art anom-
10.1% on F1 for flavor extraction across 4,000 types. aly detection by 75.3% on PRAUC. [email protected] shows the recall
when the precision is 0.7, etc.
Model Vocab size Precision Recall F1
BiLSTM-CRF [37] 6756 70.3 49.6 57.5 Model PRAUC [email protected] [email protected] [email protected] [email protected]
Anomaly Detection [18] 32.0 2.4 1.3 1.3 1.3
AK-Imputation 13093 70.9 57.8 63.3
AK-Cleaning 56.1 59.6 39.8 26.0 20.7
w/o. CondSelfAtt 9528 74.5 53.2 61.5 w/o. Taxonomy 52.6 52.6 36.2 22.4 3.0
w/o. MultiTask 12558 68.8 57.0 61.9 positional embedding [32] that indicates the relative location of
We again adopt the distant supervision approach to automat- token i in the sequence. The sequence of embeddings [e 1 , e 2 , . . .] is
ically generate the training sequence labels by text matching be- propagated through a multi-layer transformer model whose output
tween product profiles and available attribute values in the Catalog. embedding vector e Out captures the distilled representations of all
The trained model is then applied to all (PID, A) pairs for predicting three input sequences. Finally, e Out passes through a dense layer
missing values. followed by a sigmoid node to produce a single score between 0
and 1, indicating the consistency of D and V ; in other words, the
Component Evaluation: In Table 5, we show the performance
likelihood of the input triple (PID, A, V ) being correct (see Appendix
evaluation and ablation studies of flavor extraction across 4000
C for details and Figure 4 for the model architecture).
types of products in the Grocery domain. Compared to the baseline
To train the cleaning model, we adopt distant supervision to
BiLSTM-CRF model adopted by current state-of-the-art [37], both
automatically generate training labels from the input Catalog. We
CondSelfAtt and MultiTask learning, when applied alone, improve
generate positive examples by selecting high-frequency values that
F1 score by at least 7.0%; combination of the two together improved
appear in multiple brands, then for each positive example we ran-
F1 by 10.1%.
domly apply one of the following three procedures to generate
a negative example: 1) We build a vocabulary vocab(A) for each
4.2 Data Cleaning
attribute A and replace a catalog value V of A with a randomly
Problem definition: The structured data contributed by retailer selected value from vocab(A); 2) We randomly select n-grams from
are often error-prone because of misunderstanding or intentional the product title that does not contain the catalog value V , where n
abuse of the product attributes. Detecting anomalies and removing is a random number drawn according to the distribution of lengths
them is thus another important aspect to address the C1- Structure- of tokens in vocab(A); 3) We randomly pick the value of another at-
sparsity challenge. Formally, given product information (PID, {T }, tribute A′ , A to replace V . At inference time, we apply our model
{(A, V )}), Data cleaning identifies (A, V ) pairs that are incorrect for to every (PID, A, V ) triple and consider those with a low confidence
the product, such as (A = flavor, V = “1 lb. box”) for a box of as incorrect.
chocolate and (A = color, V = “100% Cotton”) for a shirt.
Component Evaluation: As shown in Table 6, evaluation on the
Abedjan et al. [1] have made successes in cleaning values of
flavor attribute for the Grocery domain on 2230 labels across 223
types like numerical and date/time. We focus our discussion on
types shows that our model improves PRAUC over state-of-the-art
textual attributes, which are often considered as most challenging
anomaly detection technique [18] by 75.3%, and considering the
in cleaning. Similar to data imputation, the key is to address the
product taxonomy in addition to product profiles improved PRAUC
C3- Product-type-variety challenge such that we can scale up
by 6.7%.
to nearly millions of types. In particular, we need to answer the
question: how to identify anomaly values inconsistent with product
profiles for a large number of product types?
4.3 Synonym Finding
We finally very briefly discuss how we identify synonyms with the
Key techniques: Our intuition is that an attribute value shall be
same semantic meaning, including spelling variants (e.g., Reese’s vs.
consistent with the contexts provided by the product profiles. We
reese), acronyms or abbreviation (e.g., FTO vs. fair trade organic), and
propose a transformer-based [32] neural net model that jointly
semantic equivalence (e.g., lite vs. low sugar). Aligning synonym
processes signals from textual product profile (D) and the product
values is another important aspect to address the C1- Structure-
taxonomy T via a multi-head attention mechanism to decide if a
sparsity challenge, and how to train a domain-specific model to
triple (PID, A, V ) is correct (i.e., whether V is the correct value of
distinguish identical values and highly-similar values is a key ques-
attribute A for product PID). The model is capable of learning from
tion to answer.
raw textual input without extensive feature engineering, making it
Our method has two stages. First, we apply collaborative filter-
ideal for scaling to thousands of types.
ing [17] on customer co-view behavior signals to retain product
The raw input of the model is the concatenation of token se-
pairs with high similarity, and take their attribute values as candi-
quences in D,T and V . For the i-th token in the sequence, a learnable
date pairs for synonyms. Such a candidate set is very noisy, hence
embedding vector e i is constructed by summing up three embed-
requires heavy filtering. Second, we train a simple logistic regres-
ding vectors of the same dimension:
sion model to decide if a candidate pair has exactly the same mean-
Segment
e i = e iFastText + e i + e iPosition , (3) ing. The features we consider include edit distance, pre-trained
MT-DNN model [19] score, and features regarding distinct vs. com-
where e iFastText is the pre-trained FastText embedding [3] of token
mon words. The features regarding distinct vs. common words play
Segment
i, e i is a segment embedding vector that indicates to which a critical role in the model; they focus on three sets of words: words
source sequence (D, T or V ) token i belongs, and e iPosition is a appearing only in the first candidate but not the second, and vice
Table 7: Statistics for raw data used as input to AutoKnow. Table 8: Aggregated statistics describing our PG on four
product domains (Grocery, Baby, Beauty, Health).
Product Domain Grocery Health Beauty Baby
#types 3,169 1,850 990 697 #Triples #Attributes #Types #Products
med. # products/type 1,760 18,320 27,150 28,700 >1B >1K >19K >30M
#attributes 1,243 1,824 1,657 1,511
med. #attrs/type 113 195 228 206 Table 9: AutoKnow achieved 87.7% type precision and in-
versa, and words shared by the two candidates. Between every two creased the number of types by 2.9X.
out of these three sets, edit distance and embedding similarity are Grocery Health Beauty Baby Avg
computed and used as features. precision 93.89% 84.60% 82.24% 89.97% 87.68%
An experiment on 2500 candidate pairs (143 positive; half used MoE 2.71% 4.08% 4.32% 3.40% 3.63%
for training) shows a PRAUC of 0.83 on Grocery flavor; removing #types 3368 7276 4102 4368 4778
the distinct-word features will reduce the PRAUC to 0.79. increase 1.1X 3.9X 4.1X 2.4X 2.9X
5 EXPERIMENTAL RESULTS
We now present our product knowledge graph (PG) produced by how much we improve the quality of the structured knowledge in
the AutoKnow pipeline. We show that we built a graph with over the next section.
1 billion triples for over 11K product types and significantly im-
proved accuracy and completeness of the data. Note that we have 5.2 Quality Evaluation
already compared each individual component with state-of-the-art
Metrics: We report precision, recall, F-metric of the knowledge. To
in previous sections, so here we only compare PG with the raw
understand how much gap there is in providing correct structured
input data.
data for each attribute, we also define a new metric, called defect
rate, the percentage of (product, attribute) pairs with missing or
5.1 Input Data and Resulting Graph incorrect values. Specifically, consider an attribute A. Let c be the
Raw Data: AutoKnow takes Amazon Product Catalog, including number of products with correct values for A, w be the number of
the product type taxonomy, as input. We chose products of four products with a wrong value for A, s be the number of products
domains (i.e., high-level categories): Grocery, Health, Beauty, and where A does not apply but there is a value (e.g., flavor for shoes),
Baby. These domains have the commonality that they contain quite m be the number of products where A applies but the value is
sparse structured data; on the other hand, the numbers of types missing, and t be the number of products within the scope. We
and the sizes vary from domain to domain. We consider products compute applicability, the percentage of products where A applies,
that have at least on page view in a randomly chosen month. as (c + w + m)/t; coverage as (c + s + w)/t; precision as c/(c + s + w);
Table 7 shows statistics of the domains. For each domain, there recall as c/(c +w +m); and defect rate as D = (w +s +m)/(c +w +s +m).
are hundreds to thousands of types in the Catalog taxonomy, and We consider three types of triples: triples with product types
the median number of products per type is thousands to tens of such as (product-1, hasType, Television), triples with attribute val-
thousands. Amazon Catalog contains thousands of attributes; how- ues such as (product-2, hasBrand, Sony), and triples depicting entity
ever, for each individual product most of the attributes do not apply. relations such as (chocolate, isSynonymOf, choc). We report preci-
Thus for each product, there are typically tens to hundreds of pop- sion for each type of triples. Computing recall is hard, especially
ulated values. We say an attribute is covered by a type if at least for type triples and relation triples, since it is nearly impossible
one product of that type has a value in Catalog for the attribute. to find all applicable types and synonyms; we thus only report it
As shown in the statistics, each domain involves thousands of at- for triples with attribute values. For triples involving products, we
tributes, and the median number of attributes per product type is used product popularity weighted sampling.
100-250.
Type triples: Table 9 shows the quality of product-type triples
Building a Graph: We implemented AutoKnow in a distributed measured by MTurk workers on 300 labels per domain. MoE shows
setting using Apache Spark 2.4 on an Amazon EMR cluster, and the margin of error with a confidence level of 95%. AutoKnow
Python 3.6 on individual hosts. Deep learning was implemented obtained an average precision of 87.7% and increased the number
using TensorFlow and AWS Deep Graph Library 2 was used to im- of types in each domain by 2.9X on average.
plement Graph Neural Networks for AK-Taxonomy. AK-Relations
Attribute triples: We start with choosing three text-valued at-
component was implemented using Spark ML. AK-Imputation com-
tributes that are fairly popular to all 4 domains, and evaluated each
ponent used an AWS SageMaker instance for training3 .
(domain, attribute) pairs on 200 samples (Table 10). Note that even
Resulting PG: Key characteristics of our PG are shown in Table 8. though they are popular among all text-valued attributes except
The table presents aggregated statistics for the four product do- brand, the applicability is still fairly low (<10% most of the cases),
mains. We highlight that after product type extraction, we increase showing the big diversity of each domain. We made three obser-
the size of the taxonomy by 2.9X, from 6.7K to 19K; some types ap- vations. First, we significantly increased the quality of the data
pear in different domains and there are 11K unique types. We show (precision up by 7.6%, recall up by 16.4%, F-measure up by 14.1%,
2 https://www.dgl.ai/ and defect rate down by 14.4%). Second, the quality of the gener-
3 https://aws.amazon.com/sagemaker/ ated data is often correlated with the precision of the raw data.
Table 10: PG improves over input data on average by 7.6% Table 12: Precision for triples representing relations be-
(percentage point) on precision, 16.4% on recall, and 14.4% on tween entities. Precision for synonym relations is reported
defect rate, with average MoE of 6.56% on precision/recall. on two representative attributes.
Attribute 1 Attribute 2 Attribute 3 Attribute 1 Attribute 2 Product Types
Grocery Input PG Input PG Input PG Precision 91.6% 93.7% 88.1%
Applicability 38.51% 7.53% 10.00% #Pairs 6,610 1,066 21,900
Precision 68.61% 82.59% 49.94% 77.30% 55.10% 55.10%
Recall 37.17% 83.15% 1.43% 80.96% 54.58% 54.59% two attributes as in Table 10. At a precision of 90% (i.e., 9 out of
F-measure 48.22% 82.87% 2.78% 79.09% 54.84% 54.85% 10 removed values are indeed incorrect), we achieve a recall of
Defect Rate 62.91% 21.14% 98.58% 30.72% 49.50% 49.49%
73.7% for Attribute 1 and 27.8% for Attribute 2. In total we removed
Health Input PG Input PG Input PG
1.3M values for these two attributes, accounting for 21% of Catalog
Applicability 1.35% 0.59% 57.92%
Precision 70.54% 84.00% 59.11% 70.00% 78.00% 61.75% values and 64.6% of AK-Imputation.
Recall 59.69% 49.92% 69.50% 63.45% 47.13% 69.92% Relation triples: Finally, we show precision of relation triples in
F-measure 64.66% 62.62% 63.89% 66.56% 58.76% 65.58% Table 12, including hypernym relations between product types and
Defect Rate 49.69% 52.34% 48.31% 45.45% 55.04% 38.28% synonym relations between attribute values. We observed very high
Beauty Input PG Input PG Input PG precision for value synonyms (>90%) and fairly high for hypernyms
Applicability 0.04% 0.54% 4.82%
(88.1%) when we consider attaching to any ancestor node (not
Precision 18.83% 48.00% 71.98% 76.00% 62.00% 61.68%
Recall 69.44% 69.44% 65.21% 59.95% 53.26% 62.98% necessarily to the leaf) as correct.
F-measure 29.62% 56.76% 68.43% 67.03% 57.30% 62.32%
Defect Rate 82.35% 59.02% 40.19% 42.76% 48.51% 39.50% 6 LESSONS WE LEARNT
Baby Input PG Input PG Input PG There are many lessons in building the Product KG, pointing to
Applicability 0.0011% 0.09% 55.82% interesting research directions.
Precision 1.45% 0.03% 8.22% 10.60% 42.00% 49.54%
Non-tree product hierarchies: First, we may need to funda-
Recall 9.79% 9.79% 3.83% 50.92% 44.13% 56.39%
mentally reconsider the way we model taxonomy and classify prod-
F-measure 2.53% 0.06% 5.23% 17.55% 43.04% 52.74%
Defect Rate 98.72% 99.97% 97.30% 89.63% 60.81% 50.46% ucts. Common practice is to structure product types into a tree
structure for “subtypes”, and classify each product into a single
Table 11: AutoKnow cleaned 1.77M incorrect values for two (ideally leaf) type. However, we miss multiple parents; for example,
attributes with a precision of 90%. “knife” has subtypes “chef’s knife”, “hunting knife”, which corre-
spondingly is also a subtype of “Cutlery & Knife” and “Hunting
Grocery Recall@ 90% Recall@ 80% #Removed %Removed kits”. Also, there is often no clear cut for product types: one product
Attribute 1 36.58% 58.20% 1,381,277 55.06%
can be both fashion swimwear and two-piece swimwear. In general,
Attribute 2 9.92% 13.29% 320,960 59.40%
Health Recall@ 90% Recall@ 80% #Removed %Removed
we need to extend concepts to model broadly subtype, synonym,
Attribute 1 76.72% 85.93% 30,215 32.63% and overlapping relationships; for each product, we can simply ex-
Attribute 2 3.18% 21.14% 14,110 20.92% tract the type from its title, and infer other types according to the
Beauty Recall@ 90% Recall@ 80% #Removed %Removed relationship between product types.
Attribute 1 94.33% 97.16% 10,926 62.31% Noisy data: Second, the large volume of noises can deteriorate
Attribute 2 46.05% 69.74% 7,651 11.87% the performance of the imputation and cleaning models. This can be
observed from our lower quality of knowledge in the Baby domain,
Baby Recall@ 90% Recall@ 80% #Removed %Removed
Attribute 1 87.24% 95.31% 2,673 66.17%
Attribute 2 52.07% 59.24% 1,956 73.04% caused by wrong product types and many inapplicable values in
Catalog. We propose aggressive data cleaning before using the
For example, the precision of the data in the Baby domain is very data for training, and training a multi-task end-to-end model that
low; as a result, the PG data also have low precision. On the other imputes missing values and identifies wrong or inapplicable values.
hand, recall tends to have less effect; for example, Attribute 2 has More than text: Third, product profile is not the only source
a recall of 1.4% for Grocery, but we are able to boost it to 81.0% of product knowledge. A study on randomly sampled 22 (type,
with reasonable precision (77.3%). Third, there is still huge space for attribute) pairs shows that 71.3% values can be derived from Ama-
improvement: the defect rate is at best 21.1%, and can be over 90% in zon product profiles, an additional 3.8% can be derived from Ama-
certain cases. We also note that production requires high accuracy, zon product images, and 24.9% have to be sought from external
so we trade recall with precision in the production system. sources such as manufacturer websites. This observation hints that
Next, we randomly chose 8 (type, attr) pairs for top-5 important a promising direction is to enhance AutoKnow with image pro-
attributes to report our results at a finer granularity (Table 18 in cessing capabilities and web extraction.
Appendix A). The attribute values are categorical (much smaller
vocabulary) or binary, leading to higher extraction quality. Auto- 7 RELATED WORK
Know obtained an average precision of 95.0% and improved the Industry KG systems typically rely on manually defined ontology
recall by 4.3X. and curate knowledge from Wikipedia and a number of structured
To highlight how data cleaning improves the data quality, we sources (e.g., Freebase[4], Bing Satori [12], YAGO [10], YAGO2 [14]).
show in Table 11 the noisy values we have removed for the same Research systems conduct web extraction, but again observing
pre-defined ontology and focus on named entities such as people, [11] Tim Furche, Georg Gottlob, Giovanni Grasso, Omer Gunes, Xiaoanan Guo, An-
movies, companies (e.g., NELL [6], Knowledge Vault [8], Deep- drey Kravchenko, Giorgio Orsi, Christian Schallhart, Andrew Sellers, and Cheng
Wang. 2012. DIADEM: domain-centric, intelligent, automated data extraction
Dive [7]). By comparison, this work extracts product knowledge methodology. In WWW. ACM, 267–270.
from product profiles, where the structure, sparsity and noise level [12] Yuqing Gao, Jisheng Liang, Benjamin Han, Mohamed Yakout, and Ahmed Mo-
hamed. 2018. Building a large-scale, accurate and fresh knowledge graph. In
are very different from webpages; many attribute values are free SigKDD.
texts or numbers instead of named entities. Incorporating taxonomy [13] Pankaj Gulhane, Amit Madaan, Rupesh Mehta, Jeyashankher Ramamirtham,
knowledge into machine learning models and utilizing customer be- Rajeev Rastogi, Sandeep Satpal, Srinivasan H Sengamedu, Ashwin Tengli, and
Charu Tiwari. 2011. Web-scale information extraction with Vertex. In ICDE.
havior signals for supervision are two themes employed throughout 1209–1220.
this work to improve model performance. [14] Johannes Hoffart, Fabian M Suchanek, Klaus Berberich, and Gerhard Weikum.
The product knowledge graph described in [34] differs from our 2013. YAGO2: A spatially and temporally enhanced knowledge base from
Wikipedia. Artificial Intelligence 194 (2013), 28–61.
work as it focuses on training product embeddings to represent [15] Andrew Hopkinson, Amit Gurdasani, Dave Palfrey, and Arpit Mittal. 2018.
co-view/complement/substitute relationship defined therein, while Demand-Weighted Completeness Prediction for a Knowledge Base. In NAACL.
200–207.
this work focuses on collecting factual knowledge about products [16] Giannis Karamanolakis, Jun Ma, and Xin Luna Dong. 2020. TXtract: Taxonomy-
(e.g., product types and attribute values). Recent product property Aware Knowledge Extraction for Thousands of Product Categories. In Proceedings
extraction systems [35, 37] apply tagging on product profiles, but of the 58th Annual Meeting of the Association for Computational Linguistics.
[17] Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon. com recommenda-
consider a single product type. Web extraction systems [26, 28] tions: Item-to-item collaborative filtering. IEEE Internet computing 7, 1 (2003),
extract product knowledge from semi-structured websites, and the 76–80.
techniques are orthogonal to ours. [18] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In 2008
Eighth IEEE International Conference on Data Mining. IEEE, 413–422.
In addition to end-to-end systems, there have been solutions [19] Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019. Multi-Task
for individual components, including ontology definition [4, 10], Deep Neural Networks for Natural Language Understanding. In ACL. 4487–4496.
[20] Colin Lockard, Xin Luna Dong, Arash Einolghozati, and Prashant Shiralkar. 2018.
entity identification [10], relation extraction [20], hierarchical em- CERES: Distantly Supervised Relation Extraction from the Semi-structured Web.
bedding [23], linkage [13, 24], and knowledge fusion [8, 9]. We PVLDB (2018), 1084–1096.
apply these techniques whenever appropriate, and improve them [21] Yuning Mao, Xiang Ren, Jiaming Shen, Xiaotao Gu, and Jiawei Han. 2018. End-
to-End Reinforcement Learning for Automatic Taxonomy Induction. In ACL.
to address the unique challenges for the product domain. 2462–2472.
[22] Yuning Mao, Tong Zhao, Andrey Kan, Chenwei Zhang, Xin Luna Dong, Christos
8 CONCLUSIONS Faloutsos, and Jiawei Han. 2020. OCTET: Online Catalog Taxonomy Enrichment
with Self-Supervision. In SigKDD.
This paper describes our experience in building a broad knowledge [23] Maximillian Nickel and Douwe Kiela. 2017. Poincaré embeddings for learning
graph for products of thousands of types. We applied a suite of hierarchical representations. In NIPS. 6338–6347.
[24] George Papadakis, Jonathan Svirsky, Avigdor Gal, and Themis Palpanas. 2016.
ML methods to automate ontology construction, knowledge enrich- Comparative analysis of approximate blocking techniques for entity resolution.
ment and cleaning for a large number of products with frequent Proceedings of the VLDB Endowment 9, 9 (2016), 684–695.
[25] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe:
changes. With these techniques we built a knowledge graph that Global Vectors for Word Representation. In EMNLP. 1532–1543.
significantly improves completeness, accuracy, and consistency of [26] Disheng Qiu, Luciano Barbosa, Xin Luna Dong, Yanyan Shen, and Divesh Srivas-
data comparing to Catalog. Our efforts also shed light on how we tava. 2015. Dexter: large-scale discovery and extraction of product specifications
on the web. Proceedings of the VLDB Endowment 8, 13 (2015), 2194–2205.
may further improve by going both broader and deeper in product [27] Simon Razniewski, Vevake Balaraman, and Werner Nutt. 2017. Doctoral advisor or
graph construction. medical condition: Towards entity-specific rankings of knowledge base properties.
In International Conference on Advanced Data Mining and Applications.
[28] Martin Rezk, Laura Alonso Alemany, Lasguido Nio, and Ted Zhang. 2019. Accu-
REFERENCES rate product attribute extraction on the field. In ICDE. 1862âĂŞ1873.
[1] Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F Ilyas, [29] Michael Schlichtkrull and Héctor Martínez Alonso. 2016. Msejrku at semeval-2016
Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang. 2016. De- task 14: Taxonomy enrichment by evidence ranking. In SemEval.
tecting data errors: Where are we and what needs to be done? Proceedings of the [30] Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss, and Jiawei Han.
VLDB Endowment 9, 12 (2016), 993–1004. 2018. Automated phrase mining from massive text corpora. IEEE Transactions on
[2] Mohit Bansal, David Burkett, Gerard De Melo, and Dan Klein. 2014. Structured Knowledge and Data Engineering 30, 10 (2018), 1825–1837.
Learning for Taxonomy Induction with Belief Propagation.. In ACL. [31] Shengjie Sun, Dong Yang, Hongchun Zhang, Yanxu Chen, Chao Wei, Xiaonan
[3] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Meng, and Yi Hu. 2018. Important Attribute Identification in Knowledge Graph.
Enriching word vectors with subword information. TACL 5 (2017), 135–146. arXiv preprint arXiv:1810.05320 (2018).
[4] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. [32] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
2008. Freebase: a collaboratively created graph database for structuring human Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
knowledge. In Sigmod. AcM, 1247–1250. you need. In Advances in neural information processing systems. 5998–6008.
[5] Georgeta Bordea, Els Lefever, and Paul Buitelaar. 2016. Semeval-2016 task 13: [33] Jingjing Wang, Changsung Kang, Yi Chang, and Jiawei Han. 2014. A hierarchical
Taxonomy extraction evaluation (texeval-2). In SemEval-2016. 1081–1091. dirichlet model for taxonomy expansion for search engines. In WWW.
[6] Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R Hr- [34] Da Xu, Chuanwei Ruan, Evren Korpeoglu, Sushant Kumar, and Kannan Achan.
uschka Jr, and Tom M Mitchell. 2010. Toward an architecture for never-ending 2020. Product Knowledge Graph Embedding for E-commerce. In Proceedings of
language learning.. In AAAI, Vol. 5. Atlanta, 3. the 13th International Conference on Web Search and Data Mining. 672–680.
[7] Christopher De Sa, Alex Ratner, Christopher Ré, Jaeho Shin, Feiran Wang, Sen [35] Huimin Xu, Wenting Wang, Xinnian Mao, Xinyu Jiang, and Man Lan. 2019.
Wu, and Ce Zhang. 2016. Deepdive: Declarative knowledge base construction. Scaling up Open Tagging from Tens to Thousands: Comprehension Empowered
ACM SIGMOD Record 45, 1 (2016), 60–67. Attribute Value Extraction from Product Title. In ACL. 5214–5223.
[8] Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Mur- [36] Dongxu Zhang, Subhabrata Mukherjee, Colin Lockard, Xin Luna Dong, and
phy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. 2014. Knowledge vault: A Andrew McCallum. 2019. OpenKI: Integrating Open Information Extraction and
web-scale approach to probabilistic knowledge fusion. In SigKDD. ACM, 601–610. Knowledge Bases with Relation Inference. arXiv preprint arXiv:1904.12606 (2019).
[9] Xin Luna Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Kevin Murphy, [37] Guineng Zheng, Subhabrata Mukherjee, Xin Luna Dong, and Feifei Li. 2018.
Shaohua Sun, and Wei Zhang. 2014. From Data Fusion to Knowledge Fusion. OpenTag: Open Attribute Value Extraction from Product Profiles. In SigKDD.
PVLDB (2014).
[10] MS Fabian, K Gjergji, Weikum Gerhard, et al. 2007. Yago: A core of semantic
knowledge unifying wordnet and wikipedia. In WWW. 697–706.
A EXAMPLES
Here we provide example outputs produced by each of the five com-
ponents (Figure 2). Taxonomy enrichment and relation discovery
results are shown in Tables 13, 14 and 15. Next, data imputation,
cleaning and synonym finding results are shown in Figure 3 and
Tables 16 and 17. Finally, we also show additional evaluation results
for the entire pipeline on a sample of type-attribute pairs in Table 18 (a)
(see evaluation details in Section 5).

Table 13: Examples of type extraction results.

Source Text Product Type


Product 4 Country Pasta Homemade Style Egg Egg Pasta
Pasta - 16-oz bag
Product Hamburger Helper Lasagna Pasta, Four Lasagna Pasta
Cheese, 10.3 Ounce (Pack of 6)
Product COFFEE MATE The Original Powder Cof- Coffee Creamer
fee Creamer 35.3 Oz. Canister Non-dairy,
Lactose Free, Gluten Free Creamer
Query mccormick paprika 8.5 ounce paprika
Query flax seeds raw flax seeds (b)

Table 14: Examples of detected product type hypernyms.

Child Type Parent Type


Coconut flour Baking flours & meals
Tilapia Fresh fish
Fresh cut carnations Fresh cut flowers
Bock beers Lager & pilsner beers
Pinto beans Dried beans

(c)
Table 15: Attributes identified as most important for two ex-
ample types. Figure 3: Examples of extracted attribute values from Open-
Tag and TXtract.
Cereals Shampoo
brand brand
ingredients hair type In both models, we use the same set of features that characterize
flavor number of items how relevant the attribute is for the given product type. The first
number of items ingredients feature is coverage, which is the proportion of products that have a
energy content liquid volume non-missing attribute value. Next, a range of features are based on
frequency of attribute mentions in different text sources. Consider a
text source s (e.g., product descriptions, reviews, etc.), and suppose
B ATTRIBUTE APPLICABILITY AND that all products of the required type are indexed from 1 to n. Note
IMPORTANCE that a particular product i, can have several pieces of text of type s
(e.g., a product might have several reviews), and let l(s, i) denote the
Recall that for each (product type, attribute) pair we need to identify
number of such pieces. A feature based on signal s is then defined
whether the attribute applies and how important the attribute is
1 Ín  1 Íl (s,i)
as x(s) = n i=1 l (s,i) j=1 M(s, i, j) Here M(s, i, j) = 1 if the
(e.g., whether color applies to Shoes, and how important is color
for Shoes). To this end, we independently train a Random Forest j-th piece of text of signal s associated with product i mentions the
classifier to predict applicability and a Random Forest Regressor to attribute (e.g., whether the j-th review of the i-th shoes mentions
predict importance scores (for applicable attributes). In both cases, color), otherwise M(s, i, j) = 0.
we consider each (product type, attribute) pair as an instance, and We consider two implementations of M(s, i, j), and accordingly,
we label a sample of such pairs with either applicability or impor- for each s we compute two features. First, M(s, i, j) = 1 if text piece
tance labels (Section 3.2). Sample prediction results for attribute j contains attribute value of product i (e.g., whether the review for
importance are shown in Table 15 product i contains color of this product). Second, M(s, i, j) = 1 if
Table 16: Example errors found by the cleaning model.

Attribute
Product profile Attrib. value
Love of Candy Bulk Candy - Pink Mint Flavor Pink
Chocolate Lentils - 6lb Bag
Scott’s Cakes Dark Chocolate Fruit & Flavor snowflake box
Nut Cream Filling Candies with Burgandy
Foils in a 1 Pound Snowflake Box
Lucky Baby - Baby Blanket Envelope Flavor Color 1
Swaddle Winter Wrap Coral Fleece New-
born Blanket Sleeper Infant Stroller Wrap
Toddlers Baby Sleeping Bag (color 1) Figure 4: Cleaning model architecture.
ASUTRA Himalayan Sea Salt Body Scrub Scent vitamin c
Exfoliator + Body Brush (Vitamin C), 12 body scrub -
oz | Ultra Hydrating, Gentle, Moisturiz- 12oz & body C TAXONOMY-AWARE SEMANTIC
ing | All Natural & Organic Jojoba, Sweet brush CLEANING
Almond, Argan Oils
The cleaning model detects whether or not a triple (PID, A, V ) is
Folgers Simply Smooth Ground Coffee, 2 Scent 2Packages
correct (i.e., whether V is the correct value of attribute A for prod-
Count (Medium Roast), 31.1 Ounce (Breakfast
Blend, 31.1 oz) uct PID) by attending to its taxonomy node and semantic signals
in product profile. Let D = [d 1 , . . . , dn D ], T = [t 1 , . . . , tnT ] and
V = [v 1 , . . . , vnV ] be the token sequences of the product descrip-
Table 17: Examples of discovered flavor and scent synonym
tion, product taxonomy, and target attribute value, respectively. We
pairs.
construct the input sequence S by concatenating D, T and V and
inserting special tokens "[CLS]" and "[SEP]" as follows.
flavor synonyms
herb and garlic herb & garlic
macadamia nut macadamia S = concat([CLS], D, [SEP],T , [SEP], V ) := [s 1 , . . . , sn S ] (4)
roasted oolong tea roasted oolong where n S = n D + nT + nV + 3. We then map each si ∈ S to an
decaffeinated honey lemon decaf honey lemon embedding vector e i ∈ Rd as the summation of three embedding
zero carb vanilla zero cal vanilla vectors of the same dimension d:
scent synonyms
Segment
basil (sweet) sweet basil e i = e iFastText + e i + e iPosition , i = 1, . . . , n S (5)
rose flower rose where e iFastText is the pretrained FastText embedding [32] of si ,
aloe lubricant aloe lube Segment
unscented uncented ei is a segment embedding vector defined as:
moonlight path moonlit path 
 e D , if si ∈ D
Segment

= e T , if si ∈ T (6)


ei
e V , if si ∈ V ,
Table 18: AutoKnow obtained an average precision of 95.0% 

and improved the recall by 4.3X for important categori- 
cal/binary attributes. and e iPosition is the position embedding vector of the location of si in
the sequence (i.e. i), for which we adopt the same constructions used
in [32]. Here e iFastText ’s and e iPosition ’s are frozen (not trainable), and
Scope Prec Recall Recall gain
pair-1 91.05% 70.18% 7.3X
pair-2 97.06% 19.70% 5.2X e D , e T , e V are randomly initialized and jointly trained with other
pair-3 97.12% 36.13% 1.3X model parameters.
pair-4 93.87% 37.72% 10.5X The embedding sequence [e i ]n1 S is propagated through a multi-
pair-5 95.88% 25.01% 1.2X layer transformer model where number of layers, number of heads
pair-6 90.42% 87.46% 2.8X and hidden dimension are hyperparameters. The final embedding
pair-7 97.97% 55.95% 1.4X
vector of the special token [CLS], denoted by e Out , captures the
pair-8 96.44% 87.49% 4.5X
distilled representations of all three input sequences. It is passed
text piece j contains any common attribute value for this product through a dense layer followed by a sigmoid node to produce a
type (i.e., whether the review for product i contains any frequent single score between 0 and 1, indicating the likelihood of the input
color values among Shoes). We consider a value to be common if triple (PID, A, V ) being correct. See Figure 4 for an illustration of
it is among the top 30 most frequent values within the given type. the model architecture.
We consider several text signals (e.g., product titles, reviews, search In Table 16 we give examples of attribute value errors detected
queries, etc.) and compute 30 features as described above. Finally, by the cleaning model.
for each feature we also consider an alternative where products are
weighted by popularity, and thus in total we have 60 features.

You might also like