Hmm-based address parsing with massive synthetic training data generation

X Li, H Kardes, X Wang, A Sun - … of the 4th International Workshop on …, 2014 - dl.acm.org
X Li, H Kardes, X Wang, A Sun
Proceedings of the 4th International Workshop on Location and the Web, 2014dl.acm.org
Record linkage is the task of identifying which records in one or more data collections refer
to the same entity, and address is one of the most commonly used fields in databases.
Hence, segmentation of the raw addresses into a set of semantic fields is the primary step in
this task. In this paper, we present a probabilistic address parsing system based on the
Hidden Markov Model. We also introduce several novel approaches of synthetic training
data generation to build robust models for noisy real-world addresses, obtaining 95.6% F …
Record linkage is the task of identifying which records in one or more data collections refer to the same entity, and address is one of the most commonly used fields in databases. Hence, segmentation of the raw addresses into a set of semantic fields is the primary step in this task. In this paper, we present a probabilistic address parsing system based on the Hidden Markov Model. We also introduce several novel approaches of synthetic training data generation to build robust models for noisy real-world addresses, obtaining 95.6% F-measure. Furthermore, we demonstrate the viability and efficiency of this system for large-scale data by scaling it up to parse billions of addresses.
ACM Digital Library
Showing the best result for this search. See all results