0% found this document useful (0 votes)
88 views8 pages

Analysis of Geographic Queries in A Search Engine Log

This document analyzes geographic queries in a search engine log to identify opportunities for improving geographic search engines. The authors manually examined thousands of queries to observe typical properties of geographic queries. They then built a classifier to separate 36 million queries into geographic and non-geographic queries. Key findings include the most common types of geographic terms used, how geographic queries related to the websites visited and users issuing them, and a proposed new taxonomy of geographic query types. The goal is to provide insights into how people write geographic queries and how search engines can better process them.

Uploaded by

Maico Xuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views8 pages

Analysis of Geographic Queries in A Search Engine Log

This document analyzes geographic queries in a search engine log to identify opportunities for improving geographic search engines. The authors manually examined thousands of queries to observe typical properties of geographic queries. They then built a classifier to separate 36 million queries into geographic and non-geographic queries. Key findings include the most common types of geographic terms used, how geographic queries related to the websites visited and users issuing them, and a proposed new taxonomy of geographic query types. The goal is to provide insights into how people write geographic queries and how search engines can better process them.

Uploaded by

Maico Xuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Analysis of Geographic Queries in a Search Engine Log

Qingqing Gan Josh Attenberg Alexander Markowetz Torsten Suel


Polytechnic University Polytechnic University University of Science & Technology Polytechnic University
Brooklyn, NY 11201 Brooklyn, NY 11201 Hong Kong, S.A.R Brooklyn, NY 11201
[email protected] [email protected] [email protected] [email protected]
ABSTRACT area. Typical examples are “hotels new york”, “building codes in seat-
Geography is becoming increasingly important in web search. Search tle”, “virgina historical sites”, or “unemployment long island”. Such
engines can often return better results to users by analyzing features queries frequently contain names of cities, states, or countries – often
such as user location or geographic terms in web pages and user que- abbreviated, e.g., “CA”, “NYC”, or “SF”. Alternately, they may con-
ries. This is also of great commercial value as it enables location tain streets names, informal synonyms (e.g., “big apple”), or refer to
specific advertising and improved search for local businesses. As a re- landmarks and neighborhoods (e.g., “SoHo” in New York). In some
sult, major search companies have invested significant resources into cases, users include zip codes or phone numbers.
geographic search technologies, also often called local search. Because of geography’s important role in search requests, and the
This paper studies geographic search queries, i.e., text queries such significant commercial potential of such queries (e.g., for hotels, real
as “hotel new york” that employ geographical terms in an attempt to estate, or local businesses), search companies have recently invested
restrict results to a particular region or location. Our main motivation significant resources into geographic (geo) search technologies (also
is to identify opportunities for improving geographical search and re- called local search), i.e., methods aimed at giving improved answers
lated technologies, and we perform an analysis of 36 million queries to geographic search requests. Approaches range from integration of
of the recently released AOL query trace. First, we identify typical business directories (yellow pages) to answer fairly simple but lucra-
properties of geographic search (geo) queries based on a manual ex- tive queries (e.g., for hotels, shops, and restaurants), to a more de-
amination of several thousand queries. Based on these observations, tailed analysis of queries, page content, and site and link structure in
we build a classifier that separates the trace into geo and non-geo que- order to facilitate more general queries. Geo search applications can
ries. We then investigate the properties of geo queries in more detail, use a standard keyword interface and extract geographic terms from
and relate them to web sites and users associated with such queries. queries, employ graphic interfaces such as interactive maps, or use
We also propose a new taxonomy for geographic search queries. the current location of a mobile user. In general, geo search engines
combine knowledge regarding how people use geographic terms in
queries, how such terms are used in pages, and how sites are orga-
Categories and Subject Descriptors nized and linked with respect to geography. They commonly also use
H.3.1 [Information Systems]: Content Analysis and Indexing—In- external data sources, in particular gazetteers listing the names and
dexing methods; H.3.3 [Information Systems]: Information Search locations of states, cities, or businesses. Geo search technology has
and Retrieval—Search process recently been studied by a number of researchers, mainly focusing on
the extraction of geographic information from page content and struc-
General Terms ture [22, 24, 2, 14, 20, 9], indexing and query processing [38, 7, 35,
Measurement, Human Factors 21], and the automatic identification of geographic queries [10, 36].
Our main objective is to identify opportunities for improving geo-
Keywords graphic search engines. However, our observations should be of more
web search, geographic search, local search, query log mining general interest. We investigate real world queries of a large query log
from a standard (non-geographic) search engine, namely 36 million
1. INTRODUCTION queries from AOL. We study how people write geographic queries and
how these should be processed by search engines. Our paper builds
Over the last decade, search engines have become the primary means
on work in [28] and [37] that analyzed geographic queries.
of locating information for many people. For this reason, researchers
We are interested in what types of geographic queries (informa-
have started investigating available search query logs, in order to bet-
tional, navigational, transactional) users issue, what types of geogra-
ter understand what people are searching for, how they are searching,
phic terms they employ, and what they are looking for. We also study
and how this process can be improved. A number of recent studies
what sites users visited as a result of a geo query, how different geo-
[30, 11, 29, 4, 25], have looked at query logs from various perspec-
graphic terms were used by the same user, and what non-geographic
tives, including Computer Science, Library and Information Science,
terms are associated with geographic terms.
and Social Sciences. Our perspective is primarily from Computer Sci-
The remainder of this paper is organized as follows. Section 2 pro-
ence, where researchers mine query logs and click-through behavior
vides a basic background and an overview of related work. Section 3
to optimize system performance or provide more accurate results.
introduces the data set. Section 4 shows how geographic features can
While the Web has removed many geographical limitations in me-
be used to classify queries into geo and non-geo queries. The next
dia, communications, and e-commerce, many geographical aspects of
three sections investigate geographic properties of queries, users, and
the physical world are nonetheless reflected in the Web’s content and
sites, respectively. The main focus lies on our taxonomy of geographic
structure. As a result, geography often provides a useful and intuitive
queries. Finally, we conclude in Section 8.
constraint for Web search. This paper investigates geographic search
queries, i.e., keyword queries that employ geographical terms in order
to obtain search results related to a particular geographical location or 2. BACKGROUND AND RELATED WORK
Permission to make digital or hard copies of all or part of this work for There is significant literature on search engine logs, including stud-
personal or classroom use is granted without fee provided that copies are ies of general search logs [30, 11, 29, 4, 25], and various papers focus-
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
ing on special types of users and collections, e.g., multi-media search
republish, to post on servers or to redistribute to lists, requires prior specific [12], intranet search [31], blog search [23], or search in other lan-
permission and/or a fee. guages [19]. In particular, Kamvar and Baluja [15] studied the char-
LocWeb 2008, April 22, 2008, Beijing, China. acteristics of mobile queries submitted to Google’s search services for
ACM 978-1-60558-160-6/08/04 : : : $5.00.
PDA and cellular phones. We note that while mobile and geographic 3. Business, travel, employment
(local) search are often thought of as being closely related technolo- 4. Computers
gies, they are certainly not the same. It can be argued that many mobile 5. Science and medicine
queries are in fact geographic in nature, and that for certain types of
6. People, places, things, odds and ends
queries it may make sense to return results related to the current posi-
tion of the user. Kamvar and Baluja [15] investigate various features 7. Society and religion
of mobile queries, including query length and topics (but not geog- 8. Education, humanitarian interests
raphy), focusing on the user interface aspects of small screens and 9. The arts
limited input capabilities. In contrast, we focus on queries issued by 10. Government
desktop and laptop users to a general search engine. 11. Unknown and other
Search queries can be categorized according to several dimensions.
Broder [5] first proposed three distinct categories of queries: (i) nav- Even before the web, researchers studied how to exploit geogra-
igational, (ii) informational, and (iii) transactional. Of particular im- phic information embedded in documents for better text search and
portance to our approach is the work by Rose and Levinson [27] who analysis; see [16] for a good overview of early work. Initial work on
expanded Broder’s work into a more detailed taxonomy, also consist- geographic search on the web appears in [6, 9, 22], and in recent years
ing of three categories but differentiated further into ten search goals: a significant amount of research has addressed this new challenge.
A. Navigational: The user has a distinct Web site or page in mind Geographic queries were previously studied by Sanderson and Kohler
that he knows or assumes to exist. Navigational queries often contain [28] and by Zhang et al. [37]. The former provides a brief study of
fragments of URLs or names of organizations. The user commonly some of the properties of geographic queries, in particular frequency,
clicks on only one result, taking him directly to the desired page. topics, length, and spatial relationships. The latter study focuses on
B. Informational: These queries are similar to those traditionally the issue of geo modification in consecutive queries, i.e., how users
studied in IR, i.e., the user wants information about a certain topic, modify their choice of geographic terms when the previous query did
either broad (e.g., “history us”) or narrow (e.g., “special nutrition for not provide satisfactory results.
wound care”). Here, users often follow several of the resulting links. Assume a user looking for a nearby yoga class might look for “yoga
park slope” (a neighborhood in Brooklyn). When this search returns
 Closed: queries seek a single, closed answer. poor results, she might try “yoga new york” and be swamped by many
 Open: queries seek open-ended answers or answers of unlim- irrelevant results. Finally, “yoga brooklyn” satisfies her information
ited depth. need. For a single search task, she had to re-write the same query
 Undirected: queries target anything or everything about a par- several times. One goal of geographic search technology is to avoid
ticular topic. successive query modification through proper analysis of queries and
 Advice: queries seek advice or instructions to complete a task. collections. The automatic rewriting method in [37] provides one such
 Locate: queries attempt to detect where a real world good or approach (also related to the query expansion technique for geogra-
service can be obtained. phic search in [8]). Our work here expands on [28] by providing a
more in-depth analysis of the properties of geo queries. This paper
 List: queries search for lists of good pages on a topic, e.g., a also investigates the relationship between geography, page topic, and
Yahoo or ODP directory. users, and is to our knowledge the first work in this direction.
C. Resource: These queries target resources, not web documents. Closely related to the analysis of geographic queries is the auto-
 Download: queries target a resource which must be downloaded matic detection of geo queries [10, 36, 37], and in general of geo-
to be useful. graphic terms in text data [18, 2]. In particular, automatic detection
is highly useful for measuring the statistical properties of geo que-
 Entertainment: queries search for pages which when viewed ries in large logs. Such detection can be based either on individual
may provide entertainment. queries, or can include past queries, past click-through behavior, or
 Interact: queries look for pages which require further interac- results returned by the engine. There have been many proposals on
tion, for instance map or weather services. how to use knowledge mined from search query logs, such as click-
 Obtain: queries seek documents which are useful on or off the through information, repeated identical or related queries by the same
computer, such as tax forms or government documents. or different users, or co-occurrences of terms in queries, to deliver im-
In [27, 17], researchers studied users’ navigational behavior (in par- proved search results to users [3, 13, 33, 26, 34, 32, 1]. The study of
ticular, click-through behavior), since a user’s goal cannot always be geographic queries by the same or different users, or of click-through
inferred by just looking at a query. They find that over 60% of que- behavior on such queries, is also of interest in this context.
ries were informational, and a large fraction of the other nearly 40%
seemed to seek a commercial transactions, rather than request prod- 3. IDENTIFYING GEO QUERIES
uct information. Distributions of search taxonomies are subject to This section lays the foundation for our study. We describe the
changes in search technology and user behavior - somebody who a underlying data, discuss basic geographic properties, and introduce
few years ago may have looked for the Web site of a company (nav- a taxonomy of geographic queries. The relative frequency of geo-
igational) for product information may now be willing and able to graphical queries as well as their subtypes is evaluated on a manually
order the item directly from the site (locate). In this paper, we use geo-coded query set. Finally, we propose two classifiers to classify the
the classification in [27], utilizing click-through data to identify the entire query trace. These classifiers are highly accurate, as evaluated
information need reflected by a query. on the manually geo-coded samples. We then use these classifier to
We are also interested in examining geo-queries categorized accord- aid in our subsequent statistical evaluation of the entire trace.
ing to the topic-based taxonomy of Spink et al. [30]. Here queries
are assigned to one of eleven categories according to what topic most 3.1 Underlying Data
closely matches their intent. These categories are, in decreasing order We study a trace of the AOL search engine, recording queries of
according to the fraction of all queries in a general query log in with roughly 650 000
; users over three months in early 2006. The trace
each category: 36
consists of about million lines of data, each containing five fields:
1. Entertainment AnonID: an anonymous user-ID
2. Pornography Query: the actual query terms
QueryTime: when the query was issued often seek to locate a store using the company’s web site. We did not
Item-Rank: the rank of the clicked result evaluate the number of such queries, as it would be difficult to guess
if a user is interested in finding a local store or making an online pur-
ClickURL: the host-level result the user clicked on (if any)
chase. In any case, 13%is probably an underestimate of the frequency
In case the user clicked on multiple results to a single query, these of geographic search tasks.
events are recorded in the form of extra lines. For an in-depth descrip- In our experiments, we only consider geographic entities within the
tion of the data, see [25]. United States; thus, queries that refer to international locations or to
Although real-life queries are often malformed and misspelled, the the US as a whole are ignored. The rationale behind this decision is
user’s intent is usually quite clear. For example, “www.footballcamps- that any automatic query classifier needs to incorporate some under-
atlanta.google” is clearly malformed, but it is apparent what the user standing of the language issues, ambiguities and difficulties associated
was looking for. Similarly, “noweign cruise lines” is misspelled, but with the geographic query terms from a particular region. Such infor-
has a clear intention.1 When classifying queries by hand, we label ac- mation is usually compiled for a single region or country at a time;
cording to the intent of the user, not according to any mistakes, when for this reason, local search engines are commonly launched on a per-
possible. This is done using the methodology of Rose and Levinson country basis. Since we are best able to manage these issues within
[27], utilizing click-through data for clarification when queries alone the geographic and linguistic confines of the United States, we chose
are insufficient for determining intent. The rationale is that query clas- to focus our work on queries focused there.
sification per se should be interested in a user’s intent, not her way of After manual classification, we discovered 582queries with geogra-
expressing this intent. Also, most advanced search engines realize phic intent out of 4495 queries in the sample. We then looked at the
users’ mistakes and propose corrected versions of the query. Due to query length (number of terms) of these queries; the results are shown
limited resources, we do not perform spell-checking when performing in Table 3.2. Note that the columns titled “Non-Geo” and “Geo” in-
automatic classification on the entire query trace. dicate the distribution of geographic and non-geographic queries in
To detect geographic terms in queries, we use the US Census Bu- terms of query length; thus, : 14 48% of all geographic queries have
reau’s gazetteer, which contains names and locations of counties, their 2 terms. The column titled “Geo of all” depicts the percentage of all
subdivisions (district, borough, barrio), places (town, city, village, queries with a given number of terms which have a geographic intent;
etc.), and ZIP Codes for all 50 states. 18 78%
thus, : of all queries with three terms are geographic queries.
3.2 Hand-Tagging Geo Queries Num. Query Terms Non-Geo Geo Geo of all
We begin by extracting an initial sample of 6000
random queries 1 25 54%
: 1 03%
: 0 52%
:

from the data set. After discarding all queries consisting exclusively 2 33 95%
: 14 48%
: 5 22%
:

of URLs and some badly misspelled or malformed queries, 4495 que- 3 19 54%
: 35 04%
: 18 78%
:

ries remain. These are examined manually, and assigned one of four 4 10 47%
: 26 21%
: 24 56%
:

labels, according to their geographic intent and their use of common 5 5 19%
: 17 93%
: 30 86%
:

geographic terms. Thus, for each query we decide if it has a geo- > 5 5 31%
: 5 31%
: 11 19%
:

graphic intent, and if it contains the name of a city, county, or state


Table 3.2: Number of terms in geo and non-geo queries.
according to the gazetteer. Note that other geographic terms also ap-
pear frequently, such as street names or names of landmarks or places This table confirms what was noticed in [28] and [37]: geo que-
of interest (e.g., “statue of liberty” or “empire state building”). The ries tend to have more terms than non-geo queries, and conversely the
four categories are: (i) Geographic queries that contain a city, country likelihood that a query is a geo query increases with the number of
or state name as a geographic term. (ii) Geographic queries that do terms. However, one has to be very careful in interpreting these re-
not contain such terms. (iii) Non-geographic queries seemingly con- sults. It should be expected that many classes of specialized queries,
taining a geographic term, e.g., “whitney houston”. This category in- say geographic queries, people queries, or product queries, have more
cludes many entity names, such as “Kentucky Fried Chicken”, “New terms than average. If we imagine that each term in a query is chosen
York Times” or “First Niagara Bank”. (iv) Non-geographic queries from some distribution, then the likelihood that a geo term (or people
without geographic terms. The numerical results of this classification term, or product term) is present, and/or that a geographic or people
are presented in Table 3.1. or product intent is present, increases with the number of terms. Note
also that classes such as geographic and health queries are not mutu-
Types of Queries Num. of Queries
Geo with Geo terms 12 01%
:
ally exclusive, and that a longer query may be more likely to be in
Geo without Geo terms 0 93%
:
several classes. Thus, it is not impossible that most or even all such
Non-Geo with Geo terms 24 44% specialized classes of queries of interest have an above average num-
62 62%
:
Non-Geo without Geo terms : ber of terms. Finally, a very short query is less likely to be recognized
as a geographic query even if the underling intent is geographic (e.g.,
Table 3.1: Geo vs. non-geo queries. as query “walmart” that tries to find the closest store on the company
Table 3.1 may give the impression that only 13%
of the queries 12 7%
website). Related to this, [37] reports that : of query rewrites add
pursue a geographically focused task, but the real percentage should a geo-specific term; thus, the original query probably had geographic
be somewhat higher. The AOL query trace is based on a standard intent. A good geographic search engine might use the user’s loca-
search engine, with no explicit geo capabilities. Many users with tion and previous geographic queries to return likely results of interest
a geographical search task in mind may only use such search en- without a rewrite by the user.
gines to find a Web site that will allow them to restrain the geogra- 3.3 Taxonomies for Geo-Search Queries
phic focus of their query in a second step. In our random sample,
Following Rose and Levinson [27], we classified about 500geo
for example, we find about twenty five requests for mapping services
(e.g., mapquest.com). These users are most likely pursuing a geo- queries and about 500
non-geo queries from our sample into eleven
graphic search task. Similarly, users searching for “craigslist” will distinct categories according to the apparent goal of the user, as in-
have to specify a metropolitan area of interest as soon as they access ferred from the query itself and the associated click-through data. re-
www.craigslist.org. Many queries for retail chains, e.g., Radio sults, given in Figure 3.1, show significant differences between geo
Shack, Nordstrom, or Target, are likely geographic in nature as users and non-geo queries. Geo queries are more frequently aimed at lo-
cating goods and services; non-geo queries are more likely aimed at
1
Norwegian Cruise Line is a large cruise operator. entertainment, downloads, or lists of pages with further information.
which are often within a category.
To address this, we propose a new query taxonomy for geographic
queries that combines aspects of topicality and desired type of inter-
action. We came up with 23
categories as follows:
1. Tourism/Travel: hotels, maps, flights, transport, local attractions
2. Government: searches for government entities, info, and laws
3. Real Estate: houses, apartments, and commercial real estate
4. Education: requests for educational or school related information
5. Business: non-online business related searches, except when in another
Figure 3.1: Distribution of geo and non-geo queries according to category
the taxonomy of Rose-Levinson. Note that the bars in each color 6. Night Life: including restaurants, entertainment, and casinos
sum up to a total of 1.0. 7. Undirected: broad informational requests for a topic
Navigational queries of a geographic nature often point to regional 8. Medical: hospitals, doctors, and general health and medical informa-
sections of nation-wide corporation or service. We observe two typ- tion
ical cases: (1) Site-Wide. The geographic term is used to distin- 9. Media: news, radio, papers, magazines, and other media
guish the desired Web site from other similar Web sites. For exam- 10. Employment: searches seeking employment opportunities
ple, “DMV ny” targets www.nydmv.state.ny.us, while “DMV ca”
11. Automotive: requests for automotive information and searches for au-
targets www.dmv.ca.gov. Similarly, many different cities have bars tomotive businesses
or restaurants with identical names (e.g., Joe’s Pizza) that are not af-
12. Civic: searches seeking civic, religious, and non-profit organizations
filiated in any way. (2) Site-Internal. Here the non-geographic terms
already determines the desired Web site, and the geographic term tar- 13. Closed: seeking an answer to a specific question
gets a particular page or item inside this site (e.g., “craigslist boston”). 14. Obtain: seeking a specific document or resource that is useful on or off
The difference between “locate” queries in the context of geo vs. the computer
non-geo queries is pronounced. Most geo-query “locate” searches 15. List: searches for a site which can provide further information. Seeking
consist of the name of a particular store or a search for a service in a hub rather than an authority
an area, e.g., “florists phoenix” or “crobar nyc”, while a typical non- 16. Advice: requests for advice or directions to complete a task
geographic counterpart may contain the name of a good to buy online, 17. Downloads: requesting software or files to be downloaded to a user’s
such as “ellsworth kelly prints”. Also, while there are many naviga- computer
tional queries among the geo queries, a majority of these are searches 18. Interactive: requesting pages which require further interaction in order
for local or state government agencies. Many “open” geographic que- to be useful
ries are searches for local media, news, or people. Such topical differ- 19. People: seeking individual people
ences are not conveyed by the taxonomy of Rose and Levinson.
20. Open: open ended questions or requests for information
Next, we turn to the topical classification scheme used by Spink et
al. [30], which also consists of eleven categories, listed in Section 2. 21. e-Business: attempts to find a online retailer of a product or service
Labelling the same set if geo and non-geo queries, we get the results 22. Entertainment: queries seeking to be entertained by the contents of a
shown in Figure 3.2. page. Including pornography and pictures
23. Navigational: requests clearly looking for a specific web site

We note here that this taxonomy is specifically designed to allow


better understanding of geo queries, and in particular the first twelve
classes captures common types of queries that we found in our trace.
The distribution of geo and non-geo queries in this finer-grained, hy-
brid taxonomy is shown in Figure 3.3. As we see, geo queries focus
on the first 13 categories, and are less frequent in the others (with the
exception of category 20). While there are significant number of com-
mercial geo queries for hotels, restaurants, cafes, real estate, and local
businesses, one interesting observation was the large number of local
queries about government, civil organizations, education, and media
Figure 3.2: Distribution of geo and non-geo queries according to that may not be well served by the current generation of geo search
the topicality classification of Spink et al. technology that is heavily focused on the former cases.
We again see some obvious difference in several categories. Cat-
egory two and four are exclusively non-geographic: there were no
queries asking for local pornography or local information about com-
puters. Category 6 is dominated by geographic queries. There are
frequent requests for local news and events, local government ser-
vices, weather. On the other hand, many non-geo queries were about
celebrities and national news. In category 5 (science and medicine),
there were many queries for local medical services, but unsurprisingly
very little local physics or other sciences. Category 8 shows that much
information about schools and education is sought at the local level,
for all levels of education. The same applies to category 10; there are
frequent searches for branches of local government and official forms
and information (e.g., about zoning laws and taxes). But as the taxon-
Figure 3.3: Distribution of geo and non-geo queries according to
omy of Rose and Levinson, Spink’s taxonomy also does not capture
our hybrid classification
some important difference between geo and non-geo queries users,
4. QUERY CLASSIFICATION Class Precision Recall F-Measure
Non-Geo 0 911
: 0 899
: 0 905
:
The sample data set used in the previous section is of insufficient Geo 0 903
: 0 915
: 0 909
:
size for many tasks. For example, making statements about frequently
appearing terms in geographic queries requires more information than Table 4.1: Accuracy of the Geo-NonGeo Classifier
our sample set allows. Categorizing the entire AOL trace by hand Place-Person If a city, county or state name is present, could this
is, however, not feasible. Instead, we use the manually labeled sam- term also be a person’s first or last name? First and last names
ple to bootstrap two classifiers. The first differentiates geographical were obtained from the US Census Bureau.
queries from those without geographic intent, while the second clas-
sifies geographic queries roughly according to informational versus Name-Place If a city, county or state name is present, does this term
navigational queries. As our experiments show, both classifiers are appear prior to a last name or after a first name?
sufficiently accurate, and thus they are subsequently used to classify
all 36million queries.
As shown in Table 3.1, there are actually more non-geographic que-
ries containing geo terms than there are geographic queries. In order
The biggest challenge in geographic query classification comes from to produce a good classifier, we used training data consisting of 50%
ambiguous geographic terms. It is obvious to readers of the yellow
press that queries such as “Paris Hilton” do not commonly refer to
geographic queries with geographic terms and 50%
geographic que-
ries without geographic terms. In total, the training set consisted of
hotels in the capital of France. Similarly, “Cadillac” commonly tar-
gets automobiles, not a city in Michigan. In order to disambiguate
around ; 1 200 queries.
Utilizing the popular machine learning software, Weka3 , we eval-
queries containing these terms, we have to inspect their other terms. uate our decision-tree based classifier using ten-fold cross validation.
Abbreviations of state names such as “CA” often indicate a geographic
meaning. This rule of thumb however does not apply to certain states
About : 90 69% of all queries were correctly classified; see Table 4.1
for the results. Note that this accuracy is measured on the already fil-
like “MD”, “LA”, or “OR”. Many such cases are hard to classify, even tered data, i.e., the classifier differentiates between geo and non-geo
for humans. queries that both contain geographic terms. If used on all queries, its
4.1 Geo Non-Geo Classification accuracy would be higher. Our classifier compares favorably to that
of [10] in terms of accuracy. After applying the classifier to the en-
This first classifier detects geographical queries in two stages. First,
a simple filter removes all queries without any geographic terms. In
tire AOL log, around : 13 39% of all queries are identified as having
other words, queries with no locality terms are classified as non-geo geographic intent.
queries; as shown earlier this affects about 1% of all queries that are
4.2 Informational vs. Navigational Queries
geographic but have no city, country, or state name. After applying
this filter, we are left with queries falling into categories “geo with It is not feasible to automatically classify geographic queries ac-
geo terms” and “non-geo with geo terms”. These are then classified cording to any of the fine-grained taxonomies illustrated in Section 3.3.
according to the following features: From a user’s point of view there is a clear distinction between naviga-
tional or resource queries. A user wants to either find a website, or find
Property & Tourism Does the query contain terms about properties a resource, e.g., buy something. However, the resulting queries often
or hotels?2 look similar, and can even be identical. Assume a user investigating
State Does the query contain a state name, or its abbreviation? the latest sportswear. She might search for “adidas”, a navigational
State-Pos The position of the state name from the end of the query; query to learn about available models. But a user intending to buy
0
e.g., if it is the rightmost term in the query. We notice that shoes online might also enter “adidas” and then proceed to the online
when a state name is included in a query, the state name often store. This query now targets a resource; the query is the same, but
appears at the end of the query. the user’s intention is very different. Thus, it is clearly not possible to
Ambiguous State-Abbreviation Does one of the following state ab- infer user intent from queries alone, even for a human classifier. How-
breviations appear as the only locality information in the query: ever, we can resort to a cruder taxonomy which is still meaningful and
“OH”, “OR”, “MD”, “AS”(American Samoa) ? These abbrevi- that allows for automatic classification. We hence limit ourselves to
ations are often used in a non-geographic sense. two simple categories, navigational and informational. The first con-
City Does the query contain a city name? tains all queries that are navigational according to the definition of
County Does the query contain a county name? Rose and Levinson, or that request a download. The second category
contains all other queries.
County-follow If the answer is true for the previous questions, is the
This classifier differs from the previous in that it does not look at the
county name followed by word “county”, “village”, “co”, “bor-
query terms, but instead looks at users’ click-through data. The under-
ough” etc? People searching for a county or city often append
lying assumption is that for a navigational query, a user only clicks on
such indicative terms.
a single result, as suggested in [17]. For an informational query, she
State-follow If a city or county term appears in a query, does the term may instead follow several links. This hypothesis is captured by the
occur next or prior to a state name? The city or county must be following two features used by our classifier:4
inside that particular state.
Place-Size If a city or county term is found, how large is its popula- Avg. number of clicks per query This feature represents how many
tion? If it is a very popular city or county in US, it is most likely results a user clicks on after issuing a query. This number is
that the query searches for that city/county. On the other hand, averaged over all users who issued a particular query.
a small city is the target of few search queries. Click distribution This feature is based on the intuition that most
Geo-Web-Freq If a city, county or state name is present, what is the clicks resulting from a navigational query focus on a few popu-
frequency of this term in general Web documents? lar URLs. The click distribution of a query is defined according
to the number of clicked times for each different URL associ-
5
Geo-Query-Freq If a city, county or state name is present, what is
the frequency of this term in general search queries? ated with the same query. We look at measures of distribution:
2
average, mean, standard deviation, skew, and kurtosis.
In particular: apartment, balcony, bath, bathroom, bed and break-
fast, bedroom, building, condo, condominium, duplex, estate, flats, Additionally, we investigate:
garage, home, hotel, house, inn, kitchen, lawn, lease, lodge, lodging, 3
map, motel, property, real estate, realestate(sic.), rental, renting, sub- http://www.cs.waikato.ac.nz/ml/weka/
4
let, view, villa, waterfront, and their plural forms, e.g., apartments. For a detailed explanation of both features, see [17].
Class Precision Recall F-Measure Query Granularity Top-5 terms
Informational 0 85
: 0 951
: 0 898
: city level “hotel”
Navigational 0 928
: 0 789
: 0 853
: “beach”
“city”
Table 4.2: Accuracy of Info-Navi Classifier “news”
“auto”
Geo-URL Does the clicked URL contain the name of a city, county county level “county”
or state? “real estate”
“house”
The resulting classifier is reasonably accurate. Given a training set
400
“property”
of around hand labeled queries distributed evenly between in- “home”
formational and navigational, the classifier achieves an accuracy of state level “jobs”
87 94%
: . Note that we only select queries with more than 10 clicks “lottery”
“sale”
to evaluate our classifier. If a user issued an identical query several
“park”
times and every time followed the same result, then we counted only a “department”
single click. Table 4.2 shows the accuracy numbers for this classifier.
Table 5.2: Top-5 query terms
5. GEOGRAPHIC QUERY PROPERTIES
There are important differences between geo and non-geo queries; Term Likelihood to appear in a geographic query
estate 81 61%
:
81 59%
users look for different “things” when searching locally than globally.
shores
81 05%
:
The classifiers presented in the previous section facilitate the study of
cemeteries :
properties of geo queries on a large scale. First, we classify the en- appraiser 80 98%
:
tire AOL trace into geo and non-geoqueries. Then, we analyze term lodging 80 79%
:
frequencies for both types of queries. Finally, we explore the distri-
bution of geographic and non-geographic queries in different topical Table 5.3: Terms most likely to appear in geographic queries
categories as well as geographic distribution.
5.4 Geo Queries and Topical Categories
5.1 Frequent Terms In Section 3, we showed that geo and non-geo queries focus on dif-
Table 5.1 outlines the five most frequent terms for geographic and ferent search topics. To explore this notion in the larger dataset, we
non-geographic queries, taken from the results of our automatic classi- relate our queries to web sites covered by the Open Directory Project
fier. Note that no geo terms (city, county, or state names) or stop words (ODP). Thus, we assume that a query falls into some category iff the
are counted; this applies to all remaining sections. Unsurprisingly, the clicked URL (i.e., website, since click-though data is provided on a
most frequent terms in non-geographic queries are unrelated to geog- site level only) associated with this query is covered under that cate-
raphy, while other terms are more likely to appear in geo queries than gory. We limit ourselves to the ODP top-level categories. For each
in non-geo ones. category, Figure 5.1 shows the number of geo and non-geo queries.

Query Type Top-5 terms


non-geographic “free”
“google”
“new”
“yahoo”
“pictures”
general geographic “hotel(s)”
“sale”
“real estate”
“beach”
“home(s)”

Table 5.1: Top-5 query terms

5.2 Frequent Terms at Varying Granularity


Do geographic queries at different granularity (e.g. county vs. city)
address different information needs? This is indeed the case, as shown Figure 5.1: Query distribution over different topics
in Table 5.2, which outlines the most frequent terms in different gran-
Note that we filter out duplicate query/click pairs from the same
ularity. (We note here that county vs. city is not just a different gran-
user. A small portion of sites are covered by more than one category.
ularity, but also often an indication of more rural or suburban versus
Of course, categories are not entirely exclusive. In particular, many
urban environments, complicating the picture a bit. City residents are
sites (e.g., a local football club) are commonly classified by location
often more likely to refer to their location by city name rather than the
(“regional”) as well as topic (“sports”). Obviously, the “regional” cat-
county the city is located in, which may have little relevance to them.)
egory applies to a larger number of geographic queries. In order to
5.3 Indicative Terms compare geo and non-geo queries in terms of their distribution over
Some terms are more likely to appear in geo queries than in non- topics, we removed the regional category and plotted the results again,
geo queries, of a non-geographic nature, and vice versa. Table 5.3 shown in Figure 5.2. We can see that geographical queries clearly tend
displays the five terms that are most likely to be in a geo queries. towards a few categories in ODP, such as Society and Sports. This
This is computed as the number of times a term appears in geographic also includes a large number of clicks on pages of religious, civic,and
queries divided by the number of instances in which the term appears governmental sites.
in the general query log. This could be used to further improve the
performance of our classifier. For example, the term “estate” is much
5.5 Geo Query Distribution over US States
more likely to appear in a geo query. Here, we only take into account This section investigates how geo queries are distributed among dif-
query terms which appear more than 1000times in the whole query ferent states in the US. A geographic query includes at least one lo-
log, reducing noise induced by infrequent terms. cation term, i.e., a city, county, or state name. We assign a state to
the associated queries are non-geographic in nature, we call non-geo
sites. Next, we look at the differences between geo and non-geo sites.
6.2 Geo Sites and Top-Level Domains
In Figure 6.2, we look at how geo and non-geo sites are distributed
among different top-level domains. We see that .gov and .org sites
are more often visited via geo queries, as such sites are more often
associated with local government and civil organizations.

Figure 5.2: Query distribution, without “regional”


each query according to this term. In the case that only a city name
is found and is associated with more than one state, we associate this
query with the city having the largest population. For example, there Figure 6.2: Distribution of geo/non-geo queries for different top
are more than five “Brooklyn” in the US, but we assign “New York”’ level domains
as the state for any such query.
In our experiments, we look at the popularity of different states in 6.3 Geo Sites and Topical Categories
geographic queries. The five most popular states are: Florida, Califor- Now we investigate the topical distribution of geo and non-geo
nia, Texas, New York, and Ohio. Combined, queries about those five sites, using again the ODP hierarchy. Confirming our previous find-
36 72%
states count for : of all geographic queries in our data set. This ings, we see that geo-sites are more likely to be associated with the re-
is not surprising as these are also very populous states. Also, people gional category. In fact, the vast majority of geo sites that were found
show different interests for different states. For example, “Kids and in ODP were in the regional category. This indicates that our way of
teens” is the most popular topic in both Florida and New York, while defining a geo site could in fact be used to identify good candidates
the same topic is the least popular one in other states (possibly due for the regional category. More detailed results are again omitted due
to the importance of tourism for these states). Detailed results on this to space constraints.
experiment are omitted for space reasons.

6. GEO PROPERTIES OF WEB SITES


6.1 Geo vs Non-Geo Sites
In the previous section, we investigated geo queries. In this sec-
tion, we extend our study to sites that are commonly associated with
such queries. In particular, we look at what sites are are mostly visited
by clicking through on geo queries, and how such sites are distibuted
over topics and assosiated with geo terms. Figure 6.1 divides all sites
receiving more than 10
clicks into ten bins. Bins are assigned accord-
ing to the fraction of these queries that were geo queries. Thus, the
first column on the left represents sites visited exclusively from geo
queries, while the rightmost column represents sites visited only from
non-geo queries. We can see that there is a strong bimodal behavior; Figure 6.3: Distribution of sites in different categories
many sites are either mostly geo or mostly non-geo in nature when 6.4 Local vs National sites
characterized by the queries used to visit them. There is also a rea-
sonable number of sites, shown in column 2 to 4, that have mostly Some sites seem to appear only in results for queries regarding a
non-geo queries but also some geo queries; such sites may have some particular area (say, “www.brooklynyoga.com” for Brooklyn), while
limited amount fo geographic information on their site such as, such other sites are associated with geographic query terms from around
as a store location or company address. the country. Examples of such sites include “www.realtor.com” and
“travel.yahoo.com”. This tells us that some sites have a broad geo-
0.4 graphic relevence while others provide a service only to a particular
0.35
area. In additional experiments, omitted for space reasons, we studied
0.3
the properties of such local versus nationwide sites. In summary, as
% of total sites

0.25

0.2
shown in this section, geo queries can be used to mine interesting facts
0.15
about the sites that are visited via those queries.
0.1

0.05 7. GEOGRAPHIC USER PROPERTIES


0
This section studies user behavior in connection with geographic
0

0.

0.

0.

0.

0.

0.

0.

0.

0.

search tasks. Due to space constraints, we can only summarize some


-0

2-

3-

4-

5-

6-

7-

8-

9-

200
-0

0.

0.

0.

0.

0.

0.

0.

1.
.1

0
2

% of Geo queries in each site of our observations. We focused on users with at least geographic
queries, and then manually examined the users’ searching behavior,
Figure 6.1: Distribution of sites according to the queries that are looking at the following questions:
used to find them Do users repeatedly conduct searches on the same geographic area?
Based on this, we define a geo site as a site where more than 80% of The answer is yes. Indeed, one could probably easily infer the home-
its associated queries are geo queries. Those sites where more 80% of towns of many of these users from the geo terms in their queries, as
users exhibit a tendency to conduct searches for local services. The Information Retrieval, pages 61–64, 2005.
non-geo terms associated with a user’s geo-terms also reveal much of [9] J. Ding, L. Gravano, and N. Shivakumar. Computing geographical scopes of web
resources. In Proc. of the 26th Int. Conf. on Very Large Data Bases (VLDB), pages
a user’s relationship with an area. Thus, if terms such as “school”, 545–556, 2000.
“yoga” or “real estate” tend to appear with geo terms, we have reason [10] L. Gravano, V. Hatzivassiloglou, and R. Lichtenstein. Categorizing Web queries
to believe that the user lives nearby. On the other hand, terms like according to geographical locality. In Proc. of the 12th Int. Conf. on Information
“hotel” or “vacation” might indicate the user lives somewhere else. and Knowledge Management, pages 325–333, 2003.
[11] B. J. Jansen and U. Pooch. A review of Web searching studies and a framework for
Do people in a single session of querying reformulate their que- future research. J. of the American Society for Information Science and Technology,
ries, trying different names for the same area? That is, how fre- 52(3):235–246, 2001.
quent is geo modification, as discussed in Section 2? Indeed, not too [12] B. J. Jansen, A. Spink, and J. Pedersen. An analysis of multimedia searching on
often. There are different ways to define search sessions. Manually AltaVista. In Proc. of the 5th ACM SIGMM Int. Workshop on Multimedia
Information Retrieval (MIR), pages 186–192, 2003.
checking the search history, we can identify instances when a person [13] T. Joachims. Optimizing search engines using clickthrough data. In Proc. of the
changes the topic of a search, and thus define a user search session eighth ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD),
as a series of queries on a similar topic over a continuous block of pages 133–142, 2002.
time. This period can vary from several minutes to several days, as [14] C. B. Jones, A. I. Abdelmoty, D. Finch, G. Fu, and S. Vaid. The SPIRIT spatial
search engine: Architecture, ontologies and spatial indexing. In Proc. of the 3rd Int.
long as a user stays focused on a topic. When people search for lo- Conf. on Geographic Information Science, pages 125–139, 2004.
cal information or services, they are often fairly confident about the [15] M. Kamvar and S. Baluja. A large scale study of wireless search behavior: Google
appropriate geo terms. Thus, when users modify their queries, they mobile search. In Proc. of the SIGCHI conference on Human Factors in Computing
more often modify the non-geo terms. Users occasionally change the Systems, pages 701–709, 2006.
[16] R. R. Larson. Geographic information retrieval and spatial browsing. In L. Smith
geographic constraint present in the query while maintaining the non- and M. Gluck, editors, GIS and Libraries: Patrons, Maps and Spatial Information,
geographic portion of the information request. We found that in most pages 81–124, 1996.
of these cases, the user is querying about a location away from their [17] U. Lee, Z. Liu, and J. Cho. Automatic identification of user goals in Web search. In
likely home. The geographic terms are sometimes adjusted to point to Proc. of the 14th Int. Conf. on the World Wide Web, pages 391–400, 2005.
[18] J. L. Leidner. Toponym resolution in text: Which sheffield is it? In Proc. of the 27th
different parts of a city, since in some cases a tourist or traveler may be Int. ACM SIGIR Conf. on Research and Development in Information Retrieval,
flexible about where to go for a temporary stay. We note that the state pages 602–602, 2004.
names show very strong consistency across a user’s search session. [19] D. Lewandowski. Query types and search topics of German Web search engine
How are user queries clustered locally? For a particular user, one users. Information Services and Use, 26:261–1269, 2006.
[20] A. Markowetz, Y.-Y. Chen, T. Suel, X. Long, and B. Seeger. Design and
can derive their main geographical focus as the state or area addressed implementation of a geographic search engine. In 8th Int. Workshop on the Web and
by most of the geo queries of this user. This is likely the place of Databases (WebDB), 2005.
residence of the user. Similarly, one can define secondary and further [21] B. Martins, M. Silva, and L. Andrade. Indexing and ranking in GeoIR systems. In
clusters, potentially recent travel destinations of this user. Proc. of the 2. Int. Workshop on Geo-IR, 2005.
[22] K. McCurley. Geospatial mapping and navigation of the web. In Proc. of the 10th
Int. Conf. on the World Wide Web, pages 221–229, 2001.
8. CONCLUSION [23] G. Mishne and M. de Rijke. A study of blog search. In Proc. of the European Conf.
on Information Retrieval, pages 289–301, 2006.
In this paper, we investigated geographic properties of search que-
[24] Y. Morimoto, M. Aono, M. Houle, and K. McCurley. Extracting spatial knowledge
ries. Though, our main objective was to derive new techniques for from the web. In Proc. of the Symp. on Applications and the Internet, pages
geographic search engines, we believe our observations are of gen- 326–333, 2003.
eral interest. Our main contributions here are a more detailed study [25] G. Pass, A. Chowdhury, and C. Torgeson. A picture of search. In Proc. of the 1st
Int. Conf. on Scalable Information Systems, 2006.
of geographic search queries, a new taxonomy for such queries, and
[26] F. Radlinski and T. Joachims. Query chains: Learning to rank from implicit
experiments that relate such queries to the sites that are visited and the feedback. In Proc. of the Eleventh ACM SIGKDD Int. Conf. on Knowledge
users that pose them. We believe that with improved understanding Discovery in Data Mining, pages 239–248, 2005.
of users’ query goals and websites’ informational content, search en- [27] D. E. Rose and D. Levinson. Understanding user goals in Web search. In Proc. of
the 13th Int. Conf. on the World Wide Web, pages 13–19, 2004.
gines can take measures to improve response relevance. Due to space
[28] T. Sanderson and J. Kohler. Analyzing geographic queries. In Proc. of the
constraints, we had to omit many details of our results. Workshop on Geographic Information Retrieval, 2005.
There are many intriguing open questions left by our work. In [29] C. Silverstein, H. Marais, M. Henzinger, and M. Moricz. Analysis of a very large
particular, we would like to explore additional properties of the web Web search engine query log. SIGIR Forum, 33(1):6–12, 1999.
sites associated with geographic queries, and of geographic search ses- [30] A. Spink, D. Wolfram, M. B. J. Jansen, and T. Saracevic. Searching the Web: the
public and their queries. J. of the American Society for Information Science and
sions, and study how user behavior on geo queries (particularly click- Technology, 52(3):226–234, 2001.
through data) can be harvested for better geographic search. [31] D. Stenmark. One week with a corporate search engine: A time-based analysis of
intranet information seeking. In Proc. of the Americas’ Conf. on Information
Systems, 2005.
9. REFERENCES [32] B. Tan, X. Shen, and C. Zhai. Mining long-term search history to improve search
[1] E. Agichtein and Z. Zheng. Identifying “best bet” Web search results by mining
past user behavior. In Proc. of the 12th ACM SIGKDD Int. Conf. on Knowledge accuracy. In Proc. of the 12th ACM SIGKDD Int. Conf. on Knowledge Discovery
Discovery and Data Mining (KDD), pages 902–908, 2006. and Data Mining (KDD), pages 718–723, 2006.
[2] E. Amitay, N. Har’El, R. Sivan, and A. Soffer. Web-a-where: geotagging web [33] Q. Tan, X. Chai, W. Ng, and D. Lee. Applying co-training to clickthrough data for
content. In Proc. of the 27th Ann. Int. ACM SIGIR Conference on Research and search engine adaptation. In Proc. of the 9th Int. Conf. on Database Systems for
Development in Information Retrieval, pages 273–280, 2004. Advanced Applications (DASFAA), 2004.
[3] D. Beeferman and A. Berger. Agglomerative clustering of a search engine query [34] J. Teevan, E. Adar, R. Jones, and M. Potts. History repeats itself: Repeat queries in
log. In Proc. of the Sixth ACM SIGKDD Int. Conf. on Knowledge Discovery and yahoo’s logs. In Proc. of the 29th Annual International ACM SIGIR Conf. on
Data Mining (KDD), pages 407–416, 2000. Research and Development in Information Retrieval, pages 703–704, 2006.
[4] S. M. Beitzel, E. C. Jensen, A. Chowdhury, D. Grossman, and O. Frieder. Hourly [35] S. Vaid, C. B. Jones, H. Joho, and M. Sanderson. Spatio-textual indexing for
analysis of a very large topically categorized Web query log. In Proc. of the 27th geographical search on the web. In Proc. of 9th Int. Symp. on Spatial and Temporal
Annual Int. ACM SIGIR Conf. on Research and Development in Information Databases (SSTD), 2005.
Retrieval, pages 321–328, 2004. [36] L. Wang, C. Wang, X. Xie, J. Forman, Y. Lu, W.-Y. Ma, and Y. Li. Detecting
[5] A. Broder. A taxonomy of Web search. SIGIR Forum, 36(2):3–10, 2002. dominant locations from search queries. In Proc. of the 28th Annual Int. ACM
SIGIR Conf. on Research and Development in Information Retrieval, 2005.
[6] O. Buyukkokten, J. Cho, H. Garcia-Molina, L. Gravano, and N. Shivakumar.
Exploiting Geographical Location Information of Web Pages. In 2nd Int. Workshop [37] V. Zhang, B. Rey, E. Stipp, and R. Jones. Geomodification in query rewriting. In
on the Web and Databases (WebDB), pages 91–96, 1999. Proc. of the Workshop on Geographic Information Retrieval, 2006.
[7] Y. Chen, T. Suel, and A. Markowetz. Efficient query processing in geographic web [38] Y. Zhou, X. Xie, C. Wang, Y. Gong, and W. Ma. Hybrid index structures for
search engines. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, location-based web search. In Proc. of the 14th ACM Int. Conf. on Information and
pages 277–288, 2006. Knowledge Management, pages 155–162, 2005.
[8] T. M. Delboni, K. A. V. Borges, and A. H. F. Laender. Geographic Web search
based on positioning expressions. In Proc. of the Workshop on Geographic

You might also like