ChatGPT As A Mapping Assistant
ChatGPT As A Mapping Assistant
{ljuhasz,bguan}@fiu.edu
2
Department of Computer Science, Maynooth University, Co. Kildare, Ireland
[email protected]
3
Geomatics Sciences, University of Florida, Ft. Lauderdale, FL 33144, USA
[email protected]
1 Introduction
Generative Artificial Intelligence (AI) is type of AI that can produce various
types of content, including text, imagery, audio, code, and simulations. It has
gained enormous attention since the public release of ChatGPT in late 2022.
ChatGPT is an example of a Large Language Model (LLM), which is a form of
generative AI that produces human-like language. Since the launch of ChatGPT,
Paper submitted to The Fourth Spatial Data Science Symposium #SDSS2023.
2 L. Juhász et al.
Fig. 1. OpenStreetMap roads and Mapillary images in the study area near Downtown
Miami
GeoAI has been part of the GIScience discourse in recent years. For example,
Janowicz et al. [7] elaborated on whether it was possible to develop an artificial
GIS analyst that passes a domain specific Turing test. While this questions is
still largely unanswered, our study contributes to early steps in this direction
by utilizing an LLM (ChatGPT) and a multimodal pre-training method (BLIP-
2) to connect visual and language information in the context of mapping. We
explore the larger question of whether generative AI is a useful tool in the context
of creating and enriching map databases and more specifically investigate the
following research questions:
ChatGPT as a mapping assistant 3
2 Study setup
2.1 Data sources and preparation
In OSM, geographic features are annotated with key-value pairs to assign the
correct feature category to them, a process called tagging [14]. For example,
roads are assigned a "highway"=<value> tag where <value> indicates a specific
road category, such as "residential" for a residential street.
OSM data is not homogeneous, and individual users may perceive roads dif-
ferently, and therefore assign different "highway" values to the same type of
road. A list of "highway" tag values was established to better describe the
meaning of each road category in OSM. Furthermore, the difference between
some road categories, e.g. “primary” and “secondary” is more of an administra-
tive nature rather than visual appearance. For example, a 2-lane road in rural
areas could be considered primary, whereas a more heavily trafficked road in
an urban environment might be categorized secondary. To consider semantic
road categories rather than individual "highway" values as one of the evalua-
tion methods, "highway" tag values representing similar roads in our dataset
were grouped into four categories (Table 1).
Figure 2 shows the methodology to obtain OSM roads of interest with corre-
sponding Mapillary street-level images. First, all OSM roads with a "highway"=*
tags were extracted within the study area. Then, short sections (<50m), inac-
cessible roads, sidewalks along roadways and roads without street-level photo
coverage were excluded. Retained OSM roads were matched with corresponding
Mapillary photographs, so that each road segment would have at least one repre-
sentative Mapillary image. Lastly, a list of objects detected in the corresponding
4 L. Juhász et al.
Table 1. Grouping distinct "highway" tag values into semantically similar categories.
image was also extracted from the Mapillary API. These inputs were further
used as described in Section 2.3.
2.2 Resources
AI tools and models We utilize GPT-3.5-turbo, which is an advanced lan-
guage model developed by OpenAI. It is an upgraded version of GPT-3, de-
signed to offer improved performance and capabilities and retains the large-scale
architecture of its predecessor, enabling it to generate coherent and contextually
relevant text [15]. GPT-3.5-turbo serves as a powerful tool for natural language
processing, content generation, and other language-related applications. In our
study it is used to suggest OSM tagging based on pre-constructed prompts using
the content of street-level images. The model was accessed through the OpenAI
API.
BLIP-2 [11] is a state-of-the-art, scalable multimodal pre-training method,
designed to equip LLMs with the capability to understand images while keep-
ing their parameters entirely frozen. This approach is built upon leveraging
frozen pre-trained unimodal models and a proposed Querying Transformer (Q-
Former), sequentially pre-trained for vision-language representation learning and
ChatGPT as a mapping assistant 5
Analysts Three analysts were tasked to describe the visual content of street-
level images (captioning), and to answer a few questions regarding the image
content (visual Q&A). Two (human) analysts were undergraduate students at
Florida International University with previous GIS coursework. BLIP-2 was used
to perform the same task as human analysts, and its responses were recorded as
the third (artificial) analyst. Analysts were deliberately not given any guidelines
as to how to describe images so that their answers would not be biased by prior
knowledge about OSM and mapping. Table 2 lists questions and tasks performed
by analysts.
The answers of analysts differ in level of detail. For example, BLIP-2’s and
Analyst #2’s captions were significantly shorter on average (9 and 11 words, re-
spectively) than Analyst #1’s (37 words). BLIP-2’s responses were also found to
be more generic (e.g. "a city street with tall buildings in the background")
6 L. Juhász et al.
than human analysts’. This allows us to explore the effect of providing increasing
detail on tag suggestion accuracy.
Figure 3 shows the methodology for suggesting tags for an OSM road. For each
retained road in the area, the corresponding Mapillary images were shown to an-
alysts described in Section 2.2. Analysts created an image caption and answered
simple questions as described in Table 2. These responses in combination with
additional context were used to build prompts for an LLM to suggest OSM tags.
To explore what influences the accuracy of suggested tags, a series of prompts
were developed that differ in the level of detail that is presented to the LLM.
All prompts start with the following message that provides context and in-
structs the model about the expected output format.
Based on the following context that was derived from a street-level pho-
tograph showing the street, recommend the most suitable tagging for an
OpenStreetMap highway feature. Omit the ’oneway’ and ’lit’ tags if the
answer to the corresponding questions is no or N/A. Format your sug-
gested key-value pairs as a JSON. Your response should only contain this
JSON.
Finally, Object detection and locational context (OD + LC) are com-
bined into a new scenario that supplies both additional contexts for the language
model.
The last step in the process is to supply the prompts described above to
GPT-3.5-turbo (ChatGPT for simplicity). The model responds with a JSON
document containing the suggested OSM tagging for the roadway, e.g. ("highway"
="primary", "lanes"= 3), which can be compared to the original OSM tags
of the same roadway.
The final dataset contains 94 OSM highway features and their original tags.
For four scenarios and three analysts described above, recommendations given
by ChatGPT based on the corresponding prompts were also recorded for the
same roadway, resulting in a total of 12 tagging suggestions. These suggestions
are then compared to the original OSM tags to assess the accuracy of a particular
scenario and analyst.
8 L. Juhász et al.
3 Results
3.1 Accuracy of suggesting road categories
Table 3 lists the correctness of ChatGPT suggested road categories based on
two different methods. First, we consider historical "highway" values of an OSM
road. A suggestion was considered correct if the current or any previous versions
of the corresponding OSM highway value matched ChatGPT’s suggested tag.
This step takes into account differences in how individual mappers may perceive
road features (e.g. primary vs. secondary). The second method is based on se-
mantic road categories listed in Table 1. Considering groups of roads as opposed
to individual "highway" values mitigates the fact that OSM tagging often fol-
lows administrative roles that are difficult to infer from photographs. Table 3
reports the accuracy of individual analysts across the four scenarios as well as
the average correctness for analysts (values on bottom) and scenarios (values in
different rows).
BLIP-2 achieved the lowest accuracy among the three analysts, followed by
Analysts #2 and #1 respectively. This resembles the level of detail analysts
described photographs with, which suggests that in general, providing more de-
tailed image captions may lead to more accurate tag suggestions by ChatGPT.
This is further supported by the average accuracy achieved in different sce-
narios. The baseline scenario, which used prompts purely based on the visual
description of street-level photographs, achieved a suggestion accuracy of 30-40%
on average from the three analysts. Providing additional context in different sce-
narios increased this accuracy. Additional location context, i.e. specifying that
the roads are located near Downtown Miami (LC scenario) increased sugges-
tion accuracy by 4.3-8.2% on average, depending on the evaluation method. This
can potentially be explained by regional differences in OSM tagging practices
ChatGPT as a mapping assistant 9
Future research will also focus on adding and refining more traffic-related infor-
mation to road maps, e.g. speed limit, biking infrastructure, turn restrictions.
While experiments like this are useful initial steps, we urge the GIScience
community to go beyond simply applying AI in geographic contexts and focus on
synergistic research that advance both the spatial sciences and AI research (see
e.g. [12,6]). There are multiple potential extension of this work along this idea
that go beyond the case study presented in this paper. For example, exploration
of a multimodal conversational AI agent for spatial data science is a promising
research direction. In theory, incorporating a spatial understanding component
in a multimodal AI system allows it to comprehend and analyze geospatial data.
This could result in a method for the AI to interpret and interact with geospatial
data, similar to how BLIP-2 enables language models to understand images.
Future research should focus on exploring the potential of this integration, and
on deepening our understanding of the theoretical and practical aspects of this
fusion. This, in turn will advance the field of research and lay the groundwork
for future innovations in comprehensive multimodal AI systems for geospatial
science.
References
1. Bang, Y., Cahyawijaya, S., Lee, N., Dai, W., Su, D., Wilie, B., Lovenia, H., Ji, Z.,
Yu, T., Chung, W., Do, Q.V., Xu, Y., Fung, P.: A Multitask, Multilingual, Mul-
timodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity
(Feb 2023). https://doi.org/10.48550/arXiv.2302.04023, http://arxiv.org/abs/
2302.04023, arXiv:2302.04023 [cs]
2. Dwivedi, Y.K., Kshetri, N., Hughes, L., Slade, E.L., Jeyaraj, A., Kar, A.K.,
Baabdullah, A.M., Koohang, A., Raghavan, V., Ahuja, M., Albanna, H., Al-
bashrawi, M.A., Al-Busaidi, A.S., Balakrishnan, J., Barlette, Y., Basu, S., Bose, I.,
Brooks, L., Buhalis, D., Carter, L., Chowdhury, S., Crick, T., Cunningham, S.W.,
Davies, G.H., Davison, R.M., Dé, R., Dennehy, D., Duan, Y., Dubey, R., Dwivedi,
R., Edwards, J.S., Flavián, C., Gauld, R., Grover, V., Hu, M.C., Janssen, M.,
Jones, P., Junglas, I., Khorana, S., Kraus, S., Larsen, K.R., Latreille, P., Laumer,
S., Malik, F.T., Mardani, A., Mariani, M., Mithas, S., Mogaji, E., Nord, J.H.,
O’Connor, S., Okumus, F., Pagani, M., Pandey, N., Papagiannidis, S., Pappas,
I.O., Pathak, N., Pries-Heje, J., Raman, R., Rana, N.P., Rehm, S.V., Ribeiro-
Navarrete, S., Richter, A., Rowe, F., Sarker, S., Stahl, B.C., Tiwari, M.K., van
der Aalst, W., Venkatesh, V., Viglia, G., Wade, M., Walton, P., Wirtz, J., Wright,
R.: “so what if chatgpt wrote it?” multidisciplinary perspectives on opportuni-
ties, challenges and implications of generative conversational ai for research, prac-
tice and policy. International Journal of Information Management 71, 102642
12 L. Juhász et al.
https://doi.org/10.48550/arXiv.2203.02155, http://arxiv.org/abs/2203.02155,
arXiv:2203.02155 [cs]
16. Reynolds, L., McDonell, K.: Prompt Programming for Large Language Mod-
els: Beyond the Few-Shot Paradigm. In: Extended Abstracts of the 2021
CHI Conference on Human Factors in Computing Systems. pp. 1–7. CHI
EA ’21, Association for Computing Machinery, New York, NY, USA (May
2021). https://doi.org/10.1145/3411763.3451760, https://dl.acm.org/doi/10.
1145/3411763.3451760
17. Wang, S., Scells, H., Koopman, B., Zuccon, G.: Can ChatGPT Write a
Good Boolean Query for Systematic Review Literature Search? (Feb 2023).
https://doi.org/10.48550/arXiv.2302.03495, http://arxiv.org/abs/2302.03495,
arXiv:2302.03495 [cs]