0% found this document useful (0 votes)

32 views18 pages

Unlocking The Potential of Web Data For Retailing Res 2024 Journal of Retail

The article discusses the potential of web data, obtained through web scraping and APIs, to enhance retailing research, which has seen limited uptake of such data despite its abundance. It aims to review existing applications of web data, clarify its value compared to traditional datasets, and provide a practical guide for researchers to incorporate web data collection into their studies. The authors also introduce a mock-up digital retail store to aid learning in web data collection methods.

Uploaded by

abdosh.acc2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views18 pages

Unlocking The Potential of Web Data For Retailing Res 2024 Journal of Retail

Uploaded by

abdosh.acc2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Available online at www.sciencedirect.

com

Journal of Retailing 100 (2024) 130–147

www.elsevier.com/locate/jretai

Unlocking the Potential of Web Data for Retailing Research ✩

Jonne Y. Guyt a,∗, Hannes Datta b, Johannes Boegershausen c
a Amsterdam Business School, University of Amsterdam, the Netherlands
b Tilburg School of Economics and Management, Tilburg University, the Netherlands
c Rotterdam School of Management, Erasmus University Rotterdam, the Netherlands

Available online 4 March 2024

Abstract
Web data collected via web scraping and application programming interfaces (APIs) has opened many new avenues for retail innovations
and research opportunities. Yet, despite the abundance of online data on retailers, brands, products, and consumers, its use in retailing research
remains limited. To spur the increased use of web data, we aim to achieve three goals. First, we review existing retailing applications using
web data. Second, we demystify the use of web data by discussing its value in the context of existing retail data sets and to-be-constructed
primary web datasets. Third, we provide a hands-on guide to help retailing researchers incorporate web data collection into their research
routines. Our paper is accompanied by a mock-up digital retail store (music-to-scrape.org) that researchers and students can use to learn to
collect web data using web scraping and APIs.
© 2024 The Author(s). Published by Elsevier Inc. on behalf of New York University.
This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)

Keywords: Web data; Web scraping; APIs; Retailing; Retail; AI.

Introduction However, web data usage has seen a limited uptake in re-
tailing studies compared to other major marketing fields. To
The retailing landscape and the scope of retailing research illustrate, fewer than thirty articles in the Journal of Retailing
are evolving (Gielens and Roggeveen 2023), largely spurred in the last 25 years have used web data. Most applications
by technological progress. Indeed, the internet has revolution- have been geared towards analyzing reviews and (product)
ized access to and the exchange of information on products recommendations. Almost all articles using web data focus on
and services, empowered the creation of novel business mod- textual and numeric data, although various other data types
els, and led to significant retail innovations, generating many exist. Despite large-scale geographic availability, most stud-
novel research areas (Ratchford et al. 2022). A benefit of the ies primarily use US data. Finally, most researchers using web
digitization of retailing is the large-scale availability of web data relied on web scraping instead of application program-
data on consumers, brands, retailers, and markets. Web data, ming interfaces (APIs).1
defined as any data source publicly available on the internet Our article identifies idiosyncratic challenges in retailing-
and shown on digital devices (Boegershausen et al. 2022), is based research contributing to the slow uptake of web data.
uniquely positioned to aid researchers in tackling relevant and Publicly available web data offers retailing researchers nu-
novel questions. merous opportunities to augment traditional data sources or to
compile novel datasets on trends in the evolving retail sector.
Yet, the relative richness of existing proprietary datasets (such
✩ All authors contributed equally. We thank participants at the Special Ses- as NielsenIQ’s Consumer Panel Data and Retail Scanner Data,
sion of the Retailing SIG at EMAC 2023 for comments on an earlier version.
Research support from Tilburg Science Hub (Thierry Lahaije) and tilburg.ai
(Marijn Bransen, Jonas Klein) is gratefully acknowledged. This work was 1 Web scraping is defined as the collection of data by downloading the
supported by the Marketing Science Institute (grant #4000678). HTML web page and extracting elements of interest. Application Program-
∗ Corresponding author. ming Interface (API) offers programmatic access to company information
E-mail address: [email protected] (J.Y. Guyt). (see Web Appendix A in Boegershausen et al. 2022).

https://doi.org/10.1016/j.jretai.2024.02.002
0022-4359/© 2024 The Author(s). Published by Elsevier Inc. on behalf of New York University. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/)
J.Y. Guyt, H. Datta and J. Boegershausen Journal of Retailing 100 (2024) 130–147

GfK’s ConsumerScan, Kantar’s Worldpanel, or firms’ loyalty follows, we provide an overview of how studies published
program data), in addition to primary data collection (i.e., in the Journal of Retailing have used web data, identify
surveys and experiments) has diminished the attractiveness themes that can be used to maximize the benefits of web
of such alternative sources. Existing, well-established propri- data for retailing research, provide an introduction to music-
etary datasets typically offer depth in some dimensions (e.g., to-scrape.org and the framework for collecting web data, and
sales metrics for established brands and retailers for many discuss data source selection, LLMs and Gen AI, and big
years) but often are difficult or costly to obtain and have lim- team science. We conclude in Section 6.
ited information in other dimensions (e.g., product metadata,
tracking of emerging retail formats). In addition, it is unclear Web data usage in retailing research
how to combine web data with traditional datasets, which of-
ten cover past behavior and are released with a significant To better understand how web scraping and APIs have
time lag. Overcoming these challenges requires researchers been used in retail research, we closely inspect where and
to align data in time and accurately match many products, how they have been used in the Journal of Retailing. We fol-
consumers, or retailers. low Boegershausen et al. (2022) and identify 28 articles. We
This article intends to demystify the use of web data depict the scattered evolution of this research area in Fig. 1
for academic retailing researchers.2 With this goal in mind, (see Appendix A for a list of articles and collection details).
we first substantiate the significance of web data com- Next, we leverage our coding to show how and from where
pared to existing datasets by reviewing studies using web researchers have gathered which information to explore re-
data published in the Journal of Retailing. Building on tailing research questions and phenomena. Interestingly, most
Boegershausen et al. (2022), we coded key characteristics of the topics explored with web data remain on the periphery
and data sources from this body of work, highlighting the rather than the “core” retailing research settings and questions
untapped potential of web data in retail research. We dis- (Gielens and Roggeveen 2023).
cuss scraping methods (web scraping vs. APIs), data sources
scraped, the geographic coverage and the type of data, and
highlight papers engaging in longitudinal data collection. Use of web scraping vs. APIs
Second, we distill key themes that can facilitate future ap-
plications in retailing research using web data. We identify Retailing researchers rely primarily on web scraping (68%)
underutilized web sources and applications and explain how and, to a lesser extent, on APIs (14%) to extract web data.
researchers can use them to (i) improve measures, (ii) increase The extraction approach for the remaining articles was either
the diversity of retailing research, (iii) overcome limitations manual or unclear. Additionally, two articles reused existing
of other methods, and (iv) study emerging retail formats and web data sets (i.e., the Yelp Academic Dataset and SNAP li-
trends. brary at Stanford University). As in the marketing discipline at
Third, to further ease the adoption of web scraping for large (Boegershausen et al. 2022), the most widely used data
retailing researchers, educators, and students, we have devel- sources in retailing research are Amazon (n = 4) and Google
oped a mock-up digital platform (https://music- to- scrape.org) Trends (n = 4). For example, Pan and Zhang (2011) col-
and offer R code to learn web scraping in a controlled en- lected 41,405 Amazon.com reviews from three categories
vironment. Our guide zooms in on the critical stages of ex- (consumer electronics, software, and healthcare products) to
tracting data and building the shells around it: looping (e.g., explore which factors make user reviews more helpful. Be-
to extract many products), scheduling (e.g., to build longi- sides e-commerce websites and search engines, researchers
tudinal datasets), and the infrastructure (e.g., to collect data have also gathered from auction sites (e.g., eBay, n = 5),
remotely). movie sites (e.g., RottenTomatoes, n = 5), and online review
We conclude with reflections on important data and source platforms (e.g., Yelp, n = 2).
selection issues in the study of retail phenomena. Our dis-
cussion on the idiosyncratic retailing research considerations Data sources
includes the usage of web data aggregators, archival versions
of websites, and guidance on what data to collect to facilitate Half of all articles using web data relied on a sin-
the inclusion of control variables and matching with other gle source (n = 14). Only two articles leveraged data
sources. Finally, we provide an outlook charting pathways from many web sources (i.e., ten or more sources). First,
for applying generative AI (Gen AI)/large language models Meiseberg (2016) complemented her focal dataset of scraped
(LLM) and big-team science (Forscher et al. 2023) to em- book reviews from Amazon’s German store with other web
power researchers to kickstart web data collections. In what data from various German book retailers and authors’ home-
pages. Second, for an early exploration of the nascent e-
commerce market, Venkatesan, Mehta and Bapna (2007) gath-
2 Web scraping is also used in practice. For example, in The Netherlands,
ered 22,209 price quotes for 1,880 products from the websites
hiiper (https:// hiiper.nl/ ) offers “current scraping data from e-commerce web-
sites” as part of their consulting services to Dutch retailers. Other companies
of 233 different retailers to examine how retailer characteris-
like Zyte (https://zyte.com) offer commercial web scraping infrastructures for tics (e.g., service quality) interact with market characteristics
corporate clients. (e.g., competitive intensity) in shaping online price dispersion.

131
J.Y. Guyt, H. Datta and J. Boegershausen Journal of Retailing 100 (2024) 130–147

Fig. 1. Web data in the Journal of Retailing. Notes: Evolution of published articles in the Journal of Retailing that use web data in absolute (bars) and relative
(line) numbers. For a list of articles, see Appendix A.

Geographic coverage the corresponding images from Amazon.com to explore under

which conditions images posted by users boost the helpfulness
Despite the broad accessibility of diverse sources cover- of consumer reviews. Figueiredo, Larsen and Bean (2021) col-
ing retailers and markets worldwide, web scraping research lected images from the review platform Yelp to enrich a qual-
in retailing is highly geographically concentrated. Half of all itative dataset about celebrity chef Marcus Samuelsson and
articles used only US data (n = 14), such as data from Ama- his Red Rooster Harlem restaurant. We did not identify any
zon.com (e.g., Pan and Zhang 2011) or biddingfortravel.com articles collecting video or audio data.
(e.g., Joo, Mazumdar and Raj 2012). In addition, most scraped
datasets rely on sources in the English language (n = 22, Longitudinal data collections
79%). The remaining articles feature a combination of English
and Chinese-language sources (n = 2), Chinese-language Only a handful of articles (n = 5, 18%) extracted data
sources (n = 2), and German-language sources (n = 2). We from one or more sources multiple times. An illustrative ex-
did not find any articles drawing sources in the third- to sixth- ample of such a dataset is Zhao, Zhao and Deng (2016),
most-spoken languages in the world (i.e., Spanish, Hindi, who investigated online gray markets for branded products.
Standard Arabic, and French, adding up to 1.5 billion speak- To explore sellers’ and buyers’ behavior in gray markets, the
ers). We did also not identify a single article examining web authors built a panel dataset based on automatically extracted
data from retailers and markets in South America or Africa. data about counterfeit Coach handbags from Taobao.com once
The overrepresentation of US and English-language sources is per week over more than eight months. To match products
at odds with the potential of web data to make research more between sources, Zhao et al. leveraged the Coach style num-
diverse and less WEIRD (i.e., Western, educated, industrial- bers from the official Coach websites (i.e., coach.com and
ized, rich, and democratic; Henrich, Heine and Norenzayan china.coach.com).
2010b). Even if the textual data is restricted to English due to
dictionary-based text analyses, only a single article leverages Maximizing the benefits of web data in retailing research
data from retailers in non-WEIRD countries (i.e., English-
speaking reviews of the Dubai Mall; Joy et al. 2023). Web data can empower retailing researchers to provide
better answers to existing research questions or study en-
Data types tirely new research questions (George et al. 2016). There
are numerous retailing research topics for which web data
Extant retailing research using web data has focused pri- could play a critical role, including showrooming, omnichan-
marily on textual and numeric data (n = 26, 93%), typi- nel marketing, augmented and virtual reality, artificial intelli-
cally studying subjects like product sales (e.g., movie box gence and chatbots, spatial computing, dynamic pricing, the
office performance) and online reviews (e.g., review texts, sharing economy, subscriptions, and third-party digital plat-
helpfulness votes). Only two articles collected other types of forms. Importantly, as the scope of retailing research widens
data: Kübler et al. (2024) scraped 97,997 product reviews and (Gielens and Roggeveen 2023), embracing web scraping as a

132
J.Y. Guyt, H. Datta and J. Boegershausen Journal of Retailing 100 (2024) 130–147

part of the methodological toolkit will be essential for retail- Expanding the diversity and geographical coverage of
ing researchers given that the traditional, mostly proprietary, retailing research
and often expensive datasets rarely contain information about
these phenomena. Web scraping provides access to data on diverse pop-
Next, we discuss how retailing researchers can maximize ulations of consumers and markets worldwide (Kosin-
the value of web data by outlining how web data (i) can ski et al. 2016). Web data enables researchers to move be-
be used to improve existing measures, (ii) can expand the yond the typical Western, educated, industrialized, rich, and
geographic coverage and diversity of retailing research, (iii) democratic samples often used in marketing and retailing re-
allows researchers to study topics that are hard to study other- search. Studying diverse populations is important, given that
wise, and (iv) facilitates the examination of emerging retailing most consumers and retailers are located outside WEIRD
phenomena in a timely fashion. markets (Henrich, Heine and Norenzayan 2010a). Leverag-
ing web data from diverse geographical locations allows re-
searchers to examine consumers, retailers, marketplaces, and
Web data to collect better measures of existing phenomena platforms exposed to different competitive dynamics. Extend-
ing the geographic coverage of retailing research can further
We first outline some ‘quick wins.’ These quick wins increase confidence in the generalizability of research findings
come in the form of better data (e.g., more granularity, (Maner 2016; Rad, Martingano and Ginges 2018) and their
additional control variables) and provide researchers with impact on retailing practice worldwide. For example, how re-
better measures and information about the generalizability tailers respond to policy changes such as soda or ‘health’
of findings. Numerous APIs allow for the collection of taxes may differ depending on contextual factors. For ex-
better control variables. For example, in examining cross- ample, several South American countries have implemented
national variations in market response across the Indo- progressive taxes. Notably, Colombia implemented a tax on
Pacific Rim region, such as price and distribution elasticities, “ultra-processed foods” recently, which will be implemented
Datta et al. (2022) enrich a GfK sales dataset with data about in a stepwise manner (Sanchez 2023). Web data can inform
different national holidays using the HolidayAPI. Leverag- how retailers react to such new measures in non-WEIRD mar-
ing this API is particularly practical as the dates of many kets.
national holidays (particularly in this region) shift yearly. Web data can also boost the diversity of retailing formats
Several other similar APIs may enable retailing researchers studied in established markets such as North America and
to augment existing datasets without major time alignment Europe. Specifically, in an era of increased geographic mobil-
issues. For example, researchers might require sub-national ity and cultural diversity, becoming an entrepreneur remains
level data (e.g., state, county) about economic activity. How- an important starting point for many immigrants to the US
ever, such data from official sources is often only available and Europe (Peñaloza and Gilly 1999). Yet, despite an es-
with significant time lags. Thus, researchers might draw on timated market size of approximately USD 50B in the US
APIs like 505economics.com that offer geospatial insights (IBIS World 2023), the challenges of ethnic retailers have re-
to proxy economic activity (see also Chen and Nordhaus ceived scant attention in retailing research. For example, re-
2011). searchers can study strategies allowing these retailers to effec-
Another valuable type of web data to collect is comple- tively serve diverse tastes of their own ethnic and non-ethnic
mentary marketing mix information. Consider research us- customer bases. Likewise, researchers could study how immi-
ing retail scanner data. A recent review of 493 studies by grant entrepreneurs’ experience abroad spills over to retailing
Lu et al. (2023) suggests that most studies (63%) use datasets practices in their homelands (Balachandran and Hernandez
that contain only actual prices rather than regular and dis- 2021).
count prices. As such, researchers only observe prices paid,
with limited or no information about regular prices and dis- Leveraging web data to overcome limitations of established
counts, resulting in biased elasticities. Heuristics and complex methods
methodological solutions have been used previously to infer
the regular and discount price from the evolution of the ac- A crucial underutilized benefit of web scraping is that it
tual price (Fok et al. 2006; Geyskens, Gielens and Gijsbrechts lets researchers examine phenomena unobtrusively, which is
2010; Lu et al. 2023) but all contain mismeasurements that difficult with more established methods. Because researchers
introduce biases (Lu et al. 2023). While, at times, the data collect information about behaviors after they occurred nat-
obtained is used as control variables (e.g., Geyskens, Gie- urally (Hoover et al. 2018), web data typically avoids many
lens and Gijsbrechts 2010 focus on the introduction of pri- common challenges in studying such phenomena with exper-
vate labels on brand choice), it also plays a central role in iments or surveys (e.g., social desirability concerns). For in-
other studies (e.g., Guyt and Gijsbrechts 2018 focus on the stance, Chen and Berger (2013) collected data from an online
impact of promotions and discounts). Hence, web data, where forum to examine how controversy influences participation in
retailers typically indicate the regular and discount price, can online discussions. Web scraping also allows researchers to
complement and disambiguate information, resulting in better record behaviors retailers prefer not to disclose, such as the
(control and focal) variables. usage of tracking tools on their websites (Trusov, Ma and

133
J.Y. Guyt, H. Datta and J. Boegershausen Journal of Retailing 100 (2024) 130–147

Jamal 2016), engagement in illicit behaviors such as affiliate downloads, the Singapore-based Chinese e-commerce retailer
fraud (Edelman and Brandi 2015), or how business activi- was the most downloaded shopping app in the world in 2022
ties cause adverse societal outcomes such as noise that both- (Curry 2024). Its aggressive growth during the pandemic cat-
ers local residents (Ozer, Greenwood and Gopal 2024). Web apulted its total revenue to a level on par with major legacy
data could also be deployed to study marketplaces and “re- fast-fashion retailers such as Inditex and H&M. SHEIN’s
tailers” hidden from the public eye (e.g., on the Dark Web; business model is based on a rapidly changing assortment
Thomaz and Hulland 2021). of fashion items priced at very low unit prices (e.g., $4 shirts
In light of the enormous amount and diversity of data and $6 dresses). Their “real-time” approach to assortment
available compared to conventional data, web data is also management (i.e., rapid prototyping and production) offers
ideally positioned to study relatively narrow categories (e.g., a unique opportunity to explore the evolving retailing land-
extremely unusual consumer groups; Bright 2017) and collect scape.
the diverse data (e.g., audio, video) necessary to explore the On the consumer side, researchers could study how the
increasingly multimodal communication of retailers across brand has built a cult following online based on a TikTok
various digital platforms (Grewal et al. 2022). The digital trend called “SHEIN hauls,” wherein consumers buy many
footprints left by retailers and consumers create an enormous individual items and broadcast their purchases to their au-
volume of data not only in terms of the total number of dience. By 2021, such videos had attracted over 2.5 billion
cases but also in terms of the number and frequency of traces views (Gan 2021). Researchers could leverage the TikTok Re-
of a single actor (e.g., one consumer) over time (Matz and search API to explore what makes such videos engaging as
Netzer 2017; Adjerid and Kelley 2018). Researchers can use well as what factors shape the outcomes of these videos for
this data to construct panels capturing actors’ behavior over brands (e.g., brand attitudes) and content creators (e.g., new
time as a function of variables of theoretical interest (e.g., followers).
Moore 2012) or examine how effects unfold over time (e.g., From a societal perspective, policymakers and non-
Datta, Knox and Bronnenberg 2018). The real-time nature of governmental organizations have sounded the alarm about the
online data can allow researchers to study consumer behavior sustainability and environmental impact of fast-fashion play-
at high granularity, such as in seconds, minutes, hours, or ers such as SHEIN. Web data could play a critical role in
days—something difficult to accomplish with experiments or quantifying the environmental impact (e.g., via supply chain
surveys. practices and environmental waste produced by end con-
Next to deductive retailing research, web data provides sumers) of an emerging class of ultra-cheap retailers such
enormous research potential focused on inductive theory- as SHEIN and other marketplaces (e.g.,TEMU) that heavily
building. While various qualitative approaches, such as rely on steeply discounted goods (for similar explorations of
netnography( Kozinets 2002), leverage web data these ap- societal outcomes of marketing strategies and business mod-
proaches typically rely on manual web data extraction. Few els, see van Lin et al. 2023; Ozer, Greenwood and Gopal
netnographic studies in marketing have leveraged web scrap- 2024; Hsu and Kovács 2021). More generally, policymakers
ing or APIs (e.g., Arvidsson and Caliandro 2016). However, have shown increased interest in regulating retailing to pro-
rich consumer and corporate narratives on blogs and access to mote Sustainable Development Goals (SDGs) as well as con-
online communities from idiosyncratic samples can be fruit- sumer welfare. Examples are the implementation of health-
ful bases for generating novel and relevant retailing theories related and environmental initiatives to foster responsible re-
(see also Figueiredo, Larsen and Bean 2021). tailing, such as soda taxes and bottle bills (Keller and Guyt
2023a; Keller, Guyt and Grewal 2023; Seiler, Tuchman and
Exploring emerging retail formats and trends Yao 2021), or evaluating the determinants of consumer de-
mand in light of platform regulation (e.g., Pachali and Datta
Finally, web data allows researchers to study nascent mar- 2024). Web data can either inform policy or document how
keting and societal phenomena (Boegershausen et al. 2022). looming or applied regulation affects the behavior of retail-
This is particularly relevant for retailing researchers, given ers either directly (e.g., pricing decisions) or indirectly (e.g.,
the profound digital transformation (Verhoef et al. 2021), the assortment, recommendations).
emergence of new players and retail formats not (yet) covered Another major retailing trend over the last decade is the
by traditional data providers, and increasing (looming) regu- emergence and growth of retail formats empowering small
lation. Over the last two decades, many of the most disruptive sellers to reach many consumers, such as the handmade and
retailing trends have emerged online, from e-commerce mar- vintage goods marketplace Etsy. Marketplaces like Etsy allow
ketplaces (e.g., Amazon) to ride-hailing services (e.g., Uber). researchers to explore more personalized, small-scale forms of
Established retailers and brands encounter numerous chal- retailing (Schnurr et al. 2022; Fuchs et al. 2022). Web data
lenges from a new class of digital-first competitors relying on scraped on Etsy, for example, can help researchers enhance
direct-to-consumer and consumer-to-consumer business mod- the ecological validity of their research by complementing
els (Gielens and Steenkamp 2019; Muller 2020). experimental studies with field data or creating a rich set of
Consider, for example, the emergence of SHEIN, an ultra- real-life stimuli (e.g., sellers’ pages) that can be used in exper-
fast fashion retailers that has significantly disrupted various imental studies (Boegershausen et al. 2022). Gathering many
major retailing categories. With approximately 200 million real-world stimuli facilitates the creation of more comprehen-

134
J.Y. Guyt, H. Datta and J. Boegershausen Journal of Retailing 100 (2024) 130–147

(e.g., looping) are built around already developed parts (e.g.,

data extraction).4 Next, we discuss the key considerations for
each stage and a range of commonly used tools.

Building a web scraper for music-to-scrape.org

To guide novices in collecting web data, we devel-

oped a mock-up retailing platform called music-to-scrape.org
(https://music- to- scrape.org).5 Like real-life retailing plat-
forms, music-to-scrape.org has a desktop and mobile version
and offers data via web scraping and APIs. Learning how
to extract data from music-to-scrape.org’s various subpages
(e.g., landing page, user profile page, artist or song page) and
Fig. 2. A Nested Approach for Developing Web Data Collections. Notes: endpoints (e.g., as documented in the platform’s API) will
The figure depicts the four crucial steps for extracting web data. Researchers equip retailing researchers with the versatile skill set required
typically start by directly extracting data and building a loop to automate the
process for multiple units (e.g., products or users). Researchers then schedule
to collect real-life data from the web (e.g., tracking promoted
the data extraction (e.g., hourly, weekly), and finally make infrastructure products from the landing page at walmart.com or repeatedly
decisions regarding where to run the data collection and how to store the extracting prices and reviews from product pages at Amazon).
data during and after the research project. Fig. 3 shows a screenshot of music-to-scrape.org.
In what follows, we present an exemplary web scraper for
sive stimulus-sampling paradigms. In these designs, partici- extracting song metadata (here, a song’s number of plays)
pants are exposed to multiple instances of each experimental from music-to-scrape.org using R, the open-source program-
manipulation (e.g., various profiles of service providers that ming language for statistical computing. An exemplary code
vary along variables of theoretical interest; Howe and Monin snippet for data extraction using the platform’s API is pro-
2017). Stimulus-sampling paradigms boost the generalizabil- vided in Appendix B. More tutorials and code snippets are
ity of effects and reduce the risk that an effect is driven by available at https://music- to- scrape.org/. Readers with experi-
idiosyncratic features of certain experimental stimuli, such as ence in designing web data studies can skip this subsection.
the wording on an Etsy’s seller’s page shown to study partic-
ipants (Judd, Westfall and Kenny 2017).
Step 1: Data extraction
These cases illustrate how web scraping can be used to
Researchers must first connect with the online data source,
study emerging phenomena quickly without relying on cor-
locate desired information on the website, and save that in-
porate partners (e.g., Walmart) or syndicated services (e.g.,
formation.6 For example, assume a researcher is interested
NielsenIQ). But these cases are only a sample of a much
in obtaining song metadata. They can begin by visiting the
larger set of possibilities. The list of applications of web
song page at https://music- to- scrape.org. While exploring this
data is wide-ranging, from using Weedmaps3 to study dy-
website using a web browser, the researcher can not only find
namics in the cannabis retailing industry after the legalization
details such as the song name and the artist’s name (e.g., “Is It
of cannabis (e.g., Hsu, Koçak and Kovács 2018) to collect-
You” by “Lee Ritenour”) but also observe that the song’s ID
ing price data from the Ethereum blockchain to explore the
is included in the website’s URL (https://music- to- scrape.org/
factors driving consumers’ evaluation of non-fungible tokens
song?song-id=SOHMZNL12A58A8001A). This ID serves as
(e.g., Hofstetter, Fritze and Lamberton 2024).
a means to programmatically access this information.
We next use the R package rvest to connect to the site
Getting started with web scraping and APIs (lines 2 and 10 in Table 1A). Subsequently, we extract infor-
mation on the song’s number of plays using the data point’s
This section focuses on how to get started with collecting
web data. Specifically, we develop a practical guide featur-
4 The process applies to both data collection types: web scraping and APIs.
ing four essential steps: (1) data extraction and storage (e.g.,
which data points to extract), (2) looping to collect data for Not all process steps are necessary for every data collection: for example,
to only capture data from the landing page using web scraping, one does
multiple units (e.g., extracting information for many prod- not need step 2 (e.g., looping over categories) and proceed with steps 3-4
ucts), (3) scheduling the extraction (e.g., to run weekly), and (scheduling and infrastructure).
(4) deciding on which infrastructure the data collection runs 5 Inspired by Zyte’s https://books.toscrape.com project, music-to-scrape.org

(e.g., a local computer or the cloud). In Fig. 2, we depict the is a dynamic platform based on thousands of users’ simulated music listening
data collection as a nested process in which additional layers behavior. As an open-source project, the research community can extend
music-to-scrape.org at https:// github.com/ tilburgsciencehub/ music- to- scrape.
6 Connecting to the data source can be done through various ways: down-
3 An efficient way to collect data from Weedmaps.com is an “undocu- loading “web data” (such as reading the HTML code of a website or down-
mented” API. Undocumented APIs are typically publicly accessible but come loading particular files), browser emulation (such as remotely controlling a
without the documentation designed to facilitate the adoption of regular (doc- browser that can be instructed to click, scroll, and capture data), and phone
umented) APIs. emulation (particularly useful for capturing app data).

135
J.Y. Guyt, H. Datta and J. Boegershausen Journal of Retailing 100 (2024) 130–147

Fig. 3. Screenshot of music-to-scrape.org. Notes: Screenshot from https://music- to- scrape.org depicting the platform’s landing page with dynamic and simulated
data suited to learning how to extract data using web scraping and APIs.

Table 1A
R Code for Step 1 (Data Extraction).

Readers can access the code at https:// osf.io/ bk7e9/ , which may be updated over time for enhancements or fixes.

136
J.Y. Guyt, H. Datta and J. Boegershausen Journal of Retailing 100 (2024) 130–147

Table 1B
R Code for Step 2 (Looping).

Readers can access the code at https:// osf.io/ bk7e9/ , which may be updated over time for enhancements or fixes.

unique “address” on the website (lines 13–17).7 For this, we ers). Researchers can make use of so-called loops, which re-
use so-called selectors, which we have identified using the peatedly execute a set of instructions (e.g., “extract the song
browser’s “inspect mode” by hovering over individual ele- title and artist name”) a specified number of times or for a
ments of the website.8 Finally, researchers save the data. Here, specific list of items (e.g., “extract artist names for all songs
we save the song ID and corresponding data point and store in the soul category”). Hence, we extend the previous exam-
it in a table (lines 20–23). The easiest is to write it into CSV ple by assuming access to a list of song IDs to capture the
(Comma Separated Value) files with columns for each vari- play count information repeatedly.9 We list these song IDs in
able and rows for each observation. Line 25 of the code in Table 1B. To facilitate repeat execu-
tion of the data extraction of step 1, we “wrap” that code
Step 2: Looping in a function. Line 29 starts the loop: for each song ID in
In most web scraping projects, researchers seek to capture the list of song IDs, the function extract_data() is executed
information for multiple units (e.g., many products or retail- (lines 6–22). In other words, in our example, the extraction
is executed for each song ID in our list – i.e., five times.
7 Web scraping works with any data displayed in a browser, i.e., it is
Finally, we use a timer to reduce the data extraction speed
not restricted to text-only information. For example, researchers can capture
to 1 second per page (see Line 32), ensuring that the website
audio files, video content, or images. is not contacted more than necessary. Throttling requests by
8 Website creators typically assign styles (e.g., font, size) and functionality limiting the number of requests to a website in a given time-
(e.g., an action when clicked) to specific elements (e.g., a button or a cus- frame is critical when collecting data from real-life platforms
tomer review) on a website. To do so, they use classes and attributes in the to prevent server overload and avoid being blocked. Respect-
HTML code of a website (“selectors”). These selectors, in turn, can be used
to extract desired data from a website. One useful tool to identify specific
elements is the SelectorGadget (https:// selectorgadget.com/ ). We note several 9 We can also use a web scraper to capture this information. For our ex-

types of selectors can be used (e.g., CSS selectors, Xpaths, class names, and ample, a search for “love” in the search bar of music-to-scrape.org yielded
HTML attributes and attribute-value pairs; see, e.g., Mitchell 2018). a list of many songs. We present five of these IDs in the example.

137
J.Y. Guyt, H. Datta and J. Boegershausen Journal of Retailing 100 (2024) 130–147

Table 1C
Process of Setting Up Scheduled Tasks in Step 3 (Scheduling).

A. Preparation for Scheduling

• Save R script from Step 2.
• Test script by running it manually to make sure it works without errors
• Consider adding functionality in your R script to log output or errors, which can be useful for troubleshooting if your script is running automatically.

B. Starting the Scheduler

C. Removing Scheduled Tasks

• Use taskscheduler_ls() to list all scheduled tasks. • Use cron_ls() to see all scheduled jobs.
• If you want to remove a task, use • If you want to remove a task, use
taskscheduler_delete(taskname = "My cron_rm("my_script_id")
Task")

Readers can access the code at https:// osf.io/ bk7e9/ , which may be updated over time for enhancements or fixes.

ing extraction limits, or excluding certain sections that are our example, we assume we would like to extract information
flagged by the firm as not-to-be-scraped10 are vital for ethical every hour.
data collection, fostering good relations with website admin- While data extraction can be timed in R, it is better
istrators, and ensuring sustained access to the data source. to schedule the data extraction at a “higher level,” i.e., at
the operating system level, to ensure stability. Tools are
operating-system-specific (Task Scheduler for Windows and
Step 3: Scheduling
cron for Mac/Linux). In addition, researchers can consider
The web scraper designed in steps 1 and 2 extracts data
setting up monitoring systems to track the performance of
for multiple songs. Researchers can use scheduling if seeking
scrapers and notify researchers of successes, failures, or
to build a longitudinal data collection–i.e., capturing infor-
data anomalies. For example, if a scraper fails to run at its
mation for a set of units multiple times over longer periods
scheduled time, an alert can be triggered (e.g., an email),
(e.g., weeks, months). The extraction timing should be pri-
allowing the researcher to investigate and resolve the issue
marily motivated by the frequency of the focal phenomenon
promptly. Table 1C provides example code for scheduling on
(see Boegershausen et al. 2022). For example, checking for
Windows and Mac/Linux operating systems.
a retailer’s opening times every day may not be required be-
cause such information is not updated daily. Product stocks,
in turn, can be captured multiple times a day to see when Step 4: Infrastructure
retailers replenish stocks or whether actual stockouts occur. As web scraping projects grow in size and complexity, the
This step can be skipped for single-shot data extractions. In need for scalable infrastructure becomes more pressing and
critical. Starting with a single computer may be adequate for
small-scale, exploratory projects, but larger endeavors may
10 For example, firms use a small text file, named robots.txt, which cod-
require moving to cloud services. The infrastructure of a web
ifies which sections of a website can be extracted, at which interval
(Boegershausen et al. 2022). Recognizing the need for ethical web scraping,
scraping project generally consists of one or more computers
firms have recently co-founded the Ethical Web Data Collection Initiative running the extraction software (steps 1–3) and an attached
(https://ethicalwebdata.com). storage space (e.g., database or filesystem).

138
J.Y. Guyt, H. Datta and J. Boegershausen Journal of Retailing 100 (2024) 130–147

Table 2
Software Stacks for Web Scraping.

Code-based scrapers Non-code based scrapers LLM-based scrapers

Project Type Durable, long-term data collection; full Durable, long-term data collections; web Prototyping, one-time data collection
control over the process and highly data collections without much
customizable customization
Supported Steps
√ √ √
1) Extraction
√ √
2) Looping Limited
√ √
3) Scheduling ✕
4) Infrastructure customizable (e.g., databases, files) Limited Limited
Tools R (rvest, RSelenium), Python Often commercial tools (e.g., Octoparse, ChatGPT, Bard
(BeautifulSoup, Scrapy, Selenium), Task Import.io, ParseHub, WebHarvy)
Scheduler (Windows), cron (Mac/Linux)
Cost Low – Moderate (only development and Moderate – High (costs typically depend Low - Moderate (subscription required
infrastructure) on volume of data collected) for more advanced versions)

Choice of computation infrastructure. Researchers can ex- users who prefer a more straightforward approach. LLM-
ecute their code locally or in the cloud. For example, the based tools such as ChatGPT and similar models are emerg-
previous example was run locally on a researcher’s computer. ing as innovative web scraping and data collection solutions.
However, running the scraper on a personal computer may While they currently offer limited functionality in looping
not be practical if data needs to be collected repeatedly over and scheduling, they excel in prototyping one-time data col-
many weeks or months. In these circumstances, a server at lections. More recent and advanced models require a sub-
a research institution or commercial cloud infrastructure may scription. Table 2 compares different software tools for web
be more suitable. For self-programmed data collection, rent- scraping projects.
ing computers from cloud providers like Amazon Web Ser-
vices, Microsoft Azure, or Google Cloud is an option. These Novel opportunities and reflections
providers offer preconfigured images with pay-by-the-hour
flexibility. While low-powered systems often suffice, costs can The main goal of our article is to encourage more retailing
rise quickly with heavier usage. researchers to consider how they can incorporate web data
Choice of storage. In our current example, data is stored into their research. Thus, we conclude by offering a reflection
locally, risking data loss for longer data collections. Remote on how to address retailing-specific challenges in collecting
databases in the cloud are preferable for larger projects. Data web data and provide an outlook on leveraging generative AI
security and privacy are important, particularly when dealing and Big Team Science to jumpstart web data usage across the
with personal or sensitive information (for extensive discus- discipline.
sion and solutions, see Boegershausen et al. 2022). Databases
offer the additional benefits of maintaining metadata about the
Selecting retailing data and sources
collection, enabling multiple computers to collectively collect
data, or facilitating logging and monitoring to safeguard qual-
Both historical and future web data can be useful for retail-
ity.
ing research. While historical data may be harder to obtain,
we outline three ways to leverage web data. Specifically, we
Commonly used tools propose two strategies to collect data from past periods: (1)
draw from nonprimary data providers such as aggregators and
While our example leverages R, the steps described can (2) leverage archival versions of the target websites. Finally,
be implemented using different types of tools: code-based, we also outline the critical role of (3) collection design to
non-code-based commercial tools, and LLM (Large Language effectively match web data with other data.
Model)-based approaches. Each category of tools aligns with Surveying and extracting from aggregators. Dedicated ex-
the steps of extraction, looping, scheduling, and infrastructure. ternal parties (i.e., nonprimary data providers) may have cap-
In code-based tools, technologies such as R (employing tured (part of) the interest data routinely. For example, Heis-
such packages as rvest and RSelenium) and Python (using sePreisse (https://heisse-preise.io), created by an Austrian de-
such libraries as BeautifulSoup, Scrapy, and selenium) veloper to monitor food prices daily, contains pricing data on
are often used and are free of cost. On the other hand, non- products sold at the larger Austrian, German, and Slovenian
code-based commercial tools such as Octoparse, Import.io, retailers. Importantly, the data is easily retrievable and con-
ParseHub, and WebHarvy offer a more user-friendly, “worry- tains a historical overview since 2017.11 Similarly, Tweakers’
free” solution to web scraping where costs typically depend
on the scale of the data collection. Although they might be 11 The Heisse-Preise project was initiated by a disgruntled developer in

pricier, they provide a hassle-free environment, especially for an attempt to provide insights into pricing trends and the alleged dearth

139
J.Y. Guyt, H. Datta and J. Boegershausen Journal of Retailing 100 (2024) 130–147

Pricewatch (https://tweakers.net/pricewatch/) is an aggregator names, flavors, but also visuals) and methods (e.g., determin-
that tracks average and minimum prices of an exhaustive list istic, fuzzy matching, usage of internal search functions of
of consumer electronics in more than 3,000 shops in The web sources, and Generative AI) to match web data and other
Netherlands, often dating back to the time when the prod- archival datasets.
uct was launched. We encourage researchers to explore these
and similar aggregators covering many product categories and Using Gen AI/LLMs for web scraping
geographic regions.
Leveraging archival versions of web data. The ‘Way- Besides matching, generative AI (GenAI) and large lan-
back Machine’ (https:// archive.org/ web/ ), part of the Internet guage models (LLMs) offer many opportunities for web
Archive, is among the most popular tools available to re- scraping (see also Krosnick and Oney 2023). Their appli-
searchers who want to travel back to a static version of a cation extends to several key areas: coding, data discovery,
website. The Wayback Machine allows researchers to query enrichment, and analysis support. We briefly discuss some of
for a link and check whether historical snapshots of websites the most promising areas of GenAI deployment.
are documented. The availability of such snapshots is largely Coding. In the context of coding, setting up a basic web
driven by public demand, but researchers can also save pages scraper is often straightforward, but scaling it for reliable,
for future use. long-term operation presents a significant challenge. Here,
Collection design. By anticipating research questions and specialized large language models (such as GPT) can be in-
agendas, researchers vastly increase the options to leverage valuable. They can assist researchers in navigating complex
publicly available web data. We delineate two distinct philoso- HTML code to identify relevant tags or ensuring efficient
phies on web data collection: a targeted versus comprehen- scheduling for web scraping tasks. This support is crucial for
sive data collection approach. In targeted data collection, re- developing scraping solutions that require advanced coding
searchers focus on the data required to answer a specific skills. For instance, a GPT could help figure out how to write
research question. For example, should a researcher require the initial code in different programming languages to retrieve
information on the availability, variety, and price data of e- specific elements from a website.
cigarettes before and after new legislation is introduced, a Regarding data discovery and enrichment, LLMs can
programmatic effort can be made to collect this data. In con- expand the scope and depth of web scraping in re-
trast, a researcher adopting a comprehensive approach collects tail research. These models can assist in identifying di-
all data that may facilitate exploring multiple research ques- verse and relevant datasets or websites, thus preventing re-
tions. For example, researchers can focus on retailers or plat- searchers from relying solely on popular or familiar sources
forms and navigate to the specific retailer’s website (e.g., Wal- (Boegershausen et al. 2022). This feature is particularly ben-
mart.com). When adopting this approach, researchers store the eficial for exploring data from different countries where a
entire web page rather than a specific element (as opposed to researcher may not be well-versed. For data enhancement,
the approach in “Step 1 of “Building a Web Scraper”). This LLMs can automate the execution of complex prompts across
approach simplifies the steps outlined in Fig. 2 at the expense various data units. A practical example is the analysis of
of storing more data.12 The web scraper could follow all first- newspaper articles focusing on specific retailers. Here, an
degree links (i.e., any HTML links found on the landing page) LLM can systematically identify retailer names within ar-
and store these pages. ticles, facilitating the creation of a comprehensive retailer
The two philosophies provide notably different advan- database. LLMs can also be instrumental in linking data
tages to researchers. We juxtapose the two philosophies in across different databases, like matching unique product IDs,
Appendix C using research process criteria (e.g., data cover- or performing tasks like data imputation. This automation en-
age, flexibility to shift research interest, and ability to create hances the richness of the collected data and streamlines the
additional control variables) and technical criteria (e.g., re- research process.
source intensity, robustness, and ease of matching). The abil- LLMs can contribute significantly to the analysis phase by
ity to match to other data sources (e.g., NielsenIQ or GfK restructuring data or performing exploratory analyses. How-
data) is particularly relevant to retailing research. To facili- ever, this aspect of LLMs in web scraping requires fur-
tate matching, we recommend collecting a great variety of ther exploration and development to fully reach its potential
data related to the focal data of interest. In Appendix C, we and ensure its effective integration into the research work-
elaborate on the type of data to collect (e.g., EANs, brand flow. Appendix D includes example prompts illustrating how
GenAI can help with coding, data discovery and enrichment,
of competition in the Austrian market. The web scraper uses the APIs of and analysis support.
retailers to collect data. As a result of the project, the Austrian government
focused on creating a legal framework in which retailers of a certain size Collecting data in big teams
need to make available and standardize information regarding a product’s
price and other details via APIs. The code is freely available on GitHub and
Most web data collection efforts are ad-hoc, project-
can be ported to different countries, which we revisit in “Collection design.”
12 One trade-off that researchers can make ahead of time is to exclude specific, and led by one or a few researchers, leading to
saving any images, vastly reducing the disk space required but foregoing the constraints in time and product coverage. This approach lim-
possibility of visual analytics at a later point. its the robustness of data collection efforts and hinders re-

140
J.Y. Guyt, H. Datta and J. Boegershausen Journal of Retailing 100 (2024) 130–147

use in other projects. We propose a big-team science ap- Rigobon 2016). Our Appendix E contains exemplary web data
proach to data collection to overcome these limitations. Big- projects to be tackled via big-team science.
team science involves extensive collaboration across various
research groups, institutions, disciplines, cultures, and conti- Conclusion
nents. This approach has been increasingly adopted in other
scientific fields, such as psychology, to address generalizabil- The continued growth of online commerce and rapidly
ity, selection, and computational reproducibility challenges evolving consumer behavior drive the retail industry’s digital
(Forscher et al. 2023). transformation. Retailing researchers can embrace these forces
For retail data collection, one of the foremost challenges by using web scraping and APIs to collect novel datasets
is enhancing scalability and ensuring long-term operation. capturing these emerging phenomena. To facilitate a broader
Pooling resources benefits researchers in several key areas: adoption of web data across the entire retailing discipline,
(a) more comprehensive exploration of promising web data we provide resources to get started and offer hitherto miss-
sources, (b) development of more robust coding solutions, ing guidance on overcoming challenges that have inhibited a
(c) continuous operation of web scraping along with vigilant broader adoption of web scraping in retailing. Web data of-
monitoring of data quality, and (d) effective dissemination and fers unprecedented data on consumers, retailers, and markets.
accessibility of the data sets for download, which includes We hope our article encourages researchers to leverage web
comprehensive documentation enabling other researchers to data to investigate crucial retailing questions and phenomena.
use these rich datasets (e.g., Gebru et al. 2020). To kickstart
such an approach, we recommend following the steps out- Appendix A. Overview of Journal of Retailing articles
lined above, focusing on scalability (i.e., multiple collections using web data
concurrently) and distribution of expertise (e.g., researchers
with innovative ideas to collect data and technically trained To identify articles in the Journal of Retailing using web
software engineers to implement the projects). data, we follow the approach of Boegershausen et al. (2022).
We hope researchers explore collaborative data collection Specifically, our initial search comprised of various terms
and dissemination to create datasets with widely shared docu- describing the process of collecting web data (e.g., scrap∗ ,
mentation and source code for public reuse. These initiatives crawl∗ , Application Programming Interface) as well as for
could also set benchmarks for meaningful reporting and evalu- the names of specific retailers and platforms (e.g., Yelp, Tri-
ation. Academic journals should consider inviting registered- pAdvisor, Twitter, TikTok). We iteratively expanded the list
report-like dataset submissions to incentivize researchers to of search terms based on our inspection of the initial arti-
pursue ambitious data collection efforts (see also Cavallo and cles discovered with the search terms (e.g., adding additional
sources like “BoxOfficeMojo” or “Baidu”) Table A1.

Table A1
Articles published in Journal of Retailing using web data.
Author (year) Title
Tang and Xing (2001) Will the growth of multi-channel retailing diminish the pricing efficiency of the web?
Suter and Hardesty (2005) Maximizing earnings and price fairness perceptions in online consumer-to-consumer auctions
Gopal et al. (2006) From Fatwallet to eBay: An investigation of online deal-forums and sales promotions
Venkatesan, Mehta and Bapna (2007) Do market characteristics impact the relationship between retailer characteristics and online prices?
Duan, Gu and Whinston (2008) The dynamics of online word-of-mouth and product sales - An empirical investigation of the movie industry
Popkowski Leszczyc, Qiu and Empirical Testing of the Reference-Price Effect of Buy-Now Prices in Internet Auctions
He (2009)
Aggarwal, Vaidyanathan and Using Lexical Semantic Analysis to Derive Online Brand Positions: An Application to Retail Marketing Research
Venkatesh (2009)
Hu and Wang (2010) Country-of-Origin Premiums for Retailers in International Trades: Evidence from eBay’s International Markets
Pan and Zhang (2011) Born Unequal: A Study of the Helpfulness of User-Generated Product Reviews
Joo, Mazumdar and Raj (2012) Bidding Strategies and Consumer Savings in NYOP Auctions
Fay, Xie and Feng (2015) The Effect of Probabilistic Selling on the Optimal Product Mix
Wang, Liu and Fang (2015) User Reviews Variance, Critic Reviews Variance, and Product Sales: An Exploration of Customer Breadth and Depth Effects
Moon and Song (2015) The Roles of Cultural Elements in International Retailing of Cultural Products: An Application to the Motion Picture
Industry
Nejad, Amini and Babakus (2015) Success Factors in Product Seeding: The Role of Homophily
Gong, Smith and Telang (2015) Substitution or Promotion? The Impact of Price Discounts on Cross-Channel Sales of Digital Movies
Wu and Lee (2016) Limited Edition for Me and Best Seller for You: The Impact of Scarcity versus Popularity Cues on Self versus
Other-Purchase Behavior
Meiseberg (2016) The Effectiveness of E-tailers’ Communication Practices in Stimulating Sales of Niche versus Popular Products
Zhao, Zhao and Deng (2016) An Empirical Investigation of Online Gray Markets
Verma et al. (2019) Are Low Price and Price Matching Guarantees Equivalent? The Effects of Different Price Guarantees on Consumers’
Evaluations
(continued on next page)

141
J.Y. Guyt, H. Datta and J. Boegershausen Journal of Retailing 100 (2024) 130–147

Table A1 (continued)

Author (year) Title

Marchand and Marx (2020) Automated Product Recommendations with Preference-Based Explanations
Figueiredo, Larsen and Bean (2021) The Cosmopolitan Servicescape
Feng and Fay (2022) An empirical investigation of forward-looking retailer performance using parking lot traffic data derived from satellite
imagery
Ravula, Jha and Biswas (2022) Relative persuasiveness of repurchase intentions versus recommendations in online reviews
Kovacheva, Nikolova and Will he buy a surprise? Gender differences in the purchase of surprise offerings
Lamberton (2022)
Joy et al. (2023) Co-creating affective atmospheres in retail experience
Gu and Wu (2023) Highlighting supply-abundance increases attraction to small-assortment retailers
Kübler et al. (2024) The effect of review images on review helpfulness: A contingency approach
Cui, Zhu and Chen (2024) Where you live matters: The impact of offline retail density on mobile shopping app usage

Appendix B. Exemplary data extraction using APIs can be found in the API documentation at https://api.
music- to- scrape.org/docs.
Application Programming Interfaces (APIs) are typically Compared to web scraping (which would initially re-
well-documented, including instructions on “downloading” turn the HTML source code of a website), the re-
specific data. sponse of an API call is often in the JSON file for-
Using the API from music-to-scrape.org, the code block mat. JSON files store information in a more hierar-
in Table B1 shows how to extract fictitious data on the chical and condensed, ‘data-only’ format. Compared to
artist “Lee Ritenour.” The structure behind the API call HTML source code, this data-only format strips all over-

Table B1
R code for step 1 (data extraction).

Readers can access the code at https:// osf.io/ bk7e9/ , which may be updated over time for enhancements or fixes.

142
J.Y. Guyt, H. Datta and J. Boegershausen Journal of Retailing 100 (2024) 130–147

Identifying areas of interest and collecting a large body of data from the landing page and
head (e.g., formatting) and is, therefore, efficient in its

Broad or explorative research projects or agendas where research questions are emerging
usage.

X-degree links (e.g., retailer’s landing page and pages linked on the landing page)

Moderate – High (can exploit naturally occuring unanticipated policy changes)

Appendix C. Targeted vs. comprehensive web data
collections

Moderate – High (includes many and potentially large files, e.g., pictures)1

Moderate – High (additional data may contain applicable information)

In Table C1, we juxtapose two distinct web data collec-
tion philosophies: a targeted versus comprehensive approach.
The targeted approach is suited for concrete and specific re-

Moderate (additional data may provide matching identifiers)

1 A trade-off that researchers can make is to exclude saving images, vastly reducing the disk space required but foregoing the possibility of visual analytics at a later point.
search questions, whereas researchers with an entire research
agenda or multiple research questions may benefit from the

High (code is not prone to changes in web format)

Low - Moderate (allows for breadth but not depth)
comprehensive approach. The main differentiator in terms of
research flow in the comprehensive approach is the increased
flexibility in analyzing new focal and adding additional con-
trol variables. From a programming perspective, the targeted
approach is more efficient at the cost of being more likely

Comprehensive web data collection

to break in case of a website update. We refer to the table
for additional information and discuss the implications for
matching thereafter.

Data collection design to facilitate matching

The ability to match to other data sources is particularly

relevant to retailing research. While the approach (targeted
or comprehensive) influences the variety of data collected,
researchers can cast a smaller or wider net even within each
approach. Collecting a greater variety of data related to the collecting those exhaustively (e.g., pricing of all products

Focused research projects with clear research question(s)

focal data of interest is beneficial.
For example, for a Coca-Cola can of 330 ml, this would
Identifying elements of interest on the website and

Low (no additional characteristics for matching)

entail documenting the information of interest (i.e., price),
but also metadata (i.e., URL, date of data collection, etc.)

Low (only if no alternative data is needed)

and any other information provided on the website (i.e., bar-
High (allows for collecting more depth)

Low (elements or layout may change)

code, EAN, images, ingredients, flavors, brand names, etc.).

N/A (no additional data collected)

These characteristics may also be used as control variables.
Low (only elements of interest)

It is crucial to create matching tables during data collection.

Targeted web data collection

In the case of the 330 ml Coca-Cola can, the matching table

could contain a unique ID (e.g., a URL or EAN and a re-
tailer indicator) and relevant meta-data (e.g., data collection
date).
or predictions
of a retailer)

Methods used in matching

Unique identifiers (UIDs) can be used as ‘keys’ to match

Targeted vs. comprehensive web data collections.

data. Commonly used UIDs are product codes (e.g., Universal

Product Code or European Article Number, UPC or EAN, re-
Ability to create additional control variables
Robustness to environmental changes (e.g.,

spectively) allocated by a central authority. EANs and UPCs

Flexibility to change data collection plan

Ease of matching (e.g., based on EAN)

come in different lengths but are unique and can be con-

verted using simple rules. If UPCs are available, a determin-
Resource intensity (e.g., computing

istic match is possible, whereas the absence of it leads to

probabilistic matches. Examples of using the collected UPCs
are found in Keller and Guyt (2023b). It is particularly im-
(e.g., in review process)
website layout changes)
infrastructure, storage)

portant for researchers that UPCs might not always be vis-

ible when browsing a particular website but may still be
Data coverage

present in the metadata or the accompanying visuals. For

Description

example, retailers may use UIDs internally to track SKUs

Table C1

Notes:

and still rely on these when searching for a product on the

website, yet not display them on the user-facing side. Such
…

143
could be tackled using big-team science.

ing LLMs in web scraping projects.

accuracy.
random sample of matches is selected and hand-coded for
executed using an inter-coder reliability measure in which a
accuracy of the matches is important. Such checks can be
ing the ability of LLMs to navigate the web. Verifying the
perform matching on existing data or find auxiliary data us-
of GenAI, such as LLMs, for matching purposes. LLMs can

products.
fruitful avenues for determining competitors of searched
function to find potential matches. This may also provide
lected. Researchers could programmatically use the search
any available product characteristics that have been col-
to their characteristics. Such searches can be done using
form will optimize its search function to match products
searching. If the product is available, the retailer or plat-
to accurately display the product for which a customer is
tailers and platforms, we note that firms are incentivized
ChatGPT). Regarding the internal search functions of re-
tion of retailers and platforms or (ii) generative AI (e.g.,
to facilitate matching: utilizing (i) the internal search func-

product.
non-standardized characteristics, such as a description of the
teristics. Such fuzzy matching can be particularly useful for
matching routines that allow for ‘distances’ between charac-
equal, and size is equal, it is a match) or can utilize fuzzy
contain heuristics (e.g., if brand names are equal, flavor is
custom code to determine matches. This custom code can
tics (e.g., size, brand name, flavor, ingredients) and write
such efforts, researchers can focus on specific characteris-
matching techniques may still provide relief. To engage in
such cases, probabilistic matching using heuristics or ‘fuzzy’
or alternative UID to facilitate deterministic matching. In

videos.
Bar Code Reader can identify bar codes from images and
using available packages. Open-source software such as ZBar
tures allows researchers to obtain the UPC relatively easily
pictures of SKUs can contain the UPC. Storing these pic-
UIDs are often found in the raw HTML files. Alternatively,

J.Y. Guyt, H. Datta and J. Boegershausen

Appendix D. Using LLMs for web scraping projects
Table E1 highlights exemplary web data collections that

Table D1 illustrates various use cases and prompts for us-

While the field is rapidly evolving, there are clear use cases

We propose two less frequently discussed alternative forms

Nevertheless, researchers may find themselves with a UPC

Appendix E. Exemplary big-team science web data
projects
144

Table D1
Using LLMs for web scraping.

Area Goal Example prompts1

Coding Identify elements in web page “Identify the html_tags that allow me to locate the price of products in the following web page.”
Coding Suggest methods to extract elements “Can you write code in R using Rvest that scrapes prices from the following website?”
Coding Develop code in different languages ”Below, I have some R code which scrapes prices of a website. Could you translate the code into Python so it does the exact same
thing?”
Coding Debug and fix code “My Python scraper is failing [place error or code underneath] to parse dates correctly from a webpage. Can you suggest a fix?”
Coding Suggest code improvements “Look at my code below which tries to scrape the website [insert website]. Could you give me some suggestions (if any) that could
improve this script?”
Coding Get a script to start a data “I’d like to regularly monitor product names and prices at [insert website]. Which coding language would you recommend me to
collection scrape the information with and could you provide code I could use as a starting point?”

Journal of Retailing 100 (2024) 130–147

Data Discovery & Enrichment Extract data from a single web page “I need the product names and corresponding prices of this webshop [insert website]. Please provide them in an Excel sheet.”1
Data Discovery & Enrichment Identify similar data sources “I have data on prices of sodas at Walmart in the US, can you provide me with other [relevant retailers / countries] I should inspect?”
Data Discovery & Enrichment Identify additional data sources “I have data on sodas including the EAN, can you provide me with datasets with nutritional information on EAN codes?”
Analysis Support Restructuring data to get “clean” “Given the following HTML [insert HTML underneath], how would I extract the product name and price using Python, R, and
output Puppeteer?”
Analysis Support Check data for anomalies “I have a scraper that collects data on prices from X, can you write an R function for me that verifies that all prices are in USD?”
Analysis Support Recommending and creating data “I have data on [insert which data you have] scraped from a website. It contains [insert what your data is about], please suggest 5
visualizations ways to visualize this data.”
Analysis Support Performing sentiment analysis “I have [describe dataset] with reviews about products from Amazon. Could you help me perform a sentiment analysis?”2
Notes: Some prompts may not function with language models lacking internet access, like ChatGPT 3.5. Using just a webpage URL might be ineffective, even with internet-enabled models. A better
approach is to upload a website’s HTML file, obtained by saving a webpage (e.g., Amazon.com’s homepage). We note that GPTs may be available to assist researchers with specialized data extraction tasks.
1 For an exemplary chat, see https:// chat.openai.com/ share/ 22f38836- 1e6a- 4aa4- 9ec3- b99a7e74393d.
2 For an exemplary chat, see https:// chat.openai.com/ share/ de6fdf7f- 5bc3- 42e2- 9604- b63987f777f9.
J.Y. Guyt, H. Datta and J. Boegershausen Journal of Retailing 100 (2024) 130–147

Table E1
Exemplary “big-team science” web data projects. Exemplary “big-team science” web data projects.

Category Explanation Exemplary Platforms/Websites

E-commerce Websites Scrape data to generate a database of historical prices, customer Amazon, eBay, Alibaba, Target, Wayfair
reviews, product availability, product pictures, and nutritional facts.
Social Media Platforms Monitor for brand mentions, customer sentiment, trending products, X, Facebook, Instagram, Pinterest
and visual content analysis for product pictures.
Price Comparison Websites Collect data to understand pricing strategies across different retailers Honey, CamelCamelCamel, Slickdeals
and regions, including related products from recommendation
engines.
Consumer Forums and Review Sites Capture customer reviews on understudied platforms, including Quora, Trustpilot, Yelp
discussions on product recommendations and related products.
International Retailer Websites Compare retail strategies and product offerings globally. Uncover Tesco (UK), Carrefour (France), Flipkart
insights into regional market preferences and global retail trends, (India), Picnic (The Netherlands)
including nutritional facts.
Mobile App Data Capture data from retail mobile apps, including user engagement. Walmart App, Amazon Shopping, The
Home Depot
Cross-Platform Retail Data Collect and integrate data from various online and offline platforms Combination of online retailers,
for a holistic view of the retail landscape, including product images brick-and-mortar store data, and specialized
and metadata. e-commerce platforms like Shopify stores

References tives in Online Affiliate Marketing,” Journal of Marketing Research, 52

(1), 1–12.
Adjerid, Idris & Ken Kelley (2018), “Big Data in Psychology: A Framework Fay, Scott, Jinhong Xie & Cong Feng (2015), “The Effect of Probabilis-
for Research Advancement,” American Psychologist, 73 (7), 899–917. tic Selling on the Optimal Product Mix,” Journal of Retailing, 91 (3),
Aggarwal, Praveen, Rajiv Vaidyanathan & Alladi Venkatesh (2009), “Using 451–67.
Lexical Semantic Analysis to Derive Online Brand Positions: An Ap- Feng, Cong & Scott Fay (2022), “An Empirical Investigation of For-
plication to Retail Marketing Research,” Journal of Retailing, 85 (2), ward-Looking Retailer Performance Using Parking Lot Traffic Data De-
145–58. rived from Satellite Imagery,” Journal of Retailing, 98 (4), 633–46.
Arvidsson, Adam & Alessandro Caliandro (2016), “Brand Public,” Journal Figueiredo, Bernardo, Hanne Pico Larsen & Jonathan Bean (2021), “The
of Consumer Research, 42 (5), 727–48. Cosmopolitan Servicescape,” Journal of Retailing, 97 (2), 267–87.
Balachandran, Sarath & Exequiel Hernandez (2021), “Mi Casa Es Tu Casa: Fok, Dennis, Csilla Horváth, Richard Paap & Philip Hans Franses (2006), “A
Immigrant Entrepreneurs as Pathways to Foreign Venture Capital Invest- Hierarchical Bayes Error Correction Model to Explain Dynamic Effects
ments,” Strategic Management Journal, 42 (11), 2047–83. of Price Changes,” Journal of Marketing Research, 43 (3), 443–61.
Boegershausen, Johannes, Hannes Datta, Abhishek Borah & Andrew Forscher, Patrick S., Eric-Jan Wagenmakers, Nicholas A. Coles, Miguel Ale-
T. Stephen (2022), “Fields of Gold: Scraping Web Data for Marketing jandro Silan, Natália Dutra, Dana Basnight-Brown & Hans Ijzerman
Insights,” Journal of Marketing, 86 (5), 1–20. (2023), “The Benefits, Barriers, and Risks of Big-Team Science,” Per-
Bright, Jonathan (2017). “Big Social Science: Doing Big Data in the So- spectives on Psychological Science, 18 (3), 607–23.
cial Sciences”. in Nigel G. Fielding, Raymond M. Lee, and Grant Blank Fuchs, Christoph, Ulrike Kaiser, Martin Schreier & Stijn M.J. van Osselaer
(Eds.), The Sage Handbook of Online Research Methods (pp. 125–139). (2022), “The Value of Making Producers Personal,” Journal of Retailing,
London, UK: Sage. 98 (3), 486–95.
Cavallo, Alberto & Roberto Rigobon (2016), “The Billion Prices Project: Gan, Tammy (2021). "Why Are Massive Shein Hauls So Popular on Tiktok?"
using Online Prices for Measurement and Research,” Journal of Economic https:// web.archive.org/ web/ 20240205200211/ https:// greenisthenewblack.
Perspectives, 30 (2), 151–78. com/shein- ultra- fast- fashion- consumerism- tiktok- influencer/.
Chen, Xi & William D. Nordhaus (2011), “Using Luminosity Data as a Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman
Proxy for Economic Statistics,” Proceedings of the National Academy of Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford (2020).
Sciences, 108 (21), 8589–94. "Datasheets for Datasets." arXiv preprint arXiv:1803.09010.
Chen, Zoey & Jonah Berger (2013), “When, Why, and How Controversy George, Gerard, Ernst C. Osinga, Dovev Lavie & Brent A. Scott (2016), “Big
Causes Conversation,” Journal of Consumer Research, 40 (3), 580–93. Data and Data Science Methods for Management Research,” Academy of
Cui, Xuebin, Ting Zhu & Yubo Chen (2024), “Where You Live Matters: Management Journal, 59 (5), 1493–507.
The Impact of Offline Retail Density on Mobile Shopping App Usage,” Geyskens, Inge, Katrijn Gielens & Els Gijsbrechts (2010), “Proliferating Pri-
Journal of Retailing forthcoming. vate-Label Portfolios: How Introducing Economy and Premium Private
Curry, David (2024). "Shein Revenue and Usage Statistics (2024)." https: Labels Influences Brand Choice,” Journal of Marketing Research, 47 (5),
// web.archive.org/ web/ 20240120134207/ https:// www.businessofapps. 791–807.
com/ data/ shein-statistics/ . Gielens, Katrijn & Anne L. Roggeveen (2023), “Editorial: So, What Is Re-
Datta, Hannes, George Knox & Bart J. Bronnenberg (2018), “Changing Their tailing? The Scope of Journal of Retailing,” Journal of Retailing, 99 (2),
Tune: How Consumers’ Adoption of Online Streaming Affects Music 169–72.
Consumption and Discovery,” Marketing Science, 37 (1), 5–21. Gielens, Katrijn & Jan-Benedict E.M. Steenkamp (2019), “Branding in the
Datta, Hannes, Harald J. van Heerde, Marnik G. Dekimpe & Jan-Bene- Era of Digital (Dis)Intermediation,” International Journal of Research in
dict E.M. Steenkamp (2022), “Cross-National Differences in Market Re- Marketing, 36 (3), 367–84.
sponse: Line-Length, Price, and Distribution Elasticities in Fourteen Indo– Gong, Jing, Michael D. Smith & Rahul Telang (2015), “"Substitution or
Pacific Rim Economies,” Journal of Marketing Research, 59 (2), 251–70. Promotion? The Impact of Price Discounts on Cross-Channel Sales of
Duan, Wenjing, Bin Gu & Andrew B. Whinston (2008), “The Dynamics of Digital Movies,” Journal of Retailing, 91 (2), 343–57.
Online Word-of-Mouth and Product Sales—An Empirical Investigation of Gopal, Ram D., Bhavik Pathak, Arvind K. Tripathi & Fang Yin (2006),
the Movie Industry,” Journal of Retailing, 84 (2), 233–42. “From Fatwallet to Ebay: An Investigation of Online Deal-Forums and
Edelman, Benjamin & Wesley Brandi (2015), “Risk, Information, and Incen- Sales Promotions,” Journal of Retailing, 82 (2), 155–64.

145
J.Y. Guyt, H. Datta and J. Boegershausen Journal of Retailing 100 (2024) 130–147

Grewal, Dhruv, Dennis Herhausen, Stephan Ludwig & Francisco Villar- Krosnick, Rebecca & Steve Oney (2023). “Promises and Pitfalls of Using
roel Ordenes (2022), “The Future of Digital Communication Research: LLMs for Scraping Web UIs”. CHI ’23 Workshop: The Future of Com-
Considering Dynamics and Multimodality,” Journal of Retailing, 98 (2), putational Approaches for Understanding and Adapting User Interfaces.
224–40. Kübler, Raoul V., Lara Lobschat, Lina Welke & Hugo van der Meij (2024),
Gu, Yangjie & Yuechen Wu (2023), “Highlighting Supply-Abundance In- “The Effect of Review Images on Review Helpfulness: A Contingency
creases Attraction to Small-Assortment Retailers,” Journal of Retailing, Approach,” Journal of Retailing forthcoming.
99 (3), 420–39. Lu, Huidi, Ralf van der Lans, Kristiaan Helsen & Dinesh K. Gauri
Guyt, Jonne & Els Gijsbrechts (2018), “On Consumer Choice Patterns and (2023), “Depart: Decomposing Prices Using Atheoretical Regression
the Net Impact of Feature Promotions,” International Journal of Research Trees,” International Journal of Research in Marketing, 40 (4), 781–
in Marketing, 35 (3), 490–508. 800.
Henrich, Joseph, Steven J. Heine & Ara Norenzayan (2010a), “Most People Maner, Jon K. (2016), “Into the Wild: Field Research Can Increase Both
Are Not WEIRD,” Nature, 466 (7302) 29-29. Replicability and Real-World Impact,” Journal of Experimental Social
Henrich, Joseph, Steven J. Heine & Ara Norenzayan (2010b), “The Weirdest Psychology, 66 (1), 100–6.
People in the World?,” Behavioral and Brain Sciences, 33 (2–3), 61–83. Marchand, André & Paul Marx (2020), “Automated Product Recommenda-
Hofstetter, Reto, Martin P. Fritze & Cait Lamberton (2024), “Beyond tions with Preference-Based Explanations,” Journal of Retailing, 96 (3),
Scarcity: A Social Value-Based Lens for NFT Pricing,” Journal of Con- 328–43.
sumer Research forthcoming. Matz, Sandra C. & Oded Netzer (2017), “Using Big Data as a Window
Hoover, Joseph, Morteza Dehghani, Kate Johnson, Rumen Iliev & Jesse Gra- into Consumers’ Psychology,” Current Opinion in Behavioral Sciences,
ham (2018). “Into the Wild: Big Data Analytics in Moral Psychology”. 18 (1), 7–12.
in Jesse Graham, and Kurt Gray (Eds.), The Atlas of Moral Psychology Meiseberg, Brinja (2016), “The Effectiveness of E-Tailers’ Communication
(pp. 525–536). New York: Guilford Press. Practices in Stimulating Sales of Niche Versus Popular Products,” Journal
Howe, Lauren C. & Benoît Monin (2017), “"Healthier Than Thou? “Prac- of Retailing, 92 (3), 319–32.
ticing What You Preach” Backfires by Increasing Anticipated Deval- Mitchell, Ryan (2018). Web Scraping with Python: Collecting Data from the
uation,” Journal of Personality and Social Psychology, 112 (5), 718– Modern Web. Sebastopol, CA: O’Reilly Media.
735. Moon, Sangkil & Reo Song (2015), “The Roles of Cultural Elements in
Hsu, Greta, Özgecan Koçak & Balázs Kovács (2018), “Co-Opt or Coexist? International Retailing of Cultural Products: An Application to the Motion
A Study of Medical Cannabis Dispensaries’ Identity-Based Responses to Picture Industry,” Journal of Retailing, 91 (1), 154–70.
Recreational-Use Legalization in Colorado and Washington,” Organiza- Moore, Sarah G. (2012), “Some Things Are Better Left Unsaid: How Word
tion Science, 29 (1), 172–90. of Mouth Influences the Storyteller,” Journal of Consumer Research, 38
Hsu, Greta & Balázs Kovács (2021), “Association between County Level (6), 1140–54.
Cannabis Dispensary Counts and Opioid Related Mortality Rates in the Muller, Eitan (2020), “Delimiting Disruption: Why Uber Is Disruptive, but
United States: Panel Data Study,” BMJ, 372, m4957. Airbnb Is Not,” International Journal of Research in Marketing, 37 (1),
Hu, Ye & Xin Wang (2010), “Country-of-Origin Premiums for Retailers in 43–55.
International Trades: Evidence from Ebay’s International Markets,” Jour- Nejad, Mohammad G., Mehdi Amini & Emin Babakus (2015), “Success Fac-
nal of Retailing, 86 (2), 200–7. tors in Product Seeding: The Role of Homophily,” Journal of Retailing,
IBIS World (2023). "Ethnic Supermarkets in the US - Market 91 (1), 68–88.
Size, Industry Analysis, Trends and Forecasts (2024-2029)." Ozer, Gorkem Turgut, Brad N. Greenwood & Anandasivam Gopal (2024),
https:// web.archive.org/ web/ 20240205200524/ https:// www.ibisworld. “Noisebnb: An Empirical Analysis of Home-Sharing Platforms and Res-
com/ united-states/ market-research-reports/ ethnic-supermarkets-industry/ idential Noise Complaints,” Information Systems Research forthcoming.
#IndustryStatisticsAndTrends. Pachali, Max & Hannes Datta (2024). What Drives Demand for Playlists on
Joo, Mingyu, Tridib Mazumdar & S.P. Raj (2012), “Bidding Strategies and Spotify?. Tilburg University working paper https:// papers.ssrn.com/ sol3/
Consumer Savings in NYOP Auctions,” Journal of Retailing, 88 (1), papers.cfm?abstract_id=4079693 .
180–8. Pan, Yue & Jason Q. Zhang (2011), “Born Unequal: A Study of the Helpful-
Joy, Annamma, Jeff Jianfeng Wang, Davide C. Orazi, Seyee Yoon, ness of User-Generated Product Reviews,” Journal of Retailing, 87 (4),
Kathryn LaTour & Camilo Peña (2023), “Co-Creating Affective At- 598–612.
mospheres in Retail Experience,” Journal of Retailing, 99 (2), 297– Peñaloza, Lisa & Mary C. Gilly (1999), “Marketer Acculturation: The
317. Changer and the Changed,” Journal of Marketing, 63 (3), 84–104.
Judd, Charles M., Jacob Westfall & David A. Kenny (2017), “Experiments Popkowski, Leszczyc, T.L. Peter, Chun Qiu & Yongfu He (2009), “Empiri-
with More Than One Random Factor: Designs, Analytic Models, and cal Testing of the Reference-Price Effect of Buy-Now Prices in Internet
Statistical Power,” Annual Review of Psychology, 68 (1), 601–25. Auctions,” Journal of Retailing, 85 (2), 211–21.
Keller, Kristopher O. & Jonne Y. Guyt (2023a). The Dark Side of Bottle Rad, Mostafa Salari, Alison Jane Martingano & Jeremy Ginges (2018), “To-
Bills: How Price Increases, Rather Than Bottle Deposits, Generate Sales ward a Psychology of Homo Sapiens: Making Psychological Science
Losses for Retailers. University of North Carolina at Chapel Hill working More Representative of the Human Population,” Proceedings of the Na-
paper https:// papers.ssrn.com/ sol3/ papers.cfm?abstract_id=4572494 . tional Academy of Sciences, 115 (45), 11401–5.
Keller, Kristopher O. & Jonne Y. Guyt (2023b), “A War on Sugar? Effects of Ratchford, Brian, Gonca Soysal, Alejandro Zentner & Dinesh K. Gauri
Reduced Sugar Content and Package Size in the Soda Category,” Journal (2022), “Online and Offline Retailing: What We Know and Directions
of Marketing, 87 (5), 698–718. for Future Research,” Journal of Retailing, 98 (1), 152–77.
Keller, Kristopher O., Jonne Y. Guyt & Rajdeep Grewal (2023), “Soda Taxes Ravula, Prashanth, Subhash Jha & Abhijit Biswas (2022), “Relative Persua-
and Marketing Conduct,” Journal of Marketing Research forthcoming. siveness of Repurchase Intentions Versus Recommendations in Online
Kosinski, Michal, Yilun Wang, Himabindu Lakkaraju & Jure Leskovec Reviews,” Journal of Retailing, 98 (4), 724–40.
(2016), “Mining Big Data to Extract Patterns and Predict Real-Life Out- Sanchez, Camilo (2023). "The Most Ambitious ‘Health Tax’ in
comes,” Psychological Methods, 21 (4), 493–506. Latin America Debuts in Colombia." https://english.elpais.com/
Kovacheva, Aleksandra, Hristina Nikolova & Cait Lamberton (2022), “Will international/2023- 11- 01/the- most- ambitious- health- tax- in- latin- america-
He Buy a Surprise? Gender Differences in the Purchase of Surprise Of- debuts- in- colombia.html#.
ferings,” Journal of Retailing, 98 (4), 667–84. Schnurr, Benedikt, Christoph Fuchs, Elisa Maira, Stefano Puntoni, Mar-
Kozinets, Robert V. (2002), “The Field Behind the Screen: Using Netnog- tin Schreier & Stijn M.J.van Osselaer (2022), “Sales and Self: the
raphy for Marketing Research in Online Communities,” Journal of Mar- Noneconomic Value of Selling the Fruits of One’s Labor,” Journal of
keting Research, 39 (1), 61–72. Marketing, 86 (3), 40–58.

146
J.Y. Guyt, H. Datta and J. Boegershausen Journal of Retailing 100 (2024) 130–147

Seiler, Stephan, Anna Tuchman & Song Yao (2021), “The Impact of Soda Venkatesan, Rajkumar, Kumar Mehta & Ravi Bapna (2007), “Do Market
Taxes: Pass-through, Tax Avoidance, and Nutritional Effects,” Journal of Characteristics Impact the Relationship between Retailer Characteristics
Marketing Research, 58 (1), 22–49. and Online Prices?,” Journal of Retailing, 83 (3), 309–24.
Suter, Tracy A. & David M. Hardesty (2005), “Maximizing Earnings and Verhoef, Peter C., Thijs Broekhuizen, Yakov Bart, Abhi Bhattacharya, John
Price Fairness Perceptions in Online Consumer-to-Consumer Auctions,” Qi Dong, Nicolai Fabian & Michael Haenlein (2021), “Digital Transfor-
Journal of Retailing, 81 (4), 307–17. mation: A Multidisciplinary Reflection and Research Agenda,” Journal
Tang, Fang-Fang & Xiaolin Xing (2001), “Will the Growth of Multi-Chan- of Business Research, 122, 889–901.
nel Retailing Diminish the Pricing Efficiency of the Web?,” Journal of Verma, Swati, Abhijit Guha, Abhijit Biswas & Dhruv Grewal (2019), “Are
Retailing, 77 (3), 319–33. Low Price and Price Matching Guarantees Equivalent? The Effects of
Thomaz, Felipe & John Hulland (2021). Shining a Light on the Dark Web: An Different Price Guarantees on Consumers’ Evaluations,” Journal of Re-
Examination of the Abnormal Structure of Illegal Digital Marketplaces. tailing, 95 (3), 99–108.
Oxford University working paper. Wang, Feng, Xuefeng Liu & Eric Fang (2015), “User Reviews Variance,
Trusov, Michael, Liye Ma & Zainab Jamal (2016), “Crumbs of the Cookie: Critic Reviews Variance, and Product Sales: An Exploration of Customer
User Profiling in Customer-Base Analysis and Behavioral Targeting,” Breadth and Depth Effects,” Journal of Retailing, 91 (3), 372–89.
Marketing Science, 35 (3), 405–26. Wu, Laurie & Christopher Lee (2016), “Limited Edition for Me and Best
van Lin, Arjen, Aylin Aydinli, Marco Bertini, Erica van Herpen & Julia Seller for You: The Impact of Scarcity Versus Popularity Cues on Self
von Schuckmann (2023), “Does Cash Really Mean Trash? An Empirical Versus Other-Purchase Behavior,” Journal of Retailing, 92 (4), 486–99.
Investigation into the Effect of Retailer Price Promotions on Household Zhao, Kexin, Xia Zhao & Jing Deng (2016), “An Empirical Investigation of
Food Waste,” Journal of Consumer Research, 50 (4), 663–82. Online Gray Markets,” Journal of Retailing, 92 (4), 397–410.