Jump to content

Research:Social Memory about Chileans in Wikipedia

From Meta, a Wikimedia project coordination wiki
This is an archived version of this page, as edited by Ptbeytia (talk | contribs) at 18:42, 5 April 2024 (Results). It may differ significantly from the current version.
Created
21:14, 4 April 2024 (UTC)
Collaborators
Carlos Cruz Infante
Duration:  2023-6 – 2024-6
Wikidata, social memory, digital discourse, knowledge gaps

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


This project investigates Chilean biographies on Wikipedia, examining how content gaps evolve across generations of notable people. Our goal is to explore these gaps in various occupational domains (e.g., politics, science, art, and sport) and understand, with these trends, how Wikipedia is framing (highlighting aspects of) the history of notable Chileans.

A myriad of studies have shown that the biographies on Wikipedia tend to make visible specific people categories. For example, they are significantly focused on men (Hinnosaar, 2019) born in the Global North (Beytía, 2020), who lived in the last century (Samoilenko et al., 2017), and who excelled in professions such as mass arts or popular sports (Reznik & Shatalov, 2016).

We have investigated biographical content asymmetries related to gender and place of birth:

In a study based on the Networked Pantheon dataset (Beytía & Schobin, 2020), we found that only five countries in the Global North concentrate more than 62% of Wikipedia's biographical coverage. In addition, we estimate that the inequality in coverage between countries reaches a Gini coefficient of .84 (Beytía, 2020).

We examined the written and visual asymmetries between men's and women's biographies in the ten most widely spoken languages (Beytía et al., 2022). We concluded that (a) most of the male bias arises when selecting who will have a biography, (b) written and visual asymmetries do not follow the same patterns of disparities, (c) men biographies tend to have more images across languages, and (d) female biographies average better visual quality.

In another study, we proposed a general theoretical framework to closely observe content asymmetries in Wikipedia, which was tested with research findings on gender gaps (Beytía & Wagner, 2022). Our "Visibility Layers" model serves to analyze content inequalities across three editorial stages (content selection, building, and positioning) that contribute to making groups of biographies more or less visible.

The literature mentioned above does not analyze knowledge asymmetries from the historical account point of view. A second line of research has analyzed temporal evolution by looking at specific variables. For instance, biographies have been examined to observe the evolution of occupations across generations (Jara-Figueroa et al., 2019), variation in migration patterns (Menini et al., 2017), and changes in biographical ties (Schich et al., 2014). These studies analyze content-specific variables (occupational distribution, migration, biographical relationships) and not the joint evolution of multiple content asymmetries across generations. Therefore, they do not offer a comprehensive look at how biographies frame the history of any specific human group.

In a recent study, we attempted to develop a more comprehensive approach, examining how Wikipedia frames the history of sociology in multiple and interrelated information structures (Beytía & Müller, 2022). There, we studied content structures and asymmetries across all generations of notable sociologists.

Through this research, we aim to replicate the “Wikipedia-framing” approach, albeit with a different subject matter. Instead of studying the history of a specific scientific discipline, we propose investigating how Wikipedia frames the history of notable people in a specific country and across different professions. To our knowledge, this topic has not been investigated hitherto now, especially examining the joint analysis of multiple content asymmetries that vary across generations.

Methods

We will focus on specific occupational categories that represent socially diverse areas of notable people: politicians, scientists, artists, and sportspersons. Previous research has shown that, in the last generations of notable people (those born since the 1950s), those occupational dimensions are the ones with the highest number of biographies on Wikipedia (Reznik & Shatalov, 2016; Yu et al., 2016). Except for sportspersons, those same categories are also the dominant occupations in the generations born since the beginning of Chile's history (early 19th century).

We will analyze the biographies of those occupational dimensions using a five-stage methodology:

1. Data extraction: we will create databases of people with Chilean nationality in the four occupational domains using Wikidata. From that source, we will extract essential biographical information (name, gender, year of birth, birthplace, year of death, death place, a portrait, and biography hyperlink). We will also obtain information on the notables’ participation in sub-domains (i.e., political parties, scientific disciplines, sports branches, and artistic disciplines). As an approximation to the communicative influence of each notable, we will get the number of available languages for each biography.

2. Preprocessing: we will manually check the data for correctness. We will also complete the record if any relevant biography is missing and available on Wikipedia.

3. Natural language processing (NLP): we will perform NLP on the Spanish biographies to recognize relevant discursive entities (people, events, places, dates, and organizations). We will start our analysis from the hyperlinks extracted from Wikidata. Then, we will extract the biographical text and employ the Python library Spacy to develop a named-entity recognition (NER). We will do this processing separately for each occupational domain. Finally, we will review the lists of recognized entities (people, events, places, dates, and organizations) to check for repeated terms, incomplete names, or nonsense phrases.

4. Research products: we will prepare (1) an open database of notable Chileans in Wikipedia, (2) an open database of the main entities that we found in the NLP, and (3) scripts and documentation on how to extract this data from Wikidata and perform the named-entity recognition process with Wikipedia biographies. All this material will be released on an open-access project archiving platform (such as SocArXiv) under CC BY-SA 4.0 or a more permissive license.

Timeline

Here we will add a short timeline with the main milestones and deliverables (if any) for this project.

Policy, Ethics and Human Subjects Research

We do not see ethical concerns with this research topic and methods. Aldo, we will not disrupt Wikipedians' work in our research stages.

Results

We will deliver four outputs:

1. An open database with biographies of Chileans and their essential characteristics.

2. An open database with entity recognition analysis of those biographies.

3. Open-access documentation to extract the data and create similar projects (e.g. to analyze Wikipedia's social memory on other countries).

4. A report of results presented in the wiki Workshop 2024 and Wikipedia Chile.

As a next step, we expect to publish our analyses in an academic journal and create interactive tools that allow Wikipedians to explore their biases or knowledge gaps easily.

Resources

Here we will provide links to presentations, blog posts, or other materials to disseminate our work.

References

Beytía, P. (2020). The Positioning Matters: Estimating Geographical Bias in the Multilingual Record of Biographies on Wikipedia. Companion Proceedings of the Web Conference 2020, 806–810.

Beytía, P., Agarwal, P., Redi, M., & Singh, V. (2022). Visual Gender Biases in Wikipedia: A Systematic Evaluation across the Ten Most Spoken Languages. AAAI Conference on Web and Social Media (ICWSM).

Beytía, P., & Müller, H.-P. (2022). Towards a Digital Reflexive Sociology: Using Wikipedia’s Biographical Repository as a Reflexive Tool. Poetics, 101732.

Beytía, P., & Schobin, J. (2020). Networked Pantheon: A Relational Database of Globally Famous People. Research Data Journal for the Humanities and Social Sciences, 5, 1–16. https://doi.org/10.1163/24523666-00501002

Beytía, P., & Wagner, C. (2022). Visibility Layers: A Framework for Facing the Complexity of the Gender Gap in Wikipedia Content. Internet Policy Review. https://doi.org/10.31235/osf.io/5ndkm

Entman, R. M. (1993). Framing: Toward Clarification of a Fractured Paradigm. Journal of Communication, 43(4), 51–58. https://doi.org/10.1111/j.1460-2466.1993.tb01304.x

Entman, R. M. (2007). Framing bias: Media in the distribution of power. Journal of Communication, 57(1), 163–173.

Hinnosaar, M. (2019). Gender inequality in new media: Evidence from Wikipedia. Journal of Economic Behavior & Organization, 163, 262–276.

Jara-Figueroa, C., Yu, A. Z., & Hidalgo, C. A. (2019). How the medium shapes the message: Printing and the rise of the arts and sciences. PloS One, 14(2), e0205771.

Menini, S., Sprugnoli, R., Moretti, G., Bignotti, E., Tonelli, S., & Lepri, B. (2017). Ramble on: Tracing movements of popular historical figures. Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics, 77–80.

Redi, M., Gerlach, M., Johnson, I., Morgan, J., & Zia, L. (2021). A Taxonomy of Knowledge Gaps for Wikimedia Projects (Second Draft). ArXiv:2008.12314 [Cs]. http://arxiv.org/abs/2008.12314

Reznik, I., & Shatalov, V. (2016). Hidden revolution of human priorities: An analysis of biographical data from Wikipedia. Journal of Informetrics, 10(1), 124–131.

Samoilenko, A., Lemmerich, F., Weller, K., Zens, M., & Strohmaier, M. (2017). Analysing timelines of national histories across wikipedia editions: A comparative computational approach. Eleventh International AAAI Conference on Web and Social Media.

Schich, M., Song, C., Ahn, Y.-Y., Mirsky, A., Martino, M., Barabási, A.-L., & Helbing, D. (2014). A network framework of cultural history. Science, 345(6196), 558–562.

Wagner, C., Graells-Garrido, E., Garcia, D., & Menczer, F. (2016). Women through the glass ceiling: Gender asymmetries in Wikipedia. EPJ Data Science.

Wikipedia. (2023). Wikipedia:Purpose. In Wikipedia. https://en.wikipedia.org/w/index.php?title=Wikipedia:Purpose&oldid=1144512635

Yu, A. Z., Ronen, S., Hu, K., Lu, T., & Hidalgo, C. A. (2016). Pantheon 1.0, a manually verified dataset of globally famous biographies. Scientific Data, 3, 150075.