Page MenuHomePhabricator

[GOAL] Add termbox language code mul to reduce redundancy in Wikidata Labels and Aliases
Open, HighPublicGoal

Description

Milestones:

User story:
As a Wikidata editor,
I want to avoid repeating identical labels in hundreds of languages
in order to reduce the amount of redundant content that needs to be maintained on Wikidata.

Problem:
We have many labels that are by principle identical across different languages (see examples section). This has some bad consequences:

  • editors having to create and maintain redundant content (copying the same thing to most/all languages creates massive amounts of edits and is a huge waste of resources)
  • need of storing redundant information that burdens our systems (e.g. the Query Service)

Solution:
Introduce a new language code that all languages fall back to. This will be particularly helpful for Unicode characters, Scientific articles, and Codes as well as for Names in Latin scripture (as we do not have an elaborate fallback system for that scripture yet). We will test if this solution (only one new language code) is good enough, or if we need more specific language codes after all to model a useful fallback chain.

This task

  • Adding "mul" as a new monolingual language code.
  • Have other languages fall back to it (Translatewiki fallback chain > "mul" > "en")

Community takes over

  • Community creates guidelines and help pages on how to use the new code, e.g.
    • What if one Latin-script language may prefer a form (e.g. "Philip L. Brown"), another Latin-language script another form (e.g. "Philip Larry Brown" or "Philip Brown")?
    • In what cases should the Latin-language label be used for "mul" instead of the native label (while still making sure that re-users can identify the native label via property)?
    • etc.
  • Community gives feedback after some months about how the new code and guidelines work
    • Based on the feedback we might iterate on the approach if necessary.

Ideas for the future

  • start to show a warning if someone wants to add the mul-label in a different language
  • include the experience in a possible future solution for multilingual descriptions (Abstract Descriptions)
  • re-evaluate if the final fallback to “en” is still appropriate

Mockup:

image.png (537×1 px, 170 KB)

Examples:
This will be useful in many different places:

Names

Unicode characters

Codes

Scientific articles

Translatewiki fallback chain:

Examples:
ami > zh-tw, zh-hant, zh-hans
zh-tw > zh-hant, zh-hans
zh-hant > zh-hans
zh-hans > []

de-at > de
de > []

en-gb > en
en > []

Hard-coded fallback chain:

old

  • Translatewiki fallback chain > "en"

new

  • Translatewiki fallback chain > "mul" > "en"

Community communication:

  • The interested Community needs to be aware of the new code and of the necessity to create guidelines and help pages on how to use it.
  • We need to be available for the Community when they create guidelines and to collect feedback.

Original:
This task is to add support for a "mul" language code for labels and aliases. For any benefits of this code to be properly reaped, all language codes should ultimately fall back to "mul"—which I believe would be achieved by adding it as a fallback for the "en" code.

(If it is more desirable, codes for "mul-latn", "mul-cyrl", etc. could be created, in which case e.g. only those codes using the Latin script would fall back to "mul-latn".)

Possibly related tasks: T258242 T256003 T43807

Related Objects

StatusSubtypeAssignedTask
OpenGoalNone
OpenNone
ResolvedRelease Manuel
ResolvedBUG REPORTLucas_Werkmeister_WMDE
ResolvedBUG REPORTLucas_Werkmeister_WMDE
ResolvedBUG REPORT Manuel
ResolvedBUG REPORT Manuel
Resolvedhoo
ResolvedRelease Manuel
Resolved Manuel
Resolvedhoo
Resolvedhoo
Resolvednoarave
OpenReleaseNone
ResolvedArian_Bozorg
ResolvedArian_Bozorg
ResolvedAudreyPenven_WMDE
DuplicateNone
OpenNone
OpenNone
ResolvedArian_Bozorg
ResolvedBUG REPORTNone
ResolvedArian_Bozorg
ResolvedNone
ResolvedLucasWerkmeister
ResolvedBUG REPORTArian_Bozorg
ResolvedReleaseLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedReleaseNone
Resolved Manuel
Resolvedhoo
ResolvedMichael
Resolved Manuel
Resolved Manuel
Resolvedguergana.tzatchkova
Resolvedguergana.tzatchkova
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedMichael
Resolvedhoo
ResolvedMichael
InvalidNone
InvalidNone
ResolvedMichael
ResolvedMichael
ResolvedLucas_Werkmeister_WMDE
DuplicateNone
Resolved Manuel
ResolvedLucas_Werkmeister_WMDE
OpenReleaseNone
StalledReleaseNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
ResolvedNone
OpenReleaseNone
OpenReleaseNone
StalledNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenPRODUCTION ERRORNone
OpenNone
OpenNone
ResolvedEBernhardson
ResolvedBUG REPORTNone
OpenBUG REPORTNone
OpenBUG REPORTNone
ResolvedArian_Bozorg
OpenNone
ResolvedArian_Bozorg
ResolvedArian_Bozorg
ResolvedBUG REPORTArian_Bozorg
OpenFeatureNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Lucas_Werkmeister_WMDE that should have been. I removed Mahir256 as they seem to be doing nonsensical edits to subscribers.

The presence of the user I am removing—not just here, but in any other discussion forum really—has made @Nikki (and others, both actually and potentially) sufficiently uncomfortable directly opining here and in those other fora that others like myself relay their opinions here for them. Unless that user wishes to impugn the credibility or emotional strength of Nikki and those other individuals, I contend that my actions in this regard are entirely sensical.

Changes for task description are:

Before my editAfter my edit (later deleted without an explanation or justificationReason
editors having to create and maintain redundant content (copying the same thing to most/all languages creates massive amounts of edits and is a huge waste of resources)editors having to create and maintain redundant content (copying the same thing to most/all languages could create massive amounts of edits and is a huge waste of resources)all descriptions and labels can be added in a single edit
ProblemProblem[header repeated for clarity]
user tend to fill in empty label fields, especially when a description in the language is present
empty label fields may result in suboptimal string additionsIt should be easy to find diffs for such edits on name items.
fall-back is generally ill understoodpeople wouldn't fill in labels if the fallback was understood
ExampleExample[header repeated/expanded for clarity]
persons (https://www.wikidata.org/wiki/Special:Search/haswbstatement:P31=Q5, as of now 9.2M) have in most cases the same label and the same aliases repeated in different languages, e.g. https://www.wikidata.org/wiki/Q42persons (https://www.wikidata.org/wiki/Special:Search/haswbstatement:P31=Q5, as of now 9.2M) have in most cases the same label and the same aliases repeated in different languages, e.g. https://www.wikidata.org/wiki/Q42 Labels generally differ by script (Latin script and all others)
given names and family names (https://w.wiki/3zWT, which counts Q202444 and Q101352 including subclasses, as of now 590k): in all cases the same label are repeated in different same-script languages, e.g. https://www.wikidata.org/wiki/Q21448867.given names and family names (https://w.wiki/3zWT, which counts Q202444 and Q101352 including subclasses, as of now 590k): in all cases the same label are repeated in different same-script languages, e.g. https://www.wikidata.org/wiki/Q21448867 . This to avoid that translations are added (e.g. "John"@en and "Giovanni"@it shouldn't be on the same item).
taxa (https://www.wikidata.org/wiki/Special:Search/haswbstatement:P31=Q16521, as of now 3.1M) the species "Neotrogla curvata" - has "Neotrogla curvata" as the label 411 times.taxa (https://www.wikidata.org/wiki/Special:Search/haswbstatement:P31=Q16521, as of now 3.1M) the species "Neotrogla curvata" - has "Neotrogla curvata" as the label 411 times. Latinized names should be generally available as fallback.
CodesCodes and abbreviations[header repeated for clarity]
metric ton - should have "t" as alias in Latin script languages, "т" as alias for Cyrillic languagesis there an issue with this sample?
Scientific articlesScientific articles[header repeated for clarity]
(https://www.wikidata.org/wiki/Special:Search/haswbstatement:P31=Q13442814, as of now 42M): in many cases the same label is repeated in different languages (e.g. https://www.wikidata.org/wiki/Q27860672). In some cases, there could be articles with parallel titles in different languages (e.g. https://www.wikidata.org/wiki/Q59238742(https://www.wikidata.org/wiki/Special:Search/haswbstatement:P31=Q13442814, as of now 42M): in many cases the same label is repeated in different languages (e.g. https://www.wikidata.org/wiki/Q27860672). Generally the original title is available (or a translation to English). Original non-English titles are frequently missing. In some cases, there could be articles with parallel titles in different languages (e.g. https://www.wikidata.org/wiki/Q59238742. One title for @en , one for @it,
Open questionsOpen questions[header repeated for clarity]
What are all the mul-<script> codes that we should start with?What are all the mul-<script> codes that we should start with? mul-latn seems the most frequentI think everybody agrees about the frequency (except Mahir)
Can items still be found when no label is present in the language?A developer statement that this wont an issue should be sufficient to deleted it from the task description. @Lucas_Werkmeister_WMDE can you confirm?
Search results are currently (also) ranked by the number of labels, how to ensure ranking still works?A developer statement that this wont an issue should be sufficient to deleted it from the task description. @Lucas_Werkmeister_WMDE can you confirm?
Should the "mul-latn" label be displayed in a grayed out form when a description is present?
How will this work in LUA infoboxes? Currently users copy en labels to ca/cs/da/es/nb even when the fallback works.A developer statement that this wont an issue should be sufficient to deleted it from the task description. @Lucas_Werkmeister_WMDE can you confirm?
How to prevent that now empty label fields aren't filled with inappropriate label (loss of data quality)?

Can you explain the reminder of your deletions? Above what you deleted from the description.

@Esc3300 For clarification: Lucas and I spent a lot of time yesterday on getting everything to a point where we believe it is sensible and the remaining questions are clarified. It'd be good to concentrate the discussion on those remaining points now because otherwise we can not move this forward. As there is a strong desire from several editors to get this done I want to push this to the point where we can actually pick it up.

please refrain from editing the task description while the discussion is ongoing. It is inappropriate to masquerade your personal opinion as the hard-won consensus that we are trying to achieve here.

This message (especially the latter part) is enough of a reason to undo the changes made to the ticket since Nikki's comments were added, irrespective of what opinions I may hold of any of it (which should not be assumed as was done in the diff that stains this task). The sea lion I am removing from this ticket is also free to impugn Lucas's or Lydia's credibility or emotional strength as well.

@Esc3300 For clarification: Lucas and I spent a lot of time yesterday on getting everything to a point where we believe it is sensible and the remaining questions are clarified. It'd be good to concentrate the discussion on those remaining points now

Ok. What's the proposal for the various points in how it may backfire? And finally which script do you want to start with?

please refrain from editing the task description while the discussion is ongoing. It is inappropriate to masquerade your personal opinion as the hard-won consensus that we are trying to achieve here.

This message (especially the latter part) is enough of a reason

@Mahir Can you explain which parts the later part covers? If not, please refrain from making such comments in phabricator or elsewhere.

For those who would like a clarification,

please refrain from editing the task description while the discussion is ongoing.

this is the former part of Lucas's message

It is inappropriate to masquerade your personal opinion as the hard-won consensus that we are trying to achieve here.

and this is the latter part.

(More on the "sea lion" term.)

Apparently there is a disagreement between Lucas and his manager about description editing.

Can you at least explain which parts you consider my personal opinion and which ones are not supported by a consensus (ideally with a link to the relevant discussion)?

There is no disagreement.
We are spending a lot of time discussing things that currently don't move this forward and do not help get to a meaningful consensus. So one final try. We need input on the final remaining discussion points as I laid out in T285156#7384455. Let's please concentrate on those now so that we can then update the task description once we heard everyone.

If this is the only open point, can you summarize how the open points mentioned in the task description had been addressed ?

Sure.

  • Could this solution somehow backfire? -> several answers in this thread that we will weigh and see if they warrant any action
  • What are all the mul-<script> codes that we should start with? -> none, we are just going with mul for now as I said in my comment
  • How exactly should be the fallback chain for these mul codes? -> no fallback within the mul codes because we only have one. fallback to and from other languages is in my remaining questions
  • Could this solution somehow backfire? -> several answers in this thread that we will weigh and see if they warrant any action

Can you propose something?

Step #3 mentions constraints. What will they be?

I understand that you are keen to get this done, but compared other new language codes, we are still moving quite fast. I think we all don't want this to go into a dead end.

Thank you all for your input on this! We will put this in development right after the no deploy weeks. Special thanks go to @Nikki and @Mahir256, for driving and enlightening this issue, and to @Amire80 and @Epidosis, for your valuable input!

@Esc3300: You also gave helpful input and we appreciate the effort! At the same time, your style of engagement and your continued disagreement with the direction that we took in the deliberation seems to have ultimately led to some demotivating arguments and loops in the discussion. I am sad to see that all of this resulted in a bad climate and a frustrating experience for some of the discussion's participants. It is essential for Lydia and me that - especially for hard decisions like these - we still maintain an open and welcoming climate for all people involved, as well as a worthwhile and productive discussion. This is why we would like to ask you for your help in fostering more open and welcoming discussions that respect our process in the future.

Manuel changed the subtype of this task from "Task" to "Goal".Jul 14 2022, 12:47 PM
Manuel renamed this task from Add termbox language code mul to Add termbox language code mul to reduce redundancy in Wikidata Labels and Aliases.Jul 14 2022, 12:49 PM
Manuel updated the task description. (Show Details)
Manuel renamed this task from Add termbox language code mul to reduce redundancy in Wikidata Labels and Aliases to [GOAL] Add termbox language code mul to reduce redundancy in Wikidata Labels and Aliases.Feb 22 2023, 11:53 AM