Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOI web translator: Offer the current page as a choice among the multiple #3130

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

zoe-translates
Copy link
Collaborator

If the DOI translator turns out to be the one that matches the current page, and if the DOI (this) translator detects the item type as "multiple", offer the current page as one of the choices (indeed, the first choice).

In the item selection dialog, this special choice will show up as "Current Webpage", followed by the page title in parenthesis if the title is present.

The specific situation that triggers this behavior is the following

  1. None other than the DOI web translator (lowest priority) matches the current page
  2. The current page itself is not the main resource identified by DOI present on the current page (e.g. a page with its own DOI in URL)

Resolves #3126.

@AbeJellinek
Copy link
Member

  • Webpage -> Web Page, I think. We use that spelling in the Connector ("Web Page with Snapshot") and in the name of the item type.
  • We may want to implement the Levenshtein check suggested in the original issue. A lot of the time, one of the DOI results is the DOI for the page, and in that case we don't want to offer the "Current Web Page" option.

@zoe-translates
Copy link
Collaborator Author

I feel that a potentially easier de-dup approach is to compare the distances in the URL rather than in the title. The URLs will be subjected to some normalization (e.g. keeping the hostname & pathname only, case normalization, removing extra slashes and dots in paths, etc.) first.

@zoe-translates
Copy link
Collaborator Author

Hmm, no, I think the title may actually work better than URL, because the URL of the special "current page" item may look quite unlike the URL reported by the DOI-generated item. Example: https://physics.aps.org/articles/v16/127 EM-generated page item has the same URL, but its own DOI (mentioned on the page in a href hence picked up by the DOI web translator) resolves to https://link.aps.org/doi/10.1103/Physics.16.127

@zoe-translates
Copy link
Collaborator Author

@AbeJellinek, following your suggestions I added a de-dup routine that checks whether the current web page's URL or title matches a DOI-resolved item's, based on a normalized Levenshtein distance. The metric and threshold are rather arbitrary but I think they'll catch the obvious dups.

@dstillman
Copy link
Member

Webpage -> Web Page, I think. We use that spelling in the Connector ("Web Page with Snapshot") and in the name of the item type.

I've been meaning to change that for about a decade, so "Webpage" is fine, but it should really be a localized string from the connector. Can we just return the current URL and leave it to the connector to sub in a different string there if it matches?

@zoe-translates
Copy link
Collaborator Author

A personal wish of mine is to have easy access to proper localization in translator scripts. It would be great if the Connector could support that.

Can we just return the current URL and leave it to the connector to sub in a different string there if it matches?

This is certainly doable. Until the connector can do the substitution, it will fall back to displaying the URL, which may look a bit weird but could be a bigger problem with screen readers.

@dstillman
Copy link
Member

Until the connector can do the substitution, it will fall back to displaying the URL, which may look a bit weird but could be a bigger problem with screen readers.

It should just be the current page title, no?

@zoe-translates
Copy link
Collaborator Author

Right now, it is the text Current Web Page, followed by the page title in parentheses.

Example: Current Web Page (Dark Star Hypothesis Sees the Light of Day) for https://physics.aps.org/articles/v16/127

@dstillman
Copy link
Member

Yeah, so it should just be the page title, and the connector can change it if the URL matches the current page.

@zoe-translates
Copy link
Collaborator Author

What else does the connector need, in order to do this substitution? Does it need the translator to pass a { string: Item } (or an Object resembling an Item) mapping into Zotero.selectItems?

@dstillman
Copy link
Member

Unless I'm misunderstanding something, it doesn't need to do anything other than put the current page first.

select[doi] = item.title || "[" + item.DOI + "]";
}
}
let select = buildSelections(items);
Zotero.selectItems(select, function (selectedDOIs) {
Copy link
Collaborator Author

@zoe-translates zoe-translates Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To continue the discussion from #3130 (comment) ...

@dstillman

I'm afraid I may be misunderstanding something.

My understanding is that we're discussing about the text that is to appear in the "Zotero Item Selector" popup for the special "current web page" item. The object select here, passed to Zotero.selectItems(), is responsible for what the user sees in the Item Selector. This object is a simple mapping of string (DOI) -> string (title as shown to the user in the Item Selector).

If I understand it, you would be saying that here what the translator should do was to put the current page's title as the value for the special item.

What I was asking was whether this could be sufficient for the "text substitution" (into the localized string for "current web page") to be implemented in the Connector. I was asking because I didn't understand how that could be done without any URL info present in the select object. But I may be missing something.

@AbeJellinek
Copy link
Member

Maybe an empty string as the key instead?

@zoe-translates
Copy link
Collaborator Author

Here's more context about why I was asking

https://github.com/zotero/zotero-connectors/blob/d5f025de9b4f513535cbf4639c6b59bf115d790d/src/common/itemSelector/itemSelector.js#L48-L49

It's clear that the first argument (mapping of selectable items) passed to Zotero.selectItems() can be a { string: Item } mapping, although in practice we almost always pass a string-string mapping in the translator script. This is also partly due to the fact that we don't expect the connector to do anything substantive with that piece of data.

But for @dstillman's idea to work smoothly with translators, I think a { string: Item } mapping may be necessary here, if text substitution for the title is to be done in the Connector.

@AbeJellinek, what do you mean by "as key"?

@zoe-translates
Copy link
Collaborator Author

zoe-translates commented Sep 8, 2023

Based on how pre-selection is implemented (see e.g. PubMed.js)

translators/PubMed.js

Lines 172 to 175 in 5187948

items["u" + uid] = {
title: ZU.trimInternal(title),
checked: checkbox && checkbox.checked
};

I created a PoC for a mechanism to tell the item selector about the current webpage among them. The following is a demo in Scaffold: https://github.com/zotero/zotero/compare/40b30f2b632d54bc31f94715d4251687af38744c...zoe-translates:zotero:PoC-make-it-possible-to-identify-current-page-by-translator-for-multiple-selection?expand=1

This puts the onus on the translator, so on the connector side there's minimal change to existing code and little new logic.

Below is how it looks in Item Selector

Current page is already one of the items referred to by DOI on the page itself:
Proof of concept for identifying the current page (adding label to one of the items)

Current page as an option in addition to the DOI-resolved items (always the first one):
Proof of concept for identifying the current page (an extra item, always the first one)

On the translator side, it is supposed to be like this:

translators/DOI.js

Lines 281 to 286 in bb8479e

select[doi]
= {
title: item.title,
isCurrentPage: (doi === possibleCurrentWebPageDOI
|| doi === MAGIC_INVALID_DOI)
};

Of course, the ultimate goal is to do i18n properly, without using a hard-coded text label.

@dstillman @AbeJellinek I wonder how you feel about this solution?

@AbeJellinek
Copy link
Member

AbeJellinek commented Sep 13, 2023

So the issues here as I see them are:

  1. DOI translator uses DOIs as the keys in its selectItems() object, instead of URLs like most translators.
  2. We don't want to do something too specific to the DOI translator here - other translators might want a similar mechanism in the future
  3. We don't necessarily want to implement a full localization system for translators just to solve this issue. We've been fine without it in the past, and localization of attachment titles, which is most needed, is probably better handled in the client (Localize attachment titles zotero#3088) than by updating our 500+ (600+?) existing translators to call a new getString() whenever they need a localizable string.
  4. Any solution we implement for this should ideally be backwards-compatible with older client/connector versions. An older runtime should just show an unlocalized title.

I'm fine with the approach @zoe-translates proposes above, but I sort of like the idea that an index === 0 && key === document.location.href-based approach would automatically localize selectItems() dialogs for other translators that do something similar. I know we have one or two that use selectItems() on a book page with a list of sections and allow you to select the full book (current page) and/or its sections.

But why not use the page URL as the key, just for this special case? The translator would call selectItems() with something like:

Zotero.selectItems({
	'https://www.example.com/some/page': 'Page title',
	'10.48905/some/other/DOI': 'Title of a paper that got mentioned',
	/* and so on */
});

And the selectItems handler in the connector would automatically localize the title of that first entry because its key matches the current URL. That would work for all translators, the vast majority of which use URLs as keys.

@zoe-translates
Copy link
Collaborator Author

Thanks for the explanation @AbeJellinek

Let me try to summarize the current proposals

  1. Identify the selectable item as the current web page (the page on which the connector was invoked) by the URL. If a key of the object passed to Z.selectItems() is equal to the current page's location, it is given special treatment. (from you)

  2. In the object passed to Z.selectItems(), signal the identification of a particular item as the current page in the value rather than the key. This is done by setting that value to an object that has the property isCurrentPage set to true. (from me)

Notice that to take advantage of either, most translator code would have to be changed anyway, because you have to put the current page in the selectable items as an offer and be ready to handle it in the translator.

Some pros and cons of these proposals that I can think of -

  • First proposal, pros:

    • It follows the informal convention of URLs as keys; the translator author don't have to use a special key for that.
    • The identification is unique because the keys in the object are unique.
  • First proposal, cons:

    • For translators not using URLs as the keys, extra work in the translator has to be done anyway to special-case the current page. This can happen when the translator is based on identifiers (DOI, Google Scholar, some others based on API calls). (More about this later)
    • The page location is in general volatile and subject to manipulation live by scripts on the page. This might interfere in certain edge cases (?)
    • Being implicit from the translator, it cannot be disabled (any real use case for disabling it?)
  • Second proposal, pros:

    • Relatively little needs to be done on the side of the connector/ingester side.
    • Translator author has full control.
  • Second proposal, cons:

    • A new feature that requires the awareness of the translator author and therefore cannot be picked up by existing code without change.
    • It's possible to mark multiple items as "current page"; there's no inherent guard against this.

It seems to me that the first approach is probably better because its benefits are more concrete.

BTW, you mentioned "index === 0". Does this mean that the logic in the connector will check the iteration order of the key as an additional condition? I don't think this is necessary. In the case of my second screenshot in the previous comment (from the page https://physics.aps.org/articles/v16/127), one of the DOIs gathered from the page content happens to refer to the page itself, a fact determined by inspecting the DOI resolution results. If we are to check the ordering in addition to the "key == url" condition, in the translator code we would have to maintain the condition. This means creating a new object, put that current-page one first by insertion order, and copy the rest. Which I consider is just extra complexity.

@AbeJellinek
Copy link
Member

This means creating a new object, put that current-page one first by insertion order, and copy the rest. Which I consider is just extra complexity.

Yeah, fair.

The page URL changing is a concern, but I think in practice the likelihood that it changes between when doWeb() is called and when selectItems() is called is low.

@zoe-translates
Copy link
Collaborator Author

Alright! I think we have a plan. Thanks for these discussions!

@AbeJellinek So I'll go on to make the necessary changes to this DOI translator (which will still be backward-compatible even before Connector gets updated). Meanwhile, do you mind if I go on to update Connector, Scaffold, and translation-server with the proposed changes?

zoe-translates added a commit to zoe-translates/translators that referenced this pull request Sep 15, 2023
In anticipation of updates to the Connector in the item-selector, where
a special key -- the translation-initiating page's URL -- will be used
to label it as the current page with proper localization, changes
are made here to take advantage of that upcoming feature, and display
non-localized title label as fallback before the Connector is updated.

(Cf. zotero#3130 (comment))

The logic of doing so is a bit non-trivial, and the code is now more
extensively commented.

Some minor simplifications are done in passing.
@AbeJellinek
Copy link
Member

Meanwhile, do you mind if I go on to update Connector, Scaffold, and translation-server with the proposed changes?

Go for it.

@dstillman
Copy link
Member

I serialized the DOI Content Negotiation requests. Can you rebase on the latest DOI.js?

…iple

If the DOI translator turns out to be the one that matches the current
page, and if the DOI (this) translator detects the item type as
"multiple", offer the current page as one of the choices (indeed, the
first choice).

In the item selection dialog, this special choice will show up as
"Current Webpage", followed by the page title in parenthesis if the
title is present.

The specific situation that triggers this behavior is the following

1. None other than the DOI web translator (lowest priority) matches the
   current page
2. The current page itself is not the main resource identified by DOI
   present on the current page (e.g. a page with its own DOI in URL)

Resolves zotero#3126.
…ultiple

Use a dissimilarity metric based on item URL and title to de-duplicate
the special "current web page" item. A test case is added for this
scenario.

Also, add a visual cue (text "possibly current web page" in parentheses
after the title) in the item selection dialog for the choice that may be
referring to the current web page.
…esult

Also, prefer "web page" to "webpage" in human-readable text
In anticipation of updates to the Connector in the item-selector, where
a special key -- the translation-initiating page's URL -- will be used
to label it as the current page with proper localization, changes
are made here to take advantage of that upcoming feature, and display
non-localized title label as fallback before the Connector is updated.

(Cf. zotero#3130 (comment))

The logic of doing so is a bit non-trivial, and the code is now more
extensively commented.

Some minor simplifications are done in passing.
…tions

- Guard against the very rare case when empty object could be passed to
  Z.selectItems()
- Eliminate superfluous intermediate variable
- Check if URL domains form subdomains rather than simple substrings
- Path comparison is verbatim up to any final trailing slash
- Title comparison is done after stripping out punctuations
@zoe-translates zoe-translates force-pushed the issue3126-DOI-offer-current-website-as-choice-multiple branch from deff79d to 676e9a8 Compare October 18, 2023 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

DOI: Offer current webpage as first option if not present in DOI results
3 participants