Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zbMATH: Updates and multiple fixes. #3052

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

zoe-translates
Copy link
Collaborator

  • Updated the selectors/XPaths to match the current state of the site.
  • Prefer selector to XPath to simplify code.
  • Made the scrape()/doWeb() functions async.
  • Changes to keyword/tag handling: the returned tags now contain MSC numbers, their readable labels, and the "Keywords" content.
  • Strip the duplicated characters in MSC labels and abstracts that had been caused by inline MathML rendered by MathJax. In abstracts, the math content is replaced by their LaTeX annotation, surrounded by the dollar signs ($ $), to mark the places where math text appeared.
  • Prefer the cleaner permalinks in URL fields.
  • Updated test cases.

Resolves #3039

- Updated the selectors/XPaths to match the current state of the site.
- Prefer selector to XPath to simplify code.
- Made the scrape()/doWeb() functions async.
- Changes to keyword/tag handling: the returned tags now contain MSC
  numbers, their readable labels, and the "Keywords" content.
- Strip the duplicated characters in MSC labels and abstracts that had
  been caused by inline MathML rendered by MathJax. In abstracts, the
  math content is replaced by their LaTeX annotation, surrounded by the
  dollar signs ($ $), to mark the places where math text appeared.
- Prefer the cleaner permalinks in URL fields.
- Updated test cases.

Resolves zotero#3039
Copy link
Collaborator

@adam3smith adam3smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of small things -- I'd want dstillman or AbeJellinek to chime in on the handling of LaTeX in fields

@@ -158,16 +216,16 @@ var testCases = [
"date": "2012",
"DOI": "10.1002/rsa.20472",
"ISSN": "1042-9832",
"abstractNote": "We prove that a given tree TT on n vertices with bounded maximum degree is contained asymptotically almost surely in the binomial random graph G(n,(1+ε)lognn)G\\left(n,\\frac {(1+\\varepsilon)\\log n}{n}\\right) provided that TT belongs to one of the following two classes: \n\n(1)TT has linearly many leaves; (2)TT has a path of linear length all of whose vertices have degree two in TT.",
"extra": "MSC2010: 05C05 = Trees\nMSC2010: 05C80 = Random graphs (graph-theoretic aspects)\nZbl: 1255.05045",
"abstractNote": "We prove that a given tree $T$ on n vertices with bounded maximum degree is contained asymptotically almost surely in the binomial random graph $G\\left(n,\\frac {(1+\\varepsilon)\\log n}{n}\\right)$ provided that $T$ belongs to one of the following two classes: \n\n(1)$T$ has linearly many leaves; (2)$T$ has a path of linear length all of whose vertices have degree two in $T$.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite know what to do about this -- we don't actually support TeX in Zotero fields (other then the new notes), so this is a bit messy, but I'm also not sure what else we could do.

Copy link
Collaborator Author

@zoe-translates zoe-translates Jun 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't, and yes this is a bit messy here. I'd like to hear more thoughts about this too.

Before this change, you can see that the MathJax-rendered elements became "TT" for a one-letter math symbol. It's even worse now, for without the change it would become "TTT" under newer MathJax. In addition, more complicated MathML text loses meaning when converted to text in the usual way. For instance, the fraction line became lost, so "log n over n" became lognn in the text.

In other words, without further processing, meaning could be easily destroyed, and silently. It's difficult to spot the change from "T" to "TTT" in the wall of text.

So I chose to preserve the LaTeX-y annotation as substitute, and mark it so, using the $ .. $. This at least signals to the reader that here used to be some rendered math, and the LaTeX source is in principle a lossless substitute.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this is a reasonable approach. And, honestly, we could probably support math in abstract fields pretty easily (just showing as $…$ in edit mode).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, this is roughly consistent with the arXiv translator's output (e.g. see https://arxiv.org/abs/2306.07357). There, the abstract is handed to us by the OAI API, which is a verbatim copy of what the preprint author puts into that field.

// Clean up the MathJaX-rendered text in elements. Returns a clone of the node
// with the duplicate-causing elements removed and the LaTeX math text
// converted to text nodes (surrounded with $ $ if laTeXify = true).
function cleanupMath(element, laTeXify = true) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd want to hear from @dstillman or @AbeJellinek what Zotero's view is on handling LaTeX/MathJaX in fields. It's currently not supported, so adding things like $$ doesn't do any good, but given the nature of the translator it might still make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generally we just save the LaTeX as is, even though nothing in Zotero will render it. I hadn't thought about it very much but I think I agree with the approach here - try to use the rendered Unicode version of the LaTeX for short fields, keep the LaTeX in the abstract and similar.

zbMATH.js Outdated Show resolved Hide resolved
zbMATH.js Outdated Show resolved Hide resolved
@zoe-translates
Copy link
Collaborator Author

zoe-translates commented Jun 15, 2023

Title with math probably needs cleaning, too. I'll do this later. Updated in f3cb2e3.

Examples:

https://zbmath.org/7694014
https://zbmath.org/7693571

- Eliminated a little bit of dead code.
- Used less cryptic syntax for the "guard" logic.
- Removed unnecessary use of `Zotero.Utilities` namespace for top-level
  names.
The title may contain math text, and it's currently not very well
understood by the BibTeX import translator. A more reliable way is to
scrape the page.
@zoe-translates
Copy link
Collaborator Author

zoe-translates commented Jun 16, 2023

Hmm, there's more to that.

The title cleanup should be applied in getSearchResults() too, to prevent the "TTT" effect in the item selection dialogue.

And I'll reconsider the way to cleanup the item.title: This will go into the full-text PDF filename. So we'd prefer less special characters such as $ or \. Since math in title is usually reserved for simple inline elements, I'll probably choose the inner text of the rendered MathML over the TeX source.

@dstillman
Copy link
Member

This will go into the full-text PDF filename. So we'd prefer less special characters such as $ or \

That doesn't matter. Zotero will automatically remove any characters that aren't valid in filenames.

@zoe-translates
Copy link
Collaborator Author

I don't think fs compatibility would be a problem.

The problem is that the characters $ or so, as LaTeX control characters, would look out of place in a filename, and this may not be what the user expected.

As an example, for this article https://zbmath.org/7695752
The filename
BMO ε-regularity ... .pdf
would look more consistent with usual filenames than would
BMO $varepsilon$-regularity ... .pdf.

Also for users less familiar with the rules of shell variable expansion, $ in filenames might be a cause of havoc when they use the commandline.

@dstillman
Copy link
Member

Ah, got it.

- In titles, the math is typically brief inline text. By using rendered
  text instead of LaTeX source, the saved PDF's file names will contain
  less "special" characters. This improves interoperability, and reduces
  the likelihood of certain user errors with "special" characters in
  file names. In addition, the look and feel will be closer to the
  normal expectation.
- In tags, already brief, the LaTeX math is a distraction.

Note that the MathML extraction routine is tested against most typical
MathJax preferences (and with the Firefox "Native MML" extension). If
Assistive MathML is turned off (default on, and strongly recommended),
the result will be more accurate. In the extreme case of Assistive
MathML off and SVG rendering on, the math text may disappear altogether.
@zoe-translates
Copy link
Collaborator Author

Errrh, wrong commit message of 2f77a8a. It should've read

"If Assistive MathML is turned off (default on, and strongly recommended), the result will be less accurate."

@zoe-translates
Copy link
Collaborator Author

In this comment to the issue #3039 (comment), the user suggested that the callNumber field be set to the Zbl ID without any Zbl or Zbl: prefix (because libraryCatalog is already set to zbMATH?). @adam3smith, @nonobsense, please let me know if this is the correct understanding? And is this how it "should" be done?

I'm asking because I see that in items translated from arXiv, the archiveID field's value includes the arXiv: prefix.

@adam3smith
Copy link
Collaborator

I don't love identifiers in callNumber fields in the first place. We are typically putting identifiers (as opposed to actual call numbers) into Extra. To make sense in Extra, they do need the prefix
arxiv: is part of the Archive ID because arXiv includes arXiv as part of the identifer. Generally, where identifiers aren't otherwise self-identifying (like, say, DOIs), it makes sense to store them with a namespace-y prefix.

@zoe-translates
Copy link
Collaborator Author

To make sense in Extra, they do need the prefix

This is what the code does atm. The identifier shows up as Zbl: [...] on a line in the extra.

zbMATH.js Outdated Show resolved Hide resolved
// Clean up the MathJaX-rendered text in elements. Returns a clone of the node
// with the duplicate-causing elements removed and the LaTeX math text
// converted to text nodes (surrounded with $ $ if laTeXify = true).
function cleanupMath(element, laTeXify = true) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generally we just save the LaTeX as is, even though nothing in Zotero will render it. I hadn't thought about it very much but I think I agree with the approach here - try to use the rendered Unicode version of the LaTeX for short fields, keep the LaTeX in the abstract and similar.

zbMATH.js Outdated Show resolved Hide resolved
zbMATH.js Outdated Show resolved Hide resolved
@zoe-translates
Copy link
Collaborator Author

There's a few more issues.

  • When we save the multiples from search results, any MathJax source will not be rendered (we only get the static DOM). If we want to "laTeXify", it's not a big problem and we can simply do a text search and replacement, because \( \) are clear markers of MathJax source. But there's not going to be rendered text even if it's just a single Greek letter from e.g. \alpha. I'm adding a few more line to get rid of \( \) when we "laTeXify".
  • Another problem is saving "snapshot". Do we need this?
		item.attachments = [{
			title: "Snapshot",
			document: doc
		}];

It probably may be useful to save a fully-rendered page as a single file, for the math, but I'm not sure if it's worth it.

When saving items from a multiple-result search page, try to be more
consistent with the behavior of single-item saving, by converting the \(
\) delimiters for MathJax source into $ $ in abstracts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

zbMATH does not record the url nor the Zbl identifier
4 participants