Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BDTD file importer #2852

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

BDTD file importer #2852

wants to merge 8 commits into from

Conversation

felipeaf
Copy link

Import search results from Biblioteca digital brasileira de teses e dissertações (BDTD; Brazilian digital library of theses and dissertations; https://bdtd.ibict.br), exported as JSON.

felipeaf and others added 4 commits July 16, 2022 02:20
Import search results from Biblioteca digital brasileira de teses e dissertações (BDTD; Brazilian digital library of theses and dissertations; https://bdtd.ibict.br)
@AbeJellinek
Copy link
Member

Thanks, a couple things:

  1. I don't think this should be an import translator at all. Just keep the web part and make the JSON translation routine internal. We don't normally have import translators for specific sites' formats without good justification.
  2. The file is huge! Almost 31,000 lines and over a megabyte in size? Please trim down the test cases - there's no way we need that much.

@AbeJellinek
Copy link
Member

AbeJellinek commented Jul 26, 2022

OK, looks like the huge file size is coming from the second test. A few issues contributing:

  1. We need to call selectItems() on search results pages. The translator shouldn't be importing every single search result without giving the user a chance to choose.
  2. It should allow you to select from among the search results visible on the current page. So if I'm on a page with 10 search results displayed, I should get ten options in the selectItems() dialog. Right now it automatically imports every single search result on every single page of the search (hence the gigantic file size).

@felipeaf
Copy link
Author

Hi! Actually, the last commit was a mistake. I didn't finish the web translator, when i did the pull request i had the file importer working and a not so big test file. I will try to finish the web part soon.
About the import translator, i can change if it's not relevant to Zotero. But i will give some context about this site: is a service maintained by the Brazilian government, and includes theses from several Brazilian universities.

@felipeaf
Copy link
Author

Ok, i reverted the last commit, because was a mistake push that to master, so now the head has just a small test case and a file importer that works.

About the web importer, i said before that i would finish that soon, but i didn't knew how web importers workers, and now i see that this json importer that i did is not useful to import just a page. This JSON format is one of 2 formats options that the user can download the full search result (the other is CSV, both looks non standard), by clicking in an "export" button. It's a download option for the user, not a JSON used in the page itself. To import just a page would be better parse the HTML DOM instead.

But it seems me a different use case. The user can download and import a full search result in order to make a systematic literature review. As I said, BDTD is maintained by the government and i guess it has some relevance for Brazilian researchers, because all master and doctoral thesis from a lot of brazilian universities are there.

@AbeJellinek
Copy link
Member

This should be a web translator and should support both search results and individual item pages. A site's size isn't an argument for making its JSON schema into an import translator - if that schema isn't used by more than just one site or as an interchange format, there's no point.

I'd imagine there's a way to export a single item as JSON, even if it's not exposed in the frontend - is there not?

@felipeaf
Copy link
Author

felipeaf commented Jul 27, 2022

Hi! I've found a way to that JSON only of the page visible items, and i'll try finish the web translator, in the right way, later, but i have a problem. Zotero already has some web translator that partially works with BDTD. It's importing only title, authors and year, but there is a lot of data missing (including abstract and tags). I checked there is no reference to BDTD URLs in the repository at all, but i guess that something works because BDTD site is a instance of VuFind software that is used a lot in this kind of site. The problem is I don't know how i check which translator is doing that and if can do a more specific one.

* Actually, that JSON link is an API from vufind (check https://vufind.org/wiki/development:apis:search#search_api). I don't know if it can be reused with others VuFind based sites. That JSON is too weird to be a standard, it should be based in some local setting.

@AbeJellinek
Copy link
Member

That JSON is too weird to be a standard, it should be based in some local setting.

Why? It seems like a standard feature according to that page...

What isn't working well with the current VuFind translator? Can we just add some fixes there instead of writing a new one with a different API?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants