A user journey in OpenAIRE services through the lens of repository managers (I – OpenAIRE interoperability guidelines, the content acquisition policy and the graph expansion)
1 of 92
Downloaded 13 times
More Related Content
A user journey in OpenAIRE services through the lens of repository managers - #OpenREPO2019 workshops 1st part
1. @openaire_eu
A user journey in
OpenAIRE services
through the lens of repository managers
PedroPríncipe,UniversityofMinho,AlessiaBardi,CNR-ISTI,AndréVieira,UniversityofMinho,JochenSchirrwagen,BielefeldUniversity,
OR2019 Workshop – 1st part
3. 09:00 – Welcome and introduction, Pedro Príncipe
09:20 – OpenAIRE graph expansion: an academic graph aggregating all information required to deliver
monitoring tools
09:50 – OpenAIRE content acquisition policy and the new terms of agreement for content providers.
10:05 –Explore service demo (and beta test drive) + Showcase metadata quality issues
10:30-11:00 – Coffee break
11:00 –OpenAIRE interoperability guidelines overview
11:10 –Guidelines for Literature Repositories: implementation and early adopters
11:30 –RCAAP use case & HAPLO use case
11:50 –OpenAIRE Validator demo - testing the compliance against version 4
12:00 –Breakout groups -discussion
12:20 –Wrap-up
AGENDA
14th International Open Repositories Conference, June 10th, Hamburg, Germany
6. Research
communities
Researchers (All)
Content providers
Innovators
Research
managers
Funders
Building the OpenAIRE research graph and the Dashboard services
Infrastructure
Validation
Cleaning De-duplication
Inference
Project communiity
FunderFunding
Product
Publicatio
n
Data Software
Organizatio
n
TERMS
OF USE
Harvesting Uploading
Brokering
Source
ORP
Publications
repositories
Data
repositories
Hybrid
repositories
Registries
OA
Journals
Software
repositories
Content Providers Research
Infras
GUIDE
LINES
11. OpenAIRE
graph expansion
An academic graph aggregating all
information required to deliver monitoring
tools
Slides by Paolo Manghi
Presented by Alessia Bardi
Institute of Information Science and Technologies - CNR
13. Providing an open metadata
research graph of interlinked
scientific products, with access
rights information, linked to
funding information and research
communities
The OpenAIRE research graph
Open
Complete
De-duplicated
Transparent
Participatory
Decentralized
Trusted
15. • Repositories and publishers
Download from URLs in harvested metadata: 6.8Mi
Machine-learning on OA URLs from large aggregators (DOAJ,
CrossRef/Unpaywall): 3.3 Mi (downloaded, under integration in
BETA)
• Publishers metadata/PDFs via CORE-UK
ResourceSync
Springer Open Access, etc.: 750K
Open Access articles sources
16. Mining results: links
Project community
FunderFunding
Product
Publication Research Data Software
Organization
Source
Other res.
products
5.17M
1.96M
1.24M
218M
fundedBy
affiliated
refersTo
similarTo
75k
40k
relatedWith
17. • Document classification: 3.86M of pubs with at least one
class assigned:
arxiv: 2.35Mi, meshEuroPmc: 3.64Mi, acm: 832k
• Document properties
New abstracts: 1.3Mi
• Document references
168.44M bibliographic references for 5.33M pubs
• Document external links
PDB reference extraction: 320k references (68k of unique pubs)
Mining results: properties
21. De-duplicated
Entity type # Collected
records
# Records after
cleaning and
de-duplication
# Identified
duplicates
Publications ~ 343M ~ 94M ~ 249 millions
Data ~ 5,2M ~4,6M ~ 600K
Software ~150K ~ 134 K ~ 20K
Other ~ 5M ~ 4,5M ~ 500K
Organisations ~ 380K ~220K ~ 160K
More information about the de-duplication framework used by OpenAIRE can be found
searching on Zenodo for :
• “De-duplicating the OpenAIRE Scholarly Communication Big Graph” (poster)
• “GDup: De-Duplication of Scholarly Communication Big Graphs”
22. • Rely on quality scholarly
communication sources of
different kinds
Participatory
• Include solutions and content
from any interested and known
content provider in scholarly
communication
Institutional repositories
Aggregators
Data archives
Software repositories
Research infrastructure sources
Funder grant databases
Authors & Orgs entity registries
Publishers & journals
23. • Metadata in the graph includes provenance when harvested
and reliability indicators when obtained from mining
Transparent
24. • Preservation and ownership beyond OpenAIRE
Exchanged with other graph initiatives
Redistributed via subscription and notification to
contributing data sources (provide.openaire.eu)
• Openly accessible via APIs
(develop.openaire.eu)
Decentralized
25. • Authors in the loop to enrich their ORCID record
• Validation of end-user ”claims”
Trusted (in progress)
29. For building community-specific
gateways to Open Science
AvirtualenvironmenttoimplementOpenSciencepublishingpractices
Report to funders
Uptake of Open Science
publishing practices
Research Impact
All the relevant
research products
DEPOSIT ANYTHING
… linked
On demand publishing on services of
Research Infrastructures
30. Ongoingcollaborations
Research Infrastructures/initiatives Disciplinary researchcommunities
OpenAIRE - EOSC Hub - EC meeting | Amsterdam | 15th Dec 2017
• Sustainable Development Solutions
Network (Greece)
• Agricultural and Food Science
• Fisheries and Aquaculture Management
• European Marine Science
• Neuroinformatics
• Digital Humanities and Cultural Heritage
36. • Funders
• Trends in research fields: new (multidisciplinary) disciplines
• Projects
• Interconnections, possible liaisons
• Institutions
• OA/OS behavior, ability to attract cross-funder grants
Added value functionalities
39. • Funders
• Recent and past EC and other funders’ activities (representing various
funding levels)
• Checking compliance to funder mandates
• Institutions
• Collaboration network (by institution) via projects and products
• Projects
• Compare project portfolio against that of other similar institutions
(anonymized?)
Added value functionalities
40. Tell us!
Thoughtaboutan added
valuefunctionalitywe did
not mentioned?
Explore the BETA graph
and tell us how to improve!
https://beta.explore.openaire.eu
OpenAIRE - EOSC Hub - EC meeting | Amsterdam | 15th Dec 2017
43. ALL Literature, Research data,
Software, Other research products
www.openaire.eu/policies
Open Access & non-Open Access material
44. ALL Literature, Research data, Software, Other research products
OpenAIRE Content Acquisition Policy + complete
RespectingtheOpenAIREguidelines(DataCitemetadata)
UsingPIDswithresolvers
released 05-Oct-2018,
https://doi.org/10.5281/zenodo.1446408www.openaire.eu/content-aquisition-policy
45. ALL SCIENTIFIC RESEARCH
PRODUCTS
literature, dataset, software,
other research products
METADATA QUALITY
with a minimal quality
conditions under which
metadata can be accepted
OF ALL ACCESS LEVELS
open, closed, metadata only
what data/metadata
we collect
46. It’s important that the access
level of a record is made clear
Each record must contain a
PID (or URL) that resolves to a
splash page
Is vital that the access level
of a record is clear (by an
access level statement on
record level, alternately by
the use of specific OAI-sets)
how we process
47. Metadata describing Open Access and non-Open Access material
will be included and links to other products will be resolved
where this is possible.
48. Metadata describing Open Access and
non-Open Access material will be
included and links to other products
will be resolved where this is possible
(i.e. the provided PIDs have a resolver).
as stated in the Content Acquisition Policy, published Oct. 2018
https://doi.org/10.5281/zenodo.1446408
55. EGI Application database
OMICS DI
Kaggle
ReactToMe
DOECode
Unpaywall
New data sources
OpenAIRE - EOSC Hub - EC meeting | Amsterdam | 15th Dec 2017
56. Metadata Quality Challenges
Issue Affects Proposed Solutions
Missing values Indexing, discovery, reuse Curation by repository team;
use OpenAIRE Validator,
Broker service
Missing Links and
Identifier
Interlinking with other research
products; Contextualisation
ScholXplorer, Broker service
Lack of controlled values Discovery Use agreed controlled
vocabularies according to
OpenAIRE Guidelines
Mandatory values only Discovery and reuse Broker service
57. • Open Access version coming from one of the
sources:
https://explore.openaire.eu/search/publication?articl
eId=dedup_wf_001::0ea9b3d0d7300315854e7f25e49
9d2b9
• Document classification:
https://explore.openaire.eu/search/publication?articl
eId=od______1874::6331f80a2b9758f56609a874e9ad
dc26
• For more: look at the Content Provider Dashboard
Records enhanced by de-duplication
58. • Link to software (re-use):
https://explore.openaire.eu/search/publication?articleId=
od________18::4405ffb18cc37d73d0daff3650e48f82
• Link from a software to its «main» publication:
https://explore.openaire.eu/search/software?softwareId=
openaire____::949d7264f0efb7a27e521fee9c59209b
• Software not available on GitHub, but on
SoftwareHeritage only:
https://explore.openaire.eu/search/software?softwareId=
openaire____::8bf2fbf6cb1f0c9552ca0a6fd0aecfbc
Records enhanced by full-text mining (1)
59. • Reference to a Research infrastructure:
https://explore.openaire.eu/search/publication?
articleId=nora_uio__no::3197de1949480eb9f3fc
82ba26ad2e25
• Link to project:
https://explore.openaire.eu/search/publication?
articleId=dedup_wf_001::4be652d611c4bbcf897
118bdb564c557
Records enhanced by full-text mining
60. • Take care of your PIDs:
https://explore.openaire.eu/search/dataset?data
setId=dedup_wf_001::69a0263a2925140e015c44
70779f79c1
Quality issues
61. (Aaltodoc Publication Archive, DSpace)
comparison of OAI-PMH OpenAIRE endpoint/set and standard endpoint/set
• different number of records (due to former OpenAIRE Content Acqu. Policy)
completeListSize="12413" vs. completeListSize="36886"
• non-normalised resource types
<dc:type>info:eu-repo/semantics/article</dc:type> vs.
<dc:type>A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä</dc:type>
• non-normalised or missing access levels
<dc:rights>info:eu-repo/semantics/openAccess</dc:rights> vs.
<dc:rights>openAccess</dc:rights>
A not so rare example
64. Evolution of OpenAIRE-Guidelines
2010
Literature
Guidelines v1
2012
- Literature
Guidelines v2
- Data
Guidelines v1
2013
Literature
Guidelines
v3
2014
Data
Guidelines
v2
2015
CRIS-CERIF
Guidelines v1
2018 Guidelines for
- institutional and
thematicrepos. v4.0
-CRIS-CERIF v1.1
2018 Guidelines for
- Software
Repositories
- Other Research
Products
65. Diversity of Research Results from Different Types of
Sources
Publications
• Article
• Preprint
• Report
• …
Datasets
• Dataset
• Collection
• Clinical Trials
• …
Software
• Research
Software
• …
Other Research
Products
• Service
• Workflow
• Interactive
Resource
• …
Institutional/
publication
repositories
Journals/
publishers
Data
repositories
Other
Products
repositories
Software
repositories
66. Metadata Goals in OpenAIRE
Goal Metadata Groups
Discovery and Citability Descriptive metadata
Accessibility and Reuse Access Rights, License Conditions
Contextualization Research Project, Linked Research Artefacts
Interoperability Identifier for Entities, Controlled Vocabularies
Reporting Funding Reference
TDM File Location, License Conditions
68. Metadata describing Open Access
and non-Open Access material will
be included and links to other
products will be resolved where
this is possible (i.e. the provided
PIDs have a resolver).
as stated in the
Content
Acquisition Policy
70. OpenAIRE Guidelines for
Literature Repository
Managers v4.0
http://dx.doi.org/10.5281/zenodo.1299203
(Released Nov-2018)
• Established standards: Dublin
Core and DataCite metadata
scheme
• To describe different kinds of
scholarly works
• Defines Application Profile
• Controlled Vocabularies and
Persistent Identifiers for
different entities
71. required
● PIDs for scholarly publications (with versioning)
● Deposition of content with LTP programme (eg. CLOCKSS)
● Article level metadata interoperable non-proprietary format,
under a CC0 public domain, incl. funding information
● Machine-readable information on Open Access status and
the license
Plan S - Requirements and Recommendations
72. recommended (strongly)
● PIDs for authors (e.g., ORCID), funders, funding programmes and
grants, institutions, and other relevant entities.
● Registering self-archiving policy of the venue in SHERPA/RoMEO.
● Availability for download of full text for all publications (including
supplementary text and data), eg. JATS XML.
● Direct deposition of publications by the publisher into … Open
Access repositories that fulfil the Plan S criteria.
● OpenAIRE compliance of the metadata.
● Linking to data, code, and other research outputs.
● Openly accessible data on citations according to the standards by
the Initiative for Open Citations (I4OC).
Plan S - Requirements and Recommendations
73. Implementation in Repositories
Software Supported Version Status Comments
DSpace 7 (in prep.)
5 & 6 (in test)
In preparation - DSpace
OpenAIRE 4.0 WG
Implementations by PT
repos RCAAP for v.5
70 days effort (WG
timeline plans)
Documentation will be
available ASAP
EPrints All Contacted May need funding via
Jisc or OpenAIRE
Invenio / zenodo On their roadmap
Islandora Contacted
Librecat Contacted
OPUS 4 (in prod.) Contacted
MyCoRe Contacted
HAL Contacted May have very limited
resources
Fedora Will contact
Haplo Implemented
76. ● Guidelines at https://openaire-guidelines-for-literature-repository-
managers.readthedocs.io/en/v4.0.0/
● Schema and examples on github
https://github.com/openaire/guidelines-literature-repositories
References
79. Why?
• Need for a specific format for scientific publications
• More specific fields for metadata fields
• Support for hierarchichal information (concept of entities)
• Metadata Alignement with other services (datacite, openaire,….)
86. What Changes? On Search Portal…
• New harvesting process
• Support for multi metadata schema (oai_dc; xoai; oai_openaire)
• Support for type of resource (repository, journal,…) and type of metadata
schema
• New transformation and validation processes
• DRIVER Types COAR Types
• New ways to present the information
• On the user interface
• On OAI-PMH
• On REST API
87. What Changes? On Policies level…
• New harvesting policy for the National Harvester
• Important for content regulation
• Develpment of new profile for OpenAIRE 4 on RCAAP Validator
• Important to align with national and international services and
developments
88. Final Considerations
• Pilot Dspace 5 instance with guidelines OpenAIRE 4 implemented
• All the information will be available to implemente in Dspace software
Participation on the Working Group DSpace OpenAIRE
• New harvesting rules and pilot with OpenAIRE 4 soon
• Mappings of information
• Still some lacks of information between some services (information may be lost)
• Already some suggestions for Guidelines OpenAIRE 4.1
91. • How to optimize the information exchange (metadata
and fulltext) between repositories and OpenAIRE? How
to reduce the burden to repository managers?
• How could we help - what kind of support to you would
like to have from OpenAIRE?
• What are the major metadata quality issues and how to
solve them?
Breakout groups – questions: