Crossref & Unpaywall
This section describes the procedure used to integrate the contents from Crossref and Unpaywall in the OpenAIRE Graph.
Data acquisition
The dataset containing all the Crossref records is obtained via a complete data dump on a monthly basis. The Unpaywall dataset is no longer updated anymore but its latest snapshot (Dec 2021) is used to enrich the Crossref contents.
Process
In the following we describe the process applied to the Crossref & the Unpaywall contents.
Crossref filtering
Records in Crossref are ruled out according to the following criteria
- have blank title, examples:
10.1093/rheumatology/41.7.837
10.1093/qjmed/95.7.430
10.1371/journal.pone.0171434.g005
- have one of the following publishers:
"Test accounts"
,"CrossRef Test Account"
- Examples from https://api.crossref.org/works?query.publisher-name=%22Test%20accounts%22
10.1007/bf00344543
10.1007/bf00186154
10.1306/64ed947a-1724-11d7-8645000102c1865d
- Examples from https://api.crossref.org/works?query.publisher-name=%22Test%20accounts%22
- have authors matching the following invalid names:
",", "none none", "none, none", "none &na;", "(:null)", "test test test", "test test", "test", "&na; &na"
- Examples for
"none"
author from https://api.crossref.org/works?query.author=%22none%2210.4007/annals.2016.184.3.11
10.4007/annals.2012.176.1.6
10.2172/6393585
- Examples for
"test"
author from https://api.crossref.org/works?query.author=%22test%2210.5116/ijme.54ca.a5ae
10.5755/j01.ss.71.2.544
10.5755/j01.ee.22.2.319
- Examples for
- have
"Addie Jackson"
as author and"Elsevier BV"
as publisher (empirically we say they are test records)- Examples from https://api.crossref.org/works?query.author=Addie+Jackson&query.publisher-name=%22Elsevier%20BV%22
10.2139/ssrn.2082156
10.2139/ssrn.2202300
10.2139/ssrn.2255657
- Examples from https://api.crossref.org/works?query.author=Addie+Jackson&query.publisher-name=%22Elsevier%20BV%22
- have not one of the following values in the field
type
:"book-section"
,"book"
,"book-chapter"
,"book-part"
,"book-series"
,"book-set"
,"book-track"
,"edited-book"
,"reference-book"
,"monograph"
,"journal-article"
,"dissertation"
,"other"
,"peer-review"
,"proceedings"
,"proceedings-article"
,"reference-entry"
,"report"
,"report-series"
,"standard"
,"standard-series"
,"posted-content"
,"dataset"
,- Example:
10.1371/journal.pone.0171434.g005
10.7554/elife.21052.049
10.1371/journal.pcbi.1005379.s006
- Example:
Records with type=dataset
are mapped into OpenAIRE research products of type dataset. All others are mapped as OpenAIRE research products of type publication.
Mapping Crossref properties into the OpenAIRE Graph
Properties in OpenAIRE research products are set based on the logic described in the following table:
OpenAIRE Research Product field path | Crossref path(s) | Notes |
---|---|---|
id | doi | id in the form doi_________::md5(doi) |
dateofcollection | indexed.datetime | |
lastupdatetimestamp | indexed.timestamp | |
type | type | Using the dnet:result_typologies vocabulary, we look up the instance.type synonym to generate one of the following main entities:
|
originalId | doi, clinical-trial-number, alternative-id | |
pid | The scheme tells the type of PID, the value contains the actual value | |
pid.scheme | Default value: doi | |
pid.value | doi | The doi is normalised and lower-cased |
maintitle | title | |
subtitle | subtitle | |
author | author | if available the sequence is mapped to rank and the ORCID is also mapped |
author.name | author.given | |
author.surname | author.family | |
author.fullname | author.given author.family | |
author.rank | based on the order, starts from 1 | |
author.pid | only if the ORCID is available | |
author.pid.id.scheme | Default 'pending_orcid' (meaning that it is not an id confirmed by ORCID) | |
author.pid.id.value | author.ORCID | |
author.pid.provenance.provenance | Default 'Harvested' | |
author.pid.provenance.trust | Default '0.9' | |
description | abstract | |
subject | subject | with classid='keywords' , i.e. no controlled vocabularies for Crossref subjects |
publicationdate | issued.datetime or, if not available, created.datetime | |
publisher | publisher | |
source | source | only if the record is not of type book |
source | concatenation of container-title.head + "ISBN: " + ISBN.head | only if the record is of type book |
container | It is set only for publications with information about the journal it was published in. | |
container.name | container-title.head | |
container.issnOnline | issn-type.value | if issn-type.type='electronic' |
container.issnPrinted | issn-type.value | if issn-type.type='print' |
container.vol | volume | |
container.sp | page | before '-' |
container.ep | page | after '-' |
instance | One instance is created with the DOI URL | |
instance.accessright | Values in instance.accessright.code and instance.accessright.label are set based on license and dateofacceptance:- UNKNOWN : if the license is blank- OPEN ACCESS : if the license is a CC license or an ACS license or an APA license (considered OPEN also by Unpaywall, see Unpaywall FAQ for details) or if OUP license, but only after 12 months from the publication date- EMBARGO : OUP license, before 12 months from the publication date- CLOSED : if there is a license not covered by the previous cases | |
instance.accessright.code | Code from the COAR vocabulary for access right | |
instance.accessright.label | One of: OPEN , RESTRICTED , CLOSED , EMBARGO | |
instance.accessright.scheme | Scheme that defines the code and label, i.e. the URL to the COAR vocabulary for access right | |
instance.accessright.openAccessRoute | only if instance.accessright.value = 'OPEN ACCESS' . Default is hybrid . The route is fixed in subsequent phases of DOIBoost, namely when intersecting with Unpaywall and patching the hostedby via DOAJ and the Gold-ISSN list. | |
instance.license | license.URL | If there is a license.content-version='vor' , then this is used. Otherwise the first license entry is used. |
instance.pid | The scheme tells the type of PID, the value contains the actual value | |
instance.pid.scheme | Default value: doi | |
instance.pid.value | doi | The doi is normalised and lower-cased |
instance.publicationdate | issued.datetime or, if not available, created.datetime | |
instance.refereed | set to peerReviewed only if relation.has-review.id is not empty, UNKNOWN otherwise. | |
instance.type | subtype | mapped using the OpenAIRE vocabulary for research products typologies |
instance.url | doi | Full URL of the DOI |
All other fields of the Json schema not mentioned in the table contain empty values.
All the records from Crossref are related to the datasource with name=Crossref
and id=openaire____::081b82f96300b6a6e3d282bad31cb6e2
Possible improvements:
- map
clinical-trial-number
andalternative-id
inalternateIdentifiers
? - Verify if Crossref has a property for
language
,country
,container.issnLinking
,container.iss
,container.edition
,container.conferenceplace
andcontainer.conferencedate
- Different approach to set the
refereed
field and improve its coverage?
Map Crossref links to projects/funders
Links to funding available in Crossref are mapped as funding relationships (ResearchProduct -- isProducedBy -- Project
) applying the following mapping:
Funder | Grant code | Link to |
---|---|---|
DOI: {10.13039/100010663, 10.13039/100010661, 10.13039/501100007601, 10.13039/501100000780, 10.13039/100010665} or name: 'European Union’s Horizon 2020 research and innovation program' | series of 4-9 digits in award | Link to H2020 project |
DOI: {10.13039/100011199, 10.13039/100004431, 10.13039/501100004963, 10.13039/501100000780} | series of 4-9 digits in award | Link to FP7 project |
DOI: 10.13039/501100000781 OR name: 'European Union's' | series of 4-9 digits in award | Link to FP7 or H2020 project |
DOI: 10.13039/100000001 | award | Link to NSF project |
DOI: 10.13039/501100001665 OR name: {'The French National Research Agency (ANR)', 'The French National Research Agency'} | award | Link to ANR project |
DOI: 10.13039/501100002341 | award | Link to Academy of Finland project |
DOI: 10.13039/501100001602 | award , removing the initial 'SFI' if present | Link to SFI project |
DOI: 10.13039/501100000923 | award | Link to ARC project |
DOI: 10.13039/501100000038 | award ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE | Link to NSERC (unidentified project) |
DOI: 10.13039/501100000155 | award ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE | Link to SSHRC (unidentified project) |
DOI: 10.13039/501100000024 | award ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE | Link to CIHR (unidentified project) |
DOI: 10.13039/501100002848 OR name :'CONICYT, Programa de Formación de Capital Humano Avanzado' | award | Link to CONICYT project |
DOI: 10.13039/501100003448 | series of 4-9 digits in award | Link to GSRT project |
DOI: 10.13039/501100010198 | award | Link to SGOV project |
DOI: 10.13039/501100004564 | series of 4-9 digits in award | Link to MESTD project |
DOI: 10.13039/501100003407 | award | Link to MIUR project. Since OpenAIRE has a small subset of MIUR projects, a link to the MIUR funder (unidentified project) is also generated |
DOI: {10.13039/501100006588, 10.13039/501100004488} | award , removing 'Project No' and 'HRZZ' prefix, if present | Link to HRZZ or MZOS project |
DOI: 10.13039/501100006769 | award | Link to Russian Science Foundation project |
DOI: 10.13039/501100001711 | award after '_' and before '/' | Link to SNSF project |
DOI: 10.13039/501100004410 | award | Link to TUBITAK project |
DOI: 10.10.13039/100004440 or name: Wellcome Trust Masters Fellowship | award | Link to Wellcome Trust specific project and to the unidentified project. |
Intersect Crossref with UnpayWall by DOI
The fields we consider from UnpayWall are:
is_oa
best_oa_location
oa_status
The records of Crossref that intersect by DOI with UnpayWall records are enriched with one additional instance
with the following properties:
OpenAIRE Research Product field path | Unpaywall field path | Notes |
---|---|---|
instance | created only if is_oa and a best_oa_location is available | |
instance.accessright | default value Open Access : we do not add instances if UnpayWall says there is no open version | |
instance.accessright.code | Open Access code from the COAR vocabulary for access right | |
instance.accessright.label | Always OPEN | |
instance.accessright.scheme | Scheme that defines the code and label, i.e. the URL to the COAR vocabulary for access right | |
instance.accessright.openAccessRoute | oa_status | |
instance.url | best_oa_location | |
instance.license | best_oa_location.license | |
instance.pid | The scheme tells the type of PID, the value contains the actual value | |
instance.pid.scheme | Default value: doi | |
instance.pid.value | doi | The doi is normalised and lower-cased |
For the definition of UnpayWall's oa_status
refer to the Unpaywall FAQ
The record will also feature a relation to the UnpayWall data source: name="UnpayWall"
, id=openaire____::8ac8380272269217cb09a928c8caa993
.