Skip to main content
Version: 5.1.2

DOIBoost: Crossref, Unpaywall, Microsoft Academic Graph, ORCID

DOIBoost is a dataset that combines research outputs and links among them from a selection of data sources. It enriches the records available on Crossref with what's available on Unpaywall, Microsoft Academic Graph, ORCID intersecting all those datasets by DOI. As consequence, DOIBoost does not contain any record from MAG, Unpaywall, or ORCID that doesn't provide a DOI available in Crossref.

Each Crossref record is enriched with:

  • ORCID identifiers of authors from ORCID
  • Open Access instance (with OA color/route and license) from Unpaywall
  • the following information from MAG:
    • abstracts
    • MAG identifiers of authors
    • affiliation (result - organization) relationships
    • subjects (MAG FieldsOfStudy)
    • conference or journal information

The Open Access status is also set by intersecting the journal information of a record with the journal lists available from DOAJ and the Gold ISSN list.

Inputs

  • Crossref: dump available to Crossref subscribers via MetadataPlus service, updated once a month.
  • Microsoft Academic Graph: downloaded version on 2021-02-15. We plan to take the latest version in Dec 2021 before MAG will be retired.
  • ORCID: baseline dump obtained in 2020-10-13, regularly updated every week from the ORCID public API.
  • Unpaywall: public database snapshot downloaded in March 2021. Unpaywall updates it twice a year (https://unpaywall.org/products/snapshot)

The construction of the DOIBoost dataset consists of the following phases:

Process

The following section describes the processing steps needed to build DOIBoost starting from the input data.

Crossref filtering

Records in Crossref are ruled out according to the following criteria

  • have blank title, examples:
    • 10.1093/rheumatology/41.7.837
    • 10.1093/qjmed/95.7.430
    • 10.1371/journal.pone.0171434.g005
  • have one of the following publishers: "Test accounts", "CrossRef Test Account"
  • have no authors with valid names, where valid means: not blank and different from all strings in this list: List(",", "none none", "none, none", "none &na;", "(:null)", "test test test", "test test", "test", "&na; &na;")
  • have "Addie Jackson" as author and "Elsevier BV" as publisher (empirically we say they are test records)
  • have not one of the following values in the field type : "book-section", "book", "book-chapter", "book-part", "book-series", "book-set", "book-track", "edited-book", "reference-book", "monograph", "journal-article", "dissertation", "other", "peer-review", "proceedings", "proceedings-article", "reference-entry", "report", "report-series", "standard", "standard-series", "posted-content", "dataset",
    • Example:
      • 10.1371/journal.pone.0171434.g005
      • 10.7554/elife.21052.049
      • 10.1371/journal.pcbi.1005379.s006

Records with type=dataset are mapped into OpenAIRE results of type dataset. All others are mapped as OpenAIRE results of type publication.

Mapping Crossref properties into the OpenAIRE Graph

Properties in OpenAIRE results are set based on the logic described in the following table:

OpenAIRE Result field pathCrossref path(s)Notes
iddoiid in the form doi_________::md5(doi)
dateofcollectionindexed.datetime
lastupdatetimestampindexed.timestamp
typetypedataset if the Crossref type is dataset, publication otherwise (based on the filtering logics described above)
originalIddoi, clinical-trial-number, alternative-id
pidThe scheme tells the type of PID, the value contains the actual value
pid.schemeDefault value: doi
pid.valuedoiThe doi is normalised and lower-cased
maintitletitle
subtitlesubtitle
authorauthorif available the sequence is mapped to rank and the ORCID is also mapped
author.nameauthor.given
author.surnameauthor.family
author.fullnameauthor.given author.family
author.rankbased on the order, starts from 1
author.pidonly if the ORCID is available
author.pid.id.schemeDefault 'pending_orcid' (meaning that it is not an id confirmed by ORCID)
author.pid.id.valueauthor.ORCID
author.pid.provenance.provenanceDefault 'Harvested'
author.pid.provenance.trustDefault '0.9'
descriptionabstract
subjectsubjectwith classid='keywords', i.e. no controlled vocabularies for Crossref subjects
publicationdateissued.datetime or, if not available, created.datetime
publisherpublisher
sourcesourceonly if the record is not of type book
sourceconcatenation of container-title.head + "ISBN: " + ISBN.headonly if the record is of type book
containerIt is set only for publications with information about the journal it was published in.
container.namecontainer-title.head
container.issnOnlineissn-type.valueif issn-type.type='electronic'
container.issnPrintedissn-type.valueif issn-type.type='print'
container.volvolume
container.sppagebefore '-'
container.eppageafter '-'
instanceOne instance is created with the DOI URL
instance.accessrightValues in instance.accessright.code and instance.accessright.label are set based on license and dateofacceptance:
- UNKNOWN: if the license is blank
- OPEN ACCESS: if the license is a CC license or an ACS license or an APA license (considered OPEN also by Unpaywall, see Unpaywall FAQ for details) or if OUP license, but only after 12 months from the publication date
- EMBARGO: OUP license, before 12 months from the publication date
- CLOSED: if there is a license not covered by the previous cases
instance.accessright.codeCode from the COAR vocabulary for access right
instance.accessright.labelOne of: OPEN, RESTRICTED, CLOSED, EMBARGO
instance.accessright.schemeScheme that defines the code and label, i.e. the URL to the COAR vocabulary for access right
instance.accessright.openAccessRouteonly if instance.accessright.value = 'OPEN ACCESS'. Default is hybrid. The route is fixed in subsequent phases of DOIBoost, namely when intersecting with Unpaywall and patching the hostedby via DOAJ and the Gold-ISSN list.
instance.licenselicense.URL If there is a license.content-version='vor', then this is used. Otherwise the first license entry is used.
instance.pidThe scheme tells the type of PID, the value contains the actual value
instance.pid.schemeDefault value: doi
instance.pid.valuedoiThe doi is normalised and lower-cased
instance.publicationdateissued.datetime or, if not available, created.datetime
instance.refereedset to peerReviewed only if relation.has-review.id is not empty, UNKNOWN otherwise.
instance.typesubtypemapped using the OpenAIRE vocabulary for result typologies
instance.urldoiFull URL of the DOI

All other fields of the Json schema not mentioned in the table contain empty values.

All the records from Crossref are related to the datasource with name=Crossref and id=openaire____::081b82f96300b6a6e3d282bad31cb6e2

Possible improvements:

  • map clinical-trial-number and alternative-id in alternateIdentifiers?
  • Verify if Crossref has a property for language, country, container.issnLinking, container.iss, container.edition, container.conferenceplace and container.conferencedate
  • Different approach to set the refereed field and improve its coverage?

Links to funding available in Crossref are mapped as funding relationships (result -- isProducedBy -- project) applying the following mapping:

FunderGrant codeLink to
DOI: {10.13039/100010663, 10.13039/100010661, 10.13039/501100007601, 10.13039/501100000780, 10.13039/100010665} or name: 'European Union’s Horizon 2020 research and innovation program'series of 4-9 digits in awardLink to H2020 project
DOI: {10.13039/100011199, 10.13039/100004431, 10.13039/501100004963, 10.13039/501100000780}series of 4-9 digits in awardLink to FP7 project
DOI: 10.13039/501100000781 OR name: 'European Union's'series of 4-9 digits in awardLink to FP7 or H2020 project
DOI: 10.13039/100000001awardLink to NSF project
DOI: 10.13039/501100001665 OR name: {'The French National Research Agency (ANR)', 'The French National Research Agency'}awardLink to ANR project
DOI: 10.13039/501100002341awardLink to Academy of Finland project
DOI: 10.13039/501100001602award, removing the initial 'SFI' if presentLink to SFI project
DOI: 10.13039/501100000923awardLink to ARC project
DOI: 10.13039/501100000038award ignore: we cannot map the project codes in Crossref to project codes in OpenAIRELink to NSERC (unidentified project)
DOI: 10.13039/501100000155award ignore: we cannot map the project codes in Crossref to project codes in OpenAIRELink to SSHRC (unidentified project)
DOI: 10.13039/501100000024award ignore: we cannot map the project codes in Crossref to project codes in OpenAIRELink to CIHR (unidentified project)
DOI: 10.13039/501100002848 OR name :'CONICYT, Programa de Formación de Capital Humano Avanzado'awardLink to CONICYT project
DOI: 10.13039/501100003448series of 4-9 digits in awardLink to GSRT project
DOI: 10.13039/501100010198awardLink to SGOV project
DOI: 10.13039/501100004564series of 4-9 digits in awardLink to MESTD project
DOI: 10.13039/501100003407awardLink to MIUR project. Since OpenAIRE has a small subset of MIUR projects, a link to the MIUR funder (unidentified
project) is also generated
DOI: {10.13039/501100006588, 10.13039/501100004488}award, removing 'Project No' and 'HRZZ' prefix, if presentLink to HRZZ or MZOS project
DOI: 10.13039/501100006769awardLink to Russian Science Foundation project
DOI: 10.13039/501100001711award after '_' and before '/'Link to SNSF project
DOI: 10.13039/501100004410awardLink to TUBITAK project
DOI: 10.10.13039/100004440 or name: Wellcome Trust Masters FellowshipawardLink to Wellcome Trust specific project and to the unidentified project.

Intersect Crossref with UnpayWall by DOI

The fields we consider from UnpayWall are:

  • is_oa
  • best_oa_location
  • oa_status

The results of Crossref that intersect by DOI with UnpayWall records are enriched with one additional instance with the following properties:

OpenAIRE Result field pathUnpaywall field pathNotes
instancecreated only if is_oa and a best_oa_location is available
instance.accessrightdefault value Open Access: we do not add instances if UnpayWall says there is no open version
instance.accessright.codeOpen Access code from the COAR vocabulary for access right
instance.accessright.labelAlways OPEN
instance.accessright.schemeScheme that defines the code and label, i.e. the URL to the COAR vocabulary for access right
instance.accessright.openAccessRouteoa_status
instance.urlbest_oa_location
instance.licensebest_oa_location.license
instance.pidThe scheme tells the type of PID, the value contains the actual value
instance.pid.schemeDefault value: doi
instance.pid.valuedoiThe doi is normalised and lower-cased

For the definition of UnpayWall's oa_status refer to the Unpaywall FAQ

The record will also feature a relation to the UnpayWall data source: name="UnpayWall", id=openaire____::8ac8380272269217cb09a928c8caa993.

Intersect with ORCID

The fields we consider from ORCID are:

  • doi
  • authors, a list of authors, each with optional name, surname, creditName, oid
OpenAIRE field pathORCID pathNotes
piddoi
author.namecapitalize(name)only mapped if not blank
author.surnamecapitalize(surname)only mapped if not blank
author.fullnameif name and surname are not blank, they are concatenated (capitalize(name) capitalize(surname)), otherwise we use the creditName
author.pidonly if the ORCID is available
author.pid.id.schemeDefault orcid (meaning that it is confirmed by ORCID, (in contrast to the orcid_pending set from Crossref and Unpaywall)
author.pid.id.valueoid
author.pid.provenance.provenanceDefault Harvested
author.pid.provenance.trustDefault 0.9

The records are enriched with the ORCID identifiers of their authors.

The current approach is:

  • if the number of authors from Crossref equals the size of authors from ORCID, then we pick the list of authors with more PIDs and try to enrich it with the PIDs from the other list, based on JaroWrinkler distance on authors' names, surnames, or fullnames, depending on which properties are available;
  • if the number of authors are different, then we take the longest and try to enrich it with the PIDs from the other author list, based on JaroWrinkler distance on authors' names, surnames, or fullnames, depending on which properties are available

Miriam will modify the process to ensure that:

  • the list of authors from Crossred always "win"
  • the identifiers from ORCID "win"

Intersect with Microsoft Academic Graph

Important Notes

  • Only papers with DOI are considered
  • Since for the same DOI we have multiple version of item with different MAG PaperId, we only take one per DOI (the last one we process). We call this dataset Papers_distinct

When mapping MAG records to the OpenAIRE Graph, we consider the following MAG tables:

  • PaperAbstractsInvertedIndex: for the paper abstracts
  • Authors: for the authors. The MAG data is pre-processed by grouping authors by PaperId
  • Affiliations and PaperAuthorAffiliations: to generate links between publications and organisations
  • Journals and ConferenceInstances: joined with Papers_distinct to have the information about the venues where the paper was published
  • TO BE REMOVED PaperUrls: to create one instance for the OpenAIRE publication
  • FieldsOfStudy: to add subjects

The records are enriched with:

  • abstracts
  • MAG identifiers of authors
  • affiliation relationships
  • subjects (MAG FieldsOfStudy)
  • conference or journal information (in the journal field) TODO: or container, in case of the dump?
  • [TO BE REMOVED] instances with URL from MAG

Enrich DOIBoost3 with hosting data sources (hostedby) and access right information

In this phase, we intersect DOIBoost3 with a dataset composed of journals from OpenAIRE, Crossref, and the ISSN gold list. Each journal comes with its International Standard Serial Numbers (issn, eissn, lissn) and, when available, a flag that tells if the journal is open access. The intersection is done on the basis of the International Standard Serial Numbers. The records with a journal.[l|e]issn that match are enriched as follows:

  • Each instance gain the hostedby information corresponding to the journal
  • If the journal is open access, the access rights of the instances are also set to Open Access with gold route (because by construction, the journals we know are open are from DOAJ or Gold ISSN list)

The hostedby of records that do not match are set to the Unknown Repository.

References

The idea behind DOIBoost and its origin can be found in the paper (and related resources) at:

  • La Bruzzo S., Manghi P., Mannocci A. (2019) OpenAIRE's DOIBoost - Boosting CrossRef for Research. In: Manghi P., Candela L., Silvello G. (eds) Digital Libraries: Supporting Open Science. IRCDL 2019. Communications in Computer and Information Science, vol 988. Springer, doi:10.1007/978-3-030-11226-4_11 . Open Access version available at: 10.5281/zenodo.1441071