Skip to main content
Version: 6.2.2

Datacite

This section describes the aggregation workflow used to gather the bibliographic material from Datacite and the relative mapping.

Datacite datasource

Datacite is a leading global non-profit organisation that provides persistent identifiers (DOIs) for research data and other research outputs.

Datacite API

The DataCite REST API allows users to retrieve, query, and browse Datacite metadata records. In particular, it exposes a method for harvesting new records incrementally.

https://api.datacite.org/dois?page[cursor]=$CURSOR&page[size]=$NUMBER_OF_ITEM_PER_PAGE&query=updated:[$FROM_DATE_TIMESAMP TO $TO_DATE_TIMESAMP]

On this API Request, we introduce some variables:

  • CURSOR: The value of the cursor to iterate the pages; the cursor is extracted from each API response and used in the next request.
  • NUMBER_OF_ITEM_PER_PAGE: (max 1000) defines how many records must be returned within each API response.
  • FROM_DATE_TIMESAMP, TO_DATE_TIMESAMP interval timestamp of the updated record.

Each record contains two pieces of information needed for incremental harvesting:

  • isActive: tells if the record is deleted (isActive:false)
  • updated: timestamp of last update

Collection Workflow

The collection workflow is responsible for aggregating new records. Each record is stored locally on a table with the following schema:

  • DOI: The DOI of the Datacite record (it is the primary key)
  • update_timestamp: the last update date timestamp
  • json: the native record JSON

The metadata collection process identifies the most recent record date available locally and uses such date to requests the records to the Datacite API, populating the FROM_DATE_TIMESAMP variable. The records in the API response are included in the local storage in upsert mode.

Datacite Mapping

Entity Mapping

The table below describes the mapping from the XML baseline records to the OpenAIRE Graph dump format.

OpenAIRE Result field pathDatacite record JSON path# Notes
id\attributes\doiid in the form doi_________::md5(doi)
  • instance
  • instance.type
  • \attributes\types\resourceType
  • \attributes\types\resourceTypeGeneral
  • attributes\types\schemaOrg
Use the vocabulary dnet:publication_resource to find a synonym to one of these terms and get the instance.type.
type
  • \attributes\types\resourceType
  • \attributes\types\resourceTypeGeneral
  • attributes\types\schemaOrg
Using the dnet:result_typologies vocabulary, we look up the instance.type synonym to generate one of the following main entities:
  • publication
  • dataset
  • software
  • otherresearchproduct
pid\attributes\doischeme = doi
originalid\attributes\doi
dateofcollectionattributes\updatedthe timestamp is defined in milliseconds we convert to "yyyy-MM-dd'T'HH:mm:ssZ" format
author\attributes\creatorsEach creator field will be mapped in the author entity below the subfield. If the record has no Creator it will be skipped
author.fullname\attributes\creators\nameif name is not defined, we construct from given and family name
author.rankIncremental index starting from 1
author.name\attributes\creators\givenName
author.surname\attributes\creators\familyName
author.pid\attributes\creators\nameIdentifiersthis is a list of pids associated to the creator
author.pid.scheme\attributes\creators\nameIdentifiersmapping with vocabulary dnet:pid_types
author.pid.value\attributes\creators\nameIdentifiers/nameIdentifierthe pid value
maintitle\attributes\titlesTitles whose title type is null or title type is Main
subtitle\attributes\titlesTitles whose title type is Subtitle since the title type vocabulary in OpenAIRE use the datacite title type vocabulary
date sectionfor each date in particular for DOI starting with 10.14457 we Apply a fix thai date convert a date to ThaiBuddhistDate and reformat to local one see ticket #6791
publicationdate\attributes\dateswhere dateType is issued
publicationdate\attributes\publicationYearwe create this date format 01-01-publicationYear
embargoenddate\attributes\dateswhere dateType is available
subjects\attributes\subjectscheme=keywords
description\attributes\descriptions
publisher\attributes\publisher
language\attributes\languagecleaned by using vocabulary dnet:languages
publisher\attributes\publisher
instance.license\attributes\rightsListif the rights value starts with http and matches a particular regex
instance.accessright\attributes\rightsList
  • if not present :unknown
  • if datasource is Figshare:open
  • If embargo_date < today(): OPEN

Relation Mapping

OpenAIRE Relation Semantic and inverseDatacite record JSON pathSource/Target type#Notes
isProducedBy/producesattributes\fundingReferencesresult/projectonly when the fundingReferences matches the pattern (info:eu-repo/grantagreement/ec/h2020/)(\d{6})(.*)
IsProvidedBy/providesresult/datasourceDatasource is always set to Datacite
isHostedBy/host\attributes\relationships\client\idresult/datasourcewe defined a curated map clientId/Datasource if we found a match we create an hostedBy Relation
isRelatedTo\attribute\relatedIdentifiersresult/resultwe create relationships whenever the pid of the target is resolved on the Research Graph