Version: 10.3.0

PIDs and identifiers

One of the challenges towards the stability of the contents in the OpenAIRE Graph consists of making its identifiers and records stable over time. The barriers to this scenario are many, as the Graph keeps a map of data sources that is subject to constant variations: records in repositories vary in content, original IDs, and PIDs, may disappear or reappear, and the same holds for the repository or the metadata collection it exposes. Not only, but the mappings applied to the original contents may also change and improve over time to catch up with the changes in the input records.

PID Authorities

One of the fronts regards the attribution of the identity to the objects populating the graph. The basic idea is to build the identifiers of the objects in the graph from the PIDs available in some authoritative sources while considering all the other sources as by definition “unstable”. Examples of authoritative sources are Crossref and DataCite. Examples of non-authoritative ones are institutional repositories, aggregators, etc. PIDs from the authoritative sources would form the stable OpenAIRE ID skeleton of the Graph, precisely because they are immutable by construction.

Such a policy defines a list of data sources that are considered authoritative for a specific type of PID they provide, whose effect is twofold:

OpenAIRE IDs depend on persistent IDs when they are provided by the authority responsible to create them;
PIDs are included in the graph according to a tight criterion: the PID Types declared in the table below are considered to be mapped as PIDs only when they are collected from the relative PID authority data source.

PID Type	Authority
doi	Crossref, Datacite
pmc, pmid	Europe PubMed Central, PubMed Central
arXiv	arXiv.org e-Print Archive
uniprot	Protein Data Bank
ena	Protein Data Bank
pdb	Protein Data Bank

There is an exception though: Handle(s) are minted by several repositories; as listing them all would not be a viable option, to avoid losing them as PIDs, Handles bypass the PID authority filtering rule. In all other cases, PIDs are included in the graph as alternate Identifiers.

Delegated authorities

When a record is aggregated from multiple sources considered authoritative for minting specific PIDs, different mappings could be applied to them and, depending on the case, this could result in inconsistencies in the attribution of the field values. To overcome the issue, the intuition is to include such records only once in the graph. To do so, the concept of "delegated authorities" defines a list of datasources that assigns PIDs to their scientific products from a given PID minter.

This "selection" can be performed when the entities in the graph sharing the same identifier are grouped together. The list of the delegated authorities currently includes

Datasource delegated	Datasource delegating	Pid Type
Zenodo	Datacite	doi
RoHub	W3ID	w3id

Identifiers in the Graph

OpenAIRE assigns internal identifiers for each object it collects. By default, the internal identifier is generated as sourcePrefix::md5(localId) where:

sourcePrefix is a namespace prefix of 12 chars assigned to the data source at registration time
localΙd is the identifier assigned to the object by the data source

After years of operation, we can say that:

localId are generally unstable
objects can disappear from sources
PIDs provided by sources that are not PID agencies (authoritative sources for a specific type of PID) are often wrong (e.g. pre-print with the DOI of the published version, DOIs with typos)

Therefore, when the record is collected from an authoritative source:

the identity of the record is forged using the PID, like pidTypePrefix::md5(lowercase(doi))
the PID is added in a pid element of the data model

When the record is collected from a source which is not authoritative for any type of PID:

the identity of the record is forged as usual using the local identifier
the PID, if available, is added as alternateIdentifier

Currently, the following data sources are used as "PID authorities":

PID Type	Prefix (12 chars)	Authority
doi	`doi_________`	Crossref, Datacite, Zenodo
pmc	`pmc_________`	Europe PubMed Central, PubMed Central
pmid	`pmid________`	Europe PubMed Central, PubMed Central
arXiv	`arXiv_______`	arXiv.org e-Print Archive
ena	`ena_________`	EMBL-EBI
pdb	`pdb_________`	EMBL-EBI
uniprot	`uniprot_____`	EMBL-EBI

OpenAIRE also perform duplicate identification (see the dedicated section for details). All duplicates are merged together in a representative record which must be assigned a dedicated OpenAIRE identifier (i.e. it cannot have the identifier of one of the aggregated record).

PIDs and identifiers

PID Authorities​

Delegated authorities​

Identifiers in the Graph​

PID Authorities

Delegated authorities

Identifiers in the Graph