Version: 11.2.0

Person and relations

In this document is described the pipeline responsible for introducing Person entities into the OpenAIRE graph and for populating the relations that connect persons to other entities: research outputs, organizations, and projects. The pipeline is composed of several sequential steps, each building on the output of the previous one.

Overview

The OpenAIRE graph represents researchers as first-class entities of type Person, identified primarily through their ORCID iD. The construction of this sub-graph involves extracting person profiles, deriving participation and affiliation relations from multiple sources, enriching persons with bibliometric indicators, and finally pruning any person node that remains isolated — i.e., connected to no other entity in the graph.

Step 1 — Extraction of Person Entities and Base Relations

The first step materializes Person entities from the ORCID public data dump and simultaneously derives two types of relations: affiliation to organizations and participation in funded projects.

Person entities are created by mapping each ORCID author profile to the internal Person schema. Every profile is included regardless of whether the researcher has publicly visible works. The entity captures basic biographical information (given name, family name, biography, alternative names), the set of persistent identifiers associated with the researcher (including the ORCID iD itself and any other PIDs declared in the profile), and provenance metadata pointing to ORCID as the data source.

Author–organization affiliation relations (AuthorAffiliation) are derived from the employment records in the ORCID dump. Only employment records carrying a ROR identifier for the affiliated organization are considered. Where the ROR identifier corresponds to a non-deduped organization in the graph, the relation is redirected to the canonical (merged) organization identifier. Multiple employment records for the same person–organization pair are consolidated into a single relation, with the list of employment periods (start date, end date) deduplicated and merged.

Person–project participation relations (ProjectParticipation) are sourced from the funder database entries that records the participation of ORCID-identified researchers in funded projects. For each record, the person identifier is derived from the ORCID iD and the project identifier is mapped to the internal OpenAIRE project identifier. Where available, the role held by the researcher within the project is also captured.

Step 2 — Propagation of Project Participation through Deliverables

A researcher may participate in a funded project without being explicitly listed in the project metadata, but may nevertheless be an author of a deliverable or technical report that is linked to that project in the graph. This step exploits that signal to infer additional ProjectParticipation relations.

The process selects publications of specific instance types — deliverables and technical reports — that are connected to a project via an outcome relation. For each such publication, the authors who carry an ORCID or ORCID-pending identifier are identified, and a ProjectParticipation relation is created linking the person to the project associated with the deliverable.

The newly inferred relations are then merged with the participation relations produced in Step 1. Where a person–project pair already has a participation record (for example, one derived directly from project metadata), the two records are reconciled: the explicit record takes precedence, but the inferred one contributes any additional information not already present. The merged dataset replaces the previous participation dataset.

Step 3 — Authorship and Co-authorship Relations

This step produces Authorship relations — connecting a Person to a research output — and CoAuthorship relations — connecting two persons who have co-authored at least one product together. It draws on two complementary sources of information.

The first source is the OpenAIRE graph itself: for every result that has at least one author carrying an ORCID or ORCID-pending identifier, an Authorship relation is created between the person and the result.

The second source is a set of affiliation enriched records that have been processed by affro, an affiliation resolution algorithm that takes raw affiliation strings declared by authors and attempts to match them against organization identifiers (ROR or OpenOrgs). These enriched records are joined with the graph results using AIDeR, the OpenAIRE person name matching algorithm that establishes whether the author in the graph result and the author in the enriched record are the same individual. When a match is found and the graph author carries an ORCID while the enriched record carries affiliation information, an Authorship relation is created that includes a DeclaredAffiliation property: the raw affiliation string as declared by the author, together with the matched organization identifier(s), confidence score, and provenance. Additional author-level metadata captured in this step includes the author's role (using the CRediT taxonomy where available) and whether the author is the corresponding author.

Since the same deduped graph record may originate from the merging of multiple original records, and each of those originals may carry affiliation information from different sources, the authorship records are grouped by person–product pair and reconciled before being written to the graph. The reconciliation retains the richest available affiliation information and resolves conflicts where the same affiliation appears with different levels of confidence or from different sources.

CoAuthorship relations are derived directly from the Authorship dataset: any two persons who appear as authors of the same product are considered co-authors. Relations are aggregated across all shared products, and the coauthoredProducts counter records how many products the pair has co-authored together.

All relations produced in this step are tagged with OpenAIRE as the collector and carry a provenance trust score of 0.85.

Step 4 — Enrichment of Person Entities with Bibliometric Indicators

Once authorship relations are in place, Person entities are enriched with two aggregate bibliometric indicators: total downloads and total citations. These are computed by propagating the usage and citation metrics already associated with individual research outputs onto the persons who authored them.

For each result type in the graph (publications, datasets, software, other research products), results that have at least one author with an ORCID or ORCID-pending identifier are selected. For each such result that also carries usage metrics, an intermediate record is emitted for each ORCID-identified author, associating that ORCID with the download count and citation count of the result.

These per-result contributions are then aggregated by ORCID: the download and citation counts are summed across all results attributed to the same researcher. The resulting aggregate indicators are joined back to the Person entities and stored as measures on the person record. Persons with no computable metrics receive no measure entries rather than zero values.

Step 5 — Removal of Isolated Person Nodes

The final step ensures the graph does not contain Person entities that are not connected to any other node. A person is considered non-isolated if it appears in at least one relation of any of the following types: Authorship, CoAuthorship, ProjectParticipation, or AuthorAffiliation.

In addition to pruning isolated persons, this step also enforces referential integrity across all person-related relation types. For each relation type, only those relations whose endpoints both resolve to existing entities in the graph are retained:

Authorship relations are kept only if both the person and the product exist in the graph.
CoAuthorship relations are kept only if both persons exist in the graph.
ProjectParticipation relations are kept only if both the person and the project exist in the graph.
AuthorAffiliation relations are kept only if both the person and the organization exist in the graph.

The cleaned relation sets overwrite the previous versions, and the person dataset is then filtered to retain only those persons who appear in at least one of the surviving relations. This guarantees that every Person node in the final graph is reachable from at least one other entity.

Entity and Relation Types

Type	Description
`Person`	A researcher identified by ORCID
`Authorship`	Links a person to a research output they authored
`CoAuthorship`	Links two persons who have co-authored one or more products
`ProjectParticipation`	Links a person to a funded project they participated in
`AuthorAffiliation`	Links a person to an organization they have been affiliated with

Data Sources

Source	Used for
ORCID public data dump	Person profiles, author–organization affiliations
Project participation database	Direct person–project participation records
OpenAIRE graph (results + relations)	Authorship, indirect project participation via deliverables, bibliometric indicators
affro-enriched records (oaire, oalex, publishers, datacite)	Declared affiliations in authorship relations

Person and relations

Overview​

Step 1 — Extraction of Person Entities and Base Relations​

Step 2 — Propagation of Project Participation through Deliverables​

Step 3 — Authorship and Co-authorship Relations​

Step 4 — Enrichment of Person Entities with Bibliometric Indicators​

Step 5 — Removal of Isolated Person Nodes​

Entity and Relation Types​

Data Sources​