Enrichment from ORCID
OpenAIRE enhances publication metadata by incorporating author information from ORCID. This involves adding persistent identifiers to authors and leveraging ORCID data to improve author disambiguation.
How does the enrichment works?
The following steps outline how ORCID information is integrated into the OpenAIRE Graph:
Extracting Author and Work Information and creating ORCID-Work pairs
OpenAIRE extracts the following from ORCID profiles:
- Author information: ORCID, family name, given name, other names, credit name
- Work information: Persistent identifiers (DOI, PMC, PMID, arXiv, handle)
For each work identified by a persistent identifier (PID), a pair is created linking the ORCID to the work PID. For example, if an ORCID profile (orcid1) has a DOI (doi1) and a PMC (pmc1) associated with it, the following pairs are generated:
- P1: <orcid1, doi1>
- P2: <orcid1, pmc1>
Grouping by work persistent identifier
ORCID-Work pairs are grouped by the work's persistent identifier to identify multiple authors contributing to the same work. Two ORCIDs (orcid1 and orcid2) associated with the same DOI (doi1), result in structures like:
<doi1, [orcid1, orcid2]>
Note:
- The term "orcidx" refers to a structure containing the ORCID identifier along with the author's name information (family name, given name, other names, and credit name) as extracted from the ORCID profile.
- The term "doix" refer to a structure containing the schema and value of the persistent identifier. In case of the example "doix" : <"doi","10....">
Matching with the Graph result and enriching the author metadata
For each persistent identifier pair, OpenAIRE searches for a corresponding result in the Graph based on the pair's schema and value. Once a match is found, OpenAIRE attempts to identify the corresponding authors within the result by comparing them to the authors listed in the ORCID profile. This process employs an Algorithm called author name disambiguation to establish the correct matches. Successful matches allow OpenAIRE to enrich the result's author information with the ORCID identifier from the profile.
Author name disambiguation algorithm
The process involves comparing authors from two sets: those extracted from the graph (graph authors) and those derived from ORCID profiles (ORCID authors) that share the same persistent identifier pair. For each graph author, the algorithm iterates through the following matching strategies, ordered by decreasing confidence:
- Exact fullname match: If the full name of a graph author exactly matches the full name (constructed by concatenating the author given name and family name) of one author in the ORCID list, a match is found.
- Exact reversed fullname match: Similar to the previous strategy, but the ORCID full name is constructed by concatenating family name and given name.
- Ordered token match: Author names are tokenized into individual words. These tokens are then ordered and compared for matches or abbreviations. This strategy is applied to names with at least two words and such that the name word difference is two or less. This strategy allow for variability in the name. (some examples will be provided in the following)
- Exact match of ORCID credit name: If the graph author's full name matches an ORCID author's credit name, a match is considered.
- Exact match of ORCID other names: The graph author's full name is compared to each other name listed in the ORCID profile.
Upon identifying a match, the graph author's information is enriched with the corresponding ORCID data, and the matched ORCID author is removed from the comparison list. This process continues until no further matches can be found.
By applying this multi-faceted approach, OpenAIRE aims to maximize the accuracy of author identification and linking.
Author name disambiguation example
Consider the following author lists
- Graph List: Robert Stein, Sjoert van Velzen, Marek Kowalski, Anna Franckowiak, James C. A. Miller-Jones, Sara Frederick, Itai Sfaradi, Assaf Horesh, Albert Kong, Ryan Foley
- Orcid List: Marek Kowalski, Itai Sfaradi, James Carl Miller-Jones, Assaf Horesh, Kong Albert, Ryan Foley
The graph list contains the full names of the authors as found in the metadata. Any potential ambiguities in splitting names into components (like first name and last name) are addressed by the first three steps. The ORCID list names are expressed as the concatenation of the given name and the family name as provided in the ORCID profile (i.e. "Kong Alber => Kong is given name and Albert is family name in the ORCID profile) For simplicity, other names and credit names are excluded from this list, since the corresponding strategies can be assimilated to an exact match comparison.
Algorithm Application
First of all the Exact fullname match strategy is applied. Each graph author's full name is compared to every full name in the ORCID list until a match is found. A full name in the ORCID list is constructed by concatenating the given name and family name in the order provided. If an exact match is found, the ORCID identifier is used to enrich the corresponding graph's author record, and the ORCID author is removed from the list for subsequent comparisons. By applying this strategy we can find a match for Marek Kowalski, Itai Sfaradi, Assaf Horesh, Ryan Foley
Then the Exact reverse fullname match strategy is applied on the graph and orcid list that have not been match in the previous step:
- Graph List: Robert Stein, Sjoert van Velzen, Anna Franckowiak, James C. A. Miller-Jones, Sara Frederick, Albert Kong
- Orcid List: James Carl Miller-Jones, Kong Albert
The process is similar to step one, but the ORCID fullname is constructed by reversing the order of given name and family name. This step accommodates variation in name formatting. As before if an exact match is found, the ORCID identifier is used to update the metadata of the graph author, and the ORCID author is removed from the list for subsequent comparisons. With this strategy we can find a match for Albert Kong.
The third step is the application of the Oredered token match strategy to the remaining authors to be matched. Before going to see a running example, let us describe how the strategy works.
The tokens from the two lists are pairwise compared. The outcome of each comparison falls into one of three categories:
- No Match: This occurs when the initial characters of the compared tokens differ, or when the entire words don't match despite sharing the same starting character. A mismatch indicates that the authors are different, and the comparison process terminates.
- Short Match: A short match happens when both tokens begin with the same character, but one token consists solely of that character.
- Long Match: Exact correspondence between the two compared words
When a no match is encountered due to different initial characters, the algorithm proceeds to compare the next token in the list with the lexicographically lower preceding token. This allows to be tolerant with missing words in one of the two names.
A successful match (short or long) moves the comparison of the subsequent tokens in both lists. This iterative process continues until either a no match is determined or both token lists have been exhausted.
If both lists have been exhausted, a match is found if:
- At list one long match exists
- The sum of short and long matches equals the length of the shorter token list, indicating that all the words in the shorter list have a match in the longer one.
Going back to the example, the authors that remain to find a match for are:
- Graph List: Robert Stein, Sjoert van Velzen, Anna Franckowiak, James C. A. Miller-Jones, Sara Frederick
- Orcid List: James Carl Miller-Jones
Let us consider directly the names that can be matched by this strategy: graph name = James C. A. Miller-Jones orcid name = Carl James Miller-Jones
So the two names are broken down into individual words or token and sorted alphabetically to standardize the comparison process. graph = A C James Miller-Jones orcid = Carl James Miller-Jones
The comparison process works as follows:
- A and Carl are compared. No match since the initial characters are different. The graph list will be moved one step ahead for the next comparison
- C and Carl are compared. A short match is detected, since both start with the same character and the graph word is only that character. Both the lists will be moved one step ahead for the next comparison
- James and James are compared. A long match is detected. Both the lists will be moved one step ahead for the next comparison
- Miller-Jones and Miller-Jones are compared. A long match is found. The lists are exhausted and the computation ends.
Since at list one long match exists and the sum of long and short matches equals the length of the shorter list, the match is confirmed and the graph author can be enriched with the ORCID information.
The ORCID list remains empty after the application of the third strategy and the author name disambiguation process ends.
Note: the application of the remaining two strategies can be remanded to the application of the Exact name match strategy. Note: Even if the third strategy can subsume the first two, the reason they are applied before the third is for efficiency. In this way, in fact, we can claim a match as soon as the first pair of matching names is found. Applying only the third strategy, all the comparisons should be done and a way to determine the best match should be found before claiming a match. Example:
graph = Mario Enrico Rossi, Mario Rossi ORCID = Mario Rossi
Applying only the third strategy, we would associate Mario Rossi's ORCID to Mario Fabrizio Rossi if this one was first in the author list.