Version: Next

Affiliation matching

Short description: The goal of the affiliation matching module is to match affiliation strings (identified in full-text PDFs or in scholarly databases, such as Crossref) with persistent organization identifiers (e.g., ROR identifiers). Depending on the data source, we currently employ two distinct methodologies:

The first method revolves around affiliations extracted from PDF and XML documents, which are subsequently matched with organizations within the OpenAIRE database.
The second concerns affiliations retrieved from platforms such as Crossref, PubMed, and Datacite, and are matched to organizations of the ROR database.

Algorithmic details of the first method

The buckets concept

In order to get the best possible results, the algorithm should compare every affiliation with every organization. However, this approach would be very inefficient and slow, because it would involve the processing of the cartesian product (all possible pairs) of millions of affiliations and thousands of organizations. To avoid this, IIS has introduced the concept of buckets. A bucket is a smaller group of affiliations and organizations that have been selected to be matched with one another. The matching algorithm compares only these affiliations and organizations that belong to the same bucket.

Affiliation matching process

Every affiliation in a given bucket is compared with every organization in the same bucket multiple times, each time by using a different algorithm (voter). Each voter is assigned a number (match strength) that describes the estimated correctness of the result of its comparison. All the affiliation-organization pairs that have been matched by at least one voter, will be assigned the match strength > 0 (the actual number depends on the voters, its calculation method will be shown later).

It is very important for the algorithm to group the affiliations and organizations properly i.e. the ones that have a chance to match should be in the same bucket. To guarantee this, the affiliation matching module allows to create different methods of dividing the affiliations and organizations into buckets, and to use all of these methods in a single matching process. The specific method of grouping the affiliations and organizations into bucket and then joining them into pairs is carried out by the service called Joiner.

Every joiner can be linked with many different voters that will tell if the affiliation-organization pairs joined match or not. By providing new joiners and voters one can extend the matching algorithm with countless new methods for matching affiliations with organizations, thus adjusting the algorithm to his or her needs.

All the affiliations and organizations are sequentially computed by all the matchers. In every matcher they are grouped by some joiner in pairs, and then these pairs are processed by all the voters in the matcher. Every affiliation-organization pair that has been matched at least once is assigned the match strength that depends on the match strengths of the voters that pointed the given pair is a match.

NOTE: There can be many organizations matched with a given affiliation, each of them matched with a different match strength. The user of the module can set a match strength threshold which will limit the results to only those matches that have the match strength greater than the specified threshold.

Calculation of the match strength of the affiliation-organization pair matched by multiple matchers

It often happens that the given affiliation-organization pair is returned as a match by more than one matcher, each time with a different match strength. In such a case the match with the highest match strength will be selected.

Calculation of the match strength of the affiliation-organization pair within a single matcher

Every voter has a match strength that is in the range (0, 1]. The voter match strength says what the quotient of correct matches to all matches guessed by this voter is, and is based on real data and hundreds of matches prepared by hand.

The match strength of the given affiliation-organization pair is based on the match strengths of all the voters in the matcher that have pointed that the pair is a match. It will always be less than or equal to 1 and greater than the match strength of each single voter that matched the given pair.

The total match strength is calculated in such a way that each consecutive voter reduces (by its match strength) the gap of uncertainty about the correctness of the given match.

Parameters:

input
- input_document_metadata: ExtractedDocumentMetadata avro datastore location. Document metadata is the source of affiliations.
- input_organizations: Organization avro datastore location.
- input_document_to_project: DocumentToProject avro datastore location with imported document-to-project relations. These relations (alongside with inferred document-project and project-organization relations) are used to generate document-organization pairs which are used as a hint for matching affiliations.
- input_inferred_document_to_project: DocumentToProject avro datastore location with inferred document-to-project relations.
- input_project_to_organization: ProjectToOrganization avro datastore location. These relations (alongside with infered document-project and document-project relations) are used to generate document-organization pairs which are used as a hint for matching affiliations
output
- MatchedOrganization avro datastore location with matched publications with organizations.

Limitations: -

Environment: Java, Spark

References: -

Authority: ICM • License: AGPL-3.0 • Code: CoAnSys/affiliation-organization-matching

Algorithmic details of the second method

Overview

run_affro is the main entry point of the AffRo affiliation-matching pipeline. It takes a raw affiliation string (as it appears in a publication metadata record) and returns a list of matched organizations with their identifiers (ROR and/or OpenOrgs), confidence scores, status and location information.

from core import run_affro

results = run_affro("Department of Physics, University of Milan, Italy")

High-Level Pipeline

raw affiliation string
        │
        ▼
 direct_mapping()          ← fast rule-based lookup for known institute families
        │
        ├─ match found ──► run_affro_(shortened_aff)  +  direct results
        │
        └─ no match ─────► run_affro_(raw_aff)
                                │
                                ▼
                        normalise & stem
                                │
                         dix_name lookup   ← exact key lookup
                                │
              ┌─────────────────┼────────────────────┐
              │                                      │
        single match                          multiple matches
              │                                      │
              ▼                                      ▼
    build_result_list()     ◀---- filter by 'first'/'top_level'/'parent'
      (algorithm path)                  
                                                     │
                                              still ambiguous?
                                                     │
                                                     ▼
                                            (algorithm fallback)

Stage 1 — direct_mapping(aff)

File: helpers/direct_mapping.py

A fast, rule-based pre-processor that recognises affiliation strings belonging to specific institute families (Fraunhofer, CNR, Max Planck, Helmholtz, Leibniz, FORTH, Demokritos, IRCCS, …).

What it does

Produce a normalized, stemmed version of the raw string.
Checks for the presence of family-specific keywords (e.g. fraunhofer, cnr, max planck).
For each recognized family, iterates over pre-built sorted key lists (longest keys first for specificity) and checks whether the key appears close enough to the trigger word using a character-distance heuristic:
```
distance(aff, trigger_word, key) < len(key) + len(trigger_word) + threshold
```
When a key matches, appends its ROR/OpenOrgs ID to assigned and removes the matched substring from the affiliation string (producing shorten_aff).

Returns

[list_of_direct_results, shortened_aff_string]
# list_of_direct_results: [] if nothing was matched, otherwise list of result dicts
# shortened_aff_string:   original affiliation with matched parts stripped out

The result dicts from direct mapping use provenance = "affro_direct".

Stage 2 — run_affro_(raw_aff_string)

File: core.py

The core matching logic, applied after (or instead of) direct_mapping.

Step 2.1 — normalization of `raw_aff_string`

File: helpers/functions.py

A lightweight, fast normalization pass that produces a single flat string key used for dix_name lookup. It does not segment the affiliation — that is left for the algorithm path.

Main transformations applied (in order):

Step	Transformation
1	`unidecode` (remove accents / transliterate)
2	`process_parentheses` (keep parens with univ/hospital; drop others)
3	`replace_comma_spaces`, `replace_double_consonants`, `replace_underscore`
4	Lowercase, `replace_roman_numerals`, `remove_stop_words`
5	Remove non-alphanumeric except `,;/:.−`
6	`remove_multi_digit_numbers`
7	Replace `:`, `;`, `/`, `—` → `,`
8	`normalize_organization_names` (stem `university` → `univer`, `institution` → `instit`, etc.)

Returns: a single normalised string, e.g.:
"univer milan, italy" for input "University of Milan, Italy"

Step 2.2 — `dix_name` lookup

dix_name is a dictionary loaded from jsons/dix_name.json.gz.

Structure:

{
  "instit information science techn": [
    {
      "id": "https://ror.org/05kacka20",
      "city": ["pisa"],
      "country": ["italy"],
      "label": "cnr",
      "first": "y"
    },
    ...
  ]
}

Algorithm Path — produce_result(input, simU, simG, limit)

Used when the fast path fails. Called with simU=0.42, simG=0.82, limit=500.

`create_df_algorithm(raw_aff_string, radius_u)` — `helpers/create_input.py`

Segments and enriches the affiliation string into a structured input representation.

Steps:

clean_string() — full normalisation (includes insert_space_between_lower_and_upper, replace_newlines_with_space, replace_double_consonants, etc.)
remove_outer_parentheses, remove_leading_numbers
description(clean_aff) → detects countries present in the string
substrings_dict(reduce(clean_aff)) — segments the affiliation on ,;/:| and - and applies normalize_organization_names to each segment
replace_abbr_univ — expands abbreviations like "u Milan" → "univer Milan"
Merges protected terms (e.g. "univer california") with adjacent city/country tokens
Removes city-only or remove-list tokens
shorten_keywords([x], radius_u) — further reduces keywords
valueToCategory(keyword) — classifies each keyword (Academia, Hospitals, Specific, …)

Returns:

[clean_aff, light_aff, aff_list, countries_list, keys_list]
# clean_aff:      normalised full string
# light_aff:      comma-joined list of segments
# aff_list:       list of {index, keywords, category} dicts
# countries_list: detected country names
# keys_list:      special category keys found

`find_name(input, dix_name, simU, simG, limit)` — `helpers/find_name.py`

Matches each keyword segment against dix_name candidates, using similarity scoring.

Steps:

get_candidates(countries_list, keys_list) → restricts the search space by country and special category keys (intersection of dix_country_legalnames and dix_key_legalnames).
For each keyword s:
- If s is directly in candidates → exact "lucky" match (score = 1).
- Otherwise → find_candidate(s, ...) applies cosine similarity / edit-distance scoring against candidates, bounded by simU (universities) or simG (others).
index_multiple_matchings(pairs) detects keywords matched by >1 candidate.
best_sim_score(...) resolves multi-matched keywords using the full clean/light affiliation string.
unique_subset(best0, best1) de-duplicates.

Returns: [[name, score], ...]

`find_id(aff_input, best_names, dix_name, simG)` — `helpers/find_id.py`

Resolves each matched name to a specific organization ID, disambiguating when a name maps to multiple organizations in different countries/cities.

Disambiguation cascade (in order):

Step	Strategy
1	City and Country match
2	Country direct match
3	Special country synonyms (US states, UK variants,…)
4	City match (city not embedded in org name)
5	Country appears in affiliation
6	Country appears in both affiliation and org name
7	Specific/Acronym category → prefer `top_level`, then `parent`
8	Fallback: `first == 'y'` for non-department, non-lab, non-low-prob-country orgs

Returns: [[name, score, id], ...] (deduplicated, highest score per ID kept)

###disamb(input, id_list_, dix_id) — helpers/disambiguation.py

Final post-processing to resolve cases where multiple organizations were matched.

Logic:

Condition	Action
Single result	Return as-is
No country detected in affiliation	Keep same-country results
More active results than detected countries	Filter by country (with special handling for country names like US, UK,...)
Otherwise	Return all results

Returns: Full result list (see Output Schema below).

Output Schema

Each item in the returned list is a dictionary:

Field	Type	Description
`provenance`	`str`	`"affro"` (algorithm path) or `"affro_direct"` (direct mapping)
`version`	`str`	Pipeline version (ex. `"3.3.0"`)
`pid`	`str`	`"ror"` or `"openorgs"`
`value`	`str`	The organization identifier (ROR ID or OpenOrgs ID)
`name`	`str`	Official organization name
`confidence`	`float`	Match confidence score (0–1)
`status`	`str`	`"active"`, `"inactive"`, `"withdrawn"`, or `"merged"`
`country`	`list[str]`	Country or countries associated with the organization

Example output:

[
  {
    "provenance": "affro",
    "version": "3.3.0",
    "pid": "ror",
    "value": "https://ror.org/019kf3481",
    "name": "OpenAIRE Non-Profit Civil Partnership",
    "confidence": 1,
    "status": "active",
    "country": ["greece"]
  }
]

[NOTE] When an organization is inactive/withdrawn, affro also appends the active successor(s) from dix_id[id]['status'][1] as separate entries in the list.

Data Dictionaries

`dix_name` — `jsons/dix_name.json.gz`

Maps normalised name keys to a list of candidate organizations. Each candidate has:

Field	Type	Description
`id`	`str`	ROR URI or OpenOrgs ID
`first`	`str`	`"y"` if this is the canonical/primary org for this key
`label`	`str` \| `null`	Family label (e.g. `"fraunhofer"`, `"cnr"`) used by direct mapping
`country`	`list[str]`	Country names
`city`	`list[str]`	City names

`dix_id` — `jsons/dix_id.json.gz`

Maps organization IDs to metadata:

Field	Type	Description
`name`	`str`	Official name
`country`	`list[str]`	Country
`status`	`list`	`[primary_status, [successor_ids]]`
`top_level`	`str`	`"y"` if the org has no parent
`parent`	`str`	`"y"` if the org is a parent to others

Usage

Command-line

Run a quick test directly from the terminal (no script needed):

python -c "from affro.core import run_affro; import json; print(json.dumps(run_affro('Department of Chemistry, ETH Zurich, Switzerland'), indent=2))"

Expected behaviour by case

Input	Fast path taken	Reason
`"University of Cambridge"`	`dix_name` exact match	`"univer cambridge"` found in `dix_name`
`"Fraunhofer, Institute for Industrial Engineering, Stuttgart"`	Direct mapping	`"fraunhofer"` + `"instit industrial engineering"` triggers `direct_mapping`
`"Dept. of Physics, Univ. of Auckland, NZ"`	Algorithm path	Lucky key not in `dix_name`
Inactive ROR org	Fast path + successor	Status list contains successor ID → appended to result

Error Handling

Any exception inside run_affro_ is caught, logged to stdout with the input string, and an empty list [] is returned.
An empty result list [] indicates no match was found or an error occurred.

Module Dependencies

core.py
├── helpers/functions.py          # string cleaning, dix_name/dix_id loading, regex, utils
├── helpers/create_input.py       # create_df_algorithm, valueToCategory, substrings_dict
├── helpers/matching.py           # find_candidate, get_candidates, best_sim_score, cosine similarity
├── helpers/find_name.py          # find_name
├── helpers/find_id.py            # find_id, disambiguation helpers
├── helpers/disambiguation.py     # disamb, convert_to_result
└── helpers/direct_mapping.py     # direct_mapping, _build_label_keys

Limitations: -

Environment: Python

References: -

Authority: OpenAIRE • License: AGPL-3.0 •

Code: AffRo on GitHub •

Affiliation matching

Algorithmic details of the first method​

Algorithmic details of the second method​

What it does​

Step 2.1 — normalization of raw_aff_string​

Step 2.2 — dix_name lookup​

create_df_algorithm(raw_aff_string, radius_u) — helpers/create_input.py​

find_name(input, dix_name, simU, simG, limit) — helpers/find_name.py​

find_id(aff_input, best_names, dix_name, simG) — helpers/find_id.py​

dix_name — jsons/dix_name.json.gz​

dix_id — jsons/dix_id.json.gz​

Command-line​

Expected behaviour by case​

Error Handling​