Version: Next

Enrichment by mining

OpenAIRE collects the full-texts of the publications, in order to apply TDM (Text and Data Mining) algorithms on them and enrich the Graph with inference links.

The collection of the full-texts is handled by the internal PDF Aggregation Service. This service uses the publications' urls, from the OpenAIRE Graph and state-of-the-art algorithms, to crawl the web and try to locate and download the full-texts of the open access publications, while focusing on the most recent ones. It respects the servers of the repositories and publishers and avoids overloading them.

The service is orchestrating a distributed execution system, on the cloud, with multiple microservices running in parallel, in order to efficiently process and download a large number of publications. The microservices store the generated report records for the publications, in a database, and the full-texts in an S3 Object Store.

On the publication-page level, it applies text-mining algorithms to analyze the structure of the page, extract the full-text url and download the file. Additionally, it tracks various performance indicators to optimize the crawling speed, during execution.

The PDF Aggregation Service is also capable of bulk-importing full-texts from compatible data sources, which increases the collection speed of full-texts.

The different Text and Data Mining (TDM) algorithms used in the graph-enrichment process are grouped in the following categories.

📄️ Affiliation matching

Short description: The goal of the affiliation matching module is to match affiliation strings (identified in full-text PDFs or in scholarly databases, such as Crossref) with persistent organization identifiers (e.g., ROR identifiers).

📄️ Citation matching

Short description1303.6906[1].

📄️ Classifiers

Short description: A document classification algorithm that employs analysis of free text stemming from the abstracts of the publications. The purpose of applying a document classification module is to assign a scientific text to one or more predefined content classes.

📄️ Documents similarity

Short description: Document similarity module is responsible for finding similar documents among the ones available in the OpenAIRE Information Space. It produces "similarity" links between the documents stored in the OpenAIRE Information Space. Each link has a similarity score from [0,1] range assigned; it is expected that the higher the score, the more similar are the documents with respect to their content.

📄️ Extraction of acknowledged concepts

Short description: Scans the plaintexts of publications for acknowledged concepts, including grant identifiers (projects) of funders, accession numbers of bioetities, EPO patent mentions, as well as custom concepts that can link research objects to specific research communities and initiatives in OpenAIRE.

📄️ Extraction of cited concepts

Short description: Scans the plaintexts of publications for cited concepts, currently for references to datasets and software URIs.

📄️ Metadata extraction

Short description: Metadata Extraction algorithm is responsible for plaintext and metadata extraction out of the PDF documents. It based on CERMINE project.