Skip to main content
Version: 10.5.0

Protein Data Bank

This section documents the mapping used to integrate metadata and links from Protein Data Bank in the OpenAIRE Graph.

Input Data

The input data consists of protein structures downloaded from the FTP repository here. These proteins are preprocessed through a series of scripts available at https://github.com/sandrolabruzzo/proteinDBMetadataExtractor, which extract the metadata of the proteins and their associated publications.

MetaData Schema of Input Data

This schema defines the structure of metadata associated with an entry from the Protein Data Bank (PDB). Each instance of this schema captures key bibliographic and publication-related information for a specific PDB entry.

Fields:

  • pdb (string) : The unique 4-character identifier for the PDB entry (e.g., "1XYZ"). This serves as the primary key for the record.
  • title (string): The full title of the publication associated with the PDB entry. This typically describes the research work that led to the structure determination.
  • authors (List of strings): A list of fullnames of the authors who contributed to the publication associated with the PDB entry. Each element in the list represents an individual author's name.
  • doi (string): The Digital Object Identifier (DOI) for the primary publication related to the PDB entry. This provides a persistent link to the article. Can be an empty string if not available.
  • pmid (string): The PubMed ID (PMID) for the primary publication related to the PDB entry. This identifies the article in the PubMed database. Can be an empty string if not available.
  • date (string): The deposition date of the PDB entry, typically in a standardized format (e.g., "YYYY-MM-DD"). This indicates when the structure was initially submitted to the PDB.
  • revDate (List of RelevantDate objects): A list of revision dates for the PDB entry. Each element in the list is expected to be an instance of a RelevantDate class (which needs to be separately defined) that likely contains details about the revision, such as the date itself and potentially a description of the changes.

Mapping

The table below describes the mapping from the preprocessed records to the OpenAIRE Graph Dataset format.

OpenAIRE Research Product field pathPDB record field xpathNotes
Publication Mapping
id//pdbid in the form pmid_________::md5(pdb)
maintitle//title
publicationdate//dateclean and normalize the format of the date to be YYYY-mm-dd
instance.licenseCC 0According to https://www.ebi.ac.uk/pdbe/about/public-data-access-statement
Author Mapping
author.fullname//authors

Relation Mapping

OpenAIRE Relation Semantic and inverseSource/Target typeNotes
IsSupplementTo/IsSupplementedBy//doi or //pmidwe create relationships between the BioEntity and the pubmed publication