InterMine Integration Primary Key Configuration

This document describes the configuration used by the InterMine system's Integration to identify objects that are identical to each other. Two objects are deemed to be identical if they have matching fields for at least one primary key used for the class of object. Primary keys are defined in a global file called "<name of model>_keyDefs.properties" which must be in the classpath.

Not all data sources will contain sufficient data for all of the primary keys, so each data source needs an additional configuration file defining which primary keys should be used when integrating data. These files should be in the resources directory of the data source, and should be called "<name of data source>_keys.properties".

Global primary key configuration file

This file is a Java properties file, so all the data is in form of single lines, of the form "property name = property value". A line is a comment if it begins with a hash character, and blank lines may be present. This file defines a set of primary keys by name for each class. Defining a primary key on a class makes it apply to all the subclasses too. This file should be created in <mine>/dbmodel/resources.

To define a primary key, enter a line in the following form:

Classname.primary_key_name = field1, field2

This line means that the class "Classname" and all its subclasses have a primary key called "primary_key_name" that matches two objects if the values of both of the fields "field1" and "field2" are identical. Only attributes and references can be used as fields in a primary key, not collections.

Here is a short example of the configuration file. The configuration file we use for the FlyMine system is a good example.

Gene.key_identifier_org=identifier, organism
Gene.key_symbol_org=symbol, organism
Gene.key_organismdbid=organismDbId
Gene.key_ncbiGeneNumber=ncbiGeneNumber
Protein.key_identifier_org=identifier, organism
Protein.key_primaryacc=primaryAccession

Data source keys configuration files

For each data source, there is a configuration file providing a list of the primary keys that can be used when integrating that data source. It is again a Java properties file. The file lists the primary keys by name for each class. If a class is not mentioned, then instances of that class will never be merged with other objects. Each class must be mentioned only once in this file, unlike the global configuration. For each class, there should be a line like the following:

Classname = primary_key_name, primary_key_name2

This line means that the class "Classname" and all its subclasses have a two primary keys available for this data source, called "primary_key_name" and "primary_key_name2", which should be defined properly in the global configuration.

Here is an example of a configuration file for the UniProt data source for our FlyMine system:

Protein=key_primaryacc
Organism=key_taxonid
Synonym=key_synonym
Publication=key_pubmed
Comment=key_comment
DataSet=key_title
DataSource=key_name
Gene=key_organismdbid