InterMine Integration Primary Key Configuration
This document describes the configuration used by the InterMine system's Integration to identify objects that are identical to each other. Two objects are deemed to be identical if they have matching fields for at least one primary key used for the class of object. Primary keys are defined in a global file called "<name of model>_keyDefs.properties" which must be in the classpath.
Not all data sources will contain sufficient data for all of the primary keys, so each data source needs an additional configuration file defining which primary keys should be used when integrating data. These files should be in the resources directory of the data source, and should be called "<name of data source>_keys.properties".
Global primary key configuration file
This file is a Java properties file, so all the data is in form of single lines, of the form "property name = property value". A line is a comment if it begins with a hash character, and blank lines may be present. This file defines a set of primary keys by name for each class. Defining a primary key on a class makes it apply to all the subclasses too. This file should be created in <mine>/dbmodel/resources.
To define a primary key, enter a line in the following form:
Classname.primary_key_name = field1, field2
This line means that the class "Classname" and all its subclasses have a primary key called "primary_key_name" that matches two objects if the values of both of the fields "field1" and "field2" are identical. Only attributes and references can be used as fields in a primary key, not collections.
Here is a short example of the configuration file. The configuration file we use for the FlyMine system is a good example.
Gene.key_identifier_org=identifier, organism Gene.key_symbol_org=symbol, organism Gene.key_organismdbid=organismDbId Gene.key_ncbiGeneNumber=ncbiGeneNumber Protein.key_identifier_org=identifier, organism Protein.key_primaryacc=primaryAccession
Data source keys configuration files
For each data source, there is a configuration file providing a list of the primary keys that can be used when integrating that data source. It is again a Java properties file. The file lists the primary keys by name for each class. If a class is not mentioned, then instances of that class will never be merged with other objects. Each class must be mentioned only once in this file, unlike the global configuration. For each class, there should be a line like the following:
Classname = primary_key_name, primary_key_name2
This line means that the class "Classname" and all its subclasses have a two primary keys available for this data source, called "primary_key_name" and "primary_key_name2", which should be defined properly in the global configuration.
Here is an example of a configuration file for the UniProt data source for our FlyMine system:
Protein=key_primaryacc Organism=key_taxonid Synonym=key_synonym Publication=key_pubmed Comment=key_comment DataSet=key_title DataSource=key_name Gene=key_organismdbid
