How to create a new data source

Introduction

The aim of this tutorial is to create a new source, specifically to import data from an InterMine data format XML file. This is achieved in two parts; the first creates the files to describe the new source while the second configures an individual mine to use these files. The XML file containing the data could be created in any language. InterMine includes Java and Perl APIs to assist.

Thanks to Anthony Smith from the MRC Dunn Human Nutrition Unit for providing this tutorial.

Part 1 - create files

  1. Create a new source in the bio/sources/ by creating a new directory and naming it anything you deem appropriate, for this example we will use new-source. This directory will contain configuration for the source, keys to define how the data should be integrated and any Java code required (we don't need any here). The data itself can be elsewhere.
  1. In the new source folder create a new file called build.xml with the following contents. This is to hook the new source into InterMine's ant build system, all it does is include targets from an InterMine ant file.
<project name="new-source" default="default" basedir=".">
  <description>build, test, package new-source</description>

  <import file="../../../imbuild/source.xml"/>

</project>
  1. Create another file in the new source folder called project.properties. This file just states that the data file will be an XML file. It should have the following contents:
compile.dependencies = intermine/integrate/main

have.file.xml.tgt = true

There are other source types to retrieve from various flat file formats and databases.

  1. Create a new file in the source folder called new-source_additions.xml. This file details any extensions needed to the data model to store data from this source, everything else is automatically generated from the model description so this is all we need to do to add to the model. The file is in the same format as a complete Model description.

To add to an existing class the contents should be similar to the example code below. The class name is a class already in the model, the attribute name is the name of the new field to be added and the type describes the type of data to be stored. In the example the Protein class will be extended to include a new attribute called extraData which will hold data as a string.

<?xml version="1.0"?>
<classes>

  <class name="org.flymine.model.genomic.Protein>" is-interface="true">
    <attribute name="extraData" type="java.lang.String"/>   
  </class>

</classes>

To create a new class from scratch the new-source_additions.xml file should include contents similar to the example below:

<?xml version="1.0"?>
<classes>

  <class name="org.flymine.model.genomic.NewFeature" extends="org.flymine.model.genomic.LocatedSequenceFeature" is-interface="true">
    <attribute name="identifier" type="java.lang.String"/>  
    <attribute name="confidence" type="java.lang.Double"/>
  </class>

</classes>

The name of the class must include the Java package name, specify the name after the final dot. The extends clause is optional and is used to inherit (include all the attributes of) an existing class, in this case we extend LocatedSequenceFeature, an InterMine class that represents any genome feature. is-interface should always be set to true. The attribute lines as before define the names and types of data to be stored. A new class will be created with the name NewFeature that extends LocatedSequenceFeature.

To cross reference this with another class, similar XML should be used as the example below:

<class name="org.flymine.model.genomic.NewFeature" extends="org.flymine.model.genomic.LocatedSequenceFeature" is-interface="true">
  <reference name="protein" referenced-type="org.flymine.model.genomic.Protein" reverse-reference="features"/>
</class>

In the example above the we create a link from NewFeature to the Protein class via the reference named protein. To complete the link a reverse reference may be added to Protein to point back at the NewFeature, this is optional - the reference could be one-way. Here we define a collection called features, this means that for every NewFeature that references a Protein, that protein will include it in its features collection. Note that as this is a collection a Protein can link to multiple NewFeatures but NewFeature.protein is a reference so each can only link to one Protein.

The reverse entry needs to be added to Protein (still in the same file):

<class name="org.flymine.model.genomic.Protein" is-interface="true">
  <collection name="features"  referenced-type="org.flymine.model.genomic.NewFeature" reverse-reference="protein"/>
</class>

So, the final additions XML should look like:

<?xml version="1.0"?>
<classes>

  <class name="org.flymine.model.genomic.Protein>" is-interface="true">
    <attribute name="extraData" type="java.lang.String"/> 
    <collection name="features"  referenced-type="org.flymine.model.genomic.NewFeature" reverse-reference="protein"/>  
  </class>

  <class name="org.flymine.model.genomic.NewFeature" extends="org.flymine.model.genomic.LocatedSequenceFeature" is-interface="true">
    <attribute name="identifier" type="java.lang.String"/>  
    <attribute name="confidence" type="java.lang.Double"/>
    <reference name="protein" referenced-type="org.flymine.model.genomic.Protein" reverse-reference="features"/>
  </class>

</classes>

Note - if all the data you wish to load is already modelled in InterMine then you don't need an additions file.

  1. A new directory needs to be created within the source directory called resources. Within this new directory create a file called new-source_keys.properties. Here we can define primary keys that will be used to integrate data from this source with any exiting objects in the database. We want to integrate proteins by their (UniProt) primaryAccession attribute so we define that this source should use the key:
Protein=key_primaryacc

Note that we don't expect any other data sources to provide interesting features so we don't need to integrate them - no key is defined. The possible keys are defined in the following file, new keys can be added if necessary: trunk/bio/core/props/resources/genomic_keyDefs.properties

Part 2 - including your source in a Mine

  1. From InterMine/FlyMine? version 12, the model additions file described above is automatically merged into the final model when a mine in build.

In previous versions, the additions file needed to be explicitly configured by adding an line to the -merge-models target in mine/dbmodel/build.xml:

  <target name="-merge-models" if="doing.build.db">
    ...
    <merge-additions file="bio/sources/new-source-name/new-source-name_additions.xml"/>
    ...
  </target>

In the current system all the additions files from all sources mentioned in the project.xml will automatically be added to the data model. The -merge-models target in dbmodel/build.xml in now optional.

  1. In the project.xml file, in the root of your mine directory (e.g. malaramine), the following entries should be added and altered accordingly:
<source name="new-source-name" type="new-source">
<property name="src.data.file" location="my_data_dir/example.xml"/>
</source>

If you have more that one file you can set this up to point at a directory:

<source name="new-source-name" type="new-source">
<property name="src.data.dir" location="my_data_dir/source_files/"/>
</source>

The first line defines the name you wish to give to the of the source and the type - the name of the directory in 'bio/sources'. The second line defines the location and name of the data file.

The data file should have the same format as the XML below:

<?xml version="1.0"?>

<items>

   <item id="0_1" class="" implements="http://www.flymine.org/model/genomic#NewFeature>
      <attribute name="identifier" value="feature2" />
      <attribute name="confidence" value="0.8" />
      <reference name="protein" ref_id="0_3" /> 
   </item>

   <item id="0_2" class="" implements="http://www.flymine.org/model/genomic#NewFeature>
      <attribute name="identifier" value="feature2" />
      <attribute name="confidence" value="0.37" />
      <reference name="protein" ref_id="0_3" /> 
   </item>

   <item id="0_3" class="" implements="http://www.flymine.org/model/genomic#Protein">
      <attribute name="primaryAccession" value="Q8I5D2" />
      <attribute name="extraData" value="proteinInfo"/>
      <collection name="features">
        <reference ref_id="0_1" />
        <reference ref_id="0_2" />
      </collection>
   </item>

</items>

An example of a perl script to create InterMine data XML can be found at: trunk/bio/scripts/intermine_items_example.pl

  1. Create the database as usual. Please note that unless the 'clean' is run (which deletes the build directory) in the mine/dbmodel/ any changes will append to the current model structure and any unwanted classes/attributes will remain.
  1. The source should now be included when building the mine.

See: RunningABuild, AnatomyOfASource