Last modified 3 years ago Last modified on 29/05/09 10:07:18

The chado-db source

We have developed an InterMine DataLoader that can use a  GMOD Chado database as a source for an InterMine warehouse. The eventual aim is to allow import of any Chado database with some configuration. This will provide a web environment to perform rapid, complex queries on Chado databases with minimal development effort. If you are interested in importing your Chado database please ContactUs.

Converter

The converter for this source is the ChadoDBConverter class. This class controls which ChadoProcessors are run. A ChadoProcessor class corresponds to a chado module. For example, the sequence module is processed by the SequenceProcessor and the stock module is processed by the StockProcessor.

Chado tables

The chado-db source is able to integrate objects from a  Chado database. Currently only tables from the  Chado sequence module and  stock modules are read.

These tables are queried from the chado database:

  • feature (notes)
    • used to create objects in the ObjectStore
    • The default configuration only supports features that have a  Sequence Ontology type (eg. gene, exon, chromosome)
    • Each new feature in InterMine will be a sub-class of LocatedSequenceFeature.
  • featureloc (notes)
    • used to create Location objects to set chromosomeLocation reference in each LocatedSequenceFeature
  • feature_relationship (notes)
    • used to find part_of relationships between features
    • this information is used to create parent-child references and collections
    • examples include setting the transcripts collection in the Exon objects and the gene reference in the Transcript class.
  • dbxref and feature_dbxref
    • used to create Synonym objects for external identifiers of features
    • the Synonyms will be added to the synonyms collection of the relevant LocatedSequenceFeature
  • featureprop
    • used to set fields in features based on properties
    • an example from the FlyBase database: the LocatedSequenceFeature.cytoLocation field is set using the cyto_range feature_prop
  • synonym and feature_synonym
    • used to create extra Synonym objects for chado synonyms and to set fields in features
    • the Synonyms will be added to the synonyms collection of the relevant LocatedSequenceFeature
  • cvterm and feature_cvterm
    • used to set fields in features and to create synonyms based on CV terms
  • pub, feature_pub and db
    • used to set the publications collection in the new LocatedSequenceFeature objects.

Additionally, the StockProcessor class reads the tables from the chado stock module, eg. stockcollection, stock, stock_genotype.

Default configuration

The default configuration of ChadoDBConverter is to query the feature table to only a limited list of types. The list can be changed by sub-classing the ChadoDBConverter class and overriding the getFeatureList() method. The featureloc, feature_relationship and pub tables will then be queried to create locations, parent-child relationships and publications (respectively).

Converter configuration

Sub-classes can control how the Chado tables are used by overriding the getConfig() method and returning a configuration map.

Source configuration

Example source configuration for reading from the C.elegans Chado database:

    <source name="chado-db-wormbase-c_elegans" type="chado-db" dump="true">
      <property name="source.db.name" value="wormbase"/>
      <property name="genus" value="Caenorhabditis"/>
      <property name="species" value="elegans"/>
      <property name="taxonId" value="6239"/>
      <property name="dataSourceName" value="WormBase"/>
      <property name="dataSetTitle" value="WormBase C.elegans data set"/>
    </source>

Sub-classing the converter

The processor classes can be sub-classed to allow organism or database specific configuration. To do that, create your class (perhaps in bio/sources/chado-db/main/src/) set the processors property in your source element. For example for reading the  FlyBase Chado database there is a FlyBaseProcessor which can be configured like this:

<source name="chado-db-flybase-dmel" type="chado-db">
...
      <property name="processors"
                value="org.intermine.bio.dataconversion.FlyBaseProcessor"/>
...

Current uses

 FlyMine uses the chado-db source for reading the Drosophila genomes from the  FlyBase chado database. The FlyBaseProcessor sub-class is used for configuration and to handle FlyBase special cases.

 modMine for the  modENCODE project uses ChadoDBSource for reading D. melanogaster from FlyBase and to read C. elegans data from the  WormBase chado database. The WormBaseProcessor sub-class is used for configuration when reading from WormBase.

Implementation

There is more detail about the implementation on the ChadoDBSourceImplNotes page.