The chado-db source
We have developed an InterMine DataLoader to read data from the GMOD Chado database schema into an InterMine warehouse. The InterMine core data model is broadly similar to that of Chado. The eventual aim is to allow import of any Chado database with some configuration. This will provide a web environment to perform rapid, complex queries on Chado databases with minimal development effort. If you are interested in importing your Chado database please contact us - dev[at]flymine.org.
Chado tables
The chado-db source is able to integrate objects from a Chado database. Currently only tables from the Chado sequence module are read.
These tables are queried from the chado database:
- feature
- used to create objects in the ObjectStore
- The default configuration only supports features that have a Sequence Ontology type (eg. gene, exon, chromosome)
- Each new feature in InterMine will be a sub-class of LocatedSequenceFeature.
- featureloc
- used to create Location objects to set chromosomeLocation reference in each LocatedSequenceFeature
- feature_relationship
- used to find part_of relationships between features
- this information is used to create parent-child references and collections
- examples include setting the transcripts collection in the Exon objects and the gene reference in the Transcript class.
- dbxref and feature_dbxref
- used to create Synonym objects for external identifiers of features
- the Synonyms will be added to the synonyms collection of the relevant LocatedSequenceFeature
- featureprop
- used to set fields in features based on properties
- an example from the FlyBase database: the LocatedSequenceFeature.cytoLocation field is set using the cyto_range feature_prop
- synonym and feature_synonym
- used to create extra Synonym objects for chado synonyms
- the Synonyms will be added to the synonyms collection of the relevant LocatedSequenceFeature.
- pub, feature_pub and db
- used to set the publications collection in the new LocatedSequenceFeature objects.
Converter
The converter for this source is the ChadoDBConverter class.
Default configuration
The default configuration of ChadoDBConverter is to query the feature table to only a limited list of types. The list can be changed by sub-classing the ChadoDBConverter class and overriding the getFeatureList() method. The featureloc, feature_relationship and pub tables will then be queried to create locations, parent-child relationships and publications (respectively).
Source configuration
Example source configuration for reading from the C.elegans chado database:
<source name="chado-db-wormbase-c_elegans" type="chado-db" dump="true">
<property name="source.db.name" value="wormbase"/>
<property name="genus" value="Caenorhabditis"/>
<property name="species" value="elegans"/>
<property name="taxonId" value="6239"/>
<property name="dataSourceName" value="WormBase"/>
<property name="dataSetTitle" value="WormBase C.elegans data set"/>
</source>
Sub-classing the converter
The ChadoDBConverter class can be sub-classed to allow organism or database specific configuration. To do that, create your class (perhaps in bio/sources/chado-db/main/src/) set the converter.class property in your source element. eg.
<property name="converter.class" value="org.intermine.bio.dataconversion.FlyBaseChadoDBConverter"/>
Current uses
FlyMine uses the chado-db source for reading the D. melanogaster and D. pseudoobscura genomes from the FlyBase chado database. The FlyBaseChadoDBConverter sub-class is used for configuration and to handle FlyBase special cases.
modENCODE-mine uses this source for reading D. melanogaster from FlyBase and to read C. elegans data from the WormBase chado database. The WormBaseChadoDBConverter sub-class is used for configuration.
