Last modified 3 years ago Last modified on 30/01/09 11:11:30

Implementation notes for the chado-db source

(See ChadoDBSource for a user level description of this source.)

The chado-db source is implemented by the ChadoDBConverter class which runs the ChadoProcessors that have been configured in the project.xml. The configuration looks like this:

<source name="chado-db-some-database" type="chado-db">
...
      <property name="processors"
                value="org.intermine.bio.dataconversion.ChadoSequenceProcessor
                       org.intermine.bio.dataconversion.StockProcessor"/>
...

ChadoDBConverter.process() will create an object for each ChadoProcessor in turn, then call ChadoProcessor.process().

Chado sequence module table processing

ChadoSequenceProcessor processes the sequence module from Chado. The process() method handles each table in turn by calling: processFeatureTable(), processFeatureCVTermTable() etc.

Each table processing method calls a result set method, eg. processFeatureTable() calls getFeatureTableResultSet() and then processes each row. The returned ResultSet may not always include all rows from the Chado table. For example the getFeatures() method returns a sub-set of the possible feature types and that list is used to when querying the feature table.

Generally each row is made into an appropriate object, eg. in processFeatureTable(), feature table rows correspond to BioEntity objects. Some rows of some tables are ignored (ie. not turned into objects) based on configuration.

Reading the feature table

Handled by ChadoSequenceProcessor.processFeatureTable()

For each feature it calls: ChadoSequenceProcessor.makeFeatureData(), which may be overridden by subclasses. If makeFeatureData() returns null (eg. because the sub-class does not need that feature) the row is discarded, otherwise processing of the feature continues.

Based on the configuration, fields in the BioEntity are set using feature.uniquename and feature.name from Chado.

If the residues column in the feature is set, create a Sequence object and add it to the new BioEntity.

Reading the featureloc table

Handled by ChadoSequenceProcessor.processLocationTable().

This method gets passed a result set with start position, end position and information from the featureloc table. For each row from the result set it will:

  • store a Location object
  • set chromosomeLocation in the associated LocatedSequenceFeature
  • set the chromosome reference in the LocatedSequenceFeature if the srcfeature from the featureloc table is a chromosome feature

Reading the feature_relationship table

Handled by ChadoSequenceProcessor.processRelationTable().

This method calls getFeatureRelationshipResultSet() to return the relations of interest. The relations will be used to create references and collections.

The method will automatically attempt to find and set the appropriate references and collections for part_of relations. As an example, if there is a part_of relation in the table between Gene and Transcript and there Gene.transcript reference or a Gene.transcripts collection, it will be set.

There are two modes of operation, controlled by the subjectFirst parameters. If true, order by the subject_id of the feature_relationship table so we get results like:

  gene1_feature_id  relation_type  protein1_feature_id
  gene1_feature_id  relation_type  protein2_feature_id
  gene2_feature_id  relation_type  protein1_feature_id
  gene2_feature_id  relation_type  protein2_feature_id

(Assuming the unlikely case where two genes are related to two proteins)

If subjectFirst is false we get results like:

  gene1_feature_id  relation_type  protein1_feature_id
  gene2_feature_id  relation_type  protein1_feature_id
  gene1_feature_id  relation_type  protein2_feature_id
  gene2_feature_id  relation_type  protein2_feature_id

The first case is used when we need to set a collection in the gene, the second if we need to set a collection in proteins.

Reading the cvterm table

Handled by ChadoSequenceProcessor.processFeatureCVTermTable()