Changes between Version 1 and Version 2 of GFF

Show
Ignore:
Author:
julie (IP: 192.168.128.116)
Timestamp:
23/11/09 16:32:10 (10 months ago)
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • GFF

    v1 v2  
    11= BioSources > gff = 
    22 
    3 See GettingStarted for a detailed look on how to create a GFF source. 
     3See [http://intermine.org/wiki/GettingStarted#a8.3TheGFF3source GettingStarted] for a detailed look on how to create a GFF source. 
    44 
    5 An example project.properties file: 
    65 
     6!InterMine includes a parser to load valid GFF3 files.  The creation of features, sequence features (usually chromosomes), locations and standard attributes is taken care of automatically.   
     7  
     8 * Many elements can be configured by properties in `project.xml`, to deal with any specific attributes or perform custom operations on each feature you can  write a handler in Java which will get called when reading each line of GFF. 
     9 * Other `gff3` properties can be congfigured in the `project.xml` The properties set for `malaria-gff` are: 
     10   * `gff3.seqClsName = Chromosome` 
     11    * the ids in the first column represent `Chromosome` objects, e.g. MAL1 
     12   * `gff3.taxonId = 36329` 
     13    * taxon id of malaria 
     14   * `gff3.dataSourceName = PlasmoDB` 
     15    * the data source for features and their identifiers, this is used for the !DataSet (evidence) and synonyms. 
     16   * `gff3.seqDataSourceName = PlasmoDB` 
     17    * the source of the seqids (chromosomes) is sometimes different to the features described 
     18   * `gff3.dataSetTitle = PlasmoDB P. falciparum genome`  
     19    * a !DataSet object is created as evidence for the features, it is linked to a !DataSource (PlasmoDB) 
     20 
     21 * In some cases specific code is required to deal with attributes in the gff file and any special cases.  A specific `source` can be created to contain the code to do this and any additions to the data model necessary.  For malaria gff we need a handler to switch which fields from the file are set as `primaryIdentifier` and `symbol`/`secondaryIdentifier` in the features created.  This is to match the identifiers from !UniProt, it is quite a common issue when integrating from multiple data sources. 
     22  * From the example above, by default: `ID=gene.46311;description=hypothetical%20protein;Name=PFA0210c` would make `Gene.primaryIdentifier` be `gene.46311` and `Gene.symbol` be `PFA0210c`.  We need `PFA0210c` to be the `primaryIdentifier`. 
     23 * The `malaria-gff` source is held in the `bio/sources/example-sources/malaria-gff` directory.  Look at the `project.properties` file in this directory, there are two properties of interest: 
     24   {{{ 
     25   # set the source type to be gff 
     26   have.file.gff=true 
     27 
     28   # specify a Java class to be called on each row of the gff file to cope with attributes 
     29   gff3.handlerClassName = org.intermine.bio.dataconversion.MalariaGFF3RecordHandler 
     30   }}} 
     31 * Look at the `MalariaGFF3RecordHandler` class in `bio/sources/example-sources/malaria-gff/main/src/org/intermine/bio/dataconversion`.  This code changes which fields the `ID` and `Name` attributes from the GFF file have been assigned to. 
    732{{{ 
    8 # project.properties for malaria-gff source 
    9 compile.dependencies = intermine/objectstore/main,\ 
    10                        intermine/integrate/main, \ 
    11                        bio/core/main, \ 
    12                        bio/sources/example-sources/malaria-gff/main 
    13  
    14 have.file.gff3 = true 
    15  
    16 gff3.handlerClassName = org.intermine.bio.dataconversion.MalariaGFF3RecordHandler 
     33> less ~/svn/dev/bio/sources/example-sources/malaria-gff/main/src/org/intermine/bio/dataconversion/MalariaGFF3RecordHandler.java 
    1734}}} 
    1835