| | 6 | !InterMine includes a parser to load valid GFF3 files. The creation of features, sequence features (usually chromosomes), locations and standard attributes is taken care of automatically. |
|---|
| | 7 | |
|---|
| | 8 | * Many elements can be configured by properties in `project.xml`, to deal with any specific attributes or perform custom operations on each feature you can write a handler in Java which will get called when reading each line of GFF. |
|---|
| | 9 | * Other `gff3` properties can be congfigured in the `project.xml` The properties set for `malaria-gff` are: |
|---|
| | 10 | * `gff3.seqClsName = Chromosome` |
|---|
| | 11 | * the ids in the first column represent `Chromosome` objects, e.g. MAL1 |
|---|
| | 12 | * `gff3.taxonId = 36329` |
|---|
| | 13 | * taxon id of malaria |
|---|
| | 14 | * `gff3.dataSourceName = PlasmoDB` |
|---|
| | 15 | * the data source for features and their identifiers, this is used for the !DataSet (evidence) and synonyms. |
|---|
| | 16 | * `gff3.seqDataSourceName = PlasmoDB` |
|---|
| | 17 | * the source of the seqids (chromosomes) is sometimes different to the features described |
|---|
| | 18 | * `gff3.dataSetTitle = PlasmoDB P. falciparum genome` |
|---|
| | 19 | * a !DataSet object is created as evidence for the features, it is linked to a !DataSource (PlasmoDB) |
|---|
| | 20 | |
|---|
| | 21 | * In some cases specific code is required to deal with attributes in the gff file and any special cases. A specific `source` can be created to contain the code to do this and any additions to the data model necessary. For malaria gff we need a handler to switch which fields from the file are set as `primaryIdentifier` and `symbol`/`secondaryIdentifier` in the features created. This is to match the identifiers from !UniProt, it is quite a common issue when integrating from multiple data sources. |
|---|
| | 22 | * From the example above, by default: `ID=gene.46311;description=hypothetical%20protein;Name=PFA0210c` would make `Gene.primaryIdentifier` be `gene.46311` and `Gene.symbol` be `PFA0210c`. We need `PFA0210c` to be the `primaryIdentifier`. |
|---|
| | 23 | * The `malaria-gff` source is held in the `bio/sources/example-sources/malaria-gff` directory. Look at the `project.properties` file in this directory, there are two properties of interest: |
|---|
| | 24 | {{{ |
|---|
| | 25 | # set the source type to be gff |
|---|
| | 26 | have.file.gff=true |
|---|
| | 27 | |
|---|
| | 28 | # specify a Java class to be called on each row of the gff file to cope with attributes |
|---|
| | 29 | gff3.handlerClassName = org.intermine.bio.dataconversion.MalariaGFF3RecordHandler |
|---|
| | 30 | }}} |
|---|
| | 31 | * Look at the `MalariaGFF3RecordHandler` class in `bio/sources/example-sources/malaria-gff/main/src/org/intermine/bio/dataconversion`. This code changes which fields the `ID` and `Name` attributes from the GFF file have been assigned to. |
|---|