Last modified 15 months ago Last modified on 15/11/10 00:08:55

Writing a Parser using Perl to Write InterMine-Items XML

In this tutorial we will create a new Perl parser to parse pathway information from KEGG.

Opening the stub parser

In the source properties panel, click on the Open parser to edit button. This will open this stub of a parser.

Add code to parse the KEGG files

There are two KEGG files to load, both are tab delimited.

  • map_title.tab contains the mapping between KEGG pathway identifier and pathway titles
  • pfa_gene_map.tab contains the mapping between gene identifier and pathway identifiers

Before starting the code, you might want to read about the InterMine Perl API

Steps

  1. Redefine the argument parsing to include the data files we need: (you might also want to error check these...)
    my ( $model_file, $out_file, $pathway_file, $gene_mappings_file ) = @ARGV;
    
  2. Define a couple of variables we will want to refer to:
    my $data_source = 'Kegg';
    my $taxon_id = 36329;
    my %pathway_with;
    
  3. make a DataSource object, with the name set to our source, ie. 'Kegg':
    my $datasource_item = $document->add_item(
        'DataSource',
        'name' => $data_source,
    );
    
  4. make a DataSet object, with its name and dataSource fields set:
    my $dataset_item = $document->add_item(
        'DataSet',
        'name'       => $data_source . ' data set for taxon id: ' . $taxon_id,
        'dataSource' => $datasource_item,
    );
    
  5. make an Organism object, with its taxonId and dataSets fields set - note that collections take array-refs of items, even if there is only one:
    my $org_item = $document->add_item(
        'Organism',
        taxonId => $taxon_id,
        dataSets => [$dataset_item],
    );
    
  6. read kegg/map_title.tab into a the %pathway_with hash.
    • format: pathway_id<tab>pathway_title
      open(my $pathways, '<', $pathway_file) or die;
      for (<$pathways>) {
          chomp;
          my ($id, $title) = split(/\t/);
      ...
      
    • for each line create a item of type "Pathway", with its name field set to the value of pathway_title, and its identifier field set to the value of pathway_id, and with a reference to the dataset item in its dataSets field.
      my $item = $document->add_item(
          'Pathway',
          'identifier' => $id,
          'name'       => $title,
      );
      
      • we are going to need to access it by its id later, so store it in a hash, associated with the id as its key.
        $pathway_with{$id} = $item;
        
  7. read in kegg/pfa_gene_map.tab
    open(my $gene_mappings, '<', $gene_mappings_file) or die;
    for (<$gene_mappings>) {
        chomp;
    
    • format: gene_identifier<tab>pathway_id[<space>pathway_id ...]
      my ($gene_id, $pathway_string) = split(/\t/);
      my @pathway_ids = split(/\s/, $pathway_string);
      
    • Get the pathway items for the pathway ids:
      my @pathway_items = @pathway_with{@pathway_ids};
      
      • create a item of type "Gene", with its 'primaryIdentifier' field set to the value of "gene_identifier", its "organism" set as our organism item, and its pathways set as a reference to the array of pathways we just made, and with the same dataSets field as above and then store it away in our list of items to write out:
        my $item = $document->add_item(
           'Gene',
           primaryIdentifier => $id,
           organism          => $org_item,
           pathways          => \@pathway_items,
           dataSets          => [$dataset_item],
        );
        
  8. write the items to the XML document:
    $document->write();
    

Memory Management

Notice that although we defined in the additions file that pathways have a collection of genes and genes a collection of pathways, we are only setting pathways on the genes, and not vice versa. This is because the intermine-items-xml-file source loader will autonmatically make these reverse references for us for many-to-many relationships. If we do set both sides, it can result in large memory leaks when dealing with larger data sets.

Completed Example

For a completed example: see KeggParserExamplePerl

Next: WorkshopNov2010