Last modified 15 months ago Last modified on 15/11/10 00:22:32

Create a new parser to load KEGG pathway data

In this tutorial we will create a new parser to load pathway information from KEGG and mapping of genes to particular pathways.

The following pages will be useful:

1. Copy a partially completed java file

We now need to add a Java class that will parse the KEGG files and create Items that will be loaded into the malariamine database.

The make_source script created a skeleton Java class to which you would normally add the parsing code:

> less ~/svn/dev/bio/sources/kegg-demo/main/src/org/intermine/bio/dataconversion/KeggDemoConverter.java

For this tutorial we will use a partially completed parser for KEGG data. Copy this into your source directory now.

> cp ~/svn/dev/bio/tutorial/malariamine/examples/KeggDemoConverter.java ~/svn/dev/bio/sources/kegg-demo/main/src/org/intermine/bio/dataconversion/KeggDemoConverter.java

And open this file in an editor. The process() method is called automatically by the InterMine build system, in the example KEGG parser this tests which file we're reading and calls the appropriate method.

2. Add code to parse the pfa_gene_map.tab file

Write code in the processGeneMapFile() method, this needs to be inside the while loop:

  • read the gene id from the first column of the file:
     String geneId = line[0];
    
  • you will need to create a Gene Item for each gene id and set the primaryIdentifier attribute
      Item gene = createItem("Gene");
      gene.setAttribute("primaryIdentifier", geneId);
    
  • each gene should reference an organism, you should create and store the organism only once. There is a getOrganism() method already created that will create the organism Item and store it the first time.
      gene.setReference("organism", getOrganism());
    

This line calls the getOrganism() method that is already written in the java file.

  • The second column contains a space-separated list of pathway ids that the gene is a member of. For each pathway id in the file create a pathway Item and add it to the pathways collection of the Gene Item
      String[] pathwayIds = line[1].split(" ");
                
      // add each pathway to the Gene.pathways collection
      for (String pathwayId : pathwayIds) {
        // getPathway() will create a new pathway or fetch it from a map if already seen
        Item pathway = getPathway(pathwayId);
        gene.addToCollection("pathways", pathway);
      }
    
  • note that you only need to create an Item for each Pathway once, you can keep them in a map by the KEGG id. Add this method to the end of your class:
        private Item getPathway(String pathwayId) {
            Item pathway = pathwayMap.get(pathwayId);
            if (pathway == null) {
            pathway = createItem("Pathway");
            pathway.setAttribute("identifier", pathwayId);
            pathwayMap.put(pathwayId, pathway);
        }
        return pathway;
        }
    
  • all the fields of the gene item are now set so we can store it:
      store(gene);
    

For reference you can see a complete parser here: example solution

3. Add code to parse the map_title.tab file

The map_title.tab file contains names for each pathway id. Now add code to the processMapTitleFile method to read this file and store pathways with names.

  • First get the pathway id and name from the current link of the file:
        String pathwayId = line[0];
        String pathwayName = line[1];
    
  • Fetch the pathway item and set the name:
        Item pathway = getPathway(pathwayId);
        pathway.setAttribute("name", pathwayName);
    
  • We've now finished processing the pathway so can store it:
        store(pathway);
    

Note - the pathway items are created and held in a map by kegg identifier, the processGeneMapFile only needs to know the internal id of the generated item to add it to the genes collection. The use of the pathways map means we can process the two files in either order and get the same result.

To test your converter, try running the uniprot-malaria and kegg-demo sources together then looking at the integrated data:

# in malariamine/integrate
> (cd ../dbmodel; ant clean build-db) && ant -Dsource=uniprot-malaria,kegg-demo -v

There is an example Java converter here:  KeggExampleConverter.java.


You should now how have a working source to integrate KEGG pathway data into MalariaMine.

Next: WorkshopNov2010