Last modified 3 years ago Last modified on 21/05/09 10:37:03

MalariaMine setup

Throughout the workshop we will be building a small Mine with data from P. falciparum - MalariaMine. We are using malaria because the data sets are relatively small so will load quickly. We will use existing InterMine parsers to create MalariaMine with genome data (GFF3 and FASTA), UniProt protein information and GO annotation. You will write a new parser to integrate some KEGG pathway data.

Definition of terms:

  • InterMine - the system we have developed to enable construction of integrated databases with corresponding web interfaces.
  • Mine - we use the term 'Mine' to refer to an instance of InterMine with specific data - examples of existing Mines are FlyMine and modMine. All of the configuration to create a new Mine is held under a single directory of the same name and makes use of code from the core InterMine system.
  • sources - each type of data included in a Mine is termed a source, a source may include code to parse data from a particular format (e.g. GO, UniProt) and information on how to integrate it. A Mine can include data from several sources.
  • webapp - the web application component of InterMine deployed to provide query access to the database.

Creating the new Mine

All configuration to create a new Mine is held in a directory in svn/dev. Change into the directory you checked out the InterMine source code to and look at the sub-directories:

% cd svn/dev
% ls

The directories include:

  • intermine - the core InterMine code, this is generic code to work with any data model.
  • bio - code to deal specifically with biological data, it includes sources to load data from standard biological formats.
  • flymine - the configuration used to create FlyMine, a useful reference when building your own Mine.
  • testmodel - a non-biological test data model used for testing the core InterMine system.
  • imbuild - InterMine's ant-based build system, you shouldn't need to edit anyting here.

Any Mine needs to be a top level directory in your InterMine checkout. There is a make_mine for creating the bare bones structure for a Mine with the given name. In svn/dev run:

% bio/scripts/make_mine MalariaMine

You will see the message:

Created malariamine directory for MalariaMine

A malariamine directory has been created with several sub-directories, change into the directory and look what is created:

% cd malariamine
% ls

We will look at each of the sub-directories in much more detail later, they are:

  • dbmodel - contains information about the data model to be used and ant targets relating to the data model and database creation.
  • integrate - provides ant targets for loading data into the Mine.
  • postprocess - ant targets to run post-processing operations on the data in the Mine once it is integrated.

In addition there are two default.xxx.properties files which we won't need to edit and a project.xml file.

Note most generated directories have a project.properties file and a short build.xml file, these are used by the InterMine build system.

Project.xml

This file acts as the central configuration for the Mine. There are some properties at the top that you we won't need to edit, for reference you can see what they are used for here.

The automatically generated file has empty <sources> and <post-processing> sections. There is a project.xml already prepared to define a new MalariaMine, copy it to this directory now and look at it:

% cp ../bio/tutorial/malariamine/project.xml .
% less project.xml

<sources>

The <source> elements list and configure the data sources to be loaded, each one has a type that corresponds to a directory in svn/dev/bio/sources. These directories include parsers to retrieve data and information on how it will be integrated. The name can be anything and can be the same as type, using a more specific name allows you to define specific integration keys (more on this later).

<source> elements can have several properties: src.data.dir, src.data.file and src.data.includes are all used to define locations of files that the source should load. Other properties are used as parameters to parsing code.

<post-processing>

Specific operations can be performed on the Mine once data is loaded, these are listed here as <post-process> elements. We will look at these in more detail later.

Data to load

The InterMine checkout includes a tar file with data to load into MalariaMine. These are real, complete data sets for P. falciparum. We will load genome annotation from PlasmoDB, protein data from UniProt and GO annotation also from PlasmoDB.

Copy this to some local directory (your home directory is fine for this workshop) and extract the archive:

cd
cp svn/dev/bio/tutorial/malariamine/malaria-data.tar.gz .
tar -zxvf malaria-data.tar.gz

In your malariamine directory edit project.xml to point each source at the extracted data, just replace DATA_DIR with /home/username. In the training room your unix username will be partXX. For example, the uniprot-malaria source:

  <sources>
    <source name="uniprot-malaria" type="uniprot">
      <property name="uniprot.organisms" value="36329"/>
      <property name="src.data.dir" location="DATA_DIR/malaria/uniprot/"/>
    </source>
    ...

The project.xml file is now ready to use.

Properties file

We need to set up a malariamine.properties file as we did for the tests to configure the database names and locations.

If you don't already have a .intermine directory in your home directory, create one now:

% cd
% mkdir .intermine

Create a file called malariamine.properties in the .intermine/ directory under your home directory with the contents as below. Edit this file :

  • set the username and password, in the training room these will both be 'partXX' which is your unix username.

includeConf(trunk/bio/tutorial/malariamine/malariamine.properties)?

Create databases

Finally, we need to create production-malaria and common-tgt-items-malaria postgres databases as specified in the malariamine.properties file:

% createdb production-malaria
% createdb common-tgt-items-malaria

For an introduction to postgres commands see: PostgresBasics.

You are now ready to start building MalariaMine, next we will look at the data model creation of a database schema.


Next: WorkshopDataModel?