Technical Overview

This document is intended to be a brief introduction to how InterMine works.

General overview of InterMine

The basic architecture of an InterMine system is shown above. At the core of InterMine is the ObjectStore, a custom object/relational mapping system written in Java and optimized for supporting read-only databases. This is accessed from a Struts/JSP/Ajax web application and executes queries in a PostgreSQL relational database. The ObjectStore accepts object queries from a client (the web application), generates SQL to execute in the underlying database and materializes objects from the results.

Query Optimisation

Before final execution, SQL queries are passed to the QueryOptimiser (a Java program developed to enhance InterMine performance), which is able to re-write any SQL query to make use of precomputed tables (analogous to materialized views). The whole query or parts of it may be replaced by one or more precomputed tables; estimates of execution time from the PostgreSQL database are used to decide which possible re-written query will be fastest. This replicates behaviour of many commercial database systems but our aim was to use only open source software.

Precomputed tables can be created while the database system is running in production so performance can be adapted to match actual usage. From the web application any new template queries can be precomputed immediately to ensure they run quickly. This approach separates the definition of the data model from performance optimization, making it possible to tailor the performance of an InterMine warehouse for types of queries not known at design time.

Object/Relational mapping

The InterMine ObjectStore is a custom object/relational mapping implementation designed for high query performance in a read-only database. It stores serialised copies of objects to minimise table reads when dealing with inheritance. All 'classes' in the object model are actually interfaces, classes are created dynamically at runtime. This allows InterMine data model to handle multiple inheritance. A data model is defined at the object level by an XML file. Java objects, the relational database schema and all model-specific parts of the web application are generated automatically, reducing the maintenance overhead when data model changes are required. Data are loaded as Java objects or XML that conforms to the specified model. Integration of data from multiple sources is configured to define how equivalent objects from different sources should be merged. As different data sources may provide different fields, multiple 'keys' can be defined for a particular type. For example, Gene objects may be merged according to an 'identifier' field or a 'symbol' field. A priority configuration system is used to resolve conflicts between data sources.

Biological data

InterMine can operate on any data model but we provide an extension specifically for handling biological data. This includes a core data model based on the sequence ontology and a series of sources that include Java code to parse data from a particular data format; for example, UniProtKB, protein interactions and GO annotations each have their own source. The use of data represented by a particular standard facilitates the incorporation of future data into the database. For example, protein interaction data can be represented by the PSI-MI standard and by supporting this standard in InterMine we can easily accommodate future data published in this format.

Data can also be loaded from an InterMine XML format, allowing the parsing code to be written in a language other than Java. Each source can add classes and fields to extend the data model if required (for example, the protein interaction source adds a ProteinInteraction class) and defines how the data should be integrated. Construction of an InterMine data warehouse (for example, FlyMine) means configuring which sources should be included and specifying the particular organisms or data files to include. This system reduces the development required to update FlyMine and add new data types. It also makes possible the construction of comparable data warehouses for different organisms and data sets.

See: GettingStarted, InterMineOverview

Attachments