Loading Data

The InterMine system is designed to integrate data from multiple sources. It is a data warehouse so should be built from scratch each time updates of data are required, there is no facility to curate or update data.

Sources and Mines

Each type of data to be loaded is defined as a 'source', sources are directories that contain everything needed to parse and integrate a particular type of data. There are some common sources to include several biological data types and you can create your own. The sources to include and files they load are configured in a project XML file in each Mine?.

Data model

InterMine uses an object-based data model which can easily adapt to include new types of data. Each source can include extra fields or classes as 'additions' in a model XML format. The database schema and Java object code are automatically created from the model XML.

Data Integration

It is likely that data loaded from different sources will provide information about the same objects, for example a gff file may provide the genome location of a feature while a fasta file will provide the sequence. The data integration system makes sure that corresponding genes loaded from the two sources become the same gene and that any conflicts between data are resolved.


Tutorials

  • MalariaMine - build a data warehouse and web interface with example P. falciparum data
  • SourceHowto - guide to creating a new 'source' to load your own data
  • MineHowTo - overview of creating a new InterMine instance

Reference Documents

Creating a FlyMine mirror

See Also: ModelDescription, PrimaryKeys, PriorityConfig, ItemsXmlFormat, ItemsAPIJava, ItemsAPIPerl