Loading Data
The InterMine system is designed to integrate data from multiple sources. It is a data warehouse so should be built from scratch each time updates of data are required, there is no facility to curate or update data.
Sources and Mines
Each type of data to be loaded is defined as a 'source', sources are directories that contain everything needed to parse and integrate a particular type of data. There are some common sources to include several biological data types and you can create your own. The sources to include and files they load are configured in a project XML file in each Mine?.
Data model
InterMine uses an object-based data model which can easily adapt to include new types of data. Each source can include extra fields or classes as 'additions' in a model XML format. The database schema and Java object code are automatically created from the model XML.
Data Integration
It is likely that data loaded from different sources will provide information about the same objects, for example a gff file may provide the genome location of a feature while a fasta file will provide the sequence. The data integration system makes sure that corresponding genes loaded from the two sources become the same gene and that any conflicts between data are resolved.
Tutorials
- MalariaMine - build a data warehouse and web interface with example P. falciparum data
- SourceHowto - guide to creating a new 'source' to load your own data
- MineHowTo - overview of creating a new InterMine instance
Reference Documents
- DataIntegration - overview of how data integration works
- Sources
- BioSources - a list of the common biological sources available (needs work)
- AnatomyOfASource - the directory structure and common files found in all sources
- SourceHowto - guide to creating a new source
- Mines
- AnatomyOfAMine? - the directory structure and common files found in all mines
- ProjectXMLFormat - specify what to load into a mine
- Properties files needed for loading data
- MineHowTo - guide to setting up a new mine
- ChadoDBSource - creating an InterMine warehouse from a chado database
- BioDataModel? - about the core model
- SourceProjectProperties
- RunningABuild - how to manage data loading
- SourceTypes? - types of sources available (databasem,flat file, dag, Items XML, gff)
- DataLoadingPerformance - how to make data loading faster
Creating a FlyMine mirror
- MirrorHowTo - set up a mirror of FlyMine
- FlyMineOwnData - set up a mirror of FlyMine and integrate your own data
See Also: ModelDescription, PrimaryKeys, PriorityConfig, ItemsXmlFormat, ItemsAPIJava, ItemsAPIPerl
