InterMine Bio sources
This page lists the current sources available for use in InterMine. All the sources here are found in bio/sources. Look at flymine/project.xml for examples of how to use them.
You can also add your own sources to load custom file formats, see How to create a source for more information. In addition, the tutorial contains detailed steps on creating sources for a variety of different data formats.
Most of the configuration done in the config files is optional, if no config entry exists the default behaviour is followed. For instance, if no chromosomes are specified in ensembl_config.properties, then all chromosomes are processed. There are exceptions to this rule, however.
Common sources
These are commonly used sources that you may want to use to load data into your own InterMine instance.
| source | data | model | config | project.xml |
| biogrid | Loads genetic and protein interaction data from BioGRID. These data include both high-throughput studies and conventional focused studies and have been curated from the literature. | genes, proteins, interactions | biogrid_config.properties - which gene identifiers are populated. | All files specified in project.xml will be processed. |
| biopax | Loads data from files in BioPAX level 2 format | pathways, proteins | biopax_config.properties- which gene identifiers are populated. | organisms, datasource name, curated=true|false |
| chado-db | Loads data from a Chado database. | Any chado features, eg. chromosome location, genes, proteins. | none | See chado-db docs |
| ensembl | Load ensembl genomes from a downloaded mysql database | chromosomes, genes, transcripts, exons, protein sequences, CDSs | ensembl_config.properties - which chromosomes are processed. | See Ensembl |
| entrez-organism | All other sources refer to organisms only by their NCBI taxonomy id. This source should be included at the end of the build. It will select the taxonIds loaded into the Organism class, fetch details via the Entrez web service and fill in the organism names in the database. | updates fields for organism created by other sources | none | |
| fasta | Load features and their sequences. Will create a feature for each entry in a fasta file and set the sequence, the class of the feature to create is set for the whole file. | protein, sequence | none | See FASTA |
| gff | This isn't a source itself but genome annotation from gff files can be loaded easily by creating a new source of type gff. See redfly, malaria-gff and tiffin for examples. | features | Configuration is added to the project.properties file and an optional handler can be added to deal with data in the attributes section of the gff file. | See GFF |
| go-annotation | Load gene association files that link GO terms to genes or proteins. | genes, go terms | go-annotation_config.properties - which gene identifiers are populated. Note: Currently we require that the gene_association file is sorted by the 'product' (gene/protein/etc) identifier in column 2. This is to save memory while loading. If your file isn't already sorted by column 2 you can use this command > sort -k 2,2 input_file > output_file | |
| go | Load the Gene Ontology term ids, names and definitions, and the relationships between terms. Should be loaded if the go-annotation source is used. | go terms | none | |
| intact | Load interactions data from IntAct. | genes, interactions | psi-intact_config.properties - which gene identifiers are set | organisms - If none are configured, all interactions are stored. |
| intermine-items-xml-file | Use this source to load Items XML conforming to the data model directly into the production database. | any | none | |
| intermine-items-large-xml-file | Use this source to load Items XML conforming to the data model into the production database, this uses an intermediate database to allow it to cope with very large files that would otherwise cause memory problems. | any | none | |
| interpro | Data from interpro | protein domains | none | |
| kegg-pathway | Link genes to KEGG pathways that they operate in. | pathways, genes | kegg_config.properties - which gene identifier fields are populated, mapping from organism taxonId to abbreviation | only taxonIds specified in project.xml file are downloaded, if no taxonIds are configured, all are loaded. |
| pdb | Data from PDB | proteins, protein structures | none | each structure is its own PDB file. PDB file doesn't contain taxonId, so files must live in taxonId-specific directories, eg /7227. converter processes directories specified in project.xml. if nothing configured, processes all directories |
| psi-mi-ontology | Include this source when loading psi data to fill in details of ontology terms used. Loads the file psi-mi.obo file. | ontology terms | none | |
| pubmed | data from pubmed | publications | none | entire file is downloaded, only taxonIDs set in project.xml will be loaded. if nothing configured, processes all entries. |
| so | This source loads no data but adds a class in the data model for every term in SOFA - a reduced version of the sequence ontology. SO terms represent biological features such as gene, exon, 3' UTR. The version of SOFA we use is a little out of date. You should include it if you are loading genome annotation. | see the SO file for complete list. | none | |
| snp | Load SNP data from a downloaded mysql database | SNPs, chromosomes | none | See Ensembl |
| treefam | Loads homologue data from Treefam | genes, homologues | treefam_config.properties - which gene identifiers are set | geneFile - location of gene file, organisms - taxonIds or organisms to process |
| uniprot | Loads protein information from UniProtKB XML files. | protein sequences, lengths and molecular weights, links between proteins and genes, features located on proteins, UniProt keywords, references and comments. This source can optionally load InterPro protein domains and GO terms | uniprot_config.properties - where to get identifiers for genes and which field should be unique | See UniProt |
| uniprot-fasta | Loads sequences for proteins loaded in UniProt source | sequences | none | See: UniProt |
| uniprot-keywords | Load definitions of UniProtKB keywords, should be included if loading data from uniprot. | updates uniprot keyword definitions | none | See UniProt |
| update-publications | All publications are referred to by PubMed id by other sources. This source should be included at the end of the build. It will query all PubMed ids from the database, fetch details from the Entrez web service and fill in Publication objects. | updates publications loaded by other sources | none |
Specific sources
These are sources that load Drosophila specific data sets into FlyMine, we don't expect you will re-use these unless you are creating a Drosophila warehouse. All of these sources are located in bio/sources/flymine.
- affy-probes
- anoest
- anopheles-identifiers
- anoph-expr
- arbeitman-items-xml
- bdgp-clone
- bdgp-insitu
- drosdel-gff
- drosophila-homology
- fly-anatomy-ontology
- flyatlas
- fly-development-ontology
- fly-fish
- fly-misc-cvterms
- flyreg
- flyrnai-screens
- homophila
- long_oligo
- miranda
- protein_structure
- redfly
- rnai
- tiffin
- tiffin-expression
- tiling_path
See FlyMine for more information about these datasets. Look at flymine/project.xml for examples of how to use these sources.
Other loaders
These loaders were written by InterMine users.
- CHEBI
- Disease ontology
- haem-atlas
- HGNC
- HuGE
- Mammalian phenotypes
- pride
- stitch
