InterMine Bio sources
This page lists the current sources available for use in InterMine - this defines the formats that are already available to load into InterMine. You can of course add your own sources to load custom file formats, see How to create a source for more information.
All the sources here are found in bio/sources. Look at flymine/project.xml for examples of how to use them. The specific classes added to the model by each source are specified in bio/sources/<sourcename>/<sourcename>_additions.xml.
Common sources
These are commonly used sources that you may want to use to load data into your own InterMine instance.
chado-db
Loads data from a Chado database. See chado-db docs for more.
ensembl
Load ensembl genomes from a downloaded mysql database.
entrez-organism
All other sources refer to organisms only by their NCBI taxonomy id. This source should be included at the end of the build. It will select the taxonIds loaded into the Organism class, fetch details via the Entrez web service and fill in the organism names in the database.
fasta
Load features and their sequences. Will create a feature for each entry in a fasta file and set the sequence, the class of the feature to create is set for the whole file.
gff
This isn't a source itself but genome annotation from gff files can be loaded easily by creating a new source of type gff. See redfly, malaria-gff and tiffin for examples. Configuration is added to the project.properties file and an optional handler can be added to deal with data in the attributes section of the gff file.
go-annotation
Load gene association files that link GO terms to genes or proteins. This source will create an association for each parent of the term actually assigned. The Gene.goAnnotation collection contains all terms actually applied to the gene, Gene.allGoAnnotation also includes all the parents of those terms. So to find all genes that have a particular GO term or any of its children you can query Gene.allGoAnnotation.
go
Load the Gene Ontology term ids, names and definitions, and the relationships between terms. Should be loaded if the go-annotation source is used.
intermine-items-xml-file
Use this source to load Items XML? conforming to the data model directly into the production database.
intermine-items-large-xml-file
Use this source to load Items XML? conforming to the data model into the production database, this uses an intermediate database to allow it to cope with very large files that would otherwise cause memory problems.
kegg-pathway
Link genes to KEGG pathways that they operate in.
orthologue
Load InParanoid orthologue sqltable files. This will create Orthologue objects to link genes/translations between organisms and Paralogues within an organism.
psi
Load data from the PSI molecular interactions format version 2.5. This is used to represent protein-protein interactions, files are available from e.g. !IntAct. A ProteinInteraction object is created for each interaction.
psi-mi-ontology
Include this source when loading psi data to fill in details of ontology terms used. Loads the file psi-mi.obo file.
so
This source loads no data but adds a class in the data model for every term in SOFA - a reduced version of the sequence ontology. SO terms represent biological features such as gene, exon, 3' UTR. The version of SOFA we use is a little out of date. You should include it if you are loading genome annotation.
uniprot
Load protein information from UniProtKB XML files. UniProt data loaded includes: protein sequences, lengths and molecular weights, links between proteins and genes, features located on proteins, UniProt keywords, references and comments. This source can optionally load InterPro protein domains.
uniprot-keywords
Load definitions of !UniProtKB keywords, should be included if loading data from uniprot.
update-publications
All publications are referred to by PubMed id by other sources. This source should be included at the end of the build. It will query all PubMed ids from the database, fetch details from the Entrez web service and fill in Publication objects.
Specific sources
These are sources that load Drosophila specific data sets into FlyMine, we don't expect you will re-use these unless you are creating a Drosophila warehouse.
- flyatlas
- flyreg
- flyrnai-screens
- homophila
- long_oligo
- protein_structure
- redfly
- rnai
- tiffin
- tiling_path
- drosdel:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>404 Not Found</title> </head><body> <h1>Not Found</h1> <p>The requested URL /trunk/bio/sources/drosdel/README.source was not found on this server.</p> <hr> <address>Apache/2.2.6 (Fedora) Server at subversion.flymine.org Port 80</address> </body></html>
- region
- malaria-gff
Retired Sources
These are sources that are due to be removed or replaced, we don't recommend you use them.
- mage
- flybase-gff
- interpro
