Last modified 3 weeks ago Last modified on 12/01/12 11:35:22

InterMine Bio sources

This page lists the current sources available for use in InterMine. All the sources here are found in bio/sources. Look at flymine/project.xml for examples of how to use them.

You can also add your own sources to load custom file formats, see How to create a source for more information. In addition, the tutorial contains detailed steps on creating sources for a variety of different data formats.

Most of the configuration done in the config files is optional, if no config entry exists the default behaviour is followed. For instance, if no chromosomes are specified in ensembl_config.properties, then all chromosomes are processed. There are exceptions to this rule, however.

Common sources

These are commonly used sources that you may want to use to load data into your own InterMine instance.

sourcedatamodel configproject.xml
biogridLoads genetic and protein interaction data from  BioGRID. These data include both high-throughput studies and conventional focused studies and have been curated from the literature.genes, proteins, interactionsbiogrid_config.properties - which gene identifiers are populated.All files specified in project.xml will be processed.
biopaxLoads data from files in  BioPAX level 2 formatpathways, proteins, genesbiopax_config.properties- which gene identifiers are populated.See Reactome
chado-dbLoads data from a  Chado database.Any chado features, eg. chromosome location, genes, proteins.noneSee chado-db docs
ensemblLoad  ensembl data from a downloaded mysql database or access via script using their APIchromosomes, genes, transcripts, exons, protein sequences, CDSs, SNPsensembl_config.properties - which chromosomes are processed. See Ensembl
ensembl-comparaLoad  ensembl compara data from public BioMart?homologuesSee EnsemblCompara
entrez-organismAll other sources refer to organisms only by their  NCBI taxonomy id. This source should be included at the end of the build. It will select the taxonIds loaded into the Organism class, fetch details via the Entrez web service and fill in the organism names in the database.updates fields for organism created by other sourcesnone
fastaLoad features and their sequences. Will create a feature for each entry in a fasta file and set the sequence, the class of the feature to create is set for the whole file.protein, sequencenoneSee FASTA
gffThis isn't a source itself but genome annotation from gff files can be loaded easily by creating a new source of type gff. See redfly, malaria-gff and tiffin for examples.featuresConfiguration is added to the project.properties file and an optional handler can be added to deal with data in the attributes section of the gff file.See GFF
go-annotationLoad  gene association files that link GO terms to genes or proteins.genes, go termsgo-annotation_config.properties - determines type of object annotated and which identifiers are created. Note: Currently we require that the gene_association file is sorted by the 'product' (gene/protein/etc) identifier in column 2. This is to save memory while loading. If your file isn't already sorted by column 2 you can use this command > sort -k 2,2 input_file > output_file
goLoad the  Gene Ontology term ids, names and definitions, and the relationships between terms. Should be loaded if the go-annotation source is used.go termsnone
intactLoad interactions data from  IntAct.genes, interactionspsi-intact_config.properties - which gene identifiers are setorganisms - If none are configured, all interactions are stored.
intermine-items-xml-fileUse this source to load Items XML conforming to the data model directly into the production database.anynone
intermine-items-large-xml-fileUse this source to load Items XML conforming to the data model into the production database, this uses an intermediate database to allow it to cope with very large files that would otherwise cause memory problems.anynone
interproData from  interproprotein domainsnone
kegg-orthologuesLoad homologues from  KEGGhomologues, geneskegg_config.properties - which gene identifier fields are populated, mapping from organism taxonId to abbreviationonly taxonIds specified in project.xml file are downloaded, if no taxonIds are configured, all are loaded.
kegg-pathwayLink genes to  KEGG pathways that they operate in.pathways, geneskegg_config.properties - which gene identifier fields are populated, mapping from organism taxonId to abbreviationonly taxonIds specified in project.xml file are downloaded, if no taxonIds are configured, all are loaded.
mirandamiRBase Targets from the Sanger InstituteMiRNATargets, MRNAs, genesConfiguration is added to the project.properties file and an optional handler can be added to deal with data in the attributes section of the gff file.See GFF
pdbData from  PDBproteins, protein structuresnoneeach structure is its own PDB file. PDB file doesn't contain taxonId, so files must live in taxonId-specific directories, eg /7227. converter processes directories specified in project.xml. if nothing configured, processes all directories
psi-mi-ontologyInclude this source when loading psi data to fill in details of ontology terms used. Loads the file  psi-mi.obo file.ontology termsnone
pubmeddata from pubmedpublicationsnoneentire file is downloaded, only taxonIDs set in project.xml will be loaded. if nothing configured, processes all entries.
reactomeflat file from Reactome martpathways and genesnoneSee Reactome
soThis source loads no data but adds a class in the data model for every term in SOFA - a reduced version of the  sequence ontology. SO terms represent biological features such as gene, exon, 3' UTR. The version of SOFA we use is a little out of date. You should include it if you are loading genome annotation.see the SO file for complete list.none
snpLoad SNP data from a downloaded mysql databaseSNPs, chromosomesnoneSee Ensembl
treefamLoads homologue data from  TreeFam See TreeFamgenes, homologuestreefam_config.properties - which gene identifiers are setgeneFile - location of gene file, organisms - taxonIds or organisms to process See TreeFam
uniprotLoads protein information from  UniProtKB XML files.protein sequences, lengths and molecular weights, links between proteins and genes, features located on proteins, UniProt keywords, references and comments. This source can optionally load InterPro protein domains and GO termsuniprot_config.properties - where to get identifiers for genes and which field should be unique See UniProt
uniprot-fastaLoads sequences for proteins loaded in UniProt sourcesequencesnoneSee UniProt
uniprot-keywordsLoad definitions of UniProtKB keywords, should be included if loading data from uniprot.updates uniprot keyword definitionsnoneSee UniProt
update-publicationsAll publications are referred to by PubMed id by other sources. This source should be included at the end of the build. It will query all PubMed ids from the database, fetch details from the Entrez web service and fill in Publication objects.updates publications loaded by other sourcesnone

Specific sources

These are sources that load Drosophila specific data sets into FlyMine, we don't expect you will re-use these unless you are creating a Drosophila warehouse. All of these sources are located in bio/sources/flymine.

  • affy-probes
  • anoest
  • anopheles-identifiers
  • anoph-expr
  • arbeitman-items-xml
  • bdgp-clone
  • bdgp-insitu
  • drosdel-gff
  • drosophila-homology
  • fly-anatomy-ontology
  • flyatlas
  • fly-development-ontology
  • fly-fish
  • fly-misc-cvterms
  • flyreg
  • flyrnai-screens
  • homophila
  • long_oligo
  • protein_structure
  • redfly
  • rnai
  • tiffin
  • tiffin-expression
  • tiling_path

See  FlyMine for more information about these datasets. Look at flymine/project.xml for examples of how to use these sources.

Other loaders

These loaders were written by InterMine users.

  • CHEBI
  • Disease ontology
  • haem-atlas
  • HGNC
  • HuGE
  • Mammalian phenotypes
  • pride
  • stitch