Last modified 3 months ago Last modified on 02/11/11 12:01:09

BioSources > UniProt

This page describes the source for loading UniProt data.

1 Data

This source loads data from the  UniProt website.

The UniProt source expects the data files to be in a special format:

TAXONID_uniprot_sprot.xml
TAXONID_uniprot_trembl.xml

2 Config

2.1 uniprot_config.properties

2.1.1 gene identifier fields

You can specify which gene fields are assigned when UniProt data is loaded. An example entry:

10116.uniqueField = primaryIdentifier
10116.primaryIdentifier.dbref = RGD
10116.secondaryIdentifier.dbref = Ensembl
10116.symbol.name = primary

The format for the file is:

<TAXON_ID>.<IDENTIFIER_FIELD> = <VALUE>

A rat uniprot entry:  http://www.uniprot.org/uniprot/Q923K9.xml

The second line of that configuration would set the ID value as the gene.primaryIdentifier:

	<dbReference type="RGD" id="619834" key="33">
	<property type="gene designation" value="Acf"/>
	</dbReference>

The third line would set this ID value as gene.secondaryIdentifier:

	<dbReference type="Ensembl" id="ENSRNOG00000033195" key="30">
	<property type="organism name" value="Rattus norvegicus"/>
	</dbReference>

The last line would set the value between the <name/> tags as gene.symbol:

	<gene>
	<name type="primary">A1cf</name>
	<name type="synonym">Acf</name>
	<name type="synonym">Asp</name>
	</gene>

The values for symbol.name can be primary, ORF or ordered locus.

2.1.2 protein feature types

You can also configure which protein features to load.

  • to load specific feature types only, specify them like so:
    # in uniprot_config.properties
    feature.types = helix
    
  • to load NO feature types:
    # in uniprot_config.properties
    feature.types = NONE
    
  • to load ALL feature types, do not specify any feature types, remove that line from this config file. Loading all feature types is the default behaviour.

2.2 project.xml

    # FlyMine's uniprot entry in project.xml file
    <source name="uniprot" type="uniprot" >
      <property name="uniprot.organisms" value="7227 7237 6239 7165 7460 4932 9606 10090"/>
      <property name="src.data.dir" location="/data/uniprot/current/split"/>
      <property name="createinterpro" value="true"/>
      <property name="creatego" value="true"/>
    </source>

3 fasta

    # FlyMine's uniprot fasta entry in project.xml file
    <source name="uniprot-fasta" type="fasta">
      <property name="fasta.taxonId" value="7227 7237 6239 7165 7460 4932 9606 10090"/>
      <property name="fasta.className" value="org.intermine.model.bio.Protein"/>
      <property name="fasta.classAttribute" value="primaryAccession"/>
      <property name="fasta.dataSetTitle" value="UniProt data set"/>
      <property name="fasta.dataSourceName" value="UniProt"/>
      <property name="src.data.dir" location="/data/uniprot/current"/>
      <property name="fasta.includes" value="uniprot_sprot_varsplic.fasta"/>
      <property name="fasta.sequenceType" value="protein" />
      <property name="fasta.loaderClassName"
                value="org.intermine.bio.dataconversion.UniProtFastaLoaderTask"/>
    </source>


Back: BioSources