BioSources > UniProt
This page describes the source for loading UniProt data.
1 Data
This source loads data from the UniProt website.
The UniProt source expects the data files to be in a special format:
TAXONID_uniprot_sprot.xml TAXONID_uniprot_trembl.xml
2 Config
2.1 uniprot_config.properties
2.1.1 gene identifier fields
You can specify which gene fields are assigned when UniProt data is loaded. An example entry:
10116.uniqueField = primaryIdentifier 10116.primaryIdentifier.dbref = RGD 10116.secondaryIdentifier.dbref = Ensembl 10116.symbol.name = primary
The format for the file is:
<TAXON_ID>.<IDENTIFIER_FIELD> = <VALUE>
A rat uniprot entry: http://www.uniprot.org/uniprot/Q923K9.xml
The second line of that configuration would set the ID value as the gene.primaryIdentifier:
<dbReference type="RGD" id="619834" key="33"> <property type="gene designation" value="Acf"/> </dbReference>
The third line would set this ID value as gene.secondaryIdentifier:
<dbReference type="Ensembl" id="ENSRNOG00000033195" key="30"> <property type="organism name" value="Rattus norvegicus"/> </dbReference>
The last line would set the value between the <name/> tags as gene.symbol:
<gene> <name type="primary">A1cf</name> <name type="synonym">Acf</name> <name type="synonym">Asp</name> </gene>
The values for symbol.name can be primary, ORF or ordered locus.
2.1.2 protein feature types
You can also configure which protein features to load.
- to load specific feature types only, specify them like so:
# in uniprot_config.properties feature.types = helix
- to load NO feature types:
# in uniprot_config.properties feature.types = NONE
- to load ALL feature types, do not specify any feature types, remove that line from this config file. Loading all feature types is the default behaviour.
2.2 project.xml
# FlyMine's uniprot entry in project.xml file
<source name="uniprot" type="uniprot" >
<property name="uniprot.organisms" value="7227 7237 6239 7165 7460 4932 9606 10090"/>
<property name="src.data.dir" location="/data/uniprot/current/split"/>
<property name="createinterpro" value="true"/>
<property name="creatego" value="true"/>
</source>
3 fasta
# FlyMine's uniprot fasta entry in project.xml file
<source name="uniprot-fasta" type="fasta">
<property name="fasta.taxonId" value="7227 7237 6239 7165 7460 4932 9606 10090"/>
<property name="fasta.className" value="org.intermine.model.bio.Protein"/>
<property name="fasta.classAttribute" value="primaryAccession"/>
<property name="fasta.dataSetTitle" value="UniProt data set"/>
<property name="fasta.dataSourceName" value="UniProt"/>
<property name="src.data.dir" location="/data/uniprot/current"/>
<property name="fasta.includes" value="uniprot_sprot_varsplic.fasta"/>
<property name="fasta.sequenceType" value="protein" />
<property name="fasta.loaderClassName"
value="org.intermine.bio.dataconversion.UniProtFastaLoaderTask"/>
</source>
Back: BioSources
