Hints for improving data loading performance
Java options
Loading data can be memory intensive so there are some Java options that should be tuned to improve performance. See a note about setting ANT_OPTS.
Storing Items in order
When loading objects into the production ObjectStore the order of loading can have a big impact on performance. It is important to store objects before any other objects that reference them. For example, if we have a Gene with a Publication in its evidence collection and a Synonym referencing the Gene, the objects should be stored in the order: Publication, Gene, Synonym. (If e.g. the Gene is stored after the Synonym a placeholder object is stored in the Gene's place which is later replaced by the real Gene. This takes time).
Objects are loaded in the order that Items are stored by converter code or the order they appear in an Items XML file. When Items are stored into the common-tgt-items database (during the build or using ant -Dsource=sourcename -Daction=retrieve) you can check if there are improvements possible with this SQL query:
select classnamea, name, classnameb, count(*)
from (select distinct itema.classname AS classnamea, name,
itemb.classname AS classnameb, itemb.identifier
FROM item AS itemA, reference, item AS itemB
where itema.id = itemid and refid = itemb.identifier
and itema.id < itemb.id) as a
group by classnamea, name, classnameb;
If there are no results then no improvement can be made. The example below shows that there were 27836 Gene Items stored before the Synonyms that reference them. subject is the name of the reference in Synonym.
classnamea | name | classnameb | count ----------------------------------------------+---------+-------------------------------------------+------- http://www.flymine.org/model/genomic#Synonym | subject | http://www.flymine.org/model/genomic#Gene | 27836
Still to add: recommended hardware, postgres config
See Also: RunningABuild
