Last modified 3 years ago Last modified on 21/05/09 12:36:22

Fetching related objects in the Results object - solving the N + 1 reads problem

Introduction

This document describes a particular performance problem inherent in object-oriented database systems, and how it is tackled in the InterMine system

The problem is that following references from objects will result in a request to the database to fetch the data. Each of these database requests is small, and returns a tiny amount of data for a large amount of effort spent. If many of these operations are performed, the efficiency of the database system is greatly reduced.

Why is this needed?

Consider the situation where a database has two object classes - A and B. Objects of type A each have a reference to an object of type B. The user writes a program that generates a query for all objects of type A, and iterates through the entire Results object, calling toString on each object. Unfortunately, the toString method of class A uses the reference to the object of type B. Therefore, the program performs lots of single object fetches of objects of type B. In fact, it performs N + 1 queries in total, where N is the number of objects of type A - hence it is called the "N + 1 reads problem".

First coping strategy - caches

The first coping strategy that the InterMine system uses is a cache in the ObjectStore that indexes by the ID of the object. When a proxy reference is accessed, the ObjectStore first looks in the cache. This way, if the user can arrange for the B objects to be in the cache already, the problem no longer occurs.

 ObjectStoreFastCollectionsImpl

The second mechanism that the InterMine system provides is a specific ObjectStore that behaves differently to the main ObjectStore. This is a subclass of  ObjectStorePassthruImpl, and it explicitly materialises every reference and collection present in the objects returned by the underlying ObjectStore. This is a fairly crude method, but it works well for the InterMine data translation pipeline. It could be retrofitted with a mechanism for specifying which collections and references are to be materialised.

A similar function is provided by the  ObjectStoreFastCollectionsForTranslatorImpl, although it has more knowledge of the actual data requirements of the user than ObjectStoreFastCollectionsImpl.

Outer Joins

One can explicitly create a query that returns an extra column containing the contents of the reference to the B objects. This will ensure that the B objects are in the cache. In fact, the objects are guaranteed to be in the cache if they are still strongly-referenced in the virtual machine, which will almost certainly be the case if you are iterating through the results. Alternatively, you can bypass the cache entirely and just read the B objects out of the extra column in the results. Outer Joins are a feature of IQL, described in PathExpressions?.

Automatic N + 1 reads problem correction

There are currently no plans to implement an automatic system to detect and solve generic N + 1 reads problems, although it would not be too difficult.